Summarize_target

This tutorial will guide you through the summarease.summarize_target module, which includes tools to summarize target variables and visualize class balance, making it easier to evaluate target variable characteristics.

Getting started

The summarease.summarize_target module provides the following core functionalities:

1. Summarizing Target Variables:

  • Handles both categorical and numerical target variables.

  • Outputs class proportions and imbalance flags with threshold for categorical targets.

  • Provides basic statistical summaries for numerical targets.

2. Visualizing Target Balance:

  • Generates bar plots to visualize class proportions and expected balanced ranges for categorical targets.

  • Highlights imbalanced classes with different colors for easy identification.

Necessary libraries

To use the summarease.summarize_target module, ensure the following libraries are installed:

import pandas as pd
import altair as alt
from summarease.summarize_target import summarize_target_df, summarize_target_balance_plot

Example datasets

We’ll use the following datasets to demonstrate the module’s functionality:

Dataset 1: Categorical Target Dataset

categorical_data = pd.DataFrame({
    'target': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D']
})

Dataset 2: Numerical Target Dataset

numerical_data = pd.DataFrame({
    'target': [1.2, 2.3, 3.1, 4.8, 5.5, 6.7, 8.9, 10.1]
})

Example usage

1. summarize_target_df

This function calculates the class proportions and checks imbalance for categorical targets, and provides basic statistical summaries for numerical targets. There are four parameters for summarize_target.

  • dataset_name: The input dataset containing target variable. It must be a DataFrame.

  • target_variable: The name of target column. It must be a string.

  • target_type: The type of target variable. It must be a string and within {“categorical”, “numerical”}.

  • threshold: Only feasible for “categorical” type to identify class imbalance. It must be a float and the default value is 0.2. It will raise an warning if you transfer a value into the function when target_type is numerical.

The following are two examples for demonstration using summarize_target for both categorical and numerical targets.

Example 1: Summarizing a Categorical Target using a table

# Summarize the categorical target variable
summary_categorical = summarize_target_df(
    dataset_name=categorical_data, 
    target_variable='target', 
    target_type='categorical', 
    threshold=0.2
)
summary_categorical
class proportion imbalanced threshold
0 A 0.181818 True 0.2
1 B 0.181818 True 0.2
2 C 0.272727 False 0.2
3 D 0.363636 True 0.2

Example 2: Summarizing a Numerical Target

# Summarize the numerical target variable
summary_numerical = summarize_target_df(
    dataset_name=numerical_data, 
    target_variable='target', 
    target_type='numerical',
    threshold=None
)
summary_numerical
count mean std min 25% 50% 75% max
target 8.0 5.325 3.137219 1.2 2.9 5.15 7.25 10.1

2. summarize_target_balance_plot

This function visualizes the balance condition of a categorical target variable using an interactive bar plot. There is one parameter in summarize_target_balance_plot.

  • summary_df: The input DataFrame, expected to match the output of summarize_target_df() when target_type is categorical. It must contain the columns [‘class’, ‘proportion’, ‘imbalanced’, ‘threshold’].

The following is an example for demostrating the usage of summarize_target_balance_plot.

Example 3: Visualizing Class Balance

# Visualize the balance of the categorical target variable
balance_plot = summarize_target_balance_plot(summary_categorical)

# Display the plot
balance_plot.show()

The error bars show the balance ranges with the threshold around average proportion (0.2 default). The figure flags the classes within the balance ranges by green and those within imbalance ranges by red.

Final notes

If you get an error or something went wrong during the usage of the function, you can always submit an issue in the github repo which will be addressed as soon as possible. Thanks for your time!