{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Summarize_target\n", "\n", "This tutorial will guide you through the `summarease.summarize_target` module, which includes tools to summarize target variables and visualize class balance, making it easier to evaluate target variable characteristics.\n", "\n", "\n", "## Getting started\n", "\n", "The `summarease.summarize_target` module provides the following core functionalities:\n", "\n", "**1. Summarizing Target Variables:**\n", "- Handles both categorical and numerical target variables. \n", "- Outputs class proportions and imbalance flags with threshold for categorical targets. \n", "- Provides basic statistical summaries for numerical targets.\n", "\n", "**2. Visualizing Target Balance:**\n", "- Generates bar plots to visualize class proportions and expected balanced ranges for categorical targets. \n", "- Highlights imbalanced classes with different colors for easy identification.\n", "\n", "## Necessary libraries\n", "\n", "To use the `summarease.summarize_target` module, ensure the following libraries are installed:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import altair as alt\n", "from summarease.summarize_target import summarize_target_df, summarize_target_balance_plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example datasets\n", "\n", "We'll use the following datasets to demonstrate the module's functionality:\n", "\n", "### Dataset 1: Categorical Target Dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "categorical_data = pd.DataFrame({\n", " 'target': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D']\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset 2: Numerical Target Dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "numerical_data = pd.DataFrame({\n", " 'target': [1.2, 2.3, 3.1, 4.8, 5.5, 6.7, 8.9, 10.1]\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example usage\n", "\n", "### 1. `summarize_target_df`\n", "\n", "This function calculates the class proportions and checks imbalance for categorical targets, and provides basic statistical summaries for numerical targets. There are four parameters for `summarize_target`. \n", "\n", "- **dataset_name:** The input dataset containing target variable. It must be a DataFrame.\n", "- **target_variable:** The name of target column. It must be a string.\n", "- **target_type:** The type of target variable. It must be a string and within {\"categorical\", \"numerical\"}.\n", "- **threshold:** Only feasible for \"categorical\" type to identify class imbalance. It must be a float and the default value is 0.2. It will raise an warning if you transfer a value into the function when `target_type` is numerical.\n", "\n", "The following are two examples for demonstration using `summarize_target` for both categorical and numerical targets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 1: Summarizing a Categorical Target using a table" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classproportionimbalancedthreshold
0A0.181818True0.2
1B0.181818True0.2
2C0.272727False0.2
3D0.363636True0.2
\n", "
" ], "text/plain": [ " class proportion imbalanced threshold\n", "0 A 0.181818 True 0.2\n", "1 B 0.181818 True 0.2\n", "2 C 0.272727 False 0.2\n", "3 D 0.363636 True 0.2" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Summarize the categorical target variable\n", "summary_categorical = summarize_target_df(\n", " dataset_name=categorical_data, \n", " target_variable='target', \n", " target_type='categorical', \n", " threshold=0.2\n", ")\n", "summary_categorical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 2: Summarizing a Numerical Target" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
target8.05.3253.1372191.22.95.157.2510.1
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "target 8.0 5.325 3.137219 1.2 2.9 5.15 7.25 10.1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Summarize the numerical target variable\n", "summary_numerical = summarize_target_df(\n", " dataset_name=numerical_data, \n", " target_variable='target', \n", " target_type='numerical',\n", " threshold=None\n", ")\n", "summary_numerical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. `summarize_target_balance_plot` \n", "\n", "This function visualizes the balance condition of a categorical target variable using an interactive bar plot. There is one parameter in `summarize_target_balance_plot`.\n", "\n", "- **summary_df**: The input DataFrame, expected to match the output of `summarize_target_df()` when `target_type` is categorical. It must contain the columns ['class', 'proportion', 'imbalanced', 'threshold'].\n", "\n", "The following is an example for demostrating the usage of `summarize_target_balance_plot`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 3: Visualizing Class Balance" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualize the balance of the categorical target variable\n", "balance_plot = summarize_target_balance_plot(summary_categorical)\n", "\n", "# Display the plot\n", "balance_plot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The error bars show the balance ranges with the threshold around average proportion (0.2 default). The figure flags the classes within the balance ranges by green and those within imbalance ranges by red." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final notes\n", "\n", "If you get an error or something went wrong during the usage of the function, you can always submit an issue in the github repo which will be addressed as soon as possible. Thanks for your time! " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 4 }