{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Summarize\n", "\n", "This tutorial will guide you through the `summarize` function from `summarease.summarize` module, which is used to summarize a dataset by creating a report as an output pdf. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Purpose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine you are working on a data science problem and you have got a dataset that you should extract interesting patterns from. The time is limited, thus you think of a convenient function that could do the initial exploratory data analysis for you — generate a quick summary of the dataset at hand. It could then compile the information into a well-organized PDF report, complete with charts and tables, enabling the team to quickly grasp trends and have a quick discussion — all without the need to manually process the data. \n", "\n", "Situations like this call for the function `summarize` from the `summarease` package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The core functionalities of the `summarize` function from `summarease.summarize` module include the following:\n", "\n", "**1. Summarize the dataset using tables which includes:**\n", " \n", "- Dataset description (if provided),\n", "- A plot for summarizing numeric columns,\n", "- A Correlation heatmap for numeric columns,\n", "- A plot for summarizing the target variable distribution,\n", "- A table for summarizing the data types.\n", "\n", "\n", "**2. Summarize the dataset using plots which includes:**\n", " \n", "- Dataset description (if provided),\n", "- A table summarizing numeric columns,\n", "- A table for summarizing the target variable distribution,\n", "- A table for summarizing the data types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing the function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To import the summarize function use the following code:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from summarease.summarize import summarize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the Iris Dataset imported by `load_iris` function from `sklearn.datasets` for demonstration purposes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing the necessary functions and libraries" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1 - Load the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default `load_iris` function returns a Dictionary-like object which is assigned to the `iris` variable. It should be converted to a dataframe first. Next, we add the target variable to the dataframe from the `iris` object. See the below code for detailed instructions." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " target \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the dataset\n", "iris = load_iris()\n", "\n", "# Create a DataFrame\n", "iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)\n", "# Separate the target variable\n", "iris_df['target'] = iris.target\n", "\n", "# Display the first few rows\n", "iris_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the dataset is ready, we can proceed to summarizing it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2 - Prepare the dataset information (optional)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As desired, the dataset information can be included in the report. These include some components that can be provided by the user and which are not always extractable from the dataset itself. It includes the following information:\n", "- **The dataset name** or the title for the summary (By default \"Dataset Summary\").\n", "- **The dataset description** (By default \"Dataset summary generated by summarease.\")." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "dataset_name = \"Iris Dataset Summary\"\n", "dataset_description = \"The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, \\\n", "and can also be found on the UCI Machine Learning Repository. It includes three iris species with 50 samples each as well as some properties about each flower. \\\n", "One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3 - Run the summarize function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all needed to create the report of the dataset. Now the dataset can be summarized either by using plots or tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using plots" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PDF created!\n" ] } ], "source": [ "summarize(\n", " dataset=iris_df, \n", " dataset_name=dataset_name, \n", " description=dataset_description,\n", " summarize_by=\"plot\",\n", " target_variable=\"target\",\n", " target_type=\"categorical\",\n", " output_file=\"iris_summary.pdf\",\n", " output_dir=\"./dataset_summary/\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The message **PDF created!** suggests that the report was successfuly saved in the provided directory `./dataset_summary/` and the provided name of the file `iris_summary.pdf`. The report will contain dominantly plots and figures for summarization. If the table format suits the problem better, the `table` option can be considered. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using tables" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PDF created!\n" ] } ], "source": [ "summarize(\n", " dataset=iris_df, \n", " dataset_name=dataset_name, \n", " description=dataset_description,\n", " summarize_by=\"table\",\n", " target_variable=\"target\",\n", " target_type=\"categorical\",\n", " output_file=\"iris_summary.pdf\",\n", " output_dir=\"./dataset_summary/\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The message **PDF created!** suggests that the report was successfuly saved in the provided directory `./dataset_summary/` and the provided name of the file `iris_summary.pdf`. The report will contain only tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final notes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As simple as that! Take a note that whether the output is in plot format or table format is controlled by the argument `summarize_by`. For more, please visit the function documentation using `help(summarize)`. \n", "\n", "If you get an error or something went wrong during the report generation, you can always submit an issue in the GitHub repo which will be addressed as soon as possible. Thanks for your time! " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 4 }