Summarize

This tutorial will guide you through the summarize function from summarease.summarize module, which is used to summarize a dataset by creating a report as an output pdf.

Purpose

Imagine you are working on a data science problem and you have got a dataset that you should extract interesting patterns from. The time is limited, thus you think of a convenient function that could do the initial exploratory data analysis for you — generate a quick summary of the dataset at hand. It could then compile the information into a well-organized PDF report, complete with charts and tables, enabling the team to quickly grasp trends and have a quick discussion — all without the need to manually process the data.

Situations like this call for the function summarize from the summarease package.

Getting started

The core functionalities of the summarize function from summarease.summarize module include the following:

1. Summarize the dataset using tables which includes:

  • Dataset description (if provided),

  • A plot for summarizing numeric columns,

  • A Correlation heatmap for numeric columns,

  • A plot for summarizing the target variable distribution,

  • A table for summarizing the data types.

2. Summarize the dataset using plots which includes:

  • Dataset description (if provided),

  • A table summarizing numeric columns,

  • A table for summarizing the target variable distribution,

  • A table for summarizing the data types.

Importing the function

To import the summarize function use the following code:

from summarease.summarize import summarize

Example usage

We will use the Iris Dataset imported by load_iris function from sklearn.datasets for demonstration purposes.

Importing the necessary functions and libraries

from sklearn.datasets import load_iris
import pandas as pd

Step 1 - Load the dataset

By default load_iris function returns a Dictionary-like object which is assigned to the iris variable. It should be converted to a dataframe first. Next, we add the target variable to the dataframe from the iris object. See the below code for detailed instructions.

# Load the dataset
iris = load_iris()

# Create a DataFrame
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
# Separate the target variable
iris_df['target'] = iris.target

# Display the first few rows
iris_df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Once the dataset is ready, we can proceed to summarizing it.

Step 2 - Prepare the dataset information (optional)

As desired, the dataset information can be included in the report. These include some components that can be provided by the user and which are not always extractable from the dataset itself. It includes the following information:

  • The dataset name or the title for the summary (By default “Dataset Summary”).

  • The dataset description (By default “Dataset summary generated by summarease.”).

dataset_name = "Iris Dataset Summary"
dataset_description = "The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, \
and can also be found on the UCI Machine Learning Repository. It includes three iris species with 50 samples each as well as some properties about each flower. \
One flower species is linearly separable from the other two, but the other two are not linearly separable from each other."

Step 3 - Run the summarize function

That’s all needed to create the report of the dataset. Now the dataset can be summarized either by using plots or tables.

Using plots

summarize(
    dataset=iris_df, 
    dataset_name=dataset_name, 
    description=dataset_description,
    summarize_by="plot",
    target_variable="target",
    target_type="categorical",
    output_file="iris_summary.pdf",
    output_dir="./dataset_summary/"
)
PDF created!

The message PDF created! suggests that the report was successfuly saved in the provided directory ./dataset_summary/ and the provided name of the file iris_summary.pdf. The report will contain dominantly plots and figures for summarization. If the table format suits the problem better, the table option can be considered.

Using tables

summarize(
    dataset=iris_df, 
    dataset_name=dataset_name, 
    description=dataset_description,
    summarize_by="table",
    target_variable="target",
    target_type="categorical",
    output_file="iris_summary.pdf",
    output_dir="./dataset_summary/"
)
PDF created!

The message PDF created! suggests that the report was successfuly saved in the provided directory ./dataset_summary/ and the provided name of the file iris_summary.pdf. The report will contain only tables.

Final notes

As simple as that! Take a note that whether the output is in plot format or table format is controlled by the argument summarize_by. For more, please visit the function documentation using help(summarize).

If you get an error or something went wrong during the report generation, you can always submit an issue in the GitHub repo which will be addressed as soon as possible. Thanks for your time!