Summarize_numeric

This tutorial will guide you through the summarease.summarize_numeric module, which provides tools to generate summary statistics or visualizations for numeric variables in a dataset, helping you understand key patterns and relationships between your numeric data.

Getting started

The summarease.summarize_numeric module offers the following core functionalities:

  1. Summarizing Numeric Variables:

  • Outputs summary statistics for each numeric column (mean, standard deviation, min, max, etc.).

  • Useful for quickly understanding the distribution and spread of numeric features.

  1. Visualizing Numeric Relationships:

  • Generates density plots for each numeric column to visualize their distributions.

  • Creates a correlation heatmap to explore the relationships between multiple numeric variables.

Necessary libraries

To use the summarease.summarize_numeric module, ensure the following libraries are installed:

import pandas as pd
import altair as alt
from summarease.summarize_numeric import summarize_numeric, plot_numeric_density, plot_correlation_heatmap

Example dataset

We’ll use the following dataset to demonstrate the module’s functionality:

Dataset: Numeric Data

numeric_data = pd.DataFrame({
    'feature1': [1.2, 2.3, 3.1, 4.8, 5.5, 6.7, 8.9, 10.1],
    'feature2': [3.2, 4.5, 5.1, 6.0, 7.8, 8.3, 9.1, 10.7],
    'feature3': [1.1, 2.2, 3.1, 4.0, 5.4, 6.6, 7.7, 8.5]
})

Example usage

summarize_numeric

This function calculates summary statistics or generates visualizations for numeric columns. It takes two parameters:

  • dataset: The input dataset containing numeric variables. It must be a DataFrame.

  • summarize_by: Specifies the format for summarizing the numeric variables. It can be either "table" (default) to generate a summary table, or "plot" to generate visualizations like density plots and a correlation heatmap.

Example 1: Summarizing with a Table

# Summarize numeric variables in the dataset by generating summary statistics
summary_table = summarize_numeric(dataset=numeric_data, summarize_by="table")
summary_table["numeric_describe"]
feature1 feature2 feature3
count 8.000000 8.000000 8.000000
mean 5.325000 6.837500 4.825000
std 3.137219 2.550035 2.663912
min 1.200000 3.200000 1.100000
25% 2.900000 4.950000 2.875000
50% 5.150000 6.900000 4.700000
75% 7.250000 8.500000 6.875000
max 10.100000 10.700000 8.500000

Example 2: Visualizing Numeric Relationships

# Visualize the distribution of numeric variables and their correlations
numeric_plots = summarize_numeric(dataset=numeric_data, summarize_by="plot")
numeric_plots["numeric_plot"].show()  # Display density plots
numeric_plots["corr_plot"].show()    # Display correlation heatmap

In conclusion, the summarease.summarize_numeric module provides an easy way to explore and understand your dataset’s numeric features. Whether you’re looking for summary statistics or visualizations to analyze distributions and correlations, this tool will help you gain deeper insights into your data.

Final notes

If you get an error or something went wrong during the usage of the function, you can always submit an issue in the github repo which will be addressed as soon as possible. Thanks for your time!