{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Summarize_numeric\n", "\n", "This tutorial will guide you through the `summarease.summarize_numeric` module, which provides tools to generate summary statistics or visualizations for numeric variables in a dataset, helping you understand key patterns and relationships between your numeric data.\n", "\n", "## Getting started\n", "\n", "\n", "The `summarease.summarize_numeric` module offers the following core functionalities:\n", "\n", "\n", "1. **Summarizing Numeric Variables:**\n", "- Outputs summary statistics for each numeric column (mean, standard deviation, min, max, etc.).\n", "- Useful for quickly understanding the distribution and spread of numeric features.\n", "\n", "2. **Visualizing Numeric Relationships:**\n", "- Generates density plots for each numeric column to visualize their distributions.\n", "- Creates a correlation heatmap to explore the relationships between multiple numeric variables.\n", "\n", "## Necessary libraries\n", "\n", "\n", "To use the `summarease.summarize_numeric` module, ensure the following libraries are installed:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import altair as alt\n", "from summarease.summarize_numeric import summarize_numeric, plot_numeric_density, plot_correlation_heatmap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example dataset\n", "\n", "\n", "We'll use the following dataset to demonstrate the module's functionality:\n", "\n", "### Dataset: Numeric Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "numeric_data = pd.DataFrame({\n", " 'feature1': [1.2, 2.3, 3.1, 4.8, 5.5, 6.7, 8.9, 10.1],\n", " 'feature2': [3.2, 4.5, 5.1, 6.0, 7.8, 8.3, 9.1, 10.7],\n", " 'feature3': [1.1, 2.2, 3.1, 4.0, 5.4, 6.6, 7.7, 8.5]\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example usage\n", "\n", "### `summarize_numeric`\n", "\n", "This function calculates summary statistics or generates visualizations for numeric columns. It takes two parameters:\n", "\n", "\n", "- `dataset`: The input dataset containing numeric variables. It must be a DataFrame.\n", "- `summarize_by`: Specifies the format for summarizing the numeric variables. It can be either `\"table\"` (default) to generate a summary table, or `\"plot\"` to generate visualizations like density plots and a correlation heatmap.\n", "\n", "#### Example 1: Summarizing with a Table" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | feature1 | \n", "feature2 | \n", "feature3 | \n", "
|---|---|---|---|
| count | \n", "8.000000 | \n", "8.000000 | \n", "8.000000 | \n", "
| mean | \n", "5.325000 | \n", "6.837500 | \n", "4.825000 | \n", "
| std | \n", "3.137219 | \n", "2.550035 | \n", "2.663912 | \n", "
| min | \n", "1.200000 | \n", "3.200000 | \n", "1.100000 | \n", "
| 25% | \n", "2.900000 | \n", "4.950000 | \n", "2.875000 | \n", "
| 50% | \n", "5.150000 | \n", "6.900000 | \n", "4.700000 | \n", "
| 75% | \n", "7.250000 | \n", "8.500000 | \n", "6.875000 | \n", "
| max | \n", "10.100000 | \n", "10.700000 | \n", "8.500000 | \n", "