{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Summarize\n", "\n", "This tutorial will guide you through the `summarize` function from `summarease.summarize` module, which is used to summarize a dataset by creating a report as an output pdf. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Purpose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine you are working on a data science problem and you have got a dataset that you should extract interesting patterns from. The time is limited, thus you think of a convenient function that could do the initial exploratory data analysis for you — generate a quick summary of the dataset at hand. It could then compile the information into a well-organized PDF report, complete with charts and tables, enabling the team to quickly grasp trends and have a quick discussion — all without the need to manually process the data. \n", "\n", "Situations like this call for the function `summarize` from the `summarease` package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The core functionalities of the `summarize` function from `summarease.summarize` module include the following:\n", "\n", "**1. Summarize the dataset using tables which includes:**\n", " \n", "- Dataset description (if provided),\n", "- A plot for summarizing numeric columns,\n", "- A Correlation heatmap for numeric columns,\n", "- A plot for summarizing the target variable distribution,\n", "- A table for summarizing the data types.\n", "\n", "\n", "**2. Summarize the dataset using plots which includes:**\n", " \n", "- Dataset description (if provided),\n", "- A table summarizing numeric columns,\n", "- A table for summarizing the target variable distribution,\n", "- A table for summarizing the data types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing the function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To import the summarize function use the following code:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from summarease.summarize import summarize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the Iris Dataset imported by `load_iris` function from `sklearn.datasets` for demonstration purposes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing the necessary functions and libraries" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1 - Load the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default `load_iris` function returns a Dictionary-like object which is assigned to the `iris` variable. It should be converted to a dataframe first. Next, we add the target variable to the dataframe from the `iris` object. See the below code for detailed instructions." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | sepal length (cm) | \n", "sepal width (cm) | \n", "petal length (cm) | \n", "petal width (cm) | \n", "target | \n", "
|---|---|---|---|---|---|
| 0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "0 | \n", "
| 1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "0 | \n", "
| 2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "0 | \n", "
| 3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "0 | \n", "
| 4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "0 | \n", "