summarease.summarize

Functions

validate_or_create_path(path)

Validate if the provided path is a valid Path object and create necessary directories.

add_image(pdf, image_path, pdf_height, pdf_width[, ...])

Adds an image to a PDF document at the current y-position with consideration for page size

add_table(pdf, table, pdf_height, pdf_width[, ...])

Adds a table to the PDF document with the provided data, scaling the column widths to fit

switch_page_if_needed(pdf)

summarize(dataset[, dataset_name, description, ...])

Summarizes the given dataset by generating various statistics, visualizations,

Module Contents

summarease.summarize.validate_or_create_path(path)[source]

Validate if the provided path is a valid Path object and create necessary directories.

If the path represents a file, the function ensures that its parent directory exists. If the path represents a directory, the function ensures it exists, creating it if necessary.

Parameters:

path (Path) – The path to validate or create. Can represent a file or a directory.

Raises:

TypeError – If the provided path is not an instance of Path.

Notes

  • If path is a directory and it does not exist, the function creates it, including any necessary parent directories.

summarease.summarize.add_image(pdf, image_path, pdf_height, pdf_width, element_padding=15)[source]

Adds an image to a PDF document at the current y-position with consideration for page size and scaling. If the image height exceeds the remaining space on the current page, a new page is added to the PDF. The image is scaled proportionally to fit the page width while maintaining the aspect ratio.

Parameters:
  • pdf – A FPDF object representing the PDF document to which the image will be added.

  • image_path (str or Path) – The file path to the image to be added. It supports various image formats such as .jpg, .jpeg, .png, .gif, .bmp, .tiff, and .webp.

  • pdf_height (float) – The total height of the PDF page in units consistent with the FPDF settings.

  • pdf_width (float) – The total width of the PDF page in units consistent with the FPDF settings.

  • element_padding (int, optional) – The padding (in units consistent with FPDF) to be applied between the image and the page’s top margin. Default is 15.

Returns:

The updated FPDF object with the image added at the correct position.

Return type:

pdf

Notes

  • The function checks if the image file exists and has a valid image extension.

  • The image is scaled to fit within the page width, and if necessary, a new page is added.

  • The function assumes a DPI of 96 for the image size conversion from pixels to millimeters.

  • If the image height exceeds the remaining space on the current page, a new page is created before adding the image.

summarease.summarize.add_table(pdf, table, pdf_height, pdf_width, element_padding=15)[source]

Adds a table to the PDF document with the provided data, scaling the column widths to fit within the page width while maintaining their relative proportions. The first row (header) has a gray background, and the first column (index) is highlighted with a gray background.

Parameters:
  • pdf – A FPDF object representing the PDF document to which the table will be added.

  • table (pandas.DataFrame) – The table containing the data to be added. The first column (index) will be inserted as a new column in the table.

  • pdf_height (float) – The total height of the PDF page in units consistent with the FPDF settings.

  • pdf_width (float) – The total width of the PDF page in units consistent with the FPDF settings.

  • element_padding (int, optional) – The padding (in units consistent with FPDF) to be applied around the table. Default is 15.

Returns:

The updated FPDF object with the table added.

Return type:

pdf

Notes

  • The function calculates the maximum column width based on the longest entry or column name, scaling the column widths to fit the available page width while maintaining relative proportions.

  • The first row (header) is filled with a light gray background, and the first column (index) is also highlighted with a gray background for better readability.

  • Column names are truncated if they are too long to fit in the cell, and the font size is adjusted accordingly for long column names.

  • Numeric values are rounded to 2 decimal places for consistency.

summarease.summarize.switch_page_if_needed(pdf)[source]
summarease.summarize.summarize(dataset: pandas.DataFrame, dataset_name: str = 'Dataset Summary', description: str = 'Dataset summary generated by summarease.', summarize_by: str = 'plot', target_variable: str = None, target_type: str = 'categorical', output_file: str = 'summary.pdf', output_dir: str = './summarease_summary/')[source]

Summarizes the given dataset by generating various statistics, visualizations, and/or tables based on the provided options.

Parameters:

datasetpd.DataFrame

The dataframe to be summarized.

dataset_namestr, optional, default=”Dataset Summary”

Represents the title of the summary, can be simply the name of the dataset.

descriptionstr, optional, default=”Dataset summary generated by summarease.”

A description of the dataset to provide context in the summary.

summarize_bystr, optional, default=”mix”

Specifies what visual elements to use when summarizing the dataset: - “table” : Summarize using tables. - “plot” : Summarize using plots.

target_variablestr, optional, default=None

The name of the target variable in the dataset. This helps in identifying the dependent variable for further analysis.

target_typestr, within {“categorical”, “numerical”}

The type of target variable.

output_filestr, optional, default=”summary.pdf”

The name of the output file where the summary will be saved.

output_dirstr, optional, default=”./summarease_summary/”

The directory where the output summary file will be saved.

Returns:

None

This function outputs the summary of the dataset in an output file, including statistical summaries, visualizations, and cleaning steps (if applicable).

Notes:

  • The show_observations parameter can be customized to display a certain number of observations.

  • The summarize_by parameter offers flexibility in the type of summary (table or plot).

Example:

>>> import pandas as pd
>>> from summarease import summarize
>>> data = pd.DataFrame({
...     "Age": [23, 45, 31, 35, 29],
...     "Gender": ["Male", "Female", "Female", "Male", "Male"],
...     "Salary": [50000, 60000, 75000, 80000, 65000]
... })
>>> summarize(
...     dataset=data,
...     dataset_name="Employee Data Summary",
...     description="Summary of employee demographic and salary data.",
...     summarize_by="plot",
...     output_file="employee_summary.pdf"
... )
# This will generate a summary of the `data` dataframe
# and save the summary as 'employee_summary.pdf' in the default output directory.