summarease.summarize¶
Functions¶
|
Validate if the provided path is a valid Path object and create necessary directories. |
|
Adds an image to a PDF document at the current y-position with consideration for page size |
|
Adds a table to the PDF document with the provided data, scaling the column widths to fit |
|
Summarizes the given dataset by generating various statistics, visualizations, |
Module Contents¶
- summarease.summarize.validate_or_create_path(path)[source]¶
Validate if the provided path is a valid Path object and create necessary directories.
If the path represents a file, the function ensures that its parent directory exists. If the path represents a directory, the function ensures it exists, creating it if necessary.
- Parameters:
path (Path) – The path to validate or create. Can represent a file or a directory.
- Raises:
TypeError – If the provided path is not an instance of Path.
Notes
If path is a directory and it does not exist, the function creates it, including any necessary parent directories.
- summarease.summarize.add_image(pdf, image_path, pdf_height, pdf_width, element_padding=15)[source]¶
Adds an image to a PDF document at the current y-position with consideration for page size and scaling. If the image height exceeds the remaining space on the current page, a new page is added to the PDF. The image is scaled proportionally to fit the page width while maintaining the aspect ratio.
- Parameters:
pdf – A FPDF object representing the PDF document to which the image will be added.
image_path (str or Path) – The file path to the image to be added. It supports various image formats such as .jpg, .jpeg, .png, .gif, .bmp, .tiff, and .webp.
pdf_height (float) – The total height of the PDF page in units consistent with the FPDF settings.
pdf_width (float) – The total width of the PDF page in units consistent with the FPDF settings.
element_padding (int, optional) – The padding (in units consistent with FPDF) to be applied between the image and the page’s top margin. Default is 15.
- Returns:
The updated FPDF object with the image added at the correct position.
- Return type:
pdf
Notes
The function checks if the image file exists and has a valid image extension.
The image is scaled to fit within the page width, and if necessary, a new page is added.
The function assumes a DPI of 96 for the image size conversion from pixels to millimeters.
If the image height exceeds the remaining space on the current page, a new page is created before adding the image.
- summarease.summarize.add_table(pdf, table, pdf_height, pdf_width, element_padding=15)[source]¶
Adds a table to the PDF document with the provided data, scaling the column widths to fit within the page width while maintaining their relative proportions. The first row (header) has a gray background, and the first column (index) is highlighted with a gray background.
- Parameters:
pdf – A FPDF object representing the PDF document to which the table will be added.
table (pandas.DataFrame) – The table containing the data to be added. The first column (index) will be inserted as a new column in the table.
pdf_height (float) – The total height of the PDF page in units consistent with the FPDF settings.
pdf_width (float) – The total width of the PDF page in units consistent with the FPDF settings.
element_padding (int, optional) – The padding (in units consistent with FPDF) to be applied around the table. Default is 15.
- Returns:
The updated FPDF object with the table added.
- Return type:
pdf
Notes
The function calculates the maximum column width based on the longest entry or column name, scaling the column widths to fit the available page width while maintaining relative proportions.
The first row (header) is filled with a light gray background, and the first column (index) is also highlighted with a gray background for better readability.
Column names are truncated if they are too long to fit in the cell, and the font size is adjusted accordingly for long column names.
Numeric values are rounded to 2 decimal places for consistency.
- summarease.summarize.summarize(dataset: pandas.DataFrame, dataset_name: str = 'Dataset Summary', description: str = 'Dataset summary generated by summarease.', summarize_by: str = 'plot', target_variable: str = None, target_type: str = 'categorical', output_file: str = 'summary.pdf', output_dir: str = './summarease_summary/')[source]¶
Summarizes the given dataset by generating various statistics, visualizations, and/or tables based on the provided options.
Parameters:¶
- datasetpd.DataFrame
The dataframe to be summarized.
- dataset_namestr, optional, default=”Dataset Summary”
Represents the title of the summary, can be simply the name of the dataset.
- descriptionstr, optional, default=”Dataset summary generated by summarease.”
A description of the dataset to provide context in the summary.
- summarize_bystr, optional, default=”mix”
Specifies what visual elements to use when summarizing the dataset: - “table” : Summarize using tables. - “plot” : Summarize using plots.
- target_variablestr, optional, default=None
The name of the target variable in the dataset. This helps in identifying the dependent variable for further analysis.
- target_typestr, within {“categorical”, “numerical”}
The type of target variable.
- output_filestr, optional, default=”summary.pdf”
The name of the output file where the summary will be saved.
- output_dirstr, optional, default=”./summarease_summary/”
The directory where the output summary file will be saved.
Returns:¶
- None
This function outputs the summary of the dataset in an output file, including statistical summaries, visualizations, and cleaning steps (if applicable).
Notes:¶
The show_observations parameter can be customized to display a certain number of observations.
The summarize_by parameter offers flexibility in the type of summary (table or plot).
Example:¶
>>> import pandas as pd >>> from summarease import summarize >>> data = pd.DataFrame({ ... "Age": [23, 45, 31, 35, 29], ... "Gender": ["Male", "Female", "Female", "Male", "Male"], ... "Salary": [50000, 60000, 75000, 80000, 65000] ... }) >>> summarize( ... dataset=data, ... dataset_name="Employee Data Summary", ... description="Summary of employee demographic and salary data.", ... summarize_by="plot", ... output_file="employee_summary.pdf" ... ) # This will generate a summary of the `data` dataframe # and save the summary as 'employee_summary.pdf' in the default output directory.