summarease.summarize ==================== .. py:module:: summarease.summarize Functions --------- .. autoapisummary:: summarease.summarize.validate_or_create_path summarease.summarize.add_image summarease.summarize.add_table summarease.summarize.switch_page_if_needed summarease.summarize.summarize Module Contents --------------- .. py:function:: validate_or_create_path(path) Validate if the provided path is a valid `Path` object and create necessary directories. If the path represents a file, the function ensures that its parent directory exists. If the path represents a directory, the function ensures it exists, creating it if necessary. :param path: The path to validate or create. Can represent a file or a directory. :type path: Path :raises TypeError: If the provided `path` is not an instance of `Path`. .. rubric:: Notes - If `path` is a directory and it does not exist, the function creates it, including any necessary parent directories. .. py:function:: add_image(pdf, image_path, pdf_height, pdf_width, element_padding=15) Adds an image to a PDF document at the current y-position with consideration for page size and scaling. If the image height exceeds the remaining space on the current page, a new page is added to the PDF. The image is scaled proportionally to fit the page width while maintaining the aspect ratio. :param pdf: A FPDF object representing the PDF document to which the image will be added. :param image_path: The file path to the image to be added. It supports various image formats such as .jpg, .jpeg, .png, .gif, .bmp, .tiff, and .webp. :type image_path: str or Path :param pdf_height: The total height of the PDF page in units consistent with the FPDF settings. :type pdf_height: float :param pdf_width: The total width of the PDF page in units consistent with the FPDF settings. :type pdf_width: float :param element_padding: The padding (in units consistent with FPDF) to be applied between the image and the page's top margin. Default is 15. :type element_padding: int, optional :returns: The updated FPDF object with the image added at the correct position. :rtype: pdf .. rubric:: Notes - The function checks if the image file exists and has a valid image extension. - The image is scaled to fit within the page width, and if necessary, a new page is added. - The function assumes a DPI of 96 for the image size conversion from pixels to millimeters. - If the image height exceeds the remaining space on the current page, a new page is created before adding the image. .. py:function:: add_table(pdf, table, pdf_height, pdf_width, element_padding=15) Adds a table to the PDF document with the provided data, scaling the column widths to fit within the page width while maintaining their relative proportions. The first row (header) has a gray background, and the first column (index) is highlighted with a gray background. :param pdf: A FPDF object representing the PDF document to which the table will be added. :param table: The table containing the data to be added. The first column (index) will be inserted as a new column in the table. :type table: pandas.DataFrame :param pdf_height: The total height of the PDF page in units consistent with the FPDF settings. :type pdf_height: float :param pdf_width: The total width of the PDF page in units consistent with the FPDF settings. :type pdf_width: float :param element_padding: The padding (in units consistent with FPDF) to be applied around the table. Default is 15. :type element_padding: int, optional :returns: The updated FPDF object with the table added. :rtype: pdf .. rubric:: Notes - The function calculates the maximum column width based on the longest entry or column name, scaling the column widths to fit the available page width while maintaining relative proportions. - The first row (header) is filled with a light gray background, and the first column (index) is also highlighted with a gray background for better readability. - Column names are truncated if they are too long to fit in the cell, and the font size is adjusted accordingly for long column names. - Numeric values are rounded to 2 decimal places for consistency. .. py:function:: switch_page_if_needed(pdf) .. py:function:: summarize(dataset: pandas.DataFrame, dataset_name: str = 'Dataset Summary', description: str = 'Dataset summary generated by summarease.', summarize_by: str = 'plot', target_variable: str = None, target_type: str = 'categorical', output_file: str = 'summary.pdf', output_dir: str = './summarease_summary/') Summarizes the given dataset by generating various statistics, visualizations, and/or tables based on the provided options. Parameters: ----------- dataset : pd.DataFrame The dataframe to be summarized. dataset_name : str, optional, default="Dataset Summary" Represents the title of the summary, can be simply the name of the dataset. description : str, optional, default="Dataset summary generated by summarease." A description of the dataset to provide context in the summary. summarize_by : str, optional, default="mix" Specifies what visual elements to use when summarizing the dataset: - "table" : Summarize using tables. - "plot" : Summarize using plots. target_variable : str, optional, default=None The name of the target variable in the dataset. This helps in identifying the dependent variable for further analysis. target_type : str, within {"categorical", "numerical"} The type of target variable. output_file : str, optional, default="summary.pdf" The name of the output file where the summary will be saved. output_dir : str, optional, default="./summarease_summary/" The directory where the output summary file will be saved. Returns: -------- None This function outputs the summary of the dataset in an output file, including statistical summaries, visualizations, and cleaning steps (if applicable). Notes: ------ - The `show_observations` parameter can be customized to display a certain number of observations. - The `summarize_by` parameter offers flexibility in the type of summary (table or plot). Example: -------- >>> import pandas as pd >>> from summarease import summarize >>> data = pd.DataFrame({ ... "Age": [23, 45, 31, 35, 29], ... "Gender": ["Male", "Female", "Female", "Male", "Male"], ... "Salary": [50000, 60000, 75000, 80000, 65000] ... }) >>> summarize( ... dataset=data, ... dataset_name="Employee Data Summary", ... description="Summary of employee demographic and salary data.", ... summarize_by="plot", ... output_file="employee_summary.pdf" ... ) # This will generate a summary of the `data` dataframe # and save the summary as 'employee_summary.pdf' in the default output directory.