summarease.summarize
====================

.. py:module:: summarease.summarize


Functions
---------

.. autoapisummary::

   summarease.summarize.validate_or_create_path
   summarease.summarize.add_image
   summarease.summarize.add_table
   summarease.summarize.switch_page_if_needed
   summarease.summarize.summarize


Module Contents
---------------

.. py:function:: validate_or_create_path(path)

   Validate if the provided path is a valid `Path` object and create necessary directories.

   If the path represents a file, the function ensures that its parent directory exists.
   If the path represents a directory, the function ensures it exists, creating it if necessary.

   :param path: The path to validate or create. Can represent a file or a directory.
   :type path: Path

   :raises TypeError: If the provided `path` is not an instance of `Path`.

   .. rubric:: Notes

   - If `path` is a directory and it does not exist, the function creates it, including
     any necessary parent directories.


.. py:function:: add_image(pdf, image_path, pdf_height, pdf_width, element_padding=15)

   Adds an image to a PDF document at the current y-position with consideration for page size
   and scaling. If the image height exceeds the remaining space on the current page, a new page
   is added to the PDF. The image is scaled proportionally to fit the page width while maintaining
   the aspect ratio.

   :param pdf: A FPDF object representing the PDF document to which the image will be added.
   :param image_path: The file path to the image to be added. It supports various image
                      formats such as .jpg, .jpeg, .png, .gif, .bmp, .tiff, and .webp.
   :type image_path: str or Path
   :param pdf_height: The total height of the PDF page in units consistent with the FPDF settings.
   :type pdf_height: float
   :param pdf_width: The total width of the PDF page in units consistent with the FPDF settings.
   :type pdf_width: float
   :param element_padding: The padding (in units consistent with FPDF) to be applied between
                           the image and the page's top margin. Default is 15.
   :type element_padding: int, optional

   :returns: The updated FPDF object with the image added at the correct position.
   :rtype: pdf

   .. rubric:: Notes

   - The function checks if the image file exists and has a valid image extension.
   - The image is scaled to fit within the page width, and if necessary, a new page is added.
   - The function assumes a DPI of 96 for the image size conversion from pixels to millimeters.
   - If the image height exceeds the remaining space on the current page, a new page is created before adding the image.


.. py:function:: add_table(pdf, table, pdf_height, pdf_width, element_padding=15)

   Adds a table to the PDF document with the provided data, scaling the column widths to fit
   within the page width while maintaining their relative proportions. The first row (header)
   has a gray background, and the first column (index) is highlighted with a gray background.

   :param pdf: A FPDF object representing the PDF document to which the table will be added.
   :param table: The table containing the data to be added. The first column
                 (index) will be inserted as a new column in the table.
   :type table: pandas.DataFrame
   :param pdf_height: The total height of the PDF page in units consistent with the FPDF settings.
   :type pdf_height: float
   :param pdf_width: The total width of the PDF page in units consistent with the FPDF settings.
   :type pdf_width: float
   :param element_padding: The padding (in units consistent with FPDF) to be applied
                           around the table. Default is 15.
   :type element_padding: int, optional

   :returns: The updated FPDF object with the table added.
   :rtype: pdf

   .. rubric:: Notes

   - The function calculates the maximum column width based on the longest entry or column name,
     scaling the column widths to fit the available page width while maintaining relative proportions.
   - The first row (header) is filled with a light gray background, and the first column (index)
     is also highlighted with a gray background for better readability.
   - Column names are truncated if they are too long to fit in the cell, and the font size is adjusted
     accordingly for long column names.
   - Numeric values are rounded to 2 decimal places for consistency.


.. py:function:: switch_page_if_needed(pdf)

.. py:function:: summarize(dataset: pandas.DataFrame, dataset_name: str = 'Dataset Summary', description: str = 'Dataset summary generated by summarease.', summarize_by: str = 'plot', target_variable: str = None, target_type: str = 'categorical', output_file: str = 'summary.pdf', output_dir: str = './summarease_summary/')

   Summarizes the given dataset by generating various statistics, visualizations,
   and/or tables based on the provided options.

   Parameters:
   -----------
   dataset : pd.DataFrame
       The dataframe to be summarized.

   dataset_name : str, optional, default="Dataset Summary"
       Represents the title of the summary, can be simply the name of the dataset.

   description : str, optional, default="Dataset summary generated by summarease."
       A description of the dataset to provide context in the summary.

   summarize_by : str, optional, default="mix"
       Specifies what visual elements to use when summarizing the dataset:
       - "table" : Summarize using tables.
       - "plot" : Summarize using plots.

   target_variable : str, optional, default=None
       The name of the target variable in the dataset. This helps in identifying the dependent variable for further analysis.

   target_type : str, within {"categorical", "numerical"}
       The type of target variable.

   output_file : str, optional, default="summary.pdf"
       The name of the output file where the summary will be saved.

   output_dir : str, optional, default="./summarease_summary/"
       The directory where the output summary file will be saved.

   Returns:
   --------
   None
       This function outputs the summary of the dataset in an output file, including statistical summaries, visualizations, and cleaning steps (if applicable).

   Notes:
   ------
   - The `show_observations` parameter can be customized to display a certain number of observations.
   - The `summarize_by` parameter offers flexibility in the type of summary (table or plot).

   Example:
   --------
   >>> import pandas as pd
   >>> from summarease import summarize
   >>> data = pd.DataFrame({
   ...     "Age": [23, 45, 31, 35, 29],
   ...     "Gender": ["Male", "Female", "Female", "Male", "Male"],
   ...     "Salary": [50000, 60000, 75000, 80000, 65000]
   ... })
   >>> summarize(
   ...     dataset=data,
   ...     dataset_name="Employee Data Summary",
   ...     description="Summary of employee demographic and salary data.",
   ...     summarize_by="plot",
   ...     output_file="employee_summary.pdf"
   ... )
   # This will generate a summary of the `data` dataframe
   # and save the summary as 'employee_summary.pdf' in the default output directory.