build_tools.pyphen_syllable_normaliser.aggregator
=================================================

.. py:module:: build_tools.pyphen_syllable_normaliser.aggregator

.. autoapi-nested-parse::

   File aggregation for syllable normalization pipeline.

   This module handles Step 1 of the normalization pipeline: combining multiple
   input files into a single raw syllable file while preserving all occurrences
   and maintaining raw counts.


Classes
-------

.. autoapisummary::

   build_tools.pyphen_syllable_normaliser.aggregator.FileAggregator


Functions
---------

.. autoapisummary::

   build_tools.pyphen_syllable_normaliser.aggregator.discover_input_files


Module Contents
---------------

.. py:class:: FileAggregator

   Aggregates syllables from multiple input files.

   This class handles the first step of the normalization pipeline: combining
   syllables from multiple .txt files into a single raw aggregated file. All
   occurrences are preserved (no deduplication), maintaining the original
   frequency distribution from the input files.

   .. admonition:: Example

      >>> from pathlib import Path
      >>> aggregator = FileAggregator()
      >>> input_files = [Path("file1.txt"), Path("file2.txt")]
      >>> syllables = aggregator.aggregate_files(input_files)
      >>> len(syllables)  # Total from both files
      450
      >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))


   .. py:method:: aggregate_files(input_files)

      Aggregate syllables from multiple input files.

      Reads all syllables from the provided input files and combines them
      into a single list. Each line in each input file is treated as one
      syllable. Empty lines are skipped. All occurrences are preserved
      (no deduplication).

      :param input_files: List of Path objects pointing to input .txt files.
                          Each file should contain one syllable per line.

      :returns: List of all syllables from all input files, preserving duplicates
                and maintaining the original order (file by file).

      :raises FileNotFoundError: If any input file does not exist.
      :raises PermissionError: If any input file cannot be read.
      :raises UnicodeDecodeError: If any input file contains invalid UTF-8.

      .. admonition:: Example

         >>> aggregator = FileAggregator()
         >>> files = [Path("corpus1.txt"), Path("corpus2.txt")]
         >>> syllables = aggregator.aggregate_files(files)
         >>> syllables[:3]
         ['hello', 'world', 'test']

      .. note::

         Files are processed in the order provided. If deterministic
         ordering is required, ensure input_files is sorted before calling.


   .. py:method:: read_syllables_from_file(file_path)

      Read syllables from a single file.

      Reads a file line by line, treating each line as one syllable.
      Empty lines (whitespace only) are skipped. No normalization or
      transformation is applied - syllables are preserved exactly as
      they appear in the file.

      :param file_path: Path to the input file to read.

      :returns: List of syllable strings from the file, one per non-empty line.

      :raises FileNotFoundError: If the file does not exist.
      :raises PermissionError: If the file cannot be read.
      :raises UnicodeDecodeError: If the file contains invalid UTF-8.

      .. admonition:: Example

         >>> aggregator = FileAggregator()
         >>> syllables = aggregator.read_syllables_from_file(Path("input.txt"))
         >>> syllables
         ['ka', 'ra', 'mi', 'ka', 'ta']

      .. note::

         Leading and trailing whitespace is stripped from each line,
         but the syllable content itself is not modified. This allows
         files with varying whitespace formatting to be processed
         consistently.


   .. py:method:: save_raw_syllables(syllables, output_path)

      Save raw aggregated syllables to file.

      Writes syllables to the output file, one per line, in the order
      provided. This creates the syllables_raw.txt file for the pipeline.
      All syllables are written exactly as provided (no normalization).

      :param syllables: List of syllable strings to write.
      :param output_path: Path where the raw syllables file should be saved.

      :raises PermissionError: If the output file cannot be written.
      :raises OSError: If there are filesystem issues (disk full, etc.).

      .. admonition:: Example

         >>> aggregator = FileAggregator()
         >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta']
         >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
         # File contains:
         # ka
         # ra
         # mi
         # ka
         # ta

      .. note::

         This method creates the output file if it doesn't exist and
         overwrites it if it does. The output directory must already exist.


.. py:function:: discover_input_files(source_dir, pattern = '*.txt', recursive = False)

   Discover input files in a directory matching a pattern.

   Scans a directory for files matching the specified glob pattern.
   Returns files in sorted order for deterministic processing.

   :param source_dir: Directory to scan for input files.
   :param pattern: Glob pattern for matching files. Default: "*.txt".
   :param recursive: If True, scan subdirectories recursively using "**/" prefix.
                     Default: False (only scan the immediate directory).

   :returns: Sorted list of Path objects for all matching files.

   :raises ValueError: If source_dir is not a directory.
   :raises FileNotFoundError: If source_dir does not exist.

   .. admonition:: Example

      >>> from pathlib import Path
      >>> # Non-recursive scan
      >>> files = discover_input_files(Path("data/"), pattern="*.txt")
      >>> files
      [Path('data/corpus1.txt'), Path('data/corpus2.txt')]
      >>>
      >>> # Recursive scan
      >>> files = discover_input_files(
      ...     Path("data/"),
      ...     pattern="*.txt",
      ...     recursive=True
      ... )
      >>> files
      [Path('data/corpus1.txt'),
       Path('data/subdir/corpus3.txt'),
       Path('data/subdir/corpus4.txt')]

   .. note::

      Files are always returned in sorted order to ensure deterministic
      processing. This is critical for reproducible normalization results.