build_tools.pyphen_syllable_normaliser.aggregator ================================================= .. py:module:: build_tools.pyphen_syllable_normaliser.aggregator .. autoapi-nested-parse:: File aggregation for syllable normalization pipeline. This module handles Step 1 of the normalization pipeline: combining multiple input files into a single raw syllable file while preserving all occurrences and maintaining raw counts. Classes ------- .. autoapisummary:: build_tools.pyphen_syllable_normaliser.aggregator.FileAggregator Functions --------- .. autoapisummary:: build_tools.pyphen_syllable_normaliser.aggregator.discover_input_files Module Contents --------------- .. py:class:: FileAggregator Aggregates syllables from multiple input files. This class handles the first step of the normalization pipeline: combining syllables from multiple .txt files into a single raw aggregated file. All occurrences are preserved (no deduplication), maintaining the original frequency distribution from the input files. .. admonition:: Example >>> from pathlib import Path >>> aggregator = FileAggregator() >>> input_files = [Path("file1.txt"), Path("file2.txt")] >>> syllables = aggregator.aggregate_files(input_files) >>> len(syllables) # Total from both files 450 >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt")) .. py:method:: aggregate_files(input_files) Aggregate syllables from multiple input files. Reads all syllables from the provided input files and combines them into a single list. Each line in each input file is treated as one syllable. Empty lines are skipped. All occurrences are preserved (no deduplication). :param input_files: List of Path objects pointing to input .txt files. Each file should contain one syllable per line. :returns: List of all syllables from all input files, preserving duplicates and maintaining the original order (file by file). :raises FileNotFoundError: If any input file does not exist. :raises PermissionError: If any input file cannot be read. :raises UnicodeDecodeError: If any input file contains invalid UTF-8. .. admonition:: Example >>> aggregator = FileAggregator() >>> files = [Path("corpus1.txt"), Path("corpus2.txt")] >>> syllables = aggregator.aggregate_files(files) >>> syllables[:3] ['hello', 'world', 'test'] .. note:: Files are processed in the order provided. If deterministic ordering is required, ensure input_files is sorted before calling. .. py:method:: read_syllables_from_file(file_path) Read syllables from a single file. Reads a file line by line, treating each line as one syllable. Empty lines (whitespace only) are skipped. No normalization or transformation is applied - syllables are preserved exactly as they appear in the file. :param file_path: Path to the input file to read. :returns: List of syllable strings from the file, one per non-empty line. :raises FileNotFoundError: If the file does not exist. :raises PermissionError: If the file cannot be read. :raises UnicodeDecodeError: If the file contains invalid UTF-8. .. admonition:: Example >>> aggregator = FileAggregator() >>> syllables = aggregator.read_syllables_from_file(Path("input.txt")) >>> syllables ['ka', 'ra', 'mi', 'ka', 'ta'] .. note:: Leading and trailing whitespace is stripped from each line, but the syllable content itself is not modified. This allows files with varying whitespace formatting to be processed consistently. .. py:method:: save_raw_syllables(syllables, output_path) Save raw aggregated syllables to file. Writes syllables to the output file, one per line, in the order provided. This creates the syllables_raw.txt file for the pipeline. All syllables are written exactly as provided (no normalization). :param syllables: List of syllable strings to write. :param output_path: Path where the raw syllables file should be saved. :raises PermissionError: If the output file cannot be written. :raises OSError: If there are filesystem issues (disk full, etc.). .. admonition:: Example >>> aggregator = FileAggregator() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta'] >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt")) # File contains: # ka # ra # mi # ka # ta .. note:: This method creates the output file if it doesn't exist and overwrites it if it does. The output directory must already exist. .. py:function:: discover_input_files(source_dir, pattern = '*.txt', recursive = False) Discover input files in a directory matching a pattern. Scans a directory for files matching the specified glob pattern. Returns files in sorted order for deterministic processing. :param source_dir: Directory to scan for input files. :param pattern: Glob pattern for matching files. Default: "*.txt". :param recursive: If True, scan subdirectories recursively using "**/" prefix. Default: False (only scan the immediate directory). :returns: Sorted list of Path objects for all matching files. :raises ValueError: If source_dir is not a directory. :raises FileNotFoundError: If source_dir does not exist. .. admonition:: Example >>> from pathlib import Path >>> # Non-recursive scan >>> files = discover_input_files(Path("data/"), pattern="*.txt") >>> files [Path('data/corpus1.txt'), Path('data/corpus2.txt')] >>> >>> # Recursive scan >>> files = discover_input_files( ... Path("data/"), ... pattern="*.txt", ... recursive=True ... ) >>> files [Path('data/corpus1.txt'), Path('data/subdir/corpus3.txt'), Path('data/subdir/corpus4.txt')] .. note:: Files are always returned in sorted order to ensure deterministic processing. This is critical for reproducible normalization results.