build_tools.pyphen_syllable_normaliser.aggregator
File aggregation for syllable normalization pipeline.
This module handles Step 1 of the normalization pipeline: combining multiple input files into a single raw syllable file while preserving all occurrences and maintaining raw counts.
Classes
Aggregates syllables from multiple input files. |
Functions
|
Discover input files in a directory matching a pattern. |
Module Contents
- class build_tools.pyphen_syllable_normaliser.aggregator.FileAggregator[source]
Aggregates syllables from multiple input files.
This class handles the first step of the normalization pipeline: combining syllables from multiple .txt files into a single raw aggregated file. All occurrences are preserved (no deduplication), maintaining the original frequency distribution from the input files.
Example
>>> from pathlib import Path >>> aggregator = FileAggregator() >>> input_files = [Path("file1.txt"), Path("file2.txt")] >>> syllables = aggregator.aggregate_files(input_files) >>> len(syllables) # Total from both files 450 >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
- aggregate_files(input_files)[source]
Aggregate syllables from multiple input files.
Reads all syllables from the provided input files and combines them into a single list. Each line in each input file is treated as one syllable. Empty lines are skipped. All occurrences are preserved (no deduplication).
- Parameters:
input_files (list[pathlib.Path]) – List of Path objects pointing to input .txt files. Each file should contain one syllable per line.
- Returns:
List of all syllables from all input files, preserving duplicates and maintaining the original order (file by file).
- Raises:
FileNotFoundError – If any input file does not exist.
PermissionError – If any input file cannot be read.
UnicodeDecodeError – If any input file contains invalid UTF-8.
- Return type:
Example
>>> aggregator = FileAggregator() >>> files = [Path("corpus1.txt"), Path("corpus2.txt")] >>> syllables = aggregator.aggregate_files(files) >>> syllables[:3] ['hello', 'world', 'test']
Note
Files are processed in the order provided. If deterministic ordering is required, ensure input_files is sorted before calling.
- read_syllables_from_file(file_path)[source]
Read syllables from a single file.
Reads a file line by line, treating each line as one syllable. Empty lines (whitespace only) are skipped. No normalization or transformation is applied - syllables are preserved exactly as they appear in the file.
- Parameters:
file_path (pathlib.Path) – Path to the input file to read.
- Returns:
List of syllable strings from the file, one per non-empty line.
- Raises:
FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.
- Return type:
Example
>>> aggregator = FileAggregator() >>> syllables = aggregator.read_syllables_from_file(Path("input.txt")) >>> syllables ['ka', 'ra', 'mi', 'ka', 'ta']
Note
Leading and trailing whitespace is stripped from each line, but the syllable content itself is not modified. This allows files with varying whitespace formatting to be processed consistently.
- save_raw_syllables(syllables, output_path)[source]
Save raw aggregated syllables to file.
Writes syllables to the output file, one per line, in the order provided. This creates the syllables_raw.txt file for the pipeline. All syllables are written exactly as provided (no normalization).
- Parameters:
output_path (pathlib.Path) – Path where the raw syllables file should be saved.
- Raises:
PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).
Example
>>> aggregator = FileAggregator() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta'] >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt")) # File contains: # ka # ra # mi # ka # ta
Note
This method creates the output file if it doesn’t exist and overwrites it if it does. The output directory must already exist.
- build_tools.pyphen_syllable_normaliser.aggregator.discover_input_files(source_dir, pattern='*.txt', recursive=False)[source]
Discover input files in a directory matching a pattern.
Scans a directory for files matching the specified glob pattern. Returns files in sorted order for deterministic processing.
- Parameters:
source_dir (pathlib.Path) – Directory to scan for input files.
pattern (str) – Glob pattern for matching files. Default: “*.txt”.
recursive (bool) – If True, scan subdirectories recursively using “**/” prefix. Default: False (only scan the immediate directory).
- Returns:
Sorted list of Path objects for all matching files.
- Raises:
ValueError – If source_dir is not a directory.
FileNotFoundError – If source_dir does not exist.
- Return type:
Example
>>> from pathlib import Path >>> # Non-recursive scan >>> files = discover_input_files(Path("data/"), pattern="*.txt") >>> files [Path('data/corpus1.txt'), Path('data/corpus2.txt')] >>> >>> # Recursive scan >>> files = discover_input_files( ... Path("data/"), ... pattern="*.txt", ... recursive=True ... ) >>> files [Path('data/corpus1.txt'), Path('data/subdir/corpus3.txt'), Path('data/subdir/corpus4.txt')]
Note
Files are always returned in sorted order to ensure deterministic processing. This is critical for reproducible normalization results.