build_tools.pyphen_syllable_normaliser.aggregator

File aggregation for syllable normalization pipeline.

This module handles Step 1 of the normalization pipeline: combining multiple input files into a single raw syllable file while preserving all occurrences and maintaining raw counts.

Classes

FileAggregator

Aggregates syllables from multiple input files.

Functions

discover_input_files(source_dir[, pattern, recursive])

Discover input files in a directory matching a pattern.

Module Contents

class build_tools.pyphen_syllable_normaliser.aggregator.FileAggregator[source]

Aggregates syllables from multiple input files.

This class handles the first step of the normalization pipeline: combining syllables from multiple .txt files into a single raw aggregated file. All occurrences are preserved (no deduplication), maintaining the original frequency distribution from the input files.

Example

>>> from pathlib import Path
>>> aggregator = FileAggregator()
>>> input_files = [Path("file1.txt"), Path("file2.txt")]
>>> syllables = aggregator.aggregate_files(input_files)
>>> len(syllables)  # Total from both files
450
>>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))

aggregate_files(input_files)[source]

Aggregate syllables from multiple input files.

Reads all syllables from the provided input files and combines them into a single list. Each line in each input file is treated as one syllable. Empty lines are skipped. All occurrences are preserved (no deduplication).

Parameters:

input_files (list[pathlib.Path]) – List of Path objects pointing to input .txt files. Each file should contain one syllable per line.

Returns:

List of all syllables from all input files, preserving duplicates and maintaining the original order (file by file).

Raises:

FileNotFoundError – If any input file does not exist.
PermissionError – If any input file cannot be read.
UnicodeDecodeError – If any input file contains invalid UTF-8.

Return type:

list[str]

Example

>>> aggregator = FileAggregator()
>>> files = [Path("corpus1.txt"), Path("corpus2.txt")]
>>> syllables = aggregator.aggregate_files(files)
>>> syllables[:3]
['hello', 'world', 'test']

Note

Files are processed in the order provided. If deterministic ordering is required, ensure input_files is sorted before calling.

read_syllables_from_file(file_path)[source]

Read syllables from a single file.

Reads a file line by line, treating each line as one syllable. Empty lines (whitespace only) are skipped. No normalization or transformation is applied - syllables are preserved exactly as they appear in the file.

Parameters:

file_path (pathlib.Path) – Path to the input file to read.

Returns:

List of syllable strings from the file, one per non-empty line.

Raises:

FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.

Return type:

list[str]

Example

>>> aggregator = FileAggregator()
>>> syllables = aggregator.read_syllables_from_file(Path("input.txt"))
>>> syllables
['ka', 'ra', 'mi', 'ka', 'ta']

Note

Leading and trailing whitespace is stripped from each line, but the syllable content itself is not modified. This allows files with varying whitespace formatting to be processed consistently.

save_raw_syllables(syllables, output_path)[source]

Save raw aggregated syllables to file.

Writes syllables to the output file, one per line, in the order provided. This creates the syllables_raw.txt file for the pipeline. All syllables are written exactly as provided (no normalization).

Parameters:

syllables (list[str]) – List of syllable strings to write.
output_path (pathlib.Path) – Path where the raw syllables file should be saved.

Raises:

PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).

Example

>>> aggregator = FileAggregator()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta']
>>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
# File contains:
# ka
# ra
# mi
# ka
# ta

Note

This method creates the output file if it doesn’t exist and overwrites it if it does. The output directory must already exist.

build_tools.pyphen_syllable_normaliser.aggregator.discover_input_files(source_dir, pattern='*.txt', recursive=False)[source]

Discover input files in a directory matching a pattern.

Scans a directory for files matching the specified glob pattern. Returns files in sorted order for deterministic processing.

Parameters:

source_dir (pathlib.Path) – Directory to scan for input files.
pattern (str) – Glob pattern for matching files. Default: “*.txt”.
recursive (bool) – If True, scan subdirectories recursively using “**/” prefix. Default: False (only scan the immediate directory).

Returns:

Sorted list of Path objects for all matching files.

Raises:

ValueError – If source_dir is not a directory.
FileNotFoundError – If source_dir does not exist.

Return type:

list[pathlib.Path]

Example

>>> from pathlib import Path
>>> # Non-recursive scan
>>> files = discover_input_files(Path("data/"), pattern="*.txt")
>>> files
[Path('data/corpus1.txt'), Path('data/corpus2.txt')]
>>>
>>> # Recursive scan
>>> files = discover_input_files(
...     Path("data/"),
...     pattern="*.txt",
...     recursive=True
... )
>>> files
[Path('data/corpus1.txt'),
 Path('data/subdir/corpus3.txt'),
 Path('data/subdir/corpus4.txt')]

Note

Files are always returned in sorted order to ensure deterministic processing. This is critical for reproducible normalization results.