Pyphen Syllable Normaliser
Overview
Syllable Normaliser - 3-Step Normalization Pipeline
The syllable normaliser transforms raw syllable files into canonical form through a 3-step pipeline, creating the authoritative syllable inventory for pattern development. This is a build-time tool only - not used during runtime name generation.
3-Step Normalization Pipeline:
Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence
Features:
Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files for complete analysis
The pipeline produces 5 output files (with pyphen_ prefix for provenance):
pyphen_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
pyphen_syllables_canonicalised.txt: Normalized canonical syllables
pyphen_syllables_frequencies.json: Frequency intelligence (syllable → count)
pyphen_syllables_unique.txt: Deduplicated canonical syllable inventory
pyphen_normalization_meta.txt: Detailed statistics and metadata report
- Usage:
>>> from pathlib import Path >>> from build_tools.pyphen_syllable_normaliser import ( ... NormalizationConfig, ... run_full_pipeline ... ) >>> >>> # Create configuration >>> config = NormalizationConfig(min_length=2, max_length=8) >>> >>> # Run pipeline on a pyphen run directory >>> result = run_full_pipeline( ... run_directory=Path("_working/output/20260110_143022_pyphen/"), ... config=config, ... verbose=True ... ) >>> >>> # Access results >>> print(f"Processed {result.stats.raw_count:,} raw syllables") >>> print(f"Canonical: {result.stats.after_canonicalization:,}") >>> print(f"Unique: {result.stats.unique_canonical:,}")
CLI Usage:
# Process specific pyphen run directory (in-place) python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/ # Auto-detect all pyphen run directories python -m build_tools.pyphen_syllable_normaliser --source _working/output/
Command-Line Interface
Pyphen Syllable Normaliser - 3-step normalization pipeline
usage: python -m build_tools.pyphen_syllable_normaliser [-h]
(--run-dir RUN_DIR | --source SOURCE)
[--min MIN]
[--max MAX]
[--charset CHARSET]
[--unicode-form {NFC,NFD,NFKC,NFKD}]
[--verbose] [--quiet]
Named Arguments
- --run-dir
Specific pyphen run directory to process (e.g., _working/output/20260110_143022_pyphen/)
- --source
Directory to scan for pyphen run directories (auto-detects *_pyphen/ directories)
- --min
Minimum syllable length (characters). Default: 2
Default:
2- --max
Maximum syllable length (characters). Default: 20
Default:
20- --charset
Allowed character set for syllables. Default: a-z
Default:
'abcdefghijklmnopqrstuvwxyz'- --unicode-form
Possible choices: NFC, NFD, NFKC, NFKD
Unicode normalization form. Default: NFKD
Default:
'NFKD'- --verbose, -v
Enable verbose output with detailed progress information
Default:
False- --quiet, -q
Suppress all output except errors
Default:
False
Examples:
# Process specific pyphen run directory
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/
# Auto-detect and process all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/
# Custom normalization config
python -m build_tools.pyphen_syllable_normaliser \
--run-dir _working/output/20260110_143022_pyphen/ \
--min 2 --max 8
Output Format
The pipeline generates 5 output files in the pyphen run directory with pyphen_ prefix for provenance:
pyphen_syllables_raw.txt - Aggregated raw syllables (all occurrences preserved)
pyphen_syllables_canonicalised.txt - Normalized canonical syllables
pyphen_syllables_frequencies.json - Frequency intelligence (syllable → count mapping)
pyphen_syllables_unique.txt - Deduplicated canonical syllable inventory
pyphen_normalization_meta.txt - Detailed statistics and metadata report
In-Place Processing:
Unlike older versions which wrote to a separate output directory, the pyphen normaliser now processes run directories in-place, writing output files directly into the run directory:
_working/output/20260110_143022_pyphen/
├── syllables/ # Input (from pyphen extractor)
│ ├── en_US_alice.txt
│ ├── en_US_middlemarch.txt
│ └── ...
├── meta/ # Metadata (from extractor)
│ └── ...
├── pyphen_syllables_raw.txt # Output: Aggregated
├── pyphen_syllables_canonicalised.txt # Output: Normalized
├── pyphen_syllables_frequencies.json # Output: Frequency intelligence
├── pyphen_syllables_unique.txt # Output: Deduplicated
└── pyphen_normalization_meta.txt # Output: Statistics
File structure examples:
pyphen_syllables_raw.txt (preserves all occurrences):
café
Café
hello
hello
world
pyphen_syllables_canonicalised.txt (normalized, duplicates preserved):
cafe
cafe
hello
hello
world
pyphen_syllables_frequencies.json (counts before deduplication):
{
"cafe": 2,
"hello": 2,
"world": 1
}
pyphen_syllables_unique.txt (deduplicated, sorted):
cafe
hello
world
Integration Guide
The pyphen syllable normaliser is the second step after pyphen extraction, processing syllables within their run directories:
Standard workflow:
# Step 1: Extract syllables using pyphen
python -m build_tools.pyphen_syllable_extractor \
--source data/corpus/ \
--pattern "*.txt" \
--output _working/output/ \
--lang en_US
# Step 2: Normalize extracted syllables (in-place)
python -m build_tools.pyphen_syllable_normaliser \
--run-dir _working/output/20260110_143022_pyphen/
# Alternative: Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser \
--source _working/output/
# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator \
--syllables _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
--frequencies _working/output/20260110_143022_pyphen/pyphen_syllables_frequencies.json
Comparing with NLTK normaliser:
# Pyphen pipeline (typographic hyphenation)
python -m build_tools.pyphen_syllable_extractor \
--source data/corpus/ \
--lang en_US \
--output _working/output/
python -m build_tools.pyphen_syllable_normaliser \
--run-dir _working/output/20260110_143022_pyphen/
# NLTK pipeline (phonetic splitting)
python -m build_tools.nltk_syllable_extractor \
--source data/corpus/ \
--output _working/output/
python -m build_tools.nltk_syllable_normaliser \
--run-dir _working/output/20260110_095213_nltk/
# Compare outputs - both use different prefixes (pyphen_* vs nltk_*)
diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
_working/output/20260110_095213_nltk/nltk_syllables_unique.txt
When to use pyphen normaliser vs NLTK normaliser:
Use pyphen normaliser when:
You used the pyphen syllable extractor
Your syllables are well-formed from typographic hyphenation
You want multi-language support (40+ languages)
You want in-place processing within run directories
You’re working with pyphen’s dictionary-based splits
Use NLTK normaliser when:
You used the NLTK syllable extractor
Your syllables contain many single-letter fragments
You want phonetically coherent syllables reconstructed
You’re working with NLTK’s onset/coda-based splits (English only)
You want fragment cleaning preprocessing
3-Step Normalization Pipeline:
Step 1 - Aggregation:
Combines all input files into
syllables_raw.txtPreserves ALL occurrences (no deduplication)
Maintains raw counts for frequency analysis
Empty lines filtered during file reading
Step 2 - Canonicalization:
Unicode normalization (NFKD - compatibility decomposition)
Strip diacritics: café → cafe, résumé → resume
Lowercase conversion
Trim whitespace
Charset validation (reject invalid characters)
Length constraint enforcement (default: min=2, max=20)
Outputs to
syllables_canonicalised.txt
Step 3 - Frequency Analysis:
Count occurrences of each canonical syllable
Generate frequency rankings and percentages
Create deduplicated unique list (alphabetically sorted)
Outputs: -
syllables_frequencies.json- Frequency counts before deduplication -syllables_unique.txt- Authoritative syllable inventory -normalization_meta.txt- Comprehensive statistics report
Pipeline characteristics:
Deterministic: same input always produces same output
Fast: processes thousands of syllables per second
Configurable: adjust length constraints, charset, unicode form
Comprehensive: detailed rejection statistics and metadata
Notes
Frequency Intelligence:
The frequency data captures how often each canonical syllable occurs before deduplication. This intelligence is essential for understanding natural language patterns and can inform weighted name generation:
{
"ka": 187,
"ra": 162,
"mi": 145,
"ta": 98
}
This shows “ka” appears 187 times in the canonical syllables, indicating it’s a high-frequency pattern that may be desirable for common or natural-sounding names.
Normalization Behavior:
All syllable processing is case-insensitive (output is lowercase)
Unicode normalization form NFKD provides maximum compatibility decomposition
Empty lines are filtered during aggregation (not counted as rejections)
Frequency counts capture occurrences BEFORE deduplication
Invalid syllables (wrong charset, wrong length) are rejected and counted in metadata
Default Constraints:
Min length: 2 characters
Max length: 20 characters
Allowed charset: a-z (lowercase ASCII letters)
Unicode form: NFKD (compatibility decomposition)
Use Cases:
Combining syllables from multiple language extractions
Normalizing variations in corpus encoding (UTF-8, Latin-1, etc.)
Filtering syllables by length for specific pattern requirements
Building frequency-aware name generation systems
In-Place Processing Philosophy:
The pyphen normaliser writes outputs directly into the run directory (not a separate location) because:
Convention: Each pyphen run is self-contained (extractor + normaliser outputs together)
Simplicity: No confusion about where normalized files live
Provenance: Run directory name (
*_pyphen) and file prefix (pyphen_*) both indicate source
Processing Modes:
Specific run directory:
--run-dir /path/to/run/- Process one pyphen runAuto-detection:
--source /path/to/output/- Find and process all pyphen runs
Auto-Detection Criteria:
The auto-detection feature (--source) finds pyphen run directories by:
Scanning for directories ending with
_pyphenVerifying existence of
syllables/subdirectorySorting chronologically by directory name
This allows batch processing:
# Process all pyphen runs at once
python -m build_tools.pyphen_syllable_normaliser --source _working/output/
# Output:
# Found 3 pyphen run directories:
# - 20260110_143022_pyphen
# - 20260110_153045_pyphen
# - 20260110_163010_pyphen
# Processing...
Build-time tool:
This is a build-time tool only - not used during runtime name generation.
API Reference
Syllable Normaliser - 3-Step Normalization Pipeline
The syllable normaliser transforms raw syllable files into canonical form through a 3-step pipeline, creating the authoritative syllable inventory for pattern development. This is a build-time tool only - not used during runtime name generation.
3-Step Normalization Pipeline:
Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence
Features:
Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files for complete analysis
The pipeline produces 5 output files (with pyphen_ prefix for provenance):
pyphen_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
pyphen_syllables_canonicalised.txt: Normalized canonical syllables
pyphen_syllables_frequencies.json: Frequency intelligence (syllable → count)
pyphen_syllables_unique.txt: Deduplicated canonical syllable inventory
pyphen_normalization_meta.txt: Detailed statistics and metadata report
- Usage:
>>> from pathlib import Path >>> from build_tools.pyphen_syllable_normaliser import ( ... NormalizationConfig, ... run_full_pipeline ... ) >>> >>> # Create configuration >>> config = NormalizationConfig(min_length=2, max_length=8) >>> >>> # Run pipeline on a pyphen run directory >>> result = run_full_pipeline( ... run_directory=Path("_working/output/20260110_143022_pyphen/"), ... config=config, ... verbose=True ... ) >>> >>> # Access results >>> print(f"Processed {result.stats.raw_count:,} raw syllables") >>> print(f"Canonical: {result.stats.after_canonicalization:,}") >>> print(f"Unique: {result.stats.unique_canonical:,}")
CLI Usage:
# Process specific pyphen run directory (in-place) python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/ # Auto-detect all pyphen run directories python -m build_tools.pyphen_syllable_normaliser --source _working/output/
- class build_tools.pyphen_syllable_normaliser.FileAggregator[source]
Bases:
objectAggregates syllables from multiple input files.
This class handles the first step of the normalization pipeline: combining syllables from multiple .txt files into a single raw aggregated file. All occurrences are preserved (no deduplication), maintaining the original frequency distribution from the input files.
Example
>>> from pathlib import Path >>> aggregator = FileAggregator() >>> input_files = [Path("file1.txt"), Path("file2.txt")] >>> syllables = aggregator.aggregate_files(input_files) >>> len(syllables) # Total from both files 450 >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
- aggregate_files(input_files)[source]
Aggregate syllables from multiple input files.
Reads all syllables from the provided input files and combines them into a single list. Each line in each input file is treated as one syllable. Empty lines are skipped. All occurrences are preserved (no deduplication).
- Parameters:
input_files (
list[Path]) – List of Path objects pointing to input .txt files. Each file should contain one syllable per line.- Return type:
- Returns:
List of all syllables from all input files, preserving duplicates and maintaining the original order (file by file).
- Raises:
FileNotFoundError – If any input file does not exist.
PermissionError – If any input file cannot be read.
UnicodeDecodeError – If any input file contains invalid UTF-8.
Example
>>> aggregator = FileAggregator() >>> files = [Path("corpus1.txt"), Path("corpus2.txt")] >>> syllables = aggregator.aggregate_files(files) >>> syllables[:3] ['hello', 'world', 'test']
Note
Files are processed in the order provided. If deterministic ordering is required, ensure input_files is sorted before calling.
- read_syllables_from_file(file_path)[source]
Read syllables from a single file.
Reads a file line by line, treating each line as one syllable. Empty lines (whitespace only) are skipped. No normalization or transformation is applied - syllables are preserved exactly as they appear in the file.
- Parameters:
file_path (
Path) – Path to the input file to read.- Return type:
- Returns:
List of syllable strings from the file, one per non-empty line.
- Raises:
FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.
Example
>>> aggregator = FileAggregator() >>> syllables = aggregator.read_syllables_from_file(Path("input.txt")) >>> syllables ['ka', 'ra', 'mi', 'ka', 'ta']
Note
Leading and trailing whitespace is stripped from each line, but the syllable content itself is not modified. This allows files with varying whitespace formatting to be processed consistently.
- save_raw_syllables(syllables, output_path)[source]
Save raw aggregated syllables to file.
Writes syllables to the output file, one per line, in the order provided. This creates the syllables_raw.txt file for the pipeline. All syllables are written exactly as provided (no normalization).
- Parameters:
- Raises:
PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).
- Return type:
Example
>>> aggregator = FileAggregator() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta'] >>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt")) # File contains: # ka # ra # mi # ka # ta
Note
This method creates the output file if it doesn’t exist and overwrites it if it does. The output directory must already exist.
- class build_tools.pyphen_syllable_normaliser.FrequencyAnalyzer[source]
Bases:
objectAnalyzes frequency distribution of canonical syllables.
This class handles the intelligence capture phase of the normalization pipeline. It counts occurrences of each canonical syllable, creates frequency rankings, and generates output files for downstream analysis and feature annotation.
Example
>>> from pathlib import Path >>> analyzer = FrequencyAnalyzer() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka'] >>> frequencies = analyzer.calculate_frequencies(syllables) >>> frequencies {'ka': 3, 'ra': 1, 'mi': 1, 'ta': 1} >>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json")) >>> unique = analyzer.extract_unique_syllables(syllables) >>> unique ['ka', 'mi', 'ra', 'ta']
- calculate_frequencies(syllables)[source]
Calculate frequency counts for canonical syllables.
Counts how many times each unique syllable appears in the input list. This captures the natural frequency distribution from the source corpus before deduplication.
- Parameters:
syllables (
list[str]) – List of canonical syllables (may contain duplicates).- Return type:
- Returns:
Dictionary mapping each unique syllable to its occurrence count.
Example
>>> analyzer = FrequencyAnalyzer() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra'] >>> frequencies = analyzer.calculate_frequencies(syllables) >>> frequencies {'ka': 3, 'ra': 2, 'mi': 1, 'ta': 1} >>> sum(frequencies.values()) # Total syllable count 7
Note
The returned dictionary is not sorted. Use create_frequency_entries() to generate sorted frequency rankings.
- create_frequency_entries(frequencies)[source]
Create ranked frequency entries from frequency counts.
Converts a frequency dictionary into a list of FrequencyEntry objects with ranking information and percentage calculations. Entries are sorted by frequency (descending) then alphabetically (ascending).
- Parameters:
frequencies (
dict[str,int]) – Dictionary mapping syllable to occurrence count.- Return type:
- Returns:
List of FrequencyEntry objects sorted by frequency (highest first), with alphabetical secondary sort for ties.
Example
>>> analyzer = FrequencyAnalyzer() >>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145, 'ta': 98} >>> entries = analyzer.create_frequency_entries(frequencies) >>> entries[0] FrequencyEntry(canonical='ka', frequency=187, rank=1, percentage=31.5) >>> entries[0].canonical 'ka' >>> entries[0].rank 1
Note
Percentage is calculated as (frequency / total_count) * 100. Ranks start at 1 (most frequent syllable has rank=1).
- extract_unique_syllables(syllables)[source]
Extract unique syllables and return in sorted order.
Removes duplicates from the syllable list and returns a sorted list of unique canonical syllables. This creates the authoritative syllable inventory for downstream feature annotation.
- Parameters:
syllables (
list[str]) – List of canonical syllables (may contain duplicates).- Return type:
- Returns:
Sorted list of unique syllable strings (alphabetical order).
Example
>>> analyzer = FrequencyAnalyzer() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra'] >>> unique = analyzer.extract_unique_syllables(syllables) >>> unique ['ka', 'mi', 'ra', 'ta'] >>> len(unique) 4
Note
Sorting is alphabetical (a-z) for deterministic output. Empty syllable lists return an empty list.
- save_frequencies(frequencies, output_path)[source]
Save frequency dictionary to JSON file.
Writes the frequency intelligence to a JSON file for downstream analysis. The output is formatted with indentation for readability and sorted by key for deterministic output.
- Parameters:
- Raises:
PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).
- Return type:
Example
>>> analyzer = FrequencyAnalyzer() >>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145} >>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json")) # File contains: # { # "ka": 187, # "mi": 145, # "ra": 162 # }
Note
The JSON is formatted with 2-space indentation and keys are sorted alphabetically for consistent diffs in version control.
- save_unique_syllables(unique_syllables, output_path)[source]
Save unique syllables to text file.
Writes the deduplicated canonical syllable inventory to a text file, one syllable per line. This creates the authoritative syllable list for feature annotation and downstream processing.
- Parameters:
- Raises:
PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).
- Return type:
Example
>>> analyzer = FrequencyAnalyzer() >>> unique = ['ka', 'mi', 'ra', 'ta'] >>> analyzer.save_unique_syllables(unique, Path("syllables_unique.txt")) # File contains: # ka # mi # ra # ta
Note
Syllables should be pre-sorted (alphabetically) before calling this method. Use extract_unique_syllables() which returns sorted output, or sort manually.
- class build_tools.pyphen_syllable_normaliser.FrequencyEntry(canonical, frequency, rank, percentage)[source]
Bases:
objectSingle syllable with frequency and ranking information.
This dataclass represents one syllable in the frequency analysis, including its occurrence count and relative ranking.
- canonical
The canonical form of the syllable (e.g., “ka”)
- frequency
Number of times this syllable appears
- rank
Frequency rank (1 = most common, 2 = second most common, etc.)
- percentage
Percentage of total syllables (0-100)
Example
>>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2) >>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)") ka: 187 (10.2%)
- class build_tools.pyphen_syllable_normaliser.NormalizationConfig(min_length=2, max_length=20, allowed_charset='abcdefghijklmnopqrstuvwxyz', unicode_form='NFKD')[source]
Bases:
objectConfiguration for syllable normalization process.
This dataclass stores all parameters that control how syllables are normalized to canonical form.
- min_length
Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2
- max_length
Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20
- allowed_charset
String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”
- unicode_form
Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)
Example
>>> config = NormalizationConfig(min_length=3, max_length=10) >>> config.min_length 3 >>> config.allowed_charset 'abcdefghijklmnopqrstuvwxyz'
- class build_tools.pyphen_syllable_normaliser.NormalizationResult(config, stats, frequencies, unique_syllables, input_files, output_dir, timestamp=<factory>, raw_file=PosixPath('syllables_raw.txt'), canonical_file=PosixPath('syllables_canonicalised.txt'), frequency_file=PosixPath('syllables_frequencies.json'), unique_file=PosixPath('syllables_unique.txt'), meta_file=PosixPath('normalization_meta.txt'))[source]
Bases:
objectComplete result from the syllable normalization pipeline.
This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.
- config
Configuration used for normalization
- stats
Statistics from the processing
- frequencies
Dictionary mapping canonical syllable to frequency count
- unique_syllables
Sorted list of unique canonical syllables
- input_files
List of input file paths that were processed
- output_dir
Directory where output files were saved
- timestamp
When the normalization was performed
- raw_file
Path to raw aggregated file (syllables_raw.txt)
- canonical_file
Path to canonicalized file (syllables_canonicalised.txt)
- frequency_file
Path to frequency JSON (syllables_frequencies.json)
- unique_file
Path to unique syllables (syllables_unique.txt)
- meta_file
Path to metadata report (normalization_meta.txt)
Example
>>> result = NormalizationResult( ... config=NormalizationConfig(), ... stats=NormalizationStats(raw_count=1000), ... frequencies={"ka": 187, "ra": 162}, ... unique_syllables=["ka", "ra"], ... input_files=[Path("file1.txt")], ... output_dir=Path("_working/normalized"), ... timestamp=datetime.now(), ... raw_file=Path("syllables_raw.txt"), ... canonical_file=Path("syllables_canonicalised.txt"), ... frequency_file=Path("syllables_frequencies.json"), ... unique_file=Path("syllables_unique.txt"), ... meta_file=Path("normalization_meta.txt") ... ) >>> result.stats.raw_count 1000
-
config:
NormalizationConfig
- format_metadata()[source]
Format normalization metadata as a human-readable string.
Creates a detailed report including statistics, rejection breakdown, and top frequencies.
- Return type:
- Returns:
Multi-line string containing all normalization metadata formatted for display or file output.
Example
>>> result = NormalizationResult(...) >>> print(result.format_metadata()) ====================================================================== SYLLABLE NORMALIZATION METADATA ====================================================================== Timestamp: 2026-01-05 17:30:22 ...
-
stats:
NormalizationStats
- class build_tools.pyphen_syllable_normaliser.NormalizationStats(raw_count=0, after_canonicalization=0, rejected_charset=0, rejected_length=0, rejected_empty=0, unique_canonical=0, processing_time=0.0)[source]
Bases:
objectStatistics from the syllable normalization process.
This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.
- raw_count
Total number of syllables in raw input (before normalization)
- after_canonicalization
Number of syllables after normalization
- rejected_charset
Syllables rejected due to invalid characters
- rejected_length
Syllables rejected due to length constraints
- rejected_empty
Syllables that became empty after normalization
- unique_canonical
Number of unique canonical syllables
- processing_time
Total processing time in seconds
Example
>>> stats = NormalizationStats( ... raw_count=1000, ... after_canonicalization=950, ... rejected_charset=30, ... rejected_length=20, ... rejected_empty=0, ... unique_canonical=412, ... processing_time=1.5 ... ) >>> stats.rejection_rate 5.0
- class build_tools.pyphen_syllable_normaliser.SyllableNormalizer(config)[source]
Bases:
objectNormalizes syllables to canonical form.
This class applies a multi-step normalization pipeline to transform raw syllables into a standardized canonical representation. The pipeline includes Unicode normalization, diacritic removal, case normalization, and validation against charset and length constraints.
- config
Configuration specifying normalization parameters such as allowed charset, length constraints, and Unicode normalization form.
Example
>>> from build_tools.pyphen_syllable_normaliser import NormalizationConfig >>> config = NormalizationConfig(min_length=2, max_length=8) >>> normalizer = SyllableNormalizer(config) >>> normalizer.normalize("Café") 'cafe' >>> normalizer.normalize("x") # Too short None >>> normalizer.normalize("résumé123") # Invalid characters None
- __init__(config)[source]
Initialize normalizer with configuration.
- Parameters:
config (
NormalizationConfig) – NormalizationConfig instance specifying normalization parameters including charset, length constraints, and Unicode normalization form.
Example
>>> config = NormalizationConfig( ... min_length=3, ... max_length=10, ... allowed_charset="abcdefghijklmnopqrstuvwxyz", ... unicode_form="NFKD" ... ) >>> normalizer = SyllableNormalizer(config)
- normalize(syllable)[source]
Normalize a single syllable to canonical form.
Applies the complete normalization pipeline: 1. Unicode normalization (NFKD by default) 2. Strip diacritics (remove combining characters) 3. Lowercase conversion 4. Trim whitespace 5. Validate charset (only allowed characters) 6. Check length constraints
- Parameters:
syllable (
str) – Raw syllable string to normalize.- Return type:
- Returns:
Normalized canonical syllable string, or None if the syllable is rejected due to: - Becoming empty after normalization - Containing invalid characters - Not meeting length constraints
Example
>>> config = NormalizationConfig() >>> normalizer = SyllableNormalizer(config) >>> normalizer.normalize("Café") 'cafe' >>> normalizer.normalize(" HELLO ") 'hello' >>> normalizer.normalize("résumé") 'resume' >>> normalizer.normalize("") # Empty None >>> normalizer.normalize("x") # Too short (min_length=2) None >>> normalizer.normalize("hello123") # Invalid chars None
- strip_diacritics(text)[source]
Remove diacritics (accent marks) from Unicode text.
Uses Unicode normalization (NFD/NFKD) to decompose characters into base characters and combining marks, then removes the combining marks. This converts accented characters like ‘é’ → ‘e’, ‘ñ’ → ‘n’, etc.
- Parameters:
text (
str) – Unicode string potentially containing diacritics.- Return type:
- Returns:
String with all combining diacritical marks removed.
Example
>>> normalizer = SyllableNormalizer(NormalizationConfig()) >>> normalizer.strip_diacritics("café") 'cafe' >>> normalizer.strip_diacritics("naïve") 'naive' >>> normalizer.strip_diacritics("Zürich") 'Zurich' >>> normalizer.strip_diacritics("São Paulo") 'Sao Paulo'
Note
This method assumes the text has already been normalized to NFD or NFKD form. The normalize() method handles this automatically.
- build_tools.pyphen_syllable_normaliser.create_argument_parser()[source]
Create and return the argument parser for pyphen syllable normaliser.
- Return type:
- Returns:
Configured ArgumentParser ready to parse command-line arguments.
- build_tools.pyphen_syllable_normaliser.discover_input_files(source_dir, pattern='*.txt', recursive=False)[source]
Discover input files in a directory matching a pattern.
Scans a directory for files matching the specified glob pattern. Returns files in sorted order for deterministic processing.
- Parameters:
- Return type:
- Returns:
Sorted list of Path objects for all matching files.
- Raises:
ValueError – If source_dir is not a directory.
FileNotFoundError – If source_dir does not exist.
Example
>>> from pathlib import Path >>> # Non-recursive scan >>> files = discover_input_files(Path("data/"), pattern="*.txt") >>> files [Path('data/corpus1.txt'), Path('data/corpus2.txt')] >>> >>> # Recursive scan >>> files = discover_input_files( ... Path("data/"), ... pattern="*.txt", ... recursive=True ... ) >>> files [Path('data/corpus1.txt'), Path('data/subdir/corpus3.txt'), Path('data/subdir/corpus4.txt')]
Note
Files are always returned in sorted order to ensure deterministic processing. This is critical for reproducible normalization results.
- build_tools.pyphen_syllable_normaliser.load_frequencies_from_file(file_path)[source]
Load frequency dictionary from JSON file.
Reads a previously saved syllables_frequencies.json file and returns the frequency dictionary. Useful for analysis and inspection of normalization results.
- Parameters:
file_path (
Path) – Path to the JSON frequency file.- Return type:
- Returns:
Dictionary mapping syllable to occurrence count.
- Raises:
FileNotFoundError – If the file does not exist.
json.JSONDecodeError – If the file is not valid JSON.
PermissionError – If the file cannot be read.
Example
>>> from pathlib import Path >>> frequencies = load_frequencies_from_file(Path("syllables_frequencies.json")) >>> frequencies['ka'] 187 >>> len(frequencies) 412
Note
The JSON file must have been created by save_frequencies() or follow the same format: {“syllable”: count, …}
- build_tools.pyphen_syllable_normaliser.load_unique_syllables_from_file(file_path)[source]
Load unique syllables from text file.
Reads a previously saved syllables_unique.txt file and returns the syllable list. Useful for loading the authoritative syllable inventory for feature annotation or analysis.
- Parameters:
file_path (
Path) – Path to the text file containing unique syllables.- Return type:
- Returns:
List of syllable strings (one per line from file).
- Raises:
FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.
Example
>>> from pathlib import Path >>> syllables = load_unique_syllables_from_file(Path("syllables_unique.txt")) >>> syllables[:5] ['ka', 'mi', 'ra', 'ta', 'wa'] >>> len(syllables) 412
Note
Empty lines are skipped. Leading/trailing whitespace is stripped from each line.
- build_tools.pyphen_syllable_normaliser.normalize_batch(syllables, config)[source]
Normalize a batch of syllables and collect rejection statistics.
This is a convenience function for normalizing multiple syllables at once while tracking why syllables were rejected.
- Parameters:
syllables (
list[str]) – List of raw syllable strings to normalize.config (
NormalizationConfig) – NormalizationConfig specifying normalization parameters.
- Returns:
normalized_syllables: List of successfully normalized syllables
- rejection_stats: Dictionary with rejection counts:
”rejected_empty”: Syllables that became empty after normalization
”rejected_charset”: Syllables with invalid characters
”rejected_length”: Syllables outside length constraints
- Return type:
Tuple of (normalized_syllables, rejection_stats) where
Example
>>> config = NormalizationConfig(min_length=2, max_length=8) >>> syllables = ["Café", "x", "Hello", "world123", " résumé "] >>> normalized, stats = normalize_batch(syllables, config) >>> normalized ['cafe', 'hello', 'resume'] >>> stats {'rejected_empty': 0, 'rejected_charset': 1, 'rejected_length': 1}
Note
This function processes syllables in order and preserves duplicates. For frequency analysis, use the frequency.py module which handles deduplication and counting.
- build_tools.pyphen_syllable_normaliser.run_full_pipeline(run_directory, config, verbose=False, quiet=False)[source]
Run complete pyphen normalization pipeline with in-place processing.
Executes the full pyphen-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Canonicalize syllables (Unicode normalization, etc.) 3. Frequency analysis 4. Write 5 output files to run_directory (in-place)
- Parameters:
run_directory (
Path) – Pyphen run directory (e.g., _working/output/20260110_143022_pyphen/).config (
NormalizationConfig) – NormalizationConfig specifying normalization parameters.verbose (
bool) – If True, print detailed progress information.quiet (
bool) – If True, suppress all output except errors.
- Return type:
- Returns:
NormalizationResult containing all outputs, statistics, and file paths.
- Raises:
FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.
ValueError – If run_directory is not a directory.
Example
>>> from pathlib import Path >>> config = NormalizationConfig(min_length=2, max_length=8) >>> run_dir = Path("_working/output/20260110_143022_pyphen/") >>> result = run_full_pipeline( ... run_directory=run_dir, ... config=config, ... verbose=True ... ) >>> result.stats.raw_count 15234 >>> result.stats.unique_canonical 4821