build_tools.pyphen_syllable_normaliser.normalizer
Core normalization logic for syllable canonicalization.
This module provides the SyllableNormalizer class which handles the transformation of raw syllables into canonical form through Unicode normalization, diacritic stripping, lowercase conversion, and validation.
Classes
Normalizes syllables to canonical form. |
Functions
|
Normalize a batch of syllables and collect rejection statistics. |
Module Contents
- class build_tools.pyphen_syllable_normaliser.normalizer.SyllableNormalizer(config)[source]
Normalizes syllables to canonical form.
This class applies a multi-step normalization pipeline to transform raw syllables into a standardized canonical representation. The pipeline includes Unicode normalization, diacritic removal, case normalization, and validation against charset and length constraints.
- config
Configuration specifying normalization parameters such as allowed charset, length constraints, and Unicode normalization form.
Example
>>> from build_tools.pyphen_syllable_normaliser import NormalizationConfig >>> config = NormalizationConfig(min_length=2, max_length=8) >>> normalizer = SyllableNormalizer(config) >>> normalizer.normalize("Café") 'cafe' >>> normalizer.normalize("x") # Too short None >>> normalizer.normalize("résumé123") # Invalid characters None
Initialize normalizer with configuration.
- Parameters:
config (build_tools.pyphen_syllable_normaliser.models.NormalizationConfig) – NormalizationConfig instance specifying normalization parameters including charset, length constraints, and Unicode normalization form.
Example
>>> config = NormalizationConfig( ... min_length=3, ... max_length=10, ... allowed_charset="abcdefghijklmnopqrstuvwxyz", ... unicode_form="NFKD" ... ) >>> normalizer = SyllableNormalizer(config)
- config
- normalize(syllable)[source]
Normalize a single syllable to canonical form.
Applies the complete normalization pipeline: 1. Unicode normalization (NFKD by default) 2. Strip diacritics (remove combining characters) 3. Lowercase conversion 4. Trim whitespace 5. Validate charset (only allowed characters) 6. Check length constraints
- Parameters:
syllable (str) – Raw syllable string to normalize.
- Returns:
Normalized canonical syllable string, or None if the syllable is rejected due to: - Becoming empty after normalization - Containing invalid characters - Not meeting length constraints
- Return type:
Optional[str]
Example
>>> config = NormalizationConfig() >>> normalizer = SyllableNormalizer(config) >>> normalizer.normalize("Café") 'cafe' >>> normalizer.normalize(" HELLO ") 'hello' >>> normalizer.normalize("résumé") 'resume' >>> normalizer.normalize("") # Empty None >>> normalizer.normalize("x") # Too short (min_length=2) None >>> normalizer.normalize("hello123") # Invalid chars None
- strip_diacritics(text)[source]
Remove diacritics (accent marks) from Unicode text.
Uses Unicode normalization (NFD/NFKD) to decompose characters into base characters and combining marks, then removes the combining marks. This converts accented characters like ‘é’ → ‘e’, ‘ñ’ → ‘n’, etc.
- Parameters:
text (str) – Unicode string potentially containing diacritics.
- Returns:
String with all combining diacritical marks removed.
- Return type:
Example
>>> normalizer = SyllableNormalizer(NormalizationConfig()) >>> normalizer.strip_diacritics("café") 'cafe' >>> normalizer.strip_diacritics("naïve") 'naive' >>> normalizer.strip_diacritics("Zürich") 'Zurich' >>> normalizer.strip_diacritics("São Paulo") 'Sao Paulo'
Note
This method assumes the text has already been normalized to NFD or NFKD form. The normalize() method handles this automatically.
- build_tools.pyphen_syllable_normaliser.normalizer.normalize_batch(syllables, config)[source]
Normalize a batch of syllables and collect rejection statistics.
This is a convenience function for normalizing multiple syllables at once while tracking why syllables were rejected.
- Parameters:
syllables (list[str]) – List of raw syllable strings to normalize.
config (build_tools.pyphen_syllable_normaliser.models.NormalizationConfig) – NormalizationConfig specifying normalization parameters.
- Returns:
normalized_syllables: List of successfully normalized syllables
- rejection_stats: Dictionary with rejection counts:
”rejected_empty”: Syllables that became empty after normalization
”rejected_charset”: Syllables with invalid characters
”rejected_length”: Syllables outside length constraints
- Return type:
Tuple of (normalized_syllables, rejection_stats) where
Example
>>> config = NormalizationConfig(min_length=2, max_length=8) >>> syllables = ["Café", "x", "Hello", "world123", " résumé "] >>> normalized, stats = normalize_batch(syllables, config) >>> normalized ['cafe', 'hello', 'resume'] >>> stats {'rejected_empty': 0, 'rejected_charset': 1, 'rejected_length': 1}
Note
This function processes syllables in order and preserves duplicates. For frequency analysis, use the frequency.py module which handles deduplication and counting.