build_tools.pyphen_syllable_normaliser.normalizer

Core normalization logic for syllable canonicalization.

This module provides the SyllableNormalizer class which handles the transformation of raw syllables into canonical form through Unicode normalization, diacritic stripping, lowercase conversion, and validation.

Classes

SyllableNormalizer

Normalizes syllables to canonical form.

Functions

normalize_batch(syllables, config)

Normalize a batch of syllables and collect rejection statistics.

Module Contents

class build_tools.pyphen_syllable_normaliser.normalizer.SyllableNormalizer(config)[source]

Normalizes syllables to canonical form.

This class applies a multi-step normalization pipeline to transform raw syllables into a standardized canonical representation. The pipeline includes Unicode normalization, diacritic removal, case normalization, and validation against charset and length constraints.

config

Configuration specifying normalization parameters such as allowed charset, length constraints, and Unicode normalization form.

Example

>>> from build_tools.pyphen_syllable_normaliser import NormalizationConfig
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> normalizer = SyllableNormalizer(config)
>>> normalizer.normalize("Café")
'cafe'
>>> normalizer.normalize("x")  # Too short
None
>>> normalizer.normalize("résumé123")  # Invalid characters
None

Initialize normalizer with configuration.

Parameters:

config (build_tools.pyphen_syllable_normaliser.models.NormalizationConfig) – NormalizationConfig instance specifying normalization parameters including charset, length constraints, and Unicode normalization form.

Example

>>> config = NormalizationConfig(
...     min_length=3,
...     max_length=10,
...     allowed_charset="abcdefghijklmnopqrstuvwxyz",
...     unicode_form="NFKD"
... )
>>> normalizer = SyllableNormalizer(config)
config
normalize(syllable)[source]

Normalize a single syllable to canonical form.

Applies the complete normalization pipeline: 1. Unicode normalization (NFKD by default) 2. Strip diacritics (remove combining characters) 3. Lowercase conversion 4. Trim whitespace 5. Validate charset (only allowed characters) 6. Check length constraints

Parameters:

syllable (str) – Raw syllable string to normalize.

Returns:

Normalized canonical syllable string, or None if the syllable is rejected due to: - Becoming empty after normalization - Containing invalid characters - Not meeting length constraints

Return type:

str | None

Example

>>> config = NormalizationConfig()
>>> normalizer = SyllableNormalizer(config)
>>> normalizer.normalize("Café")
'cafe'
>>> normalizer.normalize("  HELLO  ")
'hello'
>>> normalizer.normalize("résumé")
'resume'
>>> normalizer.normalize("")  # Empty
None
>>> normalizer.normalize("x")  # Too short (min_length=2)
None
>>> normalizer.normalize("hello123")  # Invalid chars
None
strip_diacritics(text)[source]

Remove diacritics (accent marks) from Unicode text.

Uses Unicode normalization (NFD/NFKD) to decompose characters into base characters and combining marks, then removes the combining marks. This converts accented characters like ‘é’ → ‘e’, ‘ñ’ → ‘n’, etc.

Parameters:

text (str) – Unicode string potentially containing diacritics.

Returns:

String with all combining diacritical marks removed.

Return type:

str

Example

>>> normalizer = SyllableNormalizer(NormalizationConfig())
>>> normalizer.strip_diacritics("café")
'cafe'
>>> normalizer.strip_diacritics("naïve")
'naive'
>>> normalizer.strip_diacritics("Zürich")
'Zurich'
>>> normalizer.strip_diacritics("São Paulo")
'Sao Paulo'

Note

This method assumes the text has already been normalized to NFD or NFKD form. The normalize() method handles this automatically.

build_tools.pyphen_syllable_normaliser.normalizer.normalize_batch(syllables, config)[source]

Normalize a batch of syllables and collect rejection statistics.

This is a convenience function for normalizing multiple syllables at once while tracking why syllables were rejected.

Parameters:
Returns:

  • normalized_syllables: List of successfully normalized syllables

  • rejection_stats: Dictionary with rejection counts:
    • ”rejected_empty”: Syllables that became empty after normalization

    • ”rejected_charset”: Syllables with invalid characters

    • ”rejected_length”: Syllables outside length constraints

Return type:

Tuple of (normalized_syllables, rejection_stats) where

Example

>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> syllables = ["Café", "x", "Hello", "world123", "  résumé  "]
>>> normalized, stats = normalize_batch(syllables, config)
>>> normalized
['cafe', 'hello', 'resume']
>>> stats
{'rejected_empty': 0, 'rejected_charset': 1, 'rejected_length': 1}

Note

This function processes syllables in order and preserves duplicates. For frequency analysis, use the frequency.py module which handles deduplication and counting.