========================== Pyphen Syllable Normaliser ========================== .. currentmodule:: build_tools.pyphen_syllable_normaliser Overview -------- .. automodule:: build_tools.pyphen_syllable_normaliser :no-members: Command-Line Interface ---------------------- .. argparse:: :module: build_tools.pyphen_syllable_normaliser.cli :func: create_argument_parser :prog: python -m build_tools.pyphen_syllable_normaliser Output Format ------------- The pipeline generates 5 output files in the pyphen run directory with ``pyphen_`` prefix for provenance: 1. **pyphen_syllables_raw.txt** - Aggregated raw syllables (all occurrences preserved) 2. **pyphen_syllables_canonicalised.txt** - Normalized canonical syllables 3. **pyphen_syllables_frequencies.json** - Frequency intelligence (syllable → count mapping) 4. **pyphen_syllables_unique.txt** - Deduplicated canonical syllable inventory 5. **pyphen_normalization_meta.txt** - Detailed statistics and metadata report **In-Place Processing:** Unlike older versions which wrote to a separate output directory, the pyphen normaliser now processes run directories in-place, writing output files directly into the run directory: :: _working/output/20260110_143022_pyphen/ ├── syllables/ # Input (from pyphen extractor) │ ├── en_US_alice.txt │ ├── en_US_middlemarch.txt │ └── ... ├── meta/ # Metadata (from extractor) │ └── ... ├── pyphen_syllables_raw.txt # Output: Aggregated ├── pyphen_syllables_canonicalised.txt # Output: Normalized ├── pyphen_syllables_frequencies.json # Output: Frequency intelligence ├── pyphen_syllables_unique.txt # Output: Deduplicated └── pyphen_normalization_meta.txt # Output: Statistics **File structure examples:** ``pyphen_syllables_raw.txt`` (preserves all occurrences): :: café Café hello hello world ``pyphen_syllables_canonicalised.txt`` (normalized, duplicates preserved): :: cafe cafe hello hello world ``pyphen_syllables_frequencies.json`` (counts before deduplication): .. code-block:: json { "cafe": 2, "hello": 2, "world": 1 } ``pyphen_syllables_unique.txt`` (deduplicated, sorted): :: cafe hello world Integration Guide ----------------- The pyphen syllable normaliser is the second step after pyphen extraction, processing syllables within their run directories: **Standard workflow:** .. code-block:: bash # Step 1: Extract syllables using pyphen python -m build_tools.pyphen_syllable_extractor \ --source data/corpus/ \ --pattern "*.txt" \ --output _working/output/ \ --lang en_US # Step 2: Normalize extracted syllables (in-place) python -m build_tools.pyphen_syllable_normaliser \ --run-dir _working/output/20260110_143022_pyphen/ # Alternative: Auto-detect all pyphen run directories python -m build_tools.pyphen_syllable_normaliser \ --source _working/output/ # Step 3: Annotate with phonetic features (source-agnostic) python -m build_tools.syllable_feature_annotator \ --syllables _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \ --frequencies _working/output/20260110_143022_pyphen/pyphen_syllables_frequencies.json **Comparing with NLTK normaliser:** .. code-block:: bash # Pyphen pipeline (typographic hyphenation) python -m build_tools.pyphen_syllable_extractor \ --source data/corpus/ \ --lang en_US \ --output _working/output/ python -m build_tools.pyphen_syllable_normaliser \ --run-dir _working/output/20260110_143022_pyphen/ # NLTK pipeline (phonetic splitting) python -m build_tools.nltk_syllable_extractor \ --source data/corpus/ \ --output _working/output/ python -m build_tools.nltk_syllable_normaliser \ --run-dir _working/output/20260110_095213_nltk/ # Compare outputs - both use different prefixes (pyphen_* vs nltk_*) diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \ _working/output/20260110_095213_nltk/nltk_syllables_unique.txt **When to use pyphen normaliser vs NLTK normaliser:** **Use pyphen normaliser when:** - You used the pyphen syllable extractor - Your syllables are well-formed from typographic hyphenation - You want multi-language support (40+ languages) - You want in-place processing within run directories - You're working with pyphen's dictionary-based splits **Use NLTK normaliser when:** - You used the NLTK syllable extractor - Your syllables contain many single-letter fragments - You want phonetically coherent syllables reconstructed - You're working with NLTK's onset/coda-based splits (English only) - You want fragment cleaning preprocessing **3-Step Normalization Pipeline:** **Step 1 - Aggregation:** - Combines all input files into ``syllables_raw.txt`` - Preserves ALL occurrences (no deduplication) - Maintains raw counts for frequency analysis - Empty lines filtered during file reading **Step 2 - Canonicalization:** - Unicode normalization (NFKD - compatibility decomposition) - Strip diacritics: café → cafe, résumé → resume - Lowercase conversion - Trim whitespace - Charset validation (reject invalid characters) - Length constraint enforcement (default: min=2, max=20) - Outputs to ``syllables_canonicalised.txt`` **Step 3 - Frequency Analysis:** - Count occurrences of each canonical syllable - Generate frequency rankings and percentages - Create deduplicated unique list (alphabetically sorted) - Outputs: - ``syllables_frequencies.json`` - Frequency counts before deduplication - ``syllables_unique.txt`` - Authoritative syllable inventory - ``normalization_meta.txt`` - Comprehensive statistics report **Pipeline characteristics:** - Deterministic: same input always produces same output - Fast: processes thousands of syllables per second - Configurable: adjust length constraints, charset, unicode form - Comprehensive: detailed rejection statistics and metadata Notes ----- **Frequency Intelligence:** The frequency data captures how often each canonical syllable occurs **before** deduplication. This intelligence is essential for understanding natural language patterns and can inform weighted name generation: .. code-block:: json { "ka": 187, "ra": 162, "mi": 145, "ta": 98 } This shows "ka" appears 187 times in the canonical syllables, indicating it's a high-frequency pattern that may be desirable for common or natural-sounding names. **Normalization Behavior:** - All syllable processing is case-insensitive (output is lowercase) - Unicode normalization form NFKD provides maximum compatibility decomposition - Empty lines are filtered during aggregation (not counted as rejections) - Frequency counts capture occurrences BEFORE deduplication - Invalid syllables (wrong charset, wrong length) are rejected and counted in metadata **Default Constraints:** - Min length: 2 characters - Max length: 20 characters - Allowed charset: a-z (lowercase ASCII letters) - Unicode form: NFKD (compatibility decomposition) **Use Cases:** - Combining syllables from multiple language extractions - Normalizing variations in corpus encoding (UTF-8, Latin-1, etc.) - Filtering syllables by length for specific pattern requirements - Building frequency-aware name generation systems **In-Place Processing Philosophy:** The pyphen normaliser writes outputs directly into the run directory (not a separate location) because: - **Convention**: Each pyphen run is self-contained (extractor + normaliser outputs together) - **Simplicity**: No confusion about where normalized files live - **Provenance**: Run directory name (``*_pyphen``) and file prefix (``pyphen_*``) both indicate source **Processing Modes:** - **Specific run directory**: ``--run-dir /path/to/run/`` - Process one pyphen run - **Auto-detection**: ``--source /path/to/output/`` - Find and process all pyphen runs **Auto-Detection Criteria:** The auto-detection feature (``--source``) finds pyphen run directories by: 1. Scanning for directories ending with ``_pyphen`` 2. Verifying existence of ``syllables/`` subdirectory 3. Sorting chronologically by directory name This allows batch processing: .. code-block:: bash # Process all pyphen runs at once python -m build_tools.pyphen_syllable_normaliser --source _working/output/ # Output: # Found 3 pyphen run directories: # - 20260110_143022_pyphen # - 20260110_153045_pyphen # - 20260110_163010_pyphen # Processing... **Build-time tool:** This is a build-time tool only - not used during runtime name generation. API Reference ------------- .. automodule:: build_tools.pyphen_syllable_normaliser :members: :undoc-members: :show-inheritance: