============================ NLTK Syllable Normaliser ============================ .. currentmodule:: build_tools.nltk_syllable_normaliser Overview -------- .. automodule:: build_tools.nltk_syllable_normaliser :no-members: Command-Line Interface ---------------------- .. argparse:: :module: build_tools.nltk_syllable_normaliser.cli :func: create_argument_parser :prog: python -m build_tools.nltk_syllable_normaliser Output Format ------------- The pipeline generates 5 output files in the NLTK run directory with ``nltk_`` prefix for provenance: 1. **nltk_syllables_raw.txt** - Aggregated raw syllables (all occurrences preserved) 2. **nltk_syllables_canonicalised.txt** - After fragment cleaning + normalization 3. **nltk_syllables_frequencies.json** - Frequency intelligence (syllable → count mapping) 4. **nltk_syllables_unique.txt** - Deduplicated canonical syllable inventory 5. **nltk_normalization_meta.txt** - Detailed statistics and metadata report **In-Place Processing:** Unlike the pyphen normaliser which writes to a separate output directory, the NLTK normaliser processes run directories in-place, writing output files directly into the run directory: :: _working/output/20260110_095213_nltk/ ├── syllables/ # Input (from NLTK extractor) │ ├── en_US_alice.txt │ ├── en_US_middlemarch.txt │ └── ... ├── meta/ # Metadata (from extractor) │ └── ... ├── nltk_syllables_raw.txt # Output: Aggregated ├── nltk_syllables_canonicalised.txt # Output: After cleaning + normalization ├── nltk_syllables_frequencies.json # Output: Frequency intelligence ├── nltk_syllables_unique.txt # Output: Deduplicated └── nltk_normalization_meta.txt # Output: Statistics **File structure examples:** ``nltk_syllables_raw.txt`` (before fragment cleaning): :: cha pter i down the r a bbit ``nltk_syllables_canonicalised.txt`` (after fragment cleaning + normalization): :: cha pter idown the rabbit ``nltk_syllables_frequencies.json`` (counts after cleaning): .. code-block:: json { "cha": 1, "pter": 1, "idown": 1, "the": 1, "rabbit": 1 } ``nltk_syllables_unique.txt`` (deduplicated, sorted): :: cha idown pter rabbit the Integration Guide ----------------- The NLTK syllable normaliser is the second step after NLTK extraction, complementing the NLTK extractor: **Standard workflow (using NLTK extractor + normaliser):** .. code-block:: bash # Step 1: Extract syllables using NLTK/CMUDict python -m build_tools.nltk_syllable_extractor \ --source data/corpus/ \ --pattern "*.txt" \ --output _working/output/ # Step 2: Normalize extracted syllables (in-place) python -m build_tools.nltk_syllable_normaliser \ --run-dir _working/output/20260110_095213_nltk/ # Alternative: Auto-detect all NLTK run directories python -m build_tools.nltk_syllable_normaliser \ --source _working/output/ # Step 3: Annotate with phonetic features (source-agnostic) python -m build_tools.syllable_feature_annotator \ --syllables _working/output/20260110_095213_nltk/nltk_syllables_unique.txt \ --frequencies _working/output/20260110_095213_nltk/nltk_syllables_frequencies.json **Parallel workflow (comparing both extractors):** .. code-block:: bash # Extract and normalize with pyphen (typographic) python -m build_tools.pyphen_syllable_extractor \ --source data/corpus/ \ --lang en_US \ --output _working/output/ python -m build_tools.pyphen_syllable_normaliser \ --source _working/output/20260110_143022_pyphen/syllables/ \ --output _working/output/20260110_143022_pyphen/ # Extract and normalize with NLTK (phonetic) python -m build_tools.nltk_syllable_extractor \ --source data/corpus/ \ --output _working/output/ python -m build_tools.nltk_syllable_normaliser \ --run-dir _working/output/20260110_095213_nltk/ # Compare outputs - both use different prefixes (pyphen_* vs nltk_*) diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \ _working/output/20260110_095213_nltk/nltk_syllables_unique.txt **When to use NLTK normaliser vs pyphen normaliser:** **Use NLTK normaliser when:** - You used the NLTK syllable extractor - Your syllables contain many single-letter fragments - You want phonetically coherent syllables reconstructed - You're working with NLTK's onset/coda-based splits - You want in-place processing within run directories **Use pyphen normaliser when:** - You used the pyphen syllable extractor - Your syllables are already well-formed (typographic hyphenation) - You want to aggregate multiple extraction runs - You prefer explicit output directory specification - You're working with multi-language pyphen extractions **Fragment Cleaning:** The key differentiator of the NLTK normaliser is fragment cleaning. This step reconstructs phonetically coherent syllables from NLTK's over-segmented output: +-------------------+-------------------------+---------------------------+ | Original Fragments| After Fragment Cleaning | Reason | +===================+=========================+===========================+ | i, down | idown | Single vowel merged | +-------------------+-------------------------+---------------------------+ | r, a, bbit | ra, bbit | Single letters merged | +-------------------+-------------------------+---------------------------+ | h, o, le | ho, le | Single letters merged | +-------------------+-------------------------+---------------------------+ | cha, pter | cha, pter | Multi-char preserved | +-------------------+-------------------------+---------------------------+ **Merging Rules:** 1. Single vowels (a, e, i, o, u, y) merge with next fragment 2. Single consonants merge with next fragment 3. Multi-character fragments remain unchanged 4. Processing is left-to-right, deterministic **Processing Modes:** - **Specific run directory**: ``--run-dir /path/to/run/`` - Process one NLTK run - **Auto-detection**: ``--source /path/to/output/`` - Find and process all NLTK runs - **Skip fragment cleaning**: ``--no-fragment-cleaning`` - For comparison with pyphen Notes ----- **In-Place Processing Philosophy:** The NLTK normaliser writes outputs directly into the run directory (not a separate location) because: - **Convention**: Each NLTK run is self-contained (extractor + normaliser outputs together) - **Simplicity**: No confusion about where normalized files live - **Provenance**: Run directory name (``*_nltk``) and file prefix (``nltk_*``) both indicate source **Fragment Cleaning Statistics:** Real-world performance on multi-language corpus (21 files, 2.98M syllables): - **Before cleaning**: 2,977,447 syllables - **After cleaning**: 2,709,503 syllables - **Fragments merged**: 267,944 (9% reduction) - **Processing time**: 6.07 seconds - **Unique syllables**: 33,640 **When Fragment Cleaning Matters:** Fragment cleaning has the most impact on: - Short function words ("i", "a", "the") - Consonant clusters split by onset/coda ("r" + "a" → "ra") - Single-character prefixes/suffixes - Phonetically over-segmented words For well-formed multi-syllable words, fragment cleaning has minimal effect. **Comparing with Pyphen Normaliser:** +--------------------------------+---------------------------+---------------------------+ | Feature | Pyphen Normaliser | NLTK Normaliser | +================================+===========================+===========================+ | **Input Source** | Any directory | NLTK run directories | +--------------------------------+---------------------------+---------------------------+ | **Preprocessing** | None | Fragment cleaning | +--------------------------------+---------------------------+---------------------------+ | **Output Location** | User-specified directory | In-place (run directory) | +--------------------------------+---------------------------+---------------------------+ | **Output Prefix** | pyphen_* | nltk_* | +--------------------------------+---------------------------+---------------------------+ | **Run Detection** | Manual file discovery | Auto-detect *_nltk dirs | +--------------------------------+---------------------------+---------------------------+ | **Normalization Steps** | 3 (aggregate, canon, freq)| 4 (clean, aggregate, ...) | +--------------------------------+---------------------------+---------------------------+ | **Typical Use Case** | Pyphen extractor output | NLTK extractor output | +--------------------------------+---------------------------+---------------------------+ **Auto-Detection Criteria:** The auto-detection feature (``--source``) finds NLTK run directories by: 1. Scanning for directories ending with ``_nltk`` 2. Verifying existence of ``syllables/`` subdirectory 3. Sorting chronologically by directory name This allows batch processing: .. code-block:: bash # Process all NLTK runs at once python -m build_tools.nltk_syllable_normaliser --source _working/output/ # Output: # Found 3 NLTK run directories: # - 20260110_095213_nltk # - 20260110_143022_nltk # - 20260110_153045_nltk # Processing... **Deterministic Processing:** The NLTK normaliser is fully deterministic: - Same input → same output (always) - Fragment cleaning uses left-to-right greedy algorithm - Unicode normalization is deterministic (NFKD) - Frequency analysis preserves insertion order (Python 3.7+) **Build-time tool:** This is a build-time tool only - not used during runtime name generation. **Output File Prefixes:** Both normalisers now use prefixed output files: - Pyphen normaliser: ``pyphen_*`` prefix - NLTK normaliser: ``nltk_*`` prefix This ensures clear provenance when files are shared individually. API Reference ------------- .. automodule:: build_tools.nltk_syllable_normaliser :members: :undoc-members: :show-inheritance: