build_tools.nltk_syllable_normaliser ==================================== .. py:module:: build_tools.nltk_syllable_normaliser .. autoapi-nested-parse:: NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a **build-time tool only** - not used during runtime name generation. NLTK-Specific Processing: 1. **Fragment Cleaning** - Merge single-letter fragments with neighbors (NLTK-specific) 2. **Aggregation** - Combine multiple input files while preserving all occurrences 3. **Canonicalization** - Unicode normalization, diacritic stripping, charset validation 4. **Frequency Analysis** - Count occurrences and generate frequency intelligence Key Differences from Pyphen Normaliser: - **Input Source**: Processes NLTK run directories with syllables/ subdirectory - **Preprocessing**: Fragment cleaning step merges isolated phonemes - **Output Location**: In-place in run directory (not separate output directory) - **Output Prefix**: nltk_ prefix (for provenance tracking) Features: - Fragment cleaning (single vowel/consonant merging) - Unicode normalization (NFKD, NFC, NFD, NFKC) - Diacritic stripping using unicodedata - Configurable charset and length constraints - Frequency intelligence capture (pre-deduplication counts) - Deterministic processing (same input = same output) - Comprehensive metadata reporting - 5 output files with nltk_ prefix for complete analysis The pipeline produces 5 output files (with nltk_ prefix for provenance): - nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved) - nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning) - nltk_syllables_frequencies.json: Frequency intelligence (syllable → count) - nltk_syllables_unique.txt: Deduplicated canonical syllable inventory - nltk_normalization_meta.txt: Detailed statistics and metadata report Usage: >>> from pathlib import Path >>> from build_tools.nltk_syllable_normaliser import ( ... NormalizationConfig, ... run_full_pipeline, ... ) >>> >>> # Process NLTK run directory in-place >>> run_dir = Path("_working/output/20260110_095213_nltk/") >>> result = run_full_pipeline( ... run_directory=run_dir, ... config=NormalizationConfig(min_length=2, max_length=8), ... verbose=True ... ) >>> >>> # Access results >>> print(f"Processed {result.stats.raw_count:,} raw syllables") >>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}") >>> print(f"Canonical: {result.stats.after_canonicalization:,}") >>> print(f"Unique: {result.stats.unique_canonical:,}") CLI Usage: .. code-block:: bash # Process specific NLTK run directory python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/ # Auto-detect NLTK run directories python -m build_tools.nltk_syllable_normaliser --source _working/output/ # Custom configuration python -m build_tools.nltk_syllable_normaliser --run-dir --min 2 --max 8 Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/build_tools/nltk_syllable_normaliser/cli/index /autoapi/build_tools/nltk_syllable_normaliser/fragment_cleaner/index