build_tools.nltk_syllable_extractor =================================== .. py:module:: build_tools.nltk_syllable_extractor .. autoapi-nested-parse:: NLTK Syllable Extractor - Phonetically-Guided Syllable Extraction The NLTK syllable extractor uses CMU Pronouncing Dictionary (via cmudict pip package) with onset/coda principles for phonetically-guided orthographic syllabification. This is a **build-time tool only** - not used during runtime name generation. The tool supports two modes: - **Interactive Mode** - Guided prompts for single-file processing - **Batch Mode** - Automated processing of multiple files via command-line arguments Features: - Phonetically-guided syllabification using CMU Pronouncing Dictionary (via cmudict package) - Onset/coda principles for natural consonant cluster splitting - English only (CMUDict limitation) - Preserves all syllables including duplicates (extraction only, no filtering) - Configurable syllable length constraints (defaults to no filtering) - Deterministic extraction (same input = same output) - Unicode support - Comprehensive metadata and statistics - Automatic provenance tracking via corpus_db ledger (batch mode) Key Differences from pyphen Extractor: - Uses phonetic information (CMUDict) rather than typographic hyphenation rules - Respects phonotactic constraints via onset/coda principles - Produces more "natural" phonetic splits (e.g., "Andrew" → "An-drew" not "And-rew") - English only vs pyphen's 40+ languages - Complementary tool, not a replacement Main Components: - NltkSyllableExtractor: Core extraction class - ExtractionResult: Data model for extraction results - FileProcessingResult: Result for single file in batch mode - BatchResult: Aggregate results for batch processing Usage: >>> from pathlib import Path >>> from build_tools.nltk_syllable_extractor import NltkSyllableExtractor >>> >>> # Initialize extractor for English (defaults to no length filtering) >>> extractor = NltkSyllableExtractor('en_US') >>> >>> # Extract syllables from text (preserves duplicates) >>> syllables, stats = extractor.extract_syllables_from_text("Hello wonderful world") >>> print(syllables) # Note: includes all syllables with duplicates ['hel', 'lo', 'won', 'der', 'ful', 'world'] >>> print(f"Total: {len(syllables)}, Unique: {len(set(syllables))}") Total: 6, Unique: 6 >>> >>> # Extract from a file >>> syllables, stats = extractor.extract_syllables_from_file(Path('input.txt')) >>> >>> # Save results (preserves duplicates) >>> extractor.save_syllables(syllables, Path('output.txt')) CLI Usage: .. code-block:: bash # Interactive mode python -m build_tools.nltk_syllable_extractor # Single file python -m build_tools.nltk_syllable_extractor --file input.txt # Batch processing python -m build_tools.nltk_syllable_extractor --source ~/docs/ --recursive Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/build_tools/nltk_syllable_extractor/batch/index /autoapi/build_tools/nltk_syllable_extractor/cli/index /autoapi/build_tools/nltk_syllable_extractor/extractor/index /autoapi/build_tools/nltk_syllable_extractor/file_io/index /autoapi/build_tools/nltk_syllable_extractor/interactive/index /autoapi/build_tools/nltk_syllable_extractor/models/index Attributes ---------- .. autoapisummary:: build_tools.nltk_syllable_extractor.main_interactive build_tools.nltk_syllable_extractor.main_batch build_tools.nltk_syllable_extractor.process_single_file_batch Package Contents ---------------- .. py:data:: main_interactive .. py:data:: main_batch .. py:data:: process_single_file_batch