build_tools.nltk_syllable_extractor.extractor ============================================= .. py:module:: build_tools.nltk_syllable_extractor.extractor .. autoapi-nested-parse:: Core CMUDict-based syllable extraction functionality. This module provides the NltkSyllableExtractor class for extracting syllables from text using CMU Pronouncing Dictionary with phonetically-guided orthographic syllabification based on onset/coda principles. Attributes ---------- .. autoapisummary:: build_tools.nltk_syllable_extractor.extractor.CMUDICT_AVAILABLE Classes ------- .. autoapisummary:: build_tools.nltk_syllable_extractor.extractor.NltkSyllableExtractor Module Contents --------------- .. py:data:: CMUDICT_AVAILABLE :value: True .. py:class:: NltkSyllableExtractor(language_code, min_syllable_length = 1, max_syllable_length = 999) Extracts syllables from text using CMU Pronouncing Dictionary. This class uses phonetic information from CMUDict to guide orthographic syllable splitting, respecting English phonotactic constraints via onset/coda principles. The extractor works by: 1. Reading text input (string or file) 2. Tokenizing into words using regex 3. Looking up phonetic transcriptions in CMUDict 4. Using vowel phonemes to identify syllable boundaries 5. Mapping phonetic structure back to orthographic positions 6. Applying onset/coda rules to split consonant clusters 7. Filtering syllables by length constraints 8. Returning unique syllables (case-insensitive) Key Differences from pyphen: - Uses phonetic information (CMUDict) rather than typographic rules - Respects consonant cluster constraints (onset/coda principles) - Produces more "natural" phonetic splits - English only (CMUDict limitation) - Includes fallback for out-of-vocabulary words Typical Usage: >>> # Basic extraction >>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8) >>> syllables = extractor.extract_syllables_from_text("Hello wonderful world") >>> print(sorted(syllables)) ['der', 'ful', 'hel', 'lo', 'won', 'world'] >>> # Extract from file and save >>> syllables = extractor.extract_syllables_from_file(Path('input.txt')) >>> extractor.save_syllables(syllables, Path('output.txt')) .. attribute:: language_code The language code (always 'en_US' for NLTK extractor) .. attribute:: min_syllable_length Minimum syllable length to include in results .. attribute:: max_syllable_length Maximum syllable length to include in results .. attribute:: cmu_dict The loaded CMU Pronouncing Dictionary .. note:: This is a build-time tool. The nltk dependency should not be used at runtime in the core name generation system. Initialize the NLTK syllable extractor. :param language_code: Language code (must be 'en_US' for NLTK extractor) :param min_syllable_length: Minimum syllable length to include (default: 1, no filtering) :param max_syllable_length: Maximum syllable length to include (default: 999, no filtering) :raises ImportError: If cmudict is not installed :raises ValueError: If the language code is not 'en_US' .. py:attribute:: VALID_ONSETS .. py:attribute:: VOWELS :value: 'aeiouy' .. py:attribute:: language_code .. py:attribute:: min_syllable_length :value: 1 .. py:attribute:: max_syllable_length :value: 999 .. py:attribute:: cmu_dict .. py:method:: extract_syllables_from_text(text, only_hyphenated = True) Extract all syllables from a block of text (preserves duplicates). This method processes input text by tokenizing it into words, applying CMUDict phonetic lookup and onset/coda principles to extract individual syllables that meet the configured length constraints. :param text: Input text to process. Can contain any characters, but only alphabetic sequences will be processed as words. :param only_hyphenated: If True, only include syllables from words that were successfully split (CMUDict lookup succeeded). Set to False to include fallback syllabification for unknown words. :returns: - syllables: List of all lowercase syllable strings (includes duplicates) - statistics: Dict with the following keys: - 'total_words': Total number of words found in source text - 'processed_words': Words that were successfully processed - 'fallback_count': Words not in CMUDict (used fallback heuristics) - 'rejected_syllables': Syllables rejected due to length constraints :rtype: Tuple of (syllables, statistics) where .. note:: - Only processes words containing alphabetic characters (a-z, A-Z) - Case-insensitive processing (all output is lowercase) - Automatically removes punctuation and special characters - Filters syllables by configured min/max length constraints - When only_hyphenated=True, excludes words not in CMUDict - Deterministic: same input always produces same output - Uses first pronunciation when multiple exist (deterministic) - Words are extracted using regex pattern: \b[a-zA-Z]+\b .. admonition:: Example >>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8) >>> syllables, stats = extractor.extract_syllables_from_text("Hello world!") >>> print(syllables) ['hel', 'lo', 'world'] >>> print(stats['total_words']) 2 .. py:method:: extract_syllables_from_file(input_path) Extract all syllables from a text file (preserves duplicates). This is a convenience wrapper around extract_syllables_from_text() that handles file reading with proper encoding (UTF-8) and error handling. :param input_path: Path to the input text file. File should be UTF-8 encoded plain text. Binary files or non-text formats will cause errors. :returns: - syllables: List of all lowercase syllable strings (includes duplicates) - statistics: Dict with processing statistics (see extract_syllables_from_text) :rtype: Tuple of (syllables, statistics) where :raises FileNotFoundError: If the input file doesn't exist at the specified path :raises IOError: If there's an error reading the file (permissions, encoding, etc.) .. admonition:: Example >>> from pathlib import Path >>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8) >>> syllables, stats = extractor.extract_syllables_from_file(Path('book.txt')) >>> print(f"Extracted {len(syllables)} unique syllables from {stats['total_words']} words") Extracted 1250 unique syllables from 50000 words .. py:method:: save_syllables(syllables, output_path) Save syllables to a text file (one syllable per line, preserves all). Writes syllables with UTF-8 encoding, one syllable per line. Syllables are written in the order they appear in the list (preserving duplicates). This format is ideal for downstream processing by normalizer tools. :param syllables: List of syllables to save (may contain duplicates). Written in the order provided. :param output_path: Path to the output file. Parent directories must exist. If the file exists, it will be overwritten. :raises IOError: If there's an error writing the file (permissions, disk space, etc.) .. admonition:: Example >>> from pathlib import Path >>> extractor = NltkSyllableExtractor('en_US') >>> syllables = ['hel', 'lo', 'world', 'hel'] # Note: 'hel' appears twice >>> extractor.save_syllables(syllables, Path('output.txt')) # Creates file with content (preserving duplicates and order): # hel # lo # world # hel .. note:: The output file uses UTF-8 encoding with Unix-style line endings (\n). Each line contains exactly one syllable with no leading/trailing whitespace. Duplicates are preserved. Use downstream tools for deduplication if needed.