build_tools.nltk_syllable_extractor.extractor

Core CMUDict-based syllable extraction functionality.

This module provides the NltkSyllableExtractor class for extracting syllables from text using CMU Pronouncing Dictionary with phonetically-guided orthographic syllabification based on onset/coda principles.

Attributes

CMUDICT_AVAILABLE

Classes

NltkSyllableExtractor

Extracts syllables from text using CMU Pronouncing Dictionary.

Module Contents

build_tools.nltk_syllable_extractor.extractor.CMUDICT_AVAILABLE = True

class build_tools.nltk_syllable_extractor.extractor.NltkSyllableExtractor(language_code, min_syllable_length=1, max_syllable_length=999)[source]

Extracts syllables from text using CMU Pronouncing Dictionary.

This class uses phonetic information from CMUDict to guide orthographic syllable splitting, respecting English phonotactic constraints via onset/coda principles.

The extractor works by: 1. Reading text input (string or file) 2. Tokenizing into words using regex 3. Looking up phonetic transcriptions in CMUDict 4. Using vowel phonemes to identify syllable boundaries 5. Mapping phonetic structure back to orthographic positions 6. Applying onset/coda rules to split consonant clusters 7. Filtering syllables by length constraints 8. Returning unique syllables (case-insensitive)

Key Differences from pyphen:

Uses phonetic information (CMUDict) rather than typographic rules
Respects consonant cluster constraints (onset/coda principles)
Produces more “natural” phonetic splits
English only (CMUDict limitation)
Includes fallback for out-of-vocabulary words

Typical Usage:

>>> # Basic extraction
>>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(sorted(syllables))
['der', 'ful', 'hel', 'lo', 'won', 'world']

>>> # Extract from file and save
>>> syllables = extractor.extract_syllables_from_file(Path('input.txt'))
>>> extractor.save_syllables(syllables, Path('output.txt'))

language_code: The language code (always ‘en_US’ for NLTK extractor)

min_syllable_length: Minimum syllable length to include in results

max_syllable_length: Maximum syllable length to include in results

cmu_dict: The loaded CMU Pronouncing Dictionary

Note

This is a build-time tool. The nltk dependency should not be used at runtime in the core name generation system.

Initialize the NLTK syllable extractor.

Parameters:

language_code (str) – Language code (must be ‘en_US’ for NLTK extractor)
min_syllable_length (int) – Minimum syllable length to include (default: 1, no filtering)
max_syllable_length (int) – Maximum syllable length to include (default: 999, no filtering)

Raises:

ImportError – If cmudict is not installed
ValueError – If the language code is not ‘en_US’

VALID_ONSETS

VOWELS = 'aeiouy'

language_code

min_syllable_length = 1

max_syllable_length = 999

cmu_dict

extract_syllables_from_text(text, only_hyphenated=True)[source]

Extract all syllables from a block of text (preserves duplicates).

This method processes input text by tokenizing it into words, applying CMUDict phonetic lookup and onset/coda principles to extract individual syllables that meet the configured length constraints.

Parameters:

text (str) – Input text to process. Can contain any characters, but only alphabetic sequences will be processed as words.
only_hyphenated (bool) – If True, only include syllables from words that were successfully split (CMUDict lookup succeeded). Set to False to include fallback syllabification for unknown words.

Returns:

syllables: List of all lowercase syllable strings (includes duplicates)
statistics: Dict with the following keys:
- ’total_words’: Total number of words found in source text
- ’processed_words’: Words that were successfully processed
- ’fallback_count’: Words not in CMUDict (used fallback heuristics)
- ’rejected_syllables’: Syllables rejected due to length constraints

Return type:

Tuple of (syllables, statistics) where

Note

Only processes words containing alphabetic characters (a-z, A-Z)
Case-insensitive processing (all output is lowercase)
Automatically removes punctuation and special characters
Filters syllables by configured min/max length constraints
When only_hyphenated=True, excludes words not in CMUDict
Deterministic: same input always produces same output
Uses first pronunciation when multiple exist (deterministic)
Words are extracted using regex pattern: b[a-zA-Z]+b

Example

>>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello world!")
>>> print(syllables)
['hel', 'lo', 'world']
>>> print(stats['total_words'])
2

extract_syllables_from_file(input_path)[source]

Extract all syllables from a text file (preserves duplicates).

This is a convenience wrapper around extract_syllables_from_text() that handles file reading with proper encoding (UTF-8) and error handling.

Parameters:

input_path (pathlib.Path) – Path to the input text file. File should be UTF-8 encoded plain text. Binary files or non-text formats will cause errors.

Returns:

syllables: List of all lowercase syllable strings (includes duplicates)
statistics: Dict with processing statistics (see extract_syllables_from_text)

Return type:

Tuple of (syllables, statistics) where

Raises:

FileNotFoundError – If the input file doesn’t exist at the specified path
IOError – If there’s an error reading the file (permissions, encoding, etc.)

Example

>>> from pathlib import Path
>>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables, stats = extractor.extract_syllables_from_file(Path('book.txt'))
>>> print(f"Extracted {len(syllables)} unique syllables from {stats['total_words']} words")
Extracted 1250 unique syllables from 50000 words

save_syllables(syllables, output_path)[source]

Save syllables to a text file (one syllable per line, preserves all).

Writes syllables with UTF-8 encoding, one syllable per line. Syllables are written in the order they appear in the list (preserving duplicates). This format is ideal for downstream processing by normalizer tools.

Parameters:

syllables (list[str]) – List of syllables to save (may contain duplicates). Written in the order provided.
output_path (pathlib.Path) – Path to the output file. Parent directories must exist. If the file exists, it will be overwritten.

Raises:

IOError – If there’s an error writing the file (permissions, disk space, etc.)

Example

>>> from pathlib import Path
>>> extractor = NltkSyllableExtractor('en_US')
>>> syllables = ['hel', 'lo', 'world', 'hel']  # Note: 'hel' appears twice
>>> extractor.save_syllables(syllables, Path('output.txt'))
# Creates file with content (preserving duplicates and order):
# hel
# lo
# world
# hel

Note

The output file uses UTF-8 encoding with Unix-style line endings (n). Each line contains exactly one syllable with no leading/trailing whitespace. Duplicates are preserved. Use downstream tools for deduplication if needed.