build_tools.nltk_syllable_extractor

NLTK Syllable Extractor - Phonetically-Guided Syllable Extraction

The NLTK syllable extractor uses CMU Pronouncing Dictionary (via cmudict pip package) with onset/coda principles for phonetically-guided orthographic syllabification. This is a build-time tool only - not used during runtime name generation.

The tool supports two modes:

Interactive Mode - Guided prompts for single-file processing
Batch Mode - Automated processing of multiple files via command-line arguments

Features:

Phonetically-guided syllabification using CMU Pronouncing Dictionary (via cmudict package)
Onset/coda principles for natural consonant cluster splitting
English only (CMUDict limitation)
Preserves all syllables including duplicates (extraction only, no filtering)
Configurable syllable length constraints (defaults to no filtering)
Deterministic extraction (same input = same output)
Unicode support
Comprehensive metadata and statistics
Automatic provenance tracking via corpus_db ledger (batch mode)

Key Differences from pyphen Extractor:

Uses phonetic information (CMUDict) rather than typographic hyphenation rules
Respects phonotactic constraints via onset/coda principles
Produces more “natural” phonetic splits (e.g., “Andrew” → “An-drew” not “And-rew”)
English only vs pyphen’s 40+ languages
Complementary tool, not a replacement

Main Components:

NltkSyllableExtractor: Core extraction class
ExtractionResult: Data model for extraction results
FileProcessingResult: Result for single file in batch mode
BatchResult: Aggregate results for batch processing

Usage:

>>> from pathlib import Path
>>> from build_tools.nltk_syllable_extractor import NltkSyllableExtractor
>>>
>>> # Initialize extractor for English (defaults to no length filtering)
>>> extractor = NltkSyllableExtractor('en_US')
>>>
>>> # Extract syllables from text (preserves duplicates)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(syllables)  # Note: includes all syllables with duplicates
['hel', 'lo', 'won', 'der', 'ful', 'world']
>>> print(f"Total: {len(syllables)}, Unique: {len(set(syllables))}")
Total: 6, Unique: 6
>>>
>>> # Extract from a file
>>> syllables, stats = extractor.extract_syllables_from_file(Path('input.txt'))
>>>
>>> # Save results (preserves duplicates)
>>> extractor.save_syllables(syllables, Path('output.txt'))

CLI Usage:

# Interactive mode
python -m build_tools.nltk_syllable_extractor

# Single file
python -m build_tools.nltk_syllable_extractor --file input.txt

# Batch processing
python -m build_tools.nltk_syllable_extractor --source ~/docs/ --recursive

build_tools.nltk_syllable_extractor

Submodules