build_tools.nltk_syllable_extractor

NLTK Syllable Extractor - Phonetically-Guided Syllable Extraction

The NLTK syllable extractor uses CMU Pronouncing Dictionary (via cmudict pip package) with onset/coda principles for phonetically-guided orthographic syllabification. This is a build-time tool only - not used during runtime name generation.

The tool supports two modes:

  • Interactive Mode - Guided prompts for single-file processing

  • Batch Mode - Automated processing of multiple files via command-line arguments

Features:

  • Phonetically-guided syllabification using CMU Pronouncing Dictionary (via cmudict package)

  • Onset/coda principles for natural consonant cluster splitting

  • English only (CMUDict limitation)

  • Preserves all syllables including duplicates (extraction only, no filtering)

  • Configurable syllable length constraints (defaults to no filtering)

  • Deterministic extraction (same input = same output)

  • Unicode support

  • Comprehensive metadata and statistics

  • Automatic provenance tracking via corpus_db ledger (batch mode)

Key Differences from pyphen Extractor:

  • Uses phonetic information (CMUDict) rather than typographic hyphenation rules

  • Respects phonotactic constraints via onset/coda principles

  • Produces more “natural” phonetic splits (e.g., “Andrew” → “An-drew” not “And-rew”)

  • English only vs pyphen’s 40+ languages

  • Complementary tool, not a replacement

Main Components:

  • NltkSyllableExtractor: Core extraction class

  • ExtractionResult: Data model for extraction results

  • FileProcessingResult: Result for single file in batch mode

  • BatchResult: Aggregate results for batch processing

Usage:
>>> from pathlib import Path
>>> from build_tools.nltk_syllable_extractor import NltkSyllableExtractor
>>>
>>> # Initialize extractor for English (defaults to no length filtering)
>>> extractor = NltkSyllableExtractor('en_US')
>>>
>>> # Extract syllables from text (preserves duplicates)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(syllables)  # Note: includes all syllables with duplicates
['hel', 'lo', 'won', 'der', 'ful', 'world']
>>> print(f"Total: {len(syllables)}, Unique: {len(set(syllables))}")
Total: 6, Unique: 6
>>>
>>> # Extract from a file
>>> syllables, stats = extractor.extract_syllables_from_file(Path('input.txt'))
>>>
>>> # Save results (preserves duplicates)
>>> extractor.save_syllables(syllables, Path('output.txt'))

CLI Usage:

# Interactive mode
python -m build_tools.nltk_syllable_extractor

# Single file
python -m build_tools.nltk_syllable_extractor --file input.txt

# Batch processing
python -m build_tools.nltk_syllable_extractor --source ~/docs/ --recursive

Submodules

Attributes

main_interactive

main_batch

process_single_file_batch

Package Contents

build_tools.nltk_syllable_extractor.main_interactive
build_tools.nltk_syllable_extractor.main_batch
build_tools.nltk_syllable_extractor.process_single_file_batch