NLTK Syllable Extractor

Overview

NLTK Syllable Extractor - Phonetically-Guided Syllable Extraction

The NLTK syllable extractor uses CMU Pronouncing Dictionary (via cmudict pip package) with onset/coda principles for phonetically-guided orthographic syllabification. This is a build-time tool only - not used during runtime name generation.

The tool supports two modes:

  • Interactive Mode - Guided prompts for single-file processing

  • Batch Mode - Automated processing of multiple files via command-line arguments

Features:

  • Phonetically-guided syllabification using CMU Pronouncing Dictionary (via cmudict package)

  • Onset/coda principles for natural consonant cluster splitting

  • English only (CMUDict limitation)

  • Preserves all syllables including duplicates (extraction only, no filtering)

  • Configurable syllable length constraints (defaults to no filtering)

  • Deterministic extraction (same input = same output)

  • Unicode support

  • Comprehensive metadata and statistics

  • Automatic provenance tracking via corpus_db ledger (batch mode)

Key Differences from pyphen Extractor:

  • Uses phonetic information (CMUDict) rather than typographic hyphenation rules

  • Respects phonotactic constraints via onset/coda principles

  • Produces more “natural” phonetic splits (e.g., “Andrew” → “An-drew” not “And-rew”)

  • English only vs pyphen’s 40+ languages

  • Complementary tool, not a replacement

Main Components:

  • NltkSyllableExtractor: Core extraction class

  • ExtractionResult: Data model for extraction results

  • FileProcessingResult: Result for single file in batch mode

  • BatchResult: Aggregate results for batch processing

Usage:
>>> from pathlib import Path
>>> from build_tools.nltk_syllable_extractor import NltkSyllableExtractor
>>>
>>> # Initialize extractor for English (defaults to no length filtering)
>>> extractor = NltkSyllableExtractor('en_US')
>>>
>>> # Extract syllables from text (preserves duplicates)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(syllables)  # Note: includes all syllables with duplicates
['hel', 'lo', 'won', 'der', 'ful', 'world']
>>> print(f"Total: {len(syllables)}, Unique: {len(set(syllables))}")
Total: 6, Unique: 6
>>>
>>> # Extract from a file
>>> syllables, stats = extractor.extract_syllables_from_file(Path('input.txt'))
>>>
>>> # Save results (preserves duplicates)
>>> extractor.save_syllables(syllables, Path('output.txt'))

CLI Usage:

# Interactive mode
python -m build_tools.nltk_syllable_extractor

# Single file
python -m build_tools.nltk_syllable_extractor --file input.txt

# Batch processing
python -m build_tools.nltk_syllable_extractor --source ~/docs/ --recursive

Command-Line Interface

NLTK Syllable Extractor - Extract syllables using CMUDict with onset/coda principles

usage: python -m build_tools.nltk_syllable_extractor [-h]
                                                     [--file FILE | --files FILE [FILE ...]
                                                     | --source SOURCE]
                                                     [--pattern PATTERN]
                                                     [--recursive] [--min N]
                                                     [--max N]
                                                     [--output OUTPUT]
                                                     [--quiet] [--verbose]

Named Arguments

--file

Process a single file

--files

Process multiple files

--source

Directory to scan for files

--pattern

File pattern for directory scanning (default: *.txt)

Default: '*.txt'

--recursive

Search directories recursively

Default: False

--min

Minimum syllable length (default: 1, no filtering)

Default: 1

--max

Maximum syllable length (default: 999, no filtering)

Default: 999

--output

Output directory (default: _working/output)

--quiet

Suppress all output except errors

Default: False

--verbose

Enable verbose output

Default: False

# Interactive mode (no arguments)
python -m build_tools.nltk_syllable_extractor

# Single file
python -m build_tools.nltk_syllable_extractor --file input.txt

# Multiple files
python -m build_tools.nltk_syllable_extractor --files file1.txt file2.txt file3.txt

# Directory scan (non-recursive)
python -m build_tools.nltk_syllable_extractor --source /data/texts/ --pattern "*.txt"

# Directory scan (recursive)
python -m build_tools.nltk_syllable_extractor --source /data/ --pattern "*.md" --recursive

# Custom output directory and syllable lengths
python -m build_tools.nltk_syllable_extractor --source /data/ --output /results/ --min 3 --max 6

Note: This extractor only supports English (CMUDict). For other languages, use pyphen_syllable_extractor.

Output Format

Output files are organized in a run-based subdirectory structure under _working/output/. Each extraction run creates a timestamped directory containing syllables/ and meta/ subdirectories:

Directory structure:

_working/output/
  └── YYYYMMDD_HHMMSS_nltk/     # Run directory (one per batch)
      ├── syllables/
      │   ├── file1.txt          # Input filename preserved
      │   ├── file2.txt
      │   └── ...
      └── meta/
          ├── file1.txt          # Matching metadata
          ├── file2.txt
          └── ...

Interactive mode (single file):

_working/output/
  └── 20260110_143022_nltk/
      ├── syllables/
      │   └── en_US.txt
      └── meta/
          └── en_US.txt

Batch mode (multiple files):

_working/output/
  └── 20260110_143022_nltk/     # All files share one run directory
      ├── syllables/
      │   ├── alice.txt
      │   ├── middlemarch.txt
      │   └── don_quijote.txt
      └── meta/
          ├── alice.txt
          ├── middlemarch.txt
          └── don_quijote.txt

Benefits of run-based organization:

  • Each extraction run is self-contained in a timestamped directory

  • Easy to archive, move, or delete entire runs as atomic units

  • Input filenames are preserved for easy identification

  • Clean separation between syllables and metadata

  • All outputs from a batch operation are grouped together

Syllables file format:

Each line contains one syllable, preserving duplicates in the order extracted. This preserves natural syllable frequency for downstream processing:

hel
lo
won
der
ful
world
hel
lo
world

Note: Duplicates are intentionally preserved. The extractor’s job is to extract, not to filter. Use build_tools.pyphen_syllable_normaliser for deduplication and frequency analysis.

Metadata file format:

The metadata file records extraction parameters and statistics:

  • Source files processed

  • Language code (always en_US for NLTK extractor)

  • Extractor type (nltk_syllable_extractor (CMUDict + onset/coda))

  • Syllable length constraints (min/max)

  • Total syllables (with duplicates)

  • Unique syllable count (for reference)

  • Total word count

  • Processing statistics (processed words, fallback count, rejected syllables)

  • Extraction timestamp

  • Command-line invocation

Metadata distinguishes extractor source:

The NLTK extractor clearly labels its output to distinguish from pyphen-based extraction:

======================================================================
NLTK SYLLABLE EXTRACTION METADATA
======================================================================
Extraction Date:    2026-01-09 22:43:28
Extractor:          nltk_syllable_extractor (CMUDict + onset/coda)
Language Code:      en_US
Syllable Length:    1-999 characters
Total Syllables:    911
Unique Syllables:   401

Processing Statistics:
  Total Words:        503
  Processed Words:    503
  Fallback Used:      17 (not in CMUDict)
  Rejected Syllables: 0 (length filter)
...

Integration Guide

The NLTK syllable extractor is an alternative first step in the build pipeline, complementing the pyphen-based extractor:

Standard workflow (using NLTK extractor):

# Step 1: Extract syllables using NLTK/CMUDict
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --pattern "*.txt" \
  --output data/raw/nltk/

# Step 2: Normalize extracted syllables (source-agnostic)
python -m build_tools.pyphen_syllable_normaliser \
  --source data/raw/nltk/ \
  --output data/normalized/

# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator

Parallel workflow (comparing both extractors):

# Extract with pyphen (typographic)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --lang en_US \
  --output data/raw/pyphen/

# Extract with NLTK (phonetic)
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --output data/raw/nltk/

# Compare outputs or merge for hybrid corpus
# Both feed into the same downstream tools

When to use NLTK extractor vs pyphen:

Use NLTK extractor when:

  • You want phonetically-guided syllable boundaries

  • You prefer consonant cluster integrity (e.g., “An-drew” not “And-rew”)

  • You want syllables that feel more like spoken language

  • You’re working with English text (CMUDict limitation)

  • You want to explore phonetic texture differences

Use pyphen extractor when:

  • You need multi-language support (40+ languages)

  • You prefer typographic hyphenation rules

  • You want formal, dictionary-based splits

  • You’re working with non-English text

Combining both extractors:

The two extractors produce complementary textures. You can:

  1. Extract with both, compare outputs, choose one

  2. Merge outputs for richer syllable inventory

  3. Use different extractors for different name generation profiles

Extraction modes:

  • Interactive mode: No arguments - prompts for file selection

  • Single file: --file input.txt - Process one file

  • Multiple files: --files file1.txt file2.txt - Process specific files

  • Directory scan: --source /data/ --pattern "*.txt" - Scan directory for files

  • Recursive scan: --source /data/ --pattern "*.txt" --recursive - Scan subdirectories

Notes

Language Limitation:

The NLTK extractor is English-only due to CMUDict constraints:

  • CMU Pronouncing Dictionary covers North American English pronunciation

  • No support for other languages (use pyphen for multi-language needs)

  • This is a fundamental limitation of the phonetic dictionary approach

CMUDict Package:

The tool uses the cmudict pip package (python-cmudict) which includes the CMU Pronouncing Dictionary data. No separate corpus download is required - simply install via pip and you’re ready to go.

Phonetic vs Typographic Splitting:

The NLTK extractor produces different splits than pyphen:

Word

pyphen (typo)

NLTK (phonetic)

Andrew

And-rew

An-drew

structure

struc-ture

stru-cture

beautiful

beau-ti-ful

beau-ti-ful

program

pro-gram

pro-gram

These differences create distinct phonetic textures:

  • pyphen: Conservative, formal, typographic breaks

  • NLTK: Natural, phonetic, respects consonant clustering

Deterministic Pronunciation Selection:

When words have multiple pronunciations in CMUDict (e.g., “read” as present vs past tense), the extractor:

  • Always selects the first pronunciation listed

  • This ensures deterministic results (same input → same output)

  • Pronunciation selection cannot be configured

Fallback for Unknown Words:

Words not in CMUDict use a heuristic fallback:

  1. Identifies vowel groups as syllable nuclei

  2. Applies onset/coda principles to consonant clusters

  3. Maintains phonetic character even for out-of-vocabulary words

Fallback usage is tracked in metadata as “Fallback Used: N (not in CMUDict)” to clearly distinguish from CMUDict-based extraction.

Extraction Philosophy - Preserving Duplicates:

The extractor preserves all syllables including duplicates, following separation-of-concerns design:

  • Extractor’s job: Extract syllables (preserves frequency information)

  • Normaliser’s job: Deduplicate, filter, aggregate (syllable_normaliser)

  • Annotator’s job: Add phonetic features (syllable_feature_annotator)

This design allows downstream tools to:

  • Perform frequency analysis on natural corpus distribution

  • Make informed filtering decisions based on occurrence counts

  • Apply different normalization strategies for different use cases

Syllable Length Constraints:

  • Default: min=1, max=999 (no practical filtering by default)

  • Adjust with --min and --max flags to filter if needed

  • Default behavior preserves all syllables for downstream processing

  • Examples:

    • --min 2 --max 8 - Filter to 2-8 character syllables (like old default)

    • --min 1 --max 1 - Extract only single-character syllables

    • --min 3 - Extract syllables of 3+ characters (no upper limit)

Output Organization:

  • Each extraction run creates a timestamped directory with nltk identifier (YYYYMMDD_HHMMSS_nltk/)

  • Run directory contains syllables/ and meta/ subdirectories

  • Batch processing groups all files into a single run directory

  • Input filenames are preserved in output (e.g., alice.txt)

  • Interactive mode uses en_US.txt as the filename

  • Metadata clearly labels extractor source for provenance

  • All extractions logged to corpus database (if available)

Performance Considerations:

  • CMUDict lookup is fast (dictionary-based)

  • Fallback heuristics are efficient

  • Processing speed comparable to pyphen for English text

  • NLTK initial import may take 1-2 seconds

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

NLTK Syllable Extractor - Phonetically-Guided Syllable Extraction

The NLTK syllable extractor uses CMU Pronouncing Dictionary (via cmudict pip package) with onset/coda principles for phonetically-guided orthographic syllabification. This is a build-time tool only - not used during runtime name generation.

The tool supports two modes:

  • Interactive Mode - Guided prompts for single-file processing

  • Batch Mode - Automated processing of multiple files via command-line arguments

Features:

  • Phonetically-guided syllabification using CMU Pronouncing Dictionary (via cmudict package)

  • Onset/coda principles for natural consonant cluster splitting

  • English only (CMUDict limitation)

  • Preserves all syllables including duplicates (extraction only, no filtering)

  • Configurable syllable length constraints (defaults to no filtering)

  • Deterministic extraction (same input = same output)

  • Unicode support

  • Comprehensive metadata and statistics

  • Automatic provenance tracking via corpus_db ledger (batch mode)

Key Differences from pyphen Extractor:

  • Uses phonetic information (CMUDict) rather than typographic hyphenation rules

  • Respects phonotactic constraints via onset/coda principles

  • Produces more “natural” phonetic splits (e.g., “Andrew” → “An-drew” not “And-rew”)

  • English only vs pyphen’s 40+ languages

  • Complementary tool, not a replacement

Main Components:

  • NltkSyllableExtractor: Core extraction class

  • ExtractionResult: Data model for extraction results

  • FileProcessingResult: Result for single file in batch mode

  • BatchResult: Aggregate results for batch processing

Usage:
>>> from pathlib import Path
>>> from build_tools.nltk_syllable_extractor import NltkSyllableExtractor
>>>
>>> # Initialize extractor for English (defaults to no length filtering)
>>> extractor = NltkSyllableExtractor('en_US')
>>>
>>> # Extract syllables from text (preserves duplicates)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(syllables)  # Note: includes all syllables with duplicates
['hel', 'lo', 'won', 'der', 'ful', 'world']
>>> print(f"Total: {len(syllables)}, Unique: {len(set(syllables))}")
Total: 6, Unique: 6
>>>
>>> # Extract from a file
>>> syllables, stats = extractor.extract_syllables_from_file(Path('input.txt'))
>>>
>>> # Save results (preserves duplicates)
>>> extractor.save_syllables(syllables, Path('output.txt'))

CLI Usage:

# Interactive mode
python -m build_tools.nltk_syllable_extractor

# Single file
python -m build_tools.nltk_syllable_extractor --file input.txt

# Batch processing
python -m build_tools.nltk_syllable_extractor --source ~/docs/ --recursive
class build_tools.nltk_syllable_extractor.BatchResult(total_files, successful, failed, results, total_time, output_directory)[source]

Bases: object

Aggregate results from a batch processing operation.

This dataclass stores summary statistics and individual file results from processing multiple files in batch mode.

total_files

Total number of files attempted in the batch

successful

Number of files processed successfully

failed

Number of files that failed to process

results

List of individual FileProcessingResult objects

total_time

Total time taken for entire batch operation in seconds

output_directory

Directory where all outputs were saved

Example

>>> result = BatchResult(
...     total_files=5,
...     successful=4,
...     failed=1,
...     results=[...],
...     total_time=12.34,
...     output_directory=Path("_working/output")
... )
>>> print(f"Success rate: {result.successful/result.total_files*100:.1f}%")
Success rate: 80.0%
failed: int
format_summary()[source]

Format batch processing summary as a human-readable string.

Creates a detailed summary report showing overall statistics, successful extractions with details, and failed files with error messages.

Return type:

str

Returns:

Multi-line formatted string with batch statistics and results

Example

>>> summary = batch_result.format_summary()
>>> print(summary)
======================================================================
BATCH PROCESSING SUMMARY
======================================================================
Total Files:        5
Successful:         4 (80.0%)
...
output_directory: Path
results: list[FileProcessingResult]
successful: int
total_files: int
total_time: float
class build_tools.nltk_syllable_extractor.ExtractionResult(syllables, language_code, min_syllable_length, max_syllable_length, input_path, timestamp=<factory>, only_hyphenated=True, length_distribution=<factory>, sample_syllables=<factory>, total_words=0, fallback_count=0, rejected_syllables=0, processed_words=0)[source]

Bases: object

Container for syllable extraction results and associated metadata.

This dataclass stores both the extracted syllables and all relevant metadata about the extraction process for reporting and persistence.

syllables

List of all syllables extracted (includes duplicates)

language_code

Language code used (always “en_US” for NLTK extractor)

min_syllable_length

Minimum syllable length constraint

max_syllable_length

Maximum syllable length constraint

input_path

Path to the input text file

timestamp

When the extraction was performed

only_hyphenated

Whether whole words were excluded

length_distribution

Map of syllable length to count

sample_syllables

Representative sample of extracted syllables

total_words

Total words found in source text

fallback_count

Words not in CMUDict (used fallback heuristics)

rejected_syllables

Syllables rejected due to length constraints

processed_words

Words that were successfully processed

__post_init__()[source]

Calculate derived fields after initialization.

fallback_count: int = 0
format_metadata()[source]

Format extraction metadata as a human-readable string.

Return type:

str

Returns:

Multi-line string containing all extraction metadata formatted for display or file output.

input_path: Path
language_code: str
length_distribution: dict[int, int]
max_syllable_length: int
min_syllable_length: int
only_hyphenated: bool = True
processed_words: int = 0
rejected_syllables: int = 0
sample_syllables: list[str]
syllables: list[str]
timestamp: datetime
total_words: int = 0
class build_tools.nltk_syllable_extractor.FileProcessingResult(input_path, success, syllables_count, language_code, syllables_output_path=None, metadata_output_path=None, error_message=None, processing_time=0.0)[source]

Bases: object

Result of processing a single file in batch mode.

This dataclass stores the outcome of processing one file during batch operations, including success status, extracted syllables count, and any error information if processing failed.

input_path

Path to the input file that was processed

success

Whether processing completed successfully

syllables_count

Number of unique syllables extracted (0 if failed)

language_code

Language code used (always “en_US”)

syllables_output_path

Path where syllables were saved (None if failed)

metadata_output_path

Path where metadata was saved (None if failed)

error_message

Error message if processing failed (None if success)

processing_time

Time taken to process this file in seconds

Example

>>> result = FileProcessingResult(
...     input_path=Path("book.txt"),
...     success=True,
...     syllables_count=245,
...     language_code="en_US",
...     syllables_output_path=Path("output.syllables.en_US.txt"),
...     metadata_output_path=Path("output.meta.en_US.txt"),
...     processing_time=2.45
... )
>>> print(f"Processed {result.syllables_count} syllables")
Processed 245 syllables
error_message: str | None = None
input_path: Path
language_code: str
metadata_output_path: Path | None = None
processing_time: float = 0.0
success: bool
syllables_count: int
syllables_output_path: Path | None = None
class build_tools.nltk_syllable_extractor.NltkSyllableExtractor(language_code, min_syllable_length=1, max_syllable_length=999)[source]

Bases: object

Extracts syllables from text using CMU Pronouncing Dictionary.

This class uses phonetic information from CMUDict to guide orthographic syllable splitting, respecting English phonotactic constraints via onset/coda principles.

The extractor works by: 1. Reading text input (string or file) 2. Tokenizing into words using regex 3. Looking up phonetic transcriptions in CMUDict 4. Using vowel phonemes to identify syllable boundaries 5. Mapping phonetic structure back to orthographic positions 6. Applying onset/coda rules to split consonant clusters 7. Filtering syllables by length constraints 8. Returning unique syllables (case-insensitive)

Key Differences from pyphen:
  • Uses phonetic information (CMUDict) rather than typographic rules

  • Respects consonant cluster constraints (onset/coda principles)

  • Produces more “natural” phonetic splits

  • English only (CMUDict limitation)

  • Includes fallback for out-of-vocabulary words

Typical Usage:
>>> # Basic extraction
>>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(sorted(syllables))
['der', 'ful', 'hel', 'lo', 'won', 'world']
>>> # Extract from file and save
>>> syllables = extractor.extract_syllables_from_file(Path('input.txt'))
>>> extractor.save_syllables(syllables, Path('output.txt'))
language_code

The language code (always ‘en_US’ for NLTK extractor)

min_syllable_length

Minimum syllable length to include in results

max_syllable_length

Maximum syllable length to include in results

cmu_dict

The loaded CMU Pronouncing Dictionary

Note

This is a build-time tool. The nltk dependency should not be used at runtime in the core name generation system.

VALID_ONSETS = {'bl', 'br', 'ch', 'cl', 'cr', 'dr', 'fl', 'fr', 'gh', 'gl', 'gr', 'ph', 'pl', 'pr', 'scr', 'sh', 'shr', 'sl', 'sm', 'sn', 'sp', 'spl', 'spr', 'squ', 'st', 'str', 'sw', 'th', 'thr', 'tr', 'tw', 'wh'}
VOWELS = 'aeiouy'
__init__(language_code, min_syllable_length=1, max_syllable_length=999)[source]

Initialize the NLTK syllable extractor.

Parameters:
  • language_code (str) – Language code (must be ‘en_US’ for NLTK extractor)

  • min_syllable_length (int) – Minimum syllable length to include (default: 1, no filtering)

  • max_syllable_length (int) – Maximum syllable length to include (default: 999, no filtering)

Raises:
extract_syllables_from_file(input_path)[source]

Extract all syllables from a text file (preserves duplicates).

This is a convenience wrapper around extract_syllables_from_text() that handles file reading with proper encoding (UTF-8) and error handling.

Parameters:

input_path (Path) – Path to the input text file. File should be UTF-8 encoded plain text. Binary files or non-text formats will cause errors.

Returns:

  • syllables: List of all lowercase syllable strings (includes duplicates)

  • statistics: Dict with processing statistics (see extract_syllables_from_text)

Return type:

Tuple of (syllables, statistics) where

Raises:
  • FileNotFoundError – If the input file doesn’t exist at the specified path

  • IOError – If there’s an error reading the file (permissions, encoding, etc.)

Example

>>> from pathlib import Path
>>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables, stats = extractor.extract_syllables_from_file(Path('book.txt'))
>>> print(f"Extracted {len(syllables)} unique syllables from {stats['total_words']} words")
Extracted 1250 unique syllables from 50000 words
extract_syllables_from_text(text, only_hyphenated=True)[source]

Extract all syllables from a block of text (preserves duplicates).

This method processes input text by tokenizing it into words, applying CMUDict phonetic lookup and onset/coda principles to extract individual syllables that meet the configured length constraints.

Parameters:
  • text (str) – Input text to process. Can contain any characters, but only alphabetic sequences will be processed as words.

  • only_hyphenated (bool) – If True, only include syllables from words that were successfully split (CMUDict lookup succeeded). Set to False to include fallback syllabification for unknown words.

Returns:

  • syllables: List of all lowercase syllable strings (includes duplicates)

  • statistics: Dict with the following keys:
    • ’total_words’: Total number of words found in source text

    • ’processed_words’: Words that were successfully processed

    • ’fallback_count’: Words not in CMUDict (used fallback heuristics)

    • ’rejected_syllables’: Syllables rejected due to length constraints

Return type:

Tuple of (syllables, statistics) where

Note

  • Only processes words containing alphabetic characters (a-z, A-Z)

  • Case-insensitive processing (all output is lowercase)

  • Automatically removes punctuation and special characters

  • Filters syllables by configured min/max length constraints

  • When only_hyphenated=True, excludes words not in CMUDict

  • Deterministic: same input always produces same output

  • Uses first pronunciation when multiple exist (deterministic)

  • Words are extracted using regex pattern: b[a-zA-Z]+b

Example

>>> extractor = NltkSyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello world!")
>>> print(syllables)
['hel', 'lo', 'world']
>>> print(stats['total_words'])
2
save_syllables(syllables, output_path)[source]

Save syllables to a text file (one syllable per line, preserves all).

Writes syllables with UTF-8 encoding, one syllable per line. Syllables are written in the order they appear in the list (preserving duplicates). This format is ideal for downstream processing by normalizer tools.

Parameters:
  • syllables (list[str]) – List of syllables to save (may contain duplicates). Written in the order provided.

  • output_path (Path) – Path to the output file. Parent directories must exist. If the file exists, it will be overwritten.

Raises:

IOError – If there’s an error writing the file (permissions, disk space, etc.)

Return type:

None

Example

>>> from pathlib import Path
>>> extractor = NltkSyllableExtractor('en_US')
>>> syllables = ['hel', 'lo', 'world', 'hel']  # Note: 'hel' appears twice
>>> extractor.save_syllables(syllables, Path('output.txt'))
# Creates file with content (preserving duplicates and order):
# hel
# lo
# world
# hel

Note

The output file uses UTF-8 encoding with Unix-style line endings (n). Each line contains exactly one syllable with no leading/trailing whitespace. Duplicates are preserved. Use downstream tools for deduplication if needed.

build_tools.nltk_syllable_extractor.discover_files(source, pattern='*.txt', recursive=False)[source]

Discover text files in a directory matching the specified pattern.

This function searches for files matching a glob pattern in the specified directory, optionally recursing into subdirectories. Results are sorted alphabetically for deterministic processing order.

Parameters:
  • source (Path) – Directory to search for files. Must be an existing directory.

  • pattern (str) – Glob pattern for file matching (default: “.txt”). Examples: “.txt”, “.md”, “data_.csv”

  • recursive (bool) – If True, search recursively into subdirectories using rglob. If False, search only the top level (default: False).

Return type:

list[Path]

Returns:

List of Path objects for matching files, sorted alphabetically. Returns empty list if no files match.

Raises:

ValueError – If source is not a directory or doesn’t exist.

Example

>>> # Find all .txt files in a directory
>>> files = discover_files(Path("/data/texts"))
>>> print(f"Found {len(files)} files")
>>> # Find all .md files recursively
>>> files = discover_files(Path("/data"), pattern="*.md", recursive=True)
>>> # Find files with custom pattern
>>> files = discover_files(Path("/data"), pattern="book_*.txt")
build_tools.nltk_syllable_extractor.generate_output_filename(output_dir=None, language_code=None, run_timestamp=None, input_filename=None)[source]

Generate output filenames in run-based subdirectory structure.

Creates a run directory with timestamp and ‘nltk’ identifier, then organizes outputs into syllables/ and meta/ subdirectories: - output_dir/YYYYMMDD_HHMMSS_nltk/syllables/filename.txt - output_dir/YYYYMMDD_HHMMSS_nltk/meta/filename.txt

This structure groups each extraction run’s outputs together, making it easier to manage, archive, or delete complete runs as atomic units.

Parameters:
  • output_dir (Path | None) – Base output directory. Defaults to _working/output/

  • language_code (str | None) – Optional language code (e.g., ‘en_US’). Used for filename if input_filename not provided.

  • run_timestamp (str | None) – Optional timestamp string (YYYYMMDD_HHMMSS format). If provided, uses this timestamp for the run directory name. If not provided, generates a new timestamp using datetime.now(). Critical for batch processing - pass the same timestamp to group all files from a batch into one run directory.

  • input_filename (str | None) – Optional input filename to use for output naming. If provided, output files will use this name (e.g., ‘alice.txt’). Takes precedence over language_code for naming.

Return type:

tuple[Path, Path]

Returns:

Tuple of (syllables_path, metadata_path)

Example

>>> # Interactive mode - single file with language code
>>> syllables_path, meta_path = generate_output_filename(language_code='en_US')
>>> print(syllables_path)
_working/output/20260110_153022_nltk/syllables/en_US.txt
>>> # Batch mode - multiple files sharing one run directory
>>> timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
>>> s1, m1 = generate_output_filename(
...     run_timestamp=timestamp,
...     input_filename='alice.txt'
... )
>>> s2, m2 = generate_output_filename(
...     run_timestamp=timestamp,
...     input_filename='middlemarch.txt'
... )
>>> print(s1)
_working/output/20260110_153022_nltk/syllables/alice.txt
>>> print(s2)
_working/output/20260110_153022_nltk/syllables/middlemarch.txt
>>> # Both files share the same run directory

Note

For batch processing, always pass the same run_timestamp to group all outputs into a single run directory. This represents one logical batch operation, regardless of how many input files are processed.

build_tools.nltk_syllable_extractor.main(args=None)[source]

Main entry point for the NLTK syllable extractor CLI.

This function determines whether to run in interactive or batch mode based on the presence of command-line arguments.

Parameters:

args (list[str] | None) – Command-line arguments. If None, uses sys.argv.

Return type:

int

Returns:

Exit code (0 for success, non-zero for error).

Modes:
  • Interactive Mode: No arguments provided. Prompts user for all settings.

  • Batch Mode: Arguments provided. Processes files based on CLI flags.

Examples

Interactive mode (no arguments):

$ python -m build_tools.nltk_syllable_extractor

Batch mode (with arguments):

$ python -m build_tools.nltk_syllable_extractor --file input.txt
$ python -m build_tools.nltk_syllable_extractor --files *.txt
$ python -m build_tools.nltk_syllable_extractor --source ~/docs/ --recursive
build_tools.nltk_syllable_extractor.main_batch(args)

Batch mode entry point for the NLTK syllable extractor CLI.

This function processes multiple files based on command-line arguments, providing progress indicators and comprehensive error reporting.

Parameters:

args (Namespace) – Parsed command-line arguments from argparse.Namespace containing: - file: Single file path (optional) - files: List of file paths (optional) - source: Directory path for scanning (optional) - pattern: File pattern for directory scanning (default: “*.txt”) - recursive: Whether to scan directories recursively - min: Minimum syllable length (default: 1) - max: Maximum syllable length (default: 999) - output: Output directory (default: _working/output/) - quiet: Suppress progress indicators - verbose: Show detailed processing information

Return type:

None

Exit Codes:

0: All files processed successfully 1: One or more files failed to process

Raises:

SystemExit – On validation errors or processing completion

build_tools.nltk_syllable_extractor.main_interactive()

Interactive mode entry point for the NLTK syllable extractor CLI.

Return type:

None

Workflow:
  1. Display tool information and CMUDict notice

  2. Configure extraction parameters (min/max syllable length)

  3. Prompt for input file path

  4. Extract syllables using CMUDict + onset/coda principles

  5. Generate timestamped output filenames

  6. Save syllables and metadata to separate files

  7. Display summary to console

Output Files:
  • YYYYMMDD_HHMMSS.syllables.en_US.txt: One syllable per line, sorted

  • YYYYMMDD_HHMMSS.meta.en_US.txt: Extraction metadata and statistics

Both files are saved to _working/output/ by default.

build_tools.nltk_syllable_extractor.process_batch(files, min_len, max_len, output_dir, quiet=False, verbose=False)[source]

Process multiple files sequentially in batch mode.

This is a backwards-compatible wrapper around run_batch_extraction.

Parameters:
  • files (list[Path]) – List of input file paths to process

  • min_len (int) – Minimum syllable length to include

  • max_len (int) – Maximum syllable length to include

  • output_dir (Path) – Output directory for all results (created if needed)

  • quiet (bool) – If True, suppress all output except errors (default: False)

  • verbose (bool) – If True, show detailed progress for each file (default: False).

Return type:

BatchResult

Returns:

BatchResult with overall statistics and individual file results.

build_tools.nltk_syllable_extractor.process_single_file(input_path, min_len, max_len, output_dir, run_timestamp, verbose=False)[source]

Process a single file in batch mode with comprehensive error handling.

This function attempts to extract syllables from a single file and saves the results. Unlike interactive mode, this function catches all exceptions and returns a result object indicating success or failure, allowing batch processing to continue even when individual files fail.

Parameters:
  • input_path (Path) – Path to the input text file to process

  • min_len (int) – Minimum syllable length to include in results

  • max_len (int) – Maximum syllable length to include in results

  • output_dir (Path) – Directory where output files should be saved

  • run_timestamp (str) – Timestamp for the batch run (shared across all files in batch)

  • verbose (bool) – If True, print detailed progress messages (default: False)

Return type:

FileProcessingResult

Returns:

FileProcessingResult object with success status, syllables count, output paths (if successful), or error message (if failed).

Note

This function never raises exceptions. All errors are caught and returned in the FileProcessingResult.error_message field.

build_tools.nltk_syllable_extractor.process_single_file_batch(input_path, min_len, max_len, output_dir, run_timestamp, verbose=False)

Process a single file in batch mode with comprehensive error handling.

This function attempts to extract syllables from a single file and saves the results. Unlike interactive mode, this function catches all exceptions and returns a result object indicating success or failure, allowing batch processing to continue even when individual files fail.

Parameters:
  • input_path (Path) – Path to the input text file to process

  • min_len (int) – Minimum syllable length to include in results

  • max_len (int) – Maximum syllable length to include in results

  • output_dir (Path) – Directory where output files should be saved

  • run_timestamp (str) – Timestamp for the batch run (shared across all files in batch)

  • verbose (bool) – If True, print detailed progress messages (default: False)

Return type:

FileProcessingResult

Returns:

FileProcessingResult object with success status, syllables count, output paths (if successful), or error message (if failed).

Note

This function never raises exceptions. All errors are caught and returned in the FileProcessingResult.error_message field.

build_tools.nltk_syllable_extractor.run_batch(args)[source]

Batch mode entry point for the NLTK syllable extractor CLI.

This function processes multiple files based on command-line arguments, providing progress indicators and comprehensive error reporting.

Parameters:

args (Namespace) – Parsed command-line arguments from argparse.Namespace containing: - file: Single file path (optional) - files: List of file paths (optional) - source: Directory path for scanning (optional) - pattern: File pattern for directory scanning (default: “*.txt”) - recursive: Whether to scan directories recursively - min: Minimum syllable length (default: 1) - max: Maximum syllable length (default: 999) - output: Output directory (default: _working/output/) - quiet: Suppress progress indicators - verbose: Show detailed processing information

Return type:

None

Exit Codes:

0: All files processed successfully 1: One or more files failed to process

Raises:

SystemExit – On validation errors or processing completion

build_tools.nltk_syllable_extractor.run_interactive()[source]

Interactive mode entry point for the NLTK syllable extractor CLI.

Return type:

None

Workflow:
  1. Display tool information and CMUDict notice

  2. Configure extraction parameters (min/max syllable length)

  3. Prompt for input file path

  4. Extract syllables using CMUDict + onset/coda principles

  5. Generate timestamped output filenames

  6. Save syllables and metadata to separate files

  7. Display summary to console

Output Files:
  • YYYYMMDD_HHMMSS.syllables.en_US.txt: One syllable per line, sorted

  • YYYYMMDD_HHMMSS.meta.en_US.txt: Extraction metadata and statistics

Both files are saved to _working/output/ by default.

build_tools.nltk_syllable_extractor.save_metadata(result, output_path)[source]

Save extraction metadata to a text file.

Parameters:
  • result (ExtractionResult) – ExtractionResult containing metadata to save

  • output_path (Path) – Path to the output metadata file

Raises:

IOError – If there’s an error writing the file

Return type:

None

Example

>>> result = ExtractionResult(...)
>>> save_metadata(result, Path("output.meta.txt"))