NLTK Syllable Normaliser
Overview
NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline
The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a build-time tool only - not used during runtime name generation.
NLTK-Specific Processing:
Fragment Cleaning - Merge single-letter fragments with neighbors (NLTK-specific)
Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence
Key Differences from Pyphen Normaliser:
Input Source: Processes NLTK run directories with syllables/ subdirectory
Preprocessing: Fragment cleaning step merges isolated phonemes
Output Location: In-place in run directory (not separate output directory)
Output Prefix: nltk_ prefix (for provenance tracking)
Features:
Fragment cleaning (single vowel/consonant merging)
Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files with nltk_ prefix for complete analysis
The pipeline produces 5 output files (with nltk_ prefix for provenance):
nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning)
nltk_syllables_frequencies.json: Frequency intelligence (syllable → count)
nltk_syllables_unique.txt: Deduplicated canonical syllable inventory
nltk_normalization_meta.txt: Detailed statistics and metadata report
- Usage:
>>> from pathlib import Path >>> from build_tools.nltk_syllable_normaliser import ( ... NormalizationConfig, ... run_full_pipeline, ... ) >>> >>> # Process NLTK run directory in-place >>> run_dir = Path("_working/output/20260110_095213_nltk/") >>> result = run_full_pipeline( ... run_directory=run_dir, ... config=NormalizationConfig(min_length=2, max_length=8), ... verbose=True ... ) >>> >>> # Access results >>> print(f"Processed {result.stats.raw_count:,} raw syllables") >>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}") >>> print(f"Canonical: {result.stats.after_canonicalization:,}") >>> print(f"Unique: {result.stats.unique_canonical:,}")
CLI Usage:
# Process specific NLTK run directory python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/ # Auto-detect NLTK run directories python -m build_tools.nltk_syllable_normaliser --source _working/output/ # Custom configuration python -m build_tools.nltk_syllable_normaliser --run-dir <path> --min 2 --max 8
Command-Line Interface
NLTK Syllable Normaliser - Fragment cleaning + 3-step normalization pipeline
usage: python -m build_tools.nltk_syllable_normaliser [-h]
(--run-dir RUN_DIR | --source SOURCE)
[--min MIN] [--max MAX]
[--charset CHARSET]
[--unicode-form {NFC,NFD,NFKC,NFKD}]
[--no-fragment-cleaning]
[--verbose] [--quiet]
Named Arguments
- --run-dir
Specific NLTK run directory to process (e.g., _working/output/20260110_095213_nltk/)
- --source
Directory to scan for NLTK run directories (auto-detects *_nltk/ directories)
- --min
Minimum syllable length (characters). Default: 2
Default:
2- --max
Maximum syllable length (characters). Default: 20
Default:
20- --charset
Allowed character set for syllables. Default: a-z
Default:
'abcdefghijklmnopqrstuvwxyz'- --unicode-form
Possible choices: NFC, NFD, NFKC, NFKD
Unicode normalization form. Default: NFKD
Default:
'NFKD'- --no-fragment-cleaning
Skip fragment cleaning step (for comparison purposes)
Default:
False- --verbose, -v
Enable verbose output with detailed progress information
Default:
False- --quiet, -q
Suppress all output except errors
Default:
False
Examples:
# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/
# Auto-detect and process all NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/
# Custom normalization config
python -m build_tools.nltk_syllable_normaliser \
--run-dir _working/output/20260110_095213_nltk/ \
--min 2 --max 8
# Skip fragment cleaning (for comparison with pyphen)
python -m build_tools.nltk_syllable_normaliser \
--run-dir _working/output/20260110_095213_nltk/ \
--no-fragment-cleaning
Output Format
The pipeline generates 5 output files in the NLTK run directory with nltk_ prefix for provenance:
nltk_syllables_raw.txt - Aggregated raw syllables (all occurrences preserved)
nltk_syllables_canonicalised.txt - After fragment cleaning + normalization
nltk_syllables_frequencies.json - Frequency intelligence (syllable → count mapping)
nltk_syllables_unique.txt - Deduplicated canonical syllable inventory
nltk_normalization_meta.txt - Detailed statistics and metadata report
In-Place Processing:
Unlike the pyphen normaliser which writes to a separate output directory, the NLTK normaliser processes run directories in-place, writing output files directly into the run directory:
_working/output/20260110_095213_nltk/
├── syllables/ # Input (from NLTK extractor)
│ ├── en_US_alice.txt
│ ├── en_US_middlemarch.txt
│ └── ...
├── meta/ # Metadata (from extractor)
│ └── ...
├── nltk_syllables_raw.txt # Output: Aggregated
├── nltk_syllables_canonicalised.txt # Output: After cleaning + normalization
├── nltk_syllables_frequencies.json # Output: Frequency intelligence
├── nltk_syllables_unique.txt # Output: Deduplicated
└── nltk_normalization_meta.txt # Output: Statistics
File structure examples:
nltk_syllables_raw.txt (before fragment cleaning):
cha
pter
i
down
the
r
a
bbit
nltk_syllables_canonicalised.txt (after fragment cleaning + normalization):
cha
pter
idown
the
rabbit
nltk_syllables_frequencies.json (counts after cleaning):
{
"cha": 1,
"pter": 1,
"idown": 1,
"the": 1,
"rabbit": 1
}
nltk_syllables_unique.txt (deduplicated, sorted):
cha
idown
pter
rabbit
the
Integration Guide
The NLTK syllable normaliser is the second step after NLTK extraction, complementing the NLTK extractor:
Standard workflow (using NLTK extractor + normaliser):
# Step 1: Extract syllables using NLTK/CMUDict
python -m build_tools.nltk_syllable_extractor \
--source data/corpus/ \
--pattern "*.txt" \
--output _working/output/
# Step 2: Normalize extracted syllables (in-place)
python -m build_tools.nltk_syllable_normaliser \
--run-dir _working/output/20260110_095213_nltk/
# Alternative: Auto-detect all NLTK run directories
python -m build_tools.nltk_syllable_normaliser \
--source _working/output/
# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator \
--syllables _working/output/20260110_095213_nltk/nltk_syllables_unique.txt \
--frequencies _working/output/20260110_095213_nltk/nltk_syllables_frequencies.json
Parallel workflow (comparing both extractors):
# Extract and normalize with pyphen (typographic)
python -m build_tools.pyphen_syllable_extractor \
--source data/corpus/ \
--lang en_US \
--output _working/output/
python -m build_tools.pyphen_syllable_normaliser \
--source _working/output/20260110_143022_pyphen/syllables/ \
--output _working/output/20260110_143022_pyphen/
# Extract and normalize with NLTK (phonetic)
python -m build_tools.nltk_syllable_extractor \
--source data/corpus/ \
--output _working/output/
python -m build_tools.nltk_syllable_normaliser \
--run-dir _working/output/20260110_095213_nltk/
# Compare outputs - both use different prefixes (pyphen_* vs nltk_*)
diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
_working/output/20260110_095213_nltk/nltk_syllables_unique.txt
When to use NLTK normaliser vs pyphen normaliser:
Use NLTK normaliser when:
You used the NLTK syllable extractor
Your syllables contain many single-letter fragments
You want phonetically coherent syllables reconstructed
You’re working with NLTK’s onset/coda-based splits
You want in-place processing within run directories
Use pyphen normaliser when:
You used the pyphen syllable extractor
Your syllables are already well-formed (typographic hyphenation)
You want to aggregate multiple extraction runs
You prefer explicit output directory specification
You’re working with multi-language pyphen extractions
Fragment Cleaning:
The key differentiator of the NLTK normaliser is fragment cleaning. This step reconstructs phonetically coherent syllables from NLTK’s over-segmented output:
Original Fragments |
After Fragment Cleaning |
Reason |
|---|---|---|
i, down |
idown |
Single vowel merged |
r, a, bbit |
ra, bbit |
Single letters merged |
h, o, le |
ho, le |
Single letters merged |
cha, pter |
cha, pter |
Multi-char preserved |
Merging Rules:
Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged
Processing is left-to-right, deterministic
Processing Modes:
Specific run directory:
--run-dir /path/to/run/- Process one NLTK runAuto-detection:
--source /path/to/output/- Find and process all NLTK runsSkip fragment cleaning:
--no-fragment-cleaning- For comparison with pyphen
Notes
In-Place Processing Philosophy:
The NLTK normaliser writes outputs directly into the run directory (not a separate location) because:
Convention: Each NLTK run is self-contained (extractor + normaliser outputs together)
Simplicity: No confusion about where normalized files live
Provenance: Run directory name (
*_nltk) and file prefix (nltk_*) both indicate source
Fragment Cleaning Statistics:
Real-world performance on multi-language corpus (21 files, 2.98M syllables):
Before cleaning: 2,977,447 syllables
After cleaning: 2,709,503 syllables
Fragments merged: 267,944 (9% reduction)
Processing time: 6.07 seconds
Unique syllables: 33,640
When Fragment Cleaning Matters:
Fragment cleaning has the most impact on:
Short function words (“i”, “a”, “the”)
Consonant clusters split by onset/coda (“r” + “a” → “ra”)
Single-character prefixes/suffixes
Phonetically over-segmented words
For well-formed multi-syllable words, fragment cleaning has minimal effect.
Comparing with Pyphen Normaliser:
Feature |
Pyphen Normaliser |
NLTK Normaliser |
|---|---|---|
Input Source |
Any directory |
NLTK run directories |
Preprocessing |
None |
Fragment cleaning |
Output Location |
User-specified directory |
In-place (run directory) |
Output Prefix |
pyphen_* |
nltk_* |
Run Detection |
Manual file discovery |
Auto-detect *_nltk dirs |
Normalization Steps |
3 (aggregate, canon, freq) |
4 (clean, aggregate, …) |
Typical Use Case |
Pyphen extractor output |
NLTK extractor output |
Auto-Detection Criteria:
The auto-detection feature (--source) finds NLTK run directories by:
Scanning for directories ending with
_nltkVerifying existence of
syllables/subdirectorySorting chronologically by directory name
This allows batch processing:
# Process all NLTK runs at once
python -m build_tools.nltk_syllable_normaliser --source _working/output/
# Output:
# Found 3 NLTK run directories:
# - 20260110_095213_nltk
# - 20260110_143022_nltk
# - 20260110_153045_nltk
# Processing...
Deterministic Processing:
The NLTK normaliser is fully deterministic:
Same input → same output (always)
Fragment cleaning uses left-to-right greedy algorithm
Unicode normalization is deterministic (NFKD)
Frequency analysis preserves insertion order (Python 3.7+)
Build-time tool:
This is a build-time tool only - not used during runtime name generation.
Output File Prefixes:
Both normalisers now use prefixed output files:
Pyphen normaliser:
pyphen_*prefixNLTK normaliser:
nltk_*prefix
This ensures clear provenance when files are shared individually.
API Reference
NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline
The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a build-time tool only - not used during runtime name generation.
NLTK-Specific Processing:
Fragment Cleaning - Merge single-letter fragments with neighbors (NLTK-specific)
Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence
Key Differences from Pyphen Normaliser:
Input Source: Processes NLTK run directories with syllables/ subdirectory
Preprocessing: Fragment cleaning step merges isolated phonemes
Output Location: In-place in run directory (not separate output directory)
Output Prefix: nltk_ prefix (for provenance tracking)
Features:
Fragment cleaning (single vowel/consonant merging)
Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files with nltk_ prefix for complete analysis
The pipeline produces 5 output files (with nltk_ prefix for provenance):
nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning)
nltk_syllables_frequencies.json: Frequency intelligence (syllable → count)
nltk_syllables_unique.txt: Deduplicated canonical syllable inventory
nltk_normalization_meta.txt: Detailed statistics and metadata report
- Usage:
>>> from pathlib import Path >>> from build_tools.nltk_syllable_normaliser import ( ... NormalizationConfig, ... run_full_pipeline, ... ) >>> >>> # Process NLTK run directory in-place >>> run_dir = Path("_working/output/20260110_095213_nltk/") >>> result = run_full_pipeline( ... run_directory=run_dir, ... config=NormalizationConfig(min_length=2, max_length=8), ... verbose=True ... ) >>> >>> # Access results >>> print(f"Processed {result.stats.raw_count:,} raw syllables") >>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}") >>> print(f"Canonical: {result.stats.after_canonicalization:,}") >>> print(f"Unique: {result.stats.unique_canonical:,}")
CLI Usage:
# Process specific NLTK run directory python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/ # Auto-detect NLTK run directories python -m build_tools.nltk_syllable_normaliser --source _working/output/ # Custom configuration python -m build_tools.nltk_syllable_normaliser --run-dir <path> --min 2 --max 8
- class build_tools.nltk_syllable_normaliser.FragmentCleaner[source]
Bases:
objectClean NLTK-produced syllable fragments by merging isolated letters.
The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules.
- Merging Rules:
Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged
Example
>>> cleaner = FragmentCleaner() >>> fragments = ["i", "down", "the", "ra", "bbit"] >>> cleaner.clean_fragments(fragments) ['idown', 'the', 'rabbit']
Note
This is NLTK-specific preprocessing. Pyphen output doesn’t need fragment cleaning as it uses typographic hyphenation rules.
- clean_fragments(fragments)[source]
Perform mechanical cleanup by merging single-letter fragments.
Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment
This reconstructs more phonetically coherent syllables from NLTK’s onset/coda-based over-segmentation.
- Parameters:
fragments (
list[str]) – List of syllable fragments (possibly over-segmented).- Return type:
- Returns:
List of cleaned fragments with single letters merged.
Example
>>> cleaner = FragmentCleaner() >>> # Example 1: Single vowel merging >>> cleaner.clean_fragments(["i", "down"]) ['idown'] >>> >>> # Example 2: Single consonant merging >>> cleaner.clean_fragments(["r", "abbit"]) ['rabbit'] >>> >>> # Example 3: Mixed fragments >>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"]) ['cha', 'pter', 'idown', 'the', 'ra'] >>> >>> # Example 4: Preserve multi-character fragments >>> cleaner.clean_fragments(["hel", "lo", "world"]) ['hel', 'lo', 'world']
Note
Fragments are processed left-to-right
Single letters merge with next fragment (if available)
Last fragment never merges (no next fragment available)
Empty input returns empty output
- clean_fragments_from_file(input_path, output_path)[source]
Clean fragments from input file and write to output file.
Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line).
- Parameters:
- Return type:
- Returns:
Tuple of (original_count, cleaned_count) indicating how many fragments were merged.
- Raises:
FileNotFoundError – If input file doesn’t exist.
IOError – If there’s an error reading or writing files.
Example
>>> # input.txt contains: >>> # i >>> # down >>> # the >>> # ra >>> # bbit >>> >>> cleaner = FragmentCleaner() >>> original, cleaned = cleaner.clean_fragments_from_file( ... "input.txt", "output.txt" ... ) >>> print(f"Cleaned {original} → {cleaned} fragments") Cleaned 5 → 3 fragments >>> >>> # output.txt now contains: >>> # idown >>> # the >>> # rabbit
- static is_single_letter(token)[source]
Check if token is a single alphabetic character.
- Parameters:
token (
str) – String to check.- Return type:
- Returns:
True if token is exactly one alphabetic character, False otherwise.
Example
>>> FragmentCleaner.is_single_letter("a") True >>> FragmentCleaner.is_single_letter("ab") False >>> FragmentCleaner.is_single_letter("1") False
- static is_single_vowel(token)[source]
Check if token is a single vowel character.
- Parameters:
token (
str) – String to check.- Return type:
- Returns:
True if token is a single vowel (a, e, i, o, u, y), False otherwise.
Example
>>> FragmentCleaner.is_single_vowel("a") True >>> FragmentCleaner.is_single_vowel("b") False >>> FragmentCleaner.is_single_vowel("ae") False
- class build_tools.nltk_syllable_normaliser.NormalizationConfig(min_length=2, max_length=20, allowed_charset='abcdefghijklmnopqrstuvwxyz', unicode_form='NFKD')[source]
Bases:
objectConfiguration for syllable normalization process.
This dataclass stores all parameters that control how syllables are normalized to canonical form.
- min_length
Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2
- max_length
Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20
- allowed_charset
String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”
- unicode_form
Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)
Example
>>> config = NormalizationConfig(min_length=3, max_length=10) >>> config.min_length 3 >>> config.allowed_charset 'abcdefghijklmnopqrstuvwxyz'
- class build_tools.nltk_syllable_normaliser.NormalizationResult(config, stats, frequencies, unique_syllables, input_files, output_dir, timestamp=<factory>, raw_file=PosixPath('syllables_raw.txt'), canonical_file=PosixPath('syllables_canonicalised.txt'), frequency_file=PosixPath('syllables_frequencies.json'), unique_file=PosixPath('syllables_unique.txt'), meta_file=PosixPath('normalization_meta.txt'))[source]
Bases:
objectComplete result from the syllable normalization pipeline.
This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.
- config
Configuration used for normalization
- stats
Statistics from the processing
- frequencies
Dictionary mapping canonical syllable to frequency count
- unique_syllables
Sorted list of unique canonical syllables
- input_files
List of input file paths that were processed
- output_dir
Directory where output files were saved
- timestamp
When the normalization was performed
- raw_file
Path to raw aggregated file (syllables_raw.txt)
- canonical_file
Path to canonicalized file (syllables_canonicalised.txt)
- frequency_file
Path to frequency JSON (syllables_frequencies.json)
- unique_file
Path to unique syllables (syllables_unique.txt)
- meta_file
Path to metadata report (normalization_meta.txt)
Example
>>> result = NormalizationResult( ... config=NormalizationConfig(), ... stats=NormalizationStats(raw_count=1000), ... frequencies={"ka": 187, "ra": 162}, ... unique_syllables=["ka", "ra"], ... input_files=[Path("file1.txt")], ... output_dir=Path("_working/normalized"), ... timestamp=datetime.now(), ... raw_file=Path("syllables_raw.txt"), ... canonical_file=Path("syllables_canonicalised.txt"), ... frequency_file=Path("syllables_frequencies.json"), ... unique_file=Path("syllables_unique.txt"), ... meta_file=Path("normalization_meta.txt") ... ) >>> result.stats.raw_count 1000
-
config:
NormalizationConfig
- format_metadata()[source]
Format normalization metadata as a human-readable string.
Creates a detailed report including statistics, rejection breakdown, and top frequencies.
- Return type:
- Returns:
Multi-line string containing all normalization metadata formatted for display or file output.
Example
>>> result = NormalizationResult(...) >>> print(result.format_metadata()) ====================================================================== SYLLABLE NORMALIZATION METADATA ====================================================================== Timestamp: 2026-01-05 17:30:22 ...
-
stats:
NormalizationStats
- class build_tools.nltk_syllable_normaliser.NormalizationStats(raw_count=0, after_canonicalization=0, rejected_charset=0, rejected_length=0, rejected_empty=0, unique_canonical=0, processing_time=0.0)[source]
Bases:
objectStatistics from the syllable normalization process.
This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.
- raw_count
Total number of syllables in raw input (before normalization)
- after_canonicalization
Number of syllables after normalization
- rejected_charset
Syllables rejected due to invalid characters
- rejected_length
Syllables rejected due to length constraints
- rejected_empty
Syllables that became empty after normalization
- unique_canonical
Number of unique canonical syllables
- processing_time
Total processing time in seconds
Example
>>> stats = NormalizationStats( ... raw_count=1000, ... after_canonicalization=950, ... rejected_charset=30, ... rejected_length=20, ... rejected_empty=0, ... unique_canonical=412, ... processing_time=1.5 ... ) >>> stats.rejection_rate 5.0
- build_tools.nltk_syllable_normaliser.create_argument_parser()[source]
Create and return the argument parser for NLTK syllable normaliser.
- Return type:
- Returns:
Configured ArgumentParser ready to parse command-line arguments.
- build_tools.nltk_syllable_normaliser.run_full_pipeline(run_directory, config, verbose=False, skip_fragment_cleaning=False)[source]
Run complete NLTK normalization pipeline with in-place processing.
Executes the full NLTK-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Fragment cleaning (NLTK-specific preprocessing) 3. Canonicalize syllables (Unicode normalization, etc.) 4. Frequency analysis 5. Write 5 output files to run_directory (in-place)
- Parameters:
run_directory (
Path) – NLTK run directory (e.g., _working/output/20260110_095213_nltk/).config (
NormalizationConfig) – NormalizationConfig specifying normalization parameters.verbose (
bool) – If True, print detailed progress information.skip_fragment_cleaning (
bool) – If True, skip fragment cleaning step (for comparison).
- Return type:
- Returns:
NormalizationResult containing all outputs, statistics, and file paths.
- Raises:
FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.
ValueError – If run_directory is not a directory.
Example
>>> from pathlib import Path >>> config = NormalizationConfig(min_length=2, max_length=8) >>> run_dir = Path("_working/output/20260110_095213_nltk/") >>> result = run_full_pipeline( ... run_directory=run_dir, ... config=config, ... verbose=True ... ) >>> result.stats.raw_count 15234 >>> result.stats.unique_canonical 4821