NLTK Syllable Normaliser

Overview

NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline

The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a build-time tool only - not used during runtime name generation.

NLTK-Specific Processing:

  1. Fragment Cleaning - Merge single-letter fragments with neighbors (NLTK-specific)

  2. Aggregation - Combine multiple input files while preserving all occurrences

  3. Canonicalization - Unicode normalization, diacritic stripping, charset validation

  4. Frequency Analysis - Count occurrences and generate frequency intelligence

Key Differences from Pyphen Normaliser:

  • Input Source: Processes NLTK run directories with syllables/ subdirectory

  • Preprocessing: Fragment cleaning step merges isolated phonemes

  • Output Location: In-place in run directory (not separate output directory)

  • Output Prefix: nltk_ prefix (for provenance tracking)

Features:

  • Fragment cleaning (single vowel/consonant merging)

  • Unicode normalization (NFKD, NFC, NFD, NFKC)

  • Diacritic stripping using unicodedata

  • Configurable charset and length constraints

  • Frequency intelligence capture (pre-deduplication counts)

  • Deterministic processing (same input = same output)

  • Comprehensive metadata reporting

  • 5 output files with nltk_ prefix for complete analysis

The pipeline produces 5 output files (with nltk_ prefix for provenance):

  • nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)

  • nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning)

  • nltk_syllables_frequencies.json: Frequency intelligence (syllable → count)

  • nltk_syllables_unique.txt: Deduplicated canonical syllable inventory

  • nltk_normalization_meta.txt: Detailed statistics and metadata report

Usage:
>>> from pathlib import Path
>>> from build_tools.nltk_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline,
... )
>>>
>>> # Process NLTK run directory in-place
>>> run_dir = Path("_working/output/20260110_095213_nltk/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=NormalizationConfig(min_length=2, max_length=8),
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/

# Auto-detect NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Custom configuration
python -m build_tools.nltk_syllable_normaliser --run-dir <path> --min 2 --max 8

Command-Line Interface

NLTK Syllable Normaliser - Fragment cleaning + 3-step normalization pipeline

usage: python -m build_tools.nltk_syllable_normaliser [-h]
                                                      (--run-dir RUN_DIR | --source SOURCE)
                                                      [--min MIN] [--max MAX]
                                                      [--charset CHARSET]
                                                      [--unicode-form {NFC,NFD,NFKC,NFKD}]
                                                      [--no-fragment-cleaning]
                                                      [--verbose] [--quiet]

Named Arguments

--run-dir

Specific NLTK run directory to process (e.g., _working/output/20260110_095213_nltk/)

--source

Directory to scan for NLTK run directories (auto-detects *_nltk/ directories)

--min

Minimum syllable length (characters). Default: 2

Default: 2

--max

Maximum syllable length (characters). Default: 20

Default: 20

--charset

Allowed character set for syllables. Default: a-z

Default: 'abcdefghijklmnopqrstuvwxyz'

--unicode-form

Possible choices: NFC, NFD, NFKC, NFKD

Unicode normalization form. Default: NFKD

Default: 'NFKD'

--no-fragment-cleaning

Skip fragment cleaning step (for comparison purposes)

Default: False

--verbose, -v

Enable verbose output with detailed progress information

Default: False

--quiet, -q

Suppress all output except errors

Default: False

Examples:

# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/

# Auto-detect and process all NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Custom normalization config
python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/ \
  --min 2 --max 8

# Skip fragment cleaning (for comparison with pyphen)
python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/ \
  --no-fragment-cleaning

Output Format

The pipeline generates 5 output files in the NLTK run directory with nltk_ prefix for provenance:

  1. nltk_syllables_raw.txt - Aggregated raw syllables (all occurrences preserved)

  2. nltk_syllables_canonicalised.txt - After fragment cleaning + normalization

  3. nltk_syllables_frequencies.json - Frequency intelligence (syllable → count mapping)

  4. nltk_syllables_unique.txt - Deduplicated canonical syllable inventory

  5. nltk_normalization_meta.txt - Detailed statistics and metadata report

In-Place Processing:

Unlike the pyphen normaliser which writes to a separate output directory, the NLTK normaliser processes run directories in-place, writing output files directly into the run directory:

_working/output/20260110_095213_nltk/
├── syllables/                          # Input (from NLTK extractor)
│   ├── en_US_alice.txt
│   ├── en_US_middlemarch.txt
│   └── ...
├── meta/                               # Metadata (from extractor)
│   └── ...
├── nltk_syllables_raw.txt              # Output: Aggregated
├── nltk_syllables_canonicalised.txt    # Output: After cleaning + normalization
├── nltk_syllables_frequencies.json     # Output: Frequency intelligence
├── nltk_syllables_unique.txt           # Output: Deduplicated
└── nltk_normalization_meta.txt         # Output: Statistics

File structure examples:

nltk_syllables_raw.txt (before fragment cleaning):

cha
pter
i
down
the
r
a
bbit

nltk_syllables_canonicalised.txt (after fragment cleaning + normalization):

cha
pter
idown
the
rabbit

nltk_syllables_frequencies.json (counts after cleaning):

{
  "cha": 1,
  "pter": 1,
  "idown": 1,
  "the": 1,
  "rabbit": 1
}

nltk_syllables_unique.txt (deduplicated, sorted):

cha
idown
pter
rabbit
the

Integration Guide

The NLTK syllable normaliser is the second step after NLTK extraction, complementing the NLTK extractor:

Standard workflow (using NLTK extractor + normaliser):

# Step 1: Extract syllables using NLTK/CMUDict
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --pattern "*.txt" \
  --output _working/output/

# Step 2: Normalize extracted syllables (in-place)
python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/

# Alternative: Auto-detect all NLTK run directories
python -m build_tools.nltk_syllable_normaliser \
  --source _working/output/

# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator \
  --syllables _working/output/20260110_095213_nltk/nltk_syllables_unique.txt \
  --frequencies _working/output/20260110_095213_nltk/nltk_syllables_frequencies.json

Parallel workflow (comparing both extractors):

# Extract and normalize with pyphen (typographic)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --lang en_US \
  --output _working/output/

python -m build_tools.pyphen_syllable_normaliser \
  --source _working/output/20260110_143022_pyphen/syllables/ \
  --output _working/output/20260110_143022_pyphen/

# Extract and normalize with NLTK (phonetic)
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --output _working/output/

python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/

# Compare outputs - both use different prefixes (pyphen_* vs nltk_*)
diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
     _working/output/20260110_095213_nltk/nltk_syllables_unique.txt

When to use NLTK normaliser vs pyphen normaliser:

Use NLTK normaliser when:

  • You used the NLTK syllable extractor

  • Your syllables contain many single-letter fragments

  • You want phonetically coherent syllables reconstructed

  • You’re working with NLTK’s onset/coda-based splits

  • You want in-place processing within run directories

Use pyphen normaliser when:

  • You used the pyphen syllable extractor

  • Your syllables are already well-formed (typographic hyphenation)

  • You want to aggregate multiple extraction runs

  • You prefer explicit output directory specification

  • You’re working with multi-language pyphen extractions

Fragment Cleaning:

The key differentiator of the NLTK normaliser is fragment cleaning. This step reconstructs phonetically coherent syllables from NLTK’s over-segmented output:

Original Fragments

After Fragment Cleaning

Reason

i, down

idown

Single vowel merged

r, a, bbit

ra, bbit

Single letters merged

h, o, le

ho, le

Single letters merged

cha, pter

cha, pter

Multi-char preserved

Merging Rules:

  1. Single vowels (a, e, i, o, u, y) merge with next fragment

  2. Single consonants merge with next fragment

  3. Multi-character fragments remain unchanged

  4. Processing is left-to-right, deterministic

Processing Modes:

  • Specific run directory: --run-dir /path/to/run/ - Process one NLTK run

  • Auto-detection: --source /path/to/output/ - Find and process all NLTK runs

  • Skip fragment cleaning: --no-fragment-cleaning - For comparison with pyphen

Notes

In-Place Processing Philosophy:

The NLTK normaliser writes outputs directly into the run directory (not a separate location) because:

  • Convention: Each NLTK run is self-contained (extractor + normaliser outputs together)

  • Simplicity: No confusion about where normalized files live

  • Provenance: Run directory name (*_nltk) and file prefix (nltk_*) both indicate source

Fragment Cleaning Statistics:

Real-world performance on multi-language corpus (21 files, 2.98M syllables):

  • Before cleaning: 2,977,447 syllables

  • After cleaning: 2,709,503 syllables

  • Fragments merged: 267,944 (9% reduction)

  • Processing time: 6.07 seconds

  • Unique syllables: 33,640

When Fragment Cleaning Matters:

Fragment cleaning has the most impact on:

  • Short function words (“i”, “a”, “the”)

  • Consonant clusters split by onset/coda (“r” + “a” → “ra”)

  • Single-character prefixes/suffixes

  • Phonetically over-segmented words

For well-formed multi-syllable words, fragment cleaning has minimal effect.

Comparing with Pyphen Normaliser:

Feature

Pyphen Normaliser

NLTK Normaliser

Input Source

Any directory

NLTK run directories

Preprocessing

None

Fragment cleaning

Output Location

User-specified directory

In-place (run directory)

Output Prefix

pyphen_*

nltk_*

Run Detection

Manual file discovery

Auto-detect *_nltk dirs

Normalization Steps

3 (aggregate, canon, freq)

4 (clean, aggregate, …)

Typical Use Case

Pyphen extractor output

NLTK extractor output

Auto-Detection Criteria:

The auto-detection feature (--source) finds NLTK run directories by:

  1. Scanning for directories ending with _nltk

  2. Verifying existence of syllables/ subdirectory

  3. Sorting chronologically by directory name

This allows batch processing:

# Process all NLTK runs at once
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Output:
# Found 3 NLTK run directories:
#   - 20260110_095213_nltk
#   - 20260110_143022_nltk
#   - 20260110_153045_nltk
# Processing...

Deterministic Processing:

The NLTK normaliser is fully deterministic:

  • Same input → same output (always)

  • Fragment cleaning uses left-to-right greedy algorithm

  • Unicode normalization is deterministic (NFKD)

  • Frequency analysis preserves insertion order (Python 3.7+)

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

Output File Prefixes:

Both normalisers now use prefixed output files:

  • Pyphen normaliser: pyphen_* prefix

  • NLTK normaliser: nltk_* prefix

This ensures clear provenance when files are shared individually.

API Reference

NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline

The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a build-time tool only - not used during runtime name generation.

NLTK-Specific Processing:

  1. Fragment Cleaning - Merge single-letter fragments with neighbors (NLTK-specific)

  2. Aggregation - Combine multiple input files while preserving all occurrences

  3. Canonicalization - Unicode normalization, diacritic stripping, charset validation

  4. Frequency Analysis - Count occurrences and generate frequency intelligence

Key Differences from Pyphen Normaliser:

  • Input Source: Processes NLTK run directories with syllables/ subdirectory

  • Preprocessing: Fragment cleaning step merges isolated phonemes

  • Output Location: In-place in run directory (not separate output directory)

  • Output Prefix: nltk_ prefix (for provenance tracking)

Features:

  • Fragment cleaning (single vowel/consonant merging)

  • Unicode normalization (NFKD, NFC, NFD, NFKC)

  • Diacritic stripping using unicodedata

  • Configurable charset and length constraints

  • Frequency intelligence capture (pre-deduplication counts)

  • Deterministic processing (same input = same output)

  • Comprehensive metadata reporting

  • 5 output files with nltk_ prefix for complete analysis

The pipeline produces 5 output files (with nltk_ prefix for provenance):

  • nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)

  • nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning)

  • nltk_syllables_frequencies.json: Frequency intelligence (syllable → count)

  • nltk_syllables_unique.txt: Deduplicated canonical syllable inventory

  • nltk_normalization_meta.txt: Detailed statistics and metadata report

Usage:
>>> from pathlib import Path
>>> from build_tools.nltk_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline,
... )
>>>
>>> # Process NLTK run directory in-place
>>> run_dir = Path("_working/output/20260110_095213_nltk/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=NormalizationConfig(min_length=2, max_length=8),
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/

# Auto-detect NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Custom configuration
python -m build_tools.nltk_syllable_normaliser --run-dir <path> --min 2 --max 8
class build_tools.nltk_syllable_normaliser.FragmentCleaner[source]

Bases: object

Clean NLTK-produced syllable fragments by merging isolated letters.

The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules.

Merging Rules:
  1. Single vowels (a, e, i, o, u, y) merge with next fragment

  2. Single consonants merge with next fragment

  3. Multi-character fragments remain unchanged

Example

>>> cleaner = FragmentCleaner()
>>> fragments = ["i", "down", "the", "ra", "bbit"]
>>> cleaner.clean_fragments(fragments)
['idown', 'the', 'rabbit']

Note

This is NLTK-specific preprocessing. Pyphen output doesn’t need fragment cleaning as it uses typographic hyphenation rules.

clean_fragments(fragments)[source]

Perform mechanical cleanup by merging single-letter fragments.

Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment

This reconstructs more phonetically coherent syllables from NLTK’s onset/coda-based over-segmentation.

Parameters:

fragments (list[str]) – List of syllable fragments (possibly over-segmented).

Return type:

list[str]

Returns:

List of cleaned fragments with single letters merged.

Example

>>> cleaner = FragmentCleaner()
>>> # Example 1: Single vowel merging
>>> cleaner.clean_fragments(["i", "down"])
['idown']
>>>
>>> # Example 2: Single consonant merging
>>> cleaner.clean_fragments(["r", "abbit"])
['rabbit']
>>>
>>> # Example 3: Mixed fragments
>>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"])
['cha', 'pter', 'idown', 'the', 'ra']
>>>
>>> # Example 4: Preserve multi-character fragments
>>> cleaner.clean_fragments(["hel", "lo", "world"])
['hel', 'lo', 'world']

Note

  • Fragments are processed left-to-right

  • Single letters merge with next fragment (if available)

  • Last fragment never merges (no next fragment available)

  • Empty input returns empty output

clean_fragments_from_file(input_path, output_path)[source]

Clean fragments from input file and write to output file.

Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line).

Parameters:
  • input_path (str) – Path to input file (one fragment per line).

  • output_path (str) – Path to output file for cleaned fragments.

Return type:

tuple[int, int]

Returns:

Tuple of (original_count, cleaned_count) indicating how many fragments were merged.

Raises:

Example

>>> # input.txt contains:
>>> # i
>>> # down
>>> # the
>>> # ra
>>> # bbit
>>>
>>> cleaner = FragmentCleaner()
>>> original, cleaned = cleaner.clean_fragments_from_file(
...     "input.txt", "output.txt"
... )
>>> print(f"Cleaned {original}{cleaned} fragments")
Cleaned 5 → 3 fragments
>>>
>>> # output.txt now contains:
>>> # idown
>>> # the
>>> # rabbit
static is_single_letter(token)[source]

Check if token is a single alphabetic character.

Parameters:

token (str) – String to check.

Return type:

bool

Returns:

True if token is exactly one alphabetic character, False otherwise.

Example

>>> FragmentCleaner.is_single_letter("a")
True
>>> FragmentCleaner.is_single_letter("ab")
False
>>> FragmentCleaner.is_single_letter("1")
False
static is_single_vowel(token)[source]

Check if token is a single vowel character.

Parameters:

token (str) – String to check.

Return type:

bool

Returns:

True if token is a single vowel (a, e, i, o, u, y), False otherwise.

Example

>>> FragmentCleaner.is_single_vowel("a")
True
>>> FragmentCleaner.is_single_vowel("b")
False
>>> FragmentCleaner.is_single_vowel("ae")
False
class build_tools.nltk_syllable_normaliser.NormalizationConfig(min_length=2, max_length=20, allowed_charset='abcdefghijklmnopqrstuvwxyz', unicode_form='NFKD')[source]

Bases: object

Configuration for syllable normalization process.

This dataclass stores all parameters that control how syllables are normalized to canonical form.

min_length

Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2

max_length

Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20

allowed_charset

String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”

unicode_form

Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)

Example

>>> config = NormalizationConfig(min_length=3, max_length=10)
>>> config.min_length
3
>>> config.allowed_charset
'abcdefghijklmnopqrstuvwxyz'
__post_init__()[source]

Validate configuration parameters after initialization.

allowed_charset: str = 'abcdefghijklmnopqrstuvwxyz'
max_length: int = 20
min_length: int = 2
unicode_form: str = 'NFKD'
class build_tools.nltk_syllable_normaliser.NormalizationResult(config, stats, frequencies, unique_syllables, input_files, output_dir, timestamp=<factory>, raw_file=PosixPath('syllables_raw.txt'), canonical_file=PosixPath('syllables_canonicalised.txt'), frequency_file=PosixPath('syllables_frequencies.json'), unique_file=PosixPath('syllables_unique.txt'), meta_file=PosixPath('normalization_meta.txt'))[source]

Bases: object

Complete result from the syllable normalization pipeline.

This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.

config

Configuration used for normalization

stats

Statistics from the processing

frequencies

Dictionary mapping canonical syllable to frequency count

unique_syllables

Sorted list of unique canonical syllables

input_files

List of input file paths that were processed

output_dir

Directory where output files were saved

timestamp

When the normalization was performed

raw_file

Path to raw aggregated file (syllables_raw.txt)

canonical_file

Path to canonicalized file (syllables_canonicalised.txt)

frequency_file

Path to frequency JSON (syllables_frequencies.json)

unique_file

Path to unique syllables (syllables_unique.txt)

meta_file

Path to metadata report (normalization_meta.txt)

Example

>>> result = NormalizationResult(
...     config=NormalizationConfig(),
...     stats=NormalizationStats(raw_count=1000),
...     frequencies={"ka": 187, "ra": 162},
...     unique_syllables=["ka", "ra"],
...     input_files=[Path("file1.txt")],
...     output_dir=Path("_working/normalized"),
...     timestamp=datetime.now(),
...     raw_file=Path("syllables_raw.txt"),
...     canonical_file=Path("syllables_canonicalised.txt"),
...     frequency_file=Path("syllables_frequencies.json"),
...     unique_file=Path("syllables_unique.txt"),
...     meta_file=Path("normalization_meta.txt")
... )
>>> result.stats.raw_count
1000
canonical_file: Path = PosixPath('syllables_canonicalised.txt')
config: NormalizationConfig
format_metadata()[source]

Format normalization metadata as a human-readable string.

Creates a detailed report including statistics, rejection breakdown, and top frequencies.

Return type:

str

Returns:

Multi-line string containing all normalization metadata formatted for display or file output.

Example

>>> result = NormalizationResult(...)
>>> print(result.format_metadata())
======================================================================
SYLLABLE NORMALIZATION METADATA
======================================================================
Timestamp:           2026-01-05 17:30:22
...
frequencies: dict[str, int]
frequency_file: Path = PosixPath('syllables_frequencies.json')
input_files: list[Path]
meta_file: Path = PosixPath('normalization_meta.txt')
output_dir: Path
raw_file: Path = PosixPath('syllables_raw.txt')
stats: NormalizationStats
timestamp: datetime
unique_file: Path = PosixPath('syllables_unique.txt')
unique_syllables: list[str]
class build_tools.nltk_syllable_normaliser.NormalizationStats(raw_count=0, after_canonicalization=0, rejected_charset=0, rejected_length=0, rejected_empty=0, unique_canonical=0, processing_time=0.0)[source]

Bases: object

Statistics from the syllable normalization process.

This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.

raw_count

Total number of syllables in raw input (before normalization)

after_canonicalization

Number of syllables after normalization

rejected_charset

Syllables rejected due to invalid characters

rejected_length

Syllables rejected due to length constraints

rejected_empty

Syllables that became empty after normalization

unique_canonical

Number of unique canonical syllables

processing_time

Total processing time in seconds

Example

>>> stats = NormalizationStats(
...     raw_count=1000,
...     after_canonicalization=950,
...     rejected_charset=30,
...     rejected_length=20,
...     rejected_empty=0,
...     unique_canonical=412,
...     processing_time=1.5
... )
>>> stats.rejection_rate
5.0
after_canonicalization: int = 0
processing_time: float = 0.0
raw_count: int = 0
rejected_charset: int = 0
rejected_empty: int = 0
rejected_length: int = 0
property rejection_rate: float

Calculate rejection rate as percentage of raw count.

property total_rejected: int

Calculate total number of rejected syllables.

unique_canonical: int = 0
build_tools.nltk_syllable_normaliser.create_argument_parser()[source]

Create and return the argument parser for NLTK syllable normaliser.

Return type:

ArgumentParser

Returns:

Configured ArgumentParser ready to parse command-line arguments.

build_tools.nltk_syllable_normaliser.main(args=None)[source]

Main entry point for CLI.

Parameters:

args (list[str] | None) – Command-line arguments (for testing). If None, uses sys.argv.

Return type:

int

Returns:

Exit code (0 for success, 1 for error).

build_tools.nltk_syllable_normaliser.run_full_pipeline(run_directory, config, verbose=False, skip_fragment_cleaning=False)[source]

Run complete NLTK normalization pipeline with in-place processing.

Executes the full NLTK-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Fragment cleaning (NLTK-specific preprocessing) 3. Canonicalize syllables (Unicode normalization, etc.) 4. Frequency analysis 5. Write 5 output files to run_directory (in-place)

Parameters:
  • run_directory (Path) – NLTK run directory (e.g., _working/output/20260110_095213_nltk/).

  • config (NormalizationConfig) – NormalizationConfig specifying normalization parameters.

  • verbose (bool) – If True, print detailed progress information.

  • skip_fragment_cleaning (bool) – If True, skip fragment cleaning step (for comparison).

Return type:

NormalizationResult

Returns:

NormalizationResult containing all outputs, statistics, and file paths.

Raises:
  • FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.

  • ValueError – If run_directory is not a directory.

Example

>>> from pathlib import Path
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> run_dir = Path("_working/output/20260110_095213_nltk/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=config,
...     verbose=True
... )
>>> result.stats.raw_count
15234
>>> result.stats.unique_canonical
4821