NLTK Syllable Normaliser

Overview

NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline

The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a build-time tool only - not used during runtime name generation.

NLTK-Specific Processing:

Fragment Cleaning - Merge single-letter fragments with neighbors (NLTK-specific)
Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence

Key Differences from Pyphen Normaliser:

Input Source: Processes NLTK run directories with syllables/ subdirectory
Preprocessing: Fragment cleaning step merges isolated phonemes
Output Location: In-place in run directory (not separate output directory)
Output Prefix: nltk_ prefix (for provenance tracking)

Features:

Fragment cleaning (single vowel/consonant merging)
Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files with nltk_ prefix for complete analysis

The pipeline produces 5 output files (with nltk_ prefix for provenance):

nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning)
nltk_syllables_frequencies.json: Frequency intelligence (syllable → count)
nltk_syllables_unique.txt: Deduplicated canonical syllable inventory
nltk_normalization_meta.txt: Detailed statistics and metadata report

Usage:

>>> from pathlib import Path
>>> from build_tools.nltk_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline,
... )
>>>
>>> # Process NLTK run directory in-place
>>> run_dir = Path("_working/output/20260110_095213_nltk/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=NormalizationConfig(min_length=2, max_length=8),
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/

# Auto-detect NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Custom configuration
python -m build_tools.nltk_syllable_normaliser --run-dir <path> --min 2 --max 8

Command-Line Interface

NLTK Syllable Normaliser - Fragment cleaning + 3-step normalization pipeline

usage: python -m build_tools.nltk_syllable_normaliser [-h]
                                                      (--run-dir RUN_DIR | --source SOURCE)
                                                      [--min MIN] [--max MAX]
                                                      [--charset CHARSET]
                                                      [--unicode-form {NFC,NFD,NFKC,NFKD}]
                                                      [--no-fragment-cleaning]
                                                      [--verbose] [--quiet]

Named Arguments

--run-dir

Specific NLTK run directory to process (e.g., _working/output/20260110_095213_nltk/)

--source

Directory to scan for NLTK run directories (auto-detects *_nltk/ directories)

--min

Minimum syllable length (characters). Default: 2

Default: 2

--max

Maximum syllable length (characters). Default: 20

Default: 20

--charset

Allowed character set for syllables. Default: a-z

Default: 'abcdefghijklmnopqrstuvwxyz'

--unicode-form

Possible choices: NFC, NFD, NFKC, NFKD

Unicode normalization form. Default: NFKD

Default: 'NFKD'

--no-fragment-cleaning

Skip fragment cleaning step (for comparison purposes)

Default: False

--verbose, -v

Enable verbose output with detailed progress information

Default: False

--quiet, -q

Suppress all output except errors

Default: False

Examples:

# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/

# Auto-detect and process all NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Custom normalization config
python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/ \
  --min 2 --max 8

# Skip fragment cleaning (for comparison with pyphen)
python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/ \
  --no-fragment-cleaning

Output Format

The pipeline generates 5 output files in the NLTK run directory with nltk_ prefix for provenance:

nltk_syllables_raw.txt - Aggregated raw syllables (all occurrences preserved)
nltk_syllables_canonicalised.txt - After fragment cleaning + normalization
nltk_syllables_frequencies.json - Frequency intelligence (syllable → count mapping)
nltk_syllables_unique.txt - Deduplicated canonical syllable inventory
nltk_normalization_meta.txt - Detailed statistics and metadata report

In-Place Processing:

Unlike the pyphen normaliser which writes to a separate output directory, the NLTK normaliser processes run directories in-place, writing output files directly into the run directory:

_working/output/20260110_095213_nltk/
├── syllables/                          # Input (from NLTK extractor)
│   ├── en_US_alice.txt
│   ├── en_US_middlemarch.txt
│   └── ...
├── meta/                               # Metadata (from extractor)
│   └── ...
├── nltk_syllables_raw.txt              # Output: Aggregated
├── nltk_syllables_canonicalised.txt    # Output: After cleaning + normalization
├── nltk_syllables_frequencies.json     # Output: Frequency intelligence
├── nltk_syllables_unique.txt           # Output: Deduplicated
└── nltk_normalization_meta.txt         # Output: Statistics

File structure examples:

nltk_syllables_raw.txt (before fragment cleaning):

cha
pter
i
down
the
r
a
bbit

nltk_syllables_canonicalised.txt (after fragment cleaning + normalization):

cha
pter
idown
the
rabbit

nltk_syllables_frequencies.json (counts after cleaning):

{
  "cha": 1,
  "pter": 1,
  "idown": 1,
  "the": 1,
  "rabbit": 1
}

nltk_syllables_unique.txt (deduplicated, sorted):

cha
idown
pter
rabbit
the

Integration Guide

The NLTK syllable normaliser is the second step after NLTK extraction, complementing the NLTK extractor:

Standard workflow (using NLTK extractor + normaliser):

# Step 1: Extract syllables using NLTK/CMUDict
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --pattern "*.txt" \
  --output _working/output/

# Step 2: Normalize extracted syllables (in-place)
python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/

# Alternative: Auto-detect all NLTK run directories
python -m build_tools.nltk_syllable_normaliser \
  --source _working/output/

# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator \
  --syllables _working/output/20260110_095213_nltk/nltk_syllables_unique.txt \
  --frequencies _working/output/20260110_095213_nltk/nltk_syllables_frequencies.json

Parallel workflow (comparing both extractors):

# Extract and normalize with pyphen (typographic)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --lang en_US \
  --output _working/output/

python -m build_tools.pyphen_syllable_normaliser \
  --source _working/output/20260110_143022_pyphen/syllables/ \
  --output _working/output/20260110_143022_pyphen/

# Extract and normalize with NLTK (phonetic)
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --output _working/output/

python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/

# Compare outputs - both use different prefixes (pyphen_* vs nltk_*)
diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
     _working/output/20260110_095213_nltk/nltk_syllables_unique.txt

When to use NLTK normaliser vs pyphen normaliser:

Use NLTK normaliser when:

You used the NLTK syllable extractor
Your syllables contain many single-letter fragments
You want phonetically coherent syllables reconstructed
You’re working with NLTK’s onset/coda-based splits
You want in-place processing within run directories

Use pyphen normaliser when:

You used the pyphen syllable extractor
Your syllables are already well-formed (typographic hyphenation)
You want to aggregate multiple extraction runs
You prefer explicit output directory specification
You’re working with multi-language pyphen extractions

Fragment Cleaning:

The key differentiator of the NLTK normaliser is fragment cleaning. This step reconstructs phonetically coherent syllables from NLTK’s over-segmented output:

Original Fragments	After Fragment Cleaning	Reason
i, down	idown	Single vowel merged
r, a, bbit	ra, bbit	Single letters merged
h, o, le	ho, le	Single letters merged
cha, pter	cha, pter	Multi-char preserved

Merging Rules:

Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged
Processing is left-to-right, deterministic

Processing Modes:

Specific run directory: --run-dir /path/to/run/ - Process one NLTK run
Auto-detection: --source /path/to/output/ - Find and process all NLTK runs
Skip fragment cleaning: --no-fragment-cleaning - For comparison with pyphen

Notes

In-Place Processing Philosophy:

The NLTK normaliser writes outputs directly into the run directory (not a separate location) because:

Convention: Each NLTK run is self-contained (extractor + normaliser outputs together)
Simplicity: No confusion about where normalized files live
Provenance: Run directory name (*_nltk) and file prefix (nltk_*) both indicate source

Fragment Cleaning Statistics:

Real-world performance on multi-language corpus (21 files, 2.98M syllables):

Before cleaning: 2,977,447 syllables
After cleaning: 2,709,503 syllables
Fragments merged: 267,944 (9% reduction)
Processing time: 6.07 seconds
Unique syllables: 33,640

When Fragment Cleaning Matters:

Fragment cleaning has the most impact on:

Short function words (“i”, “a”, “the”)
Consonant clusters split by onset/coda (“r” + “a” → “ra”)
Single-character prefixes/suffixes
Phonetically over-segmented words

For well-formed multi-syllable words, fragment cleaning has minimal effect.

Comparing with Pyphen Normaliser:

Feature	Pyphen Normaliser	NLTK Normaliser
Input Source	Any directory	NLTK run directories
Preprocessing	None	Fragment cleaning
Output Location	User-specified directory	In-place (run directory)
Output Prefix	pyphen_*	nltk_*
Run Detection	Manual file discovery	Auto-detect *_nltk dirs
Normalization Steps	3 (aggregate, canon, freq)	4 (clean, aggregate, …)
Typical Use Case	Pyphen extractor output	NLTK extractor output

Auto-Detection Criteria:

The auto-detection feature (--source) finds NLTK run directories by:

Scanning for directories ending with _nltk
Verifying existence of syllables/ subdirectory
Sorting chronologically by directory name

This allows batch processing:

# Process all NLTK runs at once
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Output:
# Found 3 NLTK run directories:
#   - 20260110_095213_nltk
#   - 20260110_143022_nltk
#   - 20260110_153045_nltk
# Processing...

Deterministic Processing:

The NLTK normaliser is fully deterministic:

Same input → same output (always)
Fragment cleaning uses left-to-right greedy algorithm
Unicode normalization is deterministic (NFKD)
Frequency analysis preserves insertion order (Python 3.7+)

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

Output File Prefixes:

Both normalisers now use prefixed output files:

Pyphen normaliser: pyphen_* prefix
NLTK normaliser: nltk_* prefix

This ensures clear provenance when files are shared individually.

API Reference

NLTK Syllable Normaliser - Fragment Cleaning + 3-Step Normalization Pipeline

The NLTK syllable normaliser extends the standard normalization pipeline with NLTK-specific fragment cleaning to reconstruct phonetically coherent syllables from over-segmented output. This is a build-time tool only - not used during runtime name generation.

NLTK-Specific Processing:

Fragment Cleaning - Merge single-letter fragments with neighbors (NLTK-specific)
Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence

Key Differences from Pyphen Normaliser:

Input Source: Processes NLTK run directories with syllables/ subdirectory
Preprocessing: Fragment cleaning step merges isolated phonemes
Output Location: In-place in run directory (not separate output directory)
Output Prefix: nltk_ prefix (for provenance tracking)

Features:

Fragment cleaning (single vowel/consonant merging)
Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files with nltk_ prefix for complete analysis

The pipeline produces 5 output files (with nltk_ prefix for provenance):

nltk_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
nltk_syllables_canonicalised.txt: Normalized canonical syllables (after fragment cleaning)
nltk_syllables_frequencies.json: Frequency intelligence (syllable → count)
nltk_syllables_unique.txt: Deduplicated canonical syllable inventory
nltk_normalization_meta.txt: Detailed statistics and metadata report

Usage:

>>> from pathlib import Path
>>> from build_tools.nltk_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline,
... )
>>>
>>> # Process NLTK run directory in-place
>>> run_dir = Path("_working/output/20260110_095213_nltk/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=NormalizationConfig(min_length=2, max_length=8),
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"After cleaning: {result.stats.after_fragment_cleaning:,}")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific NLTK run directory
python -m build_tools.nltk_syllable_normaliser --run-dir _working/output/20260110_095213_nltk/

# Auto-detect NLTK run directories
python -m build_tools.nltk_syllable_normaliser --source _working/output/

# Custom configuration
python -m build_tools.nltk_syllable_normaliser --run-dir <path> --min 2 --max 8

class build_tools.nltk_syllable_normaliser.FragmentCleaner[source]

Bases: object

Clean NLTK-produced syllable fragments by merging isolated letters.

The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules.

Merging Rules:

Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged

Example

>>> cleaner = FragmentCleaner()
>>> fragments = ["i", "down", "the", "ra", "bbit"]
>>> cleaner.clean_fragments(fragments)
['idown', 'the', 'rabbit']

Note

This is NLTK-specific preprocessing. Pyphen output doesn’t need fragment cleaning as it uses typographic hyphenation rules.

clean_fragments(fragments)[source]

Perform mechanical cleanup by merging single-letter fragments.

Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment

This reconstructs more phonetically coherent syllables from NLTK’s onset/coda-based over-segmentation.

Parameters:: fragments (list[str]) – List of syllable fragments (possibly over-segmented).
Return type:: list[str]
Returns:: List of cleaned fragments with single letters merged.

Example

>>> cleaner = FragmentCleaner()
>>> # Example 1: Single vowel merging
>>> cleaner.clean_fragments(["i", "down"])
['idown']
>>>
>>> # Example 2: Single consonant merging
>>> cleaner.clean_fragments(["r", "abbit"])
['rabbit']
>>>
>>> # Example 3: Mixed fragments
>>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"])
['cha', 'pter', 'idown', 'the', 'ra']
>>>
>>> # Example 4: Preserve multi-character fragments
>>> cleaner.clean_fragments(["hel", "lo", "world"])
['hel', 'lo', 'world']

Note

Fragments are processed left-to-right
Single letters merge with next fragment (if available)
Last fragment never merges (no next fragment available)
Empty input returns empty output

clean_fragments_from_file(input_path, output_path)[source]

Clean fragments from input file and write to output file.

Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line).

Parameters:

input_path (str) – Path to input file (one fragment per line).
output_path (str) – Path to output file for cleaned fragments.

Return type:

tuple[int, int]

Returns:

Tuple of (original_count, cleaned_count) indicating how many fragments were merged.

Raises:

FileNotFoundError – If input file doesn’t exist.
IOError – If there’s an error reading or writing files.

Example

>>> # input.txt contains:
>>> # i
>>> # down
>>> # the
>>> # ra
>>> # bbit
>>>
>>> cleaner = FragmentCleaner()
>>> original, cleaned = cleaner.clean_fragments_from_file(
...     "input.txt", "output.txt"
... )
>>> print(f"Cleaned {original} → {cleaned} fragments")
Cleaned 5 → 3 fragments
>>>
>>> # output.txt now contains:
>>> # idown
>>> # the
>>> # rabbit

static is_single_letter(token)[source]

Check if token is a single alphabetic character.

Parameters:: token (str) – String to check.
Return type:: bool
Returns:: True if token is exactly one alphabetic character, False otherwise.

Example

>>> FragmentCleaner.is_single_letter("a")
True
>>> FragmentCleaner.is_single_letter("ab")
False
>>> FragmentCleaner.is_single_letter("1")
False

static is_single_vowel(token)[source]

Check if token is a single vowel character.

Parameters:: token (str) – String to check.
Return type:: bool
Returns:: True if token is a single vowel (a, e, i, o, u, y), False otherwise.

Example

>>> FragmentCleaner.is_single_vowel("a")
True
>>> FragmentCleaner.is_single_vowel("b")
False
>>> FragmentCleaner.is_single_vowel("ae")
False

class build_tools.nltk_syllable_normaliser.NormalizationConfig(min_length=2, max_length=20, allowed_charset='abcdefghijklmnopqrstuvwxyz', unicode_form='NFKD')[source]

Bases: object

Configuration for syllable normalization process.

This dataclass stores all parameters that control how syllables are normalized to canonical form.

min_length: Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2

max_length: Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20

allowed_charset: String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”

unicode_form: Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)

Example

>>> config = NormalizationConfig(min_length=3, max_length=10)
>>> config.min_length
3
>>> config.allowed_charset
'abcdefghijklmnopqrstuvwxyz'

__post_init__()[source]: Validate configuration parameters after initialization.

allowed_charset: str = 'abcdefghijklmnopqrstuvwxyz'

max_length: int = 20

min_length: int = 2

unicode_form: str = 'NFKD'

class build_tools.nltk_syllable_normaliser.NormalizationResult(config, stats, frequencies, unique_syllables, input_files, output_dir, timestamp=<factory>, raw_file=PosixPath('syllables_raw.txt'), canonical_file=PosixPath('syllables_canonicalised.txt'), frequency_file=PosixPath('syllables_frequencies.json'), unique_file=PosixPath('syllables_unique.txt'), meta_file=PosixPath('normalization_meta.txt'))[source]

Bases: object

Complete result from the syllable normalization pipeline.

This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.

config: Configuration used for normalization

stats: Statistics from the processing

frequencies: Dictionary mapping canonical syllable to frequency count

unique_syllables: Sorted list of unique canonical syllables

input_files: List of input file paths that were processed

output_dir: Directory where output files were saved

timestamp: When the normalization was performed

raw_file: Path to raw aggregated file (syllables_raw.txt)

canonical_file: Path to canonicalized file (syllables_canonicalised.txt)

frequency_file: Path to frequency JSON (syllables_frequencies.json)

unique_file: Path to unique syllables (syllables_unique.txt)

meta_file: Path to metadata report (normalization_meta.txt)

Example

>>> result = NormalizationResult(
...     config=NormalizationConfig(),
...     stats=NormalizationStats(raw_count=1000),
...     frequencies={"ka": 187, "ra": 162},
...     unique_syllables=["ka", "ra"],
...     input_files=[Path("file1.txt")],
...     output_dir=Path("_working/normalized"),
...     timestamp=datetime.now(),
...     raw_file=Path("syllables_raw.txt"),
...     canonical_file=Path("syllables_canonicalised.txt"),
...     frequency_file=Path("syllables_frequencies.json"),
...     unique_file=Path("syllables_unique.txt"),
...     meta_file=Path("normalization_meta.txt")
... )
>>> result.stats.raw_count
1000

canonical_file: Path = PosixPath('syllables_canonicalised.txt')

config: NormalizationConfig

format_metadata()[source]

Format normalization metadata as a human-readable string.

Creates a detailed report including statistics, rejection breakdown, and top frequencies.

Return type:: str
Returns:: Multi-line string containing all normalization metadata formatted for display or file output.

Example

>>> result = NormalizationResult(...)
>>> print(result.format_metadata())
======================================================================
SYLLABLE NORMALIZATION METADATA
======================================================================
Timestamp:           2026-01-05 17:30:22
...

frequencies: dict[str, int]

frequency_file: Path = PosixPath('syllables_frequencies.json')

input_files: list[Path]

meta_file: Path = PosixPath('normalization_meta.txt')

output_dir: Path

raw_file: Path = PosixPath('syllables_raw.txt')

stats: NormalizationStats

timestamp: datetime

unique_file: Path = PosixPath('syllables_unique.txt')

unique_syllables: list[str]

class build_tools.nltk_syllable_normaliser.NormalizationStats(raw_count=0, after_canonicalization=0, rejected_charset=0, rejected_length=0, rejected_empty=0, unique_canonical=0, processing_time=0.0)[source]

Bases: object

Statistics from the syllable normalization process.

This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.

raw_count: Total number of syllables in raw input (before normalization)

after_canonicalization: Number of syllables after normalization

rejected_charset: Syllables rejected due to invalid characters

rejected_length: Syllables rejected due to length constraints

rejected_empty: Syllables that became empty after normalization

unique_canonical: Number of unique canonical syllables

processing_time: Total processing time in seconds

Example

>>> stats = NormalizationStats(
...     raw_count=1000,
...     after_canonicalization=950,
...     rejected_charset=30,
...     rejected_length=20,
...     rejected_empty=0,
...     unique_canonical=412,
...     processing_time=1.5
... )
>>> stats.rejection_rate
5.0

after_canonicalization: int = 0

processing_time: float = 0.0

raw_count: int = 0

rejected_charset: int = 0

rejected_empty: int = 0

rejected_length: int = 0

property rejection_rate: float

Calculate rejection rate as percentage of raw count.

property total_rejected: int

Calculate total number of rejected syllables.

unique_canonical: int = 0

build_tools.nltk_syllable_normaliser.create_argument_parser()[source]

Create and return the argument parser for NLTK syllable normaliser.

Return type:: ArgumentParser
Returns:: Configured ArgumentParser ready to parse command-line arguments.

build_tools.nltk_syllable_normaliser.main(args=None)[source]

Main entry point for CLI.

Parameters:: args (list[str] | None) – Command-line arguments (for testing). If None, uses sys.argv.
Return type:: int
Returns:: Exit code (0 for success, 1 for error).

build_tools.nltk_syllable_normaliser.run_full_pipeline(run_directory, config, verbose=False, skip_fragment_cleaning=False)[source]

Run complete NLTK normalization pipeline with in-place processing.

Executes the full NLTK-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Fragment cleaning (NLTK-specific preprocessing) 3. Canonicalize syllables (Unicode normalization, etc.) 4. Frequency analysis 5. Write 5 output files to run_directory (in-place)

Parameters:

run_directory (Path) – NLTK run directory (e.g., _working/output/20260110_095213_nltk/).
config (NormalizationConfig) – NormalizationConfig specifying normalization parameters.
verbose (bool) – If True, print detailed progress information.
skip_fragment_cleaning (bool) – If True, skip fragment cleaning step (for comparison).

Return type:

NormalizationResult

Returns:

NormalizationResult containing all outputs, statistics, and file paths.

Raises:

FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.
ValueError – If run_directory is not a directory.

Example

>>> from pathlib import Path
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> run_dir = Path("_working/output/20260110_095213_nltk/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=config,
...     verbose=True
... )
>>> result.stats.raw_count
15234
>>> result.stats.unique_canonical
4821