build_tools.nltk_syllable_normaliser.cli
Command-line interface for NLTK syllable normalization pipeline.
This module provides the main CLI entry point for the nltk_syllable_normaliser tool, which processes NLTK extractor output with fragment cleaning + normalization pipeline.
Functions
|
Detect NLTK run directories within source directory. |
|
Run complete NLTK normalization pipeline with in-place processing. |
Create and return the argument parser for NLTK syllable normaliser. |
|
|
Parse command-line arguments. |
|
Main entry point for CLI. |
Module Contents
- build_tools.nltk_syllable_normaliser.cli.detect_nltk_run_directories(source_dir)[source]
Detect NLTK run directories within source directory.
Searches for directories matching the pattern YYYYMMDD_HHMMSS_nltk/ which contain a syllables/ subdirectory.
- Parameters:
source_dir (pathlib.Path) – Directory to search for NLTK run directories.
- Returns:
List of Path objects pointing to NLTK run directories, sorted by directory name (chronological order).
- Return type:
Example
>>> source = Path("_working/output/") >>> runs = detect_nltk_run_directories(source) >>> for run in runs: ... print(run.name) 20260110_095213_nltk 20260110_143022_nltk
- build_tools.nltk_syllable_normaliser.cli.run_full_pipeline(run_directory, config, verbose=False, skip_fragment_cleaning=False)[source]
Run complete NLTK normalization pipeline with in-place processing.
Executes the full NLTK-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Fragment cleaning (NLTK-specific preprocessing) 3. Canonicalize syllables (Unicode normalization, etc.) 4. Frequency analysis 5. Write 5 output files to run_directory (in-place)
- Parameters:
run_directory (pathlib.Path) – NLTK run directory (e.g., _working/output/20260110_095213_nltk/).
config (build_tools.pyphen_syllable_normaliser.NormalizationConfig) – NormalizationConfig specifying normalization parameters.
verbose (bool) – If True, print detailed progress information.
skip_fragment_cleaning (bool) – If True, skip fragment cleaning step (for comparison).
- Returns:
NormalizationResult containing all outputs, statistics, and file paths.
- Raises:
FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.
ValueError – If run_directory is not a directory.
- Return type:
Example
>>> from pathlib import Path >>> config = NormalizationConfig(min_length=2, max_length=8) >>> run_dir = Path("_working/output/20260110_095213_nltk/") >>> result = run_full_pipeline( ... run_directory=run_dir, ... config=config, ... verbose=True ... ) >>> result.stats.raw_count 15234 >>> result.stats.unique_canonical 4821
- build_tools.nltk_syllable_normaliser.cli.create_argument_parser()[source]
Create and return the argument parser for NLTK syllable normaliser.
- Returns:
Configured ArgumentParser ready to parse command-line arguments.
- Return type: