Pyphen Syllable Normaliser

Overview

Syllable Normaliser - 3-Step Normalization Pipeline

The syllable normaliser transforms raw syllable files into canonical form through a 3-step pipeline, creating the authoritative syllable inventory for pattern development. This is a build-time tool only - not used during runtime name generation.

3-Step Normalization Pipeline:

Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence

Features:

Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files for complete analysis

The pipeline produces 5 output files (with pyphen_ prefix for provenance):

pyphen_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
pyphen_syllables_canonicalised.txt: Normalized canonical syllables
pyphen_syllables_frequencies.json: Frequency intelligence (syllable → count)
pyphen_syllables_unique.txt: Deduplicated canonical syllable inventory
pyphen_normalization_meta.txt: Detailed statistics and metadata report

Usage:

>>> from pathlib import Path
>>> from build_tools.pyphen_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline
... )
>>>
>>> # Create configuration
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>>
>>> # Run pipeline on a pyphen run directory
>>> result = run_full_pipeline(
...     run_directory=Path("_working/output/20260110_143022_pyphen/"),
...     config=config,
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific pyphen run directory (in-place)
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/

# Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

Command-Line Interface

Pyphen Syllable Normaliser - 3-step normalization pipeline

usage: python -m build_tools.pyphen_syllable_normaliser [-h]
                                                        (--run-dir RUN_DIR | --source SOURCE)
                                                        [--min MIN]
                                                        [--max MAX]
                                                        [--charset CHARSET]
                                                        [--unicode-form {NFC,NFD,NFKC,NFKD}]
                                                        [--verbose] [--quiet]

Named Arguments

--run-dir

Specific pyphen run directory to process (e.g., _working/output/20260110_143022_pyphen/)

--source

Directory to scan for pyphen run directories (auto-detects *_pyphen/ directories)

--min

Minimum syllable length (characters). Default: 2

Default: 2

--max

Maximum syllable length (characters). Default: 20

Default: 20

--charset

Allowed character set for syllables. Default: a-z

Default: 'abcdefghijklmnopqrstuvwxyz'

--unicode-form

Possible choices: NFC, NFD, NFKC, NFKD

Unicode normalization form. Default: NFKD

Default: 'NFKD'

--verbose, -v

Enable verbose output with detailed progress information

Default: False

--quiet, -q

Suppress all output except errors

Default: False

Examples:

# Process specific pyphen run directory
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/

# Auto-detect and process all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

# Custom normalization config
python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_143022_pyphen/ \
  --min 2 --max 8

Output Format

The pipeline generates 5 output files in the pyphen run directory with pyphen_ prefix for provenance:

pyphen_syllables_raw.txt - Aggregated raw syllables (all occurrences preserved)
pyphen_syllables_canonicalised.txt - Normalized canonical syllables
pyphen_syllables_frequencies.json - Frequency intelligence (syllable → count mapping)
pyphen_syllables_unique.txt - Deduplicated canonical syllable inventory
pyphen_normalization_meta.txt - Detailed statistics and metadata report

In-Place Processing:

Unlike older versions which wrote to a separate output directory, the pyphen normaliser now processes run directories in-place, writing output files directly into the run directory:

_working/output/20260110_143022_pyphen/
├── syllables/                          # Input (from pyphen extractor)
│   ├── en_US_alice.txt
│   ├── en_US_middlemarch.txt
│   └── ...
├── meta/                               # Metadata (from extractor)
│   └── ...
├── pyphen_syllables_raw.txt            # Output: Aggregated
├── pyphen_syllables_canonicalised.txt  # Output: Normalized
├── pyphen_syllables_frequencies.json   # Output: Frequency intelligence
├── pyphen_syllables_unique.txt         # Output: Deduplicated
└── pyphen_normalization_meta.txt       # Output: Statistics

File structure examples:

pyphen_syllables_raw.txt (preserves all occurrences):

café
Café
hello
hello
world

pyphen_syllables_canonicalised.txt (normalized, duplicates preserved):

cafe
cafe
hello
hello
world

pyphen_syllables_frequencies.json (counts before deduplication):

{
  "cafe": 2,
  "hello": 2,
  "world": 1
}

pyphen_syllables_unique.txt (deduplicated, sorted):

cafe
hello
world

Integration Guide

The pyphen syllable normaliser is the second step after pyphen extraction, processing syllables within their run directories:

Standard workflow:

# Step 1: Extract syllables using pyphen
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --pattern "*.txt" \
  --output _working/output/ \
  --lang en_US

# Step 2: Normalize extracted syllables (in-place)
python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_143022_pyphen/

# Alternative: Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser \
  --source _working/output/

# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator \
  --syllables _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
  --frequencies _working/output/20260110_143022_pyphen/pyphen_syllables_frequencies.json

Comparing with NLTK normaliser:

# Pyphen pipeline (typographic hyphenation)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --lang en_US \
  --output _working/output/

python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_143022_pyphen/

# NLTK pipeline (phonetic splitting)
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --output _working/output/

python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/

# Compare outputs - both use different prefixes (pyphen_* vs nltk_*)
diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
     _working/output/20260110_095213_nltk/nltk_syllables_unique.txt

When to use pyphen normaliser vs NLTK normaliser:

Use pyphen normaliser when:

You used the pyphen syllable extractor
Your syllables are well-formed from typographic hyphenation
You want multi-language support (40+ languages)
You want in-place processing within run directories
You’re working with pyphen’s dictionary-based splits

Use NLTK normaliser when:

You used the NLTK syllable extractor
Your syllables contain many single-letter fragments
You want phonetically coherent syllables reconstructed
You’re working with NLTK’s onset/coda-based splits (English only)
You want fragment cleaning preprocessing

3-Step Normalization Pipeline:

Step 1 - Aggregation:

Combines all input files into syllables_raw.txt
Preserves ALL occurrences (no deduplication)
Maintains raw counts for frequency analysis
Empty lines filtered during file reading

Step 2 - Canonicalization:

Unicode normalization (NFKD - compatibility decomposition)
Strip diacritics: café → cafe, résumé → resume
Lowercase conversion
Trim whitespace
Charset validation (reject invalid characters)
Length constraint enforcement (default: min=2, max=20)
Outputs to syllables_canonicalised.txt

Step 3 - Frequency Analysis:

Count occurrences of each canonical syllable
Generate frequency rankings and percentages
Create deduplicated unique list (alphabetically sorted)
Outputs: - syllables_frequencies.json - Frequency counts before deduplication - syllables_unique.txt - Authoritative syllable inventory - normalization_meta.txt - Comprehensive statistics report

Pipeline characteristics:

Deterministic: same input always produces same output
Fast: processes thousands of syllables per second
Configurable: adjust length constraints, charset, unicode form
Comprehensive: detailed rejection statistics and metadata

Notes

Frequency Intelligence:

The frequency data captures how often each canonical syllable occurs before deduplication. This intelligence is essential for understanding natural language patterns and can inform weighted name generation:

{
  "ka": 187,
  "ra": 162,
  "mi": 145,
  "ta": 98
}

This shows “ka” appears 187 times in the canonical syllables, indicating it’s a high-frequency pattern that may be desirable for common or natural-sounding names.

Normalization Behavior:

All syllable processing is case-insensitive (output is lowercase)
Unicode normalization form NFKD provides maximum compatibility decomposition
Empty lines are filtered during aggregation (not counted as rejections)
Frequency counts capture occurrences BEFORE deduplication
Invalid syllables (wrong charset, wrong length) are rejected and counted in metadata

Default Constraints:

Min length: 2 characters
Max length: 20 characters
Allowed charset: a-z (lowercase ASCII letters)
Unicode form: NFKD (compatibility decomposition)

Use Cases:

Combining syllables from multiple language extractions
Normalizing variations in corpus encoding (UTF-8, Latin-1, etc.)
Filtering syllables by length for specific pattern requirements
Building frequency-aware name generation systems

In-Place Processing Philosophy:

The pyphen normaliser writes outputs directly into the run directory (not a separate location) because:

Convention: Each pyphen run is self-contained (extractor + normaliser outputs together)
Simplicity: No confusion about where normalized files live
Provenance: Run directory name (*_pyphen) and file prefix (pyphen_*) both indicate source

Processing Modes:

Specific run directory: --run-dir /path/to/run/ - Process one pyphen run
Auto-detection: --source /path/to/output/ - Find and process all pyphen runs

Auto-Detection Criteria:

The auto-detection feature (--source) finds pyphen run directories by:

Scanning for directories ending with _pyphen
Verifying existence of syllables/ subdirectory
Sorting chronologically by directory name

This allows batch processing:

# Process all pyphen runs at once
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

# Output:
# Found 3 pyphen run directories:
#   - 20260110_143022_pyphen
#   - 20260110_153045_pyphen
#   - 20260110_163010_pyphen
# Processing...

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Syllable Normaliser - 3-Step Normalization Pipeline

The syllable normaliser transforms raw syllable files into canonical form through a 3-step pipeline, creating the authoritative syllable inventory for pattern development. This is a build-time tool only - not used during runtime name generation.

3-Step Normalization Pipeline:

Aggregation - Combine multiple input files while preserving all occurrences
Canonicalization - Unicode normalization, diacritic stripping, charset validation
Frequency Analysis - Count occurrences and generate frequency intelligence

Features:

Unicode normalization (NFKD, NFC, NFD, NFKC)
Diacritic stripping using unicodedata
Configurable charset and length constraints
Frequency intelligence capture (pre-deduplication counts)
Deterministic processing (same input = same output)
Comprehensive metadata reporting
5 output files for complete analysis

The pipeline produces 5 output files (with pyphen_ prefix for provenance):

pyphen_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)
pyphen_syllables_canonicalised.txt: Normalized canonical syllables
pyphen_syllables_frequencies.json: Frequency intelligence (syllable → count)
pyphen_syllables_unique.txt: Deduplicated canonical syllable inventory
pyphen_normalization_meta.txt: Detailed statistics and metadata report

Usage:

>>> from pathlib import Path
>>> from build_tools.pyphen_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline
... )
>>>
>>> # Create configuration
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>>
>>> # Run pipeline on a pyphen run directory
>>> result = run_full_pipeline(
...     run_directory=Path("_working/output/20260110_143022_pyphen/"),
...     config=config,
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific pyphen run directory (in-place)
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/

# Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

class build_tools.pyphen_syllable_normaliser.FileAggregator[source]

Bases: object

Aggregates syllables from multiple input files.

This class handles the first step of the normalization pipeline: combining syllables from multiple .txt files into a single raw aggregated file. All occurrences are preserved (no deduplication), maintaining the original frequency distribution from the input files.

Example

>>> from pathlib import Path
>>> aggregator = FileAggregator()
>>> input_files = [Path("file1.txt"), Path("file2.txt")]
>>> syllables = aggregator.aggregate_files(input_files)
>>> len(syllables)  # Total from both files
450
>>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))

aggregate_files(input_files)[source]

Aggregate syllables from multiple input files.

Reads all syllables from the provided input files and combines them into a single list. Each line in each input file is treated as one syllable. Empty lines are skipped. All occurrences are preserved (no deduplication).

Parameters:

input_files (list[Path]) – List of Path objects pointing to input .txt files. Each file should contain one syllable per line.

Return type:

list[str]

Returns:

List of all syllables from all input files, preserving duplicates and maintaining the original order (file by file).

Raises:

FileNotFoundError – If any input file does not exist.
PermissionError – If any input file cannot be read.
UnicodeDecodeError – If any input file contains invalid UTF-8.

Example

>>> aggregator = FileAggregator()
>>> files = [Path("corpus1.txt"), Path("corpus2.txt")]
>>> syllables = aggregator.aggregate_files(files)
>>> syllables[:3]
['hello', 'world', 'test']

Note

Files are processed in the order provided. If deterministic ordering is required, ensure input_files is sorted before calling.

read_syllables_from_file(file_path)[source]

Read syllables from a single file.

Reads a file line by line, treating each line as one syllable. Empty lines (whitespace only) are skipped. No normalization or transformation is applied - syllables are preserved exactly as they appear in the file.

Parameters:

file_path (Path) – Path to the input file to read.

Return type:

list[str]

Returns:

List of syllable strings from the file, one per non-empty line.

Raises:

FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.

Example

>>> aggregator = FileAggregator()
>>> syllables = aggregator.read_syllables_from_file(Path("input.txt"))
>>> syllables
['ka', 'ra', 'mi', 'ka', 'ta']

Note

Leading and trailing whitespace is stripped from each line, but the syllable content itself is not modified. This allows files with varying whitespace formatting to be processed consistently.

save_raw_syllables(syllables, output_path)[source]

Save raw aggregated syllables to file.

Writes syllables to the output file, one per line, in the order provided. This creates the syllables_raw.txt file for the pipeline. All syllables are written exactly as provided (no normalization).

Parameters:

syllables (list[str]) – List of syllable strings to write.
output_path (Path) – Path where the raw syllables file should be saved.

Raises:

PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).

Return type:

None

Example

>>> aggregator = FileAggregator()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta']
>>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
# File contains:
# ka
# ra
# mi
# ka
# ta

Note

This method creates the output file if it doesn’t exist and overwrites it if it does. The output directory must already exist.

class build_tools.pyphen_syllable_normaliser.FrequencyAnalyzer[source]

Bases: object

Analyzes frequency distribution of canonical syllables.

This class handles the intelligence capture phase of the normalization pipeline. It counts occurrences of each canonical syllable, creates frequency rankings, and generates output files for downstream analysis and feature annotation.

Example

>>> from pathlib import Path
>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka']
>>> frequencies = analyzer.calculate_frequencies(syllables)
>>> frequencies
{'ka': 3, 'ra': 1, 'mi': 1, 'ta': 1}
>>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json"))
>>> unique = analyzer.extract_unique_syllables(syllables)
>>> unique
['ka', 'mi', 'ra', 'ta']

calculate_frequencies(syllables)[source]

Calculate frequency counts for canonical syllables.

Counts how many times each unique syllable appears in the input list. This captures the natural frequency distribution from the source corpus before deduplication.

Parameters:: syllables (list[str]) – List of canonical syllables (may contain duplicates).
Return type:: dict[str, int]
Returns:: Dictionary mapping each unique syllable to its occurrence count.

Example

>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra']
>>> frequencies = analyzer.calculate_frequencies(syllables)
>>> frequencies
{'ka': 3, 'ra': 2, 'mi': 1, 'ta': 1}
>>> sum(frequencies.values())  # Total syllable count
7

Note

The returned dictionary is not sorted. Use create_frequency_entries() to generate sorted frequency rankings.

create_frequency_entries(frequencies)[source]

Create ranked frequency entries from frequency counts.

Converts a frequency dictionary into a list of FrequencyEntry objects with ranking information and percentage calculations. Entries are sorted by frequency (descending) then alphabetically (ascending).

Parameters:: frequencies (dict[str, int]) – Dictionary mapping syllable to occurrence count.
Return type:: list[FrequencyEntry]
Returns:: List of FrequencyEntry objects sorted by frequency (highest first), with alphabetical secondary sort for ties.

Example

>>> analyzer = FrequencyAnalyzer()
>>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145, 'ta': 98}
>>> entries = analyzer.create_frequency_entries(frequencies)
>>> entries[0]
FrequencyEntry(canonical='ka', frequency=187, rank=1, percentage=31.5)
>>> entries[0].canonical
'ka'
>>> entries[0].rank
1

Note

Percentage is calculated as (frequency / total_count) * 100. Ranks start at 1 (most frequent syllable has rank=1).

extract_unique_syllables(syllables)[source]

Extract unique syllables and return in sorted order.

Removes duplicates from the syllable list and returns a sorted list of unique canonical syllables. This creates the authoritative syllable inventory for downstream feature annotation.

Parameters:: syllables (list[str]) – List of canonical syllables (may contain duplicates).
Return type:: list[str]
Returns:: Sorted list of unique syllable strings (alphabetical order).

Example

>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra']
>>> unique = analyzer.extract_unique_syllables(syllables)
>>> unique
['ka', 'mi', 'ra', 'ta']
>>> len(unique)
4

Note

Sorting is alphabetical (a-z) for deterministic output. Empty syllable lists return an empty list.

save_frequencies(frequencies, output_path)[source]

Save frequency dictionary to JSON file.

Writes the frequency intelligence to a JSON file for downstream analysis. The output is formatted with indentation for readability and sorted by key for deterministic output.

Parameters:

frequencies (dict[str, int]) – Dictionary mapping syllable to occurrence count.
output_path (Path) – Path where the JSON file should be saved.

Raises:

PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).

Return type:

None

Example

>>> analyzer = FrequencyAnalyzer()
>>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145}
>>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json"))
# File contains:
# {
#   "ka": 187,
#   "mi": 145,
#   "ra": 162
# }

Note

The JSON is formatted with 2-space indentation and keys are sorted alphabetically for consistent diffs in version control.

save_unique_syllables(unique_syllables, output_path)[source]

Save unique syllables to text file.

Writes the deduplicated canonical syllable inventory to a text file, one syllable per line. This creates the authoritative syllable list for feature annotation and downstream processing.

Parameters:

unique_syllables (list[str]) – Sorted list of unique canonical syllables.
output_path (Path) – Path where the text file should be saved.

Raises:

PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).

Return type:

None

Example

>>> analyzer = FrequencyAnalyzer()
>>> unique = ['ka', 'mi', 'ra', 'ta']
>>> analyzer.save_unique_syllables(unique, Path("syllables_unique.txt"))
# File contains:
# ka
# mi
# ra
# ta

Note

Syllables should be pre-sorted (alphabetically) before calling this method. Use extract_unique_syllables() which returns sorted output, or sort manually.

class build_tools.pyphen_syllable_normaliser.FrequencyEntry(canonical, frequency, rank, percentage)[source]

Bases: object

Single syllable with frequency and ranking information.

This dataclass represents one syllable in the frequency analysis, including its occurrence count and relative ranking.

canonical: The canonical form of the syllable (e.g., “ka”)

frequency: Number of times this syllable appears

rank: Frequency rank (1 = most common, 2 = second most common, etc.)

percentage: Percentage of total syllables (0-100)

Example

>>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2)
>>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)")
ka: 187 (10.2%)

canonical: str

frequency: int

percentage: float

rank: int

class build_tools.pyphen_syllable_normaliser.NormalizationConfig(min_length=2, max_length=20, allowed_charset='abcdefghijklmnopqrstuvwxyz', unicode_form='NFKD')[source]

Bases: object

Configuration for syllable normalization process.

This dataclass stores all parameters that control how syllables are normalized to canonical form.

min_length: Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2

max_length: Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20

allowed_charset: String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”

unicode_form: Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)

Example

>>> config = NormalizationConfig(min_length=3, max_length=10)
>>> config.min_length
3
>>> config.allowed_charset
'abcdefghijklmnopqrstuvwxyz'

__post_init__()[source]: Validate configuration parameters after initialization.

allowed_charset: str = 'abcdefghijklmnopqrstuvwxyz'

max_length: int = 20

min_length: int = 2

unicode_form: str = 'NFKD'

class build_tools.pyphen_syllable_normaliser.NormalizationResult(config, stats, frequencies, unique_syllables, input_files, output_dir, timestamp=<factory>, raw_file=PosixPath('syllables_raw.txt'), canonical_file=PosixPath('syllables_canonicalised.txt'), frequency_file=PosixPath('syllables_frequencies.json'), unique_file=PosixPath('syllables_unique.txt'), meta_file=PosixPath('normalization_meta.txt'))[source]

Bases: object

Complete result from the syllable normalization pipeline.

This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.

config: Configuration used for normalization

stats: Statistics from the processing

frequencies: Dictionary mapping canonical syllable to frequency count

unique_syllables: Sorted list of unique canonical syllables

input_files: List of input file paths that were processed

output_dir: Directory where output files were saved

timestamp: When the normalization was performed

raw_file: Path to raw aggregated file (syllables_raw.txt)

canonical_file: Path to canonicalized file (syllables_canonicalised.txt)

frequency_file: Path to frequency JSON (syllables_frequencies.json)

unique_file: Path to unique syllables (syllables_unique.txt)

meta_file: Path to metadata report (normalization_meta.txt)

Example

>>> result = NormalizationResult(
...     config=NormalizationConfig(),
...     stats=NormalizationStats(raw_count=1000),
...     frequencies={"ka": 187, "ra": 162},
...     unique_syllables=["ka", "ra"],
...     input_files=[Path("file1.txt")],
...     output_dir=Path("_working/normalized"),
...     timestamp=datetime.now(),
...     raw_file=Path("syllables_raw.txt"),
...     canonical_file=Path("syllables_canonicalised.txt"),
...     frequency_file=Path("syllables_frequencies.json"),
...     unique_file=Path("syllables_unique.txt"),
...     meta_file=Path("normalization_meta.txt")
... )
>>> result.stats.raw_count
1000

canonical_file: Path = PosixPath('syllables_canonicalised.txt')

config: NormalizationConfig

format_metadata()[source]

Format normalization metadata as a human-readable string.

Creates a detailed report including statistics, rejection breakdown, and top frequencies.

Return type:: str
Returns:: Multi-line string containing all normalization metadata formatted for display or file output.

Example

>>> result = NormalizationResult(...)
>>> print(result.format_metadata())
======================================================================
SYLLABLE NORMALIZATION METADATA
======================================================================
Timestamp:           2026-01-05 17:30:22
...

frequencies: dict[str, int]

frequency_file: Path = PosixPath('syllables_frequencies.json')

input_files: list[Path]

meta_file: Path = PosixPath('normalization_meta.txt')

output_dir: Path

raw_file: Path = PosixPath('syllables_raw.txt')

stats: NormalizationStats

timestamp: datetime

unique_file: Path = PosixPath('syllables_unique.txt')

unique_syllables: list[str]

class build_tools.pyphen_syllable_normaliser.NormalizationStats(raw_count=0, after_canonicalization=0, rejected_charset=0, rejected_length=0, rejected_empty=0, unique_canonical=0, processing_time=0.0)[source]

Bases: object

Statistics from the syllable normalization process.

This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.

raw_count: Total number of syllables in raw input (before normalization)

after_canonicalization: Number of syllables after normalization

rejected_charset: Syllables rejected due to invalid characters

rejected_length: Syllables rejected due to length constraints

rejected_empty: Syllables that became empty after normalization

unique_canonical: Number of unique canonical syllables

processing_time: Total processing time in seconds

Example

>>> stats = NormalizationStats(
...     raw_count=1000,
...     after_canonicalization=950,
...     rejected_charset=30,
...     rejected_length=20,
...     rejected_empty=0,
...     unique_canonical=412,
...     processing_time=1.5
... )
>>> stats.rejection_rate
5.0

after_canonicalization: int = 0

processing_time: float = 0.0

raw_count: int = 0

rejected_charset: int = 0

rejected_empty: int = 0

rejected_length: int = 0

property rejection_rate: float

Calculate rejection rate as percentage of raw count.

property total_rejected: int

Calculate total number of rejected syllables.

unique_canonical: int = 0

class build_tools.pyphen_syllable_normaliser.SyllableNormalizer(config)[source]

Bases: object

Normalizes syllables to canonical form.

This class applies a multi-step normalization pipeline to transform raw syllables into a standardized canonical representation. The pipeline includes Unicode normalization, diacritic removal, case normalization, and validation against charset and length constraints.

config: Configuration specifying normalization parameters such as allowed charset, length constraints, and Unicode normalization form.

Example

>>> from build_tools.pyphen_syllable_normaliser import NormalizationConfig
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> normalizer = SyllableNormalizer(config)
>>> normalizer.normalize("Café")
'cafe'
>>> normalizer.normalize("x")  # Too short
None
>>> normalizer.normalize("résumé123")  # Invalid characters
None

__init__(config)[source]

Initialize normalizer with configuration.

Parameters:: config (NormalizationConfig) – NormalizationConfig instance specifying normalization parameters including charset, length constraints, and Unicode normalization form.

Example

>>> config = NormalizationConfig(
...     min_length=3,
...     max_length=10,
...     allowed_charset="abcdefghijklmnopqrstuvwxyz",
...     unicode_form="NFKD"
... )
>>> normalizer = SyllableNormalizer(config)

normalize(syllable)[source]

Normalize a single syllable to canonical form.

Applies the complete normalization pipeline: 1. Unicode normalization (NFKD by default) 2. Strip diacritics (remove combining characters) 3. Lowercase conversion 4. Trim whitespace 5. Validate charset (only allowed characters) 6. Check length constraints

Parameters:: syllable (str) – Raw syllable string to normalize.
Return type:: str | None
Returns:: Normalized canonical syllable string, or None if the syllable is rejected due to: - Becoming empty after normalization - Containing invalid characters - Not meeting length constraints

Example

>>> config = NormalizationConfig()
>>> normalizer = SyllableNormalizer(config)
>>> normalizer.normalize("Café")
'cafe'
>>> normalizer.normalize("  HELLO  ")
'hello'
>>> normalizer.normalize("résumé")
'resume'
>>> normalizer.normalize("")  # Empty
None
>>> normalizer.normalize("x")  # Too short (min_length=2)
None
>>> normalizer.normalize("hello123")  # Invalid chars
None

strip_diacritics(text)[source]

Remove diacritics (accent marks) from Unicode text.

Uses Unicode normalization (NFD/NFKD) to decompose characters into base characters and combining marks, then removes the combining marks. This converts accented characters like ‘é’ → ‘e’, ‘ñ’ → ‘n’, etc.

Parameters:: text (str) – Unicode string potentially containing diacritics.
Return type:: str
Returns:: String with all combining diacritical marks removed.

Example

>>> normalizer = SyllableNormalizer(NormalizationConfig())
>>> normalizer.strip_diacritics("café")
'cafe'
>>> normalizer.strip_diacritics("naïve")
'naive'
>>> normalizer.strip_diacritics("Zürich")
'Zurich'
>>> normalizer.strip_diacritics("São Paulo")
'Sao Paulo'

Note

This method assumes the text has already been normalized to NFD or NFKD form. The normalize() method handles this automatically.

build_tools.pyphen_syllable_normaliser.create_argument_parser()[source]

Create and return the argument parser for pyphen syllable normaliser.

Return type:: ArgumentParser
Returns:: Configured ArgumentParser ready to parse command-line arguments.

build_tools.pyphen_syllable_normaliser.discover_input_files(source_dir, pattern='*.txt', recursive=False)[source]

Discover input files in a directory matching a pattern.

Scans a directory for files matching the specified glob pattern. Returns files in sorted order for deterministic processing.

Parameters:

source_dir (Path) – Directory to scan for input files.
pattern (str) – Glob pattern for matching files. Default: “*.txt”.
recursive (bool) – If True, scan subdirectories recursively using “**/” prefix. Default: False (only scan the immediate directory).

Return type:

list[Path]

Returns:

Sorted list of Path objects for all matching files.

Raises:

ValueError – If source_dir is not a directory.
FileNotFoundError – If source_dir does not exist.

Example

>>> from pathlib import Path
>>> # Non-recursive scan
>>> files = discover_input_files(Path("data/"), pattern="*.txt")
>>> files
[Path('data/corpus1.txt'), Path('data/corpus2.txt')]
>>>
>>> # Recursive scan
>>> files = discover_input_files(
...     Path("data/"),
...     pattern="*.txt",
...     recursive=True
... )
>>> files
[Path('data/corpus1.txt'),
 Path('data/subdir/corpus3.txt'),
 Path('data/subdir/corpus4.txt')]

Note

Files are always returned in sorted order to ensure deterministic processing. This is critical for reproducible normalization results.

build_tools.pyphen_syllable_normaliser.load_frequencies_from_file(file_path)[source]

Load frequency dictionary from JSON file.

Reads a previously saved syllables_frequencies.json file and returns the frequency dictionary. Useful for analysis and inspection of normalization results.

Parameters:

file_path (Path) – Path to the JSON frequency file.

Return type:

dict[str, int]

Returns:

Dictionary mapping syllable to occurrence count.

Raises:

FileNotFoundError – If the file does not exist.
json.JSONDecodeError – If the file is not valid JSON.
PermissionError – If the file cannot be read.

Example

>>> from pathlib import Path
>>> frequencies = load_frequencies_from_file(Path("syllables_frequencies.json"))
>>> frequencies['ka']
187
>>> len(frequencies)
412

Note

The JSON file must have been created by save_frequencies() or follow the same format: {“syllable”: count, …}

build_tools.pyphen_syllable_normaliser.load_unique_syllables_from_file(file_path)[source]

Load unique syllables from text file.

Reads a previously saved syllables_unique.txt file and returns the syllable list. Useful for loading the authoritative syllable inventory for feature annotation or analysis.

Parameters:

file_path (Path) – Path to the text file containing unique syllables.

Return type:

list[str]

Returns:

List of syllable strings (one per line from file).

Raises:

FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.

Example

>>> from pathlib import Path
>>> syllables = load_unique_syllables_from_file(Path("syllables_unique.txt"))
>>> syllables[:5]
['ka', 'mi', 'ra', 'ta', 'wa']
>>> len(syllables)
412

Note

Empty lines are skipped. Leading/trailing whitespace is stripped from each line.

build_tools.pyphen_syllable_normaliser.main(args=None)[source]

Main entry point for CLI.

Parameters:: args (list[str] | None) – Command-line arguments (for testing). If None, uses sys.argv.
Return type:: int
Returns:: Exit code (0 for success, 1 for error).

build_tools.pyphen_syllable_normaliser.normalize_batch(syllables, config)[source]

Normalize a batch of syllables and collect rejection statistics.

This is a convenience function for normalizing multiple syllables at once while tracking why syllables were rejected.

Parameters:

syllables (list[str]) – List of raw syllable strings to normalize.
config (NormalizationConfig) – NormalizationConfig specifying normalization parameters.

Returns:

normalized_syllables: List of successfully normalized syllables
rejection_stats: Dictionary with rejection counts:
- ”rejected_empty”: Syllables that became empty after normalization
- ”rejected_charset”: Syllables with invalid characters
- ”rejected_length”: Syllables outside length constraints

Return type:

Tuple of (normalized_syllables, rejection_stats) where

Example

>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> syllables = ["Café", "x", "Hello", "world123", "  résumé  "]
>>> normalized, stats = normalize_batch(syllables, config)
>>> normalized
['cafe', 'hello', 'resume']
>>> stats
{'rejected_empty': 0, 'rejected_charset': 1, 'rejected_length': 1}

Note

This function processes syllables in order and preserves duplicates. For frequency analysis, use the frequency.py module which handles deduplication and counting.

build_tools.pyphen_syllable_normaliser.run_full_pipeline(run_directory, config, verbose=False, quiet=False)[source]

Run complete pyphen normalization pipeline with in-place processing.

Executes the full pyphen-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Canonicalize syllables (Unicode normalization, etc.) 3. Frequency analysis 4. Write 5 output files to run_directory (in-place)

Parameters:

run_directory (Path) – Pyphen run directory (e.g., _working/output/20260110_143022_pyphen/).
config (NormalizationConfig) – NormalizationConfig specifying normalization parameters.
verbose (bool) – If True, print detailed progress information.
quiet (bool) – If True, suppress all output except errors.

Return type:

NormalizationResult

Returns:

NormalizationResult containing all outputs, statistics, and file paths.

Raises:

FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.
ValueError – If run_directory is not a directory.

Example

>>> from pathlib import Path
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> run_dir = Path("_working/output/20260110_143022_pyphen/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=config,
...     verbose=True
... )
>>> result.stats.raw_count
15234
>>> result.stats.unique_canonical
4821