Pyphen Syllable Normaliser

Overview

Syllable Normaliser - 3-Step Normalization Pipeline

The syllable normaliser transforms raw syllable files into canonical form through a 3-step pipeline, creating the authoritative syllable inventory for pattern development. This is a build-time tool only - not used during runtime name generation.

3-Step Normalization Pipeline:

  1. Aggregation - Combine multiple input files while preserving all occurrences

  2. Canonicalization - Unicode normalization, diacritic stripping, charset validation

  3. Frequency Analysis - Count occurrences and generate frequency intelligence

Features:

  • Unicode normalization (NFKD, NFC, NFD, NFKC)

  • Diacritic stripping using unicodedata

  • Configurable charset and length constraints

  • Frequency intelligence capture (pre-deduplication counts)

  • Deterministic processing (same input = same output)

  • Comprehensive metadata reporting

  • 5 output files for complete analysis

The pipeline produces 5 output files (with pyphen_ prefix for provenance):

  • pyphen_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)

  • pyphen_syllables_canonicalised.txt: Normalized canonical syllables

  • pyphen_syllables_frequencies.json: Frequency intelligence (syllable → count)

  • pyphen_syllables_unique.txt: Deduplicated canonical syllable inventory

  • pyphen_normalization_meta.txt: Detailed statistics and metadata report

Usage:
>>> from pathlib import Path
>>> from build_tools.pyphen_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline
... )
>>>
>>> # Create configuration
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>>
>>> # Run pipeline on a pyphen run directory
>>> result = run_full_pipeline(
...     run_directory=Path("_working/output/20260110_143022_pyphen/"),
...     config=config,
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific pyphen run directory (in-place)
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/

# Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

Command-Line Interface

Pyphen Syllable Normaliser - 3-step normalization pipeline

usage: python -m build_tools.pyphen_syllable_normaliser [-h]
                                                        (--run-dir RUN_DIR | --source SOURCE)
                                                        [--min MIN]
                                                        [--max MAX]
                                                        [--charset CHARSET]
                                                        [--unicode-form {NFC,NFD,NFKC,NFKD}]
                                                        [--verbose] [--quiet]

Named Arguments

--run-dir

Specific pyphen run directory to process (e.g., _working/output/20260110_143022_pyphen/)

--source

Directory to scan for pyphen run directories (auto-detects *_pyphen/ directories)

--min

Minimum syllable length (characters). Default: 2

Default: 2

--max

Maximum syllable length (characters). Default: 20

Default: 20

--charset

Allowed character set for syllables. Default: a-z

Default: 'abcdefghijklmnopqrstuvwxyz'

--unicode-form

Possible choices: NFC, NFD, NFKC, NFKD

Unicode normalization form. Default: NFKD

Default: 'NFKD'

--verbose, -v

Enable verbose output with detailed progress information

Default: False

--quiet, -q

Suppress all output except errors

Default: False

Examples:

# Process specific pyphen run directory
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/

# Auto-detect and process all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

# Custom normalization config
python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_143022_pyphen/ \
  --min 2 --max 8

Output Format

The pipeline generates 5 output files in the pyphen run directory with pyphen_ prefix for provenance:

  1. pyphen_syllables_raw.txt - Aggregated raw syllables (all occurrences preserved)

  2. pyphen_syllables_canonicalised.txt - Normalized canonical syllables

  3. pyphen_syllables_frequencies.json - Frequency intelligence (syllable → count mapping)

  4. pyphen_syllables_unique.txt - Deduplicated canonical syllable inventory

  5. pyphen_normalization_meta.txt - Detailed statistics and metadata report

In-Place Processing:

Unlike older versions which wrote to a separate output directory, the pyphen normaliser now processes run directories in-place, writing output files directly into the run directory:

_working/output/20260110_143022_pyphen/
├── syllables/                          # Input (from pyphen extractor)
│   ├── en_US_alice.txt
│   ├── en_US_middlemarch.txt
│   └── ...
├── meta/                               # Metadata (from extractor)
│   └── ...
├── pyphen_syllables_raw.txt            # Output: Aggregated
├── pyphen_syllables_canonicalised.txt  # Output: Normalized
├── pyphen_syllables_frequencies.json   # Output: Frequency intelligence
├── pyphen_syllables_unique.txt         # Output: Deduplicated
└── pyphen_normalization_meta.txt       # Output: Statistics

File structure examples:

pyphen_syllables_raw.txt (preserves all occurrences):

café
Café
hello
hello
world

pyphen_syllables_canonicalised.txt (normalized, duplicates preserved):

cafe
cafe
hello
hello
world

pyphen_syllables_frequencies.json (counts before deduplication):

{
  "cafe": 2,
  "hello": 2,
  "world": 1
}

pyphen_syllables_unique.txt (deduplicated, sorted):

cafe
hello
world

Integration Guide

The pyphen syllable normaliser is the second step after pyphen extraction, processing syllables within their run directories:

Standard workflow:

# Step 1: Extract syllables using pyphen
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --pattern "*.txt" \
  --output _working/output/ \
  --lang en_US

# Step 2: Normalize extracted syllables (in-place)
python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_143022_pyphen/

# Alternative: Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser \
  --source _working/output/

# Step 3: Annotate with phonetic features (source-agnostic)
python -m build_tools.syllable_feature_annotator \
  --syllables _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
  --frequencies _working/output/20260110_143022_pyphen/pyphen_syllables_frequencies.json

Comparing with NLTK normaliser:

# Pyphen pipeline (typographic hyphenation)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --lang en_US \
  --output _working/output/

python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_143022_pyphen/

# NLTK pipeline (phonetic splitting)
python -m build_tools.nltk_syllable_extractor \
  --source data/corpus/ \
  --output _working/output/

python -m build_tools.nltk_syllable_normaliser \
  --run-dir _working/output/20260110_095213_nltk/

# Compare outputs - both use different prefixes (pyphen_* vs nltk_*)
diff _working/output/20260110_143022_pyphen/pyphen_syllables_unique.txt \
     _working/output/20260110_095213_nltk/nltk_syllables_unique.txt

When to use pyphen normaliser vs NLTK normaliser:

Use pyphen normaliser when:

  • You used the pyphen syllable extractor

  • Your syllables are well-formed from typographic hyphenation

  • You want multi-language support (40+ languages)

  • You want in-place processing within run directories

  • You’re working with pyphen’s dictionary-based splits

Use NLTK normaliser when:

  • You used the NLTK syllable extractor

  • Your syllables contain many single-letter fragments

  • You want phonetically coherent syllables reconstructed

  • You’re working with NLTK’s onset/coda-based splits (English only)

  • You want fragment cleaning preprocessing

3-Step Normalization Pipeline:

Step 1 - Aggregation:

  • Combines all input files into syllables_raw.txt

  • Preserves ALL occurrences (no deduplication)

  • Maintains raw counts for frequency analysis

  • Empty lines filtered during file reading

Step 2 - Canonicalization:

  • Unicode normalization (NFKD - compatibility decomposition)

  • Strip diacritics: café → cafe, résumé → resume

  • Lowercase conversion

  • Trim whitespace

  • Charset validation (reject invalid characters)

  • Length constraint enforcement (default: min=2, max=20)

  • Outputs to syllables_canonicalised.txt

Step 3 - Frequency Analysis:

  • Count occurrences of each canonical syllable

  • Generate frequency rankings and percentages

  • Create deduplicated unique list (alphabetically sorted)

  • Outputs: - syllables_frequencies.json - Frequency counts before deduplication - syllables_unique.txt - Authoritative syllable inventory - normalization_meta.txt - Comprehensive statistics report

Pipeline characteristics:

  • Deterministic: same input always produces same output

  • Fast: processes thousands of syllables per second

  • Configurable: adjust length constraints, charset, unicode form

  • Comprehensive: detailed rejection statistics and metadata

Notes

Frequency Intelligence:

The frequency data captures how often each canonical syllable occurs before deduplication. This intelligence is essential for understanding natural language patterns and can inform weighted name generation:

{
  "ka": 187,
  "ra": 162,
  "mi": 145,
  "ta": 98
}

This shows “ka” appears 187 times in the canonical syllables, indicating it’s a high-frequency pattern that may be desirable for common or natural-sounding names.

Normalization Behavior:

  • All syllable processing is case-insensitive (output is lowercase)

  • Unicode normalization form NFKD provides maximum compatibility decomposition

  • Empty lines are filtered during aggregation (not counted as rejections)

  • Frequency counts capture occurrences BEFORE deduplication

  • Invalid syllables (wrong charset, wrong length) are rejected and counted in metadata

Default Constraints:

  • Min length: 2 characters

  • Max length: 20 characters

  • Allowed charset: a-z (lowercase ASCII letters)

  • Unicode form: NFKD (compatibility decomposition)

Use Cases:

  • Combining syllables from multiple language extractions

  • Normalizing variations in corpus encoding (UTF-8, Latin-1, etc.)

  • Filtering syllables by length for specific pattern requirements

  • Building frequency-aware name generation systems

In-Place Processing Philosophy:

The pyphen normaliser writes outputs directly into the run directory (not a separate location) because:

  • Convention: Each pyphen run is self-contained (extractor + normaliser outputs together)

  • Simplicity: No confusion about where normalized files live

  • Provenance: Run directory name (*_pyphen) and file prefix (pyphen_*) both indicate source

Processing Modes:

  • Specific run directory: --run-dir /path/to/run/ - Process one pyphen run

  • Auto-detection: --source /path/to/output/ - Find and process all pyphen runs

Auto-Detection Criteria:

The auto-detection feature (--source) finds pyphen run directories by:

  1. Scanning for directories ending with _pyphen

  2. Verifying existence of syllables/ subdirectory

  3. Sorting chronologically by directory name

This allows batch processing:

# Process all pyphen runs at once
python -m build_tools.pyphen_syllable_normaliser --source _working/output/

# Output:
# Found 3 pyphen run directories:
#   - 20260110_143022_pyphen
#   - 20260110_153045_pyphen
#   - 20260110_163010_pyphen
# Processing...

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Syllable Normaliser - 3-Step Normalization Pipeline

The syllable normaliser transforms raw syllable files into canonical form through a 3-step pipeline, creating the authoritative syllable inventory for pattern development. This is a build-time tool only - not used during runtime name generation.

3-Step Normalization Pipeline:

  1. Aggregation - Combine multiple input files while preserving all occurrences

  2. Canonicalization - Unicode normalization, diacritic stripping, charset validation

  3. Frequency Analysis - Count occurrences and generate frequency intelligence

Features:

  • Unicode normalization (NFKD, NFC, NFD, NFKC)

  • Diacritic stripping using unicodedata

  • Configurable charset and length constraints

  • Frequency intelligence capture (pre-deduplication counts)

  • Deterministic processing (same input = same output)

  • Comprehensive metadata reporting

  • 5 output files for complete analysis

The pipeline produces 5 output files (with pyphen_ prefix for provenance):

  • pyphen_syllables_raw.txt: Aggregated raw syllables (all occurrences preserved)

  • pyphen_syllables_canonicalised.txt: Normalized canonical syllables

  • pyphen_syllables_frequencies.json: Frequency intelligence (syllable → count)

  • pyphen_syllables_unique.txt: Deduplicated canonical syllable inventory

  • pyphen_normalization_meta.txt: Detailed statistics and metadata report

Usage:
>>> from pathlib import Path
>>> from build_tools.pyphen_syllable_normaliser import (
...     NormalizationConfig,
...     run_full_pipeline
... )
>>>
>>> # Create configuration
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>>
>>> # Run pipeline on a pyphen run directory
>>> result = run_full_pipeline(
...     run_directory=Path("_working/output/20260110_143022_pyphen/"),
...     config=config,
...     verbose=True
... )
>>>
>>> # Access results
>>> print(f"Processed {result.stats.raw_count:,} raw syllables")
>>> print(f"Canonical: {result.stats.after_canonicalization:,}")
>>> print(f"Unique: {result.stats.unique_canonical:,}")

CLI Usage:

# Process specific pyphen run directory (in-place)
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_143022_pyphen/

# Auto-detect all pyphen run directories
python -m build_tools.pyphen_syllable_normaliser --source _working/output/
class build_tools.pyphen_syllable_normaliser.FileAggregator[source]

Bases: object

Aggregates syllables from multiple input files.

This class handles the first step of the normalization pipeline: combining syllables from multiple .txt files into a single raw aggregated file. All occurrences are preserved (no deduplication), maintaining the original frequency distribution from the input files.

Example

>>> from pathlib import Path
>>> aggregator = FileAggregator()
>>> input_files = [Path("file1.txt"), Path("file2.txt")]
>>> syllables = aggregator.aggregate_files(input_files)
>>> len(syllables)  # Total from both files
450
>>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
aggregate_files(input_files)[source]

Aggregate syllables from multiple input files.

Reads all syllables from the provided input files and combines them into a single list. Each line in each input file is treated as one syllable. Empty lines are skipped. All occurrences are preserved (no deduplication).

Parameters:

input_files (list[Path]) – List of Path objects pointing to input .txt files. Each file should contain one syllable per line.

Return type:

list[str]

Returns:

List of all syllables from all input files, preserving duplicates and maintaining the original order (file by file).

Raises:

Example

>>> aggregator = FileAggregator()
>>> files = [Path("corpus1.txt"), Path("corpus2.txt")]
>>> syllables = aggregator.aggregate_files(files)
>>> syllables[:3]
['hello', 'world', 'test']

Note

Files are processed in the order provided. If deterministic ordering is required, ensure input_files is sorted before calling.

read_syllables_from_file(file_path)[source]

Read syllables from a single file.

Reads a file line by line, treating each line as one syllable. Empty lines (whitespace only) are skipped. No normalization or transformation is applied - syllables are preserved exactly as they appear in the file.

Parameters:

file_path (Path) – Path to the input file to read.

Return type:

list[str]

Returns:

List of syllable strings from the file, one per non-empty line.

Raises:

Example

>>> aggregator = FileAggregator()
>>> syllables = aggregator.read_syllables_from_file(Path("input.txt"))
>>> syllables
['ka', 'ra', 'mi', 'ka', 'ta']

Note

Leading and trailing whitespace is stripped from each line, but the syllable content itself is not modified. This allows files with varying whitespace formatting to be processed consistently.

save_raw_syllables(syllables, output_path)[source]

Save raw aggregated syllables to file.

Writes syllables to the output file, one per line, in the order provided. This creates the syllables_raw.txt file for the pipeline. All syllables are written exactly as provided (no normalization).

Parameters:
  • syllables (list[str]) – List of syllable strings to write.

  • output_path (Path) – Path where the raw syllables file should be saved.

Raises:
  • PermissionError – If the output file cannot be written.

  • OSError – If there are filesystem issues (disk full, etc.).

Return type:

None

Example

>>> aggregator = FileAggregator()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta']
>>> aggregator.save_raw_syllables(syllables, Path("syllables_raw.txt"))
# File contains:
# ka
# ra
# mi
# ka
# ta

Note

This method creates the output file if it doesn’t exist and overwrites it if it does. The output directory must already exist.

class build_tools.pyphen_syllable_normaliser.FrequencyAnalyzer[source]

Bases: object

Analyzes frequency distribution of canonical syllables.

This class handles the intelligence capture phase of the normalization pipeline. It counts occurrences of each canonical syllable, creates frequency rankings, and generates output files for downstream analysis and feature annotation.

Example

>>> from pathlib import Path
>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka']
>>> frequencies = analyzer.calculate_frequencies(syllables)
>>> frequencies
{'ka': 3, 'ra': 1, 'mi': 1, 'ta': 1}
>>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json"))
>>> unique = analyzer.extract_unique_syllables(syllables)
>>> unique
['ka', 'mi', 'ra', 'ta']
calculate_frequencies(syllables)[source]

Calculate frequency counts for canonical syllables.

Counts how many times each unique syllable appears in the input list. This captures the natural frequency distribution from the source corpus before deduplication.

Parameters:

syllables (list[str]) – List of canonical syllables (may contain duplicates).

Return type:

dict[str, int]

Returns:

Dictionary mapping each unique syllable to its occurrence count.

Example

>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra']
>>> frequencies = analyzer.calculate_frequencies(syllables)
>>> frequencies
{'ka': 3, 'ra': 2, 'mi': 1, 'ta': 1}
>>> sum(frequencies.values())  # Total syllable count
7

Note

The returned dictionary is not sorted. Use create_frequency_entries() to generate sorted frequency rankings.

create_frequency_entries(frequencies)[source]

Create ranked frequency entries from frequency counts.

Converts a frequency dictionary into a list of FrequencyEntry objects with ranking information and percentage calculations. Entries are sorted by frequency (descending) then alphabetically (ascending).

Parameters:

frequencies (dict[str, int]) – Dictionary mapping syllable to occurrence count.

Return type:

list[FrequencyEntry]

Returns:

List of FrequencyEntry objects sorted by frequency (highest first), with alphabetical secondary sort for ties.

Example

>>> analyzer = FrequencyAnalyzer()
>>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145, 'ta': 98}
>>> entries = analyzer.create_frequency_entries(frequencies)
>>> entries[0]
FrequencyEntry(canonical='ka', frequency=187, rank=1, percentage=31.5)
>>> entries[0].canonical
'ka'
>>> entries[0].rank
1

Note

Percentage is calculated as (frequency / total_count) * 100. Ranks start at 1 (most frequent syllable has rank=1).

extract_unique_syllables(syllables)[source]

Extract unique syllables and return in sorted order.

Removes duplicates from the syllable list and returns a sorted list of unique canonical syllables. This creates the authoritative syllable inventory for downstream feature annotation.

Parameters:

syllables (list[str]) – List of canonical syllables (may contain duplicates).

Return type:

list[str]

Returns:

Sorted list of unique syllable strings (alphabetical order).

Example

>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra']
>>> unique = analyzer.extract_unique_syllables(syllables)
>>> unique
['ka', 'mi', 'ra', 'ta']
>>> len(unique)
4

Note

Sorting is alphabetical (a-z) for deterministic output. Empty syllable lists return an empty list.

save_frequencies(frequencies, output_path)[source]

Save frequency dictionary to JSON file.

Writes the frequency intelligence to a JSON file for downstream analysis. The output is formatted with indentation for readability and sorted by key for deterministic output.

Parameters:
  • frequencies (dict[str, int]) – Dictionary mapping syllable to occurrence count.

  • output_path (Path) – Path where the JSON file should be saved.

Raises:
  • PermissionError – If the output file cannot be written.

  • OSError – If there are filesystem issues (disk full, etc.).

Return type:

None

Example

>>> analyzer = FrequencyAnalyzer()
>>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145}
>>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json"))
# File contains:
# {
#   "ka": 187,
#   "mi": 145,
#   "ra": 162
# }

Note

The JSON is formatted with 2-space indentation and keys are sorted alphabetically for consistent diffs in version control.

save_unique_syllables(unique_syllables, output_path)[source]

Save unique syllables to text file.

Writes the deduplicated canonical syllable inventory to a text file, one syllable per line. This creates the authoritative syllable list for feature annotation and downstream processing.

Parameters:
  • unique_syllables (list[str]) – Sorted list of unique canonical syllables.

  • output_path (Path) – Path where the text file should be saved.

Raises:
  • PermissionError – If the output file cannot be written.

  • OSError – If there are filesystem issues (disk full, etc.).

Return type:

None

Example

>>> analyzer = FrequencyAnalyzer()
>>> unique = ['ka', 'mi', 'ra', 'ta']
>>> analyzer.save_unique_syllables(unique, Path("syllables_unique.txt"))
# File contains:
# ka
# mi
# ra
# ta

Note

Syllables should be pre-sorted (alphabetically) before calling this method. Use extract_unique_syllables() which returns sorted output, or sort manually.

class build_tools.pyphen_syllable_normaliser.FrequencyEntry(canonical, frequency, rank, percentage)[source]

Bases: object

Single syllable with frequency and ranking information.

This dataclass represents one syllable in the frequency analysis, including its occurrence count and relative ranking.

canonical

The canonical form of the syllable (e.g., “ka”)

frequency

Number of times this syllable appears

rank

Frequency rank (1 = most common, 2 = second most common, etc.)

percentage

Percentage of total syllables (0-100)

Example

>>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2)
>>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)")
ka: 187 (10.2%)
canonical: str
frequency: int
percentage: float
rank: int
class build_tools.pyphen_syllable_normaliser.NormalizationConfig(min_length=2, max_length=20, allowed_charset='abcdefghijklmnopqrstuvwxyz', unicode_form='NFKD')[source]

Bases: object

Configuration for syllable normalization process.

This dataclass stores all parameters that control how syllables are normalized to canonical form.

min_length

Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2

max_length

Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20

allowed_charset

String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”

unicode_form

Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)

Example

>>> config = NormalizationConfig(min_length=3, max_length=10)
>>> config.min_length
3
>>> config.allowed_charset
'abcdefghijklmnopqrstuvwxyz'
__post_init__()[source]

Validate configuration parameters after initialization.

allowed_charset: str = 'abcdefghijklmnopqrstuvwxyz'
max_length: int = 20
min_length: int = 2
unicode_form: str = 'NFKD'
class build_tools.pyphen_syllable_normaliser.NormalizationResult(config, stats, frequencies, unique_syllables, input_files, output_dir, timestamp=<factory>, raw_file=PosixPath('syllables_raw.txt'), canonical_file=PosixPath('syllables_canonicalised.txt'), frequency_file=PosixPath('syllables_frequencies.json'), unique_file=PosixPath('syllables_unique.txt'), meta_file=PosixPath('normalization_meta.txt'))[source]

Bases: object

Complete result from the syllable normalization pipeline.

This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.

config

Configuration used for normalization

stats

Statistics from the processing

frequencies

Dictionary mapping canonical syllable to frequency count

unique_syllables

Sorted list of unique canonical syllables

input_files

List of input file paths that were processed

output_dir

Directory where output files were saved

timestamp

When the normalization was performed

raw_file

Path to raw aggregated file (syllables_raw.txt)

canonical_file

Path to canonicalized file (syllables_canonicalised.txt)

frequency_file

Path to frequency JSON (syllables_frequencies.json)

unique_file

Path to unique syllables (syllables_unique.txt)

meta_file

Path to metadata report (normalization_meta.txt)

Example

>>> result = NormalizationResult(
...     config=NormalizationConfig(),
...     stats=NormalizationStats(raw_count=1000),
...     frequencies={"ka": 187, "ra": 162},
...     unique_syllables=["ka", "ra"],
...     input_files=[Path("file1.txt")],
...     output_dir=Path("_working/normalized"),
...     timestamp=datetime.now(),
...     raw_file=Path("syllables_raw.txt"),
...     canonical_file=Path("syllables_canonicalised.txt"),
...     frequency_file=Path("syllables_frequencies.json"),
...     unique_file=Path("syllables_unique.txt"),
...     meta_file=Path("normalization_meta.txt")
... )
>>> result.stats.raw_count
1000
canonical_file: Path = PosixPath('syllables_canonicalised.txt')
config: NormalizationConfig
format_metadata()[source]

Format normalization metadata as a human-readable string.

Creates a detailed report including statistics, rejection breakdown, and top frequencies.

Return type:

str

Returns:

Multi-line string containing all normalization metadata formatted for display or file output.

Example

>>> result = NormalizationResult(...)
>>> print(result.format_metadata())
======================================================================
SYLLABLE NORMALIZATION METADATA
======================================================================
Timestamp:           2026-01-05 17:30:22
...
frequencies: dict[str, int]
frequency_file: Path = PosixPath('syllables_frequencies.json')
input_files: list[Path]
meta_file: Path = PosixPath('normalization_meta.txt')
output_dir: Path
raw_file: Path = PosixPath('syllables_raw.txt')
stats: NormalizationStats
timestamp: datetime
unique_file: Path = PosixPath('syllables_unique.txt')
unique_syllables: list[str]
class build_tools.pyphen_syllable_normaliser.NormalizationStats(raw_count=0, after_canonicalization=0, rejected_charset=0, rejected_length=0, rejected_empty=0, unique_canonical=0, processing_time=0.0)[source]

Bases: object

Statistics from the syllable normalization process.

This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.

raw_count

Total number of syllables in raw input (before normalization)

after_canonicalization

Number of syllables after normalization

rejected_charset

Syllables rejected due to invalid characters

rejected_length

Syllables rejected due to length constraints

rejected_empty

Syllables that became empty after normalization

unique_canonical

Number of unique canonical syllables

processing_time

Total processing time in seconds

Example

>>> stats = NormalizationStats(
...     raw_count=1000,
...     after_canonicalization=950,
...     rejected_charset=30,
...     rejected_length=20,
...     rejected_empty=0,
...     unique_canonical=412,
...     processing_time=1.5
... )
>>> stats.rejection_rate
5.0
after_canonicalization: int = 0
processing_time: float = 0.0
raw_count: int = 0
rejected_charset: int = 0
rejected_empty: int = 0
rejected_length: int = 0
property rejection_rate: float

Calculate rejection rate as percentage of raw count.

property total_rejected: int

Calculate total number of rejected syllables.

unique_canonical: int = 0
class build_tools.pyphen_syllable_normaliser.SyllableNormalizer(config)[source]

Bases: object

Normalizes syllables to canonical form.

This class applies a multi-step normalization pipeline to transform raw syllables into a standardized canonical representation. The pipeline includes Unicode normalization, diacritic removal, case normalization, and validation against charset and length constraints.

config

Configuration specifying normalization parameters such as allowed charset, length constraints, and Unicode normalization form.

Example

>>> from build_tools.pyphen_syllable_normaliser import NormalizationConfig
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> normalizer = SyllableNormalizer(config)
>>> normalizer.normalize("Café")
'cafe'
>>> normalizer.normalize("x")  # Too short
None
>>> normalizer.normalize("résumé123")  # Invalid characters
None
__init__(config)[source]

Initialize normalizer with configuration.

Parameters:

config (NormalizationConfig) – NormalizationConfig instance specifying normalization parameters including charset, length constraints, and Unicode normalization form.

Example

>>> config = NormalizationConfig(
...     min_length=3,
...     max_length=10,
...     allowed_charset="abcdefghijklmnopqrstuvwxyz",
...     unicode_form="NFKD"
... )
>>> normalizer = SyllableNormalizer(config)
normalize(syllable)[source]

Normalize a single syllable to canonical form.

Applies the complete normalization pipeline: 1. Unicode normalization (NFKD by default) 2. Strip diacritics (remove combining characters) 3. Lowercase conversion 4. Trim whitespace 5. Validate charset (only allowed characters) 6. Check length constraints

Parameters:

syllable (str) – Raw syllable string to normalize.

Return type:

str | None

Returns:

Normalized canonical syllable string, or None if the syllable is rejected due to: - Becoming empty after normalization - Containing invalid characters - Not meeting length constraints

Example

>>> config = NormalizationConfig()
>>> normalizer = SyllableNormalizer(config)
>>> normalizer.normalize("Café")
'cafe'
>>> normalizer.normalize("  HELLO  ")
'hello'
>>> normalizer.normalize("résumé")
'resume'
>>> normalizer.normalize("")  # Empty
None
>>> normalizer.normalize("x")  # Too short (min_length=2)
None
>>> normalizer.normalize("hello123")  # Invalid chars
None
strip_diacritics(text)[source]

Remove diacritics (accent marks) from Unicode text.

Uses Unicode normalization (NFD/NFKD) to decompose characters into base characters and combining marks, then removes the combining marks. This converts accented characters like ‘é’ → ‘e’, ‘ñ’ → ‘n’, etc.

Parameters:

text (str) – Unicode string potentially containing diacritics.

Return type:

str

Returns:

String with all combining diacritical marks removed.

Example

>>> normalizer = SyllableNormalizer(NormalizationConfig())
>>> normalizer.strip_diacritics("café")
'cafe'
>>> normalizer.strip_diacritics("naïve")
'naive'
>>> normalizer.strip_diacritics("Zürich")
'Zurich'
>>> normalizer.strip_diacritics("São Paulo")
'Sao Paulo'

Note

This method assumes the text has already been normalized to NFD or NFKD form. The normalize() method handles this automatically.

build_tools.pyphen_syllable_normaliser.create_argument_parser()[source]

Create and return the argument parser for pyphen syllable normaliser.

Return type:

ArgumentParser

Returns:

Configured ArgumentParser ready to parse command-line arguments.

build_tools.pyphen_syllable_normaliser.discover_input_files(source_dir, pattern='*.txt', recursive=False)[source]

Discover input files in a directory matching a pattern.

Scans a directory for files matching the specified glob pattern. Returns files in sorted order for deterministic processing.

Parameters:
  • source_dir (Path) – Directory to scan for input files.

  • pattern (str) – Glob pattern for matching files. Default: “*.txt”.

  • recursive (bool) – If True, scan subdirectories recursively using “**/” prefix. Default: False (only scan the immediate directory).

Return type:

list[Path]

Returns:

Sorted list of Path objects for all matching files.

Raises:

Example

>>> from pathlib import Path
>>> # Non-recursive scan
>>> files = discover_input_files(Path("data/"), pattern="*.txt")
>>> files
[Path('data/corpus1.txt'), Path('data/corpus2.txt')]
>>>
>>> # Recursive scan
>>> files = discover_input_files(
...     Path("data/"),
...     pattern="*.txt",
...     recursive=True
... )
>>> files
[Path('data/corpus1.txt'),
 Path('data/subdir/corpus3.txt'),
 Path('data/subdir/corpus4.txt')]

Note

Files are always returned in sorted order to ensure deterministic processing. This is critical for reproducible normalization results.

build_tools.pyphen_syllable_normaliser.load_frequencies_from_file(file_path)[source]

Load frequency dictionary from JSON file.

Reads a previously saved syllables_frequencies.json file and returns the frequency dictionary. Useful for analysis and inspection of normalization results.

Parameters:

file_path (Path) – Path to the JSON frequency file.

Return type:

dict[str, int]

Returns:

Dictionary mapping syllable to occurrence count.

Raises:

Example

>>> from pathlib import Path
>>> frequencies = load_frequencies_from_file(Path("syllables_frequencies.json"))
>>> frequencies['ka']
187
>>> len(frequencies)
412

Note

The JSON file must have been created by save_frequencies() or follow the same format: {“syllable”: count, …}

build_tools.pyphen_syllable_normaliser.load_unique_syllables_from_file(file_path)[source]

Load unique syllables from text file.

Reads a previously saved syllables_unique.txt file and returns the syllable list. Useful for loading the authoritative syllable inventory for feature annotation or analysis.

Parameters:

file_path (Path) – Path to the text file containing unique syllables.

Return type:

list[str]

Returns:

List of syllable strings (one per line from file).

Raises:

Example

>>> from pathlib import Path
>>> syllables = load_unique_syllables_from_file(Path("syllables_unique.txt"))
>>> syllables[:5]
['ka', 'mi', 'ra', 'ta', 'wa']
>>> len(syllables)
412

Note

Empty lines are skipped. Leading/trailing whitespace is stripped from each line.

build_tools.pyphen_syllable_normaliser.main(args=None)[source]

Main entry point for CLI.

Parameters:

args (list[str] | None) – Command-line arguments (for testing). If None, uses sys.argv.

Return type:

int

Returns:

Exit code (0 for success, 1 for error).

build_tools.pyphen_syllable_normaliser.normalize_batch(syllables, config)[source]

Normalize a batch of syllables and collect rejection statistics.

This is a convenience function for normalizing multiple syllables at once while tracking why syllables were rejected.

Parameters:
  • syllables (list[str]) – List of raw syllable strings to normalize.

  • config (NormalizationConfig) – NormalizationConfig specifying normalization parameters.

Returns:

  • normalized_syllables: List of successfully normalized syllables

  • rejection_stats: Dictionary with rejection counts:
    • ”rejected_empty”: Syllables that became empty after normalization

    • ”rejected_charset”: Syllables with invalid characters

    • ”rejected_length”: Syllables outside length constraints

Return type:

Tuple of (normalized_syllables, rejection_stats) where

Example

>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> syllables = ["Café", "x", "Hello", "world123", "  résumé  "]
>>> normalized, stats = normalize_batch(syllables, config)
>>> normalized
['cafe', 'hello', 'resume']
>>> stats
{'rejected_empty': 0, 'rejected_charset': 1, 'rejected_length': 1}

Note

This function processes syllables in order and preserves duplicates. For frequency analysis, use the frequency.py module which handles deduplication and counting.

build_tools.pyphen_syllable_normaliser.run_full_pipeline(run_directory, config, verbose=False, quiet=False)[source]

Run complete pyphen normalization pipeline with in-place processing.

Executes the full pyphen-specific workflow: 1. Aggregate syllables from run_directory/syllables/*.txt 2. Canonicalize syllables (Unicode normalization, etc.) 3. Frequency analysis 4. Write 5 output files to run_directory (in-place)

Parameters:
  • run_directory (Path) – Pyphen run directory (e.g., _working/output/20260110_143022_pyphen/).

  • config (NormalizationConfig) – NormalizationConfig specifying normalization parameters.

  • verbose (bool) – If True, print detailed progress information.

  • quiet (bool) – If True, suppress all output except errors.

Return type:

NormalizationResult

Returns:

NormalizationResult containing all outputs, statistics, and file paths.

Raises:
  • FileNotFoundError – If run_directory or syllables/ subdirectory doesn’t exist.

  • ValueError – If run_directory is not a directory.

Example

>>> from pathlib import Path
>>> config = NormalizationConfig(min_length=2, max_length=8)
>>> run_dir = Path("_working/output/20260110_143022_pyphen/")
>>> result = run_full_pipeline(
...     run_directory=run_dir,
...     config=config,
...     verbose=True
... )
>>> result.stats.raw_count
15234
>>> result.stats.unique_canonical
4821