build_tools.pyphen_syllable_normaliser.models

Data models for syllable normalization.

This module defines the data structures used to represent normalization configuration, statistics, and results.

Classes

`NormalizationConfig`	Configuration for syllable normalization process.
`NormalizationStats`	Statistics from the syllable normalization process.
`FrequencyEntry`	Single syllable with frequency and ranking information.
`NormalizationResult`	Complete result from the syllable normalization pipeline.

Module Contents

class build_tools.pyphen_syllable_normaliser.models.NormalizationConfig[source]

Configuration for syllable normalization process.

This dataclass stores all parameters that control how syllables are normalized to canonical form.

min_length: Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2

max_length: Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20

allowed_charset: String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”

unicode_form: Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)

Example

>>> config = NormalizationConfig(min_length=3, max_length=10)
>>> config.min_length
3
>>> config.allowed_charset
'abcdefghijklmnopqrstuvwxyz'

min_length: int = 2

max_length: int = 20

allowed_charset: str = 'abcdefghijklmnopqrstuvwxyz'

unicode_form: str = 'NFKD'

class build_tools.pyphen_syllable_normaliser.models.NormalizationStats[source]

Statistics from the syllable normalization process.

This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.

raw_count: Total number of syllables in raw input (before normalization)

after_canonicalization: Number of syllables after normalization

rejected_charset: Syllables rejected due to invalid characters

rejected_length: Syllables rejected due to length constraints

rejected_empty: Syllables that became empty after normalization

unique_canonical: Number of unique canonical syllables

processing_time: Total processing time in seconds

Example

>>> stats = NormalizationStats(
...     raw_count=1000,
...     after_canonicalization=950,
...     rejected_charset=30,
...     rejected_length=20,
...     rejected_empty=0,
...     unique_canonical=412,
...     processing_time=1.5
... )
>>> stats.rejection_rate
5.0

raw_count: int = 0

after_canonicalization: int = 0

rejected_charset: int = 0

rejected_length: int = 0

rejected_empty: int = 0

unique_canonical: int = 0

processing_time: float = 0.0

property total_rejected: int

Calculate total number of rejected syllables.

property rejection_rate: float

Calculate rejection rate as percentage of raw count.

class build_tools.pyphen_syllable_normaliser.models.FrequencyEntry[source]

Single syllable with frequency and ranking information.

This dataclass represents one syllable in the frequency analysis, including its occurrence count and relative ranking.

canonical: The canonical form of the syllable (e.g., “ka”)

frequency: Number of times this syllable appears

rank: Frequency rank (1 = most common, 2 = second most common, etc.)

percentage: Percentage of total syllables (0-100)

Example

>>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2)
>>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)")
ka: 187 (10.2%)

canonical: str

frequency: int

rank: int

percentage: float

class build_tools.pyphen_syllable_normaliser.models.NormalizationResult[source]

Complete result from the syllable normalization pipeline.

This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.

config: Configuration used for normalization

stats: Statistics from the processing

frequencies: Dictionary mapping canonical syllable to frequency count

unique_syllables: Sorted list of unique canonical syllables

input_files: List of input file paths that were processed

output_dir: Directory where output files were saved

timestamp: When the normalization was performed

raw_file: Path to raw aggregated file (syllables_raw.txt)

canonical_file: Path to canonicalized file (syllables_canonicalised.txt)

frequency_file: Path to frequency JSON (syllables_frequencies.json)

unique_file: Path to unique syllables (syllables_unique.txt)

meta_file: Path to metadata report (normalization_meta.txt)

Example

>>> result = NormalizationResult(
...     config=NormalizationConfig(),
...     stats=NormalizationStats(raw_count=1000),
...     frequencies={"ka": 187, "ra": 162},
...     unique_syllables=["ka", "ra"],
...     input_files=[Path("file1.txt")],
...     output_dir=Path("_working/normalized"),
...     timestamp=datetime.now(),
...     raw_file=Path("syllables_raw.txt"),
...     canonical_file=Path("syllables_canonicalised.txt"),
...     frequency_file=Path("syllables_frequencies.json"),
...     unique_file=Path("syllables_unique.txt"),
...     meta_file=Path("normalization_meta.txt")
... )
>>> result.stats.raw_count
1000

config: NormalizationConfig

stats: NormalizationStats

frequencies: dict[str, int]

unique_syllables: list[str]

input_files: list[pathlib.Path]

output_dir: pathlib.Path

timestamp: datetime.datetime

raw_file: pathlib.Path

canonical_file: pathlib.Path

frequency_file: pathlib.Path

unique_file: pathlib.Path

meta_file: pathlib.Path

format_metadata()[source]

Format normalization metadata as a human-readable string.

Creates a detailed report including statistics, rejection breakdown, and top frequencies.

Returns:: Multi-line string containing all normalization metadata formatted for display or file output.
Return type:: str

Example

>>> result = NormalizationResult(...)
>>> print(result.format_metadata())
======================================================================
SYLLABLE NORMALIZATION METADATA
======================================================================
Timestamp:           2026-01-05 17:30:22
...