build_tools.pyphen_syllable_normaliser.models

Data models for syllable normalization.

This module defines the data structures used to represent normalization configuration, statistics, and results.

Classes

NormalizationConfig

Configuration for syllable normalization process.

NormalizationStats

Statistics from the syllable normalization process.

FrequencyEntry

Single syllable with frequency and ranking information.

NormalizationResult

Complete result from the syllable normalization pipeline.

Module Contents

class build_tools.pyphen_syllable_normaliser.models.NormalizationConfig[source]

Configuration for syllable normalization process.

This dataclass stores all parameters that control how syllables are normalized to canonical form.

min_length

Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2

max_length

Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20

allowed_charset

String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: “abcdefghijklmnopqrstuvwxyz”

unicode_form

Unicode normalization form. Options: “NFC”, “NFD”, “NFKC”, “NFKD”. Default: “NFKD” (compatibility decomposition)

Example

>>> config = NormalizationConfig(min_length=3, max_length=10)
>>> config.min_length
3
>>> config.allowed_charset
'abcdefghijklmnopqrstuvwxyz'
min_length: int = 2
max_length: int = 20
allowed_charset: str = 'abcdefghijklmnopqrstuvwxyz'
unicode_form: str = 'NFKD'
class build_tools.pyphen_syllable_normaliser.models.NormalizationStats[source]

Statistics from the syllable normalization process.

This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results.

raw_count

Total number of syllables in raw input (before normalization)

after_canonicalization

Number of syllables after normalization

rejected_charset

Syllables rejected due to invalid characters

rejected_length

Syllables rejected due to length constraints

rejected_empty

Syllables that became empty after normalization

unique_canonical

Number of unique canonical syllables

processing_time

Total processing time in seconds

Example

>>> stats = NormalizationStats(
...     raw_count=1000,
...     after_canonicalization=950,
...     rejected_charset=30,
...     rejected_length=20,
...     rejected_empty=0,
...     unique_canonical=412,
...     processing_time=1.5
... )
>>> stats.rejection_rate
5.0
raw_count: int = 0
after_canonicalization: int = 0
rejected_charset: int = 0
rejected_length: int = 0
rejected_empty: int = 0
unique_canonical: int = 0
processing_time: float = 0.0
property total_rejected: int

Calculate total number of rejected syllables.

property rejection_rate: float

Calculate rejection rate as percentage of raw count.

class build_tools.pyphen_syllable_normaliser.models.FrequencyEntry[source]

Single syllable with frequency and ranking information.

This dataclass represents one syllable in the frequency analysis, including its occurrence count and relative ranking.

canonical

The canonical form of the syllable (e.g., “ka”)

frequency

Number of times this syllable appears

rank

Frequency rank (1 = most common, 2 = second most common, etc.)

percentage

Percentage of total syllables (0-100)

Example

>>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2)
>>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)")
ka: 187 (10.2%)
canonical: str
frequency: int
rank: int
percentage: float
class build_tools.pyphen_syllable_normaliser.models.NormalizationResult[source]

Complete result from the syllable normalization pipeline.

This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths.

config

Configuration used for normalization

stats

Statistics from the processing

frequencies

Dictionary mapping canonical syllable to frequency count

unique_syllables

Sorted list of unique canonical syllables

input_files

List of input file paths that were processed

output_dir

Directory where output files were saved

timestamp

When the normalization was performed

raw_file

Path to raw aggregated file (syllables_raw.txt)

canonical_file

Path to canonicalized file (syllables_canonicalised.txt)

frequency_file

Path to frequency JSON (syllables_frequencies.json)

unique_file

Path to unique syllables (syllables_unique.txt)

meta_file

Path to metadata report (normalization_meta.txt)

Example

>>> result = NormalizationResult(
...     config=NormalizationConfig(),
...     stats=NormalizationStats(raw_count=1000),
...     frequencies={"ka": 187, "ra": 162},
...     unique_syllables=["ka", "ra"],
...     input_files=[Path("file1.txt")],
...     output_dir=Path("_working/normalized"),
...     timestamp=datetime.now(),
...     raw_file=Path("syllables_raw.txt"),
...     canonical_file=Path("syllables_canonicalised.txt"),
...     frequency_file=Path("syllables_frequencies.json"),
...     unique_file=Path("syllables_unique.txt"),
...     meta_file=Path("normalization_meta.txt")
... )
>>> result.stats.raw_count
1000
config: NormalizationConfig
stats: NormalizationStats
frequencies: dict[str, int]
unique_syllables: list[str]
input_files: list[pathlib.Path]
output_dir: pathlib.Path
timestamp: datetime.datetime
raw_file: pathlib.Path
canonical_file: pathlib.Path
frequency_file: pathlib.Path
unique_file: pathlib.Path
meta_file: pathlib.Path
format_metadata()[source]

Format normalization metadata as a human-readable string.

Creates a detailed report including statistics, rejection breakdown, and top frequencies.

Returns:

Multi-line string containing all normalization metadata formatted for display or file output.

Return type:

str

Example

>>> result = NormalizationResult(...)
>>> print(result.format_metadata())
======================================================================
SYLLABLE NORMALIZATION METADATA
======================================================================
Timestamp:           2026-01-05 17:30:22
...