build_tools.pyphen_syllable_normaliser.models ============================================= .. py:module:: build_tools.pyphen_syllable_normaliser.models .. autoapi-nested-parse:: Data models for syllable normalization. This module defines the data structures used to represent normalization configuration, statistics, and results. Classes ------- .. autoapisummary:: build_tools.pyphen_syllable_normaliser.models.NormalizationConfig build_tools.pyphen_syllable_normaliser.models.NormalizationStats build_tools.pyphen_syllable_normaliser.models.FrequencyEntry build_tools.pyphen_syllable_normaliser.models.NormalizationResult Module Contents --------------- .. py:class:: NormalizationConfig Configuration for syllable normalization process. This dataclass stores all parameters that control how syllables are normalized to canonical form. .. attribute:: min_length Minimum syllable length (characters). Syllables shorter than this are rejected. Default: 2 .. attribute:: max_length Maximum syllable length (characters). Syllables longer than this are rejected. Default: 20 .. attribute:: allowed_charset String of allowed characters. Only syllables containing these characters (after normalization) are kept. Default: "abcdefghijklmnopqrstuvwxyz" .. attribute:: unicode_form Unicode normalization form. Options: "NFC", "NFD", "NFKC", "NFKD". Default: "NFKD" (compatibility decomposition) .. admonition:: Example >>> config = NormalizationConfig(min_length=3, max_length=10) >>> config.min_length 3 >>> config.allowed_charset 'abcdefghijklmnopqrstuvwxyz' .. py:attribute:: min_length :type: int :value: 2 .. py:attribute:: max_length :type: int :value: 20 .. py:attribute:: allowed_charset :type: str :value: 'abcdefghijklmnopqrstuvwxyz' .. py:attribute:: unicode_form :type: str :value: 'NFKD' .. py:class:: NormalizationStats Statistics from the syllable normalization process. This dataclass tracks counts and metrics throughout the normalization pipeline, useful for understanding data quality and processing results. .. attribute:: raw_count Total number of syllables in raw input (before normalization) .. attribute:: after_canonicalization Number of syllables after normalization .. attribute:: rejected_charset Syllables rejected due to invalid characters .. attribute:: rejected_length Syllables rejected due to length constraints .. attribute:: rejected_empty Syllables that became empty after normalization .. attribute:: unique_canonical Number of unique canonical syllables .. attribute:: processing_time Total processing time in seconds .. admonition:: Example >>> stats = NormalizationStats( ... raw_count=1000, ... after_canonicalization=950, ... rejected_charset=30, ... rejected_length=20, ... rejected_empty=0, ... unique_canonical=412, ... processing_time=1.5 ... ) >>> stats.rejection_rate 5.0 .. py:attribute:: raw_count :type: int :value: 0 .. py:attribute:: after_canonicalization :type: int :value: 0 .. py:attribute:: rejected_charset :type: int :value: 0 .. py:attribute:: rejected_length :type: int :value: 0 .. py:attribute:: rejected_empty :type: int :value: 0 .. py:attribute:: unique_canonical :type: int :value: 0 .. py:attribute:: processing_time :type: float :value: 0.0 .. py:property:: total_rejected :type: int Calculate total number of rejected syllables. .. py:property:: rejection_rate :type: float Calculate rejection rate as percentage of raw count. .. py:class:: FrequencyEntry Single syllable with frequency and ranking information. This dataclass represents one syllable in the frequency analysis, including its occurrence count and relative ranking. .. attribute:: canonical The canonical form of the syllable (e.g., "ka") .. attribute:: frequency Number of times this syllable appears .. attribute:: rank Frequency rank (1 = most common, 2 = second most common, etc.) .. attribute:: percentage Percentage of total syllables (0-100) .. admonition:: Example >>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2) >>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)") ka: 187 (10.2%) .. py:attribute:: canonical :type: str .. py:attribute:: frequency :type: int .. py:attribute:: rank :type: int .. py:attribute:: percentage :type: float .. py:class:: NormalizationResult Complete result from the syllable normalization pipeline. This dataclass encapsulates all outputs from the normalization process, including configuration, statistics, frequencies, and file paths. .. attribute:: config Configuration used for normalization .. attribute:: stats Statistics from the processing .. attribute:: frequencies Dictionary mapping canonical syllable to frequency count .. attribute:: unique_syllables Sorted list of unique canonical syllables .. attribute:: input_files List of input file paths that were processed .. attribute:: output_dir Directory where output files were saved .. attribute:: timestamp When the normalization was performed .. attribute:: raw_file Path to raw aggregated file (syllables_raw.txt) .. attribute:: canonical_file Path to canonicalized file (syllables_canonicalised.txt) .. attribute:: frequency_file Path to frequency JSON (syllables_frequencies.json) .. attribute:: unique_file Path to unique syllables (syllables_unique.txt) .. attribute:: meta_file Path to metadata report (normalization_meta.txt) .. admonition:: Example >>> result = NormalizationResult( ... config=NormalizationConfig(), ... stats=NormalizationStats(raw_count=1000), ... frequencies={"ka": 187, "ra": 162}, ... unique_syllables=["ka", "ra"], ... input_files=[Path("file1.txt")], ... output_dir=Path("_working/normalized"), ... timestamp=datetime.now(), ... raw_file=Path("syllables_raw.txt"), ... canonical_file=Path("syllables_canonicalised.txt"), ... frequency_file=Path("syllables_frequencies.json"), ... unique_file=Path("syllables_unique.txt"), ... meta_file=Path("normalization_meta.txt") ... ) >>> result.stats.raw_count 1000 .. py:attribute:: config :type: NormalizationConfig .. py:attribute:: stats :type: NormalizationStats .. py:attribute:: frequencies :type: Dict[str, int] .. py:attribute:: unique_syllables :type: List[str] .. py:attribute:: input_files :type: List[pathlib.Path] .. py:attribute:: output_dir :type: pathlib.Path .. py:attribute:: timestamp :type: datetime.datetime .. py:attribute:: raw_file :type: pathlib.Path .. py:attribute:: canonical_file :type: pathlib.Path .. py:attribute:: frequency_file :type: pathlib.Path .. py:attribute:: unique_file :type: pathlib.Path .. py:attribute:: meta_file :type: pathlib.Path .. py:method:: format_metadata() Format normalization metadata as a human-readable string. Creates a detailed report including statistics, rejection breakdown, and top frequencies. :returns: Multi-line string containing all normalization metadata formatted for display or file output. .. admonition:: Example >>> result = NormalizationResult(...) >>> print(result.format_metadata()) ====================================================================== SYLLABLE NORMALIZATION METADATA ====================================================================== Timestamp: 2026-01-05 17:30:22 ...