build_tools.pyphen_syllable_normaliser.models
=============================================

.. py:module:: build_tools.pyphen_syllable_normaliser.models

.. autoapi-nested-parse::

   Data models for syllable normalization.

   This module defines the data structures used to represent normalization
   configuration, statistics, and results.


Classes
-------

.. autoapisummary::

   build_tools.pyphen_syllable_normaliser.models.NormalizationConfig
   build_tools.pyphen_syllable_normaliser.models.NormalizationStats
   build_tools.pyphen_syllable_normaliser.models.FrequencyEntry
   build_tools.pyphen_syllable_normaliser.models.NormalizationResult


Module Contents
---------------

.. py:class:: NormalizationConfig

   Configuration for syllable normalization process.

   This dataclass stores all parameters that control how syllables are
   normalized to canonical form.

   .. attribute:: min_length

      Minimum syllable length (characters). Syllables shorter
      than this are rejected. Default: 2

   .. attribute:: max_length

      Maximum syllable length (characters). Syllables longer
      than this are rejected. Default: 20

   .. attribute:: allowed_charset

      String of allowed characters. Only syllables
      containing these characters (after normalization) are kept.
      Default: "abcdefghijklmnopqrstuvwxyz"

   .. attribute:: unicode_form

      Unicode normalization form. Options: "NFC", "NFD",
      "NFKC", "NFKD". Default: "NFKD" (compatibility decomposition)

   .. admonition:: Example

      >>> config = NormalizationConfig(min_length=3, max_length=10)
      >>> config.min_length
      3
      >>> config.allowed_charset
      'abcdefghijklmnopqrstuvwxyz'


   .. py:attribute:: min_length
      :type:  int
      :value: 2


   .. py:attribute:: max_length
      :type:  int
      :value: 20


   .. py:attribute:: allowed_charset
      :type:  str
      :value: 'abcdefghijklmnopqrstuvwxyz'


   .. py:attribute:: unicode_form
      :type:  str
      :value: 'NFKD'


.. py:class:: NormalizationStats

   Statistics from the syllable normalization process.

   This dataclass tracks counts and metrics throughout the normalization
   pipeline, useful for understanding data quality and processing results.

   .. attribute:: raw_count

      Total number of syllables in raw input (before normalization)

   .. attribute:: after_canonicalization

      Number of syllables after normalization

   .. attribute:: rejected_charset

      Syllables rejected due to invalid characters

   .. attribute:: rejected_length

      Syllables rejected due to length constraints

   .. attribute:: rejected_empty

      Syllables that became empty after normalization

   .. attribute:: unique_canonical

      Number of unique canonical syllables

   .. attribute:: processing_time

      Total processing time in seconds

   .. admonition:: Example

      >>> stats = NormalizationStats(
      ...     raw_count=1000,
      ...     after_canonicalization=950,
      ...     rejected_charset=30,
      ...     rejected_length=20,
      ...     rejected_empty=0,
      ...     unique_canonical=412,
      ...     processing_time=1.5
      ... )
      >>> stats.rejection_rate
      5.0


   .. py:attribute:: raw_count
      :type:  int
      :value: 0


   .. py:attribute:: after_canonicalization
      :type:  int
      :value: 0


   .. py:attribute:: rejected_charset
      :type:  int
      :value: 0


   .. py:attribute:: rejected_length
      :type:  int
      :value: 0


   .. py:attribute:: rejected_empty
      :type:  int
      :value: 0


   .. py:attribute:: unique_canonical
      :type:  int
      :value: 0


   .. py:attribute:: processing_time
      :type:  float
      :value: 0.0


   .. py:property:: total_rejected
      :type: int


      Calculate total number of rejected syllables.


   .. py:property:: rejection_rate
      :type: float


      Calculate rejection rate as percentage of raw count.


.. py:class:: FrequencyEntry

   Single syllable with frequency and ranking information.

   This dataclass represents one syllable in the frequency analysis,
   including its occurrence count and relative ranking.

   .. attribute:: canonical

      The canonical form of the syllable (e.g., "ka")

   .. attribute:: frequency

      Number of times this syllable appears

   .. attribute:: rank

      Frequency rank (1 = most common, 2 = second most common, etc.)

   .. attribute:: percentage

      Percentage of total syllables (0-100)

   .. admonition:: Example

      >>> entry = FrequencyEntry(canonical="ka", frequency=187, rank=1, percentage=10.2)
      >>> print(f"{entry.canonical}: {entry.frequency} ({entry.percentage:.1f}%)")
      ka: 187 (10.2%)


   .. py:attribute:: canonical
      :type:  str


   .. py:attribute:: frequency
      :type:  int


   .. py:attribute:: rank
      :type:  int


   .. py:attribute:: percentage
      :type:  float


.. py:class:: NormalizationResult

   Complete result from the syllable normalization pipeline.

   This dataclass encapsulates all outputs from the normalization process,
   including configuration, statistics, frequencies, and file paths.

   .. attribute:: config

      Configuration used for normalization

   .. attribute:: stats

      Statistics from the processing

   .. attribute:: frequencies

      Dictionary mapping canonical syllable to frequency count

   .. attribute:: unique_syllables

      Sorted list of unique canonical syllables

   .. attribute:: input_files

      List of input file paths that were processed

   .. attribute:: output_dir

      Directory where output files were saved

   .. attribute:: timestamp

      When the normalization was performed

   .. attribute:: raw_file

      Path to raw aggregated file (syllables_raw.txt)

   .. attribute:: canonical_file

      Path to canonicalized file (syllables_canonicalised.txt)

   .. attribute:: frequency_file

      Path to frequency JSON (syllables_frequencies.json)

   .. attribute:: unique_file

      Path to unique syllables (syllables_unique.txt)

   .. attribute:: meta_file

      Path to metadata report (normalization_meta.txt)

   .. admonition:: Example

      >>> result = NormalizationResult(
      ...     config=NormalizationConfig(),
      ...     stats=NormalizationStats(raw_count=1000),
      ...     frequencies={"ka": 187, "ra": 162},
      ...     unique_syllables=["ka", "ra"],
      ...     input_files=[Path("file1.txt")],
      ...     output_dir=Path("_working/normalized"),
      ...     timestamp=datetime.now(),
      ...     raw_file=Path("syllables_raw.txt"),
      ...     canonical_file=Path("syllables_canonicalised.txt"),
      ...     frequency_file=Path("syllables_frequencies.json"),
      ...     unique_file=Path("syllables_unique.txt"),
      ...     meta_file=Path("normalization_meta.txt")
      ... )
      >>> result.stats.raw_count
      1000


   .. py:attribute:: config
      :type:  NormalizationConfig


   .. py:attribute:: stats
      :type:  NormalizationStats


   .. py:attribute:: frequencies
      :type:  Dict[str, int]


   .. py:attribute:: unique_syllables
      :type:  List[str]


   .. py:attribute:: input_files
      :type:  List[pathlib.Path]


   .. py:attribute:: output_dir
      :type:  pathlib.Path


   .. py:attribute:: timestamp
      :type:  datetime.datetime


   .. py:attribute:: raw_file
      :type:  pathlib.Path


   .. py:attribute:: canonical_file
      :type:  pathlib.Path


   .. py:attribute:: frequency_file
      :type:  pathlib.Path


   .. py:attribute:: unique_file
      :type:  pathlib.Path


   .. py:attribute:: meta_file
      :type:  pathlib.Path


   .. py:method:: format_metadata()

      Format normalization metadata as a human-readable string.

      Creates a detailed report including statistics, rejection breakdown,
      and top frequencies.

      :returns: Multi-line string containing all normalization metadata formatted
                for display or file output.

      .. admonition:: Example

         >>> result = NormalizationResult(...)
         >>> print(result.format_metadata())
         ======================================================================
         SYLLABLE NORMALIZATION METADATA
         ======================================================================
         Timestamp:           2026-01-05 17:30:22
         ...