build_tools.pyphen_syllable_normaliser.frequency ================================================ .. py:module:: build_tools.pyphen_syllable_normaliser.frequency .. autoapi-nested-parse:: Frequency analysis for canonical syllables. This module handles Step 3 of the normalization pipeline: analyzing frequency distribution of canonical syllables and generating frequency intelligence data structures. This captures "how often each canonical syllable occurs before we collapse identity" - essential for understanding natural language patterns in the source corpus. Classes ------- .. autoapisummary:: build_tools.pyphen_syllable_normaliser.frequency.FrequencyAnalyzer Functions --------- .. autoapisummary:: build_tools.pyphen_syllable_normaliser.frequency.load_frequencies_from_file build_tools.pyphen_syllable_normaliser.frequency.load_unique_syllables_from_file Module Contents --------------- .. py:class:: FrequencyAnalyzer Analyzes frequency distribution of canonical syllables. This class handles the intelligence capture phase of the normalization pipeline. It counts occurrences of each canonical syllable, creates frequency rankings, and generates output files for downstream analysis and feature annotation. .. admonition:: Example >>> from pathlib import Path >>> analyzer = FrequencyAnalyzer() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka'] >>> frequencies = analyzer.calculate_frequencies(syllables) >>> frequencies {'ka': 3, 'ra': 1, 'mi': 1, 'ta': 1} >>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json")) >>> unique = analyzer.extract_unique_syllables(syllables) >>> unique ['ka', 'mi', 'ra', 'ta'] .. py:method:: calculate_frequencies(syllables) Calculate frequency counts for canonical syllables. Counts how many times each unique syllable appears in the input list. This captures the natural frequency distribution from the source corpus before deduplication. :param syllables: List of canonical syllables (may contain duplicates). :returns: Dictionary mapping each unique syllable to its occurrence count. .. admonition:: Example >>> analyzer = FrequencyAnalyzer() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra'] >>> frequencies = analyzer.calculate_frequencies(syllables) >>> frequencies {'ka': 3, 'ra': 2, 'mi': 1, 'ta': 1} >>> sum(frequencies.values()) # Total syllable count 7 .. note:: The returned dictionary is not sorted. Use create_frequency_entries() to generate sorted frequency rankings. .. py:method:: create_frequency_entries(frequencies) Create ranked frequency entries from frequency counts. Converts a frequency dictionary into a list of FrequencyEntry objects with ranking information and percentage calculations. Entries are sorted by frequency (descending) then alphabetically (ascending). :param frequencies: Dictionary mapping syllable to occurrence count. :returns: List of FrequencyEntry objects sorted by frequency (highest first), with alphabetical secondary sort for ties. .. admonition:: Example >>> analyzer = FrequencyAnalyzer() >>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145, 'ta': 98} >>> entries = analyzer.create_frequency_entries(frequencies) >>> entries[0] FrequencyEntry(canonical='ka', frequency=187, rank=1, percentage=31.5) >>> entries[0].canonical 'ka' >>> entries[0].rank 1 .. note:: Percentage is calculated as (frequency / total_count) * 100. Ranks start at 1 (most frequent syllable has rank=1). .. py:method:: extract_unique_syllables(syllables) Extract unique syllables and return in sorted order. Removes duplicates from the syllable list and returns a sorted list of unique canonical syllables. This creates the authoritative syllable inventory for downstream feature annotation. :param syllables: List of canonical syllables (may contain duplicates). :returns: Sorted list of unique syllable strings (alphabetical order). .. admonition:: Example >>> analyzer = FrequencyAnalyzer() >>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra'] >>> unique = analyzer.extract_unique_syllables(syllables) >>> unique ['ka', 'mi', 'ra', 'ta'] >>> len(unique) 4 .. note:: Sorting is alphabetical (a-z) for deterministic output. Empty syllable lists return an empty list. .. py:method:: save_frequencies(frequencies, output_path) Save frequency dictionary to JSON file. Writes the frequency intelligence to a JSON file for downstream analysis. The output is formatted with indentation for readability and sorted by key for deterministic output. :param frequencies: Dictionary mapping syllable to occurrence count. :param output_path: Path where the JSON file should be saved. :raises PermissionError: If the output file cannot be written. :raises OSError: If there are filesystem issues (disk full, etc.). .. admonition:: Example >>> analyzer = FrequencyAnalyzer() >>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145} >>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json")) # File contains: # { # "ka": 187, # "mi": 145, # "ra": 162 # } .. note:: The JSON is formatted with 2-space indentation and keys are sorted alphabetically for consistent diffs in version control. .. py:method:: save_unique_syllables(unique_syllables, output_path) Save unique syllables to text file. Writes the deduplicated canonical syllable inventory to a text file, one syllable per line. This creates the authoritative syllable list for feature annotation and downstream processing. :param unique_syllables: Sorted list of unique canonical syllables. :param output_path: Path where the text file should be saved. :raises PermissionError: If the output file cannot be written. :raises OSError: If there are filesystem issues (disk full, etc.). .. admonition:: Example >>> analyzer = FrequencyAnalyzer() >>> unique = ['ka', 'mi', 'ra', 'ta'] >>> analyzer.save_unique_syllables(unique, Path("syllables_unique.txt")) # File contains: # ka # mi # ra # ta .. note:: Syllables should be pre-sorted (alphabetically) before calling this method. Use extract_unique_syllables() which returns sorted output, or sort manually. .. py:function:: load_frequencies_from_file(file_path) Load frequency dictionary from JSON file. Reads a previously saved syllables_frequencies.json file and returns the frequency dictionary. Useful for analysis and inspection of normalization results. :param file_path: Path to the JSON frequency file. :returns: Dictionary mapping syllable to occurrence count. :raises FileNotFoundError: If the file does not exist. :raises json.JSONDecodeError: If the file is not valid JSON. :raises PermissionError: If the file cannot be read. .. admonition:: Example >>> from pathlib import Path >>> frequencies = load_frequencies_from_file(Path("syllables_frequencies.json")) >>> frequencies['ka'] 187 >>> len(frequencies) 412 .. note:: The JSON file must have been created by save_frequencies() or follow the same format: {"syllable": count, ...} .. py:function:: load_unique_syllables_from_file(file_path) Load unique syllables from text file. Reads a previously saved syllables_unique.txt file and returns the syllable list. Useful for loading the authoritative syllable inventory for feature annotation or analysis. :param file_path: Path to the text file containing unique syllables. :returns: List of syllable strings (one per line from file). :raises FileNotFoundError: If the file does not exist. :raises PermissionError: If the file cannot be read. :raises UnicodeDecodeError: If the file contains invalid UTF-8. .. admonition:: Example >>> from pathlib import Path >>> syllables = load_unique_syllables_from_file(Path("syllables_unique.txt")) >>> syllables[:5] ['ka', 'mi', 'ra', 'ta', 'wa'] >>> len(syllables) 412 .. note:: Empty lines are skipped. Leading/trailing whitespace is stripped from each line.