build_tools.pyphen_syllable_normaliser.frequency

Frequency analysis for canonical syllables.

This module handles Step 3 of the normalization pipeline: analyzing frequency distribution of canonical syllables and generating frequency intelligence data structures. This captures “how often each canonical syllable occurs before we collapse identity” - essential for understanding natural language patterns in the source corpus.

Classes

FrequencyAnalyzer

Analyzes frequency distribution of canonical syllables.

Functions

`load_frequencies_from_file`(file_path)	Load frequency dictionary from JSON file.
`load_unique_syllables_from_file`(file_path)	Load unique syllables from text file.

Module Contents

class build_tools.pyphen_syllable_normaliser.frequency.FrequencyAnalyzer[source]

Analyzes frequency distribution of canonical syllables.

This class handles the intelligence capture phase of the normalization pipeline. It counts occurrences of each canonical syllable, creates frequency rankings, and generates output files for downstream analysis and feature annotation.

Example

>>> from pathlib import Path
>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka']
>>> frequencies = analyzer.calculate_frequencies(syllables)
>>> frequencies
{'ka': 3, 'ra': 1, 'mi': 1, 'ta': 1}
>>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json"))
>>> unique = analyzer.extract_unique_syllables(syllables)
>>> unique
['ka', 'mi', 'ra', 'ta']

calculate_frequencies(syllables)[source]

Calculate frequency counts for canonical syllables.

Counts how many times each unique syllable appears in the input list. This captures the natural frequency distribution from the source corpus before deduplication.

Parameters:: syllables (List[str]) – List of canonical syllables (may contain duplicates).
Returns:: Dictionary mapping each unique syllable to its occurrence count.
Return type:: Dict[str, int]

Example

>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra']
>>> frequencies = analyzer.calculate_frequencies(syllables)
>>> frequencies
{'ka': 3, 'ra': 2, 'mi': 1, 'ta': 1}
>>> sum(frequencies.values())  # Total syllable count
7

Note

The returned dictionary is not sorted. Use create_frequency_entries() to generate sorted frequency rankings.

create_frequency_entries(frequencies)[source]

Create ranked frequency entries from frequency counts.

Converts a frequency dictionary into a list of FrequencyEntry objects with ranking information and percentage calculations. Entries are sorted by frequency (descending) then alphabetically (ascending).

Parameters:: frequencies (Dict[str, int]) – Dictionary mapping syllable to occurrence count.
Returns:: List of FrequencyEntry objects sorted by frequency (highest first), with alphabetical secondary sort for ties.
Return type:: List[build_tools.pyphen_syllable_normaliser.models.FrequencyEntry]

Example

>>> analyzer = FrequencyAnalyzer()
>>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145, 'ta': 98}
>>> entries = analyzer.create_frequency_entries(frequencies)
>>> entries[0]
FrequencyEntry(canonical='ka', frequency=187, rank=1, percentage=31.5)
>>> entries[0].canonical
'ka'
>>> entries[0].rank
1

Note

Percentage is calculated as (frequency / total_count) * 100. Ranks start at 1 (most frequent syllable has rank=1).

extract_unique_syllables(syllables)[source]

Extract unique syllables and return in sorted order.

Removes duplicates from the syllable list and returns a sorted list of unique canonical syllables. This creates the authoritative syllable inventory for downstream feature annotation.

Parameters:: syllables (List[str]) – List of canonical syllables (may contain duplicates).
Returns:: Sorted list of unique syllable strings (alphabetical order).
Return type:: List[str]

Example

>>> analyzer = FrequencyAnalyzer()
>>> syllables = ['ka', 'ra', 'mi', 'ka', 'ta', 'ka', 'ra']
>>> unique = analyzer.extract_unique_syllables(syllables)
>>> unique
['ka', 'mi', 'ra', 'ta']
>>> len(unique)
4

Note

Sorting is alphabetical (a-z) for deterministic output. Empty syllable lists return an empty list.

save_frequencies(frequencies, output_path)[source]

Save frequency dictionary to JSON file.

Writes the frequency intelligence to a JSON file for downstream analysis. The output is formatted with indentation for readability and sorted by key for deterministic output.

Parameters:

frequencies (Dict[str, int]) – Dictionary mapping syllable to occurrence count.
output_path (pathlib.Path) – Path where the JSON file should be saved.

Raises:

PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).

Example

>>> analyzer = FrequencyAnalyzer()
>>> frequencies = {'ka': 187, 'ra': 162, 'mi': 145}
>>> analyzer.save_frequencies(frequencies, Path("syllables_frequencies.json"))
# File contains:
# {
#   "ka": 187,
#   "mi": 145,
#   "ra": 162
# }

Note

The JSON is formatted with 2-space indentation and keys are sorted alphabetically for consistent diffs in version control.

save_unique_syllables(unique_syllables, output_path)[source]

Save unique syllables to text file.

Writes the deduplicated canonical syllable inventory to a text file, one syllable per line. This creates the authoritative syllable list for feature annotation and downstream processing.

Parameters:

unique_syllables (List[str]) – Sorted list of unique canonical syllables.
output_path (pathlib.Path) – Path where the text file should be saved.

Raises:

PermissionError – If the output file cannot be written.
OSError – If there are filesystem issues (disk full, etc.).

Example

>>> analyzer = FrequencyAnalyzer()
>>> unique = ['ka', 'mi', 'ra', 'ta']
>>> analyzer.save_unique_syllables(unique, Path("syllables_unique.txt"))
# File contains:
# ka
# mi
# ra
# ta

Note

Syllables should be pre-sorted (alphabetically) before calling this method. Use extract_unique_syllables() which returns sorted output, or sort manually.

build_tools.pyphen_syllable_normaliser.frequency.load_frequencies_from_file(file_path)[source]

Load frequency dictionary from JSON file.

Reads a previously saved syllables_frequencies.json file and returns the frequency dictionary. Useful for analysis and inspection of normalization results.

Parameters:

file_path (pathlib.Path) – Path to the JSON frequency file.

Returns:

Dictionary mapping syllable to occurrence count.

Raises:

FileNotFoundError – If the file does not exist.
json.JSONDecodeError – If the file is not valid JSON.
PermissionError – If the file cannot be read.

Return type:

Dict[str, int]

Example

>>> from pathlib import Path
>>> frequencies = load_frequencies_from_file(Path("syllables_frequencies.json"))
>>> frequencies['ka']
187
>>> len(frequencies)
412

Note

The JSON file must have been created by save_frequencies() or follow the same format: {“syllable”: count, …}

build_tools.pyphen_syllable_normaliser.frequency.load_unique_syllables_from_file(file_path)[source]

Load unique syllables from text file.

Reads a previously saved syllables_unique.txt file and returns the syllable list. Useful for loading the authoritative syllable inventory for feature annotation or analysis.

Parameters:

file_path (pathlib.Path) – Path to the text file containing unique syllables.

Returns:

List of syllable strings (one per line from file).

Raises:

FileNotFoundError – If the file does not exist.
PermissionError – If the file cannot be read.
UnicodeDecodeError – If the file contains invalid UTF-8.

Return type:

List[str]

Example

>>> from pathlib import Path
>>> syllables = load_unique_syllables_from_file(Path("syllables_unique.txt"))
>>> syllables[:5]
['ka', 'mi', 'ra', 'ta', 'wa']
>>> len(syllables)
412

Note

Empty lines are skipped. Leading/trailing whitespace is stripped from each line.