build_tools.syllable_feature_annotator.file_io

File I/O helper functions for syllable feature annotation.

This module provides simple, reusable functions for loading syllable data and saving annotated results. All functions are designed to be deterministic, predictable, and easy to test.

Design Principles

Boring is Good: Simple, straightforward I/O operations
Explicit Errors: Clear error messages for common failure modes
Minimal Abstraction: Functions do exactly what they say
Type Clarity: Clear input/output types with type hints
No Magic: No hidden transformations or side effects

Functions

load_syllables(file_path: Path) -> list[str]: Load syllables from text file (one per line)
load_frequencies(file_path: Path) -> dict[str, int]: Load frequency mapping from JSON file
save_annotated_syllables(syllables: list[dict], file_path: Path) -> None: Save annotated syllables to JSON file with formatting

Usage

Load syllables from normalizer output:

>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator.file_io import load_syllables
>>> syllables = load_syllables(Path("data/normalized/syllables_unique.txt"))
>>> print(f"Loaded {len(syllables)} syllables")

Load frequency data:

>>> from build_tools.syllable_feature_annotator.file_io import load_frequencies
>>> frequencies = load_frequencies(Path("data/normalized/syllables_frequencies.json"))
>>> print(f"Most frequent: {max(frequencies.items(), key=lambda x: x[1])}")

Save annotated results:

>>> from build_tools.syllable_feature_annotator.file_io import save_annotated_syllables
>>> annotated = [
...     {"syllable": "ka", "frequency": 187, "features": {...}},
...     {"syllable": "ra", "frequency": 162, "features": {...}},
... ]
>>> save_annotated_syllables(annotated, Path("output/syllables_annotated.json"))

Implementation Notes

Empty Line Handling:

The load_syllables function filters out empty lines automatically. This is consistent with the syllable normalizer’s behavior and prevents issues with trailing newlines or blank lines in input files.

JSON Formatting:

The save_annotated_syllables function uses indent=2 for human-readable output. While this increases file size, it makes the annotated dataset much easier to inspect, debug, and version control.

UTF-8 Encoding:

All file operations explicitly use UTF-8 encoding to ensure consistent behavior across different platforms and locales. This is critical for syllables that may contain non-ASCII characters before normalization.

Error Handling:

Functions raise clear exceptions with informative messages: - FileNotFoundError: Input file doesn’t exist - ValueError: Malformed JSON or unexpected data format - IOError: Permission errors or disk issues

Error Handling Strategy

These functions intentionally do NOT catch exceptions. Instead, they let exceptions propagate to the caller with clear error messages. This follows the principle of “fail fast” and makes debugging easier.

The caller (annotator.py or cli.py) is responsible for catching exceptions and providing user-friendly error messages if needed.

Functions

`load_syllables`(file_path)	Load syllables from a text file (one syllable per line).
`load_frequencies`(file_path)	Load syllable frequency mapping from a JSON file.
`save_annotated_syllables`(syllables, file_path)	Save annotated syllables to a JSON file with human-readable formatting.

Module Contents

build_tools.syllable_feature_annotator.file_io.load_syllables(file_path)[source]

Load syllables from a text file (one syllable per line).

Reads a text file containing one syllable per line and returns a list of syllable strings. Empty lines are automatically filtered out.

This function is designed to load the output from the syllable normalizer, specifically the syllables_unique.txt file.

Parameters

file_pathPath: Path to text file containing syllables (one per line)

Returns

list[str]: List of syllable strings, with empty lines filtered out

Raises

FileNotFoundError: If the input file doesn’t exist
IOError: If there are permission or disk errors

Examples

Load syllables from normalizer output:

>>> from pathlib import Path
>>> syllables = load_syllables(Path("data/normalized/syllables_unique.txt"))
>>> len(syllables)
1523
>>> syllables[:3]
['ka', 'ra', 'mi']

Handle missing file:

>>> syllables = load_syllables(Path("nonexistent.txt"))
Traceback (most recent call last):
    ...
FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent.txt'

Notes

Empty lines are filtered automatically
Lines are stripped of leading/trailing whitespace
UTF-8 encoding is used explicitly
The file is read entirely into memory (suitable for typical syllable counts)
Order is preserved from the input file
Deterministic: same file always produces same output

build_tools.syllable_feature_annotator.file_io.load_frequencies(file_path)[source]

Load syllable frequency mapping from a JSON file.

Reads a JSON file containing a dictionary mapping syllables to their occurrence counts. The expected format is: {“syllable”: count, …}

This function is designed to load the output from the syllable normalizer, specifically the syllables_frequencies.json file.

Parameters

file_pathPath: Path to JSON file containing frequency mapping

Returns

dict[str, int]: Dictionary mapping syllable strings to integer counts

Raises

FileNotFoundError: If the input file doesn’t exist
ValueError: If the JSON is malformed or doesn’t contain expected format
IOError: If there are permission or disk errors

Examples

Load frequencies from normalizer output:

>>> from pathlib import Path
>>> frequencies = load_frequencies(Path("data/normalized/syllables_frequencies.json"))
>>> len(frequencies)
1523
>>> frequencies["ka"]
187
>>> frequencies["ra"]
162

Get most frequent syllable:

>>> most_frequent = max(frequencies.items(), key=lambda x: x[1])
>>> print(f"{most_frequent[0]}: {most_frequent[1]} occurrences")
ka: 187 occurrences

Handle missing syllable (returns default):

>>> frequencies.get("xyz", 1)  # Default to 1 if missing
1

Notes

UTF-8 encoding is used explicitly
The entire file is loaded into memory (suitable for typical dataset sizes)
No validation is performed on syllable strings or counts
Deterministic: same file always produces same output
Missing syllables should be handled by caller (use .get(syllable, 1))

build_tools.syllable_feature_annotator.file_io.save_annotated_syllables(syllables, file_path)[source]

Save annotated syllables to a JSON file with human-readable formatting.

Writes a list of annotated syllable dictionaries to a JSON file. Each dictionary should contain ‘syllable’, ‘frequency’, and ‘features’ keys.

Output is formatted with 2-space indentation for readability and version control friendliness.

Parameters

syllableslist[dict]: List of annotated syllable dictionaries, each containing: - syllable (str): The syllable string - frequency (int): Occurrence count - features (dict[str, bool]): Feature detection results
file_pathPath: Path where JSON output should be written

Returns

None: File is written to disk, nothing returned

Raises

IOError: If there are permission or disk errors
TypeError: If syllables is not JSON-serializable

Examples

Save annotated syllables:

>>> from pathlib import Path
>>> annotated = [
...     {
...         "syllable": "kran",
...         "frequency": 7,
...         "features": {
...             "starts_with_cluster": True,
...             "contains_plosive": True,
...             "short_vowel": True,
...             # ... other features ...
...         }
...     },
...     # ... more syllables ...
... ]
>>> save_annotated_syllables(annotated, Path("output/syllables_annotated.json"))

Expected output format:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_cluster": true,
      "contains_plosive": true,
      "short_vowel": true
    }
  }
]

Notes

Parent directories are created automatically if they don’t exist
UTF-8 encoding is used explicitly
2-space indentation for readability
Output is valid JSON that can be consumed by other tools
File is overwritten if it already exists
Deterministic: same input always produces same output