build_tools.syllable_walk_tui.services.corpus

Corpus directory validation and utilities for Syllable Walker TUI.

This module provides functions for validating and loading corpus data from normalized syllable extraction output directories.

Functions

`validate_corpus_directory`(path)	Validate that a directory contains valid corpus files.
`get_corpus_info`(path)	Get display-friendly corpus information string.
`load_corpus_data`(path)	Load syllables and frequencies from a validated corpus directory.
`load_annotated_data_from_sqlite`(db_path)	Load phonetic feature annotations from a SQLite corpus database.
`load_annotated_data`(path)	Load phonetic feature annotations from a validated corpus directory.

Module Contents

build_tools.syllable_walk_tui.services.corpus.validate_corpus_directory(path)[source]

Validate that a directory contains valid corpus files.

Checks for either NLTK or Pyphen corpus structure: - nltk_syllables_unique.txt + nltk_syllables_frequencies.json - pyphen_syllables_unique.txt + pyphen_syllables_frequencies.json

Parameters:: path (pathlib.Path) – Directory path to validate
Returns:: Tuple of (is_valid, corpus_type, message) - is_valid: True if valid corpus directory - corpus_type: “NLTK” or “Pyphen” if valid, empty string otherwise - message: Description if valid, error description if invalid
Return type:: tuple[bool, str, str]

Examples

>>> validate_corpus_directory(Path("/path/to/20260110_115601_nltk"))
(True, "NLTK", "Valid NLTK corpus")

>>> validate_corpus_directory(Path("/invalid/path"))
(False, "", "Directory does not exist")

build_tools.syllable_walk_tui.services.corpus.get_corpus_info(path)[source]

Get display-friendly corpus information string.

Parameters:: path (pathlib.Path) – Path to corpus directory
Returns:: Short description string for UI display
Return type:: str

Examples

>>> get_corpus_info(Path("/path/to/20260110_115601_nltk"))
"NLTK (20260110_115601_nltk)"

build_tools.syllable_walk_tui.services.corpus.load_corpus_data(path)[source]

Load syllables and frequencies from a validated corpus directory.

Parameters:

path (pathlib.Path) – Path to validated corpus directory

Returns:

Tuple of (syllables_list, frequencies_dict) - syllables_list: List of unique syllables (one per line from .txt file) - frequencies_dict: Dictionary mapping syllable to frequency count

Raises:

ValueError – If directory is invalid or files cannot be loaded
FileNotFoundError – If expected corpus files are missing
json.JSONDecodeError – If frequencies JSON is malformed

Return type:

tuple[list[str], dict[str, int]]

Examples

>>> syllables, freqs = load_corpus_data(Path("/path/to/20260110_115601_nltk"))
>>> len(syllables)
15234
>>> freqs["hello"]
42

Note

This function assumes the directory has already been validated with validate_corpus_directory(). It will attempt to load from either NLTK or Pyphen corpus files based on what exists.

build_tools.syllable_walk_tui.services.corpus.load_annotated_data_from_sqlite(db_path)[source]

Load phonetic feature annotations from a SQLite corpus database.

This function loads syllable data from an optimized SQLite database, which is much faster and more memory-efficient than loading from JSON.

Parameters:

db_path (pathlib.Path) – Path to corpus.db file

Returns:

syllable: The syllable string
frequency: Occurrence count in source corpus
features: Dict of boolean phonetic feature flags

Return type:

List of dictionaries, each containing

Raises:

FileNotFoundError – If database file doesn’t exist
sqlite3.Error – If database is corrupted or incompatible

Performance Notes:

Much faster than JSON loading (<100ms vs 1-2s)
Memory-efficient (loads on-demand)
Can be called on main thread without freezing UI

Examples

>>> db_path = Path("/path/to/20260110_115601_nltk/data/corpus.db")
>>> data = load_annotated_data_from_sqlite(db_path)
>>> len(data)
33640

build_tools.syllable_walk_tui.services.corpus.load_annotated_data(path)[source]

Load phonetic feature annotations from a validated corpus directory.

This function intelligently loads from either SQLite (if available) or JSON (fallback for backwards compatibility). SQLite loading is much faster and more memory-efficient.

Data structure (same for both sources): [

{
“syllable”: “aa”, “frequency”: 1022, “features”: {

“starts_with_vowel”: true, “starts_with_cluster”: false, “starts_with_heavy_cluster”: false, “contains_plosive”: false, “contains_fricative”: false, “contains_liquid”: false, “contains_nasal”: false, “short_vowel”: false, “long_vowel”: true, “ends_with_vowel”: true, “ends_with_nasal”: false, “ends_with_stop”: false

}

]

Parameters:

path (pathlib.Path) – Path to validated corpus directory

Returns:

data: List of dictionaries, each containing:
- syllable: The syllable string
- frequency: Occurrence count in source corpus
- features: Dict of boolean phonetic feature flags
metadata: Dictionary with loading information:
- source: “sqlite” or “json”
- file_name: Name of the file loaded from
- load_time_ms: Approximate load time in milliseconds

Return type:

Tuple of (data, metadata)

Raises:

ValueError – If directory is invalid or file cannot be loaded
FileNotFoundError – If neither SQLite nor JSON data is available
json.JSONDecodeError – If JSON is malformed (when loading from JSON)

Performance Notes:

SQLite: <100ms load time, memory-efficient (preferred)
JSON: 1-2s load time, loads entire file into memory (fallback)
Automatically chooses best available format

Examples

>>> data, meta = load_annotated_data(Path("/path/to/20260110_115601_nltk"))
>>> len(data)
33640
>>> meta["source"]
"sqlite"
>>> data[0]["syllable"]
"aa"
>>> data[0]["features"]["starts_with_vowel"]
True