build_tools.syllable_walk_tui.services.corpus
Corpus directory validation and utilities for Syllable Walker TUI.
This module provides functions for validating and loading corpus data from normalized syllable extraction output directories.
Functions
Validate that a directory contains valid corpus files. |
|
|
Get display-friendly corpus information string. |
|
Load syllables and frequencies from a validated corpus directory. |
|
Load phonetic feature annotations from a SQLite corpus database. |
|
Load phonetic feature annotations from a validated corpus directory. |
Module Contents
- build_tools.syllable_walk_tui.services.corpus.validate_corpus_directory(path)[source]
Validate that a directory contains valid corpus files.
Checks for either NLTK or Pyphen corpus structure: - nltk_syllables_unique.txt + nltk_syllables_frequencies.json - pyphen_syllables_unique.txt + pyphen_syllables_frequencies.json
- Parameters:
path (pathlib.Path) – Directory path to validate
- Returns:
Tuple of (is_valid, corpus_type, message) - is_valid: True if valid corpus directory - corpus_type: “NLTK” or “Pyphen” if valid, empty string otherwise - message: Description if valid, error description if invalid
- Return type:
Examples
>>> validate_corpus_directory(Path("/path/to/20260110_115601_nltk")) (True, "NLTK", "Valid NLTK corpus")
>>> validate_corpus_directory(Path("/invalid/path")) (False, "", "Directory does not exist")
- build_tools.syllable_walk_tui.services.corpus.get_corpus_info(path)[source]
Get display-friendly corpus information string.
- Parameters:
path (pathlib.Path) – Path to corpus directory
- Returns:
Short description string for UI display
- Return type:
Examples
>>> get_corpus_info(Path("/path/to/20260110_115601_nltk")) "NLTK (20260110_115601_nltk)"
- build_tools.syllable_walk_tui.services.corpus.load_corpus_data(path)[source]
Load syllables and frequencies from a validated corpus directory.
- Parameters:
path (pathlib.Path) – Path to validated corpus directory
- Returns:
Tuple of (syllables_list, frequencies_dict) - syllables_list: List of unique syllables (one per line from .txt file) - frequencies_dict: Dictionary mapping syllable to frequency count
- Raises:
ValueError – If directory is invalid or files cannot be loaded
FileNotFoundError – If expected corpus files are missing
json.JSONDecodeError – If frequencies JSON is malformed
- Return type:
Examples
>>> syllables, freqs = load_corpus_data(Path("/path/to/20260110_115601_nltk")) >>> len(syllables) 15234 >>> freqs["hello"] 42
Note
This function assumes the directory has already been validated with validate_corpus_directory(). It will attempt to load from either NLTK or Pyphen corpus files based on what exists.
- build_tools.syllable_walk_tui.services.corpus.load_annotated_data_from_sqlite(db_path)[source]
Load phonetic feature annotations from a SQLite corpus database.
This function loads syllable data from an optimized SQLite database, which is much faster and more memory-efficient than loading from JSON.
- Parameters:
db_path (pathlib.Path) – Path to corpus.db file
- Returns:
syllable: The syllable string
frequency: Occurrence count in source corpus
features: Dict of boolean phonetic feature flags
- Return type:
List of dictionaries, each containing
- Raises:
FileNotFoundError – If database file doesn’t exist
sqlite3.Error – If database is corrupted or incompatible
- Performance Notes:
Much faster than JSON loading (<100ms vs 1-2s)
Memory-efficient (loads on-demand)
Can be called on main thread without freezing UI
Examples
>>> db_path = Path("/path/to/20260110_115601_nltk/data/corpus.db") >>> data = load_annotated_data_from_sqlite(db_path) >>> len(data) 33640
- build_tools.syllable_walk_tui.services.corpus.load_annotated_data(path)[source]
Load phonetic feature annotations from a validated corpus directory.
This function intelligently loads from either SQLite (if available) or JSON (fallback for backwards compatibility). SQLite loading is much faster and more memory-efficient.
Data structure (same for both sources): [
- {
“syllable”: “aa”, “frequency”: 1022, “features”: {
“starts_with_vowel”: true, “starts_with_cluster”: false, “starts_with_heavy_cluster”: false, “contains_plosive”: false, “contains_fricative”: false, “contains_liquid”: false, “contains_nasal”: false, “short_vowel”: false, “long_vowel”: true, “ends_with_vowel”: true, “ends_with_nasal”: false, “ends_with_stop”: false
}
]
- Parameters:
path (pathlib.Path) – Path to validated corpus directory
- Returns:
- data: List of dictionaries, each containing:
syllable: The syllable string
frequency: Occurrence count in source corpus
features: Dict of boolean phonetic feature flags
- metadata: Dictionary with loading information:
source: “sqlite” or “json”
file_name: Name of the file loaded from
load_time_ms: Approximate load time in milliseconds
- Return type:
Tuple of (data, metadata)
- Raises:
ValueError – If directory is invalid or file cannot be loaded
FileNotFoundError – If neither SQLite nor JSON data is available
json.JSONDecodeError – If JSON is malformed (when loading from JSON)
- Performance Notes:
SQLite: <100ms load time, memory-efficient (preferred)
JSON: 1-2s load time, loads entire file into memory (fallback)
Automatically chooses best available format
Examples
>>> data, meta = load_annotated_data(Path("/path/to/20260110_115601_nltk")) >>> len(data) 33640 >>> meta["source"] "sqlite" >>> data[0]["syllable"] "aa" >>> data[0]["features"]["starts_with_vowel"] True