build_tools.syllable_walk_tui.services.corpus

Corpus directory validation and utilities for Syllable Walker TUI.

This module provides functions for validating and loading corpus data from normalized syllable extraction output directories.

Functions

validate_corpus_directory(path)

Validate that a directory contains valid corpus files.

get_corpus_info(path)

Get display-friendly corpus information string.

load_corpus_data(path)

Load syllables and frequencies from a validated corpus directory.

load_annotated_data_from_sqlite(db_path)

Load phonetic feature annotations from a SQLite corpus database.

load_annotated_data(path)

Load phonetic feature annotations from a validated corpus directory.

Module Contents

build_tools.syllable_walk_tui.services.corpus.validate_corpus_directory(path)[source]

Validate that a directory contains valid corpus files.

Checks for either NLTK or Pyphen corpus structure: - nltk_syllables_unique.txt + nltk_syllables_frequencies.json - pyphen_syllables_unique.txt + pyphen_syllables_frequencies.json

Parameters:

path (pathlib.Path) – Directory path to validate

Returns:

Tuple of (is_valid, corpus_type, message) - is_valid: True if valid corpus directory - corpus_type: “NLTK” or “Pyphen” if valid, empty string otherwise - message: Description if valid, error description if invalid

Return type:

tuple[bool, str, str]

Examples

>>> validate_corpus_directory(Path("/path/to/20260110_115601_nltk"))
(True, "NLTK", "Valid NLTK corpus")
>>> validate_corpus_directory(Path("/invalid/path"))
(False, "", "Directory does not exist")
build_tools.syllable_walk_tui.services.corpus.get_corpus_info(path)[source]

Get display-friendly corpus information string.

Parameters:

path (pathlib.Path) – Path to corpus directory

Returns:

Short description string for UI display

Return type:

str

Examples

>>> get_corpus_info(Path("/path/to/20260110_115601_nltk"))
"NLTK (20260110_115601_nltk)"
build_tools.syllable_walk_tui.services.corpus.load_corpus_data(path)[source]

Load syllables and frequencies from a validated corpus directory.

Parameters:

path (pathlib.Path) – Path to validated corpus directory

Returns:

Tuple of (syllables_list, frequencies_dict) - syllables_list: List of unique syllables (one per line from .txt file) - frequencies_dict: Dictionary mapping syllable to frequency count

Raises:
Return type:

tuple[list[str], dict[str, int]]

Examples

>>> syllables, freqs = load_corpus_data(Path("/path/to/20260110_115601_nltk"))
>>> len(syllables)
15234
>>> freqs["hello"]
42

Note

This function assumes the directory has already been validated with validate_corpus_directory(). It will attempt to load from either NLTK or Pyphen corpus files based on what exists.

build_tools.syllable_walk_tui.services.corpus.load_annotated_data_from_sqlite(db_path)[source]

Load phonetic feature annotations from a SQLite corpus database.

This function loads syllable data from an optimized SQLite database, which is much faster and more memory-efficient than loading from JSON.

Parameters:

db_path (pathlib.Path) – Path to corpus.db file

Returns:

  • syllable: The syllable string

  • frequency: Occurrence count in source corpus

  • features: Dict of boolean phonetic feature flags

Return type:

List of dictionaries, each containing

Raises:
Performance Notes:
  • Much faster than JSON loading (<100ms vs 1-2s)

  • Memory-efficient (loads on-demand)

  • Can be called on main thread without freezing UI

Examples

>>> db_path = Path("/path/to/20260110_115601_nltk/data/corpus.db")
>>> data = load_annotated_data_from_sqlite(db_path)
>>> len(data)
33640
build_tools.syllable_walk_tui.services.corpus.load_annotated_data(path)[source]

Load phonetic feature annotations from a validated corpus directory.

This function intelligently loads from either SQLite (if available) or JSON (fallback for backwards compatibility). SQLite loading is much faster and more memory-efficient.

Data structure (same for both sources): [

{

“syllable”: “aa”, “frequency”: 1022, “features”: {

“starts_with_vowel”: true, “starts_with_cluster”: false, “starts_with_heavy_cluster”: false, “contains_plosive”: false, “contains_fricative”: false, “contains_liquid”: false, “contains_nasal”: false, “short_vowel”: false, “long_vowel”: true, “ends_with_vowel”: true, “ends_with_nasal”: false, “ends_with_stop”: false

}

]

Parameters:

path (pathlib.Path) – Path to validated corpus directory

Returns:

  • data: List of dictionaries, each containing:
    • syllable: The syllable string

    • frequency: Occurrence count in source corpus

    • features: Dict of boolean phonetic feature flags

  • metadata: Dictionary with loading information:
    • source: “sqlite” or “json”

    • file_name: Name of the file loaded from

    • load_time_ms: Approximate load time in milliseconds

Return type:

Tuple of (data, metadata)

Raises:
Performance Notes:
  • SQLite: <100ms load time, memory-efficient (preferred)

  • JSON: 1-2s load time, loads entire file into memory (fallback)

  • Automatically chooses best available format

Examples

>>> data, meta = load_annotated_data(Path("/path/to/20260110_115601_nltk"))
>>> len(data)
33640
>>> meta["source"]
"sqlite"
>>> data[0]["syllable"]
"aa"
>>> data[0]["features"]["starts_with_vowel"]
True