Corpus SQLite Builder
Overview
Corpus SQLite Builder - JSON to SQLite Conversion Tool
Converts large annotated JSON files into optimized SQLite databases for efficient querying in interactive tools like the syllable_walk_tui.
This is a build-time tool only - not used during runtime name generation.
Features: - Memory-efficient conversion of 100MB+ JSON files - Batched transactions for performance - Idempotent conversion (safe to re-run) - Auto-discovery of annotated JSON files - Batch conversion support
- Usage:
>>> from build_tools.corpus_sqlite_builder import convert_json_to_sqlite >>> from pathlib import Path >>> corpus_dir = Path("_working/output/20260110_115453_pyphen") >>> db_path = convert_json_to_sqlite(corpus_dir) >>> print(f"Created: {db_path}")
Command-line usage:
# Convert single corpus
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/
# Force overwrite
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/ --force
# Batch convert all
python -m build_tools.corpus_sqlite_builder --batch _working/output/
- Design Philosophy:
JSON is the canonical source of truth (human-readable, portable)
SQLite is derived data (optimized for queries, regeneratable)
Both formats coexist in data/ subdirectory
TUI prefers SQLite, falls back to JSON
Command-Line Interface
Convert annotated JSON files to SQLite databases for efficient querying.
This tool converts large JSON files (*_syllables_annotated.json) into optimized SQLite databases (corpus.db) for memory-efficient, high-performance queries in interactive tools like the TUI.
The JSON file remains the canonical source of truth. The SQLite database is derived data that can be regenerated at any time.
usage: python -m build_tools.corpus_sqlite_builder [-h] [--force]
[--batch OUTPUT_DIR]
[--batch-size N]
[--dry-run]
[corpus_dir]
Positional Arguments
- corpus_dir
Path to corpus directory containing data/ subdirectory with annotated JSON. Example: _working/output/20260110_115453_pyphen/
Named Arguments
- --force
Overwrite existing corpus.db file if it exists. Default: False (raise error if database exists)
Default:
False- --batch
Batch convert all corpora in the specified output directory. Example: –batch _working/output/ Will discover and convert all corpus directories.
- --batch-size
Number of records to insert per database transaction. Larger values are faster but use more memory. Default: 10000
Default:
10000- --dry-run
Show what would be converted without actually creating the database. Useful for checking if JSON files are discoverable.
Default:
False
# Convert single corpus (auto-discovers JSON in data/ subdirectory)
python -m build_tools.corpus_sqlite_builder \
_working/output/20260110_115453_pyphen/
# Force overwrite existing database
python -m build_tools.corpus_sqlite_builder \
_working/output/20260110_115453_pyphen/ --force
# Dry run to check what would be converted
python -m build_tools.corpus_sqlite_builder \
_working/output/20260110_115453_pyphen/ --dry-run
# Batch convert all corpora in output directory
python -m build_tools.corpus_sqlite_builder --batch _working/output/
Creates corpus.db in the data/ subdirectory of the corpus:
_working/output/20260110_115453_pyphen/data/corpus.db
Output Format
The corpus SQLite builder creates optimized database files in the data/ subdirectory
of each corpus. The SQLite database is a derived format that coexists with the canonical
JSON files.
Output structure:
_working/output/20260110_115453_pyphen/
├── data/
│ ├── pyphen_syllables_annotated.json (canonical, 11MB)
│ └── corpus.db (derived, size varies by corpus)
├── pyphen_syllables_unique.txt
├── pyphen_syllables_frequencies.json
└── ... (other corpus files)
SQLite Database Schema:
The corpus.db file contains two application tables:
syllables table - Main data storage
syllable(TEXT, PRIMARY KEY): The syllable stringfrequency(INTEGER): Occurrence count in source corpus12 phonetic feature columns (INTEGER 0/1):
starts_with_vowelstarts_with_clusterstarts_with_heavy_clustercontains_plosivecontains_fricativecontains_liquidcontains_nasalshort_vowellong_vowelends_with_vowelends_with_nasalends_with_stop
metadata table - Database information
schema_version: Database schema version (for migrations)source_tool: Tool that created the databasesource_version: Version of the converter toolgenerated_at: ISO 8601 timestamptotal_syllables: Number of syllables in databasesource_json_path: Original JSON file name
SQLite may also include internal statistics tables (for example sqlite_stat1)
after optimization runs (ANALYZE). These are SQLite-managed and not part of
the application data model.
Indexes created:
idx_starts_with_vowelonsyllables(starts_with_vowel)idx_ends_with_vowelonsyllables(ends_with_vowel)idx_frequencyonsyllables(frequency DESC)idx_vowel_boundariesonsyllables(starts_with_vowel, ends_with_vowel)idx_starts_with_clusteronsyllables(starts_with_cluster)idx_ends_with_stoponsyllables(ends_with_stop)
Performance characteristics:
File size: typically much smaller than canonical JSON (exact ratio depends on corpus size/content)
Load time: significantly faster than JSON loading for interactive tools (Walker web + TUI)
Memory: On-demand loading vs full file in memory
Query: Indexed, optimized for filtering by features
Integration Guide
The corpus SQLite builder is an optional performance optimization that runs after syllable feature annotation. The JSON file remains the canonical source of truth, while the SQLite database provides a faster query interface for interactive Walker tools (web and TUI).
Syllable Walk Web Integration
build_tools.syllable_walk_web integrates with corpus.db at four points:
Pipeline execution (database creation)
In the web pipeline runner, the database stage runs after annotation and executes:
python -m build_tools.corpus_sqlite_builder <run_directory> --force
Output location is the standard path:
<run_directory>/data/corpus.db.If normalization/annotation stages are skipped, this database stage does not run.
Run discovery and listing
The web History flow is manifest-driven: run discovery requires
<run_dir>/manifest.json.corpus_db_pathis surfaced whendata/corpus.dbis listed in manifest artifacts and the file exists.Syllable counts in History come from manifest metrics (
metrics.syllable_count_unique), not ad-hoc DB/JSON recounting.Runs without a valid manifest are excluded from History.
Corpus loading for Walker API
/api/walker/load-corpuscalls the corpus loader, which delegates tobuild_tools.syllable_walk.db.load_syllables.Loading behavior is:
Prefer SQLite when
corpus.dbexists and is readable.Fall back to
*_syllables_annotated.jsonif SQLite is missing or read fails.Return source info to the UI (for example
"SQLite (2,088 syllables)"or"JSON (... syllables)").
Read-only runtime queries
SQLite access in the walker data layer uses read-only URI mode (
file:<path>?mode=ro).The web stack reads from
syllables(for full corpus load and counts) and does not mutatecorpus.dbduring browsing/walking.
Standard build pipeline:
# Step 1: Extract syllables (pyphen or NLTK)
python -m build_tools.pyphen_syllable_extractor --file input.txt
# Step 2: Normalize syllables
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_115453_pyphen/
# Step 3: Annotate with phonetic features
python -m build_tools.syllable_feature_annotator \
--syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \
--frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json
# Step 4: Convert to SQLite (OPTIONAL, for performance)
python -m build_tools.corpus_sqlite_builder \
_working/output/20260110_115453_pyphen/
When to use this tool:
High-value for interactive Walker performance: If you’re using
syllable_walk_tuiorsyllable_walk_webwith large corpora (10,000+ syllables), converting to SQLite dramatically improves load times and responsiveness.Optional for other tools: Command-line analysis tools can work directly with JSON files. SQLite is only necessary when interactive performance matters.
Batch conversion: Run with
--batchflag to convert all existing corpora at once.
Regeneration is safe:
The SQLite database can be deleted and regenerated at any time from the JSON source:
# Delete database
rm _working/output/20260110_115453_pyphen/data/corpus.db
# Regenerate from JSON
python -m build_tools.corpus_sqlite_builder \
_working/output/20260110_115453_pyphen/
Idempotent operation:
Running the converter multiple times is safe. Use --force to overwrite existing databases:
# Regenerate, overwriting existing database
python -m build_tools.corpus_sqlite_builder \
_working/output/20260110_115453_pyphen/ --force
Notes
Design Philosophy:
The corpus SQLite builder follows the principle that JSON is the canonical format and SQLite is derived data. This means:
JSON files are never deleted or modified by this tool
SQLite databases can be regenerated at any time
Both formats coexist peacefully in the
data/subdirectoryWalker loaders automatically prefer SQLite when available and fall back to JSON
Performance considerations:
Conversion time: ~5-10 seconds for 30,000 syllables
Memory usage during conversion: ~50MB peak (batched inserts)
The
--batch-sizeparameter controls transaction size (default: 10,000)Larger batch sizes are faster but use more memory
Backwards compatibility:
Walker tools maintain full backwards compatibility with JSON-only corpora. If no
corpus.db file exists, load falls back to JSON with no errors. This allows
gradual migration to SQLite without breaking existing workflows.
Build-time tool:
This is a build-time tool only - not used during runtime name generation. The SQLite databases are used exclusively by interactive analysis tools.
API Reference
Corpus SQLite Builder - JSON to SQLite Conversion Tool
Converts large annotated JSON files into optimized SQLite databases for efficient querying in interactive tools like the syllable_walk_tui.
This is a build-time tool only - not used during runtime name generation.
Features: - Memory-efficient conversion of 100MB+ JSON files - Batched transactions for performance - Idempotent conversion (safe to re-run) - Auto-discovery of annotated JSON files - Batch conversion support
- Usage:
>>> from build_tools.corpus_sqlite_builder import convert_json_to_sqlite >>> from pathlib import Path >>> corpus_dir = Path("_working/output/20260110_115453_pyphen") >>> db_path = convert_json_to_sqlite(corpus_dir) >>> print(f"Created: {db_path}")
Command-line usage:
# Convert single corpus
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/
# Force overwrite
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/ --force
# Batch convert all
python -m build_tools.corpus_sqlite_builder --batch _working/output/
- Design Philosophy:
JSON is the canonical source of truth (human-readable, portable)
SQLite is derived data (optimized for queries, regeneratable)
Both formats coexist in data/ subdirectory
TUI prefers SQLite, falls back to JSON
- build_tools.corpus_sqlite_builder.convert_json_to_sqlite(corpus_dir, force=False, batch_size=10000)[source]
Convert an annotated JSON file to a SQLite database.
This function discovers the annotated JSON file in the corpus directory, creates a SQLite database with the appropriate schema, and efficiently converts all syllable data using batched transactions.
- Parameters:
- Return type:
- Returns:
Path to the created corpus.db file
- Raises:
FileNotFoundError – If corpus_dir doesn’t exist or no annotated JSON found
FileExistsError – If corpus.db exists and force=False
ValueError – If JSON structure is invalid
json.JSONDecodeError – If JSON is malformed
- build_tools.corpus_sqlite_builder.create_database(db_path)[source]
Create a new corpus database with the standard schema.
- Parameters:
db_path (
Path) – Path where the database will be created- Return type:
- Returns:
SQLite connection to the newly created database
- Raises:
sqlite3.Error – If database creation fails
- build_tools.corpus_sqlite_builder.find_annotated_json(data_dir)[source]
Find the annotated JSON file in a corpus data directory.
- Parameters:
data_dir (
Path) – Path to the data directory (e.g., _working/output/…/data/)- Return type:
- Returns:
Path to the annotated JSON file, or None if not found
Looks for files matching the pattern: *_syllables_annotated.json Supports both pyphen and nltk prefixes.
- build_tools.corpus_sqlite_builder.get_metadata(conn)[source]
Retrieve all metadata from the database.
- Parameters:
conn (
Connection) – SQLite database connection- Return type:
- Returns:
Dictionary of metadata key-value pairs
- build_tools.corpus_sqlite_builder.insert_metadata(conn, metadata)[source]
Insert metadata key-value pairs into the database.
- Parameters:
conn (
Connection) – SQLite database connectionmetadata (
dict[str,str]) – Dictionary of metadata key-value pairs
- Return type:
- Common metadata keys:
schema_version: Database schema version (int as string)
source_tool: Name of the tool that created this database
source_version: Version of the source tool
generated_at: ISO 8601 timestamp of creation
total_syllables: Number of syllables in the database (int as string)
source_json_path: Path to the source JSON file
- build_tools.corpus_sqlite_builder.verify_schema_version(conn)[source]
Verify the database schema version matches the current version.
- Parameters:
conn (
Connection) – SQLite database connection- Return type:
- Returns:
Schema version number from the database
- Raises:
ValueError – If schema version is missing or incompatible