Corpus SQLite Builder

Overview

Corpus SQLite Builder - JSON to SQLite Conversion Tool

Converts large annotated JSON files into optimized SQLite databases for efficient querying in interactive tools like the syllable_walk_tui.

This is a build-time tool only - not used during runtime name generation.

Features: - Memory-efficient conversion of 100MB+ JSON files - Batched transactions for performance - Idempotent conversion (safe to re-run) - Auto-discovery of annotated JSON files - Batch conversion support

Usage:

>>> from build_tools.corpus_sqlite_builder import convert_json_to_sqlite
>>> from pathlib import Path
>>> corpus_dir = Path("_working/output/20260110_115453_pyphen")
>>> db_path = convert_json_to_sqlite(corpus_dir)
>>> print(f"Created: {db_path}")

Command-line usage:

# Convert single corpus
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/

# Force overwrite
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/ --force

# Batch convert all
python -m build_tools.corpus_sqlite_builder --batch _working/output/

Design Philosophy:

JSON is the canonical source of truth (human-readable, portable)
SQLite is derived data (optimized for queries, regeneratable)
Both formats coexist in data/ subdirectory
TUI prefers SQLite, falls back to JSON

Command-Line Interface

Convert annotated JSON files to SQLite databases for efficient querying.

This tool converts large JSON files (*_syllables_annotated.json) into optimized SQLite databases (corpus.db) for memory-efficient, high-performance queries in interactive tools like the TUI.

The JSON file remains the canonical source of truth. The SQLite database is derived data that can be regenerated at any time.

usage: python -m build_tools.corpus_sqlite_builder [-h] [--force]
                                                   [--batch OUTPUT_DIR]
                                                   [--batch-size N]
                                                   [--dry-run]
                                                   [corpus_dir]

Positional Arguments

corpus_dir: Path to corpus directory containing data/ subdirectory with annotated JSON. Example: _working/output/20260110_115453_pyphen/

Named Arguments

--force

Overwrite existing corpus.db file if it exists. Default: False (raise error if database exists)

Default: False

--batch

Batch convert all corpora in the specified output directory. Example: –batch _working/output/ Will discover and convert all corpus directories.

--batch-size

Number of records to insert per database transaction. Larger values are faster but use more memory. Default: 10000

Default: 10000

--dry-run

Show what would be converted without actually creating the database. Useful for checking if JSON files are discoverable.

Default: False

# Convert single corpus (auto-discovers JSON in data/ subdirectory)
python -m build_tools.corpus_sqlite_builder \
  _working/output/20260110_115453_pyphen/

# Force overwrite existing database
python -m build_tools.corpus_sqlite_builder \
  _working/output/20260110_115453_pyphen/ --force

# Dry run to check what would be converted
python -m build_tools.corpus_sqlite_builder \
  _working/output/20260110_115453_pyphen/ --dry-run

# Batch convert all corpora in output directory
python -m build_tools.corpus_sqlite_builder --batch _working/output/

Creates corpus.db in the data/ subdirectory of the corpus:

_working/output/20260110_115453_pyphen/data/corpus.db

Output Format

The corpus SQLite builder creates optimized database files in the data/ subdirectory of each corpus. The SQLite database is a derived format that coexists with the canonical JSON files.

Output structure:

_working/output/20260110_115453_pyphen/
├── data/
│   ├── pyphen_syllables_annotated.json  (canonical, 11MB)
│   └── corpus.db                         (derived, size varies by corpus)
├── pyphen_syllables_unique.txt
├── pyphen_syllables_frequencies.json
└── ... (other corpus files)

SQLite Database Schema:

The corpus.db file contains two application tables:

syllables table - Main data storage
- syllable (TEXT, PRIMARY KEY): The syllable string
- frequency (INTEGER): Occurrence count in source corpus
- 12 phonetic feature columns (INTEGER 0/1):
  - starts_with_vowel
  - starts_with_cluster
  - starts_with_heavy_cluster
  - contains_plosive
  - contains_fricative
  - contains_liquid
  - contains_nasal
  - short_vowel
  - long_vowel
  - ends_with_vowel
  - ends_with_nasal
  - ends_with_stop
metadata table - Database information
- schema_version: Database schema version (for migrations)
- source_tool: Tool that created the database
- source_version: Version of the converter tool
- generated_at: ISO 8601 timestamp
- total_syllables: Number of syllables in database
- source_json_path: Original JSON file name

SQLite may also include internal statistics tables (for example sqlite_stat1) after optimization runs (ANALYZE). These are SQLite-managed and not part of the application data model.

Indexes created:

idx_starts_with_vowel on syllables(starts_with_vowel)
idx_ends_with_vowel on syllables(ends_with_vowel)
idx_frequency on syllables(frequency DESC)
idx_vowel_boundaries on syllables(starts_with_vowel, ends_with_vowel)
idx_starts_with_cluster on syllables(starts_with_cluster)
idx_ends_with_stop on syllables(ends_with_stop)

Performance characteristics:

File size: typically much smaller than canonical JSON (exact ratio depends on corpus size/content)
Load time: significantly faster than JSON loading for interactive tools (Walker web + TUI)
Memory: On-demand loading vs full file in memory
Query: Indexed, optimized for filtering by features

Integration Guide

The corpus SQLite builder is an optional performance optimization that runs after syllable feature annotation. The JSON file remains the canonical source of truth, while the SQLite database provides a faster query interface for interactive Walker tools (web and TUI).

Syllable Walk Web Integration

build_tools.syllable_walk_web integrates with corpus.db at four points:

Pipeline execution (database creation)
- In the web pipeline runner, the database stage runs after annotation and executes:
```
python -m build_tools.corpus_sqlite_builder <run_directory> --force
```
- Output location is the standard path: <run_directory>/data/corpus.db.
- If normalization/annotation stages are skipped, this database stage does not run.
Run discovery and listing
- The web History flow is manifest-driven: run discovery requires <run_dir>/manifest.json.
- corpus_db_path is surfaced when data/corpus.db is listed in manifest artifacts and the file exists.
- Syllable counts in History come from manifest metrics (metrics.syllable_count_unique), not ad-hoc DB/JSON recounting.
- Runs without a valid manifest are excluded from History.
Corpus loading for Walker API
- /api/walker/load-corpus calls the corpus loader, which delegates to build_tools.syllable_walk.db.load_syllables.
- Loading behavior is:
  - Prefer SQLite when corpus.db exists and is readable.
  - Fall back to *_syllables_annotated.json if SQLite is missing or read fails.
  - Return source info to the UI (for example "SQLite (2,088 syllables)" or "JSON (... syllables)").
Read-only runtime queries
- SQLite access in the walker data layer uses read-only URI mode (file:<path>?mode=ro).
- The web stack reads from syllables (for full corpus load and counts) and does not mutate corpus.db during browsing/walking.

Standard build pipeline:

# Step 1: Extract syllables (pyphen or NLTK)
python -m build_tools.pyphen_syllable_extractor --file input.txt

# Step 2: Normalize syllables
python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_115453_pyphen/

# Step 3: Annotate with phonetic features
python -m build_tools.syllable_feature_annotator \
    --syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \
    --frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json

# Step 4: Convert to SQLite (OPTIONAL, for performance)
python -m build_tools.corpus_sqlite_builder \
    _working/output/20260110_115453_pyphen/

When to use this tool:

High-value for interactive Walker performance: If you’re using syllable_walk_tui or syllable_walk_web with large corpora (10,000+ syllables), converting to SQLite dramatically improves load times and responsiveness.
Optional for other tools: Command-line analysis tools can work directly with JSON files. SQLite is only necessary when interactive performance matters.
Batch conversion: Run with --batch flag to convert all existing corpora at once.

Regeneration is safe:

The SQLite database can be deleted and regenerated at any time from the JSON source:

# Delete database
rm _working/output/20260110_115453_pyphen/data/corpus.db

# Regenerate from JSON
python -m build_tools.corpus_sqlite_builder \
    _working/output/20260110_115453_pyphen/

Idempotent operation:

Running the converter multiple times is safe. Use --force to overwrite existing databases:

# Regenerate, overwriting existing database
python -m build_tools.corpus_sqlite_builder \
    _working/output/20260110_115453_pyphen/ --force

Notes

Design Philosophy:

The corpus SQLite builder follows the principle that JSON is the canonical format and SQLite is derived data. This means:

JSON files are never deleted or modified by this tool
SQLite databases can be regenerated at any time
Both formats coexist peacefully in the data/ subdirectory
Walker loaders automatically prefer SQLite when available and fall back to JSON

Performance considerations:

Conversion time: ~5-10 seconds for 30,000 syllables
Memory usage during conversion: ~50MB peak (batched inserts)
The --batch-size parameter controls transaction size (default: 10,000)
Larger batch sizes are faster but use more memory

Backwards compatibility:

Walker tools maintain full backwards compatibility with JSON-only corpora. If no corpus.db file exists, load falls back to JSON with no errors. This allows gradual migration to SQLite without breaking existing workflows.

Build-time tool:

This is a build-time tool only - not used during runtime name generation. The SQLite databases are used exclusively by interactive analysis tools.

API Reference

Corpus SQLite Builder - JSON to SQLite Conversion Tool

Converts large annotated JSON files into optimized SQLite databases for efficient querying in interactive tools like the syllable_walk_tui.

This is a build-time tool only - not used during runtime name generation.

Features: - Memory-efficient conversion of 100MB+ JSON files - Batched transactions for performance - Idempotent conversion (safe to re-run) - Auto-discovery of annotated JSON files - Batch conversion support

Usage:

>>> from build_tools.corpus_sqlite_builder import convert_json_to_sqlite
>>> from pathlib import Path
>>> corpus_dir = Path("_working/output/20260110_115453_pyphen")
>>> db_path = convert_json_to_sqlite(corpus_dir)
>>> print(f"Created: {db_path}")

Command-line usage:

# Convert single corpus
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/

# Force overwrite
python -m build_tools.corpus_sqlite_builder _working/output/20260110_115453_pyphen/ --force

# Batch convert all
python -m build_tools.corpus_sqlite_builder --batch _working/output/

Design Philosophy:

JSON is the canonical source of truth (human-readable, portable)
SQLite is derived data (optimized for queries, regeneratable)
Both formats coexist in data/ subdirectory
TUI prefers SQLite, falls back to JSON

build_tools.corpus_sqlite_builder.convert_json_to_sqlite(corpus_dir, force=False, batch_size=10000)[source]

Convert an annotated JSON file to a SQLite database.

This function discovers the annotated JSON file in the corpus directory, creates a SQLite database with the appropriate schema, and efficiently converts all syllable data using batched transactions.

Parameters:

corpus_dir (Path) – Path to corpus directory (e.g., _working/output/20260110_115453_pyphen/)
force (bool) – If True, overwrite existing database. If False, raise error if exists.
batch_size (int) – Number of records to insert per transaction (default: 10000)

Return type:

Path

Returns:

Path to the created corpus.db file

Raises:

FileNotFoundError – If corpus_dir doesn’t exist or no annotated JSON found
FileExistsError – If corpus.db exists and force=False
ValueError – If JSON structure is invalid
json.JSONDecodeError – If JSON is malformed

build_tools.corpus_sqlite_builder.create_database(db_path)[source]

Create a new corpus database with the standard schema.

Parameters:: db_path (Path) – Path where the database will be created
Return type:: Connection
Returns:: SQLite connection to the newly created database
Raises:: sqlite3.Error – If database creation fails

build_tools.corpus_sqlite_builder.find_annotated_json(data_dir)[source]

Find the annotated JSON file in a corpus data directory.

Parameters:: data_dir (Path) – Path to the data directory (e.g., _working/output/…/data/)
Return type:: Path | None
Returns:: Path to the annotated JSON file, or None if not found

Looks for files matching the pattern: *_syllables_annotated.json Supports both pyphen and nltk prefixes.

build_tools.corpus_sqlite_builder.get_metadata(conn)[source]

Retrieve all metadata from the database.

Parameters:: conn (Connection) – SQLite database connection
Return type:: dict[str, str]
Returns:: Dictionary of metadata key-value pairs

build_tools.corpus_sqlite_builder.insert_metadata(conn, metadata)[source]

Insert metadata key-value pairs into the database.

Parameters:

conn (Connection) – SQLite database connection
metadata (dict[str, str]) – Dictionary of metadata key-value pairs

Return type:

None

Common metadata keys:

schema_version: Database schema version (int as string)
source_tool: Name of the tool that created this database
source_version: Version of the source tool
generated_at: ISO 8601 timestamp of creation
total_syllables: Number of syllables in the database (int as string)
source_json_path: Path to the source JSON file

build_tools.corpus_sqlite_builder.verify_schema_version(conn)[source]

Verify the database schema version matches the current version.

Parameters:: conn (Connection) – SQLite database connection
Return type:: int
Returns:: Schema version number from the database
Raises:: ValueError – If schema version is missing or incompatible