====================== Corpus SQLite Builder ====================== .. currentmodule:: build_tools.corpus_sqlite_builder Overview -------- .. automodule:: build_tools.corpus_sqlite_builder :no-members: Command-Line Interface ---------------------- .. argparse:: :module: build_tools.corpus_sqlite_builder.cli :func: create_argument_parser :prog: python -m build_tools.corpus_sqlite_builder Output Format ------------- The corpus SQLite builder creates optimized database files in the ``data/`` subdirectory of each corpus. The SQLite database is a derived format that coexists with the canonical JSON files. **Output structure:** .. code-block:: text _working/output/20260110_115453_pyphen/ ├── data/ │ ├── pyphen_syllables_annotated.json (canonical, 11MB) │ └── corpus.db (derived, size varies by corpus) ├── pyphen_syllables_unique.txt ├── pyphen_syllables_frequencies.json └── ... (other corpus files) **SQLite Database Schema:** The ``corpus.db`` file contains two application tables: 1. **syllables table** - Main data storage - ``syllable`` (TEXT, PRIMARY KEY): The syllable string - ``frequency`` (INTEGER): Occurrence count in source corpus - 12 phonetic feature columns (INTEGER 0/1): - ``starts_with_vowel`` - ``starts_with_cluster`` - ``starts_with_heavy_cluster`` - ``contains_plosive`` - ``contains_fricative`` - ``contains_liquid`` - ``contains_nasal`` - ``short_vowel`` - ``long_vowel`` - ``ends_with_vowel`` - ``ends_with_nasal`` - ``ends_with_stop`` 2. **metadata table** - Database information - ``schema_version``: Database schema version (for migrations) - ``source_tool``: Tool that created the database - ``source_version``: Version of the converter tool - ``generated_at``: ISO 8601 timestamp - ``total_syllables``: Number of syllables in database - ``source_json_path``: Original JSON file name SQLite may also include internal statistics tables (for example ``sqlite_stat1``) after optimization runs (``ANALYZE``). These are SQLite-managed and not part of the application data model. **Indexes created:** - ``idx_starts_with_vowel`` on ``syllables(starts_with_vowel)`` - ``idx_ends_with_vowel`` on ``syllables(ends_with_vowel)`` - ``idx_frequency`` on ``syllables(frequency DESC)`` - ``idx_vowel_boundaries`` on ``syllables(starts_with_vowel, ends_with_vowel)`` - ``idx_starts_with_cluster`` on ``syllables(starts_with_cluster)`` - ``idx_ends_with_stop`` on ``syllables(ends_with_stop)`` **Performance characteristics:** - File size: typically much smaller than canonical JSON (exact ratio depends on corpus size/content) - Load time: significantly faster than JSON loading for interactive tools (Walker web + TUI) - Memory: On-demand loading vs full file in memory - Query: Indexed, optimized for filtering by features Integration Guide ----------------- The corpus SQLite builder is an **optional performance optimization** that runs after syllable feature annotation. The JSON file remains the canonical source of truth, while the SQLite database provides a faster query interface for interactive Walker tools (web and TUI). Syllable Walk Web Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``build_tools.syllable_walk_web`` integrates with ``corpus.db`` at four points: 1. **Pipeline execution (database creation)** - In the web pipeline runner, the database stage runs after annotation and executes: .. code-block:: bash python -m build_tools.corpus_sqlite_builder --force - Output location is the standard path: ``/data/corpus.db``. - If normalization/annotation stages are skipped, this database stage does not run. 2. **Run discovery and listing** - The web History flow is manifest-driven: run discovery requires ``/manifest.json``. - ``corpus_db_path`` is surfaced when ``data/corpus.db`` is listed in manifest artifacts and the file exists. - Syllable counts in History come from manifest metrics (``metrics.syllable_count_unique``), not ad-hoc DB/JSON recounting. - Runs without a valid manifest are excluded from History. 3. **Corpus loading for Walker API** - ``/api/walker/load-corpus`` calls the corpus loader, which delegates to ``build_tools.syllable_walk.db.load_syllables``. - Loading behavior is: - Prefer SQLite when ``corpus.db`` exists and is readable. - Fall back to ``*_syllables_annotated.json`` if SQLite is missing or read fails. - Return source info to the UI (for example ``"SQLite (2,088 syllables)"`` or ``"JSON (... syllables)"``). 4. **Read-only runtime queries** - SQLite access in the walker data layer uses read-only URI mode (``file:?mode=ro``). - The web stack reads from ``syllables`` (for full corpus load and counts) and does not mutate ``corpus.db`` during browsing/walking. **Standard build pipeline:** .. code-block:: bash # Step 1: Extract syllables (pyphen or NLTK) python -m build_tools.pyphen_syllable_extractor --file input.txt # Step 2: Normalize syllables python -m build_tools.pyphen_syllable_normaliser --run-dir _working/output/20260110_115453_pyphen/ # Step 3: Annotate with phonetic features python -m build_tools.syllable_feature_annotator \ --syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \ --frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json # Step 4: Convert to SQLite (OPTIONAL, for performance) python -m build_tools.corpus_sqlite_builder \ _working/output/20260110_115453_pyphen/ **When to use this tool:** - **High-value for interactive Walker performance:** If you're using ``syllable_walk_tui`` or ``syllable_walk_web`` with large corpora (10,000+ syllables), converting to SQLite dramatically improves load times and responsiveness. - **Optional for other tools:** Command-line analysis tools can work directly with JSON files. SQLite is only necessary when interactive performance matters. - **Batch conversion:** Run with ``--batch`` flag to convert all existing corpora at once. **Regeneration is safe:** The SQLite database can be deleted and regenerated at any time from the JSON source: .. code-block:: bash # Delete database rm _working/output/20260110_115453_pyphen/data/corpus.db # Regenerate from JSON python -m build_tools.corpus_sqlite_builder \ _working/output/20260110_115453_pyphen/ **Idempotent operation:** Running the converter multiple times is safe. Use ``--force`` to overwrite existing databases: .. code-block:: bash # Regenerate, overwriting existing database python -m build_tools.corpus_sqlite_builder \ _working/output/20260110_115453_pyphen/ --force Notes ----- **Design Philosophy:** The corpus SQLite builder follows the principle that **JSON is the canonical format** and **SQLite is derived data**. This means: - JSON files are never deleted or modified by this tool - SQLite databases can be regenerated at any time - Both formats coexist peacefully in the ``data/`` subdirectory - Walker loaders automatically prefer SQLite when available and fall back to JSON **Performance considerations:** - Conversion time: ~5-10 seconds for 30,000 syllables - Memory usage during conversion: ~50MB peak (batched inserts) - The ``--batch-size`` parameter controls transaction size (default: 10,000) - Larger batch sizes are faster but use more memory **Backwards compatibility:** Walker tools maintain full backwards compatibility with JSON-only corpora. If no ``corpus.db`` file exists, load falls back to JSON with no errors. This allows gradual migration to SQLite without breaking existing workflows. **Build-time tool:** This is a build-time tool only - not used during runtime name generation. The SQLite databases are used exclusively by interactive analysis tools. API Reference ------------- .. automodule:: build_tools.corpus_sqlite_builder :members: :undoc-members: :show-inheritance: