=============== Corpus Database =============== .. currentmodule:: build_tools.corpus_db Overview -------- .. automodule:: build_tools.corpus_db :no-members: Output Format ------------- Database Schema ~~~~~~~~~~~~~~~ The corpus database uses a simple three-table schema to track extraction runs: **runs** - One row per extractor invocation Records all configuration parameters, timestamps, command-line invocations, and execution outcomes. Each run gets a unique ID for tracking. **inputs** - Source files/directories processed (many-to-one with runs) Tracks which files or directories were used as input for each extraction run. **outputs** - Generated output files (many-to-one with runs) Records the .syllables and .meta files produced, along with syllable counts. For a detailed schema description: .. code-block:: python from build_tools.corpus_db import get_schema_description print(get_schema_description()) Database Location ~~~~~~~~~~~~~~~~~ Default location: ``data/raw/syllable_extractor.db`` Custom location can be specified when initializing the ledger: .. code-block:: python from pathlib import Path from build_tools.corpus_db import CorpusLedger ledger = CorpusLedger(db_path=Path("_working/test.db")) Integration Guide ----------------- When building new syllable extractors, integrate the ledger by: 1. Calling ``start_run()`` at the beginning of extraction 2. Recording all input sources with ``record_input()`` 3. Recording all output files with ``record_output()`` 4. Marking completion with ``complete_run()`` in a try/finally block **Example integration pattern:** .. code-block:: python from build_tools.corpus_db import CorpusLedger import sys ledger = CorpusLedger() run_id = ledger.start_run( extractor_tool="my_extractor", command_line=" ".join(sys.argv), # ... other parameters ... ) try: # ... extraction logic ... ledger.record_output(run_id, output_path, unique_syllable_count=count) ledger.complete_run(run_id, exit_code=0, status="completed") except Exception as e: ledger.complete_run(run_id, exit_code=1, status="failed") raise **When to use this tool:** - Track provenance of all syllable extraction runs - Query history to understand what corpus files were generated - Find which run produced a specific output file - Monitor extraction success rates across tools Notes ----- **Cross-Platform Compatibility:** Paths are stored in POSIX format (forward slashes) for cross-platform consistency. This ensures the database can be shared between Windows, macOS, and Linux systems without path separator issues. **Build-time tool:** This is a build-time tool only - not used during runtime name generation. API Reference ------------- .. automodule:: build_tools.corpus_db :members: :undoc-members: :show-inheritance: