build_tools.corpus_db.schema

Database schema definitions for corpus extraction run tracking.

This module defines the SQLite database schema for tracking all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The schema provides full provenance tracking - who, what, when, with what settings, and which outputs.

The database serves as an observational ledger only - it records what happened but does not influence extraction behavior. This keeps extraction logic pure and deterministic while providing queryable build history.

Schema Design Philosophy:

Append-only: Runs are never modified, only added
Observational: Records outcomes, doesn’t control behavior
Queryable: Easy to find “which run produced this file?”
Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors

Tables:

runs: One row per extractor invocation with all configuration inputs: Source files/directories that were processed (one-to-many) outputs: Generated .syllables and .meta files (one-to-many)

Example Queries:

– Find all runs using en_GB SELECT * FROM runs WHERE pyphen_lang = ‘en_GB’;

– Which run produced this file? SELECT r.* FROM runs r JOIN outputs o ON r.id = o.run_id WHERE o.output_path = ‘data/raw/en_US/corpus_v1.syllables’;

– Compare syllable counts across tools SELECT extractor_tool, AVG(unique_syllable_count) as avg_unique FROM runs r JOIN outputs o ON r.id = o.run_id GROUP BY extractor_tool;

Attributes

`SCHEMA_VERSION`
`CREATE_RUNS_TABLE`
`CREATE_INPUTS_TABLE`
`CREATE_OUTPUTS_TABLE`
`CREATE_INDEXES`
`CREATE_SCHEMA_VERSION_TABLE`

Functions

`get_all_ddl_statements`()	Get all DDL statements required to initialize the database schema.
`get_schema_description`()	Generate a human-readable description of the database schema.

Module Contents

build_tools.corpus_db.schema.SCHEMA_VERSION = 1

build_tools.corpus_db.schema.CREATE_RUNS_TABLE = Multiline-String

Show Value

"""
CREATE TABLE IF NOT EXISTS runs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    run_timestamp TEXT NOT NULL,
    extractor_tool TEXT NOT NULL,
    extractor_version TEXT,
    hostname TEXT,
    exit_code INTEGER,
    status TEXT CHECK(status IN ('running', 'completed', 'failed', 'interrupted')),
    pyphen_lang TEXT,
    auto_lang_detected TEXT,
    min_len INTEGER,
    max_len INTEGER,
    recursive INTEGER,
    pattern TEXT,
    command_line TEXT,
    notes TEXT
);
"""

build_tools.corpus_db.schema.CREATE_INPUTS_TABLE = Multiline-String

Show Value

"""
CREATE TABLE IF NOT EXISTS inputs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id INTEGER NOT NULL,
    source_path TEXT NOT NULL,
    file_count INTEGER,
    FOREIGN KEY(run_id) REFERENCES runs(id)
);
"""

build_tools.corpus_db.schema.CREATE_OUTPUTS_TABLE = Multiline-String

Show Value

"""
CREATE TABLE IF NOT EXISTS outputs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id INTEGER NOT NULL,
    output_path TEXT NOT NULL,
    syllable_count INTEGER,
    unique_syllable_count INTEGER,
    meta_path TEXT,
    FOREIGN KEY(run_id) REFERENCES runs(id)
);
"""

build_tools.corpus_db.schema.CREATE_INDEXES = ['CREATE INDEX IF NOT EXISTS idx_runs_timestamp ON runs(run_timestamp);', 'CREATE INDEX IF NOT...

build_tools.corpus_db.schema.CREATE_SCHEMA_VERSION_TABLE = Multiline-String

Show Value

"""
CREATE TABLE IF NOT EXISTS schema_version (
    version INTEGER PRIMARY KEY,
    applied_at TEXT NOT NULL
);
"""

build_tools.corpus_db.schema.get_all_ddl_statements()[source]

Get all DDL statements required to initialize the database schema.

Returns a list of SQL statements that should be executed in order to create a fresh database with the current schema version.

Returns:: List of SQL DDL statements (CREATE TABLE, CREATE INDEX)
Return type:: list[str]

Example

>>> statements = get_all_ddl_statements()
>>> for stmt in statements:
...     cursor.execute(stmt)

build_tools.corpus_db.schema.get_schema_description()[source]

Generate a human-readable description of the database schema.

Useful for documentation, debugging, and understanding the database structure without examining SQL directly.

Returns:: Multi-line string describing tables, columns, and relationships
Return type:: str

Example

>>> print(get_schema_description())
Corpus Extraction Run Ledger Schema (v1)
========================================
...