Corpus Database
Overview
Corpus Database - Build Provenance Ledger for Syllable Extraction
This module provides observational tracking for all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The database records who ran what extraction, when, with which settings, and what outputs were produced.
This is a build-time tool only - not used during runtime name generation.
- Design Philosophy:
Observational only: Records outcomes, doesn’t control behavior
Append-only: Runs are never modified, only added
Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors
Queryable: Easy to find “which run produced this file?”
- Key Features:
Full provenance tracking (inputs, outputs, settings, timestamps)
Support for multiple extractor tools
Command-line reproducibility via full CLI capture
Manual annotation support via notes field
Simple query API for run history analysis
- Main Components:
CorpusLedger: Main API for recording and querying runs
SCHEMA_VERSION: Schema version for migration tracking
get_schema_description: Human-readable schema documentation
- Usage:
>>> from build_tools.corpus_db import CorpusLedger >>> from pathlib import Path >>> >>> # Initialize ledger (creates database if needed) >>> ledger = CorpusLedger() >>> >>> # Start recording a run >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt" ... ) >>> >>> # Record what went in >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> >>> # ... extraction happens (ledger doesn't participate) ... >>> >>> # Record what came out >>> ledger.record_output( ... run_id, ... output_path=Path("data/raw/en_US/corpus.syllables"), ... unique_syllable_count=1234 ... ) >>> >>> # Mark run complete >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Query runs later >>> recent = ledger.get_recent_runs(limit=10) >>> for run in recent: ... print(f"{run['run_timestamp']}: {run['extractor_tool']}")
- Common Queries:
# Which run produced this file? >>> run = ledger.find_run_by_output(Path(“data/raw/corpus.syllables”)) >>> print(run[‘command_line’])
# Show all runs using en_GB >>> runs = ledger.get_runs_by_tool(“syllable_extractor”) >>> en_gb = [r for r in runs if r[‘pyphen_lang’] == ‘en_GB’]
# Get overall statistics >>> stats = ledger.get_stats() >>> print(f”Total runs: {stats[‘total_runs’]}”) >>> print(f”Success rate: {stats[‘completed_runs’]/stats[‘total_runs’]*100:.1f}%”)
- Database Location:
Default: data/raw/syllable_extractor.db Custom: Pass db_path to CorpusLedger(db_path=Path(…))
- Context Manager Support:
>>> with CorpusLedger() as ledger: ... run_id = ledger.start_run(...) ... # Connection automatically closed on exit
Output Format
Database Schema
The corpus database uses a simple three-table schema to track extraction runs:
- runs - One row per extractor invocation
Records all configuration parameters, timestamps, command-line invocations, and execution outcomes. Each run gets a unique ID for tracking.
- inputs - Source files/directories processed (many-to-one with runs)
Tracks which files or directories were used as input for each extraction run.
- outputs - Generated output files (many-to-one with runs)
Records the .syllables and .meta files produced, along with syllable counts.
For a detailed schema description:
from build_tools.corpus_db import get_schema_description
print(get_schema_description())
Database Location
Default location: data/raw/syllable_extractor.db
Custom location can be specified when initializing the ledger:
from pathlib import Path
from build_tools.corpus_db import CorpusLedger
ledger = CorpusLedger(db_path=Path("_working/test.db"))
Integration Guide
When building new syllable extractors, integrate the ledger by:
Calling
start_run()at the beginning of extractionRecording all input sources with
record_input()Recording all output files with
record_output()Marking completion with
complete_run()in a try/finally block
Example integration pattern:
from build_tools.corpus_db import CorpusLedger
import sys
ledger = CorpusLedger()
run_id = ledger.start_run(
extractor_tool="my_extractor",
command_line=" ".join(sys.argv),
# ... other parameters ...
)
try:
# ... extraction logic ...
ledger.record_output(run_id, output_path, unique_syllable_count=count)
ledger.complete_run(run_id, exit_code=0, status="completed")
except Exception as e:
ledger.complete_run(run_id, exit_code=1, status="failed")
raise
When to use this tool:
Track provenance of all syllable extraction runs
Query history to understand what corpus files were generated
Find which run produced a specific output file
Monitor extraction success rates across tools
Notes
Cross-Platform Compatibility:
Paths are stored in POSIX format (forward slashes) for cross-platform consistency. This ensures the database can be shared between Windows, macOS, and Linux systems without path separator issues.
Build-time tool:
This is a build-time tool only - not used during runtime name generation.
API Reference
Corpus Database - Build Provenance Ledger for Syllable Extraction
This module provides observational tracking for all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The database records who ran what extraction, when, with which settings, and what outputs were produced.
This is a build-time tool only - not used during runtime name generation.
- Design Philosophy:
Observational only: Records outcomes, doesn’t control behavior
Append-only: Runs are never modified, only added
Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors
Queryable: Easy to find “which run produced this file?”
- Key Features:
Full provenance tracking (inputs, outputs, settings, timestamps)
Support for multiple extractor tools
Command-line reproducibility via full CLI capture
Manual annotation support via notes field
Simple query API for run history analysis
- Main Components:
CorpusLedger: Main API for recording and querying runs
SCHEMA_VERSION: Schema version for migration tracking
get_schema_description: Human-readable schema documentation
- Usage:
>>> from build_tools.corpus_db import CorpusLedger >>> from pathlib import Path >>> >>> # Initialize ledger (creates database if needed) >>> ledger = CorpusLedger() >>> >>> # Start recording a run >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt" ... ) >>> >>> # Record what went in >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> >>> # ... extraction happens (ledger doesn't participate) ... >>> >>> # Record what came out >>> ledger.record_output( ... run_id, ... output_path=Path("data/raw/en_US/corpus.syllables"), ... unique_syllable_count=1234 ... ) >>> >>> # Mark run complete >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Query runs later >>> recent = ledger.get_recent_runs(limit=10) >>> for run in recent: ... print(f"{run['run_timestamp']}: {run['extractor_tool']}")
- Common Queries:
# Which run produced this file? >>> run = ledger.find_run_by_output(Path(“data/raw/corpus.syllables”)) >>> print(run[‘command_line’])
# Show all runs using en_GB >>> runs = ledger.get_runs_by_tool(“syllable_extractor”) >>> en_gb = [r for r in runs if r[‘pyphen_lang’] == ‘en_GB’]
# Get overall statistics >>> stats = ledger.get_stats() >>> print(f”Total runs: {stats[‘total_runs’]}”) >>> print(f”Success rate: {stats[‘completed_runs’]/stats[‘total_runs’]*100:.1f}%”)
- Database Location:
Default: data/raw/syllable_extractor.db Custom: Pass db_path to CorpusLedger(db_path=Path(…))
- Context Manager Support:
>>> with CorpusLedger() as ledger: ... run_id = ledger.start_run(...) ... # Connection automatically closed on exit
- class build_tools.corpus_db.CorpusLedger(db_path=None)[source]
Bases:
objectManages the corpus extraction run ledger database.
The CorpusLedger provides a simple API for recording extraction runs, their inputs, outputs, and outcomes. All operations are append-only - runs are never modified or deleted once recorded.
The database file location is configurable but defaults to: - data/raw/syllable_extractor.db
- db_path
Path to the SQLite database file
- _conn
Active database connection (None if not connected)
Example
>>> ledger = CorpusLedger() >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US" ... ) >>> ledger.complete_run(run_id, exit_code=0, status="completed")
- __exit__(exc_type, exc_val, exc_tb)[source]
Context manager exit - ensures connection is closed.
- Return type:
- __init__(db_path=None)[source]
Initialize the corpus ledger.
Creates the database and tables if they don’t exist. If the database already exists, validates the schema version.
- Parameters:
db_path (
Path|None) – Path to SQLite database file. If None, defaults to data/raw/syllable_extractor.db in the project root.- Raises:
sqlite3.Error – If database initialization fails
Example
>>> # Use default location >>> ledger = CorpusLedger() >>> >>> # Use custom location >>> ledger = CorpusLedger(Path("_working/test.db"))
- close()[source]
Close the database connection.
Should be called when done with the ledger. Using the ledger as a context manager (with statement) is preferred as it handles cleanup automatically.
Example
>>> ledger = CorpusLedger() >>> # ... use ledger ... >>> ledger.close() >>> >>> # Preferred: use context manager >>> with CorpusLedger() as ledger: ... ledger.start_run(...)
- Return type:
- complete_run(run_id, exit_code, status='completed')[source]
Mark a run as complete or failed.
Updates the run status and exit code. This should be called when extraction finishes, whether successfully or with errors.
- Parameters:
- Raises:
ValueError – If status is not a valid value
- Return type:
Example
>>> # Successful run >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Failed run >>> ledger.complete_run(run_id, exit_code=1, status="failed")
- find_run_by_output(output_path)[source]
Find which run produced a specific output file.
This is the “reverse lookup” - given a .syllables file, find out how it was created.
- Parameters:
output_path (
Path) – Path to .syllables file to search for- Return type:
- Returns:
Run dictionary if found, None otherwise
Example
>>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables")) >>> if run: ... print(f"Created by: {run['command_line']}") ... print(f"On: {run['run_timestamp']}")
- get_recent_runs(limit=10)[source]
Get the most recent extraction runs.
- Parameters:
limit (
int) – Maximum number of runs to return (default: 10)- Return type:
- Returns:
List of run dictionaries, ordered by timestamp descending (newest first)
Example
>>> recent = ledger.get_recent_runs(limit=5) >>> for run in recent: ... print(f"{run['run_timestamp']}: {run['extractor_tool']}")
- get_run(run_id)[source]
Get details for a specific run.
- Parameters:
run_id (
int) – Run ID to fetch- Return type:
- Returns:
Dictionary with run details, or None if run_id doesn’t exist
Example
>>> run = ledger.get_run(42) >>> if run: ... print(f"Tool: {run['extractor_tool']}") ... print(f"Status: {run['status']}") ... print(f"Command: {run['command_line']}")
- get_run_inputs(run_id)[source]
Get all input sources for a run.
- Parameters:
run_id (
int) – Run ID to fetch inputs for- Return type:
- Returns:
List of input dictionaries with source_path and file_count
Example
>>> inputs = ledger.get_run_inputs(42) >>> for inp in inputs: ... print(f"Source: {inp['source_path']}") ... if inp['file_count']: ... print(f" Files: {inp['file_count']}")
- get_run_outputs(run_id)[source]
Get all outputs for a run.
- Parameters:
run_id (
int) – Run ID to fetch outputs for- Return type:
- Returns:
List of output dictionaries with paths and syllable counts
Example
>>> outputs = ledger.get_run_outputs(42) >>> for out in outputs: ... print(f"Output: {out['output_path']}") ... print(f" Unique syllables: {out['unique_syllable_count']}")
- get_runs_by_tool(extractor_tool)[source]
Get all runs for a specific extractor tool.
- Parameters:
extractor_tool (
str) – Tool name to filter by (e.g., ‘syllable_extractor’)- Return type:
- Returns:
List of run dictionaries, ordered by timestamp descending (newest first)
Example
>>> runs = ledger.get_runs_by_tool("syllable_extractor") >>> for run in runs: ... print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})")
- get_stats()[source]
Get overall ledger statistics.
Returns summary stats about all recorded runs, useful for understanding build history at a glance.
- Returns:
total_runs: Total number of runs recorded
completed_runs: Runs with status=’completed’
failed_runs: Runs with status=’failed’
tools_used: Set of unique extractor tools
languages_used: Set of unique pyphen language codes
- Return type:
Dictionary with statistics
Example
>>> stats = ledger.get_stats() >>> print(f"Total runs: {stats['total_runs']}") >>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%") >>> print(f"Tools: {', '.join(stats['tools_used'])}")
- record_input(run_id, source_path, file_count=None)[source]
Record an input source for a run.
Associates an input file or directory with an extraction run. Multiple inputs can be recorded for a single run.
Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.
- Parameters:
- Return type:
Example
>>> run_id = ledger.start_run("syllable_extractor", "0.2.0") >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42)
- record_output(run_id, output_path, syllable_count=None, unique_syllable_count=None, meta_path=None)[source]
Record an output file for a run.
Associates an output .syllables file with an extraction run. Multiple outputs can be recorded for a single run (e.g., batch processing).
Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.
- Parameters:
run_id (
int) – Run ID from start_run()output_path (
Path) – Path to generated .syllables filesyllable_count (
int|None) – Total number of syllables (including duplicates)unique_syllable_count (
int|None) – Number of unique syllablesmeta_path (
Path|None) – Path to corresponding .meta file (if generated)
- Return type:
Example
>>> ledger.record_output( ... run_id=42, ... output_path=Path("data/raw/en_US/corpus_v1.syllables"), ... syllable_count=5432, ... unique_syllable_count=1234, ... meta_path=Path("data/raw/en_US/corpus_v1.meta") ... )
- start_run(extractor_tool, extractor_version=None, pyphen_lang=None, auto_lang_detected=None, min_len=None, max_len=None, recursive=False, pattern=None, command_line=None, notes=None)[source]
Record the start of a new extraction run.
Creates a new run record with status=’running’ and returns the run ID. The caller should use this ID to record inputs, outputs, and eventually mark the run complete or failed.
- Parameters:
extractor_tool (
str) – Name of the extraction tool (e.g., ‘syllable_extractor’, ‘syllable_extractor_nltk’, ‘syllable_extractor_espeak’)extractor_version (
str|None) – Version string or git SHA of the toolpyphen_lang (
str|None) – Pyphen language code (NULL for non-pyphen tools)auto_lang_detected (
str|None) – Auto-detected language code if auto-detection was usedrecursive (
bool) – Whether source directory was processed recursivelycommand_line (
str|None) – Full command-line invocation for reproducibilitynotes (
str|None) – User-provided annotations about this run
- Return type:
- Returns:
Unique run ID (integer) for this extraction run
Example
>>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt", ... notes="Testing new corpus from Project Gutenberg" ... ) >>> print(f"Started run {run_id}") Started run 42
- build_tools.corpus_db.get_schema_description()[source]
Generate a human-readable description of the database schema.
Useful for documentation, debugging, and understanding the database structure without examining SQL directly.
- Return type:
- Returns:
Multi-line string describing tables, columns, and relationships
Example
>>> print(get_schema_description()) Corpus Extraction Run Ledger Schema (v1) ======================================== ...