build_tools.corpus_db.ledger
Corpus extraction run ledger - observational database for build provenance.
This module provides the CorpusLedger class, which manages SQLite-based tracking of all syllable extraction runs. The ledger records who ran what extraction tool, when, with which settings, and what outputs were produced.
- Critical Design Principle:
The ledger is observational only - it records what happened but does not influence extraction behavior. Extractors remain pure, deterministic functions. The ledger just watches and remembers.
- Typical Usage:
>>> from build_tools.corpus_db import CorpusLedger >>> from pathlib import Path >>> >>> # Initialize ledger (finds or creates database) >>> ledger = CorpusLedger() >>> >>> # Start a new extraction run >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt" ... ) >>> >>> # Record input sources >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> >>> # ... extraction happens (ledger doesn't participate) ... >>> >>> # Record outputs >>> ledger.record_output( ... run_id, ... output_path=Path("data/raw/en_US/corpus_v1.syllables"), ... syllable_count=5432, ... unique_syllable_count=1234, ... meta_path=Path("data/raw/en_US/corpus_v1.meta") ... ) >>> >>> # Mark run complete >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Query runs later >>> runs = ledger.get_runs_by_tool("syllable_extractor") >>> recent = ledger.get_recent_runs(limit=10)
Classes
Manages the corpus extraction run ledger database. |
Module Contents
- class build_tools.corpus_db.ledger.CorpusLedger(db_path=None)[source]
Manages the corpus extraction run ledger database.
The CorpusLedger provides a simple API for recording extraction runs, their inputs, outputs, and outcomes. All operations are append-only - runs are never modified or deleted once recorded.
The database file location is configurable but defaults to: - data/raw/syllable_extractor.db
- db_path
Path to the SQLite database file
- _conn
Active database connection (None if not connected)
Example
>>> ledger = CorpusLedger() >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US" ... ) >>> ledger.complete_run(run_id, exit_code=0, status="completed")
Initialize the corpus ledger.
Creates the database and tables if they don’t exist. If the database already exists, validates the schema version.
- Parameters:
db_path (pathlib.Path | None) – Path to SQLite database file. If None, defaults to data/raw/syllable_extractor.db in the project root.
- Raises:
sqlite3.Error – If database initialization fails
Example
>>> # Use default location >>> ledger = CorpusLedger() >>> >>> # Use custom location >>> ledger = CorpusLedger(Path("_working/test.db"))
- db_path = None
- close()[source]
Close the database connection.
Should be called when done with the ledger. Using the ledger as a context manager (with statement) is preferred as it handles cleanup automatically.
Example
>>> ledger = CorpusLedger() >>> # ... use ledger ... >>> ledger.close() >>> >>> # Preferred: use context manager >>> with CorpusLedger() as ledger: ... ledger.start_run(...)
- start_run(extractor_tool, extractor_version=None, pyphen_lang=None, auto_lang_detected=None, min_len=None, max_len=None, recursive=False, pattern=None, command_line=None, notes=None)[source]
Record the start of a new extraction run.
Creates a new run record with status=’running’ and returns the run ID. The caller should use this ID to record inputs, outputs, and eventually mark the run complete or failed.
- Parameters:
extractor_tool (str) – Name of the extraction tool (e.g., ‘syllable_extractor’, ‘syllable_extractor_nltk’, ‘syllable_extractor_espeak’)
extractor_version (str | None) – Version string or git SHA of the tool
pyphen_lang (str | None) – Pyphen language code (NULL for non-pyphen tools)
auto_lang_detected (str | None) – Auto-detected language code if auto-detection was used
min_len (int | None) – Minimum syllable length constraint
max_len (int | None) – Maximum syllable length constraint
recursive (bool) – Whether source directory was processed recursively
command_line (str | None) – Full command-line invocation for reproducibility
notes (str | None) – User-provided annotations about this run
- Returns:
Unique run ID (integer) for this extraction run
- Return type:
Example
>>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt", ... notes="Testing new corpus from Project Gutenberg" ... ) >>> print(f"Started run {run_id}") Started run 42
- record_input(run_id, source_path, file_count=None)[source]
Record an input source for a run.
Associates an input file or directory with an extraction run. Multiple inputs can be recorded for a single run.
Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.
- Parameters:
run_id (int) – Run ID from start_run()
source_path (pathlib.Path) – Path to input file or directory
file_count (int | None) – Number of files if source_path is a directory (None for single file)
Example
>>> run_id = ledger.start_run("syllable_extractor", "0.2.0") >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42)
- record_output(run_id, output_path, syllable_count=None, unique_syllable_count=None, meta_path=None)[source]
Record an output file for a run.
Associates an output .syllables file with an extraction run. Multiple outputs can be recorded for a single run (e.g., batch processing).
Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.
- Parameters:
run_id (int) – Run ID from start_run()
output_path (pathlib.Path) – Path to generated .syllables file
syllable_count (int | None) – Total number of syllables (including duplicates)
unique_syllable_count (int | None) – Number of unique syllables
meta_path (pathlib.Path | None) – Path to corresponding .meta file (if generated)
Example
>>> ledger.record_output( ... run_id=42, ... output_path=Path("data/raw/en_US/corpus_v1.syllables"), ... syllable_count=5432, ... unique_syllable_count=1234, ... meta_path=Path("data/raw/en_US/corpus_v1.meta") ... )
- complete_run(run_id, exit_code, status='completed')[source]
Mark a run as complete or failed.
Updates the run status and exit code. This should be called when extraction finishes, whether successfully or with errors.
- Parameters:
- Raises:
ValueError – If status is not a valid value
Example
>>> # Successful run >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Failed run >>> ledger.complete_run(run_id, exit_code=1, status="failed")
- get_run(run_id)[source]
Get details for a specific run.
- Parameters:
run_id (int) – Run ID to fetch
- Returns:
Dictionary with run details, or None if run_id doesn’t exist
- Return type:
Example
>>> run = ledger.get_run(42) >>> if run: ... print(f"Tool: {run['extractor_tool']}") ... print(f"Status: {run['status']}") ... print(f"Command: {run['command_line']}")
- get_runs_by_tool(extractor_tool)[source]
Get all runs for a specific extractor tool.
- Parameters:
extractor_tool (str) – Tool name to filter by (e.g., ‘syllable_extractor’)
- Returns:
List of run dictionaries, ordered by timestamp descending (newest first)
- Return type:
Example
>>> runs = ledger.get_runs_by_tool("syllable_extractor") >>> for run in runs: ... print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})")
- get_recent_runs(limit=10)[source]
Get the most recent extraction runs.
- Parameters:
limit (int) – Maximum number of runs to return (default: 10)
- Returns:
List of run dictionaries, ordered by timestamp descending (newest first)
- Return type:
Example
>>> recent = ledger.get_recent_runs(limit=5) >>> for run in recent: ... print(f"{run['run_timestamp']}: {run['extractor_tool']}")
- get_run_inputs(run_id)[source]
Get all input sources for a run.
- Parameters:
run_id (int) – Run ID to fetch inputs for
- Returns:
List of input dictionaries with source_path and file_count
- Return type:
Example
>>> inputs = ledger.get_run_inputs(42) >>> for inp in inputs: ... print(f"Source: {inp['source_path']}") ... if inp['file_count']: ... print(f" Files: {inp['file_count']}")
- get_run_outputs(run_id)[source]
Get all outputs for a run.
- Parameters:
run_id (int) – Run ID to fetch outputs for
- Returns:
List of output dictionaries with paths and syllable counts
- Return type:
Example
>>> outputs = ledger.get_run_outputs(42) >>> for out in outputs: ... print(f"Output: {out['output_path']}") ... print(f" Unique syllables: {out['unique_syllable_count']}")
- find_run_by_output(output_path)[source]
Find which run produced a specific output file.
This is the “reverse lookup” - given a .syllables file, find out how it was created.
- Parameters:
output_path (pathlib.Path) – Path to .syllables file to search for
- Returns:
Run dictionary if found, None otherwise
- Return type:
Example
>>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables")) >>> if run: ... print(f"Created by: {run['command_line']}") ... print(f"On: {run['run_timestamp']}")
- get_stats()[source]
Get overall ledger statistics.
Returns summary stats about all recorded runs, useful for understanding build history at a glance.
- Returns:
total_runs: Total number of runs recorded
completed_runs: Runs with status=’completed’
failed_runs: Runs with status=’failed’
tools_used: Set of unique extractor tools
languages_used: Set of unique pyphen language codes
- Return type:
Dictionary with statistics
Example
>>> stats = ledger.get_stats() >>> print(f"Total runs: {stats['total_runs']}") >>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%") >>> print(f"Tools: {', '.join(stats['tools_used'])}")