build_tools.corpus_db.ledger

Corpus extraction run ledger - observational database for build provenance.

This module provides the CorpusLedger class, which manages SQLite-based tracking of all syllable extraction runs. The ledger records who ran what extraction tool, when, with which settings, and what outputs were produced.

Critical Design Principle:

The ledger is observational only - it records what happened but does not influence extraction behavior. Extractors remain pure, deterministic functions. The ledger just watches and remembers.

Typical Usage:

>>> from build_tools.corpus_db import CorpusLedger
>>> from pathlib import Path
>>>
>>> # Initialize ledger (finds or creates database)
>>> ledger = CorpusLedger()
>>>
>>> # Start a new extraction run
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt"
... )
>>>
>>> # Record input sources
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>>
>>> # ... extraction happens (ledger doesn't participate) ...
>>>
>>> # Record outputs
>>> ledger.record_output(
...     run_id,
...     output_path=Path("data/raw/en_US/corpus_v1.syllables"),
...     syllable_count=5432,
...     unique_syllable_count=1234,
...     meta_path=Path("data/raw/en_US/corpus_v1.meta")
... )
>>>
>>> # Mark run complete
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Query runs later
>>> runs = ledger.get_runs_by_tool("syllable_extractor")
>>> recent = ledger.get_recent_runs(limit=10)

Classes

CorpusLedger

Manages the corpus extraction run ledger database.

Module Contents

class build_tools.corpus_db.ledger.CorpusLedger(db_path=None)[source]

Manages the corpus extraction run ledger database.

The CorpusLedger provides a simple API for recording extraction runs, their inputs, outputs, and outcomes. All operations are append-only - runs are never modified or deleted once recorded.

The database file location is configurable but defaults to: - data/raw/syllable_extractor.db

db_path: Path to the SQLite database file

_conn: Active database connection (None if not connected)

Example

>>> ledger = CorpusLedger()
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US"
... )
>>> ledger.complete_run(run_id, exit_code=0, status="completed")

Initialize the corpus ledger.

Creates the database and tables if they don’t exist. If the database already exists, validates the schema version.

Parameters:: db_path (pathlib.Path | None) – Path to SQLite database file. If None, defaults to data/raw/syllable_extractor.db in the project root.
Raises:: sqlite3.Error – If database initialization fails

Example

>>> # Use default location
>>> ledger = CorpusLedger()
>>>
>>> # Use custom location
>>> ledger = CorpusLedger(Path("_working/test.db"))

db_path = None

close()[source]

Close the database connection.

Should be called when done with the ledger. Using the ledger as a context manager (with statement) is preferred as it handles cleanup automatically.

Example

>>> ledger = CorpusLedger()
>>> # ... use ledger ...
>>> ledger.close()
>>>
>>> # Preferred: use context manager
>>> with CorpusLedger() as ledger:
...     ledger.start_run(...)

start_run(extractor_tool, extractor_version=None, pyphen_lang=None, auto_lang_detected=None, min_len=None, max_len=None, recursive=False, pattern=None, command_line=None, notes=None)[source]

Record the start of a new extraction run.

Creates a new run record with status=’running’ and returns the run ID. The caller should use this ID to record inputs, outputs, and eventually mark the run complete or failed.

Parameters:

extractor_tool (str) – Name of the extraction tool (e.g., ‘syllable_extractor’, ‘syllable_extractor_nltk’, ‘syllable_extractor_espeak’)
extractor_version (str | None) – Version string or git SHA of the tool
pyphen_lang (str | None) – Pyphen language code (NULL for non-pyphen tools)
auto_lang_detected (str | None) – Auto-detected language code if auto-detection was used
min_len (int | None) – Minimum syllable length constraint
max_len (int | None) – Maximum syllable length constraint
recursive (bool) – Whether source directory was processed recursively
pattern (str | None) – File pattern filter (e.g., ‘*.txt’)
command_line (str | None) – Full command-line invocation for reproducibility
notes (str | None) – User-provided annotations about this run

Returns:

Unique run ID (integer) for this extraction run

Return type:

int

Example

>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt",
...     notes="Testing new corpus from Project Gutenberg"
... )
>>> print(f"Started run {run_id}")
Started run 42

record_input(run_id, source_path, file_count=None)[source]

Record an input source for a run.

Associates an input file or directory with an extraction run. Multiple inputs can be recorded for a single run.

Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.

Parameters:

run_id (int) – Run ID from start_run()
source_path (pathlib.Path) – Path to input file or directory
file_count (int | None) – Number of files if source_path is a directory (None for single file)

Example

>>> run_id = ledger.start_run("syllable_extractor", "0.2.0")
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42)

record_output(run_id, output_path, syllable_count=None, unique_syllable_count=None, meta_path=None)[source]

Record an output file for a run.

Associates an output .syllables file with an extraction run. Multiple outputs can be recorded for a single run (e.g., batch processing).

Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.

Parameters:

run_id (int) – Run ID from start_run()
output_path (pathlib.Path) – Path to generated .syllables file
syllable_count (int | None) – Total number of syllables (including duplicates)
unique_syllable_count (int | None) – Number of unique syllables
meta_path (pathlib.Path | None) – Path to corresponding .meta file (if generated)

Example

>>> ledger.record_output(
...     run_id=42,
...     output_path=Path("data/raw/en_US/corpus_v1.syllables"),
...     syllable_count=5432,
...     unique_syllable_count=1234,
...     meta_path=Path("data/raw/en_US/corpus_v1.meta")
... )

complete_run(run_id, exit_code, status='completed')[source]

Mark a run as complete or failed.

Updates the run status and exit code. This should be called when extraction finishes, whether successfully or with errors.

Parameters:

run_id (int) – Run ID from start_run()
exit_code (int) – Unix exit code (0 = success, non-zero = failure)
status (str) – Final status - one of ‘completed’, ‘failed’, ‘interrupted’

Raises:

ValueError – If status is not a valid value

Example

>>> # Successful run
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Failed run
>>> ledger.complete_run(run_id, exit_code=1, status="failed")

get_run(run_id)[source]

Get details for a specific run.

Parameters:: run_id (int) – Run ID to fetch
Returns:: Dictionary with run details, or None if run_id doesn’t exist
Return type:: dict[str, Any] | None

Example

>>> run = ledger.get_run(42)
>>> if run:
...     print(f"Tool: {run['extractor_tool']}")
...     print(f"Status: {run['status']}")
...     print(f"Command: {run['command_line']}")

get_runs_by_tool(extractor_tool)[source]

Get all runs for a specific extractor tool.

Parameters:: extractor_tool (str) – Tool name to filter by (e.g., ‘syllable_extractor’)
Returns:: List of run dictionaries, ordered by timestamp descending (newest first)
Return type:: list[dict[str, Any]]

Example

>>> runs = ledger.get_runs_by_tool("syllable_extractor")
>>> for run in runs:
...     print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})")

get_recent_runs(limit=10)[source]

Get the most recent extraction runs.

Parameters:: limit (int) – Maximum number of runs to return (default: 10)
Returns:: List of run dictionaries, ordered by timestamp descending (newest first)
Return type:: list[dict[str, Any]]

Example

>>> recent = ledger.get_recent_runs(limit=5)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")

get_run_inputs(run_id)[source]

Get all input sources for a run.

Parameters:: run_id (int) – Run ID to fetch inputs for
Returns:: List of input dictionaries with source_path and file_count
Return type:: list[dict[str, Any]]

Example

>>> inputs = ledger.get_run_inputs(42)
>>> for inp in inputs:
...     print(f"Source: {inp['source_path']}")
...     if inp['file_count']:
...         print(f"  Files: {inp['file_count']}")

get_run_outputs(run_id)[source]

Get all outputs for a run.

Parameters:: run_id (int) – Run ID to fetch outputs for
Returns:: List of output dictionaries with paths and syllable counts
Return type:: list[dict[str, Any]]

Example

>>> outputs = ledger.get_run_outputs(42)
>>> for out in outputs:
...     print(f"Output: {out['output_path']}")
...     print(f"  Unique syllables: {out['unique_syllable_count']}")

find_run_by_output(output_path)[source]

Find which run produced a specific output file.

This is the “reverse lookup” - given a .syllables file, find out how it was created.

Parameters:: output_path (pathlib.Path) – Path to .syllables file to search for
Returns:: Run dictionary if found, None otherwise
Return type:: dict[str, Any] | None

Example

>>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables"))
>>> if run:
...     print(f"Created by: {run['command_line']}")
...     print(f"On: {run['run_timestamp']}")

get_stats()[source]

Get overall ledger statistics.

Returns summary stats about all recorded runs, useful for understanding build history at a glance.

Returns:

total_runs: Total number of runs recorded
completed_runs: Runs with status=’completed’
failed_runs: Runs with status=’failed’
tools_used: Set of unique extractor tools
languages_used: Set of unique pyphen language codes

Return type:

Dictionary with statistics

Example

>>> stats = ledger.get_stats()
>>> print(f"Total runs: {stats['total_runs']}")
>>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%")
>>> print(f"Tools: {', '.join(stats['tools_used'])}")