build_tools.corpus_db.ledger ============================ .. py:module:: build_tools.corpus_db.ledger .. autoapi-nested-parse:: Corpus extraction run ledger - observational database for build provenance. This module provides the CorpusLedger class, which manages SQLite-based tracking of all syllable extraction runs. The ledger records who ran what extraction tool, when, with which settings, and what outputs were produced. Critical Design Principle: The ledger is **observational only** - it records what happened but does not influence extraction behavior. Extractors remain pure, deterministic functions. The ledger just watches and remembers. Typical Usage: >>> from build_tools.corpus_db import CorpusLedger >>> from pathlib import Path >>> >>> # Initialize ledger (finds or creates database) >>> ledger = CorpusLedger() >>> >>> # Start a new extraction run >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt" ... ) >>> >>> # Record input sources >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> >>> # ... extraction happens (ledger doesn't participate) ... >>> >>> # Record outputs >>> ledger.record_output( ... run_id, ... output_path=Path("data/raw/en_US/corpus_v1.syllables"), ... syllable_count=5432, ... unique_syllable_count=1234, ... meta_path=Path("data/raw/en_US/corpus_v1.meta") ... ) >>> >>> # Mark run complete >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Query runs later >>> runs = ledger.get_runs_by_tool("syllable_extractor") >>> recent = ledger.get_recent_runs(limit=10) Classes ------- .. autoapisummary:: build_tools.corpus_db.ledger.CorpusLedger Module Contents --------------- .. py:class:: CorpusLedger(db_path = None) Manages the corpus extraction run ledger database. The CorpusLedger provides a simple API for recording extraction runs, their inputs, outputs, and outcomes. All operations are append-only - runs are never modified or deleted once recorded. The database file location is configurable but defaults to: - data/raw/syllable_extractor.db .. attribute:: db_path Path to the SQLite database file .. attribute:: _conn Active database connection (None if not connected) .. admonition:: Example >>> ledger = CorpusLedger() >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US" ... ) >>> ledger.complete_run(run_id, exit_code=0, status="completed") Initialize the corpus ledger. Creates the database and tables if they don't exist. If the database already exists, validates the schema version. :param db_path: Path to SQLite database file. If None, defaults to data/raw/syllable_extractor.db in the project root. :raises sqlite3.Error: If database initialization fails .. admonition:: Example >>> # Use default location >>> ledger = CorpusLedger() >>> >>> # Use custom location >>> ledger = CorpusLedger(Path("_working/test.db")) .. py:attribute:: db_path :value: None .. py:method:: close() Close the database connection. Should be called when done with the ledger. Using the ledger as a context manager (with statement) is preferred as it handles cleanup automatically. .. admonition:: Example >>> ledger = CorpusLedger() >>> # ... use ledger ... >>> ledger.close() >>> >>> # Preferred: use context manager >>> with CorpusLedger() as ledger: ... ledger.start_run(...) .. py:method:: start_run(extractor_tool, extractor_version = None, pyphen_lang = None, auto_lang_detected = None, min_len = None, max_len = None, recursive = False, pattern = None, command_line = None, notes = None) Record the start of a new extraction run. Creates a new run record with status='running' and returns the run ID. The caller should use this ID to record inputs, outputs, and eventually mark the run complete or failed. :param extractor_tool: Name of the extraction tool (e.g., 'syllable_extractor', 'syllable_extractor_nltk', 'syllable_extractor_espeak') :param extractor_version: Version string or git SHA of the tool :param pyphen_lang: Pyphen language code (NULL for non-pyphen tools) :param auto_lang_detected: Auto-detected language code if auto-detection was used :param min_len: Minimum syllable length constraint :param max_len: Maximum syllable length constraint :param recursive: Whether source directory was processed recursively :param pattern: File pattern filter (e.g., '*.txt') :param command_line: Full command-line invocation for reproducibility :param notes: User-provided annotations about this run :returns: Unique run ID (integer) for this extraction run .. admonition:: Example >>> run_id = ledger.start_run( ... extractor_tool="syllable_extractor", ... extractor_version="0.2.0", ... pyphen_lang="en_US", ... min_len=2, ... max_len=8, ... command_line="python -m build_tools.syllable_extractor --file input.txt", ... notes="Testing new corpus from Project Gutenberg" ... ) >>> print(f"Started run {run_id}") Started run 42 .. py:method:: record_input(run_id, source_path, file_count = None) Record an input source for a run. Associates an input file or directory with an extraction run. Multiple inputs can be recorded for a single run. Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility. :param run_id: Run ID from start_run() :param source_path: Path to input file or directory :param file_count: Number of files if source_path is a directory (None for single file) .. admonition:: Example >>> run_id = ledger.start_run("syllable_extractor", "0.2.0") >>> ledger.record_input(run_id, Path("data/corpus/english.txt")) >>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42) .. py:method:: record_output(run_id, output_path, syllable_count = None, unique_syllable_count = None, meta_path = None) Record an output file for a run. Associates an output .syllables file with an extraction run. Multiple outputs can be recorded for a single run (e.g., batch processing). Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility. :param run_id: Run ID from start_run() :param output_path: Path to generated .syllables file :param syllable_count: Total number of syllables (including duplicates) :param unique_syllable_count: Number of unique syllables :param meta_path: Path to corresponding .meta file (if generated) .. admonition:: Example >>> ledger.record_output( ... run_id=42, ... output_path=Path("data/raw/en_US/corpus_v1.syllables"), ... syllable_count=5432, ... unique_syllable_count=1234, ... meta_path=Path("data/raw/en_US/corpus_v1.meta") ... ) .. py:method:: complete_run(run_id, exit_code, status = 'completed') Mark a run as complete or failed. Updates the run status and exit code. This should be called when extraction finishes, whether successfully or with errors. :param run_id: Run ID from start_run() :param exit_code: Unix exit code (0 = success, non-zero = failure) :param status: Final status - one of 'completed', 'failed', 'interrupted' :raises ValueError: If status is not a valid value .. admonition:: Example >>> # Successful run >>> ledger.complete_run(run_id, exit_code=0, status="completed") >>> >>> # Failed run >>> ledger.complete_run(run_id, exit_code=1, status="failed") .. py:method:: get_run(run_id) Get details for a specific run. :param run_id: Run ID to fetch :returns: Dictionary with run details, or None if run_id doesn't exist .. admonition:: Example >>> run = ledger.get_run(42) >>> if run: ... print(f"Tool: {run['extractor_tool']}") ... print(f"Status: {run['status']}") ... print(f"Command: {run['command_line']}") .. py:method:: get_runs_by_tool(extractor_tool) Get all runs for a specific extractor tool. :param extractor_tool: Tool name to filter by (e.g., 'syllable_extractor') :returns: List of run dictionaries, ordered by timestamp descending (newest first) .. admonition:: Example >>> runs = ledger.get_runs_by_tool("syllable_extractor") >>> for run in runs: ... print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})") .. py:method:: get_recent_runs(limit = 10) Get the most recent extraction runs. :param limit: Maximum number of runs to return (default: 10) :returns: List of run dictionaries, ordered by timestamp descending (newest first) .. admonition:: Example >>> recent = ledger.get_recent_runs(limit=5) >>> for run in recent: ... print(f"{run['run_timestamp']}: {run['extractor_tool']}") .. py:method:: get_run_inputs(run_id) Get all input sources for a run. :param run_id: Run ID to fetch inputs for :returns: List of input dictionaries with source_path and file_count .. admonition:: Example >>> inputs = ledger.get_run_inputs(42) >>> for inp in inputs: ... print(f"Source: {inp['source_path']}") ... if inp['file_count']: ... print(f" Files: {inp['file_count']}") .. py:method:: get_run_outputs(run_id) Get all outputs for a run. :param run_id: Run ID to fetch outputs for :returns: List of output dictionaries with paths and syllable counts .. admonition:: Example >>> outputs = ledger.get_run_outputs(42) >>> for out in outputs: ... print(f"Output: {out['output_path']}") ... print(f" Unique syllables: {out['unique_syllable_count']}") .. py:method:: find_run_by_output(output_path) Find which run produced a specific output file. This is the "reverse lookup" - given a .syllables file, find out how it was created. :param output_path: Path to .syllables file to search for :returns: Run dictionary if found, None otherwise .. admonition:: Example >>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables")) >>> if run: ... print(f"Created by: {run['command_line']}") ... print(f"On: {run['run_timestamp']}") .. py:method:: get_stats() Get overall ledger statistics. Returns summary stats about all recorded runs, useful for understanding build history at a glance. :returns: - total_runs: Total number of runs recorded - completed_runs: Runs with status='completed' - failed_runs: Runs with status='failed' - tools_used: Set of unique extractor tools - languages_used: Set of unique pyphen language codes :rtype: Dictionary with statistics .. admonition:: Example >>> stats = ledger.get_stats() >>> print(f"Total runs: {stats['total_runs']}") >>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%") >>> print(f"Tools: {', '.join(stats['tools_used'])}")