build_tools.corpus_db.ledger
============================

.. py:module:: build_tools.corpus_db.ledger

.. autoapi-nested-parse::

   Corpus extraction run ledger - observational database for build provenance.

   This module provides the CorpusLedger class, which manages SQLite-based tracking
   of all syllable extraction runs. The ledger records who ran what extraction tool,
   when, with which settings, and what outputs were produced.

   Critical Design Principle:
       The ledger is **observational only** - it records what happened but does not
       influence extraction behavior. Extractors remain pure, deterministic functions.
       The ledger just watches and remembers.

   Typical Usage:
       >>> from build_tools.corpus_db import CorpusLedger
       >>> from pathlib import Path
       >>>
       >>> # Initialize ledger (finds or creates database)
       >>> ledger = CorpusLedger()
       >>>
       >>> # Start a new extraction run
       >>> run_id = ledger.start_run(
       ...     extractor_tool="syllable_extractor",
       ...     extractor_version="0.2.0",
       ...     pyphen_lang="en_US",
       ...     min_len=2,
       ...     max_len=8,
       ...     command_line="python -m build_tools.syllable_extractor --file input.txt"
       ... )
       >>>
       >>> # Record input sources
       >>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
       >>>
       >>> # ... extraction happens (ledger doesn't participate) ...
       >>>
       >>> # Record outputs
       >>> ledger.record_output(
       ...     run_id,
       ...     output_path=Path("data/raw/en_US/corpus_v1.syllables"),
       ...     syllable_count=5432,
       ...     unique_syllable_count=1234,
       ...     meta_path=Path("data/raw/en_US/corpus_v1.meta")
       ... )
       >>>
       >>> # Mark run complete
       >>> ledger.complete_run(run_id, exit_code=0, status="completed")
       >>>
       >>> # Query runs later
       >>> runs = ledger.get_runs_by_tool("syllable_extractor")
       >>> recent = ledger.get_recent_runs(limit=10)


Classes
-------

.. autoapisummary::

   build_tools.corpus_db.ledger.CorpusLedger


Module Contents
---------------

.. py:class:: CorpusLedger(db_path = None)

   Manages the corpus extraction run ledger database.

   The CorpusLedger provides a simple API for recording extraction runs,
   their inputs, outputs, and outcomes. All operations are append-only -
   runs are never modified or deleted once recorded.

   The database file location is configurable but defaults to:
   - data/raw/syllable_extractor.db

   .. attribute:: db_path

      Path to the SQLite database file

   .. attribute:: _conn

      Active database connection (None if not connected)

   .. admonition:: Example

      >>> ledger = CorpusLedger()
      >>> run_id = ledger.start_run(
      ...     extractor_tool="syllable_extractor",
      ...     extractor_version="0.2.0",
      ...     pyphen_lang="en_US"
      ... )
      >>> ledger.complete_run(run_id, exit_code=0, status="completed")

   Initialize the corpus ledger.

   Creates the database and tables if they don't exist. If the database
   already exists, validates the schema version.

   :param db_path: Path to SQLite database file. If None, defaults to
                   data/raw/syllable_extractor.db in the project root.

   :raises sqlite3.Error: If database initialization fails

   .. admonition:: Example

      >>> # Use default location
      >>> ledger = CorpusLedger()
      >>>
      >>> # Use custom location
      >>> ledger = CorpusLedger(Path("_working/test.db"))


   .. py:attribute:: db_path
      :value: None


   .. py:method:: close()

      Close the database connection.

      Should be called when done with the ledger. Using the ledger as a
      context manager (with statement) is preferred as it handles cleanup
      automatically.

      .. admonition:: Example

         >>> ledger = CorpusLedger()
         >>> # ... use ledger ...
         >>> ledger.close()
         >>>
         >>> # Preferred: use context manager
         >>> with CorpusLedger() as ledger:
         ...     ledger.start_run(...)


   .. py:method:: start_run(extractor_tool, extractor_version = None, pyphen_lang = None, auto_lang_detected = None, min_len = None, max_len = None, recursive = False, pattern = None, command_line = None, notes = None)

      Record the start of a new extraction run.

      Creates a new run record with status='running' and returns the run ID.
      The caller should use this ID to record inputs, outputs, and eventually
      mark the run complete or failed.

      :param extractor_tool: Name of the extraction tool (e.g., 'syllable_extractor',
                             'syllable_extractor_nltk', 'syllable_extractor_espeak')
      :param extractor_version: Version string or git SHA of the tool
      :param pyphen_lang: Pyphen language code (NULL for non-pyphen tools)
      :param auto_lang_detected: Auto-detected language code if auto-detection was used
      :param min_len: Minimum syllable length constraint
      :param max_len: Maximum syllable length constraint
      :param recursive: Whether source directory was processed recursively
      :param pattern: File pattern filter (e.g., '*.txt')
      :param command_line: Full command-line invocation for reproducibility
      :param notes: User-provided annotations about this run

      :returns: Unique run ID (integer) for this extraction run

      .. admonition:: Example

         >>> run_id = ledger.start_run(
         ...     extractor_tool="syllable_extractor",
         ...     extractor_version="0.2.0",
         ...     pyphen_lang="en_US",
         ...     min_len=2,
         ...     max_len=8,
         ...     command_line="python -m build_tools.syllable_extractor --file input.txt",
         ...     notes="Testing new corpus from Project Gutenberg"
         ... )
         >>> print(f"Started run {run_id}")
         Started run 42


   .. py:method:: record_input(run_id, source_path, file_count = None)

      Record an input source for a run.

      Associates an input file or directory with an extraction run. Multiple
      inputs can be recorded for a single run.

      Note: Paths are stored in POSIX format (forward slashes) for cross-platform
      compatibility.

      :param run_id: Run ID from start_run()
      :param source_path: Path to input file or directory
      :param file_count: Number of files if source_path is a directory (None for single file)

      .. admonition:: Example

         >>> run_id = ledger.start_run("syllable_extractor", "0.2.0")
         >>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
         >>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42)


   .. py:method:: record_output(run_id, output_path, syllable_count = None, unique_syllable_count = None, meta_path = None)

      Record an output file for a run.

      Associates an output .syllables file with an extraction run. Multiple
      outputs can be recorded for a single run (e.g., batch processing).

      Note: Paths are stored in POSIX format (forward slashes) for cross-platform
      compatibility.

      :param run_id: Run ID from start_run()
      :param output_path: Path to generated .syllables file
      :param syllable_count: Total number of syllables (including duplicates)
      :param unique_syllable_count: Number of unique syllables
      :param meta_path: Path to corresponding .meta file (if generated)

      .. admonition:: Example

         >>> ledger.record_output(
         ...     run_id=42,
         ...     output_path=Path("data/raw/en_US/corpus_v1.syllables"),
         ...     syllable_count=5432,
         ...     unique_syllable_count=1234,
         ...     meta_path=Path("data/raw/en_US/corpus_v1.meta")
         ... )


   .. py:method:: complete_run(run_id, exit_code, status = 'completed')

      Mark a run as complete or failed.

      Updates the run status and exit code. This should be called when
      extraction finishes, whether successfully or with errors.

      :param run_id: Run ID from start_run()
      :param exit_code: Unix exit code (0 = success, non-zero = failure)
      :param status: Final status - one of 'completed', 'failed', 'interrupted'

      :raises ValueError: If status is not a valid value

      .. admonition:: Example

         >>> # Successful run
         >>> ledger.complete_run(run_id, exit_code=0, status="completed")
         >>>
         >>> # Failed run
         >>> ledger.complete_run(run_id, exit_code=1, status="failed")


   .. py:method:: get_run(run_id)

      Get details for a specific run.

      :param run_id: Run ID to fetch

      :returns: Dictionary with run details, or None if run_id doesn't exist

      .. admonition:: Example

         >>> run = ledger.get_run(42)
         >>> if run:
         ...     print(f"Tool: {run['extractor_tool']}")
         ...     print(f"Status: {run['status']}")
         ...     print(f"Command: {run['command_line']}")


   .. py:method:: get_runs_by_tool(extractor_tool)

      Get all runs for a specific extractor tool.

      :param extractor_tool: Tool name to filter by (e.g., 'syllable_extractor')

      :returns: List of run dictionaries, ordered by timestamp descending (newest first)

      .. admonition:: Example

         >>> runs = ledger.get_runs_by_tool("syllable_extractor")
         >>> for run in runs:
         ...     print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})")


   .. py:method:: get_recent_runs(limit = 10)

      Get the most recent extraction runs.

      :param limit: Maximum number of runs to return (default: 10)

      :returns: List of run dictionaries, ordered by timestamp descending (newest first)

      .. admonition:: Example

         >>> recent = ledger.get_recent_runs(limit=5)
         >>> for run in recent:
         ...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")


   .. py:method:: get_run_inputs(run_id)

      Get all input sources for a run.

      :param run_id: Run ID to fetch inputs for

      :returns: List of input dictionaries with source_path and file_count

      .. admonition:: Example

         >>> inputs = ledger.get_run_inputs(42)
         >>> for inp in inputs:
         ...     print(f"Source: {inp['source_path']}")
         ...     if inp['file_count']:
         ...         print(f"  Files: {inp['file_count']}")


   .. py:method:: get_run_outputs(run_id)

      Get all outputs for a run.

      :param run_id: Run ID to fetch outputs for

      :returns: List of output dictionaries with paths and syllable counts

      .. admonition:: Example

         >>> outputs = ledger.get_run_outputs(42)
         >>> for out in outputs:
         ...     print(f"Output: {out['output_path']}")
         ...     print(f"  Unique syllables: {out['unique_syllable_count']}")


   .. py:method:: find_run_by_output(output_path)

      Find which run produced a specific output file.

      This is the "reverse lookup" - given a .syllables file, find out
      how it was created.

      :param output_path: Path to .syllables file to search for

      :returns: Run dictionary if found, None otherwise

      .. admonition:: Example

         >>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables"))
         >>> if run:
         ...     print(f"Created by: {run['command_line']}")
         ...     print(f"On: {run['run_timestamp']}")


   .. py:method:: get_stats()

      Get overall ledger statistics.

      Returns summary stats about all recorded runs, useful for understanding
      build history at a glance.

      :returns:     - total_runs: Total number of runs recorded
                    - completed_runs: Runs with status='completed'
                    - failed_runs: Runs with status='failed'
                    - tools_used: Set of unique extractor tools
                    - languages_used: Set of unique pyphen language codes
      :rtype: Dictionary with statistics

      .. admonition:: Example

         >>> stats = ledger.get_stats()
         >>> print(f"Total runs: {stats['total_runs']}")
         >>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%")
         >>> print(f"Tools: {', '.join(stats['tools_used'])}")