Corpus Database

Overview

Corpus Database - Build Provenance Ledger for Syllable Extraction

This module provides observational tracking for all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The database records who ran what extraction, when, with which settings, and what outputs were produced.

This is a build-time tool only - not used during runtime name generation.

Design Philosophy:
  • Observational only: Records outcomes, doesn’t control behavior

  • Append-only: Runs are never modified, only added

  • Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors

  • Queryable: Easy to find “which run produced this file?”

Key Features:
  • Full provenance tracking (inputs, outputs, settings, timestamps)

  • Support for multiple extractor tools

  • Command-line reproducibility via full CLI capture

  • Manual annotation support via notes field

  • Simple query API for run history analysis

Main Components:
  • CorpusLedger: Main API for recording and querying runs

  • SCHEMA_VERSION: Schema version for migration tracking

  • get_schema_description: Human-readable schema documentation

Usage:
>>> from build_tools.corpus_db import CorpusLedger
>>> from pathlib import Path
>>>
>>> # Initialize ledger (creates database if needed)
>>> ledger = CorpusLedger()
>>>
>>> # Start recording a run
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt"
... )
>>>
>>> # Record what went in
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>>
>>> # ... extraction happens (ledger doesn't participate) ...
>>>
>>> # Record what came out
>>> ledger.record_output(
...     run_id,
...     output_path=Path("data/raw/en_US/corpus.syllables"),
...     unique_syllable_count=1234
... )
>>>
>>> # Mark run complete
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Query runs later
>>> recent = ledger.get_recent_runs(limit=10)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")
Common Queries:

# Which run produced this file? >>> run = ledger.find_run_by_output(Path(“data/raw/corpus.syllables”)) >>> print(run[‘command_line’])

# Show all runs using en_GB >>> runs = ledger.get_runs_by_tool(“syllable_extractor”) >>> en_gb = [r for r in runs if r[‘pyphen_lang’] == ‘en_GB’]

# Get overall statistics >>> stats = ledger.get_stats() >>> print(f”Total runs: {stats[‘total_runs’]}”) >>> print(f”Success rate: {stats[‘completed_runs’]/stats[‘total_runs’]*100:.1f}%”)

Database Location:

Default: data/raw/syllable_extractor.db Custom: Pass db_path to CorpusLedger(db_path=Path(…))

Context Manager Support:
>>> with CorpusLedger() as ledger:
...     run_id = ledger.start_run(...)
...     # Connection automatically closed on exit

Output Format

Database Schema

The corpus database uses a simple three-table schema to track extraction runs:

runs - One row per extractor invocation

Records all configuration parameters, timestamps, command-line invocations, and execution outcomes. Each run gets a unique ID for tracking.

inputs - Source files/directories processed (many-to-one with runs)

Tracks which files or directories were used as input for each extraction run.

outputs - Generated output files (many-to-one with runs)

Records the .syllables and .meta files produced, along with syllable counts.

For a detailed schema description:

from build_tools.corpus_db import get_schema_description
print(get_schema_description())

Database Location

Default location: data/raw/syllable_extractor.db

Custom location can be specified when initializing the ledger:

from pathlib import Path
from build_tools.corpus_db import CorpusLedger

ledger = CorpusLedger(db_path=Path("_working/test.db"))

Integration Guide

When building new syllable extractors, integrate the ledger by:

  1. Calling start_run() at the beginning of extraction

  2. Recording all input sources with record_input()

  3. Recording all output files with record_output()

  4. Marking completion with complete_run() in a try/finally block

Example integration pattern:

from build_tools.corpus_db import CorpusLedger
import sys

ledger = CorpusLedger()
run_id = ledger.start_run(
    extractor_tool="my_extractor",
    command_line=" ".join(sys.argv),
    # ... other parameters ...
)

try:
    # ... extraction logic ...
    ledger.record_output(run_id, output_path, unique_syllable_count=count)
    ledger.complete_run(run_id, exit_code=0, status="completed")
except Exception as e:
    ledger.complete_run(run_id, exit_code=1, status="failed")
    raise

When to use this tool:

  • Track provenance of all syllable extraction runs

  • Query history to understand what corpus files were generated

  • Find which run produced a specific output file

  • Monitor extraction success rates across tools

Notes

Cross-Platform Compatibility:

Paths are stored in POSIX format (forward slashes) for cross-platform consistency. This ensures the database can be shared between Windows, macOS, and Linux systems without path separator issues.

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Corpus Database - Build Provenance Ledger for Syllable Extraction

This module provides observational tracking for all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The database records who ran what extraction, when, with which settings, and what outputs were produced.

This is a build-time tool only - not used during runtime name generation.

Design Philosophy:
  • Observational only: Records outcomes, doesn’t control behavior

  • Append-only: Runs are never modified, only added

  • Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors

  • Queryable: Easy to find “which run produced this file?”

Key Features:
  • Full provenance tracking (inputs, outputs, settings, timestamps)

  • Support for multiple extractor tools

  • Command-line reproducibility via full CLI capture

  • Manual annotation support via notes field

  • Simple query API for run history analysis

Main Components:
  • CorpusLedger: Main API for recording and querying runs

  • SCHEMA_VERSION: Schema version for migration tracking

  • get_schema_description: Human-readable schema documentation

Usage:
>>> from build_tools.corpus_db import CorpusLedger
>>> from pathlib import Path
>>>
>>> # Initialize ledger (creates database if needed)
>>> ledger = CorpusLedger()
>>>
>>> # Start recording a run
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt"
... )
>>>
>>> # Record what went in
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>>
>>> # ... extraction happens (ledger doesn't participate) ...
>>>
>>> # Record what came out
>>> ledger.record_output(
...     run_id,
...     output_path=Path("data/raw/en_US/corpus.syllables"),
...     unique_syllable_count=1234
... )
>>>
>>> # Mark run complete
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Query runs later
>>> recent = ledger.get_recent_runs(limit=10)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")
Common Queries:

# Which run produced this file? >>> run = ledger.find_run_by_output(Path(“data/raw/corpus.syllables”)) >>> print(run[‘command_line’])

# Show all runs using en_GB >>> runs = ledger.get_runs_by_tool(“syllable_extractor”) >>> en_gb = [r for r in runs if r[‘pyphen_lang’] == ‘en_GB’]

# Get overall statistics >>> stats = ledger.get_stats() >>> print(f”Total runs: {stats[‘total_runs’]}”) >>> print(f”Success rate: {stats[‘completed_runs’]/stats[‘total_runs’]*100:.1f}%”)

Database Location:

Default: data/raw/syllable_extractor.db Custom: Pass db_path to CorpusLedger(db_path=Path(…))

Context Manager Support:
>>> with CorpusLedger() as ledger:
...     run_id = ledger.start_run(...)
...     # Connection automatically closed on exit
class build_tools.corpus_db.CorpusLedger(db_path=None)[source]

Bases: object

Manages the corpus extraction run ledger database.

The CorpusLedger provides a simple API for recording extraction runs, their inputs, outputs, and outcomes. All operations are append-only - runs are never modified or deleted once recorded.

The database file location is configurable but defaults to: - data/raw/syllable_extractor.db

db_path

Path to the SQLite database file

_conn

Active database connection (None if not connected)

Example

>>> ledger = CorpusLedger()
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US"
... )
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
__enter__()[source]

Context manager entry.

Return type:

CorpusLedger

__exit__(exc_type, exc_val, exc_tb)[source]

Context manager exit - ensures connection is closed.

Return type:

None

__init__(db_path=None)[source]

Initialize the corpus ledger.

Creates the database and tables if they don’t exist. If the database already exists, validates the schema version.

Parameters:

db_path (Path | None) – Path to SQLite database file. If None, defaults to data/raw/syllable_extractor.db in the project root.

Raises:

sqlite3.Error – If database initialization fails

Example

>>> # Use default location
>>> ledger = CorpusLedger()
>>>
>>> # Use custom location
>>> ledger = CorpusLedger(Path("_working/test.db"))
close()[source]

Close the database connection.

Should be called when done with the ledger. Using the ledger as a context manager (with statement) is preferred as it handles cleanup automatically.

Example

>>> ledger = CorpusLedger()
>>> # ... use ledger ...
>>> ledger.close()
>>>
>>> # Preferred: use context manager
>>> with CorpusLedger() as ledger:
...     ledger.start_run(...)
Return type:

None

complete_run(run_id, exit_code, status='completed')[source]

Mark a run as complete or failed.

Updates the run status and exit code. This should be called when extraction finishes, whether successfully or with errors.

Parameters:
  • run_id (int) – Run ID from start_run()

  • exit_code (int) – Unix exit code (0 = success, non-zero = failure)

  • status (str) – Final status - one of ‘completed’, ‘failed’, ‘interrupted’

Raises:

ValueError – If status is not a valid value

Return type:

None

Example

>>> # Successful run
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Failed run
>>> ledger.complete_run(run_id, exit_code=1, status="failed")
find_run_by_output(output_path)[source]

Find which run produced a specific output file.

This is the “reverse lookup” - given a .syllables file, find out how it was created.

Parameters:

output_path (Path) – Path to .syllables file to search for

Return type:

dict[str, Any] | None

Returns:

Run dictionary if found, None otherwise

Example

>>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables"))
>>> if run:
...     print(f"Created by: {run['command_line']}")
...     print(f"On: {run['run_timestamp']}")
get_recent_runs(limit=10)[source]

Get the most recent extraction runs.

Parameters:

limit (int) – Maximum number of runs to return (default: 10)

Return type:

list[dict[str, Any]]

Returns:

List of run dictionaries, ordered by timestamp descending (newest first)

Example

>>> recent = ledger.get_recent_runs(limit=5)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")
get_run(run_id)[source]

Get details for a specific run.

Parameters:

run_id (int) – Run ID to fetch

Return type:

dict[str, Any] | None

Returns:

Dictionary with run details, or None if run_id doesn’t exist

Example

>>> run = ledger.get_run(42)
>>> if run:
...     print(f"Tool: {run['extractor_tool']}")
...     print(f"Status: {run['status']}")
...     print(f"Command: {run['command_line']}")
get_run_inputs(run_id)[source]

Get all input sources for a run.

Parameters:

run_id (int) – Run ID to fetch inputs for

Return type:

list[dict[str, Any]]

Returns:

List of input dictionaries with source_path and file_count

Example

>>> inputs = ledger.get_run_inputs(42)
>>> for inp in inputs:
...     print(f"Source: {inp['source_path']}")
...     if inp['file_count']:
...         print(f"  Files: {inp['file_count']}")
get_run_outputs(run_id)[source]

Get all outputs for a run.

Parameters:

run_id (int) – Run ID to fetch outputs for

Return type:

list[dict[str, Any]]

Returns:

List of output dictionaries with paths and syllable counts

Example

>>> outputs = ledger.get_run_outputs(42)
>>> for out in outputs:
...     print(f"Output: {out['output_path']}")
...     print(f"  Unique syllables: {out['unique_syllable_count']}")
get_runs_by_tool(extractor_tool)[source]

Get all runs for a specific extractor tool.

Parameters:

extractor_tool (str) – Tool name to filter by (e.g., ‘syllable_extractor’)

Return type:

list[dict[str, Any]]

Returns:

List of run dictionaries, ordered by timestamp descending (newest first)

Example

>>> runs = ledger.get_runs_by_tool("syllable_extractor")
>>> for run in runs:
...     print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})")
get_stats()[source]

Get overall ledger statistics.

Returns summary stats about all recorded runs, useful for understanding build history at a glance.

Returns:

  • total_runs: Total number of runs recorded

  • completed_runs: Runs with status=’completed’

  • failed_runs: Runs with status=’failed’

  • tools_used: Set of unique extractor tools

  • languages_used: Set of unique pyphen language codes

Return type:

Dictionary with statistics

Example

>>> stats = ledger.get_stats()
>>> print(f"Total runs: {stats['total_runs']}")
>>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%")
>>> print(f"Tools: {', '.join(stats['tools_used'])}")
record_input(run_id, source_path, file_count=None)[source]

Record an input source for a run.

Associates an input file or directory with an extraction run. Multiple inputs can be recorded for a single run.

Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.

Parameters:
  • run_id (int) – Run ID from start_run()

  • source_path (Path) – Path to input file or directory

  • file_count (int | None) – Number of files if source_path is a directory (None for single file)

Return type:

None

Example

>>> run_id = ledger.start_run("syllable_extractor", "0.2.0")
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42)
record_output(run_id, output_path, syllable_count=None, unique_syllable_count=None, meta_path=None)[source]

Record an output file for a run.

Associates an output .syllables file with an extraction run. Multiple outputs can be recorded for a single run (e.g., batch processing).

Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.

Parameters:
  • run_id (int) – Run ID from start_run()

  • output_path (Path) – Path to generated .syllables file

  • syllable_count (int | None) – Total number of syllables (including duplicates)

  • unique_syllable_count (int | None) – Number of unique syllables

  • meta_path (Path | None) – Path to corresponding .meta file (if generated)

Return type:

None

Example

>>> ledger.record_output(
...     run_id=42,
...     output_path=Path("data/raw/en_US/corpus_v1.syllables"),
...     syllable_count=5432,
...     unique_syllable_count=1234,
...     meta_path=Path("data/raw/en_US/corpus_v1.meta")
... )
start_run(extractor_tool, extractor_version=None, pyphen_lang=None, auto_lang_detected=None, min_len=None, max_len=None, recursive=False, pattern=None, command_line=None, notes=None)[source]

Record the start of a new extraction run.

Creates a new run record with status=’running’ and returns the run ID. The caller should use this ID to record inputs, outputs, and eventually mark the run complete or failed.

Parameters:
  • extractor_tool (str) – Name of the extraction tool (e.g., ‘syllable_extractor’, ‘syllable_extractor_nltk’, ‘syllable_extractor_espeak’)

  • extractor_version (str | None) – Version string or git SHA of the tool

  • pyphen_lang (str | None) – Pyphen language code (NULL for non-pyphen tools)

  • auto_lang_detected (str | None) – Auto-detected language code if auto-detection was used

  • min_len (int | None) – Minimum syllable length constraint

  • max_len (int | None) – Maximum syllable length constraint

  • recursive (bool) – Whether source directory was processed recursively

  • pattern (str | None) – File pattern filter (e.g., ‘*.txt’)

  • command_line (str | None) – Full command-line invocation for reproducibility

  • notes (str | None) – User-provided annotations about this run

Return type:

int

Returns:

Unique run ID (integer) for this extraction run

Example

>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt",
...     notes="Testing new corpus from Project Gutenberg"
... )
>>> print(f"Started run {run_id}")
Started run 42
build_tools.corpus_db.get_schema_description()[source]

Generate a human-readable description of the database schema.

Useful for documentation, debugging, and understanding the database structure without examining SQL directly.

Return type:

str

Returns:

Multi-line string describing tables, columns, and relationships

Example

>>> print(get_schema_description())
Corpus Extraction Run Ledger Schema (v1)
========================================
...