Corpus Database

Overview

Corpus Database - Build Provenance Ledger for Syllable Extraction

This module provides observational tracking for all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The database records who ran what extraction, when, with which settings, and what outputs were produced.

This is a build-time tool only - not used during runtime name generation.

Design Philosophy:

Observational only: Records outcomes, doesn’t control behavior
Append-only: Runs are never modified, only added
Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors
Queryable: Easy to find “which run produced this file?”

Key Features:

Full provenance tracking (inputs, outputs, settings, timestamps)
Support for multiple extractor tools
Command-line reproducibility via full CLI capture
Manual annotation support via notes field
Simple query API for run history analysis

Main Components:

CorpusLedger: Main API for recording and querying runs
SCHEMA_VERSION: Schema version for migration tracking
get_schema_description: Human-readable schema documentation

Usage:

>>> from build_tools.corpus_db import CorpusLedger
>>> from pathlib import Path
>>>
>>> # Initialize ledger (creates database if needed)
>>> ledger = CorpusLedger()
>>>
>>> # Start recording a run
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt"
... )
>>>
>>> # Record what went in
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>>
>>> # ... extraction happens (ledger doesn't participate) ...
>>>
>>> # Record what came out
>>> ledger.record_output(
...     run_id,
...     output_path=Path("data/raw/en_US/corpus.syllables"),
...     unique_syllable_count=1234
... )
>>>
>>> # Mark run complete
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Query runs later
>>> recent = ledger.get_recent_runs(limit=10)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")

Common Queries:

# Which run produced this file? >>> run = ledger.find_run_by_output(Path(“data/raw/corpus.syllables”)) >>> print(run[‘command_line’])

# Show all runs using en_GB >>> runs = ledger.get_runs_by_tool(“syllable_extractor”) >>> en_gb = [r for r in runs if r[‘pyphen_lang’] == ‘en_GB’]

# Get overall statistics >>> stats = ledger.get_stats() >>> print(f”Total runs: {stats[‘total_runs’]}”) >>> print(f”Success rate: {stats[‘completed_runs’]/stats[‘total_runs’]*100:.1f}%”)

Database Location:

Default: data/raw/syllable_extractor.db Custom: Pass db_path to CorpusLedger(db_path=Path(…))

Context Manager Support:

>>> with CorpusLedger() as ledger:
...     run_id = ledger.start_run(...)
...     # Connection automatically closed on exit

Output Format

Database Schema

The corpus database uses a simple three-table schema to track extraction runs:

runs - One row per extractor invocation: Records all configuration parameters, timestamps, command-line invocations, and execution outcomes. Each run gets a unique ID for tracking.
inputs - Source files/directories processed (many-to-one with runs): Tracks which files or directories were used as input for each extraction run.
outputs - Generated output files (many-to-one with runs): Records the .syllables and .meta files produced, along with syllable counts.

For a detailed schema description:

from build_tools.corpus_db import get_schema_description
print(get_schema_description())

Database Location

Default location: data/raw/syllable_extractor.db

Custom location can be specified when initializing the ledger:

from pathlib import Path
from build_tools.corpus_db import CorpusLedger

ledger = CorpusLedger(db_path=Path("_working/test.db"))

Integration Guide

When building new syllable extractors, integrate the ledger by:

Calling start_run() at the beginning of extraction
Recording all input sources with record_input()
Recording all output files with record_output()
Marking completion with complete_run() in a try/finally block

Example integration pattern:

from build_tools.corpus_db import CorpusLedger
import sys

ledger = CorpusLedger()
run_id = ledger.start_run(
    extractor_tool="my_extractor",
    command_line=" ".join(sys.argv),
    # ... other parameters ...
)

try:
    # ... extraction logic ...
    ledger.record_output(run_id, output_path, unique_syllable_count=count)
    ledger.complete_run(run_id, exit_code=0, status="completed")
except Exception as e:
    ledger.complete_run(run_id, exit_code=1, status="failed")
    raise

When to use this tool:

Track provenance of all syllable extraction runs
Query history to understand what corpus files were generated
Find which run produced a specific output file
Monitor extraction success rates across tools

Notes

Cross-Platform Compatibility:

Paths are stored in POSIX format (forward slashes) for cross-platform consistency. This ensures the database can be shared between Windows, macOS, and Linux systems without path separator issues.

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Corpus Database - Build Provenance Ledger for Syllable Extraction

This module provides observational tracking for all syllable extractor runs across different tools (pyphen, NLTK, eSpeak, etc.). The database records who ran what extraction, when, with which settings, and what outputs were produced.

This is a build-time tool only - not used during runtime name generation.

Design Philosophy:

Observational only: Records outcomes, doesn’t control behavior
Append-only: Runs are never modified, only added
Tool-agnostic: Works for pyphen, NLTK, eSpeak, or future extractors
Queryable: Easy to find “which run produced this file?”

Key Features:

Full provenance tracking (inputs, outputs, settings, timestamps)
Support for multiple extractor tools
Command-line reproducibility via full CLI capture
Manual annotation support via notes field
Simple query API for run history analysis

Main Components:

CorpusLedger: Main API for recording and querying runs
SCHEMA_VERSION: Schema version for migration tracking
get_schema_description: Human-readable schema documentation

Usage:

>>> from build_tools.corpus_db import CorpusLedger
>>> from pathlib import Path
>>>
>>> # Initialize ledger (creates database if needed)
>>> ledger = CorpusLedger()
>>>
>>> # Start recording a run
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt"
... )
>>>
>>> # Record what went in
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>>
>>> # ... extraction happens (ledger doesn't participate) ...
>>>
>>> # Record what came out
>>> ledger.record_output(
...     run_id,
...     output_path=Path("data/raw/en_US/corpus.syllables"),
...     unique_syllable_count=1234
... )
>>>
>>> # Mark run complete
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Query runs later
>>> recent = ledger.get_recent_runs(limit=10)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")

Common Queries:

# Which run produced this file? >>> run = ledger.find_run_by_output(Path(“data/raw/corpus.syllables”)) >>> print(run[‘command_line’])

# Show all runs using en_GB >>> runs = ledger.get_runs_by_tool(“syllable_extractor”) >>> en_gb = [r for r in runs if r[‘pyphen_lang’] == ‘en_GB’]

# Get overall statistics >>> stats = ledger.get_stats() >>> print(f”Total runs: {stats[‘total_runs’]}”) >>> print(f”Success rate: {stats[‘completed_runs’]/stats[‘total_runs’]*100:.1f}%”)

Database Location:

Default: data/raw/syllable_extractor.db Custom: Pass db_path to CorpusLedger(db_path=Path(…))

Context Manager Support:

>>> with CorpusLedger() as ledger:
...     run_id = ledger.start_run(...)
...     # Connection automatically closed on exit

class build_tools.corpus_db.CorpusLedger(db_path=None)[source]

Bases: object

Manages the corpus extraction run ledger database.

The CorpusLedger provides a simple API for recording extraction runs, their inputs, outputs, and outcomes. All operations are append-only - runs are never modified or deleted once recorded.

The database file location is configurable but defaults to: - data/raw/syllable_extractor.db

db_path: Path to the SQLite database file

_conn: Active database connection (None if not connected)

Example

>>> ledger = CorpusLedger()
>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US"
... )
>>> ledger.complete_run(run_id, exit_code=0, status="completed")

__enter__()[source]

Context manager entry.

Return type:: CorpusLedger

__exit__(exc_type, exc_val, exc_tb)[source]

Context manager exit - ensures connection is closed.

Return type:: None

__init__(db_path=None)[source]

Initialize the corpus ledger.

Creates the database and tables if they don’t exist. If the database already exists, validates the schema version.

Parameters:: db_path (Path | None) – Path to SQLite database file. If None, defaults to data/raw/syllable_extractor.db in the project root.
Raises:: sqlite3.Error – If database initialization fails

Example

>>> # Use default location
>>> ledger = CorpusLedger()
>>>
>>> # Use custom location
>>> ledger = CorpusLedger(Path("_working/test.db"))

close()[source]

Close the database connection.

Should be called when done with the ledger. Using the ledger as a context manager (with statement) is preferred as it handles cleanup automatically.

Example

>>> ledger = CorpusLedger()
>>> # ... use ledger ...
>>> ledger.close()
>>>
>>> # Preferred: use context manager
>>> with CorpusLedger() as ledger:
...     ledger.start_run(...)

Return type:: None

complete_run(run_id, exit_code, status='completed')[source]

Mark a run as complete or failed.

Updates the run status and exit code. This should be called when extraction finishes, whether successfully or with errors.

Parameters:

run_id (int) – Run ID from start_run()
exit_code (int) – Unix exit code (0 = success, non-zero = failure)
status (str) – Final status - one of ‘completed’, ‘failed’, ‘interrupted’

Raises:

ValueError – If status is not a valid value

Return type:

None

Example

>>> # Successful run
>>> ledger.complete_run(run_id, exit_code=0, status="completed")
>>>
>>> # Failed run
>>> ledger.complete_run(run_id, exit_code=1, status="failed")

find_run_by_output(output_path)[source]

Find which run produced a specific output file.

This is the “reverse lookup” - given a .syllables file, find out how it was created.

Parameters:: output_path (Path) – Path to .syllables file to search for
Return type:: dict[str, Any] | None
Returns:: Run dictionary if found, None otherwise

Example

>>> run = ledger.find_run_by_output(Path("data/raw/en_US/corpus_v1.syllables"))
>>> if run:
...     print(f"Created by: {run['command_line']}")
...     print(f"On: {run['run_timestamp']}")

get_recent_runs(limit=10)[source]

Get the most recent extraction runs.

Parameters:: limit (int) – Maximum number of runs to return (default: 10)
Return type:: list[dict[str, Any]]
Returns:: List of run dictionaries, ordered by timestamp descending (newest first)

Example

>>> recent = ledger.get_recent_runs(limit=5)
>>> for run in recent:
...     print(f"{run['run_timestamp']}: {run['extractor_tool']}")

get_run(run_id)[source]

Get details for a specific run.

Parameters:: run_id (int) – Run ID to fetch
Return type:: dict[str, Any] | None
Returns:: Dictionary with run details, or None if run_id doesn’t exist

Example

>>> run = ledger.get_run(42)
>>> if run:
...     print(f"Tool: {run['extractor_tool']}")
...     print(f"Status: {run['status']}")
...     print(f"Command: {run['command_line']}")

get_run_inputs(run_id)[source]

Get all input sources for a run.

Parameters:: run_id (int) – Run ID to fetch inputs for
Return type:: list[dict[str, Any]]
Returns:: List of input dictionaries with source_path and file_count

Example

>>> inputs = ledger.get_run_inputs(42)
>>> for inp in inputs:
...     print(f"Source: {inp['source_path']}")
...     if inp['file_count']:
...         print(f"  Files: {inp['file_count']}")

get_run_outputs(run_id)[source]

Get all outputs for a run.

Parameters:: run_id (int) – Run ID to fetch outputs for
Return type:: list[dict[str, Any]]
Returns:: List of output dictionaries with paths and syllable counts

Example

>>> outputs = ledger.get_run_outputs(42)
>>> for out in outputs:
...     print(f"Output: {out['output_path']}")
...     print(f"  Unique syllables: {out['unique_syllable_count']}")

get_runs_by_tool(extractor_tool)[source]

Get all runs for a specific extractor tool.

Parameters:: extractor_tool (str) – Tool name to filter by (e.g., ‘syllable_extractor’)
Return type:: list[dict[str, Any]]
Returns:: List of run dictionaries, ordered by timestamp descending (newest first)

Example

>>> runs = ledger.get_runs_by_tool("syllable_extractor")
>>> for run in runs:
...     print(f"Run {run['id']}: {run['pyphen_lang']} ({run['status']})")

get_stats()[source]

Get overall ledger statistics.

Returns summary stats about all recorded runs, useful for understanding build history at a glance.

Returns:

total_runs: Total number of runs recorded
completed_runs: Runs with status=’completed’
failed_runs: Runs with status=’failed’
tools_used: Set of unique extractor tools
languages_used: Set of unique pyphen language codes

Return type:

Dictionary with statistics

Example

>>> stats = ledger.get_stats()
>>> print(f"Total runs: {stats['total_runs']}")
>>> print(f"Success rate: {stats['completed_runs']/stats['total_runs']*100:.1f}%")
>>> print(f"Tools: {', '.join(stats['tools_used'])}")

record_input(run_id, source_path, file_count=None)[source]

Record an input source for a run.

Associates an input file or directory with an extraction run. Multiple inputs can be recorded for a single run.

Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.

Parameters:

run_id (int) – Run ID from start_run()
source_path (Path) – Path to input file or directory
file_count (int | None) – Number of files if source_path is a directory (None for single file)

Return type:

None

Example

>>> run_id = ledger.start_run("syllable_extractor", "0.2.0")
>>> ledger.record_input(run_id, Path("data/corpus/english.txt"))
>>> ledger.record_input(run_id, Path("data/corpus/german/"), file_count=42)

record_output(run_id, output_path, syllable_count=None, unique_syllable_count=None, meta_path=None)[source]

Record an output file for a run.

Associates an output .syllables file with an extraction run. Multiple outputs can be recorded for a single run (e.g., batch processing).

Note: Paths are stored in POSIX format (forward slashes) for cross-platform compatibility.

Parameters:

run_id (int) – Run ID from start_run()
output_path (Path) – Path to generated .syllables file
syllable_count (int | None) – Total number of syllables (including duplicates)
unique_syllable_count (int | None) – Number of unique syllables
meta_path (Path | None) – Path to corresponding .meta file (if generated)

Return type:

None

Example

>>> ledger.record_output(
...     run_id=42,
...     output_path=Path("data/raw/en_US/corpus_v1.syllables"),
...     syllable_count=5432,
...     unique_syllable_count=1234,
...     meta_path=Path("data/raw/en_US/corpus_v1.meta")
... )

start_run(extractor_tool, extractor_version=None, pyphen_lang=None, auto_lang_detected=None, min_len=None, max_len=None, recursive=False, pattern=None, command_line=None, notes=None)[source]

Record the start of a new extraction run.

Creates a new run record with status=’running’ and returns the run ID. The caller should use this ID to record inputs, outputs, and eventually mark the run complete or failed.

Parameters:

extractor_tool (str) – Name of the extraction tool (e.g., ‘syllable_extractor’, ‘syllable_extractor_nltk’, ‘syllable_extractor_espeak’)
extractor_version (str | None) – Version string or git SHA of the tool
pyphen_lang (str | None) – Pyphen language code (NULL for non-pyphen tools)
auto_lang_detected (str | None) – Auto-detected language code if auto-detection was used
min_len (int | None) – Minimum syllable length constraint
max_len (int | None) – Maximum syllable length constraint
recursive (bool) – Whether source directory was processed recursively
pattern (str | None) – File pattern filter (e.g., ‘*.txt’)
command_line (str | None) – Full command-line invocation for reproducibility
notes (str | None) – User-provided annotations about this run

Return type:

int

Returns:

Unique run ID (integer) for this extraction run

Example

>>> run_id = ledger.start_run(
...     extractor_tool="syllable_extractor",
...     extractor_version="0.2.0",
...     pyphen_lang="en_US",
...     min_len=2,
...     max_len=8,
...     command_line="python -m build_tools.syllable_extractor --file input.txt",
...     notes="Testing new corpus from Project Gutenberg"
... )
>>> print(f"Started run {run_id}")
Started run 42

build_tools.corpus_db.get_schema_description()[source]

Generate a human-readable description of the database schema.

Useful for documentation, debugging, and understanding the database structure without examining SQL directly.

Return type:: str
Returns:: Multi-line string describing tables, columns, and relationships

Example

>>> print(get_schema_description())
Corpus Extraction Run Ledger Schema (v1)
========================================
...