Corpus Database Viewer

Overview

Corpus Database Viewer - Interactive TUI for Build Provenance

Interactive terminal user interface for viewing corpus database provenance records. This is a build-time tool only - for inspecting extraction run history and outputs.

This tool provides: - Interactive table browsing with pagination - Schema inspection (columns, types, indexes) - Data export to CSV and JSON formats - Keyboard-driven navigation

Replaces: Flask-based web viewer (archived in _working/)

Features: - Browse all database tables interactively - Paginated data display (50 rows per page) - View table schemas and CREATE TABLE statements - Export data to CSV or JSON - Read-only database access (safe inspection) - Keyboard shortcuts for efficient navigation

Main Components: - CorpusDBViewerApp: Main Textual TUI application class - queries: Database query functions (table lists, schema, data) - formatters: Export functions for CSV and JSON - main: CLI entry point with argument parsing

CLI Usage:

# Launch interactive TUI with default database
python -m build_tools.corpus_db_viewer

# Specify custom database path
python -m build_tools.corpus_db_viewer --db /path/to/database.db

# Set custom export directory
python -m build_tools.corpus_db_viewer --export-dir _working/my_exports/

Keyboard Shortcuts (in TUI):

↑/↓ Navigate rows ←/→ Previous/Next page t Switch table i Show schema info e Export data q Quit ? Show help

Python API Usage:

>>> from build_tools.corpus_db_viewer import queries
>>> from pathlib import Path
>>>
>>> # Get list of tables
>>> db_path = Path("data/raw/syllable_extractor.db")
>>> tables = queries.get_tables_list(db_path)
>>>
>>> # Get schema for a table
>>> schema = queries.get_table_schema(db_path, "runs")
>>> print(schema['columns'])
>>>
>>> # Get paginated data
>>> data = queries.get_table_data(db_path, "runs", page=1, limit=50)
>>> print(f"Total rows: {data['total']}")

The Corpus Database Viewer is an interactive terminal user interface (TUI) for inspecting corpus database provenance records. Built with Textual, it provides a keyboard-driven interface for browsing extraction run history.

Replaces: Flask-based web viewer (archived in _working/_archived/pipeworks_db_viewer_flask/)

Key Features:

Browse all database tables interactively
Paginated data display (50 rows per page)
View table schemas and CREATE TABLE statements
Export data to CSV or JSON
Read-only database access (safe inspection)
Keyboard shortcuts for efficient navigation

Command-Line Interface

Interactive TUI for viewing corpus database provenance records

usage: corpus_db_viewer [-h] [--db DB_PATH] [--export-dir EXPORT_DIR]
                        [--page-size PAGE_SIZE]

Named Arguments

--db

Path to corpus database. Default: data/raw/syllable_extractor.db

Default: data/raw/syllable_extractor.db

--export-dir

Directory for exported data. Default: _working/exports/

Default: _working/exports

--page-size

Number of rows per page. Default: 50

Default: 50

# Launch viewer with default database
python -m build_tools.corpus_db_viewer

# Specify custom database path
python -m build_tools.corpus_db_viewer --db /path/to/database.db

# Set custom export directory
python -m build_tools.corpus_db_viewer --export-dir _working/my_exports/

↑/↓: Navigate rows
←/→: Previous/Next page
PageUp/Dn: Jump pages
Home/End: First/Last page

t: Switch table (table selector)
i: Show schema info
e: Export current view
r: Refresh data
q: Quit application
?: Show help screen

Use arrow keys to navigate through table data
Press ‘t’ to open table selector and choose a different table
Press ‘i’ to view detailed schema information
Press ‘e’ to export the current table or view to CSV/JSON

Exports are saved to the export directory (default: _working/exports/)
Files are named: <table_name>_<timestamp>.<format>
Both CSV and JSON formats are supported

Output Format

Export Files

The viewer can export table data to two formats:

CSV Format:

Comma-separated values with header row:

id,run_timestamp,extractor_tool,status
1,2026-01-09T14:30:22,syllable_extractor,completed
2,2026-01-09T15:12:45,syllable_extractor,completed

JSON Format:

Array of objects with full type preservation:

[
  {
    "id": 1,
    "run_timestamp": "2026-01-09T14:30:22",
    "extractor_tool": "syllable_extractor",
    "status": "completed"
  }
]

Export file naming:

Files are timestamped and named by table:

_working/exports/
├── runs_20260109_143022.csv
├── runs_20260109_143022.json
└── outputs_20260109_143145.csv

Important: Exports include ALL rows, not just the current page.

Database Structure

The corpus database tracks syllable extraction runs:

runs - Extraction run metadata: Tool name, version, status, timestamps, command-line arguments
inputs - Source files processed: Input files or directories used for each run
outputs - Generated files: Output .syllables and .meta files with syllable counts

Integration Guide

Use the viewer to inspect corpus database provenance after extraction runs:

# Step 1: Extract syllables (populates database)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --lang en_US

# Step 2: Inspect extraction history with TUI viewer
python -m build_tools.corpus_db_viewer

When to use this tool:

To verify extraction runs completed successfully
To inspect which corpus files were processed
To track provenance of generated syllable files
To export extraction history for reporting or analysis
To debug failed extraction runs by examining status and error messages

Common workflows:

Browse recent runs: Launch viewer → select “runs” table → sort by timestamp
Find run details: Press i to view schema → browse rows for run metadata
Export history: Press e → select format (CSV/JSON) → save to export directory
Track file provenance: Switch to “outputs” table → identify which run created specific files

Advanced Topics

Keyboard Shortcuts

Navigation:

Key(s)	Action
`↑` / `↓`	Navigate rows
`←` / `→`	Previous/Next page
`PageUp` / `PageDn`	Jump 10 pages
`Home` / `End`	First/Last page

Actions:

Key	Action
`t`	Switch table
`i`	Show schema info
`e`	Export data
`r`	Refresh
`?`	Show help
`q`	Quit

Usage Examples

Browsing Tables:

Launch the viewer and it automatically loads the first table. Navigate using:

Press t to focus the table list
Use ↑ / ↓ to navigate tables
Press Enter to select

Or click directly on table names in the sidebar.

Viewing Schema:

Press i to view detailed schema information:

Column definitions (name, type, constraints)
Indexes (name, columns, UNIQUE flags)
CREATE TABLE statement (original SQL)

Example output:

Schema: runs

Columns:
  • id: INTEGER [PRIMARY KEY]
  • run_timestamp: TEXT NOT NULL
  • extractor_tool: TEXT NOT NULL
  • status: TEXT

Indexes:
  • idx_status (status)

Exporting Data:

Press e to export the current table:

Edit filename (optional)
Choose CSV or JSON format
File saved to export directory (default: _working/exports/)

Design Philosophy

Read-Only Access:

The viewer opens databases in read-only mode (?mode=ro) to prevent accidental modifications.

Observational Tool:

Like the corpus_db ledger, this viewer is observational only - it displays run history but doesn’t control or modify build processes.

Benefits Over Flask Version:

The Textual TUI offers advantages over the previous Flask-based web viewer:

No web server overhead (terminal-native)
Better build tools integration
Reduced dependencies (no Flask, pandas, Werkzeug)
Single-language codebase (Python only)
Native keyboard navigation

Trade-offs:

No SQL query interface (may be added later)
No cross-table search (may be added later)
Terminal-only (no browser UI)

Notes

Dependencies:

Requires Textual library for TUI functionality. Install with:

pip install -e ".[dev]"

Troubleshooting:

Database Not Found:

Error: Database not found: data/raw/syllable_extractor.db

Solution: Ensure the database exists or specify a different path:

python -m build_tools.corpus_db_viewer --db /path/to/database.db

Textual Not Installed:

Error: Textual library not found

Solution: Install development dependencies:

pip install -e ".[dev]"

Terminal Too Small:

If the layout looks broken, resize your terminal. Minimum recommended: 80 columns × 24 rows.

Database Access:

Database opened in read-only mode for safety
No modification operations available
Safe to run while extraction tools are active

Build-time tool:

This is a build-time inspection tool - not used during runtime name generation.

Related Documentation:

Corpus Database - Build provenance ledger that this tool reads
Pyphen Syllable Extractor - Pyphen tool that populates the database
NLTK Syllable Extractor - NLTK tool that populates the database

API Reference