build_tools.corpus_sqlite_builder.converter

JSON to SQLite conversion logic for corpus data.

This module handles converting large annotated JSON files into optimized SQLite databases for efficient querying.

Functions

`find_annotated_json`(data_dir)	Find the annotated JSON file in a corpus data directory.
`convert_json_to_sqlite`(corpus_dir[, force, batch_size])	Convert an annotated JSON file to a SQLite database.

Module Contents

build_tools.corpus_sqlite_builder.converter.find_annotated_json(data_dir)[source]

Find the annotated JSON file in a corpus data directory.

Parameters:: data_dir (pathlib.Path) – Path to the data directory (e.g., _working/output/…/data/)
Returns:: Path to the annotated JSON file, or None if not found
Return type:: pathlib.Path | None

Looks for files matching the pattern: *_syllables_annotated.json Supports both pyphen and nltk prefixes.

build_tools.corpus_sqlite_builder.converter.convert_json_to_sqlite(corpus_dir, force=False, batch_size=10000)[source]

Convert an annotated JSON file to a SQLite database.

This function discovers the annotated JSON file in the corpus directory, creates a SQLite database with the appropriate schema, and efficiently converts all syllable data using batched transactions.

Parameters:

corpus_dir (pathlib.Path) – Path to corpus directory (e.g., _working/output/20260110_115453_pyphen/)
force (bool) – If True, overwrite existing database. If False, raise error if exists.
batch_size (int) – Number of records to insert per transaction (default: 10000)

Returns:

Path to the created corpus.db file

Raises:

FileNotFoundError – If corpus_dir doesn’t exist or no annotated JSON found
FileExistsError – If corpus.db exists and force=False
ValueError – If JSON structure is invalid
json.JSONDecodeError – If JSON is malformed

Return type:

pathlib.Path