build_tools.corpus_sqlite_builder.converter
JSON to SQLite conversion logic for corpus data.
This module handles converting large annotated JSON files into optimized SQLite databases for efficient querying.
Functions
|
Find the annotated JSON file in a corpus data directory. |
|
Convert an annotated JSON file to a SQLite database. |
Module Contents
- build_tools.corpus_sqlite_builder.converter.find_annotated_json(data_dir)[source]
Find the annotated JSON file in a corpus data directory.
- Parameters:
data_dir (pathlib.Path) – Path to the data directory (e.g., _working/output/…/data/)
- Returns:
Path to the annotated JSON file, or None if not found
- Return type:
pathlib.Path | None
Looks for files matching the pattern: *_syllables_annotated.json Supports both pyphen and nltk prefixes.
- build_tools.corpus_sqlite_builder.converter.convert_json_to_sqlite(corpus_dir, force=False, batch_size=10000)[source]
Convert an annotated JSON file to a SQLite database.
This function discovers the annotated JSON file in the corpus directory, creates a SQLite database with the appropriate schema, and efficiently converts all syllable data using batched transactions.
- Parameters:
corpus_dir (pathlib.Path) – Path to corpus directory (e.g., _working/output/20260110_115453_pyphen/)
force (bool) – If True, overwrite existing database. If False, raise error if exists.
batch_size (int) – Number of records to insert per transaction (default: 10000)
- Returns:
Path to the created corpus.db file
- Raises:
FileNotFoundError – If corpus_dir doesn’t exist or no annotated JSON found
FileExistsError – If corpus.db exists and force=False
ValueError – If JSON structure is invalid
json.JSONDecodeError – If JSON is malformed
- Return type: