build_tools.corpus_sqlite_builder.converter

JSON to SQLite conversion logic for corpus data.

This module handles converting large annotated JSON files into optimized SQLite databases for efficient querying.

Functions

find_annotated_json(data_dir)

Find the annotated JSON file in a corpus data directory.

convert_json_to_sqlite(corpus_dir[, force, batch_size])

Convert an annotated JSON file to a SQLite database.

Module Contents

build_tools.corpus_sqlite_builder.converter.find_annotated_json(data_dir)[source]

Find the annotated JSON file in a corpus data directory.

Parameters:

data_dir (pathlib.Path) – Path to the data directory (e.g., _working/output/…/data/)

Returns:

Path to the annotated JSON file, or None if not found

Return type:

pathlib.Path | None

Looks for files matching the pattern: *_syllables_annotated.json Supports both pyphen and nltk prefixes.

build_tools.corpus_sqlite_builder.converter.convert_json_to_sqlite(corpus_dir, force=False, batch_size=10000)[source]

Convert an annotated JSON file to a SQLite database.

This function discovers the annotated JSON file in the corpus directory, creates a SQLite database with the appropriate schema, and efficiently converts all syllable data using batched transactions.

Parameters:
  • corpus_dir (pathlib.Path) – Path to corpus directory (e.g., _working/output/20260110_115453_pyphen/)

  • force (bool) – If True, overwrite existing database. If False, raise error if exists.

  • batch_size (int) – Number of records to insert per transaction (default: 10000)

Returns:

Path to the created corpus.db file

Raises:
Return type:

pathlib.Path