=========================== Syllable Feature Annotator =========================== .. currentmodule:: build_tools.syllable_feature_annotator Overview -------- .. automodule:: build_tools.syllable_feature_annotator :no-members: Command-Line Interface ---------------------- .. argparse:: :module: build_tools.syllable_feature_annotator.cli :func: create_argument_parser :prog: python -m build_tools.syllable_feature_annotator Output Format ------------- Input/Output Contract ~~~~~~~~~~~~~~~~~~~~~ **Inputs** (from syllable normaliser): - ``pyphen_syllables_unique.txt`` or ``nltk_syllables_unique.txt`` - One canonical syllable per line - ``pyphen_syllables_frequencies.json`` or ``nltk_syllables_frequencies.json`` - ``{"syllable": count}`` mapping **Output** (auto-detected from input paths): - ``/data/pyphen_syllables_annotated.json`` - Pyphen extractor output - ``/data/nltk_syllables_annotated.json`` - NLTK extractor output **Output path auto-detection:** The tool automatically detects the extractor type (pyphen or nltk) from the input file paths and places output in the same run directory. This creates a self-contained workflow where each extraction run contains all its derived data. **Example directory structure after annotation:** .. code-block:: text _working/output/20260110_115601_nltk/ ├── data/ │ └── nltk_syllables_annotated.json ← Auto-detected output ├── meta/ ├── syllables/ ├── nltk_syllables_unique.txt ← Input ├── nltk_syllables_frequencies.json ← Input └── ... Output Structure ~~~~~~~~~~~~~~~~ The annotator produces JSON with this structure: .. code-block:: json [ { "syllable": "kran", "frequency": 7, "features": { "starts_with_vowel": false, "starts_with_cluster": true, "starts_with_heavy_cluster": false, "contains_plosive": true, "contains_fricative": false, "contains_liquid": true, "contains_nasal": true, "short_vowel": true, "long_vowel": false, "ends_with_vowel": false, "ends_with_nasal": true, "ends_with_stop": false } } ] **Feature set:** All 12 features are applied to every syllable: - Onset features (starts_with_vowel, starts_with_cluster, starts_with_heavy_cluster) - Content features (contains_plosive, contains_fricative, contains_liquid, contains_nasal) - Vowel features (short_vowel, long_vowel) - Coda features (ends_with_vowel, ends_with_nasal, ends_with_stop) Integration Guide ----------------- The feature annotator sits between the normaliser and pattern development. It automatically detects the extractor type and places output in the run directory. **Recommended workflow (with auto-detection):** .. code-block:: bash # Step 1: Extract and normalize syllables python -m build_tools.pyphen_syllable_extractor --file input.txt python -m build_tools.pyphen_syllable_normaliser \ --run-dir _working/output/20260110_115453_pyphen/ # Step 2: Annotate with features (output path auto-detected) python -m build_tools.syllable_feature_annotator \ --syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \ --frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json # Creates: _working/output/20260110_115453_pyphen/data/pyphen_syllables_annotated.json # Step 3: Use annotated syllables for pattern generation (future) **Alternative: Explicit output path (overrides auto-detection):** .. code-block:: bash python -m build_tools.syllable_feature_annotator \ --syllables path/to/syllables.txt \ --frequencies path/to/frequencies.json \ --output custom/output.json **When to use this tool:** - After syllable normalization is complete - Before developing phonotactic patterns or constraints - To add structural feature metadata to your syllable corpus - For analysis tasks requiring feature-based filtering or grouping **Extractor type detection:** The tool detects the extractor type by examining: 1. Filename prefix (``pyphen_*`` or ``nltk_*``) 2. Parent directory name (``*_pyphen`` or ``*_nltk``) 3. Falls back to ``data/annotated/syllables_annotated.json`` if detection fails Notes ----- **Auto-detection creates self-contained workflows:** The automatic output path detection ensures each extraction run is self-contained with all derived data in one directory. This makes it easy to manage multiple extraction runs and compare results from different extractors (pyphen vs NLTK). **Features are structural observations:** Features are structural observations based on phoneme presence, not linguistic interpretations. This ensures deterministic, language-agnostic detection. **Processing characteristics:** - Fast and deterministic (same input = same output) - All 12 features applied to every syllable (no selective detection) - Designed to integrate seamlessly with syllable normalizer output - Output path auto-detection works with both pyphen and NLTK normalizer outputs **Backward compatibility:** Explicit ``--output`` paths still work and override auto-detection. This ensures compatibility with existing workflows and scripts. **Build-time tool:** This is a build-time tool only - not used during runtime name generation. API Reference ------------- .. automodule:: build_tools.syllable_feature_annotator :members: :undoc-members: :show-inheritance: