build_tools.syllable_feature_annotator
======================================

.. py:module:: build_tools.syllable_feature_annotator

.. autoapi-nested-parse::

   Syllable Feature Annotator - Phonetic Feature Detection

   The syllable feature annotator attaches phonetic features to normalized syllables, creating
   a feature-annotated dataset for downstream pattern generation. This is a **build-time tool only** -
   not used during runtime name generation.

   The tool sits between the syllable normaliser and pattern development:

   1. **Input**: Normalized syllables from syllable_normaliser
   2. **Process**: Apply 12 feature detectors to each syllable
   3. **Output**: Feature-annotated syllable dataset

   Design Principles:

   - **Pure observation** - Observes patterns, never interprets or filters
   - **Deterministic** - Same input always produces same output
   - **Feature independence** - No detector depends on another
   - **Language agnostic** - Structural patterns only, no linguistic knowledge
   - **Conservative detection** - Approximate patterns without overthinking

   Feature Set (12 features):

   **Onset Features (3)**:
       - starts_with_vowel: Open onset (vowel-initial)
       - starts_with_cluster: Initial consonant cluster (2+ consonants)
       - starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)

   **Internal Features (4)**:
       - contains_plosive: Contains plosive consonant (p, t, k, b, d, g)
       - contains_fricative: Contains fricative consonant (f, s, z, v, h)
       - contains_liquid: Contains liquid consonant (l, r, w)
       - contains_nasal: Contains nasal consonant (m, n)

   **Nucleus Features (2)**:
       - short_vowel: Exactly one vowel (weight proxy)
       - long_vowel: Two or more vowels (weight proxy)

   **Coda Features (3)**:
       - ends_with_vowel: Open syllable (vowel-final)
       - ends_with_nasal: Nasal coda
       - ends_with_stop: Stop coda

   Quick Start
   -----------
   Command-line usage::

       $ python -m build_tools.syllable_feature_annotator \
           --syllables data/normalized/syllables_unique.txt \
           --frequencies data/normalized/syllables_frequencies.json \
           --output data/annotated/syllables_annotated.json \
           --verbose

   Programmatic usage::

       >>> from pathlib import Path
       >>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
       >>> result = run_annotation_pipeline(
       ...     syllables_path=Path("data/normalized/syllables_unique.txt"),
       ...     frequencies_path=Path("data/normalized/syllables_frequencies.json"),
       ...     output_path=Path("data/annotated/syllables_annotated.json"),
       ...     verbose=True
       ... )
       >>> print(f"Annotated {result.statistics.syllable_count} syllables")

   Annotate syllables in code::

       >>> from build_tools.syllable_feature_annotator import (
       ...     annotate_corpus,
       ...     FEATURE_DETECTORS
       ... )
       >>> syllables = ["ka", "kran", "spla"]
       >>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
       >>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
       >>> for record in result.annotated_syllables:
       ...     print(f"{record.syllable}: {sum(record.features.values())} features")

   Public API
   ----------
   This package exports the following components for programmatic use:

   **Pipeline Functions**:
       - run_annotation_pipeline: Complete end-to-end pipeline with I/O
       - annotate_corpus: Annotate syllables without I/O
       - annotate_syllable: Annotate single syllable

   **Data Models**:
       - AnnotatedSyllable: Single annotated syllable record
       - AnnotationStatistics: Processing statistics
       - AnnotationResult: Complete result with syllables and stats

   **Feature Detection**:
       - FEATURE_DETECTORS: Registry of all 12 feature detector functions
       - Individual detector functions (starts_with_vowel, contains_plosive, etc.)

   **Phoneme Sets**:
       - VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS

   **File I/O**:
       - load_syllables: Load syllables from text file
       - load_frequencies: Load frequencies from JSON
       - save_annotated_syllables: Save annotated output to JSON

   Architecture
   ------------
   The package is organized into focused modules:

   **phoneme_sets.py**: Character class definitions (VOWELS, PLOSIVES, etc.)
   **feature_rules.py**: Pure feature detector functions (12 detectors)
   **annotator.py**: Core orchestration and data models
   **file_io.py**: Simple I/O helpers
   **cli.py**: Command-line interface with argument parsing
   **__main__.py**: Module entry point for python -m

   Integration with Pipeline
   -------------------------
   This tool is designed to work with the syllable normalizer::

       # Step 1: Normalize syllables
       $ python -m build_tools.pyphen_syllable_normaliser \
           --source data/corpus/ \
           --output data/normalized/

       # Step 2: Annotate with features
       $ python -m build_tools.syllable_feature_annotator \
           --syllables data/normalized/syllables_unique.txt \
           --frequencies data/normalized/syllables_frequencies.json \
           --output data/annotated/syllables_annotated.json

       # Step 3: Use annotated data for pattern generation (future)

   Output Format
   -------------
   The annotator produces JSON with this structure::

       [
         {
           "syllable": "kran",
           "frequency": 7,
           "features": {
             "starts_with_vowel": false,
             "starts_with_cluster": true,
             "starts_with_heavy_cluster": false,
             "contains_plosive": true,
             "contains_fricative": false,
             "contains_liquid": true,
             "contains_nasal": true,
             "short_vowel": true,
             "long_vowel": false,
             "ends_with_vowel": false,
             "ends_with_nasal": true,
             "ends_with_stop": false
           }
         }
       ]

   Notes
   -----
   - This is a build-time tool only (not used during runtime name generation)
   - The annotator is deterministic (same input → same output)
   - Features are structural observations, not linguistic interpretations
   - All 12 features are applied to every syllable (no selective detection)
   - Processing is fast: typically <1 second for 1,000-10,000 syllables

   See Also
   --------
   - CLAUDE.md: Complete project documentation
   - syllable_normaliser: Upstream tool that produces input data
   - feature_rules.py: Detailed documentation of each feature detector


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/build_tools/syllable_feature_annotator/annotator/index
   /autoapi/build_tools/syllable_feature_annotator/cli/index
   /autoapi/build_tools/syllable_feature_annotator/feature_rules/index
   /autoapi/build_tools/syllable_feature_annotator/file_io/index
   /autoapi/build_tools/syllable_feature_annotator/phoneme_sets/index