build_tools.syllable_feature_annotator ====================================== .. py:module:: build_tools.syllable_feature_annotator .. autoapi-nested-parse:: Syllable Feature Annotator - Phonetic Feature Detection The syllable feature annotator attaches phonetic features to normalized syllables, creating a feature-annotated dataset for downstream pattern generation. This is a **build-time tool only** - not used during runtime name generation. The tool sits between the syllable normaliser and pattern development: 1. **Input**: Normalized syllables from syllable_normaliser 2. **Process**: Apply 12 feature detectors to each syllable 3. **Output**: Feature-annotated syllable dataset Design Principles: - **Pure observation** - Observes patterns, never interprets or filters - **Deterministic** - Same input always produces same output - **Feature independence** - No detector depends on another - **Language agnostic** - Structural patterns only, no linguistic knowledge - **Conservative detection** - Approximate patterns without overthinking Feature Set (12 features): **Onset Features (3)**: - starts_with_vowel: Open onset (vowel-initial) - starts_with_cluster: Initial consonant cluster (2+ consonants) - starts_with_heavy_cluster: Heavy initial cluster (3+ consonants) **Internal Features (4)**: - contains_plosive: Contains plosive consonant (p, t, k, b, d, g) - contains_fricative: Contains fricative consonant (f, s, z, v, h) - contains_liquid: Contains liquid consonant (l, r, w) - contains_nasal: Contains nasal consonant (m, n) **Nucleus Features (2)**: - short_vowel: Exactly one vowel (weight proxy) - long_vowel: Two or more vowels (weight proxy) **Coda Features (3)**: - ends_with_vowel: Open syllable (vowel-final) - ends_with_nasal: Nasal coda - ends_with_stop: Stop coda Quick Start ----------- Command-line usage:: $ python -m build_tools.syllable_feature_annotator \ --syllables data/normalized/syllables_unique.txt \ --frequencies data/normalized/syllables_frequencies.json \ --output data/annotated/syllables_annotated.json \ --verbose Programmatic usage:: >>> from pathlib import Path >>> from build_tools.syllable_feature_annotator import run_annotation_pipeline >>> result = run_annotation_pipeline( ... syllables_path=Path("data/normalized/syllables_unique.txt"), ... frequencies_path=Path("data/normalized/syllables_frequencies.json"), ... output_path=Path("data/annotated/syllables_annotated.json"), ... verbose=True ... ) >>> print(f"Annotated {result.statistics.syllable_count} syllables") Annotate syllables in code:: >>> from build_tools.syllable_feature_annotator import ( ... annotate_corpus, ... FEATURE_DETECTORS ... ) >>> syllables = ["ka", "kran", "spla"] >>> frequencies = {"ka": 187, "kran": 7, "spla": 2} >>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS) >>> for record in result.annotated_syllables: ... print(f"{record.syllable}: {sum(record.features.values())} features") Public API ---------- This package exports the following components for programmatic use: **Pipeline Functions**: - run_annotation_pipeline: Complete end-to-end pipeline with I/O - annotate_corpus: Annotate syllables without I/O - annotate_syllable: Annotate single syllable **Data Models**: - AnnotatedSyllable: Single annotated syllable record - AnnotationStatistics: Processing statistics - AnnotationResult: Complete result with syllables and stats **Feature Detection**: - FEATURE_DETECTORS: Registry of all 12 feature detector functions - Individual detector functions (starts_with_vowel, contains_plosive, etc.) **Phoneme Sets**: - VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS **File I/O**: - load_syllables: Load syllables from text file - load_frequencies: Load frequencies from JSON - save_annotated_syllables: Save annotated output to JSON Architecture ------------ The package is organized into focused modules: **phoneme_sets.py**: Character class definitions (VOWELS, PLOSIVES, etc.) **feature_rules.py**: Pure feature detector functions (12 detectors) **annotator.py**: Core orchestration and data models **file_io.py**: Simple I/O helpers **cli.py**: Command-line interface with argument parsing **__main__.py**: Module entry point for python -m Integration with Pipeline ------------------------- This tool is designed to work with the syllable normalizer:: # Step 1: Normalize syllables $ python -m build_tools.pyphen_syllable_normaliser \ --source data/corpus/ \ --output data/normalized/ # Step 2: Annotate with features $ python -m build_tools.syllable_feature_annotator \ --syllables data/normalized/syllables_unique.txt \ --frequencies data/normalized/syllables_frequencies.json \ --output data/annotated/syllables_annotated.json # Step 3: Use annotated data for pattern generation (future) Output Format ------------- The annotator produces JSON with this structure:: [ { "syllable": "kran", "frequency": 7, "features": { "starts_with_vowel": false, "starts_with_cluster": true, "starts_with_heavy_cluster": false, "contains_plosive": true, "contains_fricative": false, "contains_liquid": true, "contains_nasal": true, "short_vowel": true, "long_vowel": false, "ends_with_vowel": false, "ends_with_nasal": true, "ends_with_stop": false } } ] Notes ----- - This is a build-time tool only (not used during runtime name generation) - The annotator is deterministic (same input → same output) - Features are structural observations, not linguistic interpretations - All 12 features are applied to every syllable (no selective detection) - Processing is fast: typically <1 second for 1,000-10,000 syllables See Also -------- - CLAUDE.md: Complete project documentation - syllable_normaliser: Upstream tool that produces input data - feature_rules.py: Detailed documentation of each feature detector Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/build_tools/syllable_feature_annotator/annotator/index /autoapi/build_tools/syllable_feature_annotator/cli/index /autoapi/build_tools/syllable_feature_annotator/feature_rules/index /autoapi/build_tools/syllable_feature_annotator/file_io/index /autoapi/build_tools/syllable_feature_annotator/phoneme_sets/index