build_tools.syllable_feature_annotator
Syllable Feature Annotator - Phonetic Feature Detection
The syllable feature annotator attaches phonetic features to normalized syllables, creating a feature-annotated dataset for downstream pattern generation. This is a build-time tool only - not used during runtime name generation.
The tool sits between the syllable normaliser and pattern development:
Input: Normalized syllables from syllable_normaliser
Process: Apply 12 feature detectors to each syllable
Output: Feature-annotated syllable dataset
Design Principles:
Pure observation - Observes patterns, never interprets or filters
Deterministic - Same input always produces same output
Feature independence - No detector depends on another
Language agnostic - Structural patterns only, no linguistic knowledge
Conservative detection - Approximate patterns without overthinking
Feature Set (12 features):
- Onset Features (3):
starts_with_vowel: Open onset (vowel-initial)
starts_with_cluster: Initial consonant cluster (2+ consonants)
starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)
- Internal Features (4):
contains_plosive: Contains plosive consonant (p, t, k, b, d, g)
contains_fricative: Contains fricative consonant (f, s, z, v, h)
contains_liquid: Contains liquid consonant (l, r, w)
contains_nasal: Contains nasal consonant (m, n)
- Nucleus Features (2):
short_vowel: Exactly one vowel (weight proxy)
long_vowel: Two or more vowels (weight proxy)
- Coda Features (3):
ends_with_vowel: Open syllable (vowel-final)
ends_with_nasal: Nasal coda
ends_with_stop: Stop coda
Quick Start
Command-line usage:
$ python -m build_tools.syllable_feature_annotator \
--syllables data/normalized/syllables_unique.txt \
--frequencies data/normalized/syllables_frequencies.json \
--output data/annotated/syllables_annotated.json \
--verbose
Programmatic usage:
>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
>>> result = run_annotation_pipeline(
... syllables_path=Path("data/normalized/syllables_unique.txt"),
... frequencies_path=Path("data/normalized/syllables_frequencies.json"),
... output_path=Path("data/annotated/syllables_annotated.json"),
... verbose=True
... )
>>> print(f"Annotated {result.statistics.syllable_count} syllables")
Annotate syllables in code:
>>> from build_tools.syllable_feature_annotator import (
... annotate_corpus,
... FEATURE_DETECTORS
... )
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> for record in result.annotated_syllables:
... print(f"{record.syllable}: {sum(record.features.values())} features")
Public API
This package exports the following components for programmatic use:
- Pipeline Functions:
run_annotation_pipeline: Complete end-to-end pipeline with I/O
annotate_corpus: Annotate syllables without I/O
annotate_syllable: Annotate single syllable
- Data Models:
AnnotatedSyllable: Single annotated syllable record
AnnotationStatistics: Processing statistics
AnnotationResult: Complete result with syllables and stats
- Feature Detection:
FEATURE_DETECTORS: Registry of all 12 feature detector functions
Individual detector functions (starts_with_vowel, contains_plosive, etc.)
- Phoneme Sets:
VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS
- File I/O:
load_syllables: Load syllables from text file
load_frequencies: Load frequencies from JSON
save_annotated_syllables: Save annotated output to JSON
Architecture
The package is organized into focused modules:
phoneme_sets.py: Character class definitions (VOWELS, PLOSIVES, etc.) feature_rules.py: Pure feature detector functions (12 detectors) annotator.py: Core orchestration and data models file_io.py: Simple I/O helpers cli.py: Command-line interface with argument parsing __main__.py: Module entry point for python -m
Integration with Pipeline
This tool is designed to work with the syllable normalizer:
# Step 1: Normalize syllables
$ python -m build_tools.pyphen_syllable_normaliser \
--source data/corpus/ \
--output data/normalized/
# Step 2: Annotate with features
$ python -m build_tools.syllable_feature_annotator \
--syllables data/normalized/syllables_unique.txt \
--frequencies data/normalized/syllables_frequencies.json \
--output data/annotated/syllables_annotated.json
# Step 3: Use annotated data for pattern generation (future)
Output Format
The annotator produces JSON with this structure:
[
{
"syllable": "kran",
"frequency": 7,
"features": {
"starts_with_vowel": false,
"starts_with_cluster": true,
"starts_with_heavy_cluster": false,
"contains_plosive": true,
"contains_fricative": false,
"contains_liquid": true,
"contains_nasal": true,
"short_vowel": true,
"long_vowel": false,
"ends_with_vowel": false,
"ends_with_nasal": true,
"ends_with_stop": false
}
}
]
Notes
This is a build-time tool only (not used during runtime name generation)
The annotator is deterministic (same input → same output)
Features are structural observations, not linguistic interpretations
All 12 features are applied to every syllable (no selective detection)
Processing is fast: typically <1 second for 1,000-10,000 syllables
See Also
CLAUDE.md: Complete project documentation
syllable_normaliser: Upstream tool that produces input data
feature_rules.py: Detailed documentation of each feature detector