"""
Syllable Feature Annotator - Phonetic Feature Detection
The syllable feature annotator attaches phonetic features to normalized syllables, creating
a feature-annotated dataset for downstream pattern generation. This is a **build-time tool only** -
not used during runtime name generation.
The tool sits between the syllable normaliser and pattern development:
1. **Input**: Normalized syllables from syllable_normaliser
2. **Process**: Apply 12 feature detectors to each syllable
3. **Output**: Feature-annotated syllable dataset
Design Principles:
- **Pure observation** - Observes patterns, never interprets or filters
- **Deterministic** - Same input always produces same output
- **Feature independence** - No detector depends on another
- **Language agnostic** - Structural patterns only, no linguistic knowledge
- **Conservative detection** - Approximate patterns without overthinking
Feature Set (12 features):
**Onset Features (3)**:
- starts_with_vowel: Open onset (vowel-initial)
- starts_with_cluster: Initial consonant cluster (2+ consonants)
- starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)
**Internal Features (4)**:
- contains_plosive: Contains plosive consonant (p, t, k, b, d, g)
- contains_fricative: Contains fricative consonant (f, s, z, v, h)
- contains_liquid: Contains liquid consonant (l, r, w)
- contains_nasal: Contains nasal consonant (m, n)
**Nucleus Features (2)**:
- short_vowel: Exactly one vowel (weight proxy)
- long_vowel: Two or more vowels (weight proxy)
**Coda Features (3)**:
- ends_with_vowel: Open syllable (vowel-final)
- ends_with_nasal: Nasal coda
- ends_with_stop: Stop coda
Quick Start
-----------
Command-line usage::
$ python -m build_tools.syllable_feature_annotator \\
--syllables data/normalized/syllables_unique.txt \\
--frequencies data/normalized/syllables_frequencies.json \\
--output data/annotated/syllables_annotated.json \\
--verbose
Programmatic usage::
>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
>>> result = run_annotation_pipeline(
... syllables_path=Path("data/normalized/syllables_unique.txt"),
... frequencies_path=Path("data/normalized/syllables_frequencies.json"),
... output_path=Path("data/annotated/syllables_annotated.json"),
... verbose=True
... )
>>> print(f"Annotated {result.statistics.syllable_count} syllables")
Annotate syllables in code::
>>> from build_tools.syllable_feature_annotator import (
... annotate_corpus,
... FEATURE_DETECTORS
... )
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> for record in result.annotated_syllables:
... print(f"{record.syllable}: {sum(record.features.values())} features")
Public API
----------
This package exports the following components for programmatic use:
**Pipeline Functions**:
- run_annotation_pipeline: Complete end-to-end pipeline with I/O
- annotate_corpus: Annotate syllables without I/O
- annotate_syllable: Annotate single syllable
**Data Models**:
- AnnotatedSyllable: Single annotated syllable record
- AnnotationStatistics: Processing statistics
- AnnotationResult: Complete result with syllables and stats
**Feature Detection**:
- FEATURE_DETECTORS: Registry of all 12 feature detector functions
- Individual detector functions (starts_with_vowel, contains_plosive, etc.)
**Phoneme Sets**:
- VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS
**File I/O**:
- load_syllables: Load syllables from text file
- load_frequencies: Load frequencies from JSON
- save_annotated_syllables: Save annotated output to JSON
Architecture
------------
The package is organized into focused modules:
**phoneme_sets.py**: Character class definitions (VOWELS, PLOSIVES, etc.)
**feature_rules.py**: Pure feature detector functions (12 detectors)
**annotator.py**: Core orchestration and data models
**file_io.py**: Simple I/O helpers
**cli.py**: Command-line interface with argument parsing
**__main__.py**: Module entry point for python -m
Integration with Pipeline
-------------------------
This tool is designed to work with the syllable normalizer::
# Step 1: Normalize syllables
$ python -m build_tools.pyphen_syllable_normaliser \\
--source data/corpus/ \\
--output data/normalized/
# Step 2: Annotate with features
$ python -m build_tools.syllable_feature_annotator \\
--syllables data/normalized/syllables_unique.txt \\
--frequencies data/normalized/syllables_frequencies.json \\
--output data/annotated/syllables_annotated.json
# Step 3: Use annotated data for pattern generation (future)
Output Format
-------------
The annotator produces JSON with this structure::
[
{
"syllable": "kran",
"frequency": 7,
"features": {
"starts_with_vowel": false,
"starts_with_cluster": true,
"starts_with_heavy_cluster": false,
"contains_plosive": true,
"contains_fricative": false,
"contains_liquid": true,
"contains_nasal": true,
"short_vowel": true,
"long_vowel": false,
"ends_with_vowel": false,
"ends_with_nasal": true,
"ends_with_stop": false
}
}
]
Notes
-----
- This is a build-time tool only (not used during runtime name generation)
- The annotator is deterministic (same input → same output)
- Features are structural observations, not linguistic interpretations
- All 12 features are applied to every syllable (no selective detection)
- Processing is fast: typically <1 second for 1,000-10,000 syllables
See Also
--------
- CLAUDE.md: Complete project documentation
- syllable_normaliser: Upstream tool that produces input data
- feature_rules.py: Detailed documentation of each feature detector
"""
# Core pipeline functions
from build_tools.syllable_feature_annotator.annotator import (
AnnotatedSyllable,
AnnotationResult,
AnnotationStatistics,
annotate_corpus,
annotate_syllable,
run_annotation_pipeline,
)
# Feature detection
from build_tools.syllable_feature_annotator.feature_rules import (
FEATURE_DETECTORS,
contains_fricative,
contains_liquid,
contains_nasal,
contains_plosive,
ends_with_nasal,
ends_with_stop,
ends_with_vowel,
long_vowel,
short_vowel,
starts_with_cluster,
starts_with_heavy_cluster,
starts_with_vowel,
)
# File I/O helpers
from build_tools.syllable_feature_annotator.file_io import (
load_frequencies,
load_syllables,
save_annotated_syllables,
)
# Phoneme sets (character classes)
from build_tools.syllable_feature_annotator.phoneme_sets import (
FRICATIVES,
LIQUIDS,
NASALS,
PLOSIVES,
STOPS,
VOWELS,
)
__all__ = [
# Pipeline functions
"run_annotation_pipeline",
"annotate_corpus",
"annotate_syllable",
# Data models
"AnnotatedSyllable",
"AnnotationStatistics",
"AnnotationResult",
# Feature detection
"FEATURE_DETECTORS",
"starts_with_vowel",
"starts_with_cluster",
"starts_with_heavy_cluster",
"contains_plosive",
"contains_fricative",
"contains_liquid",
"contains_nasal",
"short_vowel",
"long_vowel",
"ends_with_vowel",
"ends_with_nasal",
"ends_with_stop",
# Phoneme sets
"VOWELS",
"PLOSIVES",
"FRICATIVES",
"NASALS",
"LIQUIDS",
"STOPS",
# File I/O
"load_syllables",
"load_frequencies",
"save_annotated_syllables",
]
# Package metadata
__version__ = "0.1.0"
__author__ = "pipeworks_name_generation contributors"
__description__ = "Deterministic phonetic feature annotation for syllables"
# Backward compatibility imports for analysis tools (DEPRECATED)
# Analysis tools have been moved to build_tools.syllable_analysis
# These will be removed in a future version
import warnings as _warnings
def _deprecated_import_warning(old_path: str, new_path: str) -> None:
"""Issue deprecation warning for moved analysis tools."""
_warnings.warn(
f"Importing from '{old_path}' is deprecated. "
f"Use '{new_path}' instead. "
"This compatibility layer will be removed in version 0.2.0.",
DeprecationWarning,
stacklevel=3,
)
[docs]
def __getattr__(name: str):
"""Lazy import with deprecation warning for moved analysis tools."""
# Random sampler functions
if name == "sample_syllables":
_deprecated_import_warning(
"build_tools.syllable_feature_annotator.random_sampler",
"build_tools.syllable_analysis.random_sampler",
)
from build_tools.syllable_analysis.random_sampler import sample_syllables # noqa: F401
return locals()[name]
# Functions moved to syllable_analysis.common during refactoring
if name in ("load_annotated_syllables", "save_samples"):
_deprecated_import_warning(
"build_tools.syllable_feature_annotator",
"build_tools.syllable_analysis.common",
)
if name == "load_annotated_syllables":
from build_tools.syllable_analysis.common import load_annotated_syllables # noqa: F401
return load_annotated_syllables
elif name == "save_samples":
# save_samples was renamed to save_json_output
from build_tools.syllable_analysis.common import save_json_output # noqa: F401
return save_json_output
# Feature signatures functions
if name in (
"extract_signature",
"analyze_feature_signatures",
"format_signature_report",
"run_analysis",
"save_report",
):
_deprecated_import_warning(
"build_tools.syllable_feature_annotator.feature_signatures",
"build_tools.syllable_analysis.feature_signatures",
)
from build_tools.syllable_analysis.feature_signatures import ( # noqa: F401
analyze_feature_signatures,
extract_signature,
format_signature_report,
run_analysis,
save_report,
)
return locals()[name]
raise AttributeError(f"module '{__name__}' has no attribute '{name}'")