build_tools.syllable_feature_annotator

Syllable Feature Annotator - Phonetic Feature Detection

The syllable feature annotator attaches phonetic features to normalized syllables, creating a feature-annotated dataset for downstream pattern generation. This is a build-time tool only - not used during runtime name generation.

The tool sits between the syllable normaliser and pattern development:

Input: Normalized syllables from syllable_normaliser
Process: Apply 12 feature detectors to each syllable
Output: Feature-annotated syllable dataset

Design Principles:

Pure observation - Observes patterns, never interprets or filters
Deterministic - Same input always produces same output
Feature independence - No detector depends on another
Language agnostic - Structural patterns only, no linguistic knowledge
Conservative detection - Approximate patterns without overthinking

Feature Set (12 features):

Onset Features (3):

starts_with_vowel: Open onset (vowel-initial)
starts_with_cluster: Initial consonant cluster (2+ consonants)
starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)

Internal Features (4):

contains_plosive: Contains plosive consonant (p, t, k, b, d, g)
contains_fricative: Contains fricative consonant (f, s, z, v, h)
contains_liquid: Contains liquid consonant (l, r, w)
contains_nasal: Contains nasal consonant (m, n)

Nucleus Features (2):

short_vowel: Exactly one vowel (weight proxy)
long_vowel: Two or more vowels (weight proxy)

Coda Features (3):

ends_with_vowel: Open syllable (vowel-final)
ends_with_nasal: Nasal coda
ends_with_stop: Stop coda

Quick Start

Command-line usage:

$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json \
    --verbose

Programmatic usage:

>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
>>> result = run_annotation_pipeline(
...     syllables_path=Path("data/normalized/syllables_unique.txt"),
...     frequencies_path=Path("data/normalized/syllables_frequencies.json"),
...     output_path=Path("data/annotated/syllables_annotated.json"),
...     verbose=True
... )
>>> print(f"Annotated {result.statistics.syllable_count} syllables")

Annotate syllables in code:

>>> from build_tools.syllable_feature_annotator import (
...     annotate_corpus,
...     FEATURE_DETECTORS
... )
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> for record in result.annotated_syllables:
...     print(f"{record.syllable}: {sum(record.features.values())} features")

Public API

This package exports the following components for programmatic use:

Pipeline Functions:

run_annotation_pipeline: Complete end-to-end pipeline with I/O
annotate_corpus: Annotate syllables without I/O
annotate_syllable: Annotate single syllable

Data Models:

AnnotatedSyllable: Single annotated syllable record
AnnotationStatistics: Processing statistics
AnnotationResult: Complete result with syllables and stats

Feature Detection:

FEATURE_DETECTORS: Registry of all 12 feature detector functions
Individual detector functions (starts_with_vowel, contains_plosive, etc.)

Phoneme Sets:

VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS

File I/O:

load_syllables: Load syllables from text file
load_frequencies: Load frequencies from JSON
save_annotated_syllables: Save annotated output to JSON

Architecture

The package is organized into focused modules:

phoneme_sets.py: Character class definitions (VOWELS, PLOSIVES, etc.) feature_rules.py: Pure feature detector functions (12 detectors) annotator.py: Core orchestration and data models file_io.py: Simple I/O helpers cli.py: Command-line interface with argument parsing __main__.py: Module entry point for python -m

Integration with Pipeline

This tool is designed to work with the syllable normalizer:

# Step 1: Normalize syllables
$ python -m build_tools.pyphen_syllable_normaliser \
    --source data/corpus/ \
    --output data/normalized/

# Step 2: Annotate with features
$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json

# Step 3: Use annotated data for pattern generation (future)

Output Format

The annotator produces JSON with this structure:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_vowel": false,
      "starts_with_cluster": true,
      "starts_with_heavy_cluster": false,
      "contains_plosive": true,
      "contains_fricative": false,
      "contains_liquid": true,
      "contains_nasal": true,
      "short_vowel": true,
      "long_vowel": false,
      "ends_with_vowel": false,
      "ends_with_nasal": true,
      "ends_with_stop": false
    }
  }
]

Notes

This is a build-time tool only (not used during runtime name generation)
The annotator is deterministic (same input → same output)
Features are structural observations, not linguistic interpretations
All 12 features are applied to every syllable (no selective detection)
Processing is fast: typically <1 second for 1,000-10,000 syllables