build_tools.syllable_feature_annotator

Syllable Feature Annotator - Phonetic Feature Detection

The syllable feature annotator attaches phonetic features to normalized syllables, creating a feature-annotated dataset for downstream pattern generation. This is a build-time tool only - not used during runtime name generation.

The tool sits between the syllable normaliser and pattern development:

  1. Input: Normalized syllables from syllable_normaliser

  2. Process: Apply 12 feature detectors to each syllable

  3. Output: Feature-annotated syllable dataset

Design Principles:

  • Pure observation - Observes patterns, never interprets or filters

  • Deterministic - Same input always produces same output

  • Feature independence - No detector depends on another

  • Language agnostic - Structural patterns only, no linguistic knowledge

  • Conservative detection - Approximate patterns without overthinking

Feature Set (12 features):

Onset Features (3):
  • starts_with_vowel: Open onset (vowel-initial)

  • starts_with_cluster: Initial consonant cluster (2+ consonants)

  • starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)

Internal Features (4):
  • contains_plosive: Contains plosive consonant (p, t, k, b, d, g)

  • contains_fricative: Contains fricative consonant (f, s, z, v, h)

  • contains_liquid: Contains liquid consonant (l, r, w)

  • contains_nasal: Contains nasal consonant (m, n)

Nucleus Features (2):
  • short_vowel: Exactly one vowel (weight proxy)

  • long_vowel: Two or more vowels (weight proxy)

Coda Features (3):
  • ends_with_vowel: Open syllable (vowel-final)

  • ends_with_nasal: Nasal coda

  • ends_with_stop: Stop coda

Quick Start

Command-line usage:

$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json \
    --verbose

Programmatic usage:

>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
>>> result = run_annotation_pipeline(
...     syllables_path=Path("data/normalized/syllables_unique.txt"),
...     frequencies_path=Path("data/normalized/syllables_frequencies.json"),
...     output_path=Path("data/annotated/syllables_annotated.json"),
...     verbose=True
... )
>>> print(f"Annotated {result.statistics.syllable_count} syllables")

Annotate syllables in code:

>>> from build_tools.syllable_feature_annotator import (
...     annotate_corpus,
...     FEATURE_DETECTORS
... )
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> for record in result.annotated_syllables:
...     print(f"{record.syllable}: {sum(record.features.values())} features")

Public API

This package exports the following components for programmatic use:

Pipeline Functions:
  • run_annotation_pipeline: Complete end-to-end pipeline with I/O

  • annotate_corpus: Annotate syllables without I/O

  • annotate_syllable: Annotate single syllable

Data Models:
  • AnnotatedSyllable: Single annotated syllable record

  • AnnotationStatistics: Processing statistics

  • AnnotationResult: Complete result with syllables and stats

Feature Detection:
  • FEATURE_DETECTORS: Registry of all 12 feature detector functions

  • Individual detector functions (starts_with_vowel, contains_plosive, etc.)

Phoneme Sets:
  • VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS

File I/O:
  • load_syllables: Load syllables from text file

  • load_frequencies: Load frequencies from JSON

  • save_annotated_syllables: Save annotated output to JSON

Architecture

The package is organized into focused modules:

phoneme_sets.py: Character class definitions (VOWELS, PLOSIVES, etc.) feature_rules.py: Pure feature detector functions (12 detectors) annotator.py: Core orchestration and data models file_io.py: Simple I/O helpers cli.py: Command-line interface with argument parsing __main__.py: Module entry point for python -m

Integration with Pipeline

This tool is designed to work with the syllable normalizer:

# Step 1: Normalize syllables
$ python -m build_tools.pyphen_syllable_normaliser \
    --source data/corpus/ \
    --output data/normalized/

# Step 2: Annotate with features
$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json

# Step 3: Use annotated data for pattern generation (future)

Output Format

The annotator produces JSON with this structure:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_vowel": false,
      "starts_with_cluster": true,
      "starts_with_heavy_cluster": false,
      "contains_plosive": true,
      "contains_fricative": false,
      "contains_liquid": true,
      "contains_nasal": true,
      "short_vowel": true,
      "long_vowel": false,
      "ends_with_vowel": false,
      "ends_with_nasal": true,
      "ends_with_stop": false
    }
  }
]

Notes

  • This is a build-time tool only (not used during runtime name generation)

  • The annotator is deterministic (same input → same output)

  • Features are structural observations, not linguistic interpretations

  • All 12 features are applied to every syllable (no selective detection)

  • Processing is fast: typically <1 second for 1,000-10,000 syllables

See Also

  • CLAUDE.md: Complete project documentation

  • syllable_normaliser: Upstream tool that produces input data

  • feature_rules.py: Detailed documentation of each feature detector

Submodules