Syllable Feature Annotator

Overview

Syllable Feature Annotator - Phonetic Feature Detection

The syllable feature annotator attaches phonetic features to normalized syllables, creating a feature-annotated dataset for downstream pattern generation. This is a build-time tool only - not used during runtime name generation.

The tool sits between the syllable normaliser and pattern development:

  1. Input: Normalized syllables from syllable_normaliser

  2. Process: Apply 12 feature detectors to each syllable

  3. Output: Feature-annotated syllable dataset

Design Principles:

  • Pure observation - Observes patterns, never interprets or filters

  • Deterministic - Same input always produces same output

  • Feature independence - No detector depends on another

  • Language agnostic - Structural patterns only, no linguistic knowledge

  • Conservative detection - Approximate patterns without overthinking

Feature Set (12 features):

Onset Features (3):
  • starts_with_vowel: Open onset (vowel-initial)

  • starts_with_cluster: Initial consonant cluster (2+ consonants)

  • starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)

Internal Features (4):
  • contains_plosive: Contains plosive consonant (p, t, k, b, d, g)

  • contains_fricative: Contains fricative consonant (f, s, z, v, h)

  • contains_liquid: Contains liquid consonant (l, r, w)

  • contains_nasal: Contains nasal consonant (m, n)

Nucleus Features (2):
  • short_vowel: Exactly one vowel (weight proxy)

  • long_vowel: Two or more vowels (weight proxy)

Coda Features (3):
  • ends_with_vowel: Open syllable (vowel-final)

  • ends_with_nasal: Nasal coda

  • ends_with_stop: Stop coda

Quick Start

Command-line usage:

$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json \
    --verbose

Programmatic usage:

>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
>>> result = run_annotation_pipeline(
...     syllables_path=Path("data/normalized/syllables_unique.txt"),
...     frequencies_path=Path("data/normalized/syllables_frequencies.json"),
...     output_path=Path("data/annotated/syllables_annotated.json"),
...     verbose=True
... )
>>> print(f"Annotated {result.statistics.syllable_count} syllables")

Annotate syllables in code:

>>> from build_tools.syllable_feature_annotator import (
...     annotate_corpus,
...     FEATURE_DETECTORS
... )
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> for record in result.annotated_syllables:
...     print(f"{record.syllable}: {sum(record.features.values())} features")

Public API

This package exports the following components for programmatic use:

Pipeline Functions:
  • run_annotation_pipeline: Complete end-to-end pipeline with I/O

  • annotate_corpus: Annotate syllables without I/O

  • annotate_syllable: Annotate single syllable

Data Models:
  • AnnotatedSyllable: Single annotated syllable record

  • AnnotationStatistics: Processing statistics

  • AnnotationResult: Complete result with syllables and stats

Feature Detection:
  • FEATURE_DETECTORS: Registry of all 12 feature detector functions

  • Individual detector functions (starts_with_vowel, contains_plosive, etc.)

Phoneme Sets:
  • VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS

File I/O:
  • load_syllables: Load syllables from text file

  • load_frequencies: Load frequencies from JSON

  • save_annotated_syllables: Save annotated output to JSON

Architecture

The package is organized into focused modules:

phoneme_sets.py: Character class definitions (VOWELS, PLOSIVES, etc.) feature_rules.py: Pure feature detector functions (12 detectors) annotator.py: Core orchestration and data models file_io.py: Simple I/O helpers cli.py: Command-line interface with argument parsing __main__.py: Module entry point for python -m

Integration with Pipeline

This tool is designed to work with the syllable normalizer:

# Step 1: Normalize syllables
$ python -m build_tools.pyphen_syllable_normaliser \
    --source data/corpus/ \
    --output data/normalized/

# Step 2: Annotate with features
$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json

# Step 3: Use annotated data for pattern generation (future)

Output Format

The annotator produces JSON with this structure:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_vowel": false,
      "starts_with_cluster": true,
      "starts_with_heavy_cluster": false,
      "contains_plosive": true,
      "contains_fricative": false,
      "contains_liquid": true,
      "contains_nasal": true,
      "short_vowel": true,
      "long_vowel": false,
      "ends_with_vowel": false,
      "ends_with_nasal": true,
      "ends_with_stop": false
    }
  }
]

Notes

  • This is a build-time tool only (not used during runtime name generation)

  • The annotator is deterministic (same input → same output)

  • Features are structural observations, not linguistic interpretations

  • All 12 features are applied to every syllable (no selective detection)

  • Processing is fast: typically <1 second for 1,000-10,000 syllables

See Also

  • CLAUDE.md: Complete project documentation

  • syllable_normaliser: Upstream tool that produces input data

  • feature_rules.py: Detailed documentation of each feature detector

Command-Line Interface

Annotate syllables with phonetic feature detection

usage: python -m build_tools.syllable_feature_annotator [-h]
                                                        [--syllables SYLLABLES]
                                                        [--frequencies FREQUENCIES]
                                                        [--output OUTPUT]
                                                        [--verbose]

Named Arguments

--syllables

Path to syllables text file (one per line). Default: data/normalized/syllables_unique.txt

Default: data/normalized/syllables_unique.txt

--frequencies

Path to frequencies JSON file. Default: data/normalized/syllables_frequencies.json

Default: data/normalized/syllables_frequencies.json

--output

Path for annotated output JSON. If not specified, automatically computes path based on syllables file location: <run_directory>/data/<extractor_type>_syllables_annotated.json. Falls back to data/annotated/syllables_annotated.json if auto-detection fails.

--verbose, -v

Show detailed progress information

Default: False

# Annotate with default paths (normalizer output)
python -m build_tools.syllable_feature_annotator

# Annotate with custom paths
python -m build_tools.syllable_feature_annotator \
  --syllables data/normalized/syllables_unique.txt \
  --frequencies data/normalized/syllables_frequencies.json \
  --output data/annotated/syllables_annotated.json

# Enable verbose output
python -m build_tools.syllable_feature_annotator --verbose

For more information, see the documentation in CLAUDE.md

Output Format

Input/Output Contract

Inputs (from syllable normaliser):

  • pyphen_syllables_unique.txt or nltk_syllables_unique.txt - One canonical syllable per line

  • pyphen_syllables_frequencies.json or nltk_syllables_frequencies.json - {"syllable": count} mapping

Output (auto-detected from input paths):

  • <run_directory>/data/pyphen_syllables_annotated.json - Pyphen extractor output

  • <run_directory>/data/nltk_syllables_annotated.json - NLTK extractor output

Output path auto-detection:

The tool automatically detects the extractor type (pyphen or nltk) from the input file paths and places output in the same run directory. This creates a self-contained workflow where each extraction run contains all its derived data.

Example directory structure after annotation:

_working/output/20260110_115601_nltk/
├── data/
│   └── nltk_syllables_annotated.json      ← Auto-detected output
├── meta/
├── syllables/
├── nltk_syllables_unique.txt              ← Input
├── nltk_syllables_frequencies.json        ← Input
└── ...

Output Structure

The annotator produces JSON with this structure:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_vowel": false,
      "starts_with_cluster": true,
      "starts_with_heavy_cluster": false,
      "contains_plosive": true,
      "contains_fricative": false,
      "contains_liquid": true,
      "contains_nasal": true,
      "short_vowel": true,
      "long_vowel": false,
      "ends_with_vowel": false,
      "ends_with_nasal": true,
      "ends_with_stop": false
    }
  }
]

Feature set:

All 12 features are applied to every syllable: - Onset features (starts_with_vowel, starts_with_cluster, starts_with_heavy_cluster) - Content features (contains_plosive, contains_fricative, contains_liquid, contains_nasal) - Vowel features (short_vowel, long_vowel) - Coda features (ends_with_vowel, ends_with_nasal, ends_with_stop)

Integration Guide

The feature annotator sits between the normaliser and pattern development. It automatically detects the extractor type and places output in the run directory.

Recommended workflow (with auto-detection):

# Step 1: Extract and normalize syllables
python -m build_tools.pyphen_syllable_extractor --file input.txt
python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_115453_pyphen/

# Step 2: Annotate with features (output path auto-detected)
python -m build_tools.syllable_feature_annotator \
  --syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \
  --frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json
# Creates: _working/output/20260110_115453_pyphen/data/pyphen_syllables_annotated.json

# Step 3: Use annotated syllables for pattern generation (future)

Alternative: Explicit output path (overrides auto-detection):

python -m build_tools.syllable_feature_annotator \
  --syllables path/to/syllables.txt \
  --frequencies path/to/frequencies.json \
  --output custom/output.json

When to use this tool:

  • After syllable normalization is complete

  • Before developing phonotactic patterns or constraints

  • To add structural feature metadata to your syllable corpus

  • For analysis tasks requiring feature-based filtering or grouping

Extractor type detection:

The tool detects the extractor type by examining: 1. Filename prefix (pyphen_* or nltk_*) 2. Parent directory name (*_pyphen or *_nltk) 3. Falls back to data/annotated/syllables_annotated.json if detection fails

Notes

Auto-detection creates self-contained workflows:

The automatic output path detection ensures each extraction run is self-contained with all derived data in one directory. This makes it easy to manage multiple extraction runs and compare results from different extractors (pyphen vs NLTK).

Features are structural observations:

Features are structural observations based on phoneme presence, not linguistic interpretations. This ensures deterministic, language-agnostic detection.

Processing characteristics:

  • Fast and deterministic (same input = same output)

  • All 12 features applied to every syllable (no selective detection)

  • Designed to integrate seamlessly with syllable normalizer output

  • Output path auto-detection works with both pyphen and NLTK normalizer outputs

Backward compatibility:

Explicit --output paths still work and override auto-detection. This ensures compatibility with existing workflows and scripts.

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Syllable Feature Annotator - Phonetic Feature Detection

The syllable feature annotator attaches phonetic features to normalized syllables, creating a feature-annotated dataset for downstream pattern generation. This is a build-time tool only - not used during runtime name generation.

The tool sits between the syllable normaliser and pattern development:

  1. Input: Normalized syllables from syllable_normaliser

  2. Process: Apply 12 feature detectors to each syllable

  3. Output: Feature-annotated syllable dataset

Design Principles:

  • Pure observation - Observes patterns, never interprets or filters

  • Deterministic - Same input always produces same output

  • Feature independence - No detector depends on another

  • Language agnostic - Structural patterns only, no linguistic knowledge

  • Conservative detection - Approximate patterns without overthinking

Feature Set (12 features):

Onset Features (3):
  • starts_with_vowel: Open onset (vowel-initial)

  • starts_with_cluster: Initial consonant cluster (2+ consonants)

  • starts_with_heavy_cluster: Heavy initial cluster (3+ consonants)

Internal Features (4):
  • contains_plosive: Contains plosive consonant (p, t, k, b, d, g)

  • contains_fricative: Contains fricative consonant (f, s, z, v, h)

  • contains_liquid: Contains liquid consonant (l, r, w)

  • contains_nasal: Contains nasal consonant (m, n)

Nucleus Features (2):
  • short_vowel: Exactly one vowel (weight proxy)

  • long_vowel: Two or more vowels (weight proxy)

Coda Features (3):
  • ends_with_vowel: Open syllable (vowel-final)

  • ends_with_nasal: Nasal coda

  • ends_with_stop: Stop coda

Quick Start

Command-line usage:

$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json \
    --verbose

Programmatic usage:

>>> from pathlib import Path
>>> from build_tools.syllable_feature_annotator import run_annotation_pipeline
>>> result = run_annotation_pipeline(
...     syllables_path=Path("data/normalized/syllables_unique.txt"),
...     frequencies_path=Path("data/normalized/syllables_frequencies.json"),
...     output_path=Path("data/annotated/syllables_annotated.json"),
...     verbose=True
... )
>>> print(f"Annotated {result.statistics.syllable_count} syllables")

Annotate syllables in code:

>>> from build_tools.syllable_feature_annotator import (
...     annotate_corpus,
...     FEATURE_DETECTORS
... )
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> for record in result.annotated_syllables:
...     print(f"{record.syllable}: {sum(record.features.values())} features")

Public API

This package exports the following components for programmatic use:

Pipeline Functions:
  • run_annotation_pipeline: Complete end-to-end pipeline with I/O

  • annotate_corpus: Annotate syllables without I/O

  • annotate_syllable: Annotate single syllable

Data Models:
  • AnnotatedSyllable: Single annotated syllable record

  • AnnotationStatistics: Processing statistics

  • AnnotationResult: Complete result with syllables and stats

Feature Detection:
  • FEATURE_DETECTORS: Registry of all 12 feature detector functions

  • Individual detector functions (starts_with_vowel, contains_plosive, etc.)

Phoneme Sets:
  • VOWELS, PLOSIVES, FRICATIVES, NASALS, LIQUIDS, STOPS

File I/O:
  • load_syllables: Load syllables from text file

  • load_frequencies: Load frequencies from JSON

  • save_annotated_syllables: Save annotated output to JSON

Architecture

The package is organized into focused modules:

phoneme_sets.py: Character class definitions (VOWELS, PLOSIVES, etc.) feature_rules.py: Pure feature detector functions (12 detectors) annotator.py: Core orchestration and data models file_io.py: Simple I/O helpers cli.py: Command-line interface with argument parsing __main__.py: Module entry point for python -m

Integration with Pipeline

This tool is designed to work with the syllable normalizer:

# Step 1: Normalize syllables
$ python -m build_tools.pyphen_syllable_normaliser \
    --source data/corpus/ \
    --output data/normalized/

# Step 2: Annotate with features
$ python -m build_tools.syllable_feature_annotator \
    --syllables data/normalized/syllables_unique.txt \
    --frequencies data/normalized/syllables_frequencies.json \
    --output data/annotated/syllables_annotated.json

# Step 3: Use annotated data for pattern generation (future)

Output Format

The annotator produces JSON with this structure:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_vowel": false,
      "starts_with_cluster": true,
      "starts_with_heavy_cluster": false,
      "contains_plosive": true,
      "contains_fricative": false,
      "contains_liquid": true,
      "contains_nasal": true,
      "short_vowel": true,
      "long_vowel": false,
      "ends_with_vowel": false,
      "ends_with_nasal": true,
      "ends_with_stop": false
    }
  }
]

Notes

  • This is a build-time tool only (not used during runtime name generation)

  • The annotator is deterministic (same input → same output)

  • Features are structural observations, not linguistic interpretations

  • All 12 features are applied to every syllable (no selective detection)

  • Processing is fast: typically <1 second for 1,000-10,000 syllables

See Also

  • CLAUDE.md: Complete project documentation

  • syllable_normaliser: Upstream tool that produces input data

  • feature_rules.py: Detailed documentation of each feature detector

class build_tools.syllable_feature_annotator.AnnotatedSyllable(syllable, frequency, features)[source]

Bases: object

Structured record for a single annotated syllable.

This dataclass represents the output format for each syllable after feature annotation. It contains the syllable string, its frequency count, and a dictionary of feature detection results.

Attributes

syllablestr

The canonical syllable string (from normalizer)

frequencyint

Occurrence count in the corpus (from frequency analysis)

featuresdict[str, bool]

Dictionary mapping feature names to boolean detection results Example: {“starts_with_cluster”: True, “contains_plosive”: True, …}

Examples

Create an annotated syllable:

>>> record = AnnotatedSyllable(
...     syllable="kran",
...     frequency=7,
...     features={
...         "starts_with_cluster": True,
...         "contains_plosive": True,
...         "short_vowel": True,
...         # ... other features ...
...     }
... )
>>> record.syllable
'kran'
>>> record.features["starts_with_cluster"]
True

Convert to dictionary (for JSON serialization):

>>> from dataclasses import asdict
>>> record_dict = asdict(record)
>>> record_dict["syllable"]
'kran'

Notes

  • All feature values must be boolean (True or False)

  • Frequency defaults to 1 if not found in frequency mapping

  • The features dict should contain all 12 feature detectors

  • Order of features in dict matches FEATURE_DETECTORS registry order

features: dict[str, bool]
frequency: int
syllable: str
class build_tools.syllable_feature_annotator.AnnotationResult(annotated_syllables, statistics)[source]

Bases: object

Complete annotation result with syllables and metadata.

Encapsulates the full output of the annotation process, including all annotated syllables and processing statistics.

Attributes

annotated_syllableslist[AnnotatedSyllable]

List of all annotated syllable records

statisticsAnnotationStatistics

Metadata about the annotation process

Examples

Access annotated syllables:

>>> result = run_annotation_pipeline(...)
>>> for record in result.annotated_syllables[:3]:
...     print(f"{record.syllable}: {record.frequency}")

Get statistics:

>>> result.statistics.syllable_count
1523

Convert to JSON-serializable format:

>>> from dataclasses import asdict
>>> output = [asdict(syl) for syl in result.annotated_syllables]

Notes

  • annotated_syllables list order matches input syllables order

  • Each syllable appears exactly once (no duplicates)

  • statistics provides overview of the annotation run

annotated_syllables: list[AnnotatedSyllable]
statistics: AnnotationStatistics
class build_tools.syllable_feature_annotator.AnnotationStatistics(syllable_count, feature_count, processing_time, total_frequency)[source]

Bases: object

Statistics about the annotation process.

Tracks metadata about the annotation run, including counts, feature coverage, and performance metrics.

Attributes

syllable_countint

Total number of syllables annotated

feature_countint

Total number of features applied per syllable

processing_timefloat

Time taken for annotation in seconds

total_frequencyint

Sum of all syllable frequencies (total corpus size)

Examples

Create statistics manually:

>>> stats = AnnotationStatistics(
...     syllable_count=1523,
...     feature_count=12,
...     processing_time=0.34,
...     total_frequency=8472
... )
>>> stats.syllable_count
1523

Access statistics from result:

>>> result = run_annotation_pipeline(...)
>>> print(f"Processed {result.statistics.syllable_count} syllables")
>>> print(f"Time: {result.statistics.processing_time:.2f}s")

Notes

  • processing_time uses time.perf_counter() for precision

  • total_frequency represents pre-deduplication corpus size

  • feature_count should always be 12 (current feature set size)

feature_count: int
processing_time: float
syllable_count: int
total_frequency: int
build_tools.syllable_feature_annotator.annotate_corpus(syllables, frequencies, detectors)[source]

Annotate entire syllable corpus with feature detection.

Processes a list of syllables and produces annotated records with feature detection results and processing statistics.

This is a pure function with no I/O operations. It takes data structures as input and returns data structures as output.

Return type:

AnnotationResult

Parameters

syllableslist[str]

List of syllable strings to annotate (typically from syllables_unique.txt)

frequenciesdict[str, int]

Dictionary mapping syllables to occurrence counts (typically from syllables_frequencies.json)

detectorsdict[str, Callable[[str], bool]]

Dictionary mapping feature names to detector functions (typically FEATURE_DETECTORS from feature_rules.py)

Returns

AnnotationResult

Complete result with annotated syllables and statistics

Examples

Annotate a small corpus:

>>> from build_tools.syllable_feature_annotator.feature_rules import FEATURE_DETECTORS
>>> syllables = ["ka", "kran", "spla"]
>>> frequencies = {"ka": 187, "kran": 7, "spla": 2}
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> result.statistics.syllable_count
3
>>> result.statistics.feature_count
12

Handle missing frequency (defaults to 1):

>>> syllables = ["xyz"]
>>> frequencies = {}  # Empty frequency dict
>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> result.annotated_syllables[0].frequency
1

Check processing time:

>>> result = annotate_corpus(syllables, frequencies, FEATURE_DETECTORS)
>>> result.statistics.processing_time < 1.0  # Should be very fast
True

Notes

  • Syllables are processed in input order (deterministic)

  • Missing frequencies default to 1 (no error raised)

  • All syllables are processed (no filtering or exclusion)

  • Processing time uses time.perf_counter() for precision

  • Function is deterministic (same inputs → same outputs)

  • Memory usage: O(n) where n = number of syllables

build_tools.syllable_feature_annotator.annotate_syllable(syllable, frequency, detectors)[source]

Apply all feature detectors to a single syllable.

This is the core annotation function. It mechanically applies each feature detector to the syllable and assembles the results into an AnnotatedSyllable record.

The function is pure: it has no side effects and always produces the same output for the same inputs.

Return type:

AnnotatedSyllable

Parameters

syllablestr

The syllable string to annotate

frequencyint

Occurrence count for this syllable

detectorsdict[str, Callable[[str], bool]]

Dictionary mapping feature names to detector functions (typically FEATURE_DETECTORS from feature_rules.py)

Returns

AnnotatedSyllable

Complete annotated record with all feature detection results

Examples

Annotate a simple syllable:

>>> from build_tools.syllable_feature_annotator.feature_rules import FEATURE_DETECTORS
>>> record = annotate_syllable("ka", 187, FEATURE_DETECTORS)
>>> record.syllable
'ka'
>>> record.frequency
187
>>> record.features["starts_with_cluster"]
False
>>> record.features["short_vowel"]
True

Annotate a complex cluster:

>>> record = annotate_syllable("spla", 2, FEATURE_DETECTORS)
>>> record.features["starts_with_heavy_cluster"]
True
>>> record.features["contains_liquid"]
True

Notes

  • Function is deterministic (same input → same output)

  • All detectors are applied (no short-circuiting or skipping)

  • Features are stored in detector iteration order

  • No validation or filtering of results

  • Empty syllables will produce all-False features

  • Processing is fast: O(n*m) where n=syllable length, m=detector count

build_tools.syllable_feature_annotator.contains_fricative(s)[source]

Detect if syllable contains any fricative consonant.

Fricatives (f, s, z, v, h) are consonants produced by forcing air through a narrow channel, creating turbulent airflow and friction. They create “hissing” or “buzzing” texture.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable contains any fricative, False otherwise

Examples

>>> contains_fricative("fish")
True
>>> contains_fricative("zone")
True
>>> contains_fricative("bat")
False
>>> contains_fricative("")
False

Notes

  • Checks entire syllable, not just specific positions

  • Fricatives defined in phoneme_sets.FRICATIVES (f, s, z, v, h)

  • Multiple fricatives in one syllable still return True

  • Empty strings return False

build_tools.syllable_feature_annotator.contains_liquid(s)[source]

Detect if syllable contains any liquid consonant.

Liquids (l, r, w) are consonants with vowel-like qualities that flow smoothly. They have lateral (l) or rhotic (r) characteristics and contribute to syllable fluidity.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable contains any liquid, False otherwise

Examples

>>> contains_liquid("kran")
True
>>> contains_liquid("slow")
True
>>> contains_liquid("bat")
False
>>> contains_liquid("")
False

Notes

  • Checks entire syllable, not just specific positions

  • Liquids defined in phoneme_sets.LIQUIDS (l, r, w)

  • ‘w’ is included due to its semi-vowel/glide properties

  • Multiple liquids in one syllable still return True

  • Empty strings return False

build_tools.syllable_feature_annotator.contains_nasal(s)[source]

Detect if syllable contains any nasal consonant.

Nasals (m, n) are consonants where air flows through the nasal cavity. They have resonant qualities and often appear in coda positions, contributing to syllable closure patterns.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable contains any nasal, False otherwise

Examples

>>> contains_nasal("kran")
True
>>> contains_nasal("man")
True
>>> contains_nasal("bat")
False
>>> contains_nasal("")
False

Notes

  • Checks entire syllable, not just specific positions

  • Nasals defined in phoneme_sets.NASALS (m, n)

  • Multiple nasals in one syllable still return True

  • Empty strings return False

  • See also: ends_with_nasal for coda-specific detection

build_tools.syllable_feature_annotator.contains_plosive(s)[source]

Detect if syllable contains any plosive consonant.

Plosives (p, t, k, b, d, g) are consonants produced by completely blocking airflow then releasing it suddenly. They inject “hardness” and percussive texture into syllables.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable contains any plosive, False otherwise

Examples

>>> contains_plosive("takt")
True
>>> contains_plosive("pat")
True
>>> contains_plosive("sal")
False
>>> contains_plosive("")
False

Notes

  • Checks entire syllable, not just specific positions

  • Plosives defined in phoneme_sets.PLOSIVES (p, t, k, b, d, g)

  • Multiple plosives in one syllable still return True

  • Empty strings return False

build_tools.syllable_feature_annotator.ends_with_nasal(s)[source]

Detect if syllable ends with a nasal consonant (nasal coda).

Nasal codas (m, n) create specific closure patterns and resonance. They are common syllable-final consonants across many languages and contribute to syllable weight.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable ends with nasal, False otherwise

Examples

>>> ends_with_nasal("turn")
True
>>> ends_with_nasal("man")
True
>>> ends_with_nasal("bat")
False
>>> ends_with_nasal("")
False

Notes

  • Only checks the final character

  • Nasals defined in phoneme_sets.NASALS (m, n)

  • Nasal codas are distinct from stop codas in sonority

  • Empty strings return False (no coda to analyze)

  • See also: contains_nasal for position-independent detection

build_tools.syllable_feature_annotator.ends_with_stop(s)[source]

Detect if syllable ends with a stop consonant (stop coda).

Stop codas create abrupt syllable termination with complete airflow closure. They include plosives and other stops that contribute to syllable closure and weight.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable ends with stop, False otherwise

Examples

>>> ends_with_stop("takt")
True
>>> ends_with_stop("bat")
True
>>> ends_with_stop("man")
False
>>> ends_with_stop("")
False

Notes

  • Only checks the final character

  • Stops defined in phoneme_sets.STOPS (p, t, k, b, d, g, q)

  • STOPS includes all PLOSIVES plus ‘q’ (terminal closure)

  • Stop codas create heavier, more closed syllables

  • Empty strings return False (no coda to analyze)

  • Distinction: STOPS for coda detection, PLOSIVES for internal texture

build_tools.syllable_feature_annotator.ends_with_vowel(s)[source]

Detect if syllable ends with a vowel (open syllable).

Syllables ending in vowels are “open” in phonological terms. They tend to have higher sonority and different prosodic properties compared to consonant-final syllables.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable ends with vowel, False otherwise

Examples

>>> ends_with_vowel("na")
True
>>> ends_with_vowel("hello")
True
>>> ends_with_vowel("bat")
False
>>> ends_with_vowel("")
False

Notes

  • Only checks the final character

  • Vowels defined in phoneme_sets.VOWELS (a, e, i, o, u)

  • Open syllables (vowel-final) vs closed syllables (consonant-final)

  • Empty strings return False (no coda to analyze)

  • Mutually exclusive with ends_with_nasal and ends_with_stop

build_tools.syllable_feature_annotator.load_frequencies(file_path)[source]

Load syllable frequency mapping from a JSON file.

Reads a JSON file containing a dictionary mapping syllables to their occurrence counts. The expected format is: {“syllable”: count, …}

This function is designed to load the output from the syllable normalizer, specifically the syllables_frequencies.json file.

Return type:

dict[str, int]

Parameters

file_pathPath

Path to JSON file containing frequency mapping

Returns

dict[str, int]

Dictionary mapping syllable strings to integer counts

Raises

FileNotFoundError

If the input file doesn’t exist

ValueError

If the JSON is malformed or doesn’t contain expected format

IOError

If there are permission or disk errors

Examples

Load frequencies from normalizer output:

>>> from pathlib import Path
>>> frequencies = load_frequencies(Path("data/normalized/syllables_frequencies.json"))
>>> len(frequencies)
1523
>>> frequencies["ka"]
187
>>> frequencies["ra"]
162

Get most frequent syllable:

>>> most_frequent = max(frequencies.items(), key=lambda x: x[1])
>>> print(f"{most_frequent[0]}: {most_frequent[1]} occurrences")
ka: 187 occurrences

Handle missing syllable (returns default):

>>> frequencies.get("xyz", 1)  # Default to 1 if missing
1

Notes

  • UTF-8 encoding is used explicitly

  • The entire file is loaded into memory (suitable for typical dataset sizes)

  • No validation is performed on syllable strings or counts

  • Deterministic: same file always produces same output

  • Missing syllables should be handled by caller (use .get(syllable, 1))

build_tools.syllable_feature_annotator.load_syllables(file_path)[source]

Load syllables from a text file (one syllable per line).

Reads a text file containing one syllable per line and returns a list of syllable strings. Empty lines are automatically filtered out.

This function is designed to load the output from the syllable normalizer, specifically the syllables_unique.txt file.

Return type:

list[str]

Parameters

file_pathPath

Path to text file containing syllables (one per line)

Returns

list[str]

List of syllable strings, with empty lines filtered out

Raises

FileNotFoundError

If the input file doesn’t exist

IOError

If there are permission or disk errors

Examples

Load syllables from normalizer output:

>>> from pathlib import Path
>>> syllables = load_syllables(Path("data/normalized/syllables_unique.txt"))
>>> len(syllables)
1523
>>> syllables[:3]
['ka', 'ra', 'mi']

Handle missing file:

>>> syllables = load_syllables(Path("nonexistent.txt"))
Traceback (most recent call last):
    ...
FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent.txt'

Notes

  • Empty lines are filtered automatically

  • Lines are stripped of leading/trailing whitespace

  • UTF-8 encoding is used explicitly

  • The file is read entirely into memory (suitable for typical syllable counts)

  • Order is preserved from the input file

  • Deterministic: same file always produces same output

build_tools.syllable_feature_annotator.long_vowel(s)[source]

Detect if syllable has two or more vowels (long vowel proxy).

This is a structural proxy for syllable weight and nucleus complexity, not linguistic vowel length. Syllables with multiple vowels tend to be heavier and more open, including diphthongs and vowel sequences.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable contains 2+ vowels, False otherwise

Examples

>>> long_vowel("beat")  # 'ea' = 2 vowels
True
>>> long_vowel("aura")  # 'au' + 'a' = 3 vowels
True
>>> long_vowel("bat")
False
>>> long_vowel("")
False

Notes

  • Counts total vowels in syllable (any position)

  • Returns True if count >= 2

  • Not linguistic vowel length (short vs long /a/ vs /aː/)

  • Catches diphthongs (ae, au, etc.) and vowel sequences

  • Provides proxy for syllable weight and complexity

  • Mutually exclusive with short_vowel

  • Empty strings return False (no nucleus)

build_tools.syllable_feature_annotator.run_annotation_pipeline(syllables_path, frequencies_path, output_path, verbose=False)[source]

Run complete end-to-end annotation pipeline with I/O.

This is the main entry point for the annotation tool. It: 1. Loads syllables from file 2. Loads frequencies from file 3. Annotates corpus 4. Saves output to file 5. Returns results

Return type:

AnnotationResult

Parameters

syllables_pathPath

Path to syllables text file (one per line) Example: data/normalized/syllables_unique.txt

frequencies_pathPath

Path to frequencies JSON file Example: data/normalized/syllables_frequencies.json

output_pathPath

Path where annotated JSON should be written Example: data/annotated/syllables_annotated.json

verbosebool, optional

If True, print progress information (default: False)

Returns

AnnotationResult

Complete annotation result with syllables and statistics

Raises

FileNotFoundError

If input files don’t exist

ValueError

If input data is malformed

IOError

If there are file permission or disk errors

Examples

Run full pipeline:

>>> from pathlib import Path
>>> result = run_annotation_pipeline(
...     syllables_path=Path("data/normalized/syllables_unique.txt"),
...     frequencies_path=Path("data/normalized/syllables_frequencies.json"),
...     output_path=Path("data/annotated/syllables_annotated.json"),
...     verbose=True
... )
Loading syllables...
Loading frequencies...
Annotating corpus...
Saving results...
Annotated 1523 syllables in 0.34s

Check results:

>>> print(f"Processed {result.statistics.syllable_count} syllables")
>>> print(f"Time: {result.statistics.processing_time:.2f}s")

Notes

  • Input files must exist and be readable

  • Output directory is created automatically if needed

  • Output file is overwritten if it exists

  • Processing is deterministic (same inputs → same outputs)

  • Verbose mode prints progress to stdout

  • All exceptions from file I/O propagate to caller

build_tools.syllable_feature_annotator.save_annotated_syllables(syllables, file_path)[source]

Save annotated syllables to a JSON file with human-readable formatting.

Writes a list of annotated syllable dictionaries to a JSON file. Each dictionary should contain ‘syllable’, ‘frequency’, and ‘features’ keys.

Output is formatted with 2-space indentation for readability and version control friendliness.

Return type:

None

Parameters

syllableslist[dict]

List of annotated syllable dictionaries, each containing: - syllable (str): The syllable string - frequency (int): Occurrence count - features (dict[str, bool]): Feature detection results

file_pathPath

Path where JSON output should be written

Returns

None

File is written to disk, nothing returned

Raises

IOError

If there are permission or disk errors

TypeError

If syllables is not JSON-serializable

Examples

Save annotated syllables:

>>> from pathlib import Path
>>> annotated = [
...     {
...         "syllable": "kran",
...         "frequency": 7,
...         "features": {
...             "starts_with_cluster": True,
...             "contains_plosive": True,
...             "short_vowel": True,
...             # ... other features ...
...         }
...     },
...     # ... more syllables ...
... ]
>>> save_annotated_syllables(annotated, Path("output/syllables_annotated.json"))

Expected output format:

[
  {
    "syllable": "kran",
    "frequency": 7,
    "features": {
      "starts_with_cluster": true,
      "contains_plosive": true,
      "short_vowel": true
    }
  }
]

Notes

  • Parent directories are created automatically if they don’t exist

  • UTF-8 encoding is used explicitly

  • 2-space indentation for readability

  • Output is valid JSON that can be consumed by other tools

  • File is overwritten if it already exists

  • Deterministic: same input always produces same output

build_tools.syllable_feature_annotator.short_vowel(s)[source]

Detect if syllable has exactly one vowel (short vowel proxy).

This is a structural proxy for syllable weight and nucleus complexity, not linguistic vowel length. Syllables with one vowel tend to be lighter and more closed.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable contains exactly one vowel, False otherwise

Examples

>>> short_vowel("bat")
True
>>> short_vowel("kran")
True
>>> short_vowel("beat")  # 'ea' = 2 vowels
False
>>> short_vowel("")
False

Notes

  • Counts total vowels in syllable (any position)

  • Returns True only if count == 1

  • Not linguistic vowel length (short vs long /a/ vs /aː/)

  • Provides proxy for syllable weight and openness

  • Mutually exclusive with long_vowel

  • Empty strings return False (no nucleus)

build_tools.syllable_feature_annotator.starts_with_cluster(s)[source]

Detect if syllable starts with a consonant cluster (2+ consonants).

A consonant cluster is two or more adjacent consonants at the beginning of a syllable. This creates increased phonetic complexity and affects pronunciation difficulty and syllable weight.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable starts with 2+ consonants, False otherwise

Examples

>>> starts_with_cluster("kran")
True
>>> starts_with_cluster("train")
True
>>> starts_with_cluster("na")
False
>>> starts_with_cluster("a")
False

Notes

  • Requires at least 2 characters

  • Checks that first two characters are both non-vowels

  • Conservative detection: catches obvious clusters (tr, kr, st, etc.)

  • Does not handle vowel-glides or language-specific edge cases

  • Heavy clusters (3+ consonants) will also trigger this detector

build_tools.syllable_feature_annotator.starts_with_heavy_cluster(s)[source]

Detect if syllable starts with a heavy consonant cluster (3+ consonants).

Heavy clusters are particularly complex initial consonant sequences. These are relatively rare in natural language but create distinctive phonetic patterns when present.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable starts with 3+ consonants, False otherwise

Examples

>>> starts_with_heavy_cluster("spla")
True
>>> starts_with_heavy_cluster("stra")
True
>>> starts_with_heavy_cluster("kran")
False
>>> starts_with_heavy_cluster("na")
False

Notes

  • Requires at least 3 characters

  • Checks that first three characters are all non-vowels

  • Future-safe: can be refined or replaced without breaking consumers

  • This is a placeholder definition that catches obvious cases

  • Syllables with heavy clusters will also trigger starts_with_cluster

build_tools.syllable_feature_annotator.starts_with_vowel(s)[source]

Detect if syllable starts with a vowel (vowel-initial or open onset).

This feature identifies syllables that begin directly with a vowel, without any initial consonant. Such syllables have an “open onset” in phonological terms.

Return type:

bool

Parameters

sstr

Syllable string to analyze

Returns

bool

True if syllable starts with vowel, False otherwise

Examples

>>> starts_with_vowel("apple")
True
>>> starts_with_vowel("kran")
False
>>> starts_with_vowel("a")
True
>>> starts_with_vowel("")  # Edge case: empty string
False

Notes

  • Empty strings return False (no onset to analyze)

  • Only checks the first character

  • Vowels are defined in phoneme_sets.VOWELS (a, e, i, o, u)