build_tools.name_combiner.aggregator

Feature aggregation for name-level evaluation.

This module implements the rules for aggregating syllable-level features into name-level features. The aggregation produces a boolean feature vector for each name candidate, enabling policy evaluation by the name_selector.

Aggregation Rules

Onset Features (first syllable only):

starts_with_vowel
starts_with_cluster
starts_with_heavy_cluster

These features describe how a name begins. Only the first syllable’s onset features are relevant - internal syllable onsets don’t affect how the name “enters” the listener’s ear.

Coda Features (final syllable only):

ends_with_vowel
ends_with_nasal
ends_with_stop

These features describe how a name ends. Only the final syllable’s coda features are relevant - internal syllable codas don’t affect how the name “lands” or closes.

Internal Features (OR across all syllables):

contains_plosive
contains_fricative
contains_liquid
contains_nasal

These features describe the texture of a name. If ANY syllable contains the feature, the name has it. A name like “kalira” contains_liquid=True because “li” has a liquid, even though “ka” and “ra” might not.

Nucleus Features (majority rule):

short_vowel
long_vowel

These features describe the dominant vowel character of a name. We use majority rule (>50% of syllables) to determine the name-level value. See the module docstring for detailed rationale.

Why Majority Rule for Nucleus Features

We use majority (>50% of syllables) rather than proportional scoring.

Preserves Architectural Consistency: The entire feature registry is built on boolean features. The policy matrix uses checkmark/tilde/cross symbols that map cleanly to boolean logic. Introducing fractional features would break this elegant simplicity.
Keeps the Implementation Simple: Majority rule means the name-level feature vector remains a simple boolean array, identical in structure to syllable-level vectors. No special cases, no type conversions.
Sufficient for Initial Policy Evaluation: For a first implementation, knowing “this name is mostly short-vowel” vs. “this name is mostly long-vowel” is enough information to make good selection decisions. Precise ratios are not needed yet.
Easier to Debug and Explain: When a name gets rejected, you can say “this name has short_vowel=true (2 of 3 syllables), which is discouraged for Place Names.” That’s clear and inspectable. Proportional scoring makes debugging harder.
Aligns with Project Philosophy: The system is about shape and suitability, not precise optimization. Majority rule captures the dominant character of a name, which is what matters for admissibility.

Future Consideration

If finer-grained nucleus control is needed, proportional scoring could be introduced as an optional mode. This would require extending the policy matrix to handle float thresholds (e.g., short_vowel > 0.6). For now, majority rule provides the right balance of simplicity and expressiveness.

Usage

>>> from build_tools.name_combiner.aggregator import aggregate_features
>>> syllables = [
...     {"syllable": "ka", "features": {"starts_with_vowel": False, ...}},
...     {"syllable": "li", "features": {"contains_liquid": True, ...}},
... ]
>>> name_features = aggregate_features(syllables)
>>> name_features["starts_with_vowel"]  # From first syllable
False
>>> name_features["contains_liquid"]  # OR across all
True

Attributes

`ONSET_FEATURES`
`CODA_FEATURES`
`INTERNAL_FEATURES`
`NUCLEUS_FEATURES`
`ALL_FEATURES`

Functions

aggregate_features(syllables)

Aggregate syllable-level features into a name-level feature vector.

Module Contents

build_tools.name_combiner.aggregator.ONSET_FEATURES

build_tools.name_combiner.aggregator.CODA_FEATURES

build_tools.name_combiner.aggregator.INTERNAL_FEATURES

build_tools.name_combiner.aggregator.NUCLEUS_FEATURES

build_tools.name_combiner.aggregator.ALL_FEATURES = ('starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'contains_plosive',...

build_tools.name_combiner.aggregator.aggregate_features(syllables)[source]

Aggregate syllable-level features into a name-level feature vector.

Takes a sequence of syllable dictionaries (each with a “features” key) and produces a single boolean feature vector for the combined name.

Parameters

syllablesSequence[dict]: List of syllable dictionaries, each containing: - “syllable”: str - The syllable text - “features”: dict[str, bool] - The 12 boolean features

Returns

dict[str, bool]: Name-level feature vector with all 12 features as booleans.

Raises

ValueError: If syllables list is empty or missing required keys.

Examples

>>> syllables = [
...     {"syllable": "ka", "features": {
...         "starts_with_vowel": False,
...         "starts_with_cluster": False,
...         "starts_with_heavy_cluster": False,
...         "contains_plosive": True,
...         "contains_fricative": False,
...         "contains_liquid": False,
...         "contains_nasal": False,
...         "short_vowel": True,
...         "long_vowel": False,
...         "ends_with_vowel": True,
...         "ends_with_nasal": False,
...         "ends_with_stop": False,
...     }},
...     {"syllable": "li", "features": {
...         "starts_with_vowel": False,
...         "starts_with_cluster": False,
...         "starts_with_heavy_cluster": False,
...         "contains_plosive": False,
...         "contains_fricative": False,
...         "contains_liquid": True,
...         "contains_nasal": False,
...         "short_vowel": True,
...         "long_vowel": False,
...         "ends_with_vowel": True,
...         "ends_with_nasal": False,
...         "ends_with_stop": False,
...     }},
... ]
>>> features = aggregate_features(syllables)
>>> features["starts_with_vowel"]  # From first syllable ("ka")
False
>>> features["ends_with_vowel"]  # From final syllable ("li")
True
>>> features["contains_liquid"]  # OR: True because "li" has it
True
>>> features["short_vowel"]  # Majority: 2/2 = 100% > 50%
True

Notes

Aggregation follows these rules:

Onset (starts_with_*): First syllable only
Coda (ends_with_*): Final syllable only
Internal (contains_*): OR across all syllables
Nucleus (short_vowel, long_vowel): Majority rule (>50%)

See module docstring for detailed rationale on majority rule.