Source code for build_tools.name_combiner.aggregator

"""
Feature aggregation for name-level evaluation.

This module implements the rules for aggregating syllable-level features
into name-level features. The aggregation produces a boolean feature vector
for each name candidate, enabling policy evaluation by the name_selector.

Aggregation Rules
-----------------
**Onset Features** (first syllable only):
    - starts_with_vowel
    - starts_with_cluster
    - starts_with_heavy_cluster

    These features describe how a name begins. Only the first syllable's
    onset features are relevant - internal syllable onsets don't affect
    how the name "enters" the listener's ear.

**Coda Features** (final syllable only):
    - ends_with_vowel
    - ends_with_nasal
    - ends_with_stop

    These features describe how a name ends. Only the final syllable's
    coda features are relevant - internal syllable codas don't affect
    how the name "lands" or closes.

**Internal Features** (OR across all syllables):
    - contains_plosive
    - contains_fricative
    - contains_liquid
    - contains_nasal

    These features describe the texture of a name. If ANY syllable contains
    the feature, the name has it. A name like "kalira" contains_liquid=True
    because "li" has a liquid, even though "ka" and "ra" might not.

**Nucleus Features** (majority rule):
    - short_vowel
    - long_vowel

    These features describe the dominant vowel character of a name.
    We use majority rule (>50% of syllables) to determine the name-level
    value. See the module docstring for detailed rationale.

Why Majority Rule for Nucleus Features
--------------------------------------
We use majority (>50% of syllables) rather than proportional scoring.

1. **Preserves Architectural Consistency**: The entire feature registry is
   built on boolean features. The policy matrix uses checkmark/tilde/cross
   symbols that map cleanly to boolean logic. Introducing fractional
   features would break this elegant simplicity.

2. **Keeps the Implementation Simple**: Majority rule means the name-level
   feature vector remains a simple boolean array, identical in structure
   to syllable-level vectors. No special cases, no type conversions.

3. **Sufficient for Initial Policy Evaluation**: For a first implementation,
   knowing "this name is mostly short-vowel" vs. "this name is mostly
   long-vowel" is enough information to make good selection decisions.
   Precise ratios are not needed yet.

4. **Easier to Debug and Explain**: When a name gets rejected, you can say
   "this name has short_vowel=true (2 of 3 syllables), which is discouraged
   for Place Names." That's clear and inspectable. Proportional scoring
   makes debugging harder.

5. **Aligns with Project Philosophy**: The system is about shape and
   suitability, not precise optimization. Majority rule captures the
   dominant character of a name, which is what matters for admissibility.

Future Consideration
--------------------
If finer-grained nucleus control is needed, proportional scoring could be
introduced as an optional mode. This would require extending the policy
matrix to handle float thresholds (e.g., short_vowel > 0.6). For now,
majority rule provides the right balance of simplicity and expressiveness.

Usage
-----
>>> from build_tools.name_combiner.aggregator import aggregate_features
>>> syllables = [
...     {"syllable": "ka", "features": {"starts_with_vowel": False, ...}},
...     {"syllable": "li", "features": {"contains_liquid": True, ...}},
... ]
>>> name_features = aggregate_features(syllables)
>>> name_features["starts_with_vowel"]  # From first syllable
False
>>> name_features["contains_liquid"]  # OR across all
True
"""

from __future__ import annotations

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from collections.abc import Sequence

# Feature categories for aggregation
ONSET_FEATURES = frozenset(
    {
        "starts_with_vowel",
        "starts_with_cluster",
        "starts_with_heavy_cluster",
    }
)

CODA_FEATURES = frozenset(
    {
        "ends_with_vowel",
        "ends_with_nasal",
        "ends_with_stop",
    }
)

INTERNAL_FEATURES = frozenset(
    {
        "contains_plosive",
        "contains_fricative",
        "contains_liquid",
        "contains_nasal",
    }
)

NUCLEUS_FEATURES = frozenset(
    {
        "short_vowel",
        "long_vowel",
    }
)

# All 12 features in canonical order (matches feature_rules.py)
ALL_FEATURES = (
    "starts_with_vowel",
    "starts_with_cluster",
    "starts_with_heavy_cluster",
    "contains_plosive",
    "contains_fricative",
    "contains_liquid",
    "contains_nasal",
    "short_vowel",
    "long_vowel",
    "ends_with_vowel",
    "ends_with_nasal",
    "ends_with_stop",
)



[docs]
def aggregate_features(syllables: Sequence[dict]) -> dict[str, bool]:
    """
    Aggregate syllable-level features into a name-level feature vector.

    Takes a sequence of syllable dictionaries (each with a "features" key)
    and produces a single boolean feature vector for the combined name.

    Parameters
    ----------
    syllables : Sequence[dict]
        List of syllable dictionaries, each containing:
        - "syllable": str - The syllable text
        - "features": dict[str, bool] - The 12 boolean features

    Returns
    -------
    dict[str, bool]
        Name-level feature vector with all 12 features as booleans.

    Raises
    ------
    ValueError
        If syllables list is empty or missing required keys.

    Examples
    --------
    >>> syllables = [
    ...     {"syllable": "ka", "features": {
    ...         "starts_with_vowel": False,
    ...         "starts_with_cluster": False,
    ...         "starts_with_heavy_cluster": False,
    ...         "contains_plosive": True,
    ...         "contains_fricative": False,
    ...         "contains_liquid": False,
    ...         "contains_nasal": False,
    ...         "short_vowel": True,
    ...         "long_vowel": False,
    ...         "ends_with_vowel": True,
    ...         "ends_with_nasal": False,
    ...         "ends_with_stop": False,
    ...     }},
    ...     {"syllable": "li", "features": {
    ...         "starts_with_vowel": False,
    ...         "starts_with_cluster": False,
    ...         "starts_with_heavy_cluster": False,
    ...         "contains_plosive": False,
    ...         "contains_fricative": False,
    ...         "contains_liquid": True,
    ...         "contains_nasal": False,
    ...         "short_vowel": True,
    ...         "long_vowel": False,
    ...         "ends_with_vowel": True,
    ...         "ends_with_nasal": False,
    ...         "ends_with_stop": False,
    ...     }},
    ... ]
    >>> features = aggregate_features(syllables)
    >>> features["starts_with_vowel"]  # From first syllable ("ka")
    False
    >>> features["ends_with_vowel"]  # From final syllable ("li")
    True
    >>> features["contains_liquid"]  # OR: True because "li" has it
    True
    >>> features["short_vowel"]  # Majority: 2/2 = 100% > 50%
    True

    Notes
    -----
    Aggregation follows these rules:

    - **Onset** (starts_with_*): First syllable only
    - **Coda** (ends_with_*): Final syllable only
    - **Internal** (contains_*): OR across all syllables
    - **Nucleus** (short_vowel, long_vowel): Majority rule (>50%)

    See module docstring for detailed rationale on majority rule.
    """
    if not syllables:
        raise ValueError("Cannot aggregate features from empty syllable list")

    # Validate structure
    for i, syl in enumerate(syllables):
        if "features" not in syl:
            raise ValueError(f"Syllable at index {i} missing 'features' key")

    first_features = syllables[0]["features"]
    final_features = syllables[-1]["features"]

    result: dict[str, bool] = {}

    # Onset features: first syllable only
    for feature in ONSET_FEATURES:
        result[feature] = first_features.get(feature, False)

    # Coda features: final syllable only
    for feature in CODA_FEATURES:
        result[feature] = final_features.get(feature, False)

    # Internal features: OR across all syllables
    for feature in INTERNAL_FEATURES:
        result[feature] = any(syl["features"].get(feature, False) for syl in syllables)

    # Nucleus features: majority rule (>50%)
    total = len(syllables)
    for feature in NUCLEUS_FEATURES:
        count = sum(1 for syl in syllables if syl["features"].get(feature, False))
        # Majority means strictly more than half
        # For 2 syllables: need 2 (>1)
        # For 3 syllables: need 2 (>1.5)
        # For 4 syllables: need 3 (>2)
        result[feature] = count > total / 2

    return result