Name Combiner

Overview

Name Combiner - Structural Name Candidate Generation

Generates N-syllable name candidates from an annotated syllable corpus by combining syllables and aggregating their features to the name level. This is a build-time tool only - not used during runtime name generation.

This module is the first stage of the Selection Policy Layer. It performs structural combination without policy evaluation - that responsibility belongs to the name_selector module.

Architectural Boundary:: Candidate generation is a structural step, not a decision-making step. All governance, admissibility, and rejection logic remains exclusively within the name_selector module.

Features: - Deterministic combination with seed control - Frequency-weighted syllable sampling - Feature aggregation to name level (majority rule for nucleus) - Output to extraction run’s candidates/ directory

Aggregation Rules: - Onset features (starts_with_*): First syllable only - Coda features (ends_with_*): Final syllable only - Internal features (contains_*): Boolean OR across all syllables - Nucleus features (short_vowel, long_vowel): Majority rule (>50%)

Usage:

>>> from build_tools.name_combiner import combine_syllables, aggregate_features
>>> candidates = combine_syllables(annotated_data, syllable_count=2, count=100)
>>> for candidate in candidates:
...     print(f"{candidate['name']}: {candidate['features']}")

CLI:

python -m build_tools.name_combiner \
    --run-dir _working/output/20260110_115453_pyphen/ \
    --syllables 2 \
    --count 10000 \
    --seed 42

Command-Line Interface

Generate N-syllable name candidates from an annotated syllable corpus. Combines syllables structurally and aggregates features to the name level. This is a build-time tool for the Selection Policy Layer.

usage: python -m build_tools.name_combiner [-h] --run-dir RUN_DIR --syllables
                                           {2,3,4} [--count COUNT]
                                           [--seed SEED]
                                           [--frequency-weight FREQUENCY_WEIGHT]

Named Arguments

--run-dir

Path to extraction run directory containing annotated JSON. Example: _working/output/20260110_115453_pyphen/

--syllables

Possible choices: 2, 3, 4

Number of syllables per candidate name. Choices: 2, 3, 4.

--count

Number of candidates to generate. Default: 10000.

Default: 10000

--seed

RNG seed for deterministic output. If not specified, uses system entropy. Same seed always produces identical candidates.

--frequency-weight

Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.

Default: 1.0

Examples:

# Generate 2-syllable candidates with default settings
python -m build_tools.name_combiner \
    --run-dir _working/output/20260110_115453_pyphen/ \
    --syllables 2

# Generate 10000 3-syllable candidates with fixed seed
python -m build_tools.name_combiner \
    --run-dir _working/output/20260110_115453_pyphen/ \
    --syllables 3 \
    --count 10000 \
    --seed 42

# Generate with uniform sampling (no frequency weighting)
python -m build_tools.name_combiner \
    --run-dir _working/output/20260110_115453_pyphen/ \
    --syllables 2 \
    --frequency-weight 0.0

Output:: Creates candidates/{prefix}_candidates_{N}syl.json in the run directory. The prefix (pyphen_ or nltk_) is auto-detected from the run directory name.

Output Format

Input/Output Contract

Inputs (from syllable feature annotator):

<run_directory>/data/pyphen_syllables_annotated.json or nltk_syllables_annotated.json

Output (auto-detected from run directory):

<run_directory>/candidates/{prefix}_candidates_{N}syl.json

Example directory structure after combination:

_working/output/20260110_115453_pyphen/
├── candidates/
│   └── pyphen_candidates_2syl.json      ← Generated output
├── data/
│   └── pyphen_syllables_annotated.json  ← Input
├── meta/
├── syllables/
└── ...

Output Structure

The combiner produces JSON with this structure:

{
  "metadata": {
    "source_run": "20260110_115453_pyphen",
    "source_annotated": "pyphen_syllables_annotated.json",
    "syllable_count": 2,
    "total_candidates": 10000,
    "seed": 42,
    "frequency_weight": 1.0,
    "aggregation_rule": "majority",
    "generated_at": "2026-01-10T12:00:00Z"
  },
  "candidates": [
    {
      "name": "kali",
      "syllables": ["ka", "li"],
      "features": {
        "starts_with_vowel": false,
        "starts_with_cluster": false,
        "starts_with_heavy_cluster": false,
        "contains_plosive": true,
        "contains_fricative": false,
        "contains_liquid": true,
        "contains_nasal": false,
        "short_vowel": true,
        "long_vowel": false,
        "ends_with_vowel": true,
        "ends_with_nasal": false,
        "ends_with_stop": false
      }
    }
  ]
}

Feature Aggregation Rules

Name-level features are aggregated from syllable-level features using these rules:

Onset features (starts_with_*): First syllable only
Coda features (ends_with_*): Final syllable only
Internal features (contains_*): Boolean OR across all syllables
Nucleus features (short_vowel, long_vowel): Majority rule (>50%)

The majority rule for nucleus features was chosen for architectural consistency with other feature categories (deterministic, binary outcome), and to maintain simplicity in the selection pipeline while being sufficient for the current 3-class system.

Integration Guide

The name combiner sits between the feature annotator and the name selector. It performs structural combination without policy evaluation - that responsibility belongs to the name_selector module.

Typical workflow:

# Step 1: Extract and normalize syllables
python -m build_tools.pyphen_syllable_extractor --file corpus.txt
python -m build_tools.pyphen_syllable_normaliser \
  --run-dir _working/output/20260110_115453_pyphen/

# Step 2: Annotate with features
python -m build_tools.syllable_feature_annotator \
  --syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \
  --frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json

# Step 3: Generate candidates
python -m build_tools.name_combiner \
  --run-dir _working/output/20260110_115453_pyphen/ \
  --syllables 2 \
  --count 10000 \
  --seed 42

# Step 4: Select names (see name_selector)
python -m build_tools.name_selector \
  --run-dir _working/output/20260110_115453_pyphen/ \
  --candidates candidates/pyphen_candidates_2syl.json \
  --name-class first_name

When to use this tool:

After syllable annotation is complete
Before selecting names against policies
When you need large pools of name candidates for filtering
For deterministic name generation pipelines

Notes

Determinism:

The combiner uses random.Random(seed) for isolated RNG, ensuring the same seed always produces identical candidates. This is critical for reproducible name generation.

Frequency weighting:

frequency_weight=1.0 (default): High-frequency syllables dominate
frequency_weight=0.0: Uniform random sampling
Values between 0-1: Interpolated weighting

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Name Combiner - Structural Name Candidate Generation

Generates N-syllable name candidates from an annotated syllable corpus by combining syllables and aggregating their features to the name level. This is a build-time tool only - not used during runtime name generation.

This module is the first stage of the Selection Policy Layer. It performs structural combination without policy evaluation - that responsibility belongs to the name_selector module.

Architectural Boundary:: Candidate generation is a structural step, not a decision-making step. All governance, admissibility, and rejection logic remains exclusively within the name_selector module.

Features: - Deterministic combination with seed control - Frequency-weighted syllable sampling - Feature aggregation to name level (majority rule for nucleus) - Output to extraction run’s candidates/ directory

Aggregation Rules: - Onset features (starts_with_*): First syllable only - Coda features (ends_with_*): Final syllable only - Internal features (contains_*): Boolean OR across all syllables - Nucleus features (short_vowel, long_vowel): Majority rule (>50%)

Usage:

>>> from build_tools.name_combiner import combine_syllables, aggregate_features
>>> candidates = combine_syllables(annotated_data, syllable_count=2, count=100)
>>> for candidate in candidates:
...     print(f"{candidate['name']}: {candidate['features']}")

CLI:

python -m build_tools.name_combiner \
    --run-dir _working/output/20260110_115453_pyphen/ \
    --syllables 2 \
    --count 10000 \
    --seed 42

build_tools.name_combiner.aggregate_features(syllables)[source]

Aggregate syllable-level features into a name-level feature vector.

Takes a sequence of syllable dictionaries (each with a “features” key) and produces a single boolean feature vector for the combined name.

Return type:: dict[str, bool]

Parameters

syllablesSequence[dict]: List of syllable dictionaries, each containing: - “syllable”: str - The syllable text - “features”: dict[str, bool] - The 12 boolean features

Returns

dict[str, bool]: Name-level feature vector with all 12 features as booleans.

Raises

ValueError: If syllables list is empty or missing required keys.

Examples

>>> syllables = [
...     {"syllable": "ka", "features": {
...         "starts_with_vowel": False,
...         "starts_with_cluster": False,
...         "starts_with_heavy_cluster": False,
...         "contains_plosive": True,
...         "contains_fricative": False,
...         "contains_liquid": False,
...         "contains_nasal": False,
...         "short_vowel": True,
...         "long_vowel": False,
...         "ends_with_vowel": True,
...         "ends_with_nasal": False,
...         "ends_with_stop": False,
...     }},
...     {"syllable": "li", "features": {
...         "starts_with_vowel": False,
...         "starts_with_cluster": False,
...         "starts_with_heavy_cluster": False,
...         "contains_plosive": False,
...         "contains_fricative": False,
...         "contains_liquid": True,
...         "contains_nasal": False,
...         "short_vowel": True,
...         "long_vowel": False,
...         "ends_with_vowel": True,
...         "ends_with_nasal": False,
...         "ends_with_stop": False,
...     }},
... ]
>>> features = aggregate_features(syllables)
>>> features["starts_with_vowel"]  # From first syllable ("ka")
False
>>> features["ends_with_vowel"]  # From final syllable ("li")
True
>>> features["contains_liquid"]  # OR: True because "li" has it
True
>>> features["short_vowel"]  # Majority: 2/2 = 100% > 50%
True

Notes

Aggregation follows these rules:

Onset (starts_with_*): First syllable only
Coda (ends_with_*): Final syllable only
Internal (contains_*): OR across all syllables
Nucleus (short_vowel, long_vowel): Majority rule (>50%)

See module docstring for detailed rationale on majority rule.

build_tools.name_combiner.combine_syllables(annotated_data, syllable_count, count, seed=None, frequency_weight=1.0)[source]

Generate name candidates by combining syllables from an annotated corpus.

Takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors suitable for policy evaluation.

Return type:: list[dict]

Parameters

annotated_dataSequence[dict]: List of annotated syllable dictionaries, each containing: - “syllable”: str - The syllable text - “frequency”: int - Occurrence count in source corpus - “features”: dict[str, bool] - The 12 boolean features
syllable_countint: Number of syllables per generated name (typically 2, 3, or 4).
countint: Number of candidates to generate.
seedint | None, optional: RNG seed for deterministic output. If None, uses system entropy. Default: None.
frequency_weightfloat, optional: Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.

Returns

list[dict]: List of candidate dictionaries, each containing: - “name”: str - The combined name (concatenated syllables) - “syllables”: list[str] - The constituent syllables - “features”: dict[str, bool] - Aggregated name-level features

Raises

ValueError: If annotated_data is empty or syllable_count < 1.

Examples

>>> corpus = [
...     {"syllable": "ka", "frequency": 100, "features": {...}},
...     {"syllable": "li", "frequency": 50, "features": {...}},
...     {"syllable": "ra", "frequency": 75, "features": {...}},
... ]
>>> candidates = combine_syllables(corpus, syllable_count=2, count=5, seed=42)
>>> len(candidates)
5
>>> candidates[0]["name"]  # Deterministic with seed=42
'kali'  # Example output
>>> candidates[0]["syllables"]
['ka', 'li']

Notes

Determinism: Uses random.Random(seed) for isolated RNG. Same seed always produces identical output.

Frequency Weighting: Higher frequency syllables are more likely to be sampled. This reflects the natural distribution of the source corpus and tends to produce more “natural-sounding” combinations.

No Policy Evaluation: This function performs structural combination only. Policy-based filtering is done by the name_selector module.

Feature aggregation for name-level evaluation.

This module implements the rules for aggregating syllable-level features into name-level features. The aggregation produces a boolean feature vector for each name candidate, enabling policy evaluation by the name_selector.

Aggregation Rules

Onset Features (first syllable only):

starts_with_vowel
starts_with_cluster
starts_with_heavy_cluster

These features describe how a name begins. Only the first syllable’s onset features are relevant - internal syllable onsets don’t affect how the name “enters” the listener’s ear.

Coda Features (final syllable only):

ends_with_vowel
ends_with_nasal
ends_with_stop

These features describe how a name ends. Only the final syllable’s coda features are relevant - internal syllable codas don’t affect how the name “lands” or closes.

Internal Features (OR across all syllables):

contains_plosive
contains_fricative
contains_liquid
contains_nasal

These features describe the texture of a name. If ANY syllable contains the feature, the name has it. A name like “kalira” contains_liquid=True because “li” has a liquid, even though “ka” and “ra” might not.

Nucleus Features (majority rule):

short_vowel
long_vowel

These features describe the dominant vowel character of a name. We use majority rule (>50% of syllables) to determine the name-level value. See the module docstring for detailed rationale.

Why Majority Rule for Nucleus Features

We use majority (>50% of syllables) rather than proportional scoring.

Preserves Architectural Consistency: The entire feature registry is built on boolean features. The policy matrix uses checkmark/tilde/cross symbols that map cleanly to boolean logic. Introducing fractional features would break this elegant simplicity.
Keeps the Implementation Simple: Majority rule means the name-level feature vector remains a simple boolean array, identical in structure to syllable-level vectors. No special cases, no type conversions.
Sufficient for Initial Policy Evaluation: For a first implementation, knowing “this name is mostly short-vowel” vs. “this name is mostly long-vowel” is enough information to make good selection decisions. Precise ratios are not needed yet.
Easier to Debug and Explain: When a name gets rejected, you can say “this name has short_vowel=true (2 of 3 syllables), which is discouraged for Place Names.” That’s clear and inspectable. Proportional scoring makes debugging harder.
Aligns with Project Philosophy: The system is about shape and suitability, not precise optimization. Majority rule captures the dominant character of a name, which is what matters for admissibility.

Future Consideration

If finer-grained nucleus control is needed, proportional scoring could be introduced as an optional mode. This would require extending the policy matrix to handle float thresholds (e.g., short_vowel > 0.6). For now, majority rule provides the right balance of simplicity and expressiveness.

Usage

>>> from build_tools.name_combiner.aggregator import aggregate_features
>>> syllables = [
...     {"syllable": "ka", "features": {"starts_with_vowel": False, ...}},
...     {"syllable": "li", "features": {"contains_liquid": True, ...}},
... ]
>>> name_features = aggregate_features(syllables)
>>> name_features["starts_with_vowel"]  # From first syllable
False
>>> name_features["contains_liquid"]  # OR across all
True

build_tools.name_combiner.aggregator.aggregate_features(syllables)[source]

Aggregate syllable-level features into a name-level feature vector.

Takes a sequence of syllable dictionaries (each with a “features” key) and produces a single boolean feature vector for the combined name.

Return type:: dict[str, bool]

Parameters

syllablesSequence[dict]: List of syllable dictionaries, each containing: - “syllable”: str - The syllable text - “features”: dict[str, bool] - The 12 boolean features

Returns

dict[str, bool]: Name-level feature vector with all 12 features as booleans.

Raises

ValueError: If syllables list is empty or missing required keys.

Examples

>>> syllables = [
...     {"syllable": "ka", "features": {
...         "starts_with_vowel": False,
...         "starts_with_cluster": False,
...         "starts_with_heavy_cluster": False,
...         "contains_plosive": True,
...         "contains_fricative": False,
...         "contains_liquid": False,
...         "contains_nasal": False,
...         "short_vowel": True,
...         "long_vowel": False,
...         "ends_with_vowel": True,
...         "ends_with_nasal": False,
...         "ends_with_stop": False,
...     }},
...     {"syllable": "li", "features": {
...         "starts_with_vowel": False,
...         "starts_with_cluster": False,
...         "starts_with_heavy_cluster": False,
...         "contains_plosive": False,
...         "contains_fricative": False,
...         "contains_liquid": True,
...         "contains_nasal": False,
...         "short_vowel": True,
...         "long_vowel": False,
...         "ends_with_vowel": True,
...         "ends_with_nasal": False,
...         "ends_with_stop": False,
...     }},
... ]
>>> features = aggregate_features(syllables)
>>> features["starts_with_vowel"]  # From first syllable ("ka")
False
>>> features["ends_with_vowel"]  # From final syllable ("li")
True
>>> features["contains_liquid"]  # OR: True because "li" has it
True
>>> features["short_vowel"]  # Majority: 2/2 = 100% > 50%
True

Notes

Aggregation follows these rules:

Onset (starts_with_*): First syllable only
Coda (ends_with_*): Final syllable only
Internal (contains_*): OR across all syllables
Nucleus (short_vowel, long_vowel): Majority rule (>50%)

See module docstring for detailed rationale on majority rule.

Core combination logic for name candidate generation.

This module provides the main combination functionality that takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors.

The combiner is intentionally simple - it performs structural combination without any policy evaluation. Policy-based filtering is the responsibility of the name_selector module.

Combination Strategy

The default combination strategy uses frequency-weighted random sampling:

Load annotated syllables with their frequencies
Build a weighted probability distribution (higher frequency = more likely)
Sample N syllables using the isolated RNG instance
Concatenate syllables to form a name
Aggregate features using the rules in aggregator.py

This produces candidates that reflect the natural distribution of the corpus while maintaining full determinism through seed control.

Determinism

Critical: All combination uses random.Random(seed) to create isolated RNG instances. This ensures:

Same seed always produces identical candidates
No global state contamination
Reproducible builds across sessions

Usage

>>> from build_tools.name_combiner.combiner import combine_syllables
>>> candidates = combine_syllables(
...     annotated_data=corpus,
...     syllable_count=2,
...     count=100,
...     seed=42,
... )
>>> for c in candidates[:3]:
...     print(f"{c['name']}: score-ready features")

build_tools.name_combiner.combiner.combine_syllables(annotated_data, syllable_count, count, seed=None, frequency_weight=1.0)[source]

Generate name candidates by combining syllables from an annotated corpus.

Takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors suitable for policy evaluation.

Return type:: list[dict]

Parameters

annotated_dataSequence[dict]: List of annotated syllable dictionaries, each containing: - “syllable”: str - The syllable text - “frequency”: int - Occurrence count in source corpus - “features”: dict[str, bool] - The 12 boolean features
syllable_countint: Number of syllables per generated name (typically 2, 3, or 4).
countint: Number of candidates to generate.
seedint | None, optional: RNG seed for deterministic output. If None, uses system entropy. Default: None.
frequency_weightfloat, optional: Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.

Returns

list[dict]: List of candidate dictionaries, each containing: - “name”: str - The combined name (concatenated syllables) - “syllables”: list[str] - The constituent syllables - “features”: dict[str, bool] - Aggregated name-level features

Raises

ValueError: If annotated_data is empty or syllable_count < 1.

Examples

>>> corpus = [
...     {"syllable": "ka", "frequency": 100, "features": {...}},
...     {"syllable": "li", "frequency": 50, "features": {...}},
...     {"syllable": "ra", "frequency": 75, "features": {...}},
... ]
>>> candidates = combine_syllables(corpus, syllable_count=2, count=5, seed=42)
>>> len(candidates)
5
>>> candidates[0]["name"]  # Deterministic with seed=42
'kali'  # Example output
>>> candidates[0]["syllables"]
['ka', 'li']

Notes

Determinism: Uses random.Random(seed) for isolated RNG. Same seed always produces identical output.

Frequency Weighting: Higher frequency syllables are more likely to be sampled. This reflects the natural distribution of the source corpus and tends to produce more “natural-sounding” combinations.

No Policy Evaluation: This function performs structural combination only. Policy-based filtering is done by the name_selector module.