build_tools.name_combiner.combiner

Core combination logic for name candidate generation.

This module provides the main combination functionality that takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors.

The combiner is intentionally simple - it performs structural combination without any policy evaluation. Policy-based filtering is the responsibility of the name_selector module.

Combination Strategy

The default combination strategy uses frequency-weighted random sampling:

Load annotated syllables with their frequencies
Build a weighted probability distribution (higher frequency = more likely)
Sample N syllables using the isolated RNG instance
Concatenate syllables to form a name
Aggregate features using the rules in aggregator.py

This produces candidates that reflect the natural distribution of the corpus while maintaining full determinism through seed control.

Determinism

Critical: All combination uses random.Random(seed) to create isolated RNG instances. This ensures:

Same seed always produces identical candidates
No global state contamination
Reproducible builds across sessions

Usage

>>> from build_tools.name_combiner.combiner import combine_syllables
>>> candidates = combine_syllables(
...     annotated_data=corpus,
...     syllable_count=2,
...     count=100,
...     seed=42,
... )
>>> for c in candidates[:3]:
...     print(f"{c['name']}: score-ready features")

Functions

combine_syllables(annotated_data, syllable_count, count)

Generate name candidates by combining syllables from an annotated corpus.

Module Contents

build_tools.name_combiner.combiner.combine_syllables(annotated_data, syllable_count, count, seed=None, frequency_weight=1.0)[source]

Generate name candidates by combining syllables from an annotated corpus.

Takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors suitable for policy evaluation.

Parameters

annotated_dataSequence[dict]: List of annotated syllable dictionaries, each containing: - “syllable”: str - The syllable text - “frequency”: int - Occurrence count in source corpus - “features”: dict[str, bool] - The 12 boolean features
syllable_countint: Number of syllables per generated name (typically 2, 3, or 4).
countint: Number of candidates to generate.
seedint | None, optional: RNG seed for deterministic output. If None, uses system entropy. Default: None.
frequency_weightfloat, optional: Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.

Returns

list[dict]: List of candidate dictionaries, each containing: - “name”: str - The combined name (concatenated syllables) - “syllables”: list[str] - The constituent syllables - “features”: dict[str, bool] - Aggregated name-level features

Raises

ValueError: If annotated_data is empty or syllable_count < 1.

Examples

>>> corpus = [
...     {"syllable": "ka", "frequency": 100, "features": {...}},
...     {"syllable": "li", "frequency": 50, "features": {...}},
...     {"syllable": "ra", "frequency": 75, "features": {...}},
... ]
>>> candidates = combine_syllables(corpus, syllable_count=2, count=5, seed=42)
>>> len(candidates)
5
>>> candidates[0]["name"]  # Deterministic with seed=42
'kali'  # Example output
>>> candidates[0]["syllables"]
['ka', 'li']

Notes

Determinism: Uses random.Random(seed) for isolated RNG. Same seed always produces identical output.

Frequency Weighting: Higher frequency syllables are more likely to be sampled. This reflects the natural distribution of the source corpus and tends to produce more “natural-sounding” combinations.

No Policy Evaluation: This function performs structural combination only. Policy-based filtering is done by the name_selector module.