build_tools.name_combiner.combiner ================================== .. py:module:: build_tools.name_combiner.combiner .. autoapi-nested-parse:: Core combination logic for name candidate generation. This module provides the main combination functionality that takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors. The combiner is intentionally simple - it performs structural combination without any policy evaluation. Policy-based filtering is the responsibility of the name_selector module. Combination Strategy -------------------- The default combination strategy uses frequency-weighted random sampling: 1. Load annotated syllables with their frequencies 2. Build a weighted probability distribution (higher frequency = more likely) 3. Sample N syllables using the isolated RNG instance 4. Concatenate syllables to form a name 5. Aggregate features using the rules in aggregator.py This produces candidates that reflect the natural distribution of the corpus while maintaining full determinism through seed control. Determinism ----------- **Critical**: All combination uses `random.Random(seed)` to create isolated RNG instances. This ensures: - Same seed always produces identical candidates - No global state contamination - Reproducible builds across sessions Usage ----- >>> from build_tools.name_combiner.combiner import combine_syllables >>> candidates = combine_syllables( ... annotated_data=corpus, ... syllable_count=2, ... count=100, ... seed=42, ... ) >>> for c in candidates[:3]: ... print(f"{c['name']}: score-ready features") Functions --------- .. autoapisummary:: build_tools.name_combiner.combiner.combine_syllables Module Contents --------------- .. py:function:: combine_syllables(annotated_data, syllable_count, count, seed = None, frequency_weight = 1.0) Generate name candidates by combining syllables from an annotated corpus. Takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors suitable for policy evaluation. Parameters ---------- annotated_data : Sequence[dict] List of annotated syllable dictionaries, each containing: - "syllable": str - The syllable text - "frequency": int - Occurrence count in source corpus - "features": dict[str, bool] - The 12 boolean features syllable_count : int Number of syllables per generated name (typically 2, 3, or 4). count : int Number of candidates to generate. seed : int | None, optional RNG seed for deterministic output. If None, uses system entropy. Default: None. frequency_weight : float, optional Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0. Returns ------- list[dict] List of candidate dictionaries, each containing: - "name": str - The combined name (concatenated syllables) - "syllables": list[str] - The constituent syllables - "features": dict[str, bool] - Aggregated name-level features Raises ------ ValueError If annotated_data is empty or syllable_count < 1. Examples -------- >>> corpus = [ ... {"syllable": "ka", "frequency": 100, "features": {...}}, ... {"syllable": "li", "frequency": 50, "features": {...}}, ... {"syllable": "ra", "frequency": 75, "features": {...}}, ... ] >>> candidates = combine_syllables(corpus, syllable_count=2, count=5, seed=42) >>> len(candidates) 5 >>> candidates[0]["name"] # Deterministic with seed=42 'kali' # Example output >>> candidates[0]["syllables"] ['ka', 'li'] Notes ----- **Determinism**: Uses `random.Random(seed)` for isolated RNG. Same seed always produces identical output. **Frequency Weighting**: Higher frequency syllables are more likely to be sampled. This reflects the natural distribution of the source corpus and tends to produce more "natural-sounding" combinations. **No Policy Evaluation**: This function performs structural combination only. Policy-based filtering is done by the name_selector module.