Name Combiner
Overview
Name Combiner - Structural Name Candidate Generation
Generates N-syllable name candidates from an annotated syllable corpus by combining syllables and aggregating their features to the name level. This is a build-time tool only - not used during runtime name generation.
This module is the first stage of the Selection Policy Layer. It performs structural combination without policy evaluation - that responsibility belongs to the name_selector module.
- Architectural Boundary:
Candidate generation is a structural step, not a decision-making step. All governance, admissibility, and rejection logic remains exclusively within the name_selector module.
Features: - Deterministic combination with seed control - Frequency-weighted syllable sampling - Feature aggregation to name level (majority rule for nucleus) - Output to extraction run’s candidates/ directory
Aggregation Rules: - Onset features (starts_with_*): First syllable only - Coda features (ends_with_*): Final syllable only - Internal features (contains_*): Boolean OR across all syllables - Nucleus features (short_vowel, long_vowel): Majority rule (>50%)
- Usage:
>>> from build_tools.name_combiner import combine_syllables, aggregate_features >>> candidates = combine_syllables(annotated_data, syllable_count=2, count=100) >>> for candidate in candidates: ... print(f"{candidate['name']}: {candidate['features']}")
CLI:
python -m build_tools.name_combiner \
--run-dir _working/output/20260110_115453_pyphen/ \
--syllables 2 \
--count 10000 \
--seed 42
Command-Line Interface
Generate N-syllable name candidates from an annotated syllable corpus. Combines syllables structurally and aggregates features to the name level. This is a build-time tool for the Selection Policy Layer.
usage: python -m build_tools.name_combiner [-h] --run-dir RUN_DIR --syllables
{2,3,4} [--count COUNT]
[--seed SEED]
[--frequency-weight FREQUENCY_WEIGHT]
Named Arguments
- --run-dir
Path to extraction run directory containing annotated JSON. Example: _working/output/20260110_115453_pyphen/
- --syllables
Possible choices: 2, 3, 4
Number of syllables per candidate name. Choices: 2, 3, 4.
- --count
Number of candidates to generate. Default: 10000.
Default:
10000- --seed
RNG seed for deterministic output. If not specified, uses system entropy. Same seed always produces identical candidates.
- --frequency-weight
Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.
Default:
1.0
Examples:
# Generate 2-syllable candidates with default settings
python -m build_tools.name_combiner \
--run-dir _working/output/20260110_115453_pyphen/ \
--syllables 2
# Generate 10000 3-syllable candidates with fixed seed
python -m build_tools.name_combiner \
--run-dir _working/output/20260110_115453_pyphen/ \
--syllables 3 \
--count 10000 \
--seed 42
# Generate with uniform sampling (no frequency weighting)
python -m build_tools.name_combiner \
--run-dir _working/output/20260110_115453_pyphen/ \
--syllables 2 \
--frequency-weight 0.0
- Output:
Creates
candidates/{prefix}_candidates_{N}syl.jsonin the run directory. The prefix (pyphen_ or nltk_) is auto-detected from the run directory name.
Output Format
Input/Output Contract
Inputs (from syllable feature annotator):
<run_directory>/data/pyphen_syllables_annotated.jsonornltk_syllables_annotated.json
Output (auto-detected from run directory):
<run_directory>/candidates/{prefix}_candidates_{N}syl.json
Example directory structure after combination:
_working/output/20260110_115453_pyphen/
├── candidates/
│ └── pyphen_candidates_2syl.json ← Generated output
├── data/
│ └── pyphen_syllables_annotated.json ← Input
├── meta/
├── syllables/
└── ...
Output Structure
The combiner produces JSON with this structure:
{
"metadata": {
"source_run": "20260110_115453_pyphen",
"source_annotated": "pyphen_syllables_annotated.json",
"syllable_count": 2,
"total_candidates": 10000,
"seed": 42,
"frequency_weight": 1.0,
"aggregation_rule": "majority",
"generated_at": "2026-01-10T12:00:00Z"
},
"candidates": [
{
"name": "kali",
"syllables": ["ka", "li"],
"features": {
"starts_with_vowel": false,
"starts_with_cluster": false,
"starts_with_heavy_cluster": false,
"contains_plosive": true,
"contains_fricative": false,
"contains_liquid": true,
"contains_nasal": false,
"short_vowel": true,
"long_vowel": false,
"ends_with_vowel": true,
"ends_with_nasal": false,
"ends_with_stop": false
}
}
]
}
Feature Aggregation Rules
Name-level features are aggregated from syllable-level features using these rules:
Onset features (
starts_with_*): First syllable onlyCoda features (
ends_with_*): Final syllable onlyInternal features (
contains_*): Boolean OR across all syllablesNucleus features (
short_vowel,long_vowel): Majority rule (>50%)
The majority rule for nucleus features was chosen for architectural consistency with other feature categories (deterministic, binary outcome), and to maintain simplicity in the selection pipeline while being sufficient for the current 3-class system.
Integration Guide
The name combiner sits between the feature annotator and the name selector. It performs structural combination without policy evaluation - that responsibility belongs to the name_selector module.
Typical workflow:
# Step 1: Extract and normalize syllables
python -m build_tools.pyphen_syllable_extractor --file corpus.txt
python -m build_tools.pyphen_syllable_normaliser \
--run-dir _working/output/20260110_115453_pyphen/
# Step 2: Annotate with features
python -m build_tools.syllable_feature_annotator \
--syllables _working/output/20260110_115453_pyphen/pyphen_syllables_unique.txt \
--frequencies _working/output/20260110_115453_pyphen/pyphen_syllables_frequencies.json
# Step 3: Generate candidates
python -m build_tools.name_combiner \
--run-dir _working/output/20260110_115453_pyphen/ \
--syllables 2 \
--count 10000 \
--seed 42
# Step 4: Select names (see name_selector)
python -m build_tools.name_selector \
--run-dir _working/output/20260110_115453_pyphen/ \
--candidates candidates/pyphen_candidates_2syl.json \
--name-class first_name
When to use this tool:
After syllable annotation is complete
Before selecting names against policies
When you need large pools of name candidates for filtering
For deterministic name generation pipelines
Notes
Determinism:
The combiner uses random.Random(seed) for isolated RNG, ensuring the same seed always
produces identical candidates. This is critical for reproducible name generation.
Frequency weighting:
frequency_weight=1.0(default): High-frequency syllables dominatefrequency_weight=0.0: Uniform random samplingValues between 0-1: Interpolated weighting
Build-time tool:
This is a build-time tool only - not used during runtime name generation.
API Reference
Name Combiner - Structural Name Candidate Generation
Generates N-syllable name candidates from an annotated syllable corpus by combining syllables and aggregating their features to the name level. This is a build-time tool only - not used during runtime name generation.
This module is the first stage of the Selection Policy Layer. It performs structural combination without policy evaluation - that responsibility belongs to the name_selector module.
- Architectural Boundary:
Candidate generation is a structural step, not a decision-making step. All governance, admissibility, and rejection logic remains exclusively within the name_selector module.
Features: - Deterministic combination with seed control - Frequency-weighted syllable sampling - Feature aggregation to name level (majority rule for nucleus) - Output to extraction run’s candidates/ directory
Aggregation Rules: - Onset features (starts_with_*): First syllable only - Coda features (ends_with_*): Final syllable only - Internal features (contains_*): Boolean OR across all syllables - Nucleus features (short_vowel, long_vowel): Majority rule (>50%)
- Usage:
>>> from build_tools.name_combiner import combine_syllables, aggregate_features >>> candidates = combine_syllables(annotated_data, syllable_count=2, count=100) >>> for candidate in candidates: ... print(f"{candidate['name']}: {candidate['features']}")
CLI:
python -m build_tools.name_combiner \
--run-dir _working/output/20260110_115453_pyphen/ \
--syllables 2 \
--count 10000 \
--seed 42
- build_tools.name_combiner.aggregate_features(syllables)[source]
Aggregate syllable-level features into a name-level feature vector.
Takes a sequence of syllable dictionaries (each with a “features” key) and produces a single boolean feature vector for the combined name.
Parameters
- syllablesSequence[dict]
List of syllable dictionaries, each containing: - “syllable”: str - The syllable text - “features”: dict[str, bool] - The 12 boolean features
Returns
- dict[str, bool]
Name-level feature vector with all 12 features as booleans.
Raises
- ValueError
If syllables list is empty or missing required keys.
Examples
>>> syllables = [ ... {"syllable": "ka", "features": { ... "starts_with_vowel": False, ... "starts_with_cluster": False, ... "starts_with_heavy_cluster": False, ... "contains_plosive": True, ... "contains_fricative": False, ... "contains_liquid": False, ... "contains_nasal": False, ... "short_vowel": True, ... "long_vowel": False, ... "ends_with_vowel": True, ... "ends_with_nasal": False, ... "ends_with_stop": False, ... }}, ... {"syllable": "li", "features": { ... "starts_with_vowel": False, ... "starts_with_cluster": False, ... "starts_with_heavy_cluster": False, ... "contains_plosive": False, ... "contains_fricative": False, ... "contains_liquid": True, ... "contains_nasal": False, ... "short_vowel": True, ... "long_vowel": False, ... "ends_with_vowel": True, ... "ends_with_nasal": False, ... "ends_with_stop": False, ... }}, ... ] >>> features = aggregate_features(syllables) >>> features["starts_with_vowel"] # From first syllable ("ka") False >>> features["ends_with_vowel"] # From final syllable ("li") True >>> features["contains_liquid"] # OR: True because "li" has it True >>> features["short_vowel"] # Majority: 2/2 = 100% > 50% True
Notes
Aggregation follows these rules:
Onset (starts_with_*): First syllable only
Coda (ends_with_*): Final syllable only
Internal (contains_*): OR across all syllables
Nucleus (short_vowel, long_vowel): Majority rule (>50%)
See module docstring for detailed rationale on majority rule.
- build_tools.name_combiner.combine_syllables(annotated_data, syllable_count, count, seed=None, frequency_weight=1.0)[source]
Generate name candidates by combining syllables from an annotated corpus.
Takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors suitable for policy evaluation.
Parameters
- annotated_dataSequence[dict]
List of annotated syllable dictionaries, each containing: - “syllable”: str - The syllable text - “frequency”: int - Occurrence count in source corpus - “features”: dict[str, bool] - The 12 boolean features
- syllable_countint
Number of syllables per generated name (typically 2, 3, or 4).
- countint
Number of candidates to generate.
- seedint | None, optional
RNG seed for deterministic output. If None, uses system entropy. Default: None.
- frequency_weightfloat, optional
Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.
Returns
- list[dict]
List of candidate dictionaries, each containing: - “name”: str - The combined name (concatenated syllables) - “syllables”: list[str] - The constituent syllables - “features”: dict[str, bool] - Aggregated name-level features
Raises
- ValueError
If annotated_data is empty or syllable_count < 1.
Examples
>>> corpus = [ ... {"syllable": "ka", "frequency": 100, "features": {...}}, ... {"syllable": "li", "frequency": 50, "features": {...}}, ... {"syllable": "ra", "frequency": 75, "features": {...}}, ... ] >>> candidates = combine_syllables(corpus, syllable_count=2, count=5, seed=42) >>> len(candidates) 5 >>> candidates[0]["name"] # Deterministic with seed=42 'kali' # Example output >>> candidates[0]["syllables"] ['ka', 'li']
Notes
Determinism: Uses random.Random(seed) for isolated RNG. Same seed always produces identical output.
Frequency Weighting: Higher frequency syllables are more likely to be sampled. This reflects the natural distribution of the source corpus and tends to produce more “natural-sounding” combinations.
No Policy Evaluation: This function performs structural combination only. Policy-based filtering is done by the name_selector module.
Feature aggregation for name-level evaluation.
This module implements the rules for aggregating syllable-level features into name-level features. The aggregation produces a boolean feature vector for each name candidate, enabling policy evaluation by the name_selector.
Aggregation Rules
- Onset Features (first syllable only):
starts_with_vowel
starts_with_cluster
starts_with_heavy_cluster
These features describe how a name begins. Only the first syllable’s onset features are relevant - internal syllable onsets don’t affect how the name “enters” the listener’s ear.
- Coda Features (final syllable only):
ends_with_vowel
ends_with_nasal
ends_with_stop
These features describe how a name ends. Only the final syllable’s coda features are relevant - internal syllable codas don’t affect how the name “lands” or closes.
- Internal Features (OR across all syllables):
contains_plosive
contains_fricative
contains_liquid
contains_nasal
These features describe the texture of a name. If ANY syllable contains the feature, the name has it. A name like “kalira” contains_liquid=True because “li” has a liquid, even though “ka” and “ra” might not.
- Nucleus Features (majority rule):
short_vowel
long_vowel
These features describe the dominant vowel character of a name. We use majority rule (>50% of syllables) to determine the name-level value. See the module docstring for detailed rationale.
Why Majority Rule for Nucleus Features
We use majority (>50% of syllables) rather than proportional scoring.
Preserves Architectural Consistency: The entire feature registry is built on boolean features. The policy matrix uses checkmark/tilde/cross symbols that map cleanly to boolean logic. Introducing fractional features would break this elegant simplicity.
Keeps the Implementation Simple: Majority rule means the name-level feature vector remains a simple boolean array, identical in structure to syllable-level vectors. No special cases, no type conversions.
Sufficient for Initial Policy Evaluation: For a first implementation, knowing “this name is mostly short-vowel” vs. “this name is mostly long-vowel” is enough information to make good selection decisions. Precise ratios are not needed yet.
Easier to Debug and Explain: When a name gets rejected, you can say “this name has short_vowel=true (2 of 3 syllables), which is discouraged for Place Names.” That’s clear and inspectable. Proportional scoring makes debugging harder.
Aligns with Project Philosophy: The system is about shape and suitability, not precise optimization. Majority rule captures the dominant character of a name, which is what matters for admissibility.
Future Consideration
If finer-grained nucleus control is needed, proportional scoring could be introduced as an optional mode. This would require extending the policy matrix to handle float thresholds (e.g., short_vowel > 0.6). For now, majority rule provides the right balance of simplicity and expressiveness.
Usage
>>> from build_tools.name_combiner.aggregator import aggregate_features
>>> syllables = [
... {"syllable": "ka", "features": {"starts_with_vowel": False, ...}},
... {"syllable": "li", "features": {"contains_liquid": True, ...}},
... ]
>>> name_features = aggregate_features(syllables)
>>> name_features["starts_with_vowel"] # From first syllable
False
>>> name_features["contains_liquid"] # OR across all
True
- build_tools.name_combiner.aggregator.aggregate_features(syllables)[source]
Aggregate syllable-level features into a name-level feature vector.
Takes a sequence of syllable dictionaries (each with a “features” key) and produces a single boolean feature vector for the combined name.
Parameters
- syllablesSequence[dict]
List of syllable dictionaries, each containing: - “syllable”: str - The syllable text - “features”: dict[str, bool] - The 12 boolean features
Returns
- dict[str, bool]
Name-level feature vector with all 12 features as booleans.
Raises
- ValueError
If syllables list is empty or missing required keys.
Examples
>>> syllables = [ ... {"syllable": "ka", "features": { ... "starts_with_vowel": False, ... "starts_with_cluster": False, ... "starts_with_heavy_cluster": False, ... "contains_plosive": True, ... "contains_fricative": False, ... "contains_liquid": False, ... "contains_nasal": False, ... "short_vowel": True, ... "long_vowel": False, ... "ends_with_vowel": True, ... "ends_with_nasal": False, ... "ends_with_stop": False, ... }}, ... {"syllable": "li", "features": { ... "starts_with_vowel": False, ... "starts_with_cluster": False, ... "starts_with_heavy_cluster": False, ... "contains_plosive": False, ... "contains_fricative": False, ... "contains_liquid": True, ... "contains_nasal": False, ... "short_vowel": True, ... "long_vowel": False, ... "ends_with_vowel": True, ... "ends_with_nasal": False, ... "ends_with_stop": False, ... }}, ... ] >>> features = aggregate_features(syllables) >>> features["starts_with_vowel"] # From first syllable ("ka") False >>> features["ends_with_vowel"] # From final syllable ("li") True >>> features["contains_liquid"] # OR: True because "li" has it True >>> features["short_vowel"] # Majority: 2/2 = 100% > 50% True
Notes
Aggregation follows these rules:
Onset (starts_with_*): First syllable only
Coda (ends_with_*): Final syllable only
Internal (contains_*): OR across all syllables
Nucleus (short_vowel, long_vowel): Majority rule (>50%)
See module docstring for detailed rationale on majority rule.
Core combination logic for name candidate generation.
This module provides the main combination functionality that takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors.
The combiner is intentionally simple - it performs structural combination without any policy evaluation. Policy-based filtering is the responsibility of the name_selector module.
Combination Strategy
The default combination strategy uses frequency-weighted random sampling:
Load annotated syllables with their frequencies
Build a weighted probability distribution (higher frequency = more likely)
Sample N syllables using the isolated RNG instance
Concatenate syllables to form a name
Aggregate features using the rules in aggregator.py
This produces candidates that reflect the natural distribution of the corpus while maintaining full determinism through seed control.
Determinism
Critical: All combination uses random.Random(seed) to create isolated RNG instances. This ensures:
Same seed always produces identical candidates
No global state contamination
Reproducible builds across sessions
Usage
>>> from build_tools.name_combiner.combiner import combine_syllables
>>> candidates = combine_syllables(
... annotated_data=corpus,
... syllable_count=2,
... count=100,
... seed=42,
... )
>>> for c in candidates[:3]:
... print(f"{c['name']}: score-ready features")
- build_tools.name_combiner.combiner.combine_syllables(annotated_data, syllable_count, count, seed=None, frequency_weight=1.0)[source]
Generate name candidates by combining syllables from an annotated corpus.
Takes an annotated syllable corpus and produces N-syllable name candidates with aggregated feature vectors suitable for policy evaluation.
Parameters
- annotated_dataSequence[dict]
List of annotated syllable dictionaries, each containing: - “syllable”: str - The syllable text - “frequency”: int - Occurrence count in source corpus - “features”: dict[str, bool] - The 12 boolean features
- syllable_countint
Number of syllables per generated name (typically 2, 3, or 4).
- countint
Number of candidates to generate.
- seedint | None, optional
RNG seed for deterministic output. If None, uses system entropy. Default: None.
- frequency_weightfloat, optional
Weight for frequency-biased sampling. 0.0 = uniform sampling, 1.0 = fully frequency-weighted. Values between 0 and 1 interpolate. Default: 1.0.
Returns
- list[dict]
List of candidate dictionaries, each containing: - “name”: str - The combined name (concatenated syllables) - “syllables”: list[str] - The constituent syllables - “features”: dict[str, bool] - Aggregated name-level features
Raises
- ValueError
If annotated_data is empty or syllable_count < 1.
Examples
>>> corpus = [ ... {"syllable": "ka", "frequency": 100, "features": {...}}, ... {"syllable": "li", "frequency": 50, "features": {...}}, ... {"syllable": "ra", "frequency": 75, "features": {...}}, ... ] >>> candidates = combine_syllables(corpus, syllable_count=2, count=5, seed=42) >>> len(candidates) 5 >>> candidates[0]["name"] # Deterministic with seed=42 'kali' # Example output >>> candidates[0]["syllables"] ['ka', 'li']
Notes
Determinism: Uses random.Random(seed) for isolated RNG. Same seed always produces identical output.
Frequency Weighting: Higher frequency syllables are more likely to be sampled. This reflects the natural distribution of the source corpus and tends to produce more “natural-sounding” combinations.
No Policy Evaluation: This function performs structural combination only. Policy-based filtering is done by the name_selector module.