build_tools.syllable_walk.walker
Syllable Walker: Navigate syllable feature space via cost-based random walks.
This module provides an efficient implementation for exploring large syllable datasets (500k+ entries) using topology-respecting random walks through phonetic feature space.
The core algorithm: 1. Pre-compute a neighbor graph during initialization (one-time O(N^2) cost) 2. For each walk step, select next syllable from neighbors using weighted probabilities 3. Weights are based on: feature flip cost, frequency bias, and temperature 4. Inertia option allows staying at current syllable
Key performance characteristics: - Initialization: ~2-3 minutes for 500k syllables (builds neighbor graph) - Walk generation: <10ms per walk after initialization - Memory: ~200-300 MB for 500k syllables
Example
>>> walker = SyllableWalker("data/annotated/syllables_annotated.json")
>>> walk = walker.walk_from_profile(start="ka", profile="dialect", steps=5, seed=42)
>>> print(" → ".join(s["syllable"] for s in walk))
ka → ki → ti → ta → da → de
Attributes
Classes
Navigate syllable feature space via cost-based random walks. |
Module Contents
- build_tools.syllable_walk.walker.FEATURE_KEYS = ['starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'contains_plosive',...
- build_tools.syllable_walk.walker.DEFAULT_INERTIA_COST = 0.5
- build_tools.syllable_walk.walker.DEFAULT_MAX_NEIGHBOR_DISTANCE = 3
- class build_tools.syllable_walk.walker.SyllableWalker(data_path, max_neighbor_distance=DEFAULT_MAX_NEIGHBOR_DISTANCE, feature_costs=None, inertia_cost=DEFAULT_INERTIA_COST, verbose=False)[source]
Navigate syllable feature space via cost-based random walks.
This class efficiently handles large syllable datasets (500k+) by pre-computing neighbor relationships and using vectorized operations where possible.
The walker performs a one-time expensive computation during initialization to build a neighbor graph, mapping each syllable to nearby syllables within a maximum Hamming distance. After initialization, walk generation is extremely fast (<10ms per walk).
- syllables
List of all syllable strings
- frequencies
NumPy array of syllable frequencies (uint32)
- feature_matrix
NumPy array of binary feature vectors (N x 12, uint8)
- syllable_to_idx
Dict mapping syllable text to index
- neighbor_graph
Dict mapping syllable index to list of neighbor indices
- max_neighbor_distance
Maximum Hamming distance for neighbors
- feature_costs
Dict of costs for each feature flip
- inertia_cost
Cost of staying at current syllable
Example
>>> walker = SyllableWalker("syllables_annotated.json", verbose=True) >>> walk = walker.walk_from_profile( ... start="ka", profile="dialect", steps=5, seed=42 ... ) >>> print(walker.format_walk(walk)) ka → ki → ti → ta → da → de
Notes
Initialization time: ~2-3 minutes for 500k syllables
Walk generation: <10ms per walk after initialization
Memory usage: ~200-300 MB for 500k syllables
Thread safety: Not thread-safe (use separate instances)
Initialize the syllable walker with pre-computed neighbor graph.
- Parameters:
data_path (pathlib.Path | str) – Path to syllables_annotated.json file (output of syllable_feature_annotator)
max_neighbor_distance (int) – Maximum Hamming distance for pre-computing neighbors (1-3). Higher values = more neighbors = slower initialization + more memory, but allows larger feature flips per step. Default: 3 (recommended)
feature_costs (dict[str, float] | None) – Custom feature cost dictionary. If None, uses DEFAULT_FEATURE_COSTS. Keys must match FEATURE_KEYS.
inertia_cost (float) – Cost of staying at current syllable. Higher values discourage staying put. Default: 0.5
verbose (bool) – If True, print progress during initialization (neighbor graph construction can take 2-3 minutes for 500k syllables)
- Raises:
FileNotFoundError – If data_path does not exist
ValueError – If data_path is not valid JSON
ValueError – If feature_costs keys don’t match FEATURE_KEYS
ValueError – If max_neighbor_distance < 1 or > len(FEATURE_KEYS)
Notes
Initialization performs expensive one-time computation
Use verbose=True for long-running initializations
Consider caching the neighbor graph (future optimization)
- data_path
- max_neighbor_distance = 3
- feature_costs
- inertia_cost = 0.5
- verbose = False
- walk(start, steps, max_flips, temperature=1.0, frequency_weight=0.0, neighbor_limit=None, min_length=None, max_length=None, seed=None)[source]
Execute a syllable walk through feature space.
Starting from a syllable, takes steps steps through feature space, choosing each next syllable probabilistically based on: - Feature flip cost (weighted Hamming distance) - Frequency cost (rarity penalty/bonus) - Temperature (exploration vs exploitation) - Inertia (tendency to stay put)
The walk uses softmax selection over candidate neighbors: 1. Find all neighbors within max_flips distance 2. Compute cost for each neighbor (flip cost + rarity cost) 3. Add inertia option (staying at current syllable) 4. Apply softmax with temperature: weight_i = exp(-cost_i / T) 5. Sample next syllable proportional to weights
- Parameters:
start (int | str) – Starting syllable (syllable text or index)
steps (int) – Number of steps to take (each step visits one syllable)
max_flips (int) – Maximum feature flips allowed per step (1-3). Must be <= max_neighbor_distance from __init__.
temperature (float) – Exploration temperature (0.1-5.0). Higher values increase randomness. Typical values: - 0.3: Conservative, minimal exploration - 0.7: Balanced - 1.5: High exploration - 2.5: Maximum randomness
frequency_weight (float) – Frequency bias (-2.0 to 2.0): - Positive: Favor common syllables - Zero: Neutral - Negative: Favor rare syllables Typical values: -1.0, 0.0, 1.0
neighbor_limit (int | None) – Optional cap on neighbor candidates considered at each step.
Nonemeans use all neighbors.min_length (int | None) – Optional minimum syllable character length allowed during traversal.
max_length (int | None) – Optional maximum syllable character length allowed during traversal.
seed (int | None) – Random seed for reproducibility. Same seed = same walk. If None, uses system randomness (non-reproducible).
- Returns:
“syllable”: Syllable text (str)
”frequency”: Corpus frequency (int)
”features”: Binary feature vector (list of 12 ints)
Length = steps + 1 (includes starting syllable)
- Return type:
List of syllable dictionaries with keys
- Raises:
ValueError – If start syllable not found in dataset
ValueError – If max_flips > max_neighbor_distance
ValueError – If steps < 0
Example
>>> walker = SyllableWalker("data.json") >>> walk = walker.walk( ... start="ka", ... steps=5, ... max_flips=2, ... temperature=0.7, ... frequency_weight=0.0, ... seed=42 ... ) >>> len(walk) 6 # start + 5 steps >>> walk[0]["syllable"] 'ka'
Notes
Deterministic: Same seed always produces same walk
Uses local Random instance (doesn’t affect global random state)
Inertia option allows walk to stay at current syllable
- walk_from_profile(start, profile, steps=5, neighbor_limit=None, min_length=None, max_length=None, seed=None)[source]
Execute a walk using a named profile.
Convenience method that uses predefined WalkProfile parameters. See WALK_PROFILES for available profiles.
- Parameters:
profile (str | build_tools.syllable_walk.profiles.WalkProfile) – Profile name (“clerical”, “dialect”, “goblin”, “ritual”) or WalkProfile object
steps (int) – Number of steps to take (default: 5)
neighbor_limit (int | None) – Optional cap on neighbors considered per step.
min_length (int | None) – Optional minimum syllable length allowed.
max_length (int | None) – Optional maximum syllable length allowed.
seed (int | None) – Random seed for reproducibility (default: None)
- Returns:
List of syllable dictionaries (same as walk())
- Raises:
ValueError – If profile name not found
- Return type:
Example
>>> walker = SyllableWalker("data.json") >>> walk = walker.walk_from_profile("ka", "goblin", steps=10, seed=42) >>> print(walker.format_walk(walk)) ka → kha → gha → ghe → ge → gwe → ...
- get_random_syllable(seed=None, min_length=None, max_length=None)[source]
Get a random syllable from the dataset.
- Parameters:
- Returns:
Random syllable text
- Raises:
ValueError – If length constraints are invalid or no syllables match.
- Return type:
Example
>>> walker.get_random_syllable(seed=42) 'ka' >>> walker.get_random_syllable(seed=42) 'ka' # Same seed = same result
- get_syllable_info(syllable)[source]
Get information about a specific syllable.
- Parameters:
syllable (str) – Syllable text to look up
- Returns:
syllable, frequency, features Returns None if syllable not found
- Return type:
Syllable dictionary with keys
Example
>>> info = walker.get_syllable_info("ka") >>> if info: ... print(f"Frequency: {info['frequency']}") Frequency: 1234
- format_walk(walk, arrow=' → ')[source]
Format a walk as a string with arrows.
- Parameters:
- Returns:
Formatted walk string
- Return type:
Example
>>> walk = walker.walk_from_profile("ka", "dialect", steps=5, seed=42) >>> walker.format_walk(walk) 'ka → ki → ti → ta → da → de' >>> walker.format_walk(walk, arrow=" -> ") 'ka -> ki -> ti -> ta -> da -> de'
- get_available_profiles()[source]
Get all available walk profiles.
- Returns:
Dictionary mapping profile names to WalkProfile objects
- Return type:
Example
>>> profiles = walker.get_available_profiles() >>> for name in profiles: ... print(name) clerical dialect goblin ritual
- classmethod from_data(data, max_neighbor_distance=DEFAULT_MAX_NEIGHBOR_DISTANCE, feature_costs=None, inertia_cost=DEFAULT_INERTIA_COST, verbose=False, progress_callback=None)[source]
Create a SyllableWalker from in-memory data.
This is useful when syllable data is loaded from a source other than a JSON file (e.g., SQLite database).
- Parameters:
data (list[dict]) – List of syllable records, each with keys: ‘syllable’, ‘frequency’, ‘features’ (dict of bool values)
max_neighbor_distance (int) – Maximum Hamming distance for neighbors (1-3)
feature_costs (dict[str, float] | None) – Custom feature cost dictionary
inertia_cost (float) – Cost of staying at current syllable
verbose (bool) – If True, print progress during initialization
progress_callback (Callable[[str], None] | None) – Optional callable invoked with a progress message string during neighbor graph construction. Used by the web UI to show live loading progress to the user.
- Returns:
Initialized SyllableWalker instance
- Return type:
Example
>>> data = [ ... {"syllable": "ka", "frequency": 100, ... "features": {"starts_with_vowel": False, ...}} ... ] >>> walker = SyllableWalker.from_data(data, verbose=True)