build_tools.syllable_walk.walker ================================ .. py:module:: build_tools.syllable_walk.walker .. autoapi-nested-parse:: Syllable Walker: Navigate syllable feature space via cost-based random walks. This module provides an efficient implementation for exploring large syllable datasets (500k+ entries) using topology-respecting random walks through phonetic feature space. The core algorithm: 1. Pre-compute a neighbor graph during initialization (one-time O(N^2) cost) 2. For each walk step, select next syllable from neighbors using weighted probabilities 3. Weights are based on: feature flip cost, frequency bias, and temperature 4. Inertia option allows staying at current syllable Key performance characteristics: - Initialization: ~2-3 minutes for 500k syllables (builds neighbor graph) - Walk generation: <10ms per walk after initialization - Memory: ~200-300 MB for 500k syllables .. admonition:: Example >>> walker = SyllableWalker("data/annotated/syllables_annotated.json") >>> walk = walker.walk_from_profile(start="ka", profile="dialect", steps=5, seed=42) >>> print(" → ".join(s["syllable"] for s in walk)) ka → ki → ti → ta → da → de Attributes ---------- .. autoapisummary:: build_tools.syllable_walk.walker.FEATURE_KEYS build_tools.syllable_walk.walker.DEFAULT_FEATURE_COSTS build_tools.syllable_walk.walker.DEFAULT_INERTIA_COST build_tools.syllable_walk.walker.DEFAULT_MAX_NEIGHBOR_DISTANCE Classes ------- .. autoapisummary:: build_tools.syllable_walk.walker.SyllableWalker Module Contents --------------- .. py:data:: FEATURE_KEYS :value: ['starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'contains_plosive',... .. py:data:: DEFAULT_FEATURE_COSTS .. py:data:: DEFAULT_INERTIA_COST :value: 0.5 .. py:data:: DEFAULT_MAX_NEIGHBOR_DISTANCE :value: 3 .. py:class:: SyllableWalker(data_path, max_neighbor_distance = DEFAULT_MAX_NEIGHBOR_DISTANCE, feature_costs = None, inertia_cost = DEFAULT_INERTIA_COST, verbose = False) Navigate syllable feature space via cost-based random walks. This class efficiently handles large syllable datasets (500k+) by pre-computing neighbor relationships and using vectorized operations where possible. The walker performs a one-time expensive computation during initialization to build a neighbor graph, mapping each syllable to nearby syllables within a maximum Hamming distance. After initialization, walk generation is extremely fast (<10ms per walk). .. attribute:: syllables List of all syllable strings .. attribute:: frequencies NumPy array of syllable frequencies (uint32) .. attribute:: feature_matrix NumPy array of binary feature vectors (N x 12, uint8) .. attribute:: syllable_to_idx Dict mapping syllable text to index .. attribute:: neighbor_graph Dict mapping syllable index to list of neighbor indices .. attribute:: max_neighbor_distance Maximum Hamming distance for neighbors .. attribute:: feature_costs Dict of costs for each feature flip .. attribute:: inertia_cost Cost of staying at current syllable .. admonition:: Example >>> walker = SyllableWalker("syllables_annotated.json", verbose=True) >>> walk = walker.walk_from_profile( ... start="ka", profile="dialect", steps=5, seed=42 ... ) >>> print(walker.format_walk(walk)) ka → ki → ti → ta → da → de .. admonition:: Notes - Initialization time: ~2-3 minutes for 500k syllables - Walk generation: <10ms per walk after initialization - Memory usage: ~200-300 MB for 500k syllables - Thread safety: Not thread-safe (use separate instances) Initialize the syllable walker with pre-computed neighbor graph. :param data_path: Path to syllables_annotated.json file (output of syllable_feature_annotator) :param max_neighbor_distance: Maximum Hamming distance for pre-computing neighbors (1-3). Higher values = more neighbors = slower initialization + more memory, but allows larger feature flips per step. Default: 3 (recommended) :param feature_costs: Custom feature cost dictionary. If None, uses DEFAULT_FEATURE_COSTS. Keys must match FEATURE_KEYS. :param inertia_cost: Cost of staying at current syllable. Higher values discourage staying put. Default: 0.5 :param verbose: If True, print progress during initialization (neighbor graph construction can take 2-3 minutes for 500k syllables) :raises FileNotFoundError: If data_path does not exist :raises ValueError: If data_path is not valid JSON :raises ValueError: If feature_costs keys don't match FEATURE_KEYS :raises ValueError: If max_neighbor_distance < 1 or > len(FEATURE_KEYS) .. admonition:: Notes - Initialization performs expensive one-time computation - Use verbose=True for long-running initializations - Consider caching the neighbor graph (future optimization) .. py:attribute:: data_path .. py:attribute:: max_neighbor_distance :value: 3 .. py:attribute:: feature_costs .. py:attribute:: inertia_cost :value: 0.5 .. py:attribute:: verbose :value: False .. py:attribute:: syllables :type: List[str] :value: [] .. py:attribute:: frequencies :type: Optional[numpy.ndarray] :value: None .. py:attribute:: feature_matrix :type: Optional[numpy.ndarray] :value: None .. py:attribute:: syllable_to_idx :type: Dict[str, int] .. py:attribute:: neighbor_graph :type: Dict[int, List[int]] .. py:method:: walk(start, steps, max_flips, temperature = 1.0, frequency_weight = 0.0, seed = None) Execute a syllable walk through feature space. Starting from a syllable, takes `steps` steps through feature space, choosing each next syllable probabilistically based on: - Feature flip cost (weighted Hamming distance) - Frequency cost (rarity penalty/bonus) - Temperature (exploration vs exploitation) - Inertia (tendency to stay put) The walk uses softmax selection over candidate neighbors: 1. Find all neighbors within max_flips distance 2. Compute cost for each neighbor (flip cost + rarity cost) 3. Add inertia option (staying at current syllable) 4. Apply softmax with temperature: weight_i = exp(-cost_i / T) 5. Sample next syllable proportional to weights :param start: Starting syllable (syllable text or index) :param steps: Number of steps to take (each step visits one syllable) :param max_flips: Maximum feature flips allowed per step (1-3). Must be <= max_neighbor_distance from __init__. :param temperature: Exploration temperature (0.1-5.0). Higher values increase randomness. Typical values: - 0.3: Conservative, minimal exploration - 0.7: Balanced - 1.5: High exploration - 2.5: Maximum randomness :param frequency_weight: Frequency bias (-2.0 to 2.0): - Positive: Favor common syllables - Zero: Neutral - Negative: Favor rare syllables Typical values: -1.0, 0.0, 1.0 :param seed: Random seed for reproducibility. Same seed = same walk. If None, uses system randomness (non-reproducible). :returns: - "syllable": Syllable text (str) - "frequency": Corpus frequency (int) - "features": Binary feature vector (list of 12 ints) Length = steps + 1 (includes starting syllable) :rtype: List of syllable dictionaries with keys :raises ValueError: If start syllable not found in dataset :raises ValueError: If max_flips > max_neighbor_distance :raises ValueError: If steps < 0 .. admonition:: Example >>> walker = SyllableWalker("data.json") >>> walk = walker.walk( ... start="ka", ... steps=5, ... max_flips=2, ... temperature=0.7, ... frequency_weight=0.0, ... seed=42 ... ) >>> len(walk) 6 # start + 5 steps >>> walk[0]["syllable"] 'ka' .. admonition:: Notes - Deterministic: Same seed always produces same walk - Uses local Random instance (doesn't affect global random state) - Inertia option allows walk to stay at current syllable .. py:method:: walk_from_profile(start, profile, steps = 5, seed = None) Execute a walk using a named profile. Convenience method that uses predefined WalkProfile parameters. See WALK_PROFILES for available profiles. :param start: Starting syllable (text or index) :param profile: Profile name ("clerical", "dialect", "goblin", "ritual") or WalkProfile object :param steps: Number of steps to take (default: 5) :param seed: Random seed for reproducibility (default: None) :returns: List of syllable dictionaries (same as walk()) :raises ValueError: If profile name not found .. admonition:: Example >>> walker = SyllableWalker("data.json") >>> walk = walker.walk_from_profile("ka", "goblin", steps=10, seed=42) >>> print(walker.format_walk(walk)) ka → kha → gha → ghe → ge → gwe → ... .. py:method:: get_random_syllable(seed = None) Get a random syllable from the dataset. :param seed: Random seed for reproducibility (default: None) :returns: Random syllable text .. admonition:: Example >>> walker.get_random_syllable(seed=42) 'ka' >>> walker.get_random_syllable(seed=42) 'ka' # Same seed = same result .. py:method:: get_syllable_info(syllable) Get information about a specific syllable. :param syllable: Syllable text to look up :returns: syllable, frequency, features Returns None if syllable not found :rtype: Syllable dictionary with keys .. admonition:: Example >>> info = walker.get_syllable_info("ka") >>> if info: ... print(f"Frequency: {info['frequency']}") Frequency: 1234 .. py:method:: format_walk(walk, arrow = ' → ') Format a walk as a string with arrows. :param walk: Walk result from walk() or walk_from_profile() :param arrow: Separator between syllables (default: " → ") :returns: Formatted walk string .. admonition:: Example >>> walk = walker.walk_from_profile("ka", "dialect", steps=5, seed=42) >>> walker.format_walk(walk) 'ka → ki → ti → ta → da → de' >>> walker.format_walk(walk, arrow=" -> ") 'ka -> ki -> ti -> ta -> da -> de' .. py:method:: get_available_profiles() Get all available walk profiles. :returns: Dictionary mapping profile names to WalkProfile objects .. admonition:: Example >>> profiles = walker.get_available_profiles() >>> for name in profiles: ... print(name) clerical dialect goblin ritual