build_tools.syllable_analysis.dimensionality.feature_matrix
Feature matrix extraction for dimensionality reduction.
This module provides utilities for extracting numerical feature matrices from annotated syllable records. The matrices are suitable for dimensionality reduction algorithms like t-SNE, PCA, UMAP, etc.
Attributes
Functions
|
Extract binary feature matrix from annotated syllable records. |
|
Validate feature matrix shape and contents. |
|
Extract a single feature vector from a feature dictionary. |
Module Contents
- build_tools.syllable_analysis.dimensionality.feature_matrix.ALL_FEATURES = ['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel',...
- build_tools.syllable_analysis.dimensionality.feature_matrix.extract_feature_matrix(records, feature_names=ALL_FEATURES)[source]
Extract binary feature matrix from annotated syllable records.
Converts feature dictionaries to a numerical matrix suitable for dimensionality reduction algorithms. Each row represents a syllable, each column represents a feature (0 or 1).
- Parameters:
records (List[Dict]) –
List of annotated syllable records with ‘features’ and ‘frequency’ keys. Each record should have structure: {
”syllable”: “ka”, “frequency”: 187, “features”: {“contains_liquid”: False, “contains_plosive”: True, …}
}
feature_names (List[str]) – Ordered list of feature names to extract (default: ALL_FEATURES). Order determines column order in output matrix.
- Returns:
feature_matrix: numpy array of shape (n_syllables, n_features) with binary values
frequencies: List of frequency counts for each syllable
- Return type:
Tuple of (feature_matrix, frequencies)
Example
>>> records = [ ... { ... "syllable": "ka", ... "frequency": 187, ... "features": {"contains_liquid": False, "contains_plosive": True, ...} ... } ... ] >>> matrix, freqs = extract_feature_matrix(records) >>> matrix.shape (1, 12) >>> freqs [187]
Notes
Missing features default to False (0)
Feature values are converted to int (True→1, False→0)
Output matrix dtype is int for memory efficiency
Empty record list returns (0, n_features) shaped array
- build_tools.syllable_analysis.dimensionality.feature_matrix.validate_feature_matrix(feature_matrix, expected_features=12)[source]
Validate feature matrix shape and contents.
Ensures the feature matrix has the expected structure for dimensionality reduction algorithms.
- Parameters:
feature_matrix (numpy.ndarray) – Binary feature matrix
expected_features (int) – Expected number of features (default: 12)
- Raises:
ValueError – If validation fails (wrong shape, non-binary values, etc.)
Example
>>> matrix = np.array([[1, 0, 1], [0, 1, 0]]) >>> validate_feature_matrix(matrix, expected_features=3) # OK >>> validate_feature_matrix(matrix, expected_features=4) # Raises ValueError
- build_tools.syllable_analysis.dimensionality.feature_matrix.get_feature_vector(features, feature_names=ALL_FEATURES)[source]
Extract a single feature vector from a feature dictionary.
Converts a dictionary of feature flags to an ordered binary vector. Useful for extracting vectors from individual syllables.
- Parameters:
- Returns:
Binary feature vector matching feature_names order
- Return type:
List[int]
Example
>>> features = {"contains_liquid": True, "contains_plosive": False} >>> vector = get_feature_vector(features, ["contains_liquid", "contains_plosive"]) >>> vector [1, 0]
Notes
Missing features default to False (0)
Order of output matches order of feature_names
Output is Python list, not numpy array (for flexibility)