build_tools.syllable_analysis.dimensionality.feature_matrix

Feature matrix extraction for dimensionality reduction.

This module provides utilities for extracting numerical feature matrices from annotated syllable records. The matrices are suitable for dimensionality reduction algorithms like t-SNE, PCA, UMAP, etc.

Attributes

ALL_FEATURES

Functions

extract_feature_matrix(records[, feature_names])

Extract binary feature matrix from annotated syllable records.

validate_feature_matrix(feature_matrix[, ...])

Validate feature matrix shape and contents.

get_feature_vector(features[, feature_names])

Extract a single feature vector from a feature dictionary.

Module Contents

build_tools.syllable_analysis.dimensionality.feature_matrix.ALL_FEATURES = ['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel',...
build_tools.syllable_analysis.dimensionality.feature_matrix.extract_feature_matrix(records, feature_names=ALL_FEATURES)[source]

Extract binary feature matrix from annotated syllable records.

Converts feature dictionaries to a numerical matrix suitable for dimensionality reduction algorithms. Each row represents a syllable, each column represents a feature (0 or 1).

Parameters:
  • records (list[dict]) –

    List of annotated syllable records with ‘features’ and ‘frequency’ keys. Each record should have structure: {

    ”syllable”: “ka”, “frequency”: 187, “features”: {“contains_liquid”: False, “contains_plosive”: True, …}

    }

  • feature_names (list[str]) – Ordered list of feature names to extract (default: ALL_FEATURES). Order determines column order in output matrix.

Returns:

  • feature_matrix: numpy array of shape (n_syllables, n_features) with binary values

  • frequencies: List of frequency counts for each syllable

Return type:

Tuple of (feature_matrix, frequencies)

Example

>>> records = [
...     {
...         "syllable": "ka",
...         "frequency": 187,
...         "features": {"contains_liquid": False, "contains_plosive": True, ...}
...     }
... ]
>>> matrix, freqs = extract_feature_matrix(records)
>>> matrix.shape
(1, 12)
>>> freqs
[187]

Notes

  • Missing features default to False (0)

  • Feature values are converted to int (True→1, False→0)

  • Output matrix dtype is int for memory efficiency

  • Empty record list returns (0, n_features) shaped array

build_tools.syllable_analysis.dimensionality.feature_matrix.validate_feature_matrix(feature_matrix, expected_features=12)[source]

Validate feature matrix shape and contents.

Ensures the feature matrix has the expected structure for dimensionality reduction algorithms.

Parameters:
  • feature_matrix (numpy.ndarray) – Binary feature matrix

  • expected_features (int) – Expected number of features (default: 12)

Raises:

ValueError – If validation fails (wrong shape, non-binary values, etc.)

Example

>>> matrix = np.array([[1, 0, 1], [0, 1, 0]])
>>> validate_feature_matrix(matrix, expected_features=3)  # OK
>>> validate_feature_matrix(matrix, expected_features=4)  # Raises ValueError
build_tools.syllable_analysis.dimensionality.feature_matrix.get_feature_vector(features, feature_names=ALL_FEATURES)[source]

Extract a single feature vector from a feature dictionary.

Converts a dictionary of feature flags to an ordered binary vector. Useful for extracting vectors from individual syllables.

Parameters:
  • features (dict[str, bool]) – Dictionary of feature name → boolean value

  • feature_names (list[str]) – Ordered list of feature names (default: ALL_FEATURES)

Returns:

Binary feature vector matching feature_names order

Return type:

list[int]

Example

>>> features = {"contains_liquid": True, "contains_plosive": False}
>>> vector = get_feature_vector(features, ["contains_liquid", "contains_plosive"])
>>> vector
[1, 0]

Notes

  • Missing features default to False (0)

  • Order of output matches order of feature_names

  • Output is Python list, not numpy array (for flexibility)