Analysis Tools

Overview

Analysis tools for annotated syllables.

This subpackage provides post-annotation analysis utilities for inspecting and understanding the annotated syllable corpus.

Subpackages

common: Shared utilities (data I/O, paths, output management) dimensionality: Dimensionality reduction (feature matrices, t-SNE, mapping) plotting: Visualization utilities (static matplotlib, interactive Plotly)

Available Tools

random_sampler: Random sampling utility for QA and inspection feature_signatures: Feature signature analysis and distribution reporting tsne_visualizer: t-SNE visualization of feature signature space

Quick Start

Random sampling:

$ python -m build_tools.syllable_analysis.random_sampler --samples 50

Feature signature analysis:

$ python -m build_tools.syllable_analysis.feature_signatures

t-SNE visualization:

$ python -m build_tools.syllable_analysis.tsne_visualizer

Programmatic Usage

Using common utilities:

>>> from build_tools.syllable_analysis import (
...     default_paths,
...     load_annotated_syllables,
...     ensure_output_dir,
... )
>>> # Load data using default paths
>>> records = load_annotated_syllables(default_paths.annotated_syllables)
>>> # Prepare output directory
>>> output_dir = ensure_output_dir(default_paths.analysis_output_dir("my_tool"))

Random sampling:

>>> from build_tools.syllable_analysis import (
...     load_annotated_syllables,
...     sample_syllables,
...     save_json_output
... )
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> samples = sample_syllables(records, 50, seed=42)
>>> save_json_output(samples, Path("output.json"))

Feature signature analysis:

>>> from build_tools.syllable_analysis import (
...     run_analysis,
...     extract_signature,
...     analyze_feature_signatures
... )
>>> result = run_analysis(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/"),
...     limit=20
... )

t-SNE visualization:

>>> from build_tools.syllable_analysis import (
...     run_tsne_visualization,
...     extract_feature_matrix
... )
>>> result = run_tsne_visualization(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/tsne/")
... )

Random Sampler

Random sampling utility for QA and inspection of annotated syllables.

Command-Line Interface

Randomly sample annotated syllables for inspection.

usage: python -m build_tools.syllable_analysis.random_sampler
       [-h] [--input INPUT] [--output OUTPUT] [--samples SAMPLES]
       [--seed SEED]

Named Arguments

--input

Path to input annotated syllables JSON file (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json

--output

Path to output samples JSON file (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/random_samples.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/random_samples.json

--samples

Number of syllables to sample (default: 100)

Default: 100

--seed

Random seed for reproducibility (default: None, uses system randomness)

# Sample 100 syllables (default)
python -m build_tools.syllable_analysis.random_sampler

# Sample 50 syllables
python -m build_tools.syllable_analysis.random_sampler --samples 50

# Use custom paths
python -m build_tools.syllable_analysis.random_sampler \
    --input data/annotated/syllables_annotated.json \
    --output _working/my_samples.json \
    --samples 200

# Use a specific seed for reproducibility
python -m build_tools.syllable_analysis.random_sampler --samples 50 --seed 42

Integration Guide

Use after annotation to randomly sample syllables for quality assurance:

# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator

# Step 2: Sample for QA inspection
python -m build_tools.syllable_analysis.random_sampler \
  --input data/annotated/syllables_annotated.json \
  --count 50 \
  --output _working/samples.json \
  --seed 42

Feature Signature Analysis

The feature signature analysis tool examines annotated syllables to identify which feature combinations actually exist in the data and how frequently each combination appears.

A “feature signature” is the set of all active (True) features for a syllable. For example, a syllable with only starts_with_vowel and ends_with_vowel active would have the signature: ('ends_with_vowel', 'starts_with_vowel').

This analysis answers questions like:

  • What feature patterns are most common in natural language?

  • Are certain feature combinations rare or impossible?

  • How diverse is the feature space in the corpus?

Command-Line Interface

Analyze feature signatures in annotated syllable corpus

usage: python -m build_tools.syllable_analysis.feature_signatures
       [-h] [--input INPUT] [--output OUTPUT] [--limit LIMIT]

Named Arguments

--input

Path to syllables_annotated.json (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json

--output

Output directory for analysis results (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/feature_signatures)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/feature_signatures

--limit

Limit number of signatures in report (default: show all)

# Analyze with default paths
python -m build_tools.syllable_analysis.feature_signatures

# Show only top 50 signatures
python -m build_tools.syllable_analysis.feature_signatures --limit 50

# Custom input/output paths
python -m build_tools.syllable_analysis.feature_signatures \
  --input data/annotated/syllables_annotated.json \
  --output _working/my_analysis/

Output Format

The tool generates timestamped plain text reports (YYYYMMDD_HHMMSS.feature_signatures.txt) with:

================================================================================
FEATURE SIGNATURE ANALYSIS
================================================================================
Generated: 2026-01-06 13:55:56
Total syllables analyzed: 23,160
Unique feature signatures: 361

SUMMARY STATISTICS
--------------------------------------------------------------------------------
Most common signature: 661 syllables (2.9%)
  Features: contains_liquid, contains_plosive, ends_with_vowel, long_vowel

Feature cardinality distribution:
  1 features: 3 unique signatures
  2 features: 17 unique signatures
  3 features: 54 unique signatures
  4 features: 86 unique signatures
  5 features: 96 unique signatures
  6 features: 70 unique signatures
  7 features: 30 unique signatures
  8 features: 5 unique signatures

================================================================================
SIGNATURE RANKINGS
--------------------------------------------------------------------------------
Rank   Count    Pct      Features
--------------------------------------------------------------------------------
1      661        2.85%  [4] contains_liquid, contains_plosive, ends_with_vowel, long_vowel
2      506        2.18%  [3] contains_plosive, ends_with_vowel, long_vowel
...

Integration Guide

Use after annotation to understand feature patterns:

# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator

# Step 2: Analyze feature signatures
python -m build_tools.syllable_analysis.feature_signatures \
  --input data/annotated/syllables_annotated.json \
  --output _working/analysis/feature_signatures/

t-SNE Visualization

The t-SNE (t-distributed Stochastic Neighbor Embedding) visualization tool creates 2D visualizations of the high-dimensional feature signature space. This helps identify clustering patterns, syllable similarity, and natural groupings in the annotated syllable corpus.

t-SNE is a dimensionality reduction technique that projects 12-dimensional feature vectors into 2D space while preserving local structure. The visualization uses:

  • Position (x, y): t-SNE projection coordinates

  • Size: Syllable frequency (larger points = more common)

  • Color: Syllable frequency (warmer colors = more common)

Command-Line Interface

Generate t-SNE visualization of feature signature space

usage: python -m build_tools.syllable_analysis.tsne_visualizer
       [-h] [--input INPUT] [--output OUTPUT] [--perplexity PERPLEXITY]
       [--random-state RANDOM_STATE] [--dpi DPI] [--save-mapping]
       [--interactive] [--verbose]

Named Arguments

--input

Path to syllables_annotated.json (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json

--output

Output directory for visualizations (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/tsne)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/tsne

--perplexity

t-SNE perplexity parameter (default: 30, range: 5-50)

Default: 30

--random-state

Random seed for reproducibility (default: 42)

Default: 42

--dpi

Output resolution in DPI (default: 300)

Default: 300

--save-mapping

Save syllable→features→coordinates mapping as JSON (default: False)

Default: False

--interactive

Generate interactive HTML visualization in addition to static PNG (requires Plotly)

Default: False

--verbose

Print detailed progress information

Default: False

# Generate visualization with default settings
python -m build_tools.syllable_analysis.tsne_visualizer

# Custom input/output paths
python -m build_tools.syllable_analysis.tsne_visualizer \
  --input data/annotated/syllables_annotated.json \
  --output _working/analysis/tsne/

# Adjust t-SNE parameters
python -m build_tools.syllable_analysis.tsne_visualizer \
  --perplexity 50 \
  --random-state 123

# High-resolution output
python -m build_tools.syllable_analysis.tsne_visualizer \
  --dpi 600

# Verbose output
python -m build_tools.syllable_analysis.tsne_visualizer --verbose

Output Format

The visualizer generates timestamped files in the output directory:

  1. YYYYMMDD_HHMMSS.tsne_visualization.png - High-resolution static visualization (PNG, always generated)

  2. YYYYMMDD_HHMMSS.tsne_metadata.txt - Detailed metadata and interpretation guide

  3. YYYYMMDD_HHMMSS.tsne_mapping.json - Syllable→features→coordinates mapping (optional, requires --save-mapping)

  4. YYYYMMDD_HHMMSS.tsne_interactive.html - Interactive Plotly visualization (optional, requires --interactive)

Output Formats:

  • Static PNG: High-resolution matplotlib visualization (always generated, 300 DPI default)

  • Interactive HTML: Plotly-based interactive visualization with hover tooltips, zoom, pan, and export (optional)

Static PNG metadata file includes:

  • Algorithm parameters (method, perplexity, random seed, dimensions, distance metric, features)

  • Visualization encoding (axis meanings, point size/color)

  • Interpretation guide (how to read the visualization)

  • Technical details (DPI, generation timestamp)

Interactive HTML features:

  • Hover tooltips showing syllable text, frequency, and active features

  • Interactive zoom, pan, and exploration controls

  • Export to high-resolution PNG directly from browser

  • Self-contained HTML file with embedded metadata

  • Works in any modern web browser without additional dependencies

Integration Guide

Use after annotation to visualize the feature space:

# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator

# Step 2: Generate t-SNE visualization
python -m build_tools.syllable_analysis.tsne_visualizer \
  --input data/annotated/syllables_annotated.json \
  --output _working/analysis/tsne/ \
  --interactive

Advanced Topics

Understanding t-SNE Parameters:

Perplexity (default: 30):

  • Balances attention between local and global structure

  • Typical range: 5-50

  • Lower values: emphasize local clusters

  • Higher values: preserve global structure

  • Rule of thumb: should be less than number of syllables

  • Default of 30 works well for most corpus sizes (100-10,000 syllables)

Random State (default: 42):

  • Controls random initialization of t-SNE

  • Same value = reproducible visualizations

  • Different values = different (but valid) layouts

  • Use fixed value (e.g., 42) for consistent results

Distance Metric:

  • Uses Hamming distance (optimal for binary feature vectors)

  • Automatically configured for 12-dimensional binary features

  • Not configurable via command-line (intentional design choice)

Interpreting the Visualization:

What to look for:

  • Nearby points: Syllables with similar phonetic features

  • Clusters: Natural groupings in the feature space

  • Large/bright points: High-frequency syllables (common patterns)

  • Small/dark points: Low-frequency syllables (rare patterns)

  • Isolated points: Unique or rare feature combinations

  • Dense regions: Common feature patterns

  • Sparse regions: Less common feature patterns

Example insights:

  • Vowel-initial syllables might cluster together

  • Heavy consonant clusters might form distinct groups

  • Frequency might correlate with certain feature patterns

  • Outliers might indicate unusual phonetic combinations

Notes

Dependencies:

Required dependencies for t-SNE visualization (install with pip install -e ".[build-tools]"):

  • scikit-learn, matplotlib, numpy, pandas (for static PNG generation)

  • plotly (for interactive HTML generation, optional)

Performance:

  • Deterministic analysis with fixed random seeds (same input = same output)

  • Fast processing: typically <10 seconds for 1,000-10,000 syllables

  • t-SNE processing time scales roughly O(n²) with corpus size

  • For very large datasets (>50,000 syllables), consider sampling first

Output Characteristics:

  • Human-readable plain text reports with formatted tables

  • Timestamped output files for historical tracking

  • High-resolution visualizations (default 300 DPI)

  • Static visualizations saved as PNG files for easy sharing and embedding

  • Interactive visualizations saved as self-contained HTML files

Build-time tool:

These are build-time analysis tools - not used during runtime name generation.

API Reference

Analysis tools for annotated syllables.

This subpackage provides post-annotation analysis utilities for inspecting and understanding the annotated syllable corpus.

Subpackages

common: Shared utilities (data I/O, paths, output management) dimensionality: Dimensionality reduction (feature matrices, t-SNE, mapping) plotting: Visualization utilities (static matplotlib, interactive Plotly)

Available Tools

random_sampler: Random sampling utility for QA and inspection feature_signatures: Feature signature analysis and distribution reporting tsne_visualizer: t-SNE visualization of feature signature space

Quick Start

Random sampling:

$ python -m build_tools.syllable_analysis.random_sampler --samples 50

Feature signature analysis:

$ python -m build_tools.syllable_analysis.feature_signatures

t-SNE visualization:

$ python -m build_tools.syllable_analysis.tsne_visualizer

Programmatic Usage

Using common utilities:

>>> from build_tools.syllable_analysis import (
...     default_paths,
...     load_annotated_syllables,
...     ensure_output_dir,
... )
>>> # Load data using default paths
>>> records = load_annotated_syllables(default_paths.annotated_syllables)
>>> # Prepare output directory
>>> output_dir = ensure_output_dir(default_paths.analysis_output_dir("my_tool"))

Random sampling:

>>> from build_tools.syllable_analysis import (
...     load_annotated_syllables,
...     sample_syllables,
...     save_json_output
... )
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> samples = sample_syllables(records, 50, seed=42)
>>> save_json_output(samples, Path("output.json"))

Feature signature analysis:

>>> from build_tools.syllable_analysis import (
...     run_analysis,
...     extract_signature,
...     analyze_feature_signatures
... )
>>> result = run_analysis(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/"),
...     limit=20
... )

t-SNE visualization:

>>> from build_tools.syllable_analysis import (
...     run_tsne_visualization,
...     extract_feature_matrix
... )
>>> result = run_tsne_visualization(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/tsne/")
... )
class build_tools.syllable_analysis.AnalysisPathConfig(root=None)[source]

Bases: object

Centralized path configuration for analysis tools.

This class manages all default paths used by analysis tools, including: - Project root detection - Input file paths (annotated syllables, frequencies) - Output directory paths (per-tool subdirectories)

The class automatically detects the project root based on this file’s location in the directory structure, but can also accept a custom root path for testing or alternative project layouts.

Attributes

rootPath

Project root directory (auto-detected or explicitly set)

Examples

Using default (auto-detected) root:

>>> config = AnalysisPathConfig()
>>> config.root
PosixPath('/path/to/pipeworks_name_generation')
>>> config.annotated_syllables
PosixPath('/path/to/pipeworks_name_generation/data/annotated/syllables_annotated.json')

Using custom root:

>>> from pathlib import Path
>>> config = AnalysisPathConfig(root=Path("/custom/root"))
>>> config.annotated_syllables
PosixPath('/custom/root/data/annotated/syllables_annotated.json')

Getting tool-specific output directories:

>>> config = AnalysisPathConfig()
>>> config.analysis_output_dir("tsne")
PosixPath('/path/to/pipeworks_name_generation/_working/analysis/tsne')
>>> config.analysis_output_dir("feature_signatures")
PosixPath('/path/to/pipeworks_name_generation/_working/analysis/feature_signatures')

Notes

This class is designed to be instantiated once per process (typically via the module-level default_paths singleton). Multiple instances are supported for testing purposes.

The auto-detection assumes this file is located at: build_tools/syllable_analysis/common/paths.py

If the directory structure changes, the _detect_project_root() method must be updated accordingly.

__init__(root=None)[source]

Initialize path configuration.

Args

rootPath, optional

Project root path. If None (default), auto-detects based on this file’s location.

Examples

Default auto-detection:

>>> config = AnalysisPathConfig()

Custom root path:

>>> from pathlib import Path
>>> config = AnalysisPathConfig(root=Path("/my/project"))
analysis_output_dir(tool_name)[source]

Get output directory for a specific analysis tool.

Each analysis tool should have its own subdirectory under _working/analysis/ to keep outputs organized and avoid naming conflicts.

Return type:

Path

Args

tool_namestr

Name of the analysis tool (e.g., ‘tsne’, ‘feature_signatures’, ‘random_sampler’). This will be used as the subdirectory name.

Returns

Path

Path to _working/analysis/{tool_name}/

Examples

>>> config = AnalysisPathConfig()
>>> config.analysis_output_dir("tsne")
PosixPath('.../pipeworks_name_generation/_working/analysis/tsne')
>>> config.analysis_output_dir("feature_signatures")
PosixPath('.../pipeworks_name_generation/_working/analysis/feature_signatures')

Use in argument parser:

parser.add_argument(
    "--output",
    type=Path,
    default=default_paths.analysis_output_dir("tsne"),
    help="Output directory"
)

Notes

The directory is not created by this method - it only returns the path. Use common.output.ensure_output_dir() to create the directory if needed.

The _working/ directory is typically git-ignored and used for build-time artifacts that don’t need to be committed.

property annotated_syllables: Path

Default path to syllables_annotated.json.

This is the primary input file for most analysis tools, containing syllables with their frequencies and feature annotations.

Returns

Path

Path to data/annotated/syllables_annotated.json

Examples

>>> config = AnalysisPathConfig()
>>> config.annotated_syllables
PosixPath('.../data/annotated/syllables_annotated.json')

Use in argument parser:

parser.add_argument(
    "--input",
    type=Path,
    default=default_paths.annotated_syllables,
    help="Path to annotated syllables"
)

Notes

This file is produced by the syllable feature annotator pipeline and contains a JSON array of syllable records with structure:

[
    {
        "syllable": "ka",
        "frequency": 187,
        "features": {
            "starts_with_vowel": false,
            "contains_plosive": true,
            ...
        }
    },
    ...
]
property syllables_frequencies: Path

Default path to syllables_frequencies.json.

This file contains frequency counts for each syllable from the normalizer, useful for weighted analysis or filtering.

Returns

Path

Path to data/normalized/syllables_frequencies.json

Examples

>>> config = AnalysisPathConfig()
>>> config.syllables_frequencies
PosixPath('.../data/normalized/syllables_frequencies.json')

Notes

This file is produced by the syllable normalizer and contains a JSON object mapping syllables to their occurrence counts:

{
    "ka": 187,
    "ra": 162,
    "mi": 145,
    ...
}

The frequencies represent pre-deduplication counts, capturing how often each canonical syllable appeared in the raw corpus.

build_tools.syllable_analysis.analyze_feature_signatures(records)[source]

Analyze feature signatures across all syllable records.

Counts how many syllables share each unique feature signature.

Parameters:

records (list[dict]) – List of syllable records from syllables_annotated.json Each record should have “syllable”, “frequency”, and “features” keys

Return type:

Counter

Returns:

Counter mapping feature signatures to occurrence counts

Example

>>> records = [
...     {"syllable": "ka", "features": {"starts_with_vowel": False}},
...     {"syllable": "a", "features": {"starts_with_vowel": True}}
... ]
>>> counter = analyze_feature_signatures(records)
>>> counter[('starts_with_vowel',)]
1
build_tools.syllable_analysis.apply_tsne(feature_matrix, n_components=2, perplexity=30, random_state=42, metric='hamming')[source]

Apply t-SNE dimensionality reduction to feature matrix.

t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique for dimensionality reduction that projects high-dimensional data into lower dimensions while preserving local structure.

Parameters:
  • feature_matrix (ndarray) – Input feature matrix (n_samples, n_features). For binary features, should contain only 0s and 1s.

  • n_components (int) – Number of dimensions for output (default: 2). 2D is typical for visualization, 3D also common.

  • perplexity (int) – t-SNE perplexity parameter (default: 30). Controls balance between local and global structure. Typical range: 5-50. Higher values consider more neighbors. Should be less than n_samples.

  • random_state (int) – Random seed for reproducibility (default: 42). Same seed ensures identical output for same input.

  • metric (str) – Distance metric (default: ‘hamming’). ‘hamming’ is optimal for binary features (counts # of differences). Other options: ‘euclidean’, ‘manhattan’, ‘cosine’, etc.

Return type:

ndarray

Returns:

Reduced coordinates array of shape (n_samples, n_components). For default n_components=2, output is (n_samples, 2) with x,y coordinates.

Raises:
  • ImportError – If scikit-learn is not installed

  • ValueError – If perplexity is invalid (too large for sample size)

Example

>>> import numpy as np
>>> from build_tools.syllable_analysis.dimensionality import apply_tsne
>>> # Create sample binary feature matrix (100 samples, 12 features)
>>> feature_matrix = np.random.randint(0, 2, size=(100, 12))
>>> # Apply t-SNE to reduce to 2D
>>> coords_2d = apply_tsne(feature_matrix, n_components=2, perplexity=30)
>>> coords_2d.shape
(100, 2)

Notes

  • Processing time scales roughly O(n²) with sample size

  • Perplexity should be less than n_samples (typically n_samples/3 max)

  • Hamming distance is best for binary features (our use case)

  • Fixed random_state ensures reproducible results

  • For large datasets (>10,000 samples), consider using approximate methods

build_tools.syllable_analysis.build_hover_text(record, max_features=4)[source]

Build rich hover text for a single syllable record.

Creates HTML-formatted hover text showing syllable details, frequency, and active features. Features are truncated if more than max_features are present.

Parameters:
  • record (dict) – Syllable record with ‘syllable’, ‘frequency’, ‘features’ keys

  • max_features (int) – Maximum features to show before truncating (default: 4)

Return type:

str

Returns:

HTML-formatted hover text string

Example

>>> record = {
...     "syllable": "kran",
...     "frequency": 150,
...     "features": {
...         "contains_plosive": True,
...         "contains_liquid": True,
...         "contains_nasal": True,
...         "starts_with_cluster": True,
...         "ends_with_nasal": True,
...     }
... }
>>> text = build_hover_text(record, max_features=4)
>>> print(text)
<b>kran</b><br>Frequency: 150<br>Features: 5/12<br><i>contains_plosive, ...</i><br>...

Notes

  • Syllable shown in bold

  • Frequency shown with comma separators (e.g., “1,234”)

  • Feature count shows active/total (e.g., “5/12”)

  • First N features shown in italics

  • If more than N features, shows “+M more” truncation message

build_tools.syllable_analysis.calculate_optimal_perplexity(n_samples, min_perplexity=5, max_perplexity=50)[source]

Suggest optimal perplexity value based on dataset size.

Perplexity is a key t-SNE parameter that balances local vs global structure. This function provides a reasonable default based on dataset size.

Rule of thumb:
  • Perplexity should be between 5 and 50

  • Perplexity should be less than n_samples

  • Common heuristic: perplexity ≈ sqrt(n_samples), clamped to [5, 50]

Parameters:
  • n_samples (int) – Number of samples in dataset

  • min_perplexity (int) – Minimum perplexity value (default: 5)

  • max_perplexity (int) – Maximum perplexity value (default: 50)

Return type:

int

Returns:

Suggested perplexity value

Example

>>> calculate_optimal_perplexity(100)
10
>>> calculate_optimal_perplexity(1000)
31
>>> calculate_optimal_perplexity(10000)
50
>>> calculate_optimal_perplexity(10)
5

Notes

  • For small datasets (<25 samples): use min_perplexity (5)

  • For large datasets (>2500 samples): use max_perplexity (50)

  • For medium datasets: use sqrt(n_samples)

  • This is a heuristic, not a strict rule - experiment for best results

build_tools.syllable_analysis.create_interactive_scatter(records, tsne_coords, title='t-SNE: Feature Signature Space (Interactive)')[source]

Create interactive Plotly scatter plot of t-SNE coordinates.

Generates an interactive HTML-compatible visualization with rich hover tooltips, zoom/pan controls, and export capabilities. Points are sized (log scale) and colored by frequency.

Parameters:
  • records (list[dict]) – List of annotated syllable records. Each must contain: - syllable (str): Syllable text - frequency (int): Occurrence count - features (dict): Boolean feature flags (12 features)

  • tsne_coords (ndarray) – 2D coordinate array of shape (n_samples, 2) from t-SNE

  • title (str) – Plot title (default: “t-SNE: Feature Signature Space (Interactive)”)

Return type:

Figure

Returns:

Plotly Figure object with configured interactive scatter plot

Raises:
  • ImportError – If Plotly is not installed

  • ValueError – If inputs are invalid or lengths don’t match

Example

>>> records = [
...     {"syllable": "ka", "frequency": 100, "features": {"contains_plosive": True}},
...     {"syllable": "mi", "frequency": 50, "features": {"contains_nasal": True}},
... ]
>>> coords = np.array([[1.0, 2.0], [3.0, 4.0]])
>>> fig = create_interactive_scatter(records, coords)
>>> fig.show()  # Opens in browser

Notes

  • Point size uses log1p scale for better visibility across frequency ranges

  • Hover text shows syllable, frequency, feature count, and up to 4 features

  • If more than 4 features, shows “…+N more” truncation

  • Viridis colorscale provides perceptually uniform coloring

  • Fixed height (900px) with responsive width for consistent aspect ratio

  • Plotly CDN used when saving to HTML for smaller file size

Create HTML metadata footer with algorithm parameters.

Generates a styled HTML block showing t-SNE parameters and generation information. Designed to be appended to the end of the HTML file.

Parameters:
  • perplexity (int) – t-SNE perplexity parameter used

  • random_state (int) – Random seed used for reproducibility

Return type:

str

Returns:

HTML string with formatted metadata table

Example

>>> footer = create_metadata_footer(perplexity=30, random_state=42)
>>> "t-SNE Visualization Parameters" in footer
True

Notes

  • Uses inline CSS for styling (no external dependencies)

  • Light gray background (#f5f5f5) for visual separation

  • Monospace font for technical parameters

  • Includes usage instructions for toolbar

  • Shows current timestamp of generation

build_tools.syllable_analysis.create_metadata_text(output_filename, dpi, perplexity, random_state, processing_time)[source]

Generate formatted metadata text for static visualization.

Creates a human-readable metadata report describing the visualization parameters, algorithm settings, and interpretation guide. Suitable for saving alongside PNG output files.

Parameters:
  • output_filename (str) – Name of the output PNG file (e.g., “20260107_143022.tsne_visualization.png”)

  • dpi (int) – Resolution used for PNG export

  • perplexity (int) – t-SNE perplexity parameter used

  • random_state (int) – Random seed used for reproducibility

  • processing_time (float) – Total processing time in seconds

Return type:

str

Returns:

Formatted multi-line metadata string ready for file output

Example

>>> metadata = create_metadata_text(
...     output_filename="20260107_143022.tsne_visualization.png",
...     dpi=300,
...     perplexity=30,
...     random_state=42,
...     processing_time=2.5
... )
>>> Path("metadata.txt").write_text(metadata)

Notes

  • Includes timestamp of generation

  • Documents all algorithm parameters

  • Provides interpretation guidance

  • Uses Unicode box-drawing characters for formatting

build_tools.syllable_analysis.create_tsne_mapping(records, tsne_coords)[source]

Create syllable→features→coordinates mapping.

Combines annotated syllable records with their t-SNE coordinates to create a comprehensive mapping structure. This is useful for: - Post-hoc cluster analysis - Cross-referencing visualizations - Interactive exploration - Sharing visualizations with collaborators

Parameters:
  • records (list[dict]) – Original annotated syllable records from load_annotated_syllables(). Each record should have: - syllable (str): The syllable text - frequency (int): Occurrence count - features (dict): Boolean feature flags

  • tsne_coords (ndarray) – t-SNE coordinate array (n_syllables × n_dimensions). Typically 2D for visualization, but can be 3D or higher.

Returns:

[
{

“syllable”: “kran”, “frequency”: 7, “tsne_x”: -2.34, “tsne_y”: 5.67, “features”: {…}

]

Return type:

List of mapping records with structure

Raises:

ValueError – If records and tsne_coords have mismatched lengths

Example

>>> records = [
...     {"syllable": "ka", "frequency": 187, "features": {...}},
...     {"syllable": "ran", "frequency": 42, "features": {...}}
... ]
>>> coords = np.array([[-2.1, 3.4], [1.5, -0.8]])
>>> mapping = create_tsne_mapping(records, coords)
>>> mapping[0]["tsne_x"]
-2.1
>>> mapping[0]["syllable"]
'ka'

Notes

  • Array indices preserve order from input records

  • Coordinates are converted from numpy float64 to Python float for JSON compatibility

  • All original record fields are preserved in the mapping

  • For 2D t-SNE: creates tsne_x and tsne_y fields

  • For 3D+ t-SNE: creates tsne_x, tsne_y, tsne_z, … fields

build_tools.syllable_analysis.create_tsne_scatter(tsne_coords, frequencies, title='t-SNE: Feature Signature Space', figsize=(14, 10), cmap='viridis', alpha=0.6)[source]

Create static matplotlib scatter plot of t-SNE coordinates.

Generates a publication-quality scatter plot showing t-SNE dimensionality reduction results. Points are sized and colored by frequency, with larger and brighter points indicating higher-frequency syllables.

Parameters:
  • tsne_coords (ndarray) – 2D coordinate array of shape (n_samples, 2) from t-SNE reduction

  • frequencies (list[int]) – Frequency values for each point (used for sizing and coloring)

  • title (str) – Plot title (default: “t-SNE: Feature Signature Space”)

  • figsize (tuple[int, int]) – Figure size in inches as (width, height) (default: (14, 10))

  • cmap (str) – Matplotlib colormap name (default: “viridis”)

  • alpha (float) – Point transparency, 0.0=transparent to 1.0=opaque (default: 0.6)

Return type:

Figure

Returns:

matplotlib Figure object with configured scatter plot

Raises:

ValueError – If tsne_coords shape is invalid or lengths don’t match

Example

>>> import numpy as np
>>> coords = np.random.randn(100, 2)
>>> freqs = list(range(1, 101))
>>> fig = create_tsne_scatter(coords, freqs)
>>> fig.savefig("output.png", dpi=300)
>>> plt.close(fig)

Notes

  • Point size is proportional to frequency (frequency × 2)

  • Colorbar is added automatically to show frequency scale

  • Black edge lines improve visibility of overlapping points

  • Layout uses tight_layout() for optimal spacing

build_tools.syllable_analysis.ensure_output_dir(output_dir)[source]

Ensure output directory exists, creating it if necessary.

This function creates the specified directory and all parent directories if they don’t already exist. It is idempotent - calling it multiple times with the same path is safe and has no side effects.

Return type:

Path

Parameters

output_dirPath

Directory path to ensure exists

Returns

Path

The same path that was passed in (for chaining)

Examples

Basic usage:

>>> from pathlib import Path
>>> output_dir = ensure_output_dir(Path("_working/analysis/tsne/"))
>>> output_dir.exists()
True

Create nested directories:

>>> nested = ensure_output_dir(Path("_working/new/nested/dirs/"))
>>> nested.exists()
True

Idempotent operation:

>>> dir1 = ensure_output_dir(Path("_working/test/"))
>>> dir2 = ensure_output_dir(Path("_working/test/"))
>>> dir1 == dir2
True

Chaining:

>>> output_file = ensure_output_dir(Path("_working/analysis/")) / "output.json"

Notes

This function uses Path.mkdir(parents=True, exist_ok=True) which: - Creates all parent directories as needed (like mkdir -p) - Does not raise an error if the directory already exists - Raises PermissionError if insufficient permissions - Raises OSError for other filesystem errors

The function returns the input path unchanged, which allows for convenient chaining in expressions.

build_tools.syllable_analysis.extract_feature_matrix(records, feature_names=['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel', 'short_vowel', 'starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'ends_with_vowel', 'ends_with_stop', 'ends_with_nasal'])[source]

Extract binary feature matrix from annotated syllable records.

Converts feature dictionaries to a numerical matrix suitable for dimensionality reduction algorithms. Each row represents a syllable, each column represents a feature (0 or 1).

Parameters:
  • records (list[dict]) –

    List of annotated syllable records with ‘features’ and ‘frequency’ keys. Each record should have structure: {

    ”syllable”: “ka”, “frequency”: 187, “features”: {“contains_liquid”: False, “contains_plosive”: True, …}

    }

  • feature_names (list[str]) – Ordered list of feature names to extract (default: ALL_FEATURES). Order determines column order in output matrix.

Returns:

  • feature_matrix: numpy array of shape (n_syllables, n_features) with binary values

  • frequencies: List of frequency counts for each syllable

Return type:

Tuple of (feature_matrix, frequencies)

Example

>>> records = [
...     {
...         "syllable": "ka",
...         "frequency": 187,
...         "features": {"contains_liquid": False, "contains_plosive": True, ...}
...     }
... ]
>>> matrix, freqs = extract_feature_matrix(records)
>>> matrix.shape
(1, 12)
>>> freqs
[187]

Notes

  • Missing features default to False (0)

  • Feature values are converted to int (True→1, False→0)

  • Output matrix dtype is int for memory efficiency

  • Empty record list returns (0, n_features) shaped array

build_tools.syllable_analysis.extract_signature(features)[source]

Extract the feature signature from a feature dictionary.

A signature is a sorted tuple of feature names where the feature value is True. This creates a canonical representation of the active feature set.

Parameters:

features (dict[str, bool]) – Dictionary mapping feature names to boolean values

Return type:

tuple[str, ...]

Returns:

Sorted tuple of feature names that are active (True)

Example

>>> extract_signature({"starts_with_vowel": True, "ends_with_vowel": False})
('starts_with_vowel',)
build_tools.syllable_analysis.format_signature_report(signature_counter, total_syllables, limit=None)[source]

Format the signature analysis results as a human-readable report.

Parameters:
  • signature_counter (Counter) – Counter of signatures to their occurrence counts

  • total_syllables (int) – Total number of syllables in the corpus

  • limit (int | None) – Maximum number of signatures to include (None = all)

Return type:

str

Returns:

Formatted multi-line string report

build_tools.syllable_analysis.generate_output_pair(output_dir, primary_suffix, metadata_suffix, primary_ext='txt', metadata_ext='txt')[source]

Generate matching pair of timestamped output paths.

This function is useful for tools that generate both primary output and accompanying metadata files. It ensures both files use the same timestamp, making it easy to associate files and maintain chronological ordering.

Return type:

tuple[Path, Path]

Parameters

output_dirPath

Output directory (should exist or be created first)

primary_suffixstr

Suffix for primary output file (e.g., ‘tsne_visualization’)

metadata_suffixstr

Suffix for metadata file (e.g., ‘tsne_metadata’)

primary_extstr, default=’txt’

File extension for primary file (e.g., ‘png’, ‘json’, ‘html’)

metadata_extstr, default=’txt’

File extension for metadata file (typically ‘txt’ or ‘json’)

Returns

tuple[Path, Path]

Tuple of (primary_path, metadata_path) with matching timestamps

Examples

Generate visualization and metadata pair:

>>> from pathlib import Path
>>> viz_path, meta_path = generate_output_pair(
...     output_dir=Path("_working/analysis/tsne/"),
...     primary_suffix="tsne_visualization",
...     metadata_suffix="tsne_metadata",
...     primary_ext="png",
...     metadata_ext="txt"
... )
>>> viz_path.name
'20260107_143022.tsne_visualization.png'
>>> meta_path.name
'20260107_143022.tsne_metadata.txt'

Generate data and metadata pair:

>>> data_path, meta_path = generate_output_pair(
...     output_dir=Path("_working/results/"),
...     primary_suffix="analysis_results",
...     metadata_suffix="analysis_meta",
...     primary_ext="json",
...     metadata_ext="json"
... )

Verify timestamp matching:

>>> viz_path, meta_path = generate_output_pair(
...     output_dir=Path("_working/"),
...     primary_suffix="primary",
...     metadata_suffix="metadata"
... )
>>> viz_path.stem.split('.')[0] == meta_path.stem.split('.')[0]
True  # Same timestamp

Notes

Output Format:

Primary:  {output_dir}/{timestamp}.{primary_suffix}.{primary_ext}
Metadata: {output_dir}/{timestamp}.{metadata_suffix}.{metadata_ext}

Examples: - Primary: _working/analysis/tsne/20260107_143022.tsne_visualization.png - Metadata: _working/analysis/tsne/20260107_143022.tsne_metadata.txt

The timestamp is generated once and used for both files, ensuring they are always associated and sort together in directory listings.

The directory is NOT created by this function - use ensure_output_dir() first if the directory might not exist.

Use Cases: - t-SNE visualizer: PNG + metadata TXT - Feature analysis: results JSON + metadata TXT - Interactive plots: HTML + metadata JSON

build_tools.syllable_analysis.generate_timestamped_path(output_dir, suffix, extension='txt', timestamp=None)[source]

Generate timestamped output file path.

This function creates a path with format: {output_dir}/{timestamp}.{suffix}.{extension}

The timestamp format is YYYYMMDD_HHMMSS (e.g., 20260107_143022), which provides: - Chronological sorting - Uniqueness (assuming not more than one file per second) - Human readability - No special characters that could cause path issues

Return type:

Path

Parameters

output_dirPath

Output directory (should exist or be created first)

suffixstr

File suffix describing content (e.g., ‘tsne_visualization’, ‘metadata’)

extensionstr, default=’txt’

File extension without leading dot (e.g., ‘txt’, ‘json’, ‘png’)

timestampstr, optional

Specific timestamp string (format: YYYYMMDD_HHMMSS). If None (default), uses current time via datetime.now()

Returns

Path

Timestamped output file path

Examples

Basic usage (auto-generated timestamp):

>>> from pathlib import Path
>>> path = generate_timestamped_path(
...     output_dir=Path("_working/analysis/tsne/"),
...     suffix="tsne_visualization",
...     extension="png"
... )
>>> path.name
'20260107_143022.tsne_visualization.png'

Custom extension:

>>> path = generate_timestamped_path(
...     output_dir=Path("_working/"),
...     suffix="results",
...     extension="json"
... )
>>> path.suffix
'.json'

Explicit timestamp (for reproducibility or paired files):

>>> path = generate_timestamped_path(
...     output_dir=Path("_working/"),
...     suffix="output",
...     timestamp="20260107_120000"
... )
>>> "20260107_120000" in str(path)
True

Notes

File Naming Convention:

{YYYYMMDD_HHMMSS}.{suffix}.{extension}

Examples: - 20260107_143022.tsne_visualization.png - 20260107_143022.tsne_metadata.txt - 20260107_143022.feature_signatures.txt

The directory is NOT created by this function - use ensure_output_dir() first if the directory might not exist.

Timestamp Format: - YYYY: 4-digit year - MM: 2-digit month (01-12) - DD: 2-digit day (01-31) - HH: 2-digit hour (00-23) - MM: 2-digit minute (00-59) - SS: 2-digit second (00-59)

build_tools.syllable_analysis.get_feature_vector(features, feature_names=['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel', 'short_vowel', 'starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'ends_with_vowel', 'ends_with_stop', 'ends_with_nasal'])[source]

Extract a single feature vector from a feature dictionary.

Converts a dictionary of feature flags to an ordered binary vector. Useful for extracting vectors from individual syllables.

Parameters:
  • features (dict[str, bool]) – Dictionary of feature name → boolean value

  • feature_names (list[str]) – Ordered list of feature names (default: ALL_FEATURES)

Return type:

list[int]

Returns:

Binary feature vector matching feature_names order

Example

>>> features = {"contains_liquid": True, "contains_plosive": False}
>>> vector = get_feature_vector(features, ["contains_liquid", "contains_plosive"])
>>> vector
[1, 0]

Notes

  • Missing features default to False (0)

  • Order of output matches order of feature_names

  • Output is Python list, not numpy array (for flexibility)

build_tools.syllable_analysis.inject_responsive_css(html_content, min_width=1250)[source]

Inject responsive CSS into HTML content.

Adds CSS rules to ensure the plot has a minimum width and proper scrolling behavior. This prevents the plot from becoming too narrow on small screens while allowing horizontal scrolling when necessary.

Parameters:
  • html_content (str) – Original HTML content from Plotly

  • min_width (int) – Minimum width constraint in pixels (default: 1250)

Return type:

str

Returns:

HTML content with injected CSS in <head> section

Example

>>> html = "<html><head></head><body>...</body></html>"
>>> modified = inject_responsive_css(html, min_width=1250)
>>> "<style>" in modified
True

Notes

  • CSS is inserted after the opening <head> tag

  • Sets body margin/padding to 0 for full-width layout

  • Enables horizontal scrolling when plot exceeds viewport width

  • Sets fixed height (900px) matching plot configuration

  • Uses !important to override Plotly’s inline styles

build_tools.syllable_analysis.load_annotated_syllables(input_path, validate=True)[source]

Load annotated syllables from JSON file with optional validation.

This function loads the output of the syllable feature annotator, which contains syllables with their frequencies and phonetic feature annotations. It provides optional validation to ensure the data structure is correct.

Return type:

list[dict[str, Any]]

Parameters

input_pathPath

Path to syllables_annotated.json file

validatebool, default=True

Whether to validate the structure of loaded data. When True, checks: - Data is a list - List is non-empty - First record has required keys: ‘syllable’, ‘frequency’, ‘features’

Returns

list[dict[str, Any]]

List of syllable records, each containing: - syllable (str): The syllable text - frequency (int): Occurrence count in corpus - features (dict): Boolean feature flags (12 features)

Raises

FileNotFoundError

If input file does not exist

json.JSONDecodeError

If file is not valid JSON

ValueError

If validation is enabled and data structure is invalid

Examples

Basic loading with validation:

>>> from pathlib import Path
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> len(records)
1247
>>> records[0].keys()
dict_keys(['syllable', 'frequency', 'features'])

Loading without validation (faster, use when structure is guaranteed):

>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"),
...                                     validate=False)

Error handling:

>>> try:
...     records = load_annotated_syllables(Path("nonexistent.json"))
... except FileNotFoundError as e:
...     print(f"File not found: {e}")

Notes

Expected file format (syllables_annotated.json):

[
    {
        "syllable": "ka",
        "frequency": 187,
        "features": {
            "contains_liquid": false,
            "contains_plosive": true,
            "contains_fricative": false,
            "contains_nasal": false,
            "long_vowel": false,
            "short_vowel": true,
            "starts_with_vowel": false,
            "starts_with_cluster": false,
            "starts_with_heavy_cluster": false,
            "ends_with_vowel": true,
            "ends_with_stop": false,
            "ends_with_nasal": false
        }
    },
    ...
]

This file is produced by the syllable feature annotator pipeline and is the primary input for analysis tools.

Performance

Loading a typical corpus of 1,000-10,000 syllables takes <100ms. Validation adds negligible overhead (~1ms).

build_tools.syllable_analysis.load_frequency_data(frequencies_path)[source]

Load frequency mapping from JSON file.

This function loads the output of the syllable normalizer’s frequency analysis, which maps each canonical syllable to its occurrence count in the raw corpus.

Return type:

dict[str, int]

Parameters

frequencies_pathPath

Path to syllables_frequencies.json file

Returns

dict[str, int]

Dictionary mapping syllable strings to their frequency counts

Raises

FileNotFoundError

If input file does not exist

json.JSONDecodeError

If file is not valid JSON

ValueError

If data structure is invalid (not a dict)

Examples

Basic loading:

>>> from pathlib import Path
>>> frequencies = load_frequency_data(Path("data/normalized/syllables_frequencies.json"))
>>> frequencies["ka"]
187
>>> len(frequencies)
1247

Checking most common syllables:

>>> sorted_freqs = sorted(frequencies.items(), key=lambda x: x[1], reverse=True)
>>> sorted_freqs[:3]
[('ka', 187), ('ra', 162), ('mi', 145)]

Error handling:

>>> try:
...     frequencies = load_frequency_data(Path("nonexistent.json"))
... except FileNotFoundError:
...     print("File not found")

Notes

Expected file format (syllables_frequencies.json):

{
    "ka": 187,
    "ra": 162,
    "mi": 145,
    "ta": 98,
    ...
}

This file is produced by the syllable normalizer’s frequency analysis step and captures pre-deduplication counts (how many times each syllable appeared in the raw corpus before creating the unique syllable list).

The frequencies can be used for: - Weighted analysis (prioritize common syllables) - Filtering (exclude rare syllables) - Visualization (size/color by frequency) - Statistical analysis

Performance

Loading a typical frequency file (1,000-10,000 entries) takes <50ms.

build_tools.syllable_analysis.parse_feature_signatures_args()

Parse command-line arguments.

Return type:

Namespace

build_tools.syllable_analysis.parse_random_sampler_arguments()

Parse command-line arguments.

Return type:

Namespace

Returns:

Parsed argument namespace.

build_tools.syllable_analysis.parse_tsne_visualizer_args()

Parse command-line arguments.

Return type:

Namespace

Returns:

Parsed argument namespace with validated parameters

build_tools.syllable_analysis.run_analysis(input_path, output_dir, limit=None)[source]

Run the complete feature signature analysis pipeline.

Parameters:
  • input_path (Path) – Path to syllables_annotated.json

  • output_dir (Path) – Directory to save analysis results

  • limit (int | None) – Maximum number of signatures to include in report (None = all)

Returns:

  • total_syllables: Total number of syllables analyzed

  • unique_signatures: Number of unique feature signatures

  • output_path: Path to the saved report

Return type:

Dictionary with analysis results including

build_tools.syllable_analysis.run_tsne_visualization(input_path, output_dir, perplexity=30, random_state=42, dpi=300, verbose=False, save_mapping=False, interactive=False)[source]

Run the complete t-SNE visualization pipeline.

This is the main entry point for programmatic use. It handles the full workflow: 1. Load annotated syllables 2. Extract feature matrix 3. Apply t-SNE dimensionality reduction 4. Create visualization 5. Save outputs (PNG + optional HTML + optional mapping)

Parameters:
  • input_path (Path) – Path to syllables_annotated.json

  • output_dir (Path) – Directory to save visualization outputs

  • perplexity (int) – t-SNE perplexity parameter (default: 30)

  • random_state (int) – Random seed for reproducibility (default: 42)

  • dpi (int) – Output resolution in dots per inch (default: 300)

  • verbose (bool) – Print detailed progress information

  • save_mapping (bool) – Save syllable→features→coordinates mapping as JSON (default: False)

  • interactive (bool) – Generate interactive HTML visualization (requires Plotly, default: False)

Returns:

  • syllable_count: Number of syllables visualized

  • feature_count: Number of features (always 12)

  • output_path: Path to saved visualization PNG

  • metadata_path: Path to saved metadata file

  • tsne_coordinates: numpy array of 2D coordinates

  • mapping_path: Path to mapping JSON (None if save_mapping=False)

  • interactive_path: Path to interactive HTML (None if interactive=False or Plotly unavailable)

  • processing_time: Total processing time in seconds

Return type:

Dictionary containing

Raises:

Example

>>> from pathlib import Path
>>> result = run_tsne_visualization(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/tsne/"),
...     interactive=True,
...     save_mapping=True
... )
>>> print(f"Visualized {result['syllable_count']} syllables")
>>> print(f"Interactive HTML: {result['interactive_path']}")
build_tools.syllable_analysis.sample_syllables(records, sample_count, seed=None)[source]

Randomly sample syllables from the full corpus.

Parameters:
  • records (list[dict[str, Any]]) – List of annotated syllable records.

  • sample_count (int) – Number of samples to draw.

  • seed (int | None) – Optional random seed for reproducibility.

Return type:

list[dict[str, Any]]

Returns:

List of sampled syllable records.

Raises:

ValueError – If sample_count is larger than available records.

build_tools.syllable_analysis.save_interactive_html(fig, output_path, perplexity, random_state, min_width=1250)[source]

Save interactive Plotly figure as standalone HTML.

Creates a self-contained HTML file with embedded Plotly visualization that can be: - Opened directly in any web browser - Shared with collaborators - Embedded in reports or documentation - Explored with zoom, pan, hover, and export controls

The HTML file uses Plotly CDN for JavaScript dependencies (smaller file size) and includes responsive CSS and a metadata footer.

Parameters:
  • fig (Figure) – Plotly Figure object from create_interactive_scatter()

  • output_path (Path) – Output HTML file path (parent directory must exist)

  • perplexity (int) – t-SNE perplexity parameter (for metadata footer)

  • random_state (int) – Random seed used (for metadata footer)

  • min_width (int) – Minimum width constraint in pixels (default: 1250)

Raises:
Return type:

None

Example

>>> fig = create_interactive_scatter(records, tsne_coords)
>>> output_path = Path("_working/visualization.html")
>>> save_interactive_html(fig, output_path, perplexity=30, random_state=42)

Notes

  • Plotly CDN used for smaller file size vs. full JS bundle

  • Mode bar configured with additional tools (hoverclosest, hovercompare)

  • Export to PNG button configured for high-resolution (1600x1200, 2x scale)

  • Responsive CSS ensures minimum width of 1250px

  • Metadata footer includes algorithm parameters and generation time

build_tools.syllable_analysis.save_json_output(data, output_path, indent=2, ensure_ascii=False)[source]

Save data as formatted JSON file.

This function provides consistent JSON output formatting across all analysis tools. It ensures proper Unicode handling, readable indentation, and creates parent directories if needed.

Return type:

None

Parameters

dataAny

Data to serialize as JSON (must be JSON-serializable)

output_pathPath

Output file path (parent directories will be created if needed)

indentint | None, default=2

Number of spaces for JSON indentation. Use 2 for readability, None for compact output

ensure_asciibool, default=False

If True, escape non-ASCII characters. If False (default), preserve Unicode characters for better readability

Raises

TypeError

If data is not JSON-serializable

OSError

If file cannot be written (permissions, disk full, etc.)

Examples

Save analysis results:

>>> from pathlib import Path
>>> results = {"total": 1247, "unique": 892}
>>> save_json_output(results, Path("output/results.json"))

Save with compact formatting:

>>> save_json_output(results, Path("output/compact.json"), indent=None)

Save with ASCII-only encoding:

>>> save_json_output(results, Path("output/ascii.json"), ensure_ascii=True)

Auto-create parent directories:

>>> save_json_output(results, Path("output/new/dir/results.json"))
>>> # Creates output/new/dir/ automatically

Notes

Default settings (indent=2, ensure_ascii=False) are optimized for: - Human readability (indented) - Unicode support (preserve accented characters, emojis, etc.) - Version control friendliness (consistent line breaks)

File encoding is always UTF-8 for maximum compatibility.

Performance

Saving 1,000-10,000 records typically takes <100ms. Using indent=None (compact) is ~20% faster but much less readable.

build_tools.syllable_analysis.save_report(report, output_dir)[source]

Save the formatted report to the output directory.

Parameters:
  • report (str) – Formatted report string

  • output_dir (Path) – Directory to save the report in

Return type:

Path

Returns:

Path to the saved report file

build_tools.syllable_analysis.save_static_plot(fig, output_path, dpi=300)[source]

Save matplotlib figure to PNG file.

Saves a matplotlib Figure to a high-resolution PNG file suitable for publication or presentation. Uses tight bounding box to minimize whitespace.

Parameters:
  • fig (Figure) – Matplotlib Figure object to save

  • output_path (Path) – Output PNG file path (parent directory must exist)

  • dpi (int) – Resolution in dots per inch (default: 300 for publication quality)

Raises:
Return type:

None

Example

>>> import matplotlib.pyplot as plt
>>> from pathlib import Path
>>> fig, ax = plt.subplots()
>>> ax.plot([1, 2, 3], [1, 4, 9])
>>> save_static_plot(fig, Path("output.png"), dpi=300)
>>> plt.close(fig)

Notes

  • Uses bbox_inches=’tight’ to remove excess whitespace

  • Higher DPI values create larger files but better quality

  • Common DPI values: 150 (screen), 300 (print), 600 (high-quality print)

  • Figure is NOT automatically closed after saving

build_tools.syllable_analysis.save_tsne_mapping(mapping, output_path, indent=2)[source]

Save t-SNE mapping to JSON file.

Writes the syllable→coordinates mapping as formatted JSON for human readability and programmatic access.

Parameters:
  • mapping (list[dict]) – Mapping data from create_tsne_mapping()

  • output_path (Path) – Output file path (should end in .json)

  • indent (int) – JSON indentation for readability (default: 2)

Return type:

None

Example

>>> from pathlib import Path
>>> mapping = [{"syllable": "ka", "tsne_x": -2.1, "tsne_y": 3.4, "features": {...}}]
>>> save_tsne_mapping(mapping, Path("output.json"))

Notes

  • Output is formatted with indentation for human readability

  • Uses ensure_ascii=False to preserve Unicode characters

  • UTF-8 encoding ensures international character support

  • Parent directories are created if they don’t exist

build_tools.syllable_analysis.validate_feature_matrix(feature_matrix, expected_features=12)[source]

Validate feature matrix shape and contents.

Ensures the feature matrix has the expected structure for dimensionality reduction algorithms.

Parameters:
  • feature_matrix (ndarray) – Binary feature matrix

  • expected_features (int) – Expected number of features (default: 12)

Raises:

ValueError – If validation fails (wrong shape, non-binary values, etc.)

Return type:

None

Example

>>> matrix = np.array([[1, 0, 1], [0, 1, 0]])
>>> validate_feature_matrix(matrix, expected_features=3)  # OK
>>> validate_feature_matrix(matrix, expected_features=4)  # Raises ValueError