Analysis Tools

Overview

Analysis tools for annotated syllables.

This subpackage provides post-annotation analysis utilities for inspecting and understanding the annotated syllable corpus.

Subpackages

common: Shared utilities (data I/O, paths, output management) dimensionality: Dimensionality reduction (feature matrices, t-SNE, mapping) plotting: Visualization utilities (static matplotlib, interactive Plotly)

Available Tools

random_sampler: Random sampling utility for QA and inspection feature_signatures: Feature signature analysis and distribution reporting tsne_visualizer: t-SNE visualization of feature signature space

Quick Start

Random sampling:

$ python -m build_tools.syllable_analysis.random_sampler --samples 50

Feature signature analysis:

$ python -m build_tools.syllable_analysis.feature_signatures

t-SNE visualization:

$ python -m build_tools.syllable_analysis.tsne_visualizer

Programmatic Usage

Using common utilities:

>>> from build_tools.syllable_analysis import (
...     default_paths,
...     load_annotated_syllables,
...     ensure_output_dir,
... )
>>> # Load data using default paths
>>> records = load_annotated_syllables(default_paths.annotated_syllables)
>>> # Prepare output directory
>>> output_dir = ensure_output_dir(default_paths.analysis_output_dir("my_tool"))

Random sampling:

>>> from build_tools.syllable_analysis import (
...     load_annotated_syllables,
...     sample_syllables,
...     save_json_output
... )
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> samples = sample_syllables(records, 50, seed=42)
>>> save_json_output(samples, Path("output.json"))

Feature signature analysis:

>>> from build_tools.syllable_analysis import (
...     run_analysis,
...     extract_signature,
...     analyze_feature_signatures
... )
>>> result = run_analysis(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/"),
...     limit=20
... )

t-SNE visualization:

>>> from build_tools.syllable_analysis import (
...     run_tsne_visualization,
...     extract_feature_matrix
... )
>>> result = run_tsne_visualization(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/tsne/")
... )

Random Sampler

Random sampling utility for QA and inspection of annotated syllables.

Command-Line Interface

Randomly sample annotated syllables for inspection.

usage: python -m build_tools.syllable_analysis.random_sampler
       [-h] [--input INPUT] [--output OUTPUT] [--samples SAMPLES]
       [--seed SEED]

Named Arguments

--input

Path to input annotated syllables JSON file (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json

--output

Path to output samples JSON file (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/random_samples.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/random_samples.json

--samples

Number of syllables to sample (default: 100)

Default: 100

--seed

Random seed for reproducibility (default: None, uses system randomness)

# Sample 100 syllables (default)
python -m build_tools.syllable_analysis.random_sampler

# Sample 50 syllables
python -m build_tools.syllable_analysis.random_sampler --samples 50

# Use custom paths
python -m build_tools.syllable_analysis.random_sampler \
    --input data/annotated/syllables_annotated.json \
    --output _working/my_samples.json \
    --samples 200

# Use a specific seed for reproducibility
python -m build_tools.syllable_analysis.random_sampler --samples 50 --seed 42

Integration Guide

Use after annotation to randomly sample syllables for quality assurance:

# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator

# Step 2: Sample for QA inspection
python -m build_tools.syllable_analysis.random_sampler \
  --input data/annotated/syllables_annotated.json \
  --count 50 \
  --output _working/samples.json \
  --seed 42

Feature Signature Analysis

The feature signature analysis tool examines annotated syllables to identify which feature combinations actually exist in the data and how frequently each combination appears.

A “feature signature” is the set of all active (True) features for a syllable. For example, a syllable with only starts_with_vowel and ends_with_vowel active would have the signature: ('ends_with_vowel', 'starts_with_vowel').

This analysis answers questions like:

What feature patterns are most common in natural language?
Are certain feature combinations rare or impossible?
How diverse is the feature space in the corpus?

Command-Line Interface

Analyze feature signatures in annotated syllable corpus

usage: python -m build_tools.syllable_analysis.feature_signatures
       [-h] [--input INPUT] [--output OUTPUT] [--limit LIMIT]

Named Arguments

--input

Path to syllables_annotated.json (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json

--output

Output directory for analysis results (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/feature_signatures)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/feature_signatures

--limit

Limit number of signatures in report (default: show all)

# Analyze with default paths
python -m build_tools.syllable_analysis.feature_signatures

# Show only top 50 signatures
python -m build_tools.syllable_analysis.feature_signatures --limit 50

# Custom input/output paths
python -m build_tools.syllable_analysis.feature_signatures \
  --input data/annotated/syllables_annotated.json \
  --output _working/my_analysis/

Output Format

The tool generates timestamped plain text reports (YYYYMMDD_HHMMSS.feature_signatures.txt) with:

================================================================================
FEATURE SIGNATURE ANALYSIS
================================================================================
Generated: 2026-01-06 13:55:56
Total syllables analyzed: 23,160
Unique feature signatures: 361

SUMMARY STATISTICS
--------------------------------------------------------------------------------
Most common signature: 661 syllables (2.9%)
  Features: contains_liquid, contains_plosive, ends_with_vowel, long_vowel

Feature cardinality distribution:
  1 features: 3 unique signatures
  2 features: 17 unique signatures
  3 features: 54 unique signatures
  4 features: 86 unique signatures
  5 features: 96 unique signatures
  6 features: 70 unique signatures
  7 features: 30 unique signatures
  8 features: 5 unique signatures

================================================================================
SIGNATURE RANKINGS
--------------------------------------------------------------------------------
Rank   Count    Pct      Features
--------------------------------------------------------------------------------
1      661        2.85%  [4] contains_liquid, contains_plosive, ends_with_vowel, long_vowel
2      506        2.18%  [3] contains_plosive, ends_with_vowel, long_vowel
...

Integration Guide

Use after annotation to understand feature patterns:

# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator

# Step 2: Analyze feature signatures
python -m build_tools.syllable_analysis.feature_signatures \
  --input data/annotated/syllables_annotated.json \
  --output _working/analysis/feature_signatures/

t-SNE Visualization

The t-SNE (t-distributed Stochastic Neighbor Embedding) visualization tool creates 2D visualizations of the high-dimensional feature signature space. This helps identify clustering patterns, syllable similarity, and natural groupings in the annotated syllable corpus.

t-SNE is a dimensionality reduction technique that projects 12-dimensional feature vectors into 2D space while preserving local structure. The visualization uses:

Position (x, y): t-SNE projection coordinates
Size: Syllable frequency (larger points = more common)
Color: Syllable frequency (warmer colors = more common)

Command-Line Interface

Generate t-SNE visualization of feature signature space

usage: python -m build_tools.syllable_analysis.tsne_visualizer
       [-h] [--input INPUT] [--output OUTPUT] [--perplexity PERPLEXITY]
       [--random-state RANDOM_STATE] [--dpi DPI] [--save-mapping]
       [--interactive] [--verbose]

Named Arguments

--input

Path to syllables_annotated.json (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json

--output

Output directory for visualizations (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/tsne)

Default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/tsne

--perplexity

t-SNE perplexity parameter (default: 30, range: 5-50)

Default: 30

--random-state

Random seed for reproducibility (default: 42)

Default: 42

--dpi

Output resolution in DPI (default: 300)

Default: 300

--save-mapping

Save syllable→features→coordinates mapping as JSON (default: False)

Default: False

--interactive

Generate interactive HTML visualization in addition to static PNG (requires Plotly)

Default: False

--verbose

Print detailed progress information

Default: False

# Generate visualization with default settings
python -m build_tools.syllable_analysis.tsne_visualizer

# Custom input/output paths
python -m build_tools.syllable_analysis.tsne_visualizer \
  --input data/annotated/syllables_annotated.json \
  --output _working/analysis/tsne/

# Adjust t-SNE parameters
python -m build_tools.syllable_analysis.tsne_visualizer \
  --perplexity 50 \
  --random-state 123

# High-resolution output
python -m build_tools.syllable_analysis.tsne_visualizer \
  --dpi 600

# Verbose output
python -m build_tools.syllable_analysis.tsne_visualizer --verbose

Output Format

The visualizer generates timestamped files in the output directory:

YYYYMMDD_HHMMSS.tsne_visualization.png - High-resolution static visualization (PNG, always generated)
YYYYMMDD_HHMMSS.tsne_metadata.txt - Detailed metadata and interpretation guide
YYYYMMDD_HHMMSS.tsne_mapping.json - Syllable→features→coordinates mapping (optional, requires --save-mapping)
YYYYMMDD_HHMMSS.tsne_interactive.html - Interactive Plotly visualization (optional, requires --interactive)

Output Formats:

Static PNG: High-resolution matplotlib visualization (always generated, 300 DPI default)
Interactive HTML: Plotly-based interactive visualization with hover tooltips, zoom, pan, and export (optional)

Static PNG metadata file includes:

Algorithm parameters (method, perplexity, random seed, dimensions, distance metric, features)
Visualization encoding (axis meanings, point size/color)
Interpretation guide (how to read the visualization)
Technical details (DPI, generation timestamp)

Interactive HTML features:

Hover tooltips showing syllable text, frequency, and active features
Interactive zoom, pan, and exploration controls
Export to high-resolution PNG directly from browser
Self-contained HTML file with embedded metadata
Works in any modern web browser without additional dependencies

Integration Guide

Use after annotation to visualize the feature space:

# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator

# Step 2: Generate t-SNE visualization
python -m build_tools.syllable_analysis.tsne_visualizer \
  --input data/annotated/syllables_annotated.json \
  --output _working/analysis/tsne/ \
  --interactive

Advanced Topics

Understanding t-SNE Parameters:

Perplexity (default: 30):

Balances attention between local and global structure
Typical range: 5-50
Lower values: emphasize local clusters
Higher values: preserve global structure
Rule of thumb: should be less than number of syllables
Default of 30 works well for most corpus sizes (100-10,000 syllables)

Random State (default: 42):

Controls random initialization of t-SNE
Same value = reproducible visualizations
Different values = different (but valid) layouts
Use fixed value (e.g., 42) for consistent results

Distance Metric:

Uses Hamming distance (optimal for binary feature vectors)
Automatically configured for 12-dimensional binary features
Not configurable via command-line (intentional design choice)

Interpreting the Visualization:

What to look for:

Nearby points: Syllables with similar phonetic features
Clusters: Natural groupings in the feature space
Large/bright points: High-frequency syllables (common patterns)
Small/dark points: Low-frequency syllables (rare patterns)
Isolated points: Unique or rare feature combinations
Dense regions: Common feature patterns
Sparse regions: Less common feature patterns

Example insights:

Vowel-initial syllables might cluster together
Heavy consonant clusters might form distinct groups
Frequency might correlate with certain feature patterns
Outliers might indicate unusual phonetic combinations

Notes

Dependencies:

Required dependencies for t-SNE visualization (install with pip install -e ".[build-tools]"):

scikit-learn, matplotlib, numpy, pandas (for static PNG generation)
plotly (for interactive HTML generation, optional)

Performance:

Deterministic analysis with fixed random seeds (same input = same output)
Fast processing: typically <10 seconds for 1,000-10,000 syllables
t-SNE processing time scales roughly O(n²) with corpus size
For very large datasets (>50,000 syllables), consider sampling first

Output Characteristics:

Human-readable plain text reports with formatted tables
Timestamped output files for historical tracking
High-resolution visualizations (default 300 DPI)
Static visualizations saved as PNG files for easy sharing and embedding
Interactive visualizations saved as self-contained HTML files

Build-time tool:

These are build-time analysis tools - not used during runtime name generation.

API Reference

Analysis tools for annotated syllables.

This subpackage provides post-annotation analysis utilities for inspecting and understanding the annotated syllable corpus.

Subpackages

common: Shared utilities (data I/O, paths, output management) dimensionality: Dimensionality reduction (feature matrices, t-SNE, mapping) plotting: Visualization utilities (static matplotlib, interactive Plotly)

Available Tools

random_sampler: Random sampling utility for QA and inspection feature_signatures: Feature signature analysis and distribution reporting tsne_visualizer: t-SNE visualization of feature signature space

Quick Start

Random sampling:

$ python -m build_tools.syllable_analysis.random_sampler --samples 50

Feature signature analysis:

$ python -m build_tools.syllable_analysis.feature_signatures

t-SNE visualization:

$ python -m build_tools.syllable_analysis.tsne_visualizer

Programmatic Usage

Using common utilities:

>>> from build_tools.syllable_analysis import (
...     default_paths,
...     load_annotated_syllables,
...     ensure_output_dir,
... )
>>> # Load data using default paths
>>> records = load_annotated_syllables(default_paths.annotated_syllables)
>>> # Prepare output directory
>>> output_dir = ensure_output_dir(default_paths.analysis_output_dir("my_tool"))

Random sampling:

>>> from build_tools.syllable_analysis import (
...     load_annotated_syllables,
...     sample_syllables,
...     save_json_output
... )
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> samples = sample_syllables(records, 50, seed=42)
>>> save_json_output(samples, Path("output.json"))

Feature signature analysis:

>>> from build_tools.syllable_analysis import (
...     run_analysis,
...     extract_signature,
...     analyze_feature_signatures
... )
>>> result = run_analysis(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/"),
...     limit=20
... )

t-SNE visualization:

>>> from build_tools.syllable_analysis import (
...     run_tsne_visualization,
...     extract_feature_matrix
... )
>>> result = run_tsne_visualization(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/tsne/")
... )

class build_tools.syllable_analysis.AnalysisPathConfig(root=None)[source]

Bases: object

Centralized path configuration for analysis tools.

This class manages all default paths used by analysis tools, including: - Project root detection - Input file paths (annotated syllables, frequencies) - Output directory paths (per-tool subdirectories)

The class automatically detects the project root based on this file’s location in the directory structure, but can also accept a custom root path for testing or alternative project layouts.

Attributes

rootPath: Project root directory (auto-detected or explicitly set)

Examples

Using default (auto-detected) root:

>>> config = AnalysisPathConfig()
>>> config.root
PosixPath('/path/to/pipeworks_name_generation')
>>> config.annotated_syllables
PosixPath('/path/to/pipeworks_name_generation/data/annotated/syllables_annotated.json')

Using custom root:

>>> from pathlib import Path
>>> config = AnalysisPathConfig(root=Path("/custom/root"))
>>> config.annotated_syllables
PosixPath('/custom/root/data/annotated/syllables_annotated.json')

Getting tool-specific output directories:

>>> config = AnalysisPathConfig()
>>> config.analysis_output_dir("tsne")
PosixPath('/path/to/pipeworks_name_generation/_working/analysis/tsne')
>>> config.analysis_output_dir("feature_signatures")
PosixPath('/path/to/pipeworks_name_generation/_working/analysis/feature_signatures')

Notes

This class is designed to be instantiated once per process (typically via the module-level default_paths singleton). Multiple instances are supported for testing purposes.

The auto-detection assumes this file is located at: build_tools/syllable_analysis/common/paths.py

If the directory structure changes, the _detect_project_root() method must be updated accordingly.

__init__(root=None)[source]

Initialize path configuration.

Args

rootPath, optional: Project root path. If None (default), auto-detects based on this file’s location.

Examples

Default auto-detection:

>>> config = AnalysisPathConfig()

Custom root path:

>>> from pathlib import Path
>>> config = AnalysisPathConfig(root=Path("/my/project"))

analysis_output_dir(tool_name)[source]

Get output directory for a specific analysis tool.

Each analysis tool should have its own subdirectory under _working/analysis/ to keep outputs organized and avoid naming conflicts.

Return type:: Path

Args

tool_namestr: Name of the analysis tool (e.g., ‘tsne’, ‘feature_signatures’, ‘random_sampler’). This will be used as the subdirectory name.

Returns

Path: Path to _working/analysis/{tool_name}/

Examples

>>> config = AnalysisPathConfig()
>>> config.analysis_output_dir("tsne")
PosixPath('.../pipeworks_name_generation/_working/analysis/tsne')
>>> config.analysis_output_dir("feature_signatures")
PosixPath('.../pipeworks_name_generation/_working/analysis/feature_signatures')

Use in argument parser:

parser.add_argument(
    "--output",
    type=Path,
    default=default_paths.analysis_output_dir("tsne"),
    help="Output directory"
)

Notes

The directory is not created by this method - it only returns the path. Use common.output.ensure_output_dir() to create the directory if needed.

The _working/ directory is typically git-ignored and used for build-time artifacts that don’t need to be committed.

property annotated_syllables: Path

Default path to syllables_annotated.json.

This is the primary input file for most analysis tools, containing syllables with their frequencies and feature annotations.

Returns

Path: Path to data/annotated/syllables_annotated.json

Examples

>>> config = AnalysisPathConfig()
>>> config.annotated_syllables
PosixPath('.../data/annotated/syllables_annotated.json')

Use in argument parser:

parser.add_argument(
    "--input",
    type=Path,
    default=default_paths.annotated_syllables,
    help="Path to annotated syllables"
)

Notes

This file is produced by the syllable feature annotator pipeline and contains a JSON array of syllable records with structure:

[
    {
        "syllable": "ka",
        "frequency": 187,
        "features": {
            "starts_with_vowel": false,
            "contains_plosive": true,
            ...
        }
    },
    ...
]

property syllables_frequencies: Path

Default path to syllables_frequencies.json.

This file contains frequency counts for each syllable from the normalizer, useful for weighted analysis or filtering.

Returns

Path: Path to data/normalized/syllables_frequencies.json

Examples

>>> config = AnalysisPathConfig()
>>> config.syllables_frequencies
PosixPath('.../data/normalized/syllables_frequencies.json')

Notes

This file is produced by the syllable normalizer and contains a JSON object mapping syllables to their occurrence counts:

{
    "ka": 187,
    "ra": 162,
    "mi": 145,
    ...
}

The frequencies represent pre-deduplication counts, capturing how often each canonical syllable appeared in the raw corpus.

build_tools.syllable_analysis.analyze_feature_signatures(records)[source]

Analyze feature signatures across all syllable records.

Counts how many syllables share each unique feature signature.

Parameters:: records (list[dict]) – List of syllable records from syllables_annotated.json Each record should have “syllable”, “frequency”, and “features” keys
Return type:: Counter
Returns:: Counter mapping feature signatures to occurrence counts

Example

>>> records = [
...     {"syllable": "ka", "features": {"starts_with_vowel": False}},
...     {"syllable": "a", "features": {"starts_with_vowel": True}}
... ]
>>> counter = analyze_feature_signatures(records)
>>> counter[('starts_with_vowel',)]
1

build_tools.syllable_analysis.apply_tsne(feature_matrix, n_components=2, perplexity=30, random_state=42, metric='hamming')[source]

Apply t-SNE dimensionality reduction to feature matrix.

t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique for dimensionality reduction that projects high-dimensional data into lower dimensions while preserving local structure.

Parameters:

feature_matrix (ndarray) – Input feature matrix (n_samples, n_features). For binary features, should contain only 0s and 1s.
n_components (int) – Number of dimensions for output (default: 2). 2D is typical for visualization, 3D also common.
perplexity (int) – t-SNE perplexity parameter (default: 30). Controls balance between local and global structure. Typical range: 5-50. Higher values consider more neighbors. Should be less than n_samples.
random_state (int) – Random seed for reproducibility (default: 42). Same seed ensures identical output for same input.
metric (str) – Distance metric (default: ‘hamming’). ‘hamming’ is optimal for binary features (counts # of differences). Other options: ‘euclidean’, ‘manhattan’, ‘cosine’, etc.

Return type:

ndarray

Returns:

Reduced coordinates array of shape (n_samples, n_components). For default n_components=2, output is (n_samples, 2) with x,y coordinates.

Raises:

ImportError – If scikit-learn is not installed
ValueError – If perplexity is invalid (too large for sample size)

Example

>>> import numpy as np
>>> from build_tools.syllable_analysis.dimensionality import apply_tsne
>>> # Create sample binary feature matrix (100 samples, 12 features)
>>> feature_matrix = np.random.randint(0, 2, size=(100, 12))
>>> # Apply t-SNE to reduce to 2D
>>> coords_2d = apply_tsne(feature_matrix, n_components=2, perplexity=30)
>>> coords_2d.shape
(100, 2)

Notes

Processing time scales roughly O(n²) with sample size
Perplexity should be less than n_samples (typically n_samples/3 max)
Hamming distance is best for binary features (our use case)
Fixed random_state ensures reproducible results
For large datasets (>10,000 samples), consider using approximate methods

build_tools.syllable_analysis.build_hover_text(record, max_features=4)[source]

Build rich hover text for a single syllable record.

Creates HTML-formatted hover text showing syllable details, frequency, and active features. Features are truncated if more than max_features are present.

Parameters:

record (dict) – Syllable record with ‘syllable’, ‘frequency’, ‘features’ keys
max_features (int) – Maximum features to show before truncating (default: 4)

Return type:

str

Returns:

HTML-formatted hover text string

Example

>>> record = {
...     "syllable": "kran",
...     "frequency": 150,
...     "features": {
...         "contains_plosive": True,
...         "contains_liquid": True,
...         "contains_nasal": True,
...         "starts_with_cluster": True,
...         "ends_with_nasal": True,
...     }
... }
>>> text = build_hover_text(record, max_features=4)
>>> print(text)
<b>kran</b><br>Frequency: 150<br>Features: 5/12<br><i>contains_plosive, ...</i><br>...

Notes

Syllable shown in bold
Frequency shown with comma separators (e.g., “1,234”)
Feature count shows active/total (e.g., “5/12”)
First N features shown in italics
If more than N features, shows “+M more” truncation message

build_tools.syllable_analysis.calculate_optimal_perplexity(n_samples, min_perplexity=5, max_perplexity=50)[source]

Suggest optimal perplexity value based on dataset size.

Perplexity is a key t-SNE parameter that balances local vs global structure. This function provides a reasonable default based on dataset size.

Rule of thumb:

Perplexity should be between 5 and 50
Perplexity should be less than n_samples
Common heuristic: perplexity ≈ sqrt(n_samples), clamped to [5, 50]

Parameters:

n_samples (int) – Number of samples in dataset
min_perplexity (int) – Minimum perplexity value (default: 5)
max_perplexity (int) – Maximum perplexity value (default: 50)

Return type:

int

Returns:

Suggested perplexity value

Example

>>> calculate_optimal_perplexity(100)
10
>>> calculate_optimal_perplexity(1000)
31
>>> calculate_optimal_perplexity(10000)
50
>>> calculate_optimal_perplexity(10)
5

Notes

For small datasets (<25 samples): use min_perplexity (5)
For large datasets (>2500 samples): use max_perplexity (50)
For medium datasets: use sqrt(n_samples)
This is a heuristic, not a strict rule - experiment for best results

build_tools.syllable_analysis.create_interactive_scatter(records, tsne_coords, title='t-SNE: Feature Signature Space (Interactive)')[source]

Create interactive Plotly scatter plot of t-SNE coordinates.

Generates an interactive HTML-compatible visualization with rich hover tooltips, zoom/pan controls, and export capabilities. Points are sized (log scale) and colored by frequency.

Parameters:

records (list[dict]) – List of annotated syllable records. Each must contain: - syllable (str): Syllable text - frequency (int): Occurrence count - features (dict): Boolean feature flags (12 features)
tsne_coords (ndarray) – 2D coordinate array of shape (n_samples, 2) from t-SNE
title (str) – Plot title (default: “t-SNE: Feature Signature Space (Interactive)”)

Return type:

Figure

Returns:

Plotly Figure object with configured interactive scatter plot

Raises:

ImportError – If Plotly is not installed
ValueError – If inputs are invalid or lengths don’t match

Example

>>> records = [
...     {"syllable": "ka", "frequency": 100, "features": {"contains_plosive": True}},
...     {"syllable": "mi", "frequency": 50, "features": {"contains_nasal": True}},
... ]
>>> coords = np.array([[1.0, 2.0], [3.0, 4.0]])
>>> fig = create_interactive_scatter(records, coords)
>>> fig.show()  # Opens in browser

Notes

Point size uses log1p scale for better visibility across frequency ranges
Hover text shows syllable, frequency, feature count, and up to 4 features
If more than 4 features, shows “…+N more” truncation
Viridis colorscale provides perceptually uniform coloring
Fixed height (900px) with responsive width for consistent aspect ratio
Plotly CDN used when saving to HTML for smaller file size

build_tools.syllable_analysis.create_metadata_footer(perplexity, random_state)[source]

Create HTML metadata footer with algorithm parameters.

Generates a styled HTML block showing t-SNE parameters and generation information. Designed to be appended to the end of the HTML file.

Parameters:

perplexity (int) – t-SNE perplexity parameter used
random_state (int) – Random seed used for reproducibility

Return type:

str

Returns:

HTML string with formatted metadata table

Example

>>> footer = create_metadata_footer(perplexity=30, random_state=42)
>>> "t-SNE Visualization Parameters" in footer
True

Notes

Uses inline CSS for styling (no external dependencies)
Light gray background (#f5f5f5) for visual separation
Monospace font for technical parameters
Includes usage instructions for toolbar
Shows current timestamp of generation

build_tools.syllable_analysis.create_metadata_text(output_filename, dpi, perplexity, random_state, processing_time)[source]

Generate formatted metadata text for static visualization.

Creates a human-readable metadata report describing the visualization parameters, algorithm settings, and interpretation guide. Suitable for saving alongside PNG output files.

Parameters:

output_filename (str) – Name of the output PNG file (e.g., “20260107_143022.tsne_visualization.png”)
dpi (int) – Resolution used for PNG export
perplexity (int) – t-SNE perplexity parameter used
random_state (int) – Random seed used for reproducibility
processing_time (float) – Total processing time in seconds

Return type:

str

Returns:

Formatted multi-line metadata string ready for file output

Example

>>> metadata = create_metadata_text(
...     output_filename="20260107_143022.tsne_visualization.png",
...     dpi=300,
...     perplexity=30,
...     random_state=42,
...     processing_time=2.5
... )
>>> Path("metadata.txt").write_text(metadata)

Notes

Includes timestamp of generation
Documents all algorithm parameters
Provides interpretation guidance
Uses Unicode box-drawing characters for formatting

build_tools.syllable_analysis.create_tsne_mapping(records, tsne_coords)[source]

Create syllable→features→coordinates mapping.

Combines annotated syllable records with their t-SNE coordinates to create a comprehensive mapping structure. This is useful for: - Post-hoc cluster analysis - Cross-referencing visualizations - Interactive exploration - Sharing visualizations with collaborators

Parameters:

records (list[dict]) – Original annotated syllable records from load_annotated_syllables(). Each record should have: - syllable (str): The syllable text - frequency (int): Occurrence count - features (dict): Boolean feature flags
tsne_coords (ndarray) – t-SNE coordinate array (n_syllables × n_dimensions). Typically 2D for visualization, but can be 3D or higher.

Returns:

[

{: “syllable”: “kran”, “frequency”: 7, “tsne_x”: -2.34, “tsne_y”: 5.67, “features”: {…}

]

Return type:

List of mapping records with structure

Raises:

ValueError – If records and tsne_coords have mismatched lengths

Example

>>> records = [
...     {"syllable": "ka", "frequency": 187, "features": {...}},
...     {"syllable": "ran", "frequency": 42, "features": {...}}
... ]
>>> coords = np.array([[-2.1, 3.4], [1.5, -0.8]])
>>> mapping = create_tsne_mapping(records, coords)
>>> mapping[0]["tsne_x"]
-2.1
>>> mapping[0]["syllable"]
'ka'

Notes

Array indices preserve order from input records
Coordinates are converted from numpy float64 to Python float for JSON compatibility
All original record fields are preserved in the mapping
For 2D t-SNE: creates tsne_x and tsne_y fields
For 3D+ t-SNE: creates tsne_x, tsne_y, tsne_z, … fields

build_tools.syllable_analysis.create_tsne_scatter(tsne_coords, frequencies, title='t-SNE: Feature Signature Space', figsize=(14, 10), cmap='viridis', alpha=0.6)[source]

Create static matplotlib scatter plot of t-SNE coordinates.

Generates a publication-quality scatter plot showing t-SNE dimensionality reduction results. Points are sized and colored by frequency, with larger and brighter points indicating higher-frequency syllables.

Parameters:

tsne_coords (ndarray) – 2D coordinate array of shape (n_samples, 2) from t-SNE reduction
frequencies (list[int]) – Frequency values for each point (used for sizing and coloring)
title (str) – Plot title (default: “t-SNE: Feature Signature Space”)
figsize (tuple[int, int]) – Figure size in inches as (width, height) (default: (14, 10))
cmap (str) – Matplotlib colormap name (default: “viridis”)
alpha (float) – Point transparency, 0.0=transparent to 1.0=opaque (default: 0.6)

Return type:

Figure

Returns:

matplotlib Figure object with configured scatter plot

Raises:

ValueError – If tsne_coords shape is invalid or lengths don’t match

Example

>>> import numpy as np
>>> coords = np.random.randn(100, 2)
>>> freqs = list(range(1, 101))
>>> fig = create_tsne_scatter(coords, freqs)
>>> fig.savefig("output.png", dpi=300)
>>> plt.close(fig)

Notes

Point size is proportional to frequency (frequency × 2)
Colorbar is added automatically to show frequency scale
Black edge lines improve visibility of overlapping points
Layout uses tight_layout() for optimal spacing

build_tools.syllable_analysis.ensure_output_dir(output_dir)[source]

Ensure output directory exists, creating it if necessary.

This function creates the specified directory and all parent directories if they don’t already exist. It is idempotent - calling it multiple times with the same path is safe and has no side effects.

Return type:: Path

Parameters

output_dirPath: Directory path to ensure exists

Returns

Path: The same path that was passed in (for chaining)

Examples

Basic usage:

>>> from pathlib import Path
>>> output_dir = ensure_output_dir(Path("_working/analysis/tsne/"))
>>> output_dir.exists()
True

Create nested directories:

>>> nested = ensure_output_dir(Path("_working/new/nested/dirs/"))
>>> nested.exists()
True

Idempotent operation:

>>> dir1 = ensure_output_dir(Path("_working/test/"))
>>> dir2 = ensure_output_dir(Path("_working/test/"))
>>> dir1 == dir2
True

Chaining:

>>> output_file = ensure_output_dir(Path("_working/analysis/")) / "output.json"

Notes

This function uses Path.mkdir(parents=True, exist_ok=True) which: - Creates all parent directories as needed (like mkdir -p) - Does not raise an error if the directory already exists - Raises PermissionError if insufficient permissions - Raises OSError for other filesystem errors

The function returns the input path unchanged, which allows for convenient chaining in expressions.

build_tools.syllable_analysis.extract_feature_matrix(records, feature_names=['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel', 'short_vowel', 'starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'ends_with_vowel', 'ends_with_stop', 'ends_with_nasal'])[source]

Extract binary feature matrix from annotated syllable records.

Converts feature dictionaries to a numerical matrix suitable for dimensionality reduction algorithms. Each row represents a syllable, each column represents a feature (0 or 1).

Parameters:

records (list[dict]) –
List of annotated syllable records with ‘features’ and ‘frequency’ keys. Each record should have structure: {

”syllable”: “ka”, “frequency”: 187, “features”: {“contains_liquid”: False, “contains_plosive”: True, …}

}
feature_names (list[str]) – Ordered list of feature names to extract (default: ALL_FEATURES). Order determines column order in output matrix.

Returns:

feature_matrix: numpy array of shape (n_syllables, n_features) with binary values
frequencies: List of frequency counts for each syllable

Return type:

Tuple of (feature_matrix, frequencies)

Example

>>> records = [
...     {
...         "syllable": "ka",
...         "frequency": 187,
...         "features": {"contains_liquid": False, "contains_plosive": True, ...}
...     }
... ]
>>> matrix, freqs = extract_feature_matrix(records)
>>> matrix.shape
(1, 12)
>>> freqs
[187]

Notes

Missing features default to False (0)
Feature values are converted to int (True→1, False→0)
Output matrix dtype is int for memory efficiency
Empty record list returns (0, n_features) shaped array

build_tools.syllable_analysis.extract_signature(features)[source]

Extract the feature signature from a feature dictionary.

A signature is a sorted tuple of feature names where the feature value is True. This creates a canonical representation of the active feature set.

Parameters:: features (dict[str, bool]) – Dictionary mapping feature names to boolean values
Return type:: tuple[str, ...]
Returns:: Sorted tuple of feature names that are active (True)

Example

>>> extract_signature({"starts_with_vowel": True, "ends_with_vowel": False})
('starts_with_vowel',)

build_tools.syllable_analysis.format_signature_report(signature_counter, total_syllables, limit=None)[source]

Format the signature analysis results as a human-readable report.

Parameters:

signature_counter (Counter) – Counter of signatures to their occurrence counts
total_syllables (int) – Total number of syllables in the corpus
limit (int | None) – Maximum number of signatures to include (None = all)

Return type:

str

Returns:

Formatted multi-line string report

build_tools.syllable_analysis.generate_output_pair(output_dir, primary_suffix, metadata_suffix, primary_ext='txt', metadata_ext='txt')[source]

Generate matching pair of timestamped output paths.

This function is useful for tools that generate both primary output and accompanying metadata files. It ensures both files use the same timestamp, making it easy to associate files and maintain chronological ordering.

Return type:: tuple[Path, Path]

Parameters

output_dirPath: Output directory (should exist or be created first)
primary_suffixstr: Suffix for primary output file (e.g., ‘tsne_visualization’)
metadata_suffixstr: Suffix for metadata file (e.g., ‘tsne_metadata’)
primary_extstr, default=’txt’: File extension for primary file (e.g., ‘png’, ‘json’, ‘html’)
metadata_extstr, default=’txt’: File extension for metadata file (typically ‘txt’ or ‘json’)

Returns

tuple[Path, Path]: Tuple of (primary_path, metadata_path) with matching timestamps

Examples

Generate visualization and metadata pair:

>>> from pathlib import Path
>>> viz_path, meta_path = generate_output_pair(
...     output_dir=Path("_working/analysis/tsne/"),
...     primary_suffix="tsne_visualization",
...     metadata_suffix="tsne_metadata",
...     primary_ext="png",
...     metadata_ext="txt"
... )
>>> viz_path.name
'20260107_143022.tsne_visualization.png'
>>> meta_path.name
'20260107_143022.tsne_metadata.txt'

Generate data and metadata pair:

>>> data_path, meta_path = generate_output_pair(
...     output_dir=Path("_working/results/"),
...     primary_suffix="analysis_results",
...     metadata_suffix="analysis_meta",
...     primary_ext="json",
...     metadata_ext="json"
... )

Verify timestamp matching:

>>> viz_path, meta_path = generate_output_pair(
...     output_dir=Path("_working/"),
...     primary_suffix="primary",
...     metadata_suffix="metadata"
... )
>>> viz_path.stem.split('.')[0] == meta_path.stem.split('.')[0]
True  # Same timestamp

Notes

Output Format:

Primary:  {output_dir}/{timestamp}.{primary_suffix}.{primary_ext}
Metadata: {output_dir}/{timestamp}.{metadata_suffix}.{metadata_ext}

Examples: - Primary: _working/analysis/tsne/20260107_143022.tsne_visualization.png - Metadata: _working/analysis/tsne/20260107_143022.tsne_metadata.txt

The timestamp is generated once and used for both files, ensuring they are always associated and sort together in directory listings.

The directory is NOT created by this function - use ensure_output_dir() first if the directory might not exist.

Use Cases: - t-SNE visualizer: PNG + metadata TXT - Feature analysis: results JSON + metadata TXT - Interactive plots: HTML + metadata JSON

build_tools.syllable_analysis.generate_timestamped_path(output_dir, suffix, extension='txt', timestamp=None)[source]

Generate timestamped output file path.

This function creates a path with format: {output_dir}/{timestamp}.{suffix}.{extension}

The timestamp format is YYYYMMDD_HHMMSS (e.g., 20260107_143022), which provides: - Chronological sorting - Uniqueness (assuming not more than one file per second) - Human readability - No special characters that could cause path issues

Return type:: Path

Parameters

output_dirPath: Output directory (should exist or be created first)
suffixstr: File suffix describing content (e.g., ‘tsne_visualization’, ‘metadata’)
extensionstr, default=’txt’: File extension without leading dot (e.g., ‘txt’, ‘json’, ‘png’)
timestampstr, optional: Specific timestamp string (format: YYYYMMDD_HHMMSS). If None (default), uses current time via datetime.now()

Returns

Path: Timestamped output file path

Examples

Basic usage (auto-generated timestamp):

>>> from pathlib import Path
>>> path = generate_timestamped_path(
...     output_dir=Path("_working/analysis/tsne/"),
...     suffix="tsne_visualization",
...     extension="png"
... )
>>> path.name
'20260107_143022.tsne_visualization.png'

Custom extension:

>>> path = generate_timestamped_path(
...     output_dir=Path("_working/"),
...     suffix="results",
...     extension="json"
... )
>>> path.suffix
'.json'

Explicit timestamp (for reproducibility or paired files):

>>> path = generate_timestamped_path(
...     output_dir=Path("_working/"),
...     suffix="output",
...     timestamp="20260107_120000"
... )
>>> "20260107_120000" in str(path)
True

Notes

File Naming Convention:

{YYYYMMDD_HHMMSS}.{suffix}.{extension}

Examples: - 20260107_143022.tsne_visualization.png - 20260107_143022.tsne_metadata.txt - 20260107_143022.feature_signatures.txt

The directory is NOT created by this function - use ensure_output_dir() first if the directory might not exist.

Timestamp Format: - YYYY: 4-digit year - MM: 2-digit month (01-12) - DD: 2-digit day (01-31) - HH: 2-digit hour (00-23) - MM: 2-digit minute (00-59) - SS: 2-digit second (00-59)

build_tools.syllable_analysis.get_feature_vector(features, feature_names=['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel', 'short_vowel', 'starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'ends_with_vowel', 'ends_with_stop', 'ends_with_nasal'])[source]

Extract a single feature vector from a feature dictionary.

Converts a dictionary of feature flags to an ordered binary vector. Useful for extracting vectors from individual syllables.

Parameters:

features (dict[str, bool]) – Dictionary of feature name → boolean value
feature_names (list[str]) – Ordered list of feature names (default: ALL_FEATURES)

Return type:

list[int]

Returns:

Binary feature vector matching feature_names order

Example

>>> features = {"contains_liquid": True, "contains_plosive": False}
>>> vector = get_feature_vector(features, ["contains_liquid", "contains_plosive"])
>>> vector
[1, 0]

Notes

Missing features default to False (0)
Order of output matches order of feature_names
Output is Python list, not numpy array (for flexibility)

build_tools.syllable_analysis.inject_responsive_css(html_content, min_width=1250)[source]

Inject responsive CSS into HTML content.

Adds CSS rules to ensure the plot has a minimum width and proper scrolling behavior. This prevents the plot from becoming too narrow on small screens while allowing horizontal scrolling when necessary.

Parameters:

html_content (str) – Original HTML content from Plotly
min_width (int) – Minimum width constraint in pixels (default: 1250)

Return type:

str

Returns:

HTML content with injected CSS in <head> section

Example

>>> html = "<html><head></head><body>...</body></html>"
>>> modified = inject_responsive_css(html, min_width=1250)
>>> "<style>" in modified
True

Notes

CSS is inserted after the opening <head> tag
Sets body margin/padding to 0 for full-width layout
Enables horizontal scrolling when plot exceeds viewport width
Sets fixed height (900px) matching plot configuration
Uses !important to override Plotly’s inline styles

build_tools.syllable_analysis.load_annotated_syllables(input_path, validate=True)[source]

Load annotated syllables from JSON file with optional validation.

This function loads the output of the syllable feature annotator, which contains syllables with their frequencies and phonetic feature annotations. It provides optional validation to ensure the data structure is correct.

Return type:: list[dict[str, Any]]

Parameters

input_pathPath: Path to syllables_annotated.json file
validatebool, default=True: Whether to validate the structure of loaded data. When True, checks: - Data is a list - List is non-empty - First record has required keys: ‘syllable’, ‘frequency’, ‘features’

Returns

list[dict[str, Any]]: List of syllable records, each containing: - syllable (str): The syllable text - frequency (int): Occurrence count in corpus - features (dict): Boolean feature flags (12 features)

Raises

FileNotFoundError: If input file does not exist
json.JSONDecodeError: If file is not valid JSON
ValueError: If validation is enabled and data structure is invalid

Examples

Basic loading with validation:

>>> from pathlib import Path
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> len(records)
1247
>>> records[0].keys()
dict_keys(['syllable', 'frequency', 'features'])

Loading without validation (faster, use when structure is guaranteed):

>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"),
...                                     validate=False)

Error handling:

>>> try:
...     records = load_annotated_syllables(Path("nonexistent.json"))
... except FileNotFoundError as e:
...     print(f"File not found: {e}")

Notes

Expected file format (syllables_annotated.json):

[
    {
        "syllable": "ka",
        "frequency": 187,
        "features": {
            "contains_liquid": false,
            "contains_plosive": true,
            "contains_fricative": false,
            "contains_nasal": false,
            "long_vowel": false,
            "short_vowel": true,
            "starts_with_vowel": false,
            "starts_with_cluster": false,
            "starts_with_heavy_cluster": false,
            "ends_with_vowel": true,
            "ends_with_stop": false,
            "ends_with_nasal": false
        }
    },
    ...
]

This file is produced by the syllable feature annotator pipeline and is the primary input for analysis tools.

Performance

Loading a typical corpus of 1,000-10,000 syllables takes <100ms. Validation adds negligible overhead (~1ms).

build_tools.syllable_analysis.load_frequency_data(frequencies_path)[source]

Load frequency mapping from JSON file.

This function loads the output of the syllable normalizer’s frequency analysis, which maps each canonical syllable to its occurrence count in the raw corpus.

Return type:: dict[str, int]

Parameters

frequencies_pathPath: Path to syllables_frequencies.json file

Returns

dict[str, int]: Dictionary mapping syllable strings to their frequency counts

Raises

FileNotFoundError: If input file does not exist
json.JSONDecodeError: If file is not valid JSON
ValueError: If data structure is invalid (not a dict)

Examples

Basic loading:

>>> from pathlib import Path
>>> frequencies = load_frequency_data(Path("data/normalized/syllables_frequencies.json"))
>>> frequencies["ka"]
187
>>> len(frequencies)
1247

Checking most common syllables:

>>> sorted_freqs = sorted(frequencies.items(), key=lambda x: x[1], reverse=True)
>>> sorted_freqs[:3]
[('ka', 187), ('ra', 162), ('mi', 145)]

Error handling:

>>> try:
...     frequencies = load_frequency_data(Path("nonexistent.json"))
... except FileNotFoundError:
...     print("File not found")

Notes

Expected file format (syllables_frequencies.json):

{
    "ka": 187,
    "ra": 162,
    "mi": 145,
    "ta": 98,
    ...
}

This file is produced by the syllable normalizer’s frequency analysis step and captures pre-deduplication counts (how many times each syllable appeared in the raw corpus before creating the unique syllable list).

The frequencies can be used for: - Weighted analysis (prioritize common syllables) - Filtering (exclude rare syllables) - Visualization (size/color by frequency) - Statistical analysis

Performance

Loading a typical frequency file (1,000-10,000 entries) takes <50ms.

build_tools.syllable_analysis.parse_feature_signatures_args()

Parse command-line arguments.

Return type:: Namespace

build_tools.syllable_analysis.parse_random_sampler_arguments()

Parse command-line arguments.

Return type:: Namespace
Returns:: Parsed argument namespace.

build_tools.syllable_analysis.parse_tsne_visualizer_args()

Parse command-line arguments.

Return type:: Namespace
Returns:: Parsed argument namespace with validated parameters

build_tools.syllable_analysis.run_analysis(input_path, output_dir, limit=None)[source]

Run the complete feature signature analysis pipeline.

Parameters:

input_path (Path) – Path to syllables_annotated.json
output_dir (Path) – Directory to save analysis results
limit (int | None) – Maximum number of signatures to include in report (None = all)

Returns:

total_syllables: Total number of syllables analyzed
unique_signatures: Number of unique feature signatures
output_path: Path to the saved report

Return type:

Dictionary with analysis results including

build_tools.syllable_analysis.run_tsne_visualization(input_path, output_dir, perplexity=30, random_state=42, dpi=300, verbose=False, save_mapping=False, interactive=False)[source]

Run the complete t-SNE visualization pipeline.

This is the main entry point for programmatic use. It handles the full workflow: 1. Load annotated syllables 2. Extract feature matrix 3. Apply t-SNE dimensionality reduction 4. Create visualization 5. Save outputs (PNG + optional HTML + optional mapping)

Parameters:

input_path (Path) – Path to syllables_annotated.json
output_dir (Path) – Directory to save visualization outputs
perplexity (int) – t-SNE perplexity parameter (default: 30)
random_state (int) – Random seed for reproducibility (default: 42)
dpi (int) – Output resolution in dots per inch (default: 300)
verbose (bool) – Print detailed progress information
save_mapping (bool) – Save syllable→features→coordinates mapping as JSON (default: False)
interactive (bool) – Generate interactive HTML visualization (requires Plotly, default: False)

Returns:

syllable_count: Number of syllables visualized
feature_count: Number of features (always 12)
output_path: Path to saved visualization PNG
metadata_path: Path to saved metadata file
tsne_coordinates: numpy array of 2D coordinates
mapping_path: Path to mapping JSON (None if save_mapping=False)
interactive_path: Path to interactive HTML (None if interactive=False or Plotly unavailable)
processing_time: Total processing time in seconds

Return type:

Dictionary containing

Raises:

FileNotFoundError – If input file does not exist
ImportError – If required dependencies are missing
ValueError – If input data is invalid

Example

>>> from pathlib import Path
>>> result = run_tsne_visualization(
...     input_path=Path("data/annotated/syllables_annotated.json"),
...     output_dir=Path("_working/analysis/tsne/"),
...     interactive=True,
...     save_mapping=True
... )
>>> print(f"Visualized {result['syllable_count']} syllables")
>>> print(f"Interactive HTML: {result['interactive_path']}")

build_tools.syllable_analysis.sample_syllables(records, sample_count, seed=None)[source]

Randomly sample syllables from the full corpus.

Parameters:

records (list[dict[str, Any]]) – List of annotated syllable records.
sample_count (int) – Number of samples to draw.
seed (int | None) – Optional random seed for reproducibility.

Return type:

list[dict[str, Any]]

Returns:

List of sampled syllable records.

Raises:

ValueError – If sample_count is larger than available records.

build_tools.syllable_analysis.save_interactive_html(fig, output_path, perplexity, random_state, min_width=1250)[source]

Save interactive Plotly figure as standalone HTML.

Creates a self-contained HTML file with embedded Plotly visualization that can be: - Opened directly in any web browser - Shared with collaborators - Embedded in reports or documentation - Explored with zoom, pan, hover, and export controls

The HTML file uses Plotly CDN for JavaScript dependencies (smaller file size) and includes responsive CSS and a metadata footer.

Parameters:

fig (Figure) – Plotly Figure object from create_interactive_scatter()
output_path (Path) – Output HTML file path (parent directory must exist)
perplexity (int) – t-SNE perplexity parameter (for metadata footer)
random_state (int) – Random seed used (for metadata footer)
min_width (int) – Minimum width constraint in pixels (default: 1250)

Raises:

ImportError – If Plotly is not installed
FileNotFoundError – If parent directory doesn’t exist
ValueError – If output_path doesn’t end with .html

Return type:

None

Example

>>> fig = create_interactive_scatter(records, tsne_coords)
>>> output_path = Path("_working/visualization.html")
>>> save_interactive_html(fig, output_path, perplexity=30, random_state=42)

Notes

Plotly CDN used for smaller file size vs. full JS bundle
Mode bar configured with additional tools (hoverclosest, hovercompare)
Export to PNG button configured for high-resolution (1600x1200, 2x scale)
Responsive CSS ensures minimum width of 1250px
Metadata footer includes algorithm parameters and generation time

build_tools.syllable_analysis.save_json_output(data, output_path, indent=2, ensure_ascii=False)[source]

Save data as formatted JSON file.

This function provides consistent JSON output formatting across all analysis tools. It ensures proper Unicode handling, readable indentation, and creates parent directories if needed.

Return type:: None

Parameters

dataAny: Data to serialize as JSON (must be JSON-serializable)
output_pathPath: Output file path (parent directories will be created if needed)
indentint | None, default=2: Number of spaces for JSON indentation. Use 2 for readability, None for compact output
ensure_asciibool, default=False: If True, escape non-ASCII characters. If False (default), preserve Unicode characters for better readability

Raises

TypeError: If data is not JSON-serializable
OSError: If file cannot be written (permissions, disk full, etc.)

Examples

Save analysis results:

>>> from pathlib import Path
>>> results = {"total": 1247, "unique": 892}
>>> save_json_output(results, Path("output/results.json"))

Save with compact formatting:

>>> save_json_output(results, Path("output/compact.json"), indent=None)

Save with ASCII-only encoding:

>>> save_json_output(results, Path("output/ascii.json"), ensure_ascii=True)

Auto-create parent directories:

>>> save_json_output(results, Path("output/new/dir/results.json"))
>>> # Creates output/new/dir/ automatically

Notes

Default settings (indent=2, ensure_ascii=False) are optimized for: - Human readability (indented) - Unicode support (preserve accented characters, emojis, etc.) - Version control friendliness (consistent line breaks)

File encoding is always UTF-8 for maximum compatibility.

Performance

Saving 1,000-10,000 records typically takes <100ms. Using indent=None (compact) is ~20% faster but much less readable.

build_tools.syllable_analysis.save_report(report, output_dir)[source]

Save the formatted report to the output directory.

Parameters:

report (str) – Formatted report string
output_dir (Path) – Directory to save the report in

Return type:

Path

Returns:

Path to the saved report file

build_tools.syllable_analysis.save_static_plot(fig, output_path, dpi=300)[source]

Save matplotlib figure to PNG file.

Saves a matplotlib Figure to a high-resolution PNG file suitable for publication or presentation. Uses tight bounding box to minimize whitespace.

Parameters:

fig (Figure) – Matplotlib Figure object to save
output_path (Path) – Output PNG file path (parent directory must exist)
dpi (int) – Resolution in dots per inch (default: 300 for publication quality)

Raises:

FileNotFoundError – If parent directory doesn’t exist
PermissionError – If file cannot be written
ValueError – If output_path doesn’t end with .png

Return type:

None

Example

>>> import matplotlib.pyplot as plt
>>> from pathlib import Path
>>> fig, ax = plt.subplots()
>>> ax.plot([1, 2, 3], [1, 4, 9])
>>> save_static_plot(fig, Path("output.png"), dpi=300)
>>> plt.close(fig)

Notes

Uses bbox_inches=’tight’ to remove excess whitespace
Higher DPI values create larger files but better quality
Common DPI values: 150 (screen), 300 (print), 600 (high-quality print)
Figure is NOT automatically closed after saving

build_tools.syllable_analysis.save_tsne_mapping(mapping, output_path, indent=2)[source]

Save t-SNE mapping to JSON file.

Writes the syllable→coordinates mapping as formatted JSON for human readability and programmatic access.

Parameters:

mapping (list[dict]) – Mapping data from create_tsne_mapping()
output_path (Path) – Output file path (should end in .json)
indent (int) – JSON indentation for readability (default: 2)

Return type:

None

Example

>>> from pathlib import Path
>>> mapping = [{"syllable": "ka", "tsne_x": -2.1, "tsne_y": 3.4, "features": {...}}]
>>> save_tsne_mapping(mapping, Path("output.json"))

Notes

Output is formatted with indentation for human readability
Uses ensure_ascii=False to preserve Unicode characters
UTF-8 encoding ensures international character support
Parent directories are created if they don’t exist

build_tools.syllable_analysis.validate_feature_matrix(feature_matrix, expected_features=12)[source]

Validate feature matrix shape and contents.

Ensures the feature matrix has the expected structure for dimensionality reduction algorithms.

Parameters:

feature_matrix (ndarray) – Binary feature matrix
expected_features (int) – Expected number of features (default: 12)

Raises:

ValueError – If validation fails (wrong shape, non-binary values, etc.)

Return type:

None

Example

>>> matrix = np.array([[1, 0, 1], [0, 1, 0]])
>>> validate_feature_matrix(matrix, expected_features=3)  # OK
>>> validate_feature_matrix(matrix, expected_features=4)  # Raises ValueError