Analysis Tools
Overview
Analysis tools for annotated syllables.
This subpackage provides post-annotation analysis utilities for inspecting and understanding the annotated syllable corpus.
Subpackages
common: Shared utilities (data I/O, paths, output management) dimensionality: Dimensionality reduction (feature matrices, t-SNE, mapping) plotting: Visualization utilities (static matplotlib, interactive Plotly)
Available Tools
random_sampler: Random sampling utility for QA and inspection feature_signatures: Feature signature analysis and distribution reporting tsne_visualizer: t-SNE visualization of feature signature space
Quick Start
Random sampling:
$ python -m build_tools.syllable_analysis.random_sampler --samples 50
Feature signature analysis:
$ python -m build_tools.syllable_analysis.feature_signatures
t-SNE visualization:
$ python -m build_tools.syllable_analysis.tsne_visualizer
Programmatic Usage
Using common utilities:
>>> from build_tools.syllable_analysis import (
... default_paths,
... load_annotated_syllables,
... ensure_output_dir,
... )
>>> # Load data using default paths
>>> records = load_annotated_syllables(default_paths.annotated_syllables)
>>> # Prepare output directory
>>> output_dir = ensure_output_dir(default_paths.analysis_output_dir("my_tool"))
Random sampling:
>>> from build_tools.syllable_analysis import (
... load_annotated_syllables,
... sample_syllables,
... save_json_output
... )
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> samples = sample_syllables(records, 50, seed=42)
>>> save_json_output(samples, Path("output.json"))
Feature signature analysis:
>>> from build_tools.syllable_analysis import (
... run_analysis,
... extract_signature,
... analyze_feature_signatures
... )
>>> result = run_analysis(
... input_path=Path("data/annotated/syllables_annotated.json"),
... output_dir=Path("_working/analysis/"),
... limit=20
... )
t-SNE visualization:
>>> from build_tools.syllable_analysis import (
... run_tsne_visualization,
... extract_feature_matrix
... )
>>> result = run_tsne_visualization(
... input_path=Path("data/annotated/syllables_annotated.json"),
... output_dir=Path("_working/analysis/tsne/")
... )
Random Sampler
Random sampling utility for QA and inspection of annotated syllables.
Command-Line Interface
Randomly sample annotated syllables for inspection.
usage: python -m build_tools.syllable_analysis.random_sampler
[-h] [--input INPUT] [--output OUTPUT] [--samples SAMPLES]
[--seed SEED]
Named Arguments
- --input
Path to input annotated syllables JSON file (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)
Default:
/home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json- --output
Path to output samples JSON file (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/random_samples.json)
Default:
/home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/random_samples.json- --samples
Number of syllables to sample (default: 100)
Default:
100- --seed
Random seed for reproducibility (default: None, uses system randomness)
# Sample 100 syllables (default)
python -m build_tools.syllable_analysis.random_sampler
# Sample 50 syllables
python -m build_tools.syllable_analysis.random_sampler --samples 50
# Use custom paths
python -m build_tools.syllable_analysis.random_sampler \
--input data/annotated/syllables_annotated.json \
--output _working/my_samples.json \
--samples 200
# Use a specific seed for reproducibility
python -m build_tools.syllable_analysis.random_sampler --samples 50 --seed 42
Integration Guide
Use after annotation to randomly sample syllables for quality assurance:
# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator
# Step 2: Sample for QA inspection
python -m build_tools.syllable_analysis.random_sampler \
--input data/annotated/syllables_annotated.json \
--count 50 \
--output _working/samples.json \
--seed 42
Feature Signature Analysis
The feature signature analysis tool examines annotated syllables to identify which feature combinations actually exist in the data and how frequently each combination appears.
A “feature signature” is the set of all active (True) features for a syllable. For example, a syllable
with only starts_with_vowel and ends_with_vowel active would have the signature:
('ends_with_vowel', 'starts_with_vowel').
This analysis answers questions like:
What feature patterns are most common in natural language?
Are certain feature combinations rare or impossible?
How diverse is the feature space in the corpus?
Command-Line Interface
Analyze feature signatures in annotated syllable corpus
usage: python -m build_tools.syllable_analysis.feature_signatures
[-h] [--input INPUT] [--output OUTPUT] [--limit LIMIT]
Named Arguments
- --input
Path to syllables_annotated.json (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)
Default:
/home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json- --output
Output directory for analysis results (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/feature_signatures)
Default:
/home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/feature_signatures- --limit
Limit number of signatures in report (default: show all)
# Analyze with default paths
python -m build_tools.syllable_analysis.feature_signatures
# Show only top 50 signatures
python -m build_tools.syllable_analysis.feature_signatures --limit 50
# Custom input/output paths
python -m build_tools.syllable_analysis.feature_signatures \
--input data/annotated/syllables_annotated.json \
--output _working/my_analysis/
Output Format
The tool generates timestamped plain text reports (YYYYMMDD_HHMMSS.feature_signatures.txt) with:
================================================================================
FEATURE SIGNATURE ANALYSIS
================================================================================
Generated: 2026-01-06 13:55:56
Total syllables analyzed: 23,160
Unique feature signatures: 361
SUMMARY STATISTICS
--------------------------------------------------------------------------------
Most common signature: 661 syllables (2.9%)
Features: contains_liquid, contains_plosive, ends_with_vowel, long_vowel
Feature cardinality distribution:
1 features: 3 unique signatures
2 features: 17 unique signatures
3 features: 54 unique signatures
4 features: 86 unique signatures
5 features: 96 unique signatures
6 features: 70 unique signatures
7 features: 30 unique signatures
8 features: 5 unique signatures
================================================================================
SIGNATURE RANKINGS
--------------------------------------------------------------------------------
Rank Count Pct Features
--------------------------------------------------------------------------------
1 661 2.85% [4] contains_liquid, contains_plosive, ends_with_vowel, long_vowel
2 506 2.18% [3] contains_plosive, ends_with_vowel, long_vowel
...
Integration Guide
Use after annotation to understand feature patterns:
# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator
# Step 2: Analyze feature signatures
python -m build_tools.syllable_analysis.feature_signatures \
--input data/annotated/syllables_annotated.json \
--output _working/analysis/feature_signatures/
t-SNE Visualization
The t-SNE (t-distributed Stochastic Neighbor Embedding) visualization tool creates 2D visualizations of the high-dimensional feature signature space. This helps identify clustering patterns, syllable similarity, and natural groupings in the annotated syllable corpus.
t-SNE is a dimensionality reduction technique that projects 12-dimensional feature vectors into 2D space while preserving local structure. The visualization uses:
Position (x, y): t-SNE projection coordinates
Size: Syllable frequency (larger points = more common)
Color: Syllable frequency (warmer colors = more common)
Command-Line Interface
Generate t-SNE visualization of feature signature space
usage: python -m build_tools.syllable_analysis.tsne_visualizer
[-h] [--input INPUT] [--output OUTPUT] [--perplexity PERPLEXITY]
[--random-state RANDOM_STATE] [--dpi DPI] [--save-mapping]
[--interactive] [--verbose]
Named Arguments
- --input
Path to syllables_annotated.json (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json)
Default:
/home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/data/annotated/syllables_annotated.json- --output
Output directory for visualizations (default: /home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/tsne)
Default:
/home/docs/checkouts/readthedocs.org/user_builds/pipeworks-name-generation/checkouts/latest/_working/analysis/tsne- --perplexity
t-SNE perplexity parameter (default: 30, range: 5-50)
Default:
30- --random-state
Random seed for reproducibility (default: 42)
Default:
42- --dpi
Output resolution in DPI (default: 300)
Default:
300- --save-mapping
Save syllable→features→coordinates mapping as JSON (default: False)
Default:
False- --interactive
Generate interactive HTML visualization in addition to static PNG (requires Plotly)
Default:
False- --verbose
Print detailed progress information
Default:
False
# Generate visualization with default settings
python -m build_tools.syllable_analysis.tsne_visualizer
# Custom input/output paths
python -m build_tools.syllable_analysis.tsne_visualizer \
--input data/annotated/syllables_annotated.json \
--output _working/analysis/tsne/
# Adjust t-SNE parameters
python -m build_tools.syllable_analysis.tsne_visualizer \
--perplexity 50 \
--random-state 123
# High-resolution output
python -m build_tools.syllable_analysis.tsne_visualizer \
--dpi 600
# Verbose output
python -m build_tools.syllable_analysis.tsne_visualizer --verbose
Output Format
The visualizer generates timestamped files in the output directory:
YYYYMMDD_HHMMSS.tsne_visualization.png - High-resolution static visualization (PNG, always generated)
YYYYMMDD_HHMMSS.tsne_metadata.txt - Detailed metadata and interpretation guide
YYYYMMDD_HHMMSS.tsne_mapping.json - Syllable→features→coordinates mapping (optional, requires
--save-mapping)YYYYMMDD_HHMMSS.tsne_interactive.html - Interactive Plotly visualization (optional, requires
--interactive)
Output Formats:
Static PNG: High-resolution matplotlib visualization (always generated, 300 DPI default)
Interactive HTML: Plotly-based interactive visualization with hover tooltips, zoom, pan, and export (optional)
Static PNG metadata file includes:
Algorithm parameters (method, perplexity, random seed, dimensions, distance metric, features)
Visualization encoding (axis meanings, point size/color)
Interpretation guide (how to read the visualization)
Technical details (DPI, generation timestamp)
Interactive HTML features:
Hover tooltips showing syllable text, frequency, and active features
Interactive zoom, pan, and exploration controls
Export to high-resolution PNG directly from browser
Self-contained HTML file with embedded metadata
Works in any modern web browser without additional dependencies
Integration Guide
Use after annotation to visualize the feature space:
# Step 1: Annotate syllables
python -m build_tools.syllable_feature_annotator
# Step 2: Generate t-SNE visualization
python -m build_tools.syllable_analysis.tsne_visualizer \
--input data/annotated/syllables_annotated.json \
--output _working/analysis/tsne/ \
--interactive
Advanced Topics
Understanding t-SNE Parameters:
Perplexity (default: 30):
Balances attention between local and global structure
Typical range: 5-50
Lower values: emphasize local clusters
Higher values: preserve global structure
Rule of thumb: should be less than number of syllables
Default of 30 works well for most corpus sizes (100-10,000 syllables)
Random State (default: 42):
Controls random initialization of t-SNE
Same value = reproducible visualizations
Different values = different (but valid) layouts
Use fixed value (e.g., 42) for consistent results
Distance Metric:
Uses Hamming distance (optimal for binary feature vectors)
Automatically configured for 12-dimensional binary features
Not configurable via command-line (intentional design choice)
Interpreting the Visualization:
What to look for:
Nearby points: Syllables with similar phonetic features
Clusters: Natural groupings in the feature space
Large/bright points: High-frequency syllables (common patterns)
Small/dark points: Low-frequency syllables (rare patterns)
Isolated points: Unique or rare feature combinations
Dense regions: Common feature patterns
Sparse regions: Less common feature patterns
Example insights:
Vowel-initial syllables might cluster together
Heavy consonant clusters might form distinct groups
Frequency might correlate with certain feature patterns
Outliers might indicate unusual phonetic combinations
Notes
Dependencies:
Required dependencies for t-SNE visualization (install with pip install -e ".[build-tools]"):
scikit-learn, matplotlib, numpy, pandas (for static PNG generation)
plotly (for interactive HTML generation, optional)
Performance:
Deterministic analysis with fixed random seeds (same input = same output)
Fast processing: typically <10 seconds for 1,000-10,000 syllables
t-SNE processing time scales roughly O(n²) with corpus size
For very large datasets (>50,000 syllables), consider sampling first
Output Characteristics:
Human-readable plain text reports with formatted tables
Timestamped output files for historical tracking
High-resolution visualizations (default 300 DPI)
Static visualizations saved as PNG files for easy sharing and embedding
Interactive visualizations saved as self-contained HTML files
Build-time tool:
These are build-time analysis tools - not used during runtime name generation.
API Reference
Analysis tools for annotated syllables.
This subpackage provides post-annotation analysis utilities for inspecting and understanding the annotated syllable corpus.
Subpackages
common: Shared utilities (data I/O, paths, output management) dimensionality: Dimensionality reduction (feature matrices, t-SNE, mapping) plotting: Visualization utilities (static matplotlib, interactive Plotly)
Available Tools
random_sampler: Random sampling utility for QA and inspection feature_signatures: Feature signature analysis and distribution reporting tsne_visualizer: t-SNE visualization of feature signature space
Quick Start
Random sampling:
$ python -m build_tools.syllable_analysis.random_sampler --samples 50
Feature signature analysis:
$ python -m build_tools.syllable_analysis.feature_signatures
t-SNE visualization:
$ python -m build_tools.syllable_analysis.tsne_visualizer
Programmatic Usage
Using common utilities:
>>> from build_tools.syllable_analysis import (
... default_paths,
... load_annotated_syllables,
... ensure_output_dir,
... )
>>> # Load data using default paths
>>> records = load_annotated_syllables(default_paths.annotated_syllables)
>>> # Prepare output directory
>>> output_dir = ensure_output_dir(default_paths.analysis_output_dir("my_tool"))
Random sampling:
>>> from build_tools.syllable_analysis import (
... load_annotated_syllables,
... sample_syllables,
... save_json_output
... )
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"))
>>> samples = sample_syllables(records, 50, seed=42)
>>> save_json_output(samples, Path("output.json"))
Feature signature analysis:
>>> from build_tools.syllable_analysis import (
... run_analysis,
... extract_signature,
... analyze_feature_signatures
... )
>>> result = run_analysis(
... input_path=Path("data/annotated/syllables_annotated.json"),
... output_dir=Path("_working/analysis/"),
... limit=20
... )
t-SNE visualization:
>>> from build_tools.syllable_analysis import (
... run_tsne_visualization,
... extract_feature_matrix
... )
>>> result = run_tsne_visualization(
... input_path=Path("data/annotated/syllables_annotated.json"),
... output_dir=Path("_working/analysis/tsne/")
... )
- class build_tools.syllable_analysis.AnalysisPathConfig(root=None)[source]
Bases:
objectCentralized path configuration for analysis tools.
This class manages all default paths used by analysis tools, including: - Project root detection - Input file paths (annotated syllables, frequencies) - Output directory paths (per-tool subdirectories)
The class automatically detects the project root based on this file’s location in the directory structure, but can also accept a custom root path for testing or alternative project layouts.
Attributes
- rootPath
Project root directory (auto-detected or explicitly set)
Examples
Using default (auto-detected) root:
>>> config = AnalysisPathConfig() >>> config.root PosixPath('/path/to/pipeworks_name_generation') >>> config.annotated_syllables PosixPath('/path/to/pipeworks_name_generation/data/annotated/syllables_annotated.json')
Using custom root:
>>> from pathlib import Path >>> config = AnalysisPathConfig(root=Path("/custom/root")) >>> config.annotated_syllables PosixPath('/custom/root/data/annotated/syllables_annotated.json')
Getting tool-specific output directories:
>>> config = AnalysisPathConfig() >>> config.analysis_output_dir("tsne") PosixPath('/path/to/pipeworks_name_generation/_working/analysis/tsne') >>> config.analysis_output_dir("feature_signatures") PosixPath('/path/to/pipeworks_name_generation/_working/analysis/feature_signatures')
Notes
This class is designed to be instantiated once per process (typically via the module-level default_paths singleton). Multiple instances are supported for testing purposes.
The auto-detection assumes this file is located at:
build_tools/syllable_analysis/common/paths.pyIf the directory structure changes, the
_detect_project_root()method must be updated accordingly.- __init__(root=None)[source]
Initialize path configuration.
Args
- rootPath, optional
Project root path. If None (default), auto-detects based on this file’s location.
Examples
Default auto-detection:
>>> config = AnalysisPathConfig()
Custom root path:
>>> from pathlib import Path >>> config = AnalysisPathConfig(root=Path("/my/project"))
- analysis_output_dir(tool_name)[source]
Get output directory for a specific analysis tool.
Each analysis tool should have its own subdirectory under
_working/analysis/to keep outputs organized and avoid naming conflicts.- Return type:
Args
- tool_namestr
Name of the analysis tool (e.g., ‘tsne’, ‘feature_signatures’, ‘random_sampler’). This will be used as the subdirectory name.
Returns
- Path
Path to
_working/analysis/{tool_name}/
Examples
>>> config = AnalysisPathConfig() >>> config.analysis_output_dir("tsne") PosixPath('.../pipeworks_name_generation/_working/analysis/tsne') >>> config.analysis_output_dir("feature_signatures") PosixPath('.../pipeworks_name_generation/_working/analysis/feature_signatures')
Use in argument parser:
parser.add_argument( "--output", type=Path, default=default_paths.analysis_output_dir("tsne"), help="Output directory" )
Notes
The directory is not created by this method - it only returns the path. Use
common.output.ensure_output_dir()to create the directory if needed.The
_working/directory is typically git-ignored and used for build-time artifacts that don’t need to be committed.
- property annotated_syllables: Path
Default path to syllables_annotated.json.
This is the primary input file for most analysis tools, containing syllables with their frequencies and feature annotations.
Returns
- Path
Path to
data/annotated/syllables_annotated.json
Examples
>>> config = AnalysisPathConfig() >>> config.annotated_syllables PosixPath('.../data/annotated/syllables_annotated.json')
Use in argument parser:
parser.add_argument( "--input", type=Path, default=default_paths.annotated_syllables, help="Path to annotated syllables" )
Notes
This file is produced by the syllable feature annotator pipeline and contains a JSON array of syllable records with structure:
[ { "syllable": "ka", "frequency": 187, "features": { "starts_with_vowel": false, "contains_plosive": true, ... } }, ... ]
- property syllables_frequencies: Path
Default path to syllables_frequencies.json.
This file contains frequency counts for each syllable from the normalizer, useful for weighted analysis or filtering.
Returns
- Path
Path to
data/normalized/syllables_frequencies.json
Examples
>>> config = AnalysisPathConfig() >>> config.syllables_frequencies PosixPath('.../data/normalized/syllables_frequencies.json')
Notes
This file is produced by the syllable normalizer and contains a JSON object mapping syllables to their occurrence counts:
{ "ka": 187, "ra": 162, "mi": 145, ... }
The frequencies represent pre-deduplication counts, capturing how often each canonical syllable appeared in the raw corpus.
- build_tools.syllable_analysis.analyze_feature_signatures(records)[source]
Analyze feature signatures across all syllable records.
Counts how many syllables share each unique feature signature.
- Parameters:
records (
list[dict]) – List of syllable records from syllables_annotated.json Each record should have “syllable”, “frequency”, and “features” keys- Return type:
- Returns:
Counter mapping feature signatures to occurrence counts
Example
>>> records = [ ... {"syllable": "ka", "features": {"starts_with_vowel": False}}, ... {"syllable": "a", "features": {"starts_with_vowel": True}} ... ] >>> counter = analyze_feature_signatures(records) >>> counter[('starts_with_vowel',)] 1
- build_tools.syllable_analysis.apply_tsne(feature_matrix, n_components=2, perplexity=30, random_state=42, metric='hamming')[source]
Apply t-SNE dimensionality reduction to feature matrix.
t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique for dimensionality reduction that projects high-dimensional data into lower dimensions while preserving local structure.
- Parameters:
feature_matrix (
ndarray) – Input feature matrix (n_samples, n_features). For binary features, should contain only 0s and 1s.n_components (
int) – Number of dimensions for output (default: 2). 2D is typical for visualization, 3D also common.perplexity (
int) – t-SNE perplexity parameter (default: 30). Controls balance between local and global structure. Typical range: 5-50. Higher values consider more neighbors. Should be less than n_samples.random_state (
int) – Random seed for reproducibility (default: 42). Same seed ensures identical output for same input.metric (
str) – Distance metric (default: ‘hamming’). ‘hamming’ is optimal for binary features (counts # of differences). Other options: ‘euclidean’, ‘manhattan’, ‘cosine’, etc.
- Return type:
ndarray- Returns:
Reduced coordinates array of shape (n_samples, n_components). For default n_components=2, output is (n_samples, 2) with x,y coordinates.
- Raises:
ImportError – If scikit-learn is not installed
ValueError – If perplexity is invalid (too large for sample size)
Example
>>> import numpy as np >>> from build_tools.syllable_analysis.dimensionality import apply_tsne >>> # Create sample binary feature matrix (100 samples, 12 features) >>> feature_matrix = np.random.randint(0, 2, size=(100, 12)) >>> # Apply t-SNE to reduce to 2D >>> coords_2d = apply_tsne(feature_matrix, n_components=2, perplexity=30) >>> coords_2d.shape (100, 2)
Notes
Processing time scales roughly O(n²) with sample size
Perplexity should be less than n_samples (typically n_samples/3 max)
Hamming distance is best for binary features (our use case)
Fixed random_state ensures reproducible results
For large datasets (>10,000 samples), consider using approximate methods
- build_tools.syllable_analysis.build_hover_text(record, max_features=4)[source]
Build rich hover text for a single syllable record.
Creates HTML-formatted hover text showing syllable details, frequency, and active features. Features are truncated if more than max_features are present.
- Parameters:
- Return type:
- Returns:
HTML-formatted hover text string
Example
>>> record = { ... "syllable": "kran", ... "frequency": 150, ... "features": { ... "contains_plosive": True, ... "contains_liquid": True, ... "contains_nasal": True, ... "starts_with_cluster": True, ... "ends_with_nasal": True, ... } ... } >>> text = build_hover_text(record, max_features=4) >>> print(text) <b>kran</b><br>Frequency: 150<br>Features: 5/12<br><i>contains_plosive, ...</i><br>...
Notes
Syllable shown in bold
Frequency shown with comma separators (e.g., “1,234”)
Feature count shows active/total (e.g., “5/12”)
First N features shown in italics
If more than N features, shows “+M more” truncation message
- build_tools.syllable_analysis.calculate_optimal_perplexity(n_samples, min_perplexity=5, max_perplexity=50)[source]
Suggest optimal perplexity value based on dataset size.
Perplexity is a key t-SNE parameter that balances local vs global structure. This function provides a reasonable default based on dataset size.
- Rule of thumb:
Perplexity should be between 5 and 50
Perplexity should be less than n_samples
Common heuristic: perplexity ≈ sqrt(n_samples), clamped to [5, 50]
- Parameters:
- Return type:
- Returns:
Suggested perplexity value
Example
>>> calculate_optimal_perplexity(100) 10 >>> calculate_optimal_perplexity(1000) 31 >>> calculate_optimal_perplexity(10000) 50 >>> calculate_optimal_perplexity(10) 5
Notes
For small datasets (<25 samples): use min_perplexity (5)
For large datasets (>2500 samples): use max_perplexity (50)
For medium datasets: use sqrt(n_samples)
This is a heuristic, not a strict rule - experiment for best results
- build_tools.syllable_analysis.create_interactive_scatter(records, tsne_coords, title='t-SNE: Feature Signature Space (Interactive)')[source]
Create interactive Plotly scatter plot of t-SNE coordinates.
Generates an interactive HTML-compatible visualization with rich hover tooltips, zoom/pan controls, and export capabilities. Points are sized (log scale) and colored by frequency.
- Parameters:
records (
list[dict]) – List of annotated syllable records. Each must contain: - syllable (str): Syllable text - frequency (int): Occurrence count - features (dict): Boolean feature flags (12 features)tsne_coords (
ndarray) – 2D coordinate array of shape (n_samples, 2) from t-SNEtitle (
str) – Plot title (default: “t-SNE: Feature Signature Space (Interactive)”)
- Return type:
Figure- Returns:
Plotly Figure object with configured interactive scatter plot
- Raises:
ImportError – If Plotly is not installed
ValueError – If inputs are invalid or lengths don’t match
Example
>>> records = [ ... {"syllable": "ka", "frequency": 100, "features": {"contains_plosive": True}}, ... {"syllable": "mi", "frequency": 50, "features": {"contains_nasal": True}}, ... ] >>> coords = np.array([[1.0, 2.0], [3.0, 4.0]]) >>> fig = create_interactive_scatter(records, coords) >>> fig.show() # Opens in browser
Notes
Point size uses log1p scale for better visibility across frequency ranges
Hover text shows syllable, frequency, feature count, and up to 4 features
If more than 4 features, shows “…+N more” truncation
Viridis colorscale provides perceptually uniform coloring
Fixed height (900px) with responsive width for consistent aspect ratio
Plotly CDN used when saving to HTML for smaller file size
Create HTML metadata footer with algorithm parameters.
Generates a styled HTML block showing t-SNE parameters and generation information. Designed to be appended to the end of the HTML file.
- Parameters:
- Return type:
- Returns:
HTML string with formatted metadata table
Example
>>> footer = create_metadata_footer(perplexity=30, random_state=42) >>> "t-SNE Visualization Parameters" in footer True
Notes
Uses inline CSS for styling (no external dependencies)
Light gray background (#f5f5f5) for visual separation
Monospace font for technical parameters
Includes usage instructions for toolbar
Shows current timestamp of generation
- build_tools.syllable_analysis.create_metadata_text(output_filename, dpi, perplexity, random_state, processing_time)[source]
Generate formatted metadata text for static visualization.
Creates a human-readable metadata report describing the visualization parameters, algorithm settings, and interpretation guide. Suitable for saving alongside PNG output files.
- Parameters:
output_filename (
str) – Name of the output PNG file (e.g., “20260107_143022.tsne_visualization.png”)dpi (
int) – Resolution used for PNG exportperplexity (
int) – t-SNE perplexity parameter usedrandom_state (
int) – Random seed used for reproducibilityprocessing_time (
float) – Total processing time in seconds
- Return type:
- Returns:
Formatted multi-line metadata string ready for file output
Example
>>> metadata = create_metadata_text( ... output_filename="20260107_143022.tsne_visualization.png", ... dpi=300, ... perplexity=30, ... random_state=42, ... processing_time=2.5 ... ) >>> Path("metadata.txt").write_text(metadata)
Notes
Includes timestamp of generation
Documents all algorithm parameters
Provides interpretation guidance
Uses Unicode box-drawing characters for formatting
- build_tools.syllable_analysis.create_tsne_mapping(records, tsne_coords)[source]
Create syllable→features→coordinates mapping.
Combines annotated syllable records with their t-SNE coordinates to create a comprehensive mapping structure. This is useful for: - Post-hoc cluster analysis - Cross-referencing visualizations - Interactive exploration - Sharing visualizations with collaborators
- Parameters:
records (
list[dict]) – Original annotated syllable records from load_annotated_syllables(). Each record should have: - syllable (str): The syllable text - frequency (int): Occurrence count - features (dict): Boolean feature flagstsne_coords (
ndarray) – t-SNE coordinate array (n_syllables × n_dimensions). Typically 2D for visualization, but can be 3D or higher.
- Returns:
- [
- {
“syllable”: “kran”, “frequency”: 7, “tsne_x”: -2.34, “tsne_y”: 5.67, “features”: {…}
]
- Return type:
List of mapping records with structure
- Raises:
ValueError – If records and tsne_coords have mismatched lengths
Example
>>> records = [ ... {"syllable": "ka", "frequency": 187, "features": {...}}, ... {"syllable": "ran", "frequency": 42, "features": {...}} ... ] >>> coords = np.array([[-2.1, 3.4], [1.5, -0.8]]) >>> mapping = create_tsne_mapping(records, coords) >>> mapping[0]["tsne_x"] -2.1 >>> mapping[0]["syllable"] 'ka'
Notes
Array indices preserve order from input records
Coordinates are converted from numpy float64 to Python float for JSON compatibility
All original record fields are preserved in the mapping
For 2D t-SNE: creates tsne_x and tsne_y fields
For 3D+ t-SNE: creates tsne_x, tsne_y, tsne_z, … fields
- build_tools.syllable_analysis.create_tsne_scatter(tsne_coords, frequencies, title='t-SNE: Feature Signature Space', figsize=(14, 10), cmap='viridis', alpha=0.6)[source]
Create static matplotlib scatter plot of t-SNE coordinates.
Generates a publication-quality scatter plot showing t-SNE dimensionality reduction results. Points are sized and colored by frequency, with larger and brighter points indicating higher-frequency syllables.
- Parameters:
tsne_coords (
ndarray) – 2D coordinate array of shape (n_samples, 2) from t-SNE reductionfrequencies (
list[int]) – Frequency values for each point (used for sizing and coloring)title (
str) – Plot title (default: “t-SNE: Feature Signature Space”)figsize (
tuple[int,int]) – Figure size in inches as (width, height) (default: (14, 10))cmap (
str) – Matplotlib colormap name (default: “viridis”)alpha (
float) – Point transparency, 0.0=transparent to 1.0=opaque (default: 0.6)
- Return type:
Figure- Returns:
matplotlib Figure object with configured scatter plot
- Raises:
ValueError – If tsne_coords shape is invalid or lengths don’t match
Example
>>> import numpy as np >>> coords = np.random.randn(100, 2) >>> freqs = list(range(1, 101)) >>> fig = create_tsne_scatter(coords, freqs) >>> fig.savefig("output.png", dpi=300) >>> plt.close(fig)
Notes
Point size is proportional to frequency (frequency × 2)
Colorbar is added automatically to show frequency scale
Black edge lines improve visibility of overlapping points
Layout uses tight_layout() for optimal spacing
- build_tools.syllable_analysis.ensure_output_dir(output_dir)[source]
Ensure output directory exists, creating it if necessary.
This function creates the specified directory and all parent directories if they don’t already exist. It is idempotent - calling it multiple times with the same path is safe and has no side effects.
- Return type:
Parameters
- output_dirPath
Directory path to ensure exists
Returns
- Path
The same path that was passed in (for chaining)
Examples
Basic usage:
>>> from pathlib import Path >>> output_dir = ensure_output_dir(Path("_working/analysis/tsne/")) >>> output_dir.exists() True
Create nested directories:
>>> nested = ensure_output_dir(Path("_working/new/nested/dirs/")) >>> nested.exists() True
Idempotent operation:
>>> dir1 = ensure_output_dir(Path("_working/test/")) >>> dir2 = ensure_output_dir(Path("_working/test/")) >>> dir1 == dir2 True
Chaining:
>>> output_file = ensure_output_dir(Path("_working/analysis/")) / "output.json"
Notes
This function uses
Path.mkdir(parents=True, exist_ok=True)which: - Creates all parent directories as needed (likemkdir -p) - Does not raise an error if the directory already exists - RaisesPermissionErrorif insufficient permissions - RaisesOSErrorfor other filesystem errorsThe function returns the input path unchanged, which allows for convenient chaining in expressions.
- build_tools.syllable_analysis.extract_feature_matrix(records, feature_names=['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel', 'short_vowel', 'starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'ends_with_vowel', 'ends_with_stop', 'ends_with_nasal'])[source]
Extract binary feature matrix from annotated syllable records.
Converts feature dictionaries to a numerical matrix suitable for dimensionality reduction algorithms. Each row represents a syllable, each column represents a feature (0 or 1).
- Parameters:
List of annotated syllable records with ‘features’ and ‘frequency’ keys. Each record should have structure: {
”syllable”: “ka”, “frequency”: 187, “features”: {“contains_liquid”: False, “contains_plosive”: True, …}
}
feature_names (
list[str]) – Ordered list of feature names to extract (default: ALL_FEATURES). Order determines column order in output matrix.
- Returns:
feature_matrix: numpy array of shape (n_syllables, n_features) with binary values
frequencies: List of frequency counts for each syllable
- Return type:
Tuple of (feature_matrix, frequencies)
Example
>>> records = [ ... { ... "syllable": "ka", ... "frequency": 187, ... "features": {"contains_liquid": False, "contains_plosive": True, ...} ... } ... ] >>> matrix, freqs = extract_feature_matrix(records) >>> matrix.shape (1, 12) >>> freqs [187]
Notes
Missing features default to False (0)
Feature values are converted to int (True→1, False→0)
Output matrix dtype is int for memory efficiency
Empty record list returns (0, n_features) shaped array
- build_tools.syllable_analysis.extract_signature(features)[source]
Extract the feature signature from a feature dictionary.
A signature is a sorted tuple of feature names where the feature value is True. This creates a canonical representation of the active feature set.
- Parameters:
features (
dict[str,bool]) – Dictionary mapping feature names to boolean values- Return type:
- Returns:
Sorted tuple of feature names that are active (True)
Example
>>> extract_signature({"starts_with_vowel": True, "ends_with_vowel": False}) ('starts_with_vowel',)
- build_tools.syllable_analysis.format_signature_report(signature_counter, total_syllables, limit=None)[source]
Format the signature analysis results as a human-readable report.
- build_tools.syllable_analysis.generate_output_pair(output_dir, primary_suffix, metadata_suffix, primary_ext='txt', metadata_ext='txt')[source]
Generate matching pair of timestamped output paths.
This function is useful for tools that generate both primary output and accompanying metadata files. It ensures both files use the same timestamp, making it easy to associate files and maintain chronological ordering.
Parameters
- output_dirPath
Output directory (should exist or be created first)
- primary_suffixstr
Suffix for primary output file (e.g., ‘tsne_visualization’)
- metadata_suffixstr
Suffix for metadata file (e.g., ‘tsne_metadata’)
- primary_extstr, default=’txt’
File extension for primary file (e.g., ‘png’, ‘json’, ‘html’)
- metadata_extstr, default=’txt’
File extension for metadata file (typically ‘txt’ or ‘json’)
Returns
- tuple[Path, Path]
Tuple of (primary_path, metadata_path) with matching timestamps
Examples
Generate visualization and metadata pair:
>>> from pathlib import Path >>> viz_path, meta_path = generate_output_pair( ... output_dir=Path("_working/analysis/tsne/"), ... primary_suffix="tsne_visualization", ... metadata_suffix="tsne_metadata", ... primary_ext="png", ... metadata_ext="txt" ... ) >>> viz_path.name '20260107_143022.tsne_visualization.png' >>> meta_path.name '20260107_143022.tsne_metadata.txt'
Generate data and metadata pair:
>>> data_path, meta_path = generate_output_pair( ... output_dir=Path("_working/results/"), ... primary_suffix="analysis_results", ... metadata_suffix="analysis_meta", ... primary_ext="json", ... metadata_ext="json" ... )
Verify timestamp matching:
>>> viz_path, meta_path = generate_output_pair( ... output_dir=Path("_working/"), ... primary_suffix="primary", ... metadata_suffix="metadata" ... ) >>> viz_path.stem.split('.')[0] == meta_path.stem.split('.')[0] True # Same timestamp
Notes
Output Format:
Primary: {output_dir}/{timestamp}.{primary_suffix}.{primary_ext} Metadata: {output_dir}/{timestamp}.{metadata_suffix}.{metadata_ext}
Examples: - Primary:
_working/analysis/tsne/20260107_143022.tsne_visualization.png- Metadata:_working/analysis/tsne/20260107_143022.tsne_metadata.txtThe timestamp is generated once and used for both files, ensuring they are always associated and sort together in directory listings.
The directory is NOT created by this function - use
ensure_output_dir()first if the directory might not exist.Use Cases: - t-SNE visualizer: PNG + metadata TXT - Feature analysis: results JSON + metadata TXT - Interactive plots: HTML + metadata JSON
- build_tools.syllable_analysis.generate_timestamped_path(output_dir, suffix, extension='txt', timestamp=None)[source]
Generate timestamped output file path.
This function creates a path with format:
{output_dir}/{timestamp}.{suffix}.{extension}The timestamp format is
YYYYMMDD_HHMMSS(e.g.,20260107_143022), which provides: - Chronological sorting - Uniqueness (assuming not more than one file per second) - Human readability - No special characters that could cause path issues- Return type:
Parameters
- output_dirPath
Output directory (should exist or be created first)
- suffixstr
File suffix describing content (e.g., ‘tsne_visualization’, ‘metadata’)
- extensionstr, default=’txt’
File extension without leading dot (e.g., ‘txt’, ‘json’, ‘png’)
- timestampstr, optional
Specific timestamp string (format: YYYYMMDD_HHMMSS). If None (default), uses current time via
datetime.now()
Returns
- Path
Timestamped output file path
Examples
Basic usage (auto-generated timestamp):
>>> from pathlib import Path >>> path = generate_timestamped_path( ... output_dir=Path("_working/analysis/tsne/"), ... suffix="tsne_visualization", ... extension="png" ... ) >>> path.name '20260107_143022.tsne_visualization.png'
Custom extension:
>>> path = generate_timestamped_path( ... output_dir=Path("_working/"), ... suffix="results", ... extension="json" ... ) >>> path.suffix '.json'
Explicit timestamp (for reproducibility or paired files):
>>> path = generate_timestamped_path( ... output_dir=Path("_working/"), ... suffix="output", ... timestamp="20260107_120000" ... ) >>> "20260107_120000" in str(path) True
Notes
File Naming Convention:
{YYYYMMDD_HHMMSS}.{suffix}.{extension}
Examples: -
20260107_143022.tsne_visualization.png-20260107_143022.tsne_metadata.txt-20260107_143022.feature_signatures.txtThe directory is NOT created by this function - use
ensure_output_dir()first if the directory might not exist.Timestamp Format: - YYYY: 4-digit year - MM: 2-digit month (01-12) - DD: 2-digit day (01-31) - HH: 2-digit hour (00-23) - MM: 2-digit minute (00-59) - SS: 2-digit second (00-59)
- build_tools.syllable_analysis.get_feature_vector(features, feature_names=['contains_liquid', 'contains_plosive', 'contains_fricative', 'contains_nasal', 'long_vowel', 'short_vowel', 'starts_with_vowel', 'starts_with_cluster', 'starts_with_heavy_cluster', 'ends_with_vowel', 'ends_with_stop', 'ends_with_nasal'])[source]
Extract a single feature vector from a feature dictionary.
Converts a dictionary of feature flags to an ordered binary vector. Useful for extracting vectors from individual syllables.
- Parameters:
- Return type:
- Returns:
Binary feature vector matching feature_names order
Example
>>> features = {"contains_liquid": True, "contains_plosive": False} >>> vector = get_feature_vector(features, ["contains_liquid", "contains_plosive"]) >>> vector [1, 0]
Notes
Missing features default to False (0)
Order of output matches order of feature_names
Output is Python list, not numpy array (for flexibility)
- build_tools.syllable_analysis.inject_responsive_css(html_content, min_width=1250)[source]
Inject responsive CSS into HTML content.
Adds CSS rules to ensure the plot has a minimum width and proper scrolling behavior. This prevents the plot from becoming too narrow on small screens while allowing horizontal scrolling when necessary.
- Parameters:
- Return type:
- Returns:
HTML content with injected CSS in <head> section
Example
>>> html = "<html><head></head><body>...</body></html>" >>> modified = inject_responsive_css(html, min_width=1250) >>> "<style>" in modified True
Notes
CSS is inserted after the opening <head> tag
Sets body margin/padding to 0 for full-width layout
Enables horizontal scrolling when plot exceeds viewport width
Sets fixed height (900px) matching plot configuration
Uses !important to override Plotly’s inline styles
- build_tools.syllable_analysis.load_annotated_syllables(input_path, validate=True)[source]
Load annotated syllables from JSON file with optional validation.
This function loads the output of the syllable feature annotator, which contains syllables with their frequencies and phonetic feature annotations. It provides optional validation to ensure the data structure is correct.
Parameters
- input_pathPath
Path to syllables_annotated.json file
- validatebool, default=True
Whether to validate the structure of loaded data. When True, checks: - Data is a list - List is non-empty - First record has required keys: ‘syllable’, ‘frequency’, ‘features’
Returns
- list[dict[str, Any]]
List of syllable records, each containing: - syllable (str): The syllable text - frequency (int): Occurrence count in corpus - features (dict): Boolean feature flags (12 features)
Raises
- FileNotFoundError
If input file does not exist
- json.JSONDecodeError
If file is not valid JSON
- ValueError
If validation is enabled and data structure is invalid
Examples
Basic loading with validation:
>>> from pathlib import Path >>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json")) >>> len(records) 1247 >>> records[0].keys() dict_keys(['syllable', 'frequency', 'features'])
Loading without validation (faster, use when structure is guaranteed):
>>> records = load_annotated_syllables(Path("data/annotated/syllables_annotated.json"), ... validate=False)
Error handling:
>>> try: ... records = load_annotated_syllables(Path("nonexistent.json")) ... except FileNotFoundError as e: ... print(f"File not found: {e}")
Notes
Expected file format (syllables_annotated.json):
[ { "syllable": "ka", "frequency": 187, "features": { "contains_liquid": false, "contains_plosive": true, "contains_fricative": false, "contains_nasal": false, "long_vowel": false, "short_vowel": true, "starts_with_vowel": false, "starts_with_cluster": false, "starts_with_heavy_cluster": false, "ends_with_vowel": true, "ends_with_stop": false, "ends_with_nasal": false } }, ... ]
This file is produced by the syllable feature annotator pipeline and is the primary input for analysis tools.
Performance
Loading a typical corpus of 1,000-10,000 syllables takes <100ms. Validation adds negligible overhead (~1ms).
- build_tools.syllable_analysis.load_frequency_data(frequencies_path)[source]
Load frequency mapping from JSON file.
This function loads the output of the syllable normalizer’s frequency analysis, which maps each canonical syllable to its occurrence count in the raw corpus.
Parameters
- frequencies_pathPath
Path to syllables_frequencies.json file
Returns
- dict[str, int]
Dictionary mapping syllable strings to their frequency counts
Raises
- FileNotFoundError
If input file does not exist
- json.JSONDecodeError
If file is not valid JSON
- ValueError
If data structure is invalid (not a dict)
Examples
Basic loading:
>>> from pathlib import Path >>> frequencies = load_frequency_data(Path("data/normalized/syllables_frequencies.json")) >>> frequencies["ka"] 187 >>> len(frequencies) 1247
Checking most common syllables:
>>> sorted_freqs = sorted(frequencies.items(), key=lambda x: x[1], reverse=True) >>> sorted_freqs[:3] [('ka', 187), ('ra', 162), ('mi', 145)]
Error handling:
>>> try: ... frequencies = load_frequency_data(Path("nonexistent.json")) ... except FileNotFoundError: ... print("File not found")
Notes
Expected file format (syllables_frequencies.json):
{ "ka": 187, "ra": 162, "mi": 145, "ta": 98, ... }
This file is produced by the syllable normalizer’s frequency analysis step and captures pre-deduplication counts (how many times each syllable appeared in the raw corpus before creating the unique syllable list).
The frequencies can be used for: - Weighted analysis (prioritize common syllables) - Filtering (exclude rare syllables) - Visualization (size/color by frequency) - Statistical analysis
Performance
Loading a typical frequency file (1,000-10,000 entries) takes <50ms.
- build_tools.syllable_analysis.parse_feature_signatures_args()
Parse command-line arguments.
- Return type:
- build_tools.syllable_analysis.parse_random_sampler_arguments()
Parse command-line arguments.
- Return type:
- Returns:
Parsed argument namespace.
- build_tools.syllable_analysis.parse_tsne_visualizer_args()
Parse command-line arguments.
- Return type:
- Returns:
Parsed argument namespace with validated parameters
- build_tools.syllable_analysis.run_analysis(input_path, output_dir, limit=None)[source]
Run the complete feature signature analysis pipeline.
- Parameters:
- Returns:
total_syllables: Total number of syllables analyzed
unique_signatures: Number of unique feature signatures
output_path: Path to the saved report
- Return type:
Dictionary with analysis results including
- build_tools.syllable_analysis.run_tsne_visualization(input_path, output_dir, perplexity=30, random_state=42, dpi=300, verbose=False, save_mapping=False, interactive=False)[source]
Run the complete t-SNE visualization pipeline.
This is the main entry point for programmatic use. It handles the full workflow: 1. Load annotated syllables 2. Extract feature matrix 3. Apply t-SNE dimensionality reduction 4. Create visualization 5. Save outputs (PNG + optional HTML + optional mapping)
- Parameters:
input_path (
Path) – Path to syllables_annotated.jsonoutput_dir (
Path) – Directory to save visualization outputsperplexity (
int) – t-SNE perplexity parameter (default: 30)random_state (
int) – Random seed for reproducibility (default: 42)dpi (
int) – Output resolution in dots per inch (default: 300)verbose (
bool) – Print detailed progress informationsave_mapping (
bool) – Save syllable→features→coordinates mapping as JSON (default: False)interactive (
bool) – Generate interactive HTML visualization (requires Plotly, default: False)
- Returns:
syllable_count: Number of syllables visualized
feature_count: Number of features (always 12)
output_path: Path to saved visualization PNG
metadata_path: Path to saved metadata file
tsne_coordinates: numpy array of 2D coordinates
mapping_path: Path to mapping JSON (None if save_mapping=False)
interactive_path: Path to interactive HTML (None if interactive=False or Plotly unavailable)
processing_time: Total processing time in seconds
- Return type:
Dictionary containing
- Raises:
FileNotFoundError – If input file does not exist
ImportError – If required dependencies are missing
ValueError – If input data is invalid
Example
>>> from pathlib import Path >>> result = run_tsne_visualization( ... input_path=Path("data/annotated/syllables_annotated.json"), ... output_dir=Path("_working/analysis/tsne/"), ... interactive=True, ... save_mapping=True ... ) >>> print(f"Visualized {result['syllable_count']} syllables") >>> print(f"Interactive HTML: {result['interactive_path']}")
- build_tools.syllable_analysis.sample_syllables(records, sample_count, seed=None)[source]
Randomly sample syllables from the full corpus.
- Parameters:
- Return type:
- Returns:
List of sampled syllable records.
- Raises:
ValueError – If sample_count is larger than available records.
- build_tools.syllable_analysis.save_interactive_html(fig, output_path, perplexity, random_state, min_width=1250)[source]
Save interactive Plotly figure as standalone HTML.
Creates a self-contained HTML file with embedded Plotly visualization that can be: - Opened directly in any web browser - Shared with collaborators - Embedded in reports or documentation - Explored with zoom, pan, hover, and export controls
The HTML file uses Plotly CDN for JavaScript dependencies (smaller file size) and includes responsive CSS and a metadata footer.
- Parameters:
fig (
Figure) – Plotly Figure object from create_interactive_scatter()output_path (
Path) – Output HTML file path (parent directory must exist)perplexity (
int) – t-SNE perplexity parameter (for metadata footer)random_state (
int) – Random seed used (for metadata footer)min_width (
int) – Minimum width constraint in pixels (default: 1250)
- Raises:
ImportError – If Plotly is not installed
FileNotFoundError – If parent directory doesn’t exist
ValueError – If output_path doesn’t end with .html
- Return type:
Example
>>> fig = create_interactive_scatter(records, tsne_coords) >>> output_path = Path("_working/visualization.html") >>> save_interactive_html(fig, output_path, perplexity=30, random_state=42)
Notes
Plotly CDN used for smaller file size vs. full JS bundle
Mode bar configured with additional tools (hoverclosest, hovercompare)
Export to PNG button configured for high-resolution (1600x1200, 2x scale)
Responsive CSS ensures minimum width of 1250px
Metadata footer includes algorithm parameters and generation time
- build_tools.syllable_analysis.save_json_output(data, output_path, indent=2, ensure_ascii=False)[source]
Save data as formatted JSON file.
This function provides consistent JSON output formatting across all analysis tools. It ensures proper Unicode handling, readable indentation, and creates parent directories if needed.
- Return type:
Parameters
- dataAny
Data to serialize as JSON (must be JSON-serializable)
- output_pathPath
Output file path (parent directories will be created if needed)
- indentint | None, default=2
Number of spaces for JSON indentation. Use 2 for readability, None for compact output
- ensure_asciibool, default=False
If True, escape non-ASCII characters. If False (default), preserve Unicode characters for better readability
Raises
- TypeError
If data is not JSON-serializable
- OSError
If file cannot be written (permissions, disk full, etc.)
Examples
Save analysis results:
>>> from pathlib import Path >>> results = {"total": 1247, "unique": 892} >>> save_json_output(results, Path("output/results.json"))
Save with compact formatting:
>>> save_json_output(results, Path("output/compact.json"), indent=None)
Save with ASCII-only encoding:
>>> save_json_output(results, Path("output/ascii.json"), ensure_ascii=True)
Auto-create parent directories:
>>> save_json_output(results, Path("output/new/dir/results.json")) >>> # Creates output/new/dir/ automatically
Notes
Default settings (indent=2, ensure_ascii=False) are optimized for: - Human readability (indented) - Unicode support (preserve accented characters, emojis, etc.) - Version control friendliness (consistent line breaks)
File encoding is always UTF-8 for maximum compatibility.
Performance
Saving 1,000-10,000 records typically takes <100ms. Using indent=None (compact) is ~20% faster but much less readable.
- build_tools.syllable_analysis.save_report(report, output_dir)[source]
Save the formatted report to the output directory.
- build_tools.syllable_analysis.save_static_plot(fig, output_path, dpi=300)[source]
Save matplotlib figure to PNG file.
Saves a matplotlib Figure to a high-resolution PNG file suitable for publication or presentation. Uses tight bounding box to minimize whitespace.
- Parameters:
- Raises:
FileNotFoundError – If parent directory doesn’t exist
PermissionError – If file cannot be written
ValueError – If output_path doesn’t end with .png
- Return type:
Example
>>> import matplotlib.pyplot as plt >>> from pathlib import Path >>> fig, ax = plt.subplots() >>> ax.plot([1, 2, 3], [1, 4, 9]) >>> save_static_plot(fig, Path("output.png"), dpi=300) >>> plt.close(fig)
Notes
Uses bbox_inches=’tight’ to remove excess whitespace
Higher DPI values create larger files but better quality
Common DPI values: 150 (screen), 300 (print), 600 (high-quality print)
Figure is NOT automatically closed after saving
- build_tools.syllable_analysis.save_tsne_mapping(mapping, output_path, indent=2)[source]
Save t-SNE mapping to JSON file.
Writes the syllable→coordinates mapping as formatted JSON for human readability and programmatic access.
- Parameters:
- Return type:
Example
>>> from pathlib import Path >>> mapping = [{"syllable": "ka", "tsne_x": -2.1, "tsne_y": 3.4, "features": {...}}] >>> save_tsne_mapping(mapping, Path("output.json"))
Notes
Output is formatted with indentation for human readability
Uses ensure_ascii=False to preserve Unicode characters
UTF-8 encoding ensures international character support
Parent directories are created if they don’t exist
- build_tools.syllable_analysis.validate_feature_matrix(feature_matrix, expected_features=12)[source]
Validate feature matrix shape and contents.
Ensures the feature matrix has the expected structure for dimensionality reduction algorithms.
- Parameters:
feature_matrix (
ndarray) – Binary feature matrixexpected_features (
int) – Expected number of features (default: 12)
- Raises:
ValueError – If validation fails (wrong shape, non-binary values, etc.)
- Return type:
Example
>>> matrix = np.array([[1, 0, 1], [0, 1, 0]]) >>> validate_feature_matrix(matrix, expected_features=3) # OK >>> validate_feature_matrix(matrix, expected_features=4) # Raises ValueError