==============
Analysis Tools
==============

.. currentmodule:: build_tools.syllable_analysis

Overview
--------

.. automodule:: build_tools.syllable_analysis
   :no-members:

Random Sampler
--------------

Random sampling utility for QA and inspection of annotated syllables.

Command-Line Interface
~~~~~~~~~~~~~~~~~~~~~~

.. argparse::
   :module: build_tools.syllable_analysis.random_sampler
   :func: create_argument_parser
   :prog: python -m build_tools.syllable_analysis.random_sampler

Integration Guide
~~~~~~~~~~~~~~~~~

Use after annotation to randomly sample syllables for quality assurance:

.. code-block:: bash

   # Step 1: Annotate syllables
   python -m build_tools.syllable_feature_annotator

   # Step 2: Sample for QA inspection
   python -m build_tools.syllable_analysis.random_sampler \
     --input data/annotated/syllables_annotated.json \
     --count 50 \
     --output _working/samples.json \
     --seed 42

Feature Signature Analysis
---------------------------

The feature signature analysis tool examines annotated syllables to identify which feature combinations
actually exist in the data and how frequently each combination appears.

A "feature signature" is the set of all active (True) features for a syllable. For example, a syllable
with only ``starts_with_vowel`` and ``ends_with_vowel`` active would have the signature:
``('ends_with_vowel', 'starts_with_vowel')``.

This analysis answers questions like:

- What feature patterns are most common in natural language?
- Are certain feature combinations rare or impossible?
- How diverse is the feature space in the corpus?

Command-Line Interface
~~~~~~~~~~~~~~~~~~~~~~

.. argparse::
   :module: build_tools.syllable_analysis.feature_signatures
   :func: create_argument_parser
   :prog: python -m build_tools.syllable_analysis.feature_signatures

Output Format
~~~~~~~~~~~~~

The tool generates timestamped plain text reports (``YYYYMMDD_HHMMSS.feature_signatures.txt``) with:

.. code-block:: text

   ================================================================================
   FEATURE SIGNATURE ANALYSIS
   ================================================================================
   Generated: 2026-01-06 13:55:56
   Total syllables analyzed: 23,160
   Unique feature signatures: 361

   SUMMARY STATISTICS
   --------------------------------------------------------------------------------
   Most common signature: 661 syllables (2.9%)
     Features: contains_liquid, contains_plosive, ends_with_vowel, long_vowel

   Feature cardinality distribution:
     1 features: 3 unique signatures
     2 features: 17 unique signatures
     3 features: 54 unique signatures
     4 features: 86 unique signatures
     5 features: 96 unique signatures
     6 features: 70 unique signatures
     7 features: 30 unique signatures
     8 features: 5 unique signatures

   ================================================================================
   SIGNATURE RANKINGS
   --------------------------------------------------------------------------------
   Rank   Count    Pct      Features
   --------------------------------------------------------------------------------
   1      661        2.85%  [4] contains_liquid, contains_plosive, ends_with_vowel, long_vowel
   2      506        2.18%  [3] contains_plosive, ends_with_vowel, long_vowel
   ...

Integration Guide
~~~~~~~~~~~~~~~~~

Use after annotation to understand feature patterns:

.. code-block:: bash

   # Step 1: Annotate syllables
   python -m build_tools.syllable_feature_annotator

   # Step 2: Analyze feature signatures
   python -m build_tools.syllable_analysis.feature_signatures \
     --input data/annotated/syllables_annotated.json \
     --output _working/analysis/feature_signatures/

t-SNE Visualization
-------------------

The t-SNE (t-distributed Stochastic Neighbor Embedding) visualization tool creates 2D visualizations of the
high-dimensional feature signature space. This helps identify clustering patterns, syllable similarity, and
natural groupings in the annotated syllable corpus.

t-SNE is a dimensionality reduction technique that projects 12-dimensional feature vectors into 2D space while
preserving local structure. The visualization uses:

- **Position (x, y)**: t-SNE projection coordinates
- **Size**: Syllable frequency (larger points = more common)
- **Color**: Syllable frequency (warmer colors = more common)

Command-Line Interface
~~~~~~~~~~~~~~~~~~~~~~

.. argparse::
   :module: build_tools.syllable_analysis.tsne_visualizer
   :func: create_argument_parser
   :prog: python -m build_tools.syllable_analysis.tsne_visualizer

Output Format
~~~~~~~~~~~~~

The visualizer generates timestamped files in the output directory:

1. **YYYYMMDD_HHMMSS.tsne_visualization.png** - High-resolution static visualization (PNG, always generated)
2. **YYYYMMDD_HHMMSS.tsne_metadata.txt** - Detailed metadata and interpretation guide
3. **YYYYMMDD_HHMMSS.tsne_mapping.json** - Syllable→features→coordinates mapping (optional, requires ``--save-mapping``)
4. **YYYYMMDD_HHMMSS.tsne_interactive.html** - Interactive Plotly visualization (optional, requires ``--interactive``)

**Output Formats:**

- **Static PNG**: High-resolution matplotlib visualization (always generated, 300 DPI default)
- **Interactive HTML**: Plotly-based interactive visualization with hover tooltips, zoom, pan, and export (optional)

**Static PNG metadata file includes:**

- Algorithm parameters (method, perplexity, random seed, dimensions, distance metric, features)
- Visualization encoding (axis meanings, point size/color)
- Interpretation guide (how to read the visualization)
- Technical details (DPI, generation timestamp)

**Interactive HTML features:**

- Hover tooltips showing syllable text, frequency, and active features
- Interactive zoom, pan, and exploration controls
- Export to high-resolution PNG directly from browser
- Self-contained HTML file with embedded metadata
- Works in any modern web browser without additional dependencies

Integration Guide
~~~~~~~~~~~~~~~~~

Use after annotation to visualize the feature space:

.. code-block:: bash

   # Step 1: Annotate syllables
   python -m build_tools.syllable_feature_annotator

   # Step 2: Generate t-SNE visualization
   python -m build_tools.syllable_analysis.tsne_visualizer \
     --input data/annotated/syllables_annotated.json \
     --output _working/analysis/tsne/ \
     --interactive

Advanced Topics
~~~~~~~~~~~~~~~

**Understanding t-SNE Parameters:**

**Perplexity** (default: 30):

- Balances attention between local and global structure
- Typical range: 5-50
- Lower values: emphasize local clusters
- Higher values: preserve global structure
- Rule of thumb: should be less than number of syllables
- Default of 30 works well for most corpus sizes (100-10,000 syllables)

**Random State** (default: 42):

- Controls random initialization of t-SNE
- Same value = reproducible visualizations
- Different values = different (but valid) layouts
- Use fixed value (e.g., 42) for consistent results

**Distance Metric**:

- Uses Hamming distance (optimal for binary feature vectors)
- Automatically configured for 12-dimensional binary features
- Not configurable via command-line (intentional design choice)

**Interpreting the Visualization:**

**What to look for:**

- **Nearby points**: Syllables with similar phonetic features
- **Clusters**: Natural groupings in the feature space
- **Large/bright points**: High-frequency syllables (common patterns)
- **Small/dark points**: Low-frequency syllables (rare patterns)
- **Isolated points**: Unique or rare feature combinations
- **Dense regions**: Common feature patterns
- **Sparse regions**: Less common feature patterns

**Example insights:**

- Vowel-initial syllables might cluster together
- Heavy consonant clusters might form distinct groups
- Frequency might correlate with certain feature patterns
- Outliers might indicate unusual phonetic combinations

Notes
-----

**Dependencies:**

Required dependencies for t-SNE visualization (install with ``pip install -e ".[build-tools]"``):

- scikit-learn, matplotlib, numpy, pandas (for static PNG generation)
- plotly (for interactive HTML generation, optional)

**Performance:**

- Deterministic analysis with fixed random seeds (same input = same output)
- Fast processing: typically <10 seconds for 1,000-10,000 syllables
- t-SNE processing time scales roughly O(n²) with corpus size
- For very large datasets (>50,000 syllables), consider sampling first

**Output Characteristics:**

- Human-readable plain text reports with formatted tables
- Timestamped output files for historical tracking
- High-resolution visualizations (default 300 DPI)
- Static visualizations saved as PNG files for easy sharing and embedding
- Interactive visualizations saved as self-contained HTML files

**Build-time tool:**

These are build-time analysis tools - not used during runtime name generation.

API Reference
-------------

.. automodule:: build_tools.syllable_analysis
   :members:
   :undoc-members:
   :show-inheritance: