=========================
NLTK Syllable Extractor
=========================

.. currentmodule:: build_tools.nltk_syllable_extractor

Overview
--------

.. automodule:: build_tools.nltk_syllable_extractor
   :no-members:

Command-Line Interface
----------------------

.. argparse::
   :module: build_tools.nltk_syllable_extractor.cli
   :func: create_argument_parser
   :prog: python -m build_tools.nltk_syllable_extractor

Output Format
-------------

Output files are organized in a run-based subdirectory structure under ``_working/output/``. Each extraction run creates a timestamped directory containing ``syllables/`` and ``meta/`` subdirectories:

**Directory structure:**

::

    _working/output/
      └── YYYYMMDD_HHMMSS_nltk/     # Run directory (one per batch)
          ├── syllables/
          │   ├── file1.txt          # Input filename preserved
          │   ├── file2.txt
          │   └── ...
          └── meta/
              ├── file1.txt          # Matching metadata
              ├── file2.txt
              └── ...

**Interactive mode (single file):**

::

    _working/output/
      └── 20260110_143022_nltk/
          ├── syllables/
          │   └── en_US.txt
          └── meta/
              └── en_US.txt

**Batch mode (multiple files):**

::

    _working/output/
      └── 20260110_143022_nltk/     # All files share one run directory
          ├── syllables/
          │   ├── alice.txt
          │   ├── middlemarch.txt
          │   └── don_quijote.txt
          └── meta/
              ├── alice.txt
              ├── middlemarch.txt
              └── don_quijote.txt

**Benefits of run-based organization:**

- Each extraction run is self-contained in a timestamped directory
- Easy to archive, move, or delete entire runs as atomic units
- Input filenames are preserved for easy identification
- Clean separation between syllables and metadata
- All outputs from a batch operation are grouped together

**Syllables file format:**

Each line contains one syllable, preserving duplicates in the order extracted. This preserves natural syllable frequency for downstream processing:

::

    hel
    lo
    won
    der
    ful
    world
    hel
    lo
    world

**Note:** Duplicates are intentionally preserved. The extractor's job is to extract, not to filter. Use ``build_tools.pyphen_syllable_normaliser`` for deduplication and frequency analysis.

**Metadata file format:**

The metadata file records extraction parameters and statistics:

- Source files processed
- Language code (always ``en_US`` for NLTK extractor)
- Extractor type (``nltk_syllable_extractor (CMUDict + onset/coda)``)
- Syllable length constraints (min/max)
- Total syllables (with duplicates)
- Unique syllable count (for reference)
- Total word count
- Processing statistics (processed words, fallback count, rejected syllables)
- Extraction timestamp
- Command-line invocation

**Metadata distinguishes extractor source:**

The NLTK extractor clearly labels its output to distinguish from pyphen-based extraction:

::

    ======================================================================
    NLTK SYLLABLE EXTRACTION METADATA
    ======================================================================
    Extraction Date:    2026-01-09 22:43:28
    Extractor:          nltk_syllable_extractor (CMUDict + onset/coda)
    Language Code:      en_US
    Syllable Length:    1-999 characters
    Total Syllables:    911
    Unique Syllables:   401

    Processing Statistics:
      Total Words:        503
      Processed Words:    503
      Fallback Used:      17 (not in CMUDict)
      Rejected Syllables: 0 (length filter)
    ...

Integration Guide
-----------------

The NLTK syllable extractor is an alternative first step in the build pipeline, complementing the pyphen-based extractor:

**Standard workflow (using NLTK extractor):**

.. code-block:: bash

   # Step 1: Extract syllables using NLTK/CMUDict
   python -m build_tools.nltk_syllable_extractor \
     --source data/corpus/ \
     --pattern "*.txt" \
     --output data/raw/nltk/

   # Step 2: Normalize extracted syllables (source-agnostic)
   python -m build_tools.pyphen_syllable_normaliser \
     --source data/raw/nltk/ \
     --output data/normalized/

   # Step 3: Annotate with phonetic features (source-agnostic)
   python -m build_tools.syllable_feature_annotator

**Parallel workflow (comparing both extractors):**

.. code-block:: bash

   # Extract with pyphen (typographic)
   python -m build_tools.pyphen_syllable_extractor \
     --source data/corpus/ \
     --lang en_US \
     --output data/raw/pyphen/

   # Extract with NLTK (phonetic)
   python -m build_tools.nltk_syllable_extractor \
     --source data/corpus/ \
     --output data/raw/nltk/

   # Compare outputs or merge for hybrid corpus
   # Both feed into the same downstream tools

**When to use NLTK extractor vs pyphen:**

**Use NLTK extractor when:**

- You want phonetically-guided syllable boundaries
- You prefer consonant cluster integrity (e.g., "An-drew" not "And-rew")
- You want syllables that feel more like spoken language
- You're working with English text (CMUDict limitation)
- You want to explore phonetic texture differences

**Use pyphen extractor when:**

- You need multi-language support (40+ languages)
- You prefer typographic hyphenation rules
- You want formal, dictionary-based splits
- You're working with non-English text

**Combining both extractors:**

The two extractors produce complementary textures. You can:

1. Extract with both, compare outputs, choose one
2. Merge outputs for richer syllable inventory
3. Use different extractors for different name generation profiles

**Extraction modes:**

- **Interactive mode**: No arguments - prompts for file selection
- **Single file**: ``--file input.txt`` - Process one file
- **Multiple files**: ``--files file1.txt file2.txt`` - Process specific files
- **Directory scan**: ``--source /data/ --pattern "*.txt"`` - Scan directory for files
- **Recursive scan**: ``--source /data/ --pattern "*.txt" --recursive`` - Scan subdirectories

Notes
-----

**Language Limitation:**

The NLTK extractor is **English-only** due to CMUDict constraints:

- CMU Pronouncing Dictionary covers North American English pronunciation
- No support for other languages (use pyphen for multi-language needs)
- This is a fundamental limitation of the phonetic dictionary approach

**CMUDict Package:**

The tool uses the ``cmudict`` pip package (python-cmudict) which includes the
CMU Pronouncing Dictionary data. No separate corpus download is required - simply
install via pip and you're ready to go.

**Phonetic vs Typographic Splitting:**

The NLTK extractor produces different splits than pyphen:

+------------+-------------------+-------------------+
| Word       | pyphen (typo)     | NLTK (phonetic)   |
+============+===================+===================+
| Andrew     | And-rew           | An-drew           |
+------------+-------------------+-------------------+
| structure  | struc-ture        | stru-cture        |
+------------+-------------------+-------------------+
| beautiful  | beau-ti-ful       | beau-ti-ful       |
+------------+-------------------+-------------------+
| program    | pro-gram          | pro-gram          |
+------------+-------------------+-------------------+

These differences create distinct **phonetic textures**:

- **pyphen**: Conservative, formal, typographic breaks
- **NLTK**: Natural, phonetic, respects consonant clustering

**Deterministic Pronunciation Selection:**

When words have multiple pronunciations in CMUDict (e.g., "read" as present vs past tense), the extractor:

- Always selects the **first pronunciation** listed
- This ensures deterministic results (same input → same output)
- Pronunciation selection cannot be configured

**Fallback for Unknown Words:**

Words not in CMUDict use a heuristic fallback:

1. Identifies vowel groups as syllable nuclei
2. Applies onset/coda principles to consonant clusters
3. Maintains phonetic character even for out-of-vocabulary words

Fallback usage is tracked in metadata as "Fallback Used: N (not in CMUDict)" to clearly distinguish from CMUDict-based extraction.

**Extraction Philosophy - Preserving Duplicates:**

The extractor preserves all syllables including duplicates, following separation-of-concerns design:

- **Extractor's job**: Extract syllables (preserves frequency information)
- **Normaliser's job**: Deduplicate, filter, aggregate (``syllable_normaliser``)
- **Annotator's job**: Add phonetic features (``syllable_feature_annotator``)

This design allows downstream tools to:

- Perform frequency analysis on natural corpus distribution
- Make informed filtering decisions based on occurrence counts
- Apply different normalization strategies for different use cases

**Syllable Length Constraints:**

- Default: min=1, max=999 (no practical filtering by default)
- Adjust with ``--min`` and ``--max`` flags to filter if needed
- Default behavior preserves all syllables for downstream processing
- Examples:

  - ``--min 2 --max 8`` - Filter to 2-8 character syllables (like old default)
  - ``--min 1 --max 1`` - Extract only single-character syllables
  - ``--min 3`` - Extract syllables of 3+ characters (no upper limit)

**Output Organization:**

- Each extraction run creates a timestamped directory with nltk identifier (``YYYYMMDD_HHMMSS_nltk/``)
- Run directory contains ``syllables/`` and ``meta/`` subdirectories
- Batch processing groups all files into a single run directory
- Input filenames are preserved in output (e.g., ``alice.txt``)
- Interactive mode uses ``en_US.txt`` as the filename
- Metadata clearly labels extractor source for provenance
- All extractions logged to corpus database (if available)

**Performance Considerations:**

- CMUDict lookup is fast (dictionary-based)
- Fallback heuristics are efficient
- Processing speed comparable to pyphen for English text
- NLTK initial import may take 1-2 seconds

**Build-time tool:**

This is a build-time tool only - not used during runtime name generation.

API Reference
-------------

.. automodule:: build_tools.nltk_syllable_extractor
   :members:
   :undoc-members:
   :show-inheritance: