========================
Pyphen Syllable Extractor
========================

.. currentmodule:: build_tools.pyphen_syllable_extractor

Overview
--------

.. automodule:: build_tools.pyphen_syllable_extractor
   :no-members:

Command-Line Interface
----------------------

.. argparse::
   :module: build_tools.pyphen_syllable_extractor.cli
   :func: create_argument_parser
   :prog: python -m build_tools.pyphen_syllable_extractor

Output Format
-------------

Output files are organized in a run-based subdirectory structure under ``_working/output/``. Each extraction run creates a timestamped directory containing ``syllables/`` and ``meta/`` subdirectories:

**Directory structure:**

::

    _working/output/
      └── YYYYMMDD_HHMMSS_pyphen/   # Run directory (one per batch)
          ├── syllables/
          │   ├── file1.txt          # Input filename preserved
          │   ├── file2.txt
          │   └── ...
          └── meta/
              ├── file1.txt          # Matching metadata
              ├── file2.txt
              └── ...

**Interactive mode (single file):**

::

    _working/output/
      └── 20260110_143022_pyphen/
          ├── syllables/
          │   └── en_US.txt
          └── meta/
              └── en_US.txt

**Batch mode (multiple files):**

::

    _working/output/
      └── 20260110_143022_pyphen/   # All files share one run directory
          ├── syllables/
          │   ├── alice.txt
          │   ├── middlemarch.txt
          │   └── don_quijote.txt
          └── meta/
              ├── alice.txt
              ├── middlemarch.txt
              └── don_quijote.txt

**Benefits of run-based organization:**

- Each extraction run is self-contained in a timestamped directory
- Easy to archive, move, or delete entire runs as atomic units
- Input filenames are preserved for easy identification
- Clean separation between syllables and metadata
- All outputs from a batch operation are grouped together

**Syllables file format:**

Each line contains one unique syllable, sorted alphabetically:

::

    der
    ful
    hel
    lo
    won
    world

**Metadata file format:**

The metadata file records extraction parameters and statistics:

- Source files processed
- Language code used
- Syllable length constraints (min/max)
- Unique syllable count
- Total word count
- Extraction timestamp
- Command-line invocation

Integration Guide
-----------------

The syllable extractor is the first step in the build pipeline:

.. code-block:: bash

   # Step 1: Extract syllables from corpus (language auto-detected or defaults to en_US)
   python -m build_tools.pyphen_syllable_extractor \
     --source data/corpus/ \
     --pattern "*.txt" \
     --output data/raw/

   # Step 2: Normalize extracted syllables
   python -m build_tools.pyphen_syllable_normaliser \
     --source data/raw/ \
     --output data/normalized/

   # Step 3: Annotate with phonetic features
   python -m build_tools.syllable_feature_annotator

**When to use this tool:**

- To extract syllables from text corpora for the first time
- When adding new language variants to the corpus
- To regenerate syllables after changing extraction parameters (min/max length)
- For exploring syllable patterns in specific text sources

**Extraction modes:**

- **Interactive mode**: No arguments - prompts for file selection
- **Single file**: ``--file input.txt`` - Process one file
- **Multiple files**: ``--files file1.txt file2.txt`` - Process specific files
- **Directory scan**: ``--source /data/ --pattern "*.txt"`` - Scan directory for files
- **Recursive scan**: ``--source /data/ --pattern "*.txt" --recursive`` - Scan subdirectories

**Language selection (optional):**

- **Intelligent defaults**: If no language is specified, the tool automatically:

  - Uses ``--auto`` (automatic detection) if ``langdetect`` is installed
  - Falls back to ``en_US`` if ``langdetect`` is not available
  - Displays which default was chosen at runtime

- **Explicit language**: ``--lang en_US`` - Specify a specific language code
- **Force auto-detect**: ``--auto`` - Explicitly request automatic language detection (requires ``langdetect``)

**Examples:**

.. code-block:: bash

   # Simple usage (language auto-selected)
   python -m build_tools.pyphen_syllable_extractor --file input.txt

   # Explicit language selection
   python -m build_tools.pyphen_syllable_extractor --file input.txt --lang de_DE

   # Force automatic detection
   python -m build_tools.pyphen_syllable_extractor --file input.txt --auto

Notes
-----

**Supported Languages:**

The extractor supports 40+ languages through pyphen's LibreOffice dictionaries:

- English (US: en_US, UK: en_GB)
- Germanic: German (de_DE), Dutch (nl_NL), Swedish (sv_SE), Danish (da_DK), Norwegian (nb_NO, nn_NO)
- Romance: French (fr_FR), Spanish (es_ES), Italian (it_IT), Portuguese (pt_PT), Romanian (ro_RO)
- Slavic: Russian (ru_RU), Polish (pl_PL), Czech (cs_CZ), Slovak (sk_SK), Ukrainian (uk_UA)
- Other: Greek (el_GR), Turkish (tr_TR), Hungarian (hu_HU), Finnish (fi_FI), Estonian (et_EE)
- And many more...

To list all available languages:

.. code-block:: python

   from build_tools.pyphen_syllable_extractor import SUPPORTED_LANGUAGES
   print(f"{len(SUPPORTED_LANGUAGES)} languages available")

**Language Auto-Detection:**

The tool includes automatic language detection (requires ``langdetect`` package):

- Use ``--auto`` flag to enable automatic language detection
- Detection is per-file based on text content
- Falls back to English (en_US) if detection fails or is unavailable
- Install with: ``pip install langdetect``

To check if auto-detection is available:

.. code-block:: python

   from build_tools.pyphen_syllable_extractor import is_detection_available
   if is_detection_available():
       print("Language auto-detection is available")

**Syllable Length Constraints:**

- Default: min=2, max=8 characters
- Adjust with ``--min`` and ``--max`` flags
- Shorter syllables (min=1) include single vowels
- Longer syllables (max=10+) may include compound patterns

**Output Organization:**

- Each extraction run creates a timestamped directory with pyphen identifier (``YYYYMMDD_HHMMSS_pyphen/``)
- Run directory contains ``syllables/`` and ``meta/`` subdirectories
- Batch processing groups all files into a single run directory
- Input filenames are preserved in output (e.g., ``alice.txt``)
- Interactive mode uses language code as filename (e.g., ``en_US.txt``)
- Metadata files provide full provenance tracking
- All extractions are logged to corpus database (if available)

**Build-time tool:**

This is a build-time tool only - not used during runtime name generation.

API Reference
-------------

.. automodule:: build_tools.pyphen_syllable_extractor
   :members:
   :undoc-members:
   :show-inheritance: