build_tools.pyphen_syllable_extractor.extractor
===============================================

.. py:module:: build_tools.pyphen_syllable_extractor.extractor

.. autoapi-nested-parse::

   Core syllable extraction functionality.

   This module provides the main SyllableExtractor class for extracting syllables
   from text using pyphen's dictionary-based hyphenation.


Attributes
----------

.. autoapisummary::

   build_tools.pyphen_syllable_extractor.extractor.PYPHEN_AVAILABLE


Classes
-------

.. autoapisummary::

   build_tools.pyphen_syllable_extractor.extractor.SyllableExtractor


Module Contents
---------------

.. py:data:: PYPHEN_AVAILABLE
   :value: True


.. py:class:: SyllableExtractor(language_code, min_syllable_length = 1, max_syllable_length = 10)

   Extracts syllables from text using pyphen hyphenation dictionaries.

   This class provides methods to process text files and extract individual
   syllables based on language-specific hyphenation rules from LibreOffice's
   dictionary collection.

   The extractor works by:
   1. Reading text input (string or file)
   2. Tokenizing into words using regex
   3. Applying language-specific hyphenation rules via pyphen
   4. Splitting hyphenated words into syllables
   5. Filtering syllables by length constraints
   6. Returning unique syllables (case-insensitive)

   Key Features:
       - Support for 40+ languages via pyphen
       - Configurable syllable length constraints
       - Option to include/exclude non-hyphenated words
       - Case-insensitive processing
       - Unicode support for accented characters
       - Deterministic extraction (same input = same output)

   Typical Usage:
       >>> # Basic extraction
       >>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
       >>> syllables = extractor.extract_syllables_from_text("Hello wonderful world")
       >>> print(sorted(syllables))
       ['der', 'ful', 'hel', 'lo', 'won', 'world']

       >>> # Extract from file and save
       >>> syllables = extractor.extract_syllables_from_file(Path('input.txt'))
       >>> extractor.save_syllables(syllables, Path('output.txt'))

   .. attribute:: dictionary

      Pyphen hyphenation dictionary for the selected language

   .. attribute:: language_code

      The pyphen language/locale code (e.g., 'en_US', 'de_DE')

   .. attribute:: min_syllable_length

      Minimum syllable length to include in results

   .. attribute:: max_syllable_length

      Maximum syllable length to include in results

   .. note::

      This is a build-time tool. The pyphen dependency should not be used
      at runtime in the core name generation system.

   Initialize the syllable extractor with a specific language.

   :param language_code: Pyphen language/locale code (e.g., 'en_US', 'de_DE')
   :param min_syllable_length: Minimum syllable length to include (default: 1)
   :param max_syllable_length: Maximum syllable length to include (default: 10)

   :raises ImportError: If pyphen is not installed
   :raises ValueError: If the language code is not supported by pyphen


   .. py:method:: extract_syllables_from_text(text, only_hyphenated = True)

      Extract unique syllables from a block of text.

      This method processes input text by tokenizing it into words, applying
      hyphenation rules via pyphen, and extracting individual syllables that
      meet the configured length constraints.

      :param text: Input text to process. Can contain any characters, but only
                   alphabetic sequences (including accented characters) will be
                   processed as words.
      :param only_hyphenated: If True, only include syllables from words that pyphen
                              actually hyphenated (default: True). This filters out
                              whole words that couldn't be syllabified. Set to False
                              to include all words, even if they can't be split.

      :returns:     - syllables: Set of unique lowercase syllable strings
                    - statistics: Dict with the following keys:
                        - 'total_words': Total number of words found in source text
                        - 'processed_words': Words that were successfully hyphenated/processed
                        - 'skipped_unhyphenated': Words skipped (only when only_hyphenated=True)
                        - 'rejected_syllables': Syllables rejected due to length constraints
      :rtype: Tuple of (syllables, statistics) where

      .. note::

         - Only processes words containing alphabetic characters (a-z, A-Z, À-ÿ)
         - Case-insensitive processing (all output is lowercase)
         - Automatically removes punctuation and special characters
         - Filters syllables by configured min/max length constraints
         - When only_hyphenated=True, excludes words pyphen couldn't split
         - Deterministic: same input always produces same output
         - Words are extracted using regex pattern: \b[a-zA-ZÀ-ÿ]+\b

      .. admonition:: Example

         >>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
         >>> syllables, stats = extractor.extract_syllables_from_text("Hello world!")
         >>> print(sorted(syllables))
         ['hel', 'lo', 'world']
         >>> print(stats['total_words'])
         2


   .. py:method:: extract_syllables_from_file(input_path)

      Extract unique syllables from a text file.

      This is a convenience wrapper around extract_syllables_from_text() that
      handles file reading with proper encoding (UTF-8) and error handling.

      :param input_path: Path to the input text file. File should be UTF-8 encoded
                         plain text. Binary files or non-text formats will cause errors.

      :returns:     - syllables: Set of unique lowercase syllable strings
                    - statistics: Dict with processing statistics (see extract_syllables_from_text)
      :rtype: Tuple of (syllables, statistics) where

      :raises FileNotFoundError: If the input file doesn't exist at the specified path
      :raises IOError: If there's an error reading the file (permissions, encoding, etc.)

      .. admonition:: Example

         >>> from pathlib import Path
         >>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
         >>> syllables, stats = extractor.extract_syllables_from_file(Path('book.txt'))
         >>> print(f"Extracted {len(syllables)} unique syllables from {stats['total_words']} words")
         Extracted 1250 unique syllables from 50000 words


   .. py:method:: save_syllables(syllables, output_path)

      Save syllables to a text file (one syllable per line, sorted).

      Writes syllables in alphabetical order with UTF-8 encoding, one syllable
      per line. This format is ideal for version control and easy importing into
      other tools.

      :param syllables: Set of syllables to save. Each syllable should be a string.
                        The set will be sorted alphabetically before writing.
      :param output_path: Path to the output file. Parent directories must exist.
                          If the file exists, it will be overwritten.

      :raises IOError: If there's an error writing the file (permissions, disk space, etc.)

      .. admonition:: Example

         >>> from pathlib import Path
         >>> extractor = SyllableExtractor('en_US')
         >>> syllables = {'hel', 'lo', 'world'}
         >>> extractor.save_syllables(syllables, Path('output.txt'))
         # Creates file with content:
         # hel
         # lo
         # world

      .. note::

         The output file uses UTF-8 encoding with Unix-style line endings (\n).
         Each line contains exactly one syllable with no leading/trailing whitespace.


   .. py:method:: extract_with_auto_language(text, min_syllable_length = 1, max_syllable_length = 10, only_hyphenated = True, default_language = 'en_US', min_detection_length = 20, suppress_warnings = False)
      :staticmethod:


      Extract syllables with automatic language detection.

      This convenience method combines language detection with syllable extraction.
      It automatically detects the language of the input text and creates an
      appropriate SyllableExtractor instance for that language.

      :param text: Input text to process. Should be at least 20-50 characters for
                   reliable language detection.
      :param min_syllable_length: Minimum syllable length to include (default: 1)
      :param max_syllable_length: Maximum syllable length to include (default: 10)
      :param only_hyphenated: If True, only include syllables from hyphenated words
                              (default: True)
      :param default_language: Language code to use if detection fails (default: "en_US")
      :param min_detection_length: Minimum text length for detection attempt (default: 20)
      :param suppress_warnings: If True, suppress language detection warnings (default: False)

      :returns:     - syllables: Set of unique lowercase syllable strings
                    - statistics: Dict with processing statistics
                    - detected_language_code: The pyphen language code that was used
      :rtype: Tuple of (syllables, statistics, detected_language_code) where

      :raises ImportError: If langdetect is not installed (unless suppress_warnings=True)

      .. admonition:: Example

         >>> # Auto-detect English text
         >>> text = "Hello beautiful world, this is wonderful"
         >>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(text)
         >>> print(f"Detected language: {lang}")
         Detected language: en_US
         >>> print(f"Found {len(syllables)} syllables")
         Found 8 syllables
         
         >>> # Auto-detect French text
         >>> text = "Bonjour le monde, comment allez-vous aujourd'hui?"
         >>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(text)
         >>> print(f"Detected language: {lang}")
         Detected language: fr
         
         >>> # With custom parameters
         >>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(
         ...     text="Das sind deutsche Wörter",
         ...     min_syllable_length=2,
         ...     max_syllable_length=8,
         ...     default_language="en_US"
         ... )
         >>> print(lang)
         de_DE

      .. note::

         - Requires langdetect: pip install langdetect
         - Detection accuracy depends on text length (20-50+ chars recommended)
         - For production use, consider setting suppress_warnings=True
         - Short text will fall back to default_language with a warning


   .. py:method:: extract_file_with_auto_language(input_path, min_syllable_length = 1, max_syllable_length = 10, only_hyphenated = True, default_language = 'en_US', min_detection_length = 20, suppress_warnings = False)
      :staticmethod:


      Extract syllables from a file with automatic language detection.

      This convenience method reads a file, detects its language, and extracts
      syllables using the appropriate language-specific hyphenation rules.

      :param input_path: Path to the input text file
      :param min_syllable_length: Minimum syllable length to include (default: 1)
      :param max_syllable_length: Maximum syllable length to include (default: 10)
      :param only_hyphenated: If True, only include syllables from hyphenated words
                              (default: True)
      :param default_language: Language code to use if detection fails (default: "en_US")
      :param min_detection_length: Minimum text length for detection attempt (default: 20)
      :param suppress_warnings: If True, suppress language detection warnings (default: False)

      :returns:     - syllables: Set of unique lowercase syllable strings
                    - statistics: Dict with processing statistics
                    - detected_language_code: The pyphen language code that was used
      :rtype: Tuple of (syllables, statistics, detected_language_code) where

      :raises FileNotFoundError: If the input file doesn't exist
      :raises IOError: If there's an error reading the file
      :raises ImportError: If langdetect is not installed (unless suppress_warnings=True)

      .. admonition:: Example

         >>> from pathlib import Path
         >>> syllables, stats, lang = SyllableExtractor.extract_file_with_auto_language(
         ...     Path('document.txt'),
         ...     min_syllable_length=2,
         ...     max_syllable_length=8
         ... )
         >>> print(f"Detected: {lang}, Found: {len(syllables)} syllables")
         Detected: de_DE, Found: 1500 syllables