build_tools.pyphen_syllable_extractor.extractor =============================================== .. py:module:: build_tools.pyphen_syllable_extractor.extractor .. autoapi-nested-parse:: Core syllable extraction functionality. This module provides the main SyllableExtractor class for extracting syllables from text using pyphen's dictionary-based hyphenation. Attributes ---------- .. autoapisummary:: build_tools.pyphen_syllable_extractor.extractor.PYPHEN_AVAILABLE Classes ------- .. autoapisummary:: build_tools.pyphen_syllable_extractor.extractor.SyllableExtractor Module Contents --------------- .. py:data:: PYPHEN_AVAILABLE :value: True .. py:class:: SyllableExtractor(language_code, min_syllable_length = 1, max_syllable_length = 10) Extracts syllables from text using pyphen hyphenation dictionaries. This class provides methods to process text files and extract individual syllables based on language-specific hyphenation rules from LibreOffice's dictionary collection. The extractor works by: 1. Reading text input (string or file) 2. Tokenizing into words using regex 3. Applying language-specific hyphenation rules via pyphen 4. Splitting hyphenated words into syllables 5. Filtering syllables by length constraints 6. Returning unique syllables (case-insensitive) Key Features: - Support for 40+ languages via pyphen - Configurable syllable length constraints - Option to include/exclude non-hyphenated words - Case-insensitive processing - Unicode support for accented characters - Deterministic extraction (same input = same output) Typical Usage: >>> # Basic extraction >>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8) >>> syllables = extractor.extract_syllables_from_text("Hello wonderful world") >>> print(sorted(syllables)) ['der', 'ful', 'hel', 'lo', 'won', 'world'] >>> # Extract from file and save >>> syllables = extractor.extract_syllables_from_file(Path('input.txt')) >>> extractor.save_syllables(syllables, Path('output.txt')) .. attribute:: dictionary Pyphen hyphenation dictionary for the selected language .. attribute:: language_code The pyphen language/locale code (e.g., 'en_US', 'de_DE') .. attribute:: min_syllable_length Minimum syllable length to include in results .. attribute:: max_syllable_length Maximum syllable length to include in results .. note:: This is a build-time tool. The pyphen dependency should not be used at runtime in the core name generation system. Initialize the syllable extractor with a specific language. :param language_code: Pyphen language/locale code (e.g., 'en_US', 'de_DE') :param min_syllable_length: Minimum syllable length to include (default: 1) :param max_syllable_length: Maximum syllable length to include (default: 10) :raises ImportError: If pyphen is not installed :raises ValueError: If the language code is not supported by pyphen .. py:method:: extract_syllables_from_text(text, only_hyphenated = True) Extract unique syllables from a block of text. This method processes input text by tokenizing it into words, applying hyphenation rules via pyphen, and extracting individual syllables that meet the configured length constraints. :param text: Input text to process. Can contain any characters, but only alphabetic sequences (including accented characters) will be processed as words. :param only_hyphenated: If True, only include syllables from words that pyphen actually hyphenated (default: True). This filters out whole words that couldn't be syllabified. Set to False to include all words, even if they can't be split. :returns: - syllables: Set of unique lowercase syllable strings - statistics: Dict with the following keys: - 'total_words': Total number of words found in source text - 'processed_words': Words that were successfully hyphenated/processed - 'skipped_unhyphenated': Words skipped (only when only_hyphenated=True) - 'rejected_syllables': Syllables rejected due to length constraints :rtype: Tuple of (syllables, statistics) where .. note:: - Only processes words containing alphabetic characters (a-z, A-Z, À-ÿ) - Case-insensitive processing (all output is lowercase) - Automatically removes punctuation and special characters - Filters syllables by configured min/max length constraints - When only_hyphenated=True, excludes words pyphen couldn't split - Deterministic: same input always produces same output - Words are extracted using regex pattern: \b[a-zA-ZÀ-ÿ]+\b .. admonition:: Example >>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8) >>> syllables, stats = extractor.extract_syllables_from_text("Hello world!") >>> print(sorted(syllables)) ['hel', 'lo', 'world'] >>> print(stats['total_words']) 2 .. py:method:: extract_syllables_from_file(input_path) Extract unique syllables from a text file. This is a convenience wrapper around extract_syllables_from_text() that handles file reading with proper encoding (UTF-8) and error handling. :param input_path: Path to the input text file. File should be UTF-8 encoded plain text. Binary files or non-text formats will cause errors. :returns: - syllables: Set of unique lowercase syllable strings - statistics: Dict with processing statistics (see extract_syllables_from_text) :rtype: Tuple of (syllables, statistics) where :raises FileNotFoundError: If the input file doesn't exist at the specified path :raises IOError: If there's an error reading the file (permissions, encoding, etc.) .. admonition:: Example >>> from pathlib import Path >>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8) >>> syllables, stats = extractor.extract_syllables_from_file(Path('book.txt')) >>> print(f"Extracted {len(syllables)} unique syllables from {stats['total_words']} words") Extracted 1250 unique syllables from 50000 words .. py:method:: save_syllables(syllables, output_path) Save syllables to a text file (one syllable per line, sorted). Writes syllables in alphabetical order with UTF-8 encoding, one syllable per line. This format is ideal for version control and easy importing into other tools. :param syllables: Set of syllables to save. Each syllable should be a string. The set will be sorted alphabetically before writing. :param output_path: Path to the output file. Parent directories must exist. If the file exists, it will be overwritten. :raises IOError: If there's an error writing the file (permissions, disk space, etc.) .. admonition:: Example >>> from pathlib import Path >>> extractor = SyllableExtractor('en_US') >>> syllables = {'hel', 'lo', 'world'} >>> extractor.save_syllables(syllables, Path('output.txt')) # Creates file with content: # hel # lo # world .. note:: The output file uses UTF-8 encoding with Unix-style line endings (\n). Each line contains exactly one syllable with no leading/trailing whitespace. .. py:method:: extract_with_auto_language(text, min_syllable_length = 1, max_syllable_length = 10, only_hyphenated = True, default_language = 'en_US', min_detection_length = 20, suppress_warnings = False) :staticmethod: Extract syllables with automatic language detection. This convenience method combines language detection with syllable extraction. It automatically detects the language of the input text and creates an appropriate SyllableExtractor instance for that language. :param text: Input text to process. Should be at least 20-50 characters for reliable language detection. :param min_syllable_length: Minimum syllable length to include (default: 1) :param max_syllable_length: Maximum syllable length to include (default: 10) :param only_hyphenated: If True, only include syllables from hyphenated words (default: True) :param default_language: Language code to use if detection fails (default: "en_US") :param min_detection_length: Minimum text length for detection attempt (default: 20) :param suppress_warnings: If True, suppress language detection warnings (default: False) :returns: - syllables: Set of unique lowercase syllable strings - statistics: Dict with processing statistics - detected_language_code: The pyphen language code that was used :rtype: Tuple of (syllables, statistics, detected_language_code) where :raises ImportError: If langdetect is not installed (unless suppress_warnings=True) .. admonition:: Example >>> # Auto-detect English text >>> text = "Hello beautiful world, this is wonderful" >>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(text) >>> print(f"Detected language: {lang}") Detected language: en_US >>> print(f"Found {len(syllables)} syllables") Found 8 syllables >>> # Auto-detect French text >>> text = "Bonjour le monde, comment allez-vous aujourd'hui?" >>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(text) >>> print(f"Detected language: {lang}") Detected language: fr >>> # With custom parameters >>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language( ... text="Das sind deutsche Wörter", ... min_syllable_length=2, ... max_syllable_length=8, ... default_language="en_US" ... ) >>> print(lang) de_DE .. note:: - Requires langdetect: pip install langdetect - Detection accuracy depends on text length (20-50+ chars recommended) - For production use, consider setting suppress_warnings=True - Short text will fall back to default_language with a warning .. py:method:: extract_file_with_auto_language(input_path, min_syllable_length = 1, max_syllable_length = 10, only_hyphenated = True, default_language = 'en_US', min_detection_length = 20, suppress_warnings = False) :staticmethod: Extract syllables from a file with automatic language detection. This convenience method reads a file, detects its language, and extracts syllables using the appropriate language-specific hyphenation rules. :param input_path: Path to the input text file :param min_syllable_length: Minimum syllable length to include (default: 1) :param max_syllable_length: Maximum syllable length to include (default: 10) :param only_hyphenated: If True, only include syllables from hyphenated words (default: True) :param default_language: Language code to use if detection fails (default: "en_US") :param min_detection_length: Minimum text length for detection attempt (default: 20) :param suppress_warnings: If True, suppress language detection warnings (default: False) :returns: - syllables: Set of unique lowercase syllable strings - statistics: Dict with processing statistics - detected_language_code: The pyphen language code that was used :rtype: Tuple of (syllables, statistics, detected_language_code) where :raises FileNotFoundError: If the input file doesn't exist :raises IOError: If there's an error reading the file :raises ImportError: If langdetect is not installed (unless suppress_warnings=True) .. admonition:: Example >>> from pathlib import Path >>> syllables, stats, lang = SyllableExtractor.extract_file_with_auto_language( ... Path('document.txt'), ... min_syllable_length=2, ... max_syllable_length=8 ... ) >>> print(f"Detected: {lang}, Found: {len(syllables)} syllables") Detected: de_DE, Found: 1500 syllables