Pyphen Syllable Extractor

Overview

Syllable Extractor - Dictionary-Based Syllable Extraction

The syllable extractor uses dictionary-based hyphenation to extract syllables from text files. This is a build-time tool only - not used during runtime name generation.

The tool supports two modes:

  • Interactive Mode - Guided prompts for single-file processing

  • Batch Mode - Automated processing of multiple files via command-line arguments

Features:

  • Dictionary-based hyphenation using pyphen (LibreOffice dictionaries)

  • Support for 40+ languages

  • Automatic language detection (optional, via langdetect)

  • Configurable syllable length constraints

  • Deterministic extraction (same input = same output)

  • Unicode support for accented characters

  • Comprehensive metadata and statistics

  • Automatic provenance tracking via corpus_db ledger (batch mode)

Main Components:

  • SyllableExtractor: Core extraction class

  • ExtractionResult: Data model for extraction results

  • FileProcessingResult: Result for single file in batch mode

  • BatchResult: Aggregate results for batch processing

  • SUPPORTED_LANGUAGES: Dictionary of supported language codes

Usage:
>>> from pathlib import Path
>>> from build_tools.pyphen_syllable_extractor import SyllableExtractor
>>>
>>> # Initialize extractor for English (US)
>>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>>
>>> # Extract syllables from text
>>> syllables = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(sorted(syllables))
['der', 'ful', 'hel', 'lo', 'won', 'world']
>>>
>>> # Extract from a file
>>> syllables = extractor.extract_syllables_from_file(Path('input.txt'))
>>>
>>> # Save results
>>> extractor.save_syllables(syllables, Path('output.txt'))

CLI Usage:

# Interactive mode
python -m build_tools.pyphen_syllable_extractor

# Single file with specific language
python -m build_tools.pyphen_syllable_extractor --file input.txt --lang en_US

# Batch processing with auto-detection
python -m build_tools.pyphen_syllable_extractor --source ~/docs/ --recursive --auto

Command-Line Interface

Syllable Extractor - Extract syllables from text using pyphen hyphenation

usage: python -m build_tools.pyphen_syllable_extractor [-h]
                                                       [--file FILE | --files FILE [FILE ...]
                                                       | --source SOURCE]
                                                       [--lang LANG | --auto]
                                                       [--pattern PATTERN]
                                                       [--recursive] [--min N]
                                                       [--max N]
                                                       [--output OUTPUT]
                                                       [--quiet] [--verbose]

Named Arguments

--file

Process a single file

--files

Process multiple files

--source

Directory to scan for files

--lang

Language code (e.g., en_US, de_DE, fr). If omitted, uses –auto if langdetect is installed, otherwise en_US.

--auto

Automatically detect language (requires langdetect). This is the default if langdetect is installed and –lang is not specified.

Default: False

--pattern

File pattern for directory scanning (default: *.txt)

Default: '*.txt'

--recursive

Search directories recursively

Default: False

--min

Minimum syllable length (default: 2)

Default: 2

--max

Maximum syllable length (default: 8)

Default: 8

--output

Output directory (default: _working/output)

--quiet

Suppress all output except errors

Default: False

--verbose

Enable verbose output

Default: False

# Interactive mode (no arguments)
python -m build_tools.pyphen_syllable_extractor

# Single file (language auto-detected or defaults to en_US)
python -m build_tools.pyphen_syllable_extractor --file input.txt

# Single file with explicit language
python -m build_tools.pyphen_syllable_extractor --file input.txt --lang en_US

# Multiple files with automatic language detection
python -m build_tools.pyphen_syllable_extractor --files file1.txt file2.txt file3.txt --auto

# Directory scan (language auto-detected or defaults to en_US)
python -m build_tools.pyphen_syllable_extractor --source /data/texts/ --pattern "*.txt"

# Directory scan (recursive)
python -m build_tools.pyphen_syllable_extractor --source /data/ --pattern "*.md" --recursive

# Custom output directory and syllable lengths
python -m build_tools.pyphen_syllable_extractor --source /data/ --output /results/ --min 3 --max 6

Output Format

Output files are organized in a run-based subdirectory structure under _working/output/. Each extraction run creates a timestamped directory containing syllables/ and meta/ subdirectories:

Directory structure:

_working/output/
  └── YYYYMMDD_HHMMSS_pyphen/   # Run directory (one per batch)
      ├── syllables/
      │   ├── file1.txt          # Input filename preserved
      │   ├── file2.txt
      │   └── ...
      └── meta/
          ├── file1.txt          # Matching metadata
          ├── file2.txt
          └── ...

Interactive mode (single file):

_working/output/
  └── 20260110_143022_pyphen/
      ├── syllables/
      │   └── en_US.txt
      └── meta/
          └── en_US.txt

Batch mode (multiple files):

_working/output/
  └── 20260110_143022_pyphen/   # All files share one run directory
      ├── syllables/
      │   ├── alice.txt
      │   ├── middlemarch.txt
      │   └── don_quijote.txt
      └── meta/
          ├── alice.txt
          ├── middlemarch.txt
          └── don_quijote.txt

Benefits of run-based organization:

  • Each extraction run is self-contained in a timestamped directory

  • Easy to archive, move, or delete entire runs as atomic units

  • Input filenames are preserved for easy identification

  • Clean separation between syllables and metadata

  • All outputs from a batch operation are grouped together

Syllables file format:

Each line contains one unique syllable, sorted alphabetically:

der
ful
hel
lo
won
world

Metadata file format:

The metadata file records extraction parameters and statistics:

  • Source files processed

  • Language code used

  • Syllable length constraints (min/max)

  • Unique syllable count

  • Total word count

  • Extraction timestamp

  • Command-line invocation

Integration Guide

The syllable extractor is the first step in the build pipeline:

# Step 1: Extract syllables from corpus (language auto-detected or defaults to en_US)
python -m build_tools.pyphen_syllable_extractor \
  --source data/corpus/ \
  --pattern "*.txt" \
  --output data/raw/

# Step 2: Normalize extracted syllables
python -m build_tools.pyphen_syllable_normaliser \
  --source data/raw/ \
  --output data/normalized/

# Step 3: Annotate with phonetic features
python -m build_tools.syllable_feature_annotator

When to use this tool:

  • To extract syllables from text corpora for the first time

  • When adding new language variants to the corpus

  • To regenerate syllables after changing extraction parameters (min/max length)

  • For exploring syllable patterns in specific text sources

Extraction modes:

  • Interactive mode: No arguments - prompts for file selection

  • Single file: --file input.txt - Process one file

  • Multiple files: --files file1.txt file2.txt - Process specific files

  • Directory scan: --source /data/ --pattern "*.txt" - Scan directory for files

  • Recursive scan: --source /data/ --pattern "*.txt" --recursive - Scan subdirectories

Language selection (optional):

  • Intelligent defaults: If no language is specified, the tool automatically:

    • Uses --auto (automatic detection) if langdetect is installed

    • Falls back to en_US if langdetect is not available

    • Displays which default was chosen at runtime

  • Explicit language: --lang en_US - Specify a specific language code

  • Force auto-detect: --auto - Explicitly request automatic language detection (requires langdetect)

Examples:

# Simple usage (language auto-selected)
python -m build_tools.pyphen_syllable_extractor --file input.txt

# Explicit language selection
python -m build_tools.pyphen_syllable_extractor --file input.txt --lang de_DE

# Force automatic detection
python -m build_tools.pyphen_syllable_extractor --file input.txt --auto

Notes

Supported Languages:

The extractor supports 40+ languages through pyphen’s LibreOffice dictionaries:

  • English (US: en_US, UK: en_GB)

  • Germanic: German (de_DE), Dutch (nl_NL), Swedish (sv_SE), Danish (da_DK), Norwegian (nb_NO, nn_NO)

  • Romance: French (fr_FR), Spanish (es_ES), Italian (it_IT), Portuguese (pt_PT), Romanian (ro_RO)

  • Slavic: Russian (ru_RU), Polish (pl_PL), Czech (cs_CZ), Slovak (sk_SK), Ukrainian (uk_UA)

  • Other: Greek (el_GR), Turkish (tr_TR), Hungarian (hu_HU), Finnish (fi_FI), Estonian (et_EE)

  • And many more…

To list all available languages:

from build_tools.pyphen_syllable_extractor import SUPPORTED_LANGUAGES
print(f"{len(SUPPORTED_LANGUAGES)} languages available")

Language Auto-Detection:

The tool includes automatic language detection (requires langdetect package):

  • Use --auto flag to enable automatic language detection

  • Detection is per-file based on text content

  • Falls back to English (en_US) if detection fails or is unavailable

  • Install with: pip install langdetect

To check if auto-detection is available:

from build_tools.pyphen_syllable_extractor import is_detection_available
if is_detection_available():
    print("Language auto-detection is available")

Syllable Length Constraints:

  • Default: min=2, max=8 characters

  • Adjust with --min and --max flags

  • Shorter syllables (min=1) include single vowels

  • Longer syllables (max=10+) may include compound patterns

Output Organization:

  • Each extraction run creates a timestamped directory with pyphen identifier (YYYYMMDD_HHMMSS_pyphen/)

  • Run directory contains syllables/ and meta/ subdirectories

  • Batch processing groups all files into a single run directory

  • Input filenames are preserved in output (e.g., alice.txt)

  • Interactive mode uses language code as filename (e.g., en_US.txt)

  • Metadata files provide full provenance tracking

  • All extractions are logged to corpus database (if available)

Build-time tool:

This is a build-time tool only - not used during runtime name generation.

API Reference

Syllable Extractor - Dictionary-Based Syllable Extraction

The syllable extractor uses dictionary-based hyphenation to extract syllables from text files. This is a build-time tool only - not used during runtime name generation.

The tool supports two modes:

  • Interactive Mode - Guided prompts for single-file processing

  • Batch Mode - Automated processing of multiple files via command-line arguments

Features:

  • Dictionary-based hyphenation using pyphen (LibreOffice dictionaries)

  • Support for 40+ languages

  • Automatic language detection (optional, via langdetect)

  • Configurable syllable length constraints

  • Deterministic extraction (same input = same output)

  • Unicode support for accented characters

  • Comprehensive metadata and statistics

  • Automatic provenance tracking via corpus_db ledger (batch mode)

Main Components:

  • SyllableExtractor: Core extraction class

  • ExtractionResult: Data model for extraction results

  • FileProcessingResult: Result for single file in batch mode

  • BatchResult: Aggregate results for batch processing

  • SUPPORTED_LANGUAGES: Dictionary of supported language codes

Usage:
>>> from pathlib import Path
>>> from build_tools.pyphen_syllable_extractor import SyllableExtractor
>>>
>>> # Initialize extractor for English (US)
>>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>>
>>> # Extract syllables from text
>>> syllables = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(sorted(syllables))
['der', 'ful', 'hel', 'lo', 'won', 'world']
>>>
>>> # Extract from a file
>>> syllables = extractor.extract_syllables_from_file(Path('input.txt'))
>>>
>>> # Save results
>>> extractor.save_syllables(syllables, Path('output.txt'))

CLI Usage:

# Interactive mode
python -m build_tools.pyphen_syllable_extractor

# Single file with specific language
python -m build_tools.pyphen_syllable_extractor --file input.txt --lang en_US

# Batch processing with auto-detection
python -m build_tools.pyphen_syllable_extractor --source ~/docs/ --recursive --auto
class build_tools.pyphen_syllable_extractor.BatchResult(total_files, successful, failed, results, total_time, output_directory)[source]

Bases: object

Aggregate results from a batch processing operation.

This dataclass stores summary statistics and individual file results from processing multiple files in batch mode.

total_files

Total number of files attempted in the batch

successful

Number of files processed successfully

failed

Number of files that failed to process

results

List of individual FileProcessingResult objects

total_time

Total time taken for entire batch operation in seconds

output_directory

Directory where all outputs were saved

Example

>>> result = BatchResult(
...     total_files=5,
...     successful=4,
...     failed=1,
...     results=[...],
...     total_time=12.34,
...     output_directory=Path("_working/output")
... )
>>> print(f"Success rate: {result.successful/result.total_files*100:.1f}%")
Success rate: 80.0%
failed: int
format_summary()[source]

Format batch processing summary as a human-readable string.

Creates a detailed summary report showing overall statistics, successful extractions with details, and failed files with error messages.

Return type:

str

Returns:

Multi-line formatted string with batch statistics and results

Example

>>> summary = batch_result.format_summary()
>>> print(summary)
======================================================================
BATCH PROCESSING SUMMARY
======================================================================
Total Files:        5
Successful:         4 (80.0%)
...
output_directory: Path
results: list[FileProcessingResult]
successful: int
total_files: int
total_time: float
class build_tools.pyphen_syllable_extractor.ExtractionResult(syllables, language_code, min_syllable_length, max_syllable_length, input_path, timestamp=<factory>, only_hyphenated=True, length_distribution=<factory>, sample_syllables=<factory>, total_words=0, skipped_unhyphenated=0, rejected_syllables=0, processed_words=0)[source]

Bases: object

Container for syllable extraction results and associated metadata.

This dataclass stores both the extracted syllables and all relevant metadata about the extraction process for reporting and persistence.

syllables

Set of unique syllables extracted from the input text

language_code

Pyphen language/locale code used for hyphenation

min_syllable_length

Minimum syllable length constraint

max_syllable_length

Maximum syllable length constraint

input_path

Path to the input text file

timestamp

When the extraction was performed

only_hyphenated

Whether whole words were excluded

length_distribution

Map of syllable length to count

sample_syllables

Representative sample of extracted syllables

total_words

Total words found in source text

skipped_unhyphenated

Words skipped because they couldn’t be hyphenated

rejected_syllables

Syllables rejected due to length constraints

processed_words

Words that were successfully processed

__post_init__()[source]

Calculate derived fields after initialization.

format_metadata()[source]

Format extraction metadata as a human-readable string.

Return type:

str

Returns:

Multi-line string containing all extraction metadata formatted for display or file output.

input_path: Path
language_code: str
length_distribution: dict[int, int]
max_syllable_length: int
min_syllable_length: int
only_hyphenated: bool = True
processed_words: int = 0
rejected_syllables: int = 0
sample_syllables: list[str]
skipped_unhyphenated: int = 0
syllables: set[str]
timestamp: datetime
total_words: int = 0
class build_tools.pyphen_syllable_extractor.FileProcessingResult(input_path, success, syllables_count, language_code, syllables_output_path=None, metadata_output_path=None, error_message=None, processing_time=0.0)[source]

Bases: object

Result of processing a single file in batch mode.

This dataclass stores the outcome of processing one file during batch operations, including success status, extracted syllables count, and any error information if processing failed.

input_path

Path to the input file that was processed

success

Whether processing completed successfully

syllables_count

Number of unique syllables extracted (0 if failed)

language_code

Detected or specified language code used

syllables_output_path

Path where syllables were saved (None if failed)

metadata_output_path

Path where metadata was saved (None if failed)

error_message

Error message if processing failed (None if success)

processing_time

Time taken to process this file in seconds

Example

>>> result = FileProcessingResult(
...     input_path=Path("book.txt"),
...     success=True,
...     syllables_count=245,
...     language_code="en_US",
...     syllables_output_path=Path("output.syllables.en_US.txt"),
...     metadata_output_path=Path("output.meta.en_US.txt"),
...     processing_time=2.45
... )
>>> print(f"Processed {result.syllables_count} syllables")
Processed 245 syllables
error_message: str | None = None
input_path: Path
language_code: str
metadata_output_path: Path | None = None
processing_time: float = 0.0
success: bool
syllables_count: int
syllables_output_path: Path | None = None
class build_tools.pyphen_syllable_extractor.SyllableExtractor(language_code, min_syllable_length=1, max_syllable_length=10)[source]

Bases: object

Extracts syllables from text using pyphen hyphenation dictionaries.

This class provides methods to process text files and extract individual syllables based on language-specific hyphenation rules from LibreOffice’s dictionary collection.

The extractor works by: 1. Reading text input (string or file) 2. Tokenizing into words using regex 3. Applying language-specific hyphenation rules via pyphen 4. Splitting hyphenated words into syllables 5. Filtering syllables by length constraints 6. Returning unique syllables (case-insensitive)

Key Features:
  • Support for 40+ languages via pyphen

  • Configurable syllable length constraints

  • Option to include/exclude non-hyphenated words

  • Case-insensitive processing

  • Unicode support for accented characters

  • Deterministic extraction (same input = same output)

Typical Usage:
>>> # Basic extraction
>>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables = extractor.extract_syllables_from_text("Hello wonderful world")
>>> print(sorted(syllables))
['der', 'ful', 'hel', 'lo', 'won', 'world']
>>> # Extract from file and save
>>> syllables = extractor.extract_syllables_from_file(Path('input.txt'))
>>> extractor.save_syllables(syllables, Path('output.txt'))
dictionary

Pyphen hyphenation dictionary for the selected language

language_code

The pyphen language/locale code (e.g., ‘en_US’, ‘de_DE’)

min_syllable_length

Minimum syllable length to include in results

max_syllable_length

Maximum syllable length to include in results

Note

This is a build-time tool. The pyphen dependency should not be used at runtime in the core name generation system.

__init__(language_code, min_syllable_length=1, max_syllable_length=10)[source]

Initialize the syllable extractor with a specific language.

Parameters:
  • language_code (str) – Pyphen language/locale code (e.g., ‘en_US’, ‘de_DE’)

  • min_syllable_length (int) – Minimum syllable length to include (default: 1)

  • max_syllable_length (int) – Maximum syllable length to include (default: 10)

Raises:
  • ImportError – If pyphen is not installed

  • ValueError – If the language code is not supported by pyphen

static extract_file_with_auto_language(input_path, min_syllable_length=1, max_syllable_length=10, only_hyphenated=True, default_language='en_US', min_detection_length=20, suppress_warnings=False)[source]

Extract syllables from a file with automatic language detection.

This convenience method reads a file, detects its language, and extracts syllables using the appropriate language-specific hyphenation rules.

Parameters:
  • input_path (Path) – Path to the input text file

  • min_syllable_length (int) – Minimum syllable length to include (default: 1)

  • max_syllable_length (int) – Maximum syllable length to include (default: 10)

  • only_hyphenated (bool) – If True, only include syllables from hyphenated words (default: True)

  • default_language (str) – Language code to use if detection fails (default: “en_US”)

  • min_detection_length (int) – Minimum text length for detection attempt (default: 20)

  • suppress_warnings (bool) – If True, suppress language detection warnings (default: False)

Returns:

  • syllables: Set of unique lowercase syllable strings

  • statistics: Dict with processing statistics

  • detected_language_code: The pyphen language code that was used

Return type:

Tuple of (syllables, statistics, detected_language_code) where

Raises:
  • FileNotFoundError – If the input file doesn’t exist

  • IOError – If there’s an error reading the file

  • ImportError – If langdetect is not installed (unless suppress_warnings=True)

Example

>>> from pathlib import Path
>>> syllables, stats, lang = SyllableExtractor.extract_file_with_auto_language(
...     Path('document.txt'),
...     min_syllable_length=2,
...     max_syllable_length=8
... )
>>> print(f"Detected: {lang}, Found: {len(syllables)} syllables")
Detected: de_DE, Found: 1500 syllables
extract_syllables_from_file(input_path)[source]

Extract unique syllables from a text file.

This is a convenience wrapper around extract_syllables_from_text() that handles file reading with proper encoding (UTF-8) and error handling.

Parameters:

input_path (Path) – Path to the input text file. File should be UTF-8 encoded plain text. Binary files or non-text formats will cause errors.

Returns:

  • syllables: Set of unique lowercase syllable strings

  • statistics: Dict with processing statistics (see extract_syllables_from_text)

Return type:

Tuple of (syllables, statistics) where

Raises:
  • FileNotFoundError – If the input file doesn’t exist at the specified path

  • IOError – If there’s an error reading the file (permissions, encoding, etc.)

Example

>>> from pathlib import Path
>>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables, stats = extractor.extract_syllables_from_file(Path('book.txt'))
>>> print(f"Extracted {len(syllables)} unique syllables from {stats['total_words']} words")
Extracted 1250 unique syllables from 50000 words
extract_syllables_from_text(text, only_hyphenated=True)[source]

Extract unique syllables from a block of text.

This method processes input text by tokenizing it into words, applying hyphenation rules via pyphen, and extracting individual syllables that meet the configured length constraints.

Parameters:
  • text (str) – Input text to process. Can contain any characters, but only alphabetic sequences (including accented characters) will be processed as words.

  • only_hyphenated (bool) – If True, only include syllables from words that pyphen actually hyphenated (default: True). This filters out whole words that couldn’t be syllabified. Set to False to include all words, even if they can’t be split.

Returns:

  • syllables: Set of unique lowercase syllable strings

  • statistics: Dict with the following keys:
    • ’total_words’: Total number of words found in source text

    • ’processed_words’: Words that were successfully hyphenated/processed

    • ’skipped_unhyphenated’: Words skipped (only when only_hyphenated=True)

    • ’rejected_syllables’: Syllables rejected due to length constraints

Return type:

Tuple of (syllables, statistics) where

Note

  • Only processes words containing alphabetic characters (a-z, A-Z, À-ÿ)

  • Case-insensitive processing (all output is lowercase)

  • Automatically removes punctuation and special characters

  • Filters syllables by configured min/max length constraints

  • When only_hyphenated=True, excludes words pyphen couldn’t split

  • Deterministic: same input always produces same output

  • Words are extracted using regex pattern: b[a-zA-ZÀ-ÿ]+b

Example

>>> extractor = SyllableExtractor('en_US', min_syllable_length=2, max_syllable_length=8)
>>> syllables, stats = extractor.extract_syllables_from_text("Hello world!")
>>> print(sorted(syllables))
['hel', 'lo', 'world']
>>> print(stats['total_words'])
2
static extract_with_auto_language(text, min_syllable_length=1, max_syllable_length=10, only_hyphenated=True, default_language='en_US', min_detection_length=20, suppress_warnings=False)[source]

Extract syllables with automatic language detection.

This convenience method combines language detection with syllable extraction. It automatically detects the language of the input text and creates an appropriate SyllableExtractor instance for that language.

Parameters:
  • text (str) – Input text to process. Should be at least 20-50 characters for reliable language detection.

  • min_syllable_length (int) – Minimum syllable length to include (default: 1)

  • max_syllable_length (int) – Maximum syllable length to include (default: 10)

  • only_hyphenated (bool) – If True, only include syllables from hyphenated words (default: True)

  • default_language (str) – Language code to use if detection fails (default: “en_US”)

  • min_detection_length (int) – Minimum text length for detection attempt (default: 20)

  • suppress_warnings (bool) – If True, suppress language detection warnings (default: False)

Returns:

  • syllables: Set of unique lowercase syllable strings

  • statistics: Dict with processing statistics

  • detected_language_code: The pyphen language code that was used

Return type:

Tuple of (syllables, statistics, detected_language_code) where

Raises:

ImportError – If langdetect is not installed (unless suppress_warnings=True)

Example

>>> # Auto-detect English text
>>> text = "Hello beautiful world, this is wonderful"
>>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(text)
>>> print(f"Detected language: {lang}")
Detected language: en_US
>>> print(f"Found {len(syllables)} syllables")
Found 8 syllables
>>> # Auto-detect French text
>>> text = "Bonjour le monde, comment allez-vous aujourd'hui?"
>>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(text)
>>> print(f"Detected language: {lang}")
Detected language: fr
>>> # With custom parameters
>>> syllables, stats, lang = SyllableExtractor.extract_with_auto_language(
...     text="Das sind deutsche Wörter",
...     min_syllable_length=2,
...     max_syllable_length=8,
...     default_language="en_US"
... )
>>> print(lang)
de_DE

Note

  • Requires langdetect: pip install langdetect

  • Detection accuracy depends on text length (20-50+ chars recommended)

  • For production use, consider setting suppress_warnings=True

  • Short text will fall back to default_language with a warning

save_syllables(syllables, output_path)[source]

Save syllables to a text file (one syllable per line, sorted).

Writes syllables in alphabetical order with UTF-8 encoding, one syllable per line. This format is ideal for version control and easy importing into other tools.

Parameters:
  • syllables (set[str]) – Set of syllables to save. Each syllable should be a string. The set will be sorted alphabetically before writing.

  • output_path (Path) – Path to the output file. Parent directories must exist. If the file exists, it will be overwritten.

Raises:

IOError – If there’s an error writing the file (permissions, disk space, etc.)

Return type:

None

Example

>>> from pathlib import Path
>>> extractor = SyllableExtractor('en_US')
>>> syllables = {'hel', 'lo', 'world'}
>>> extractor.save_syllables(syllables, Path('output.txt'))
# Creates file with content:
# hel
# lo
# world

Note

The output file uses UTF-8 encoding with Unix-style line endings (n). Each line contains exactly one syllable with no leading/trailing whitespace.

build_tools.pyphen_syllable_extractor.detect_language_code(text, default='en_US', min_confidence_length=20, suppress_warnings=False)[source]

Auto-detect language from text and return pyphen-compatible language code.

This function analyzes the input text using langdetect and maps the detected ISO 639-1 language code to a pyphen-compatible locale code (e.g., “en” -> “en_US”).

The function requires at least min_confidence_length characters for reliable detection. Shorter text will return the default language with a warning.

Parameters:
  • text (str) – Input text to analyze. Should be at least 20-50 characters for reliable detection. Mixed-language text may produce unpredictable results.

  • default (str) – Default language code to return if detection fails or langdetect is not installed (default: “en_US”).

  • min_confidence_length (int) – Minimum text length (in characters) required for detection attempt (default: 20). Text shorter than this returns the default language.

  • suppress_warnings (bool) – If True, suppress warning messages when detection fails or langdetect is unavailable (default: False).

Return type:

str

Returns:

A pyphen-compatible language code (e.g., “en_US”, “de_DE”, “fr”). Returns default if detection fails, text is too short, or langdetect is not available.

Raises:

ImportError – If langdetect is not installed (only when suppress_warnings=False)

Example

>>> # Detect English text
>>> text = "Hello world, this is a test of language detection"
>>> detect_language_code(text)
'en_US'
>>> # Detect French text
>>> text = "Bonjour le monde, comment allez-vous aujourd'hui?"
>>> detect_language_code(text)
'fr'
>>> # Short text falls back to default
>>> detect_language_code("Hello")
'en_US'
>>> # Custom default for unknown language
>>> detect_language_code("???", default='de_DE')
'de_DE'
>>> # Suppress warnings for production use
>>> code = detect_language_code("abc", default='en_US', suppress_warnings=True)

Note

  • Detection accuracy decreases significantly with text shorter than 50 chars

  • Mixed-language text detection is unreliable

  • Some languages may map to different locales than expected (e.g., “pt” -> “pt_PT”)

  • Use get_alternative_locales() to see all available variants for a language

  • Requires langdetect: pip install langdetect

build_tools.pyphen_syllable_extractor.discover_files(source, pattern='*.txt', recursive=False)[source]

Discover text files in a directory matching the specified pattern.

This function searches for files matching a glob pattern in the specified directory, optionally recursing into subdirectories. Results are sorted alphabetically for deterministic processing order.

Parameters:
  • source (Path) – Directory to search for files. Must be an existing directory.

  • pattern (str) – Glob pattern for file matching (default: “.txt”). Examples: “.txt”, “.md”, “data_.csv”

  • recursive (bool) – If True, search recursively into subdirectories using rglob. If False, search only the top level (default: False).

Return type:

list[Path]

Returns:

List of Path objects for matching files, sorted alphabetically. Returns empty list if no files match.

Raises:

ValueError – If source is not a directory or doesn’t exist.

Example

>>> # Find all .txt files in a directory
>>> files = discover_files(Path("/data/texts"))
>>> print(f"Found {len(files)} files")
>>> # Find all .md files recursively
>>> files = discover_files(Path("/data"), pattern="*.md", recursive=True)
>>> # Find files with custom pattern
>>> files = discover_files(Path("/data"), pattern="book_*.txt")
build_tools.pyphen_syllable_extractor.generate_output_filename(output_dir=None, language_code=None, run_timestamp=None, input_filename=None)[source]

Generate output filenames in run-based subdirectory structure.

Creates a run directory with timestamp and ‘pyphen’ identifier, then organizes outputs into syllables/ and meta/ subdirectories: - output_dir/YYYYMMDD_HHMMSS_pyphen/syllables/filename.txt - output_dir/YYYYMMDD_HHMMSS_pyphen/meta/filename.txt

This structure groups each extraction run’s outputs together, making it easier to manage, archive, or delete complete runs as atomic units.

Parameters:
  • output_dir (Path | None) – Base output directory. Defaults to _working/output/

  • language_code (str | None) – Optional pyphen language code (e.g., ‘en_US’, ‘de_DE’). Used for filename if input_filename not provided.

  • run_timestamp (str | None) – Optional timestamp string (YYYYMMDD_HHMMSS format). If provided, uses this timestamp for the run directory name. If not provided, generates a new timestamp using datetime.now(). Critical for batch processing - pass the same timestamp to group all files from a batch into one run directory.

  • input_filename (str | None) – Optional input filename to use for output naming. If provided, output files will use this name (e.g., ‘alice.txt’). Takes precedence over language_code for naming.

Return type:

tuple[Path, Path]

Returns:

Tuple of (syllables_path, metadata_path)

Example

>>> # Interactive mode - single file with language code
>>> syllables_path, meta_path = generate_output_filename(language_code='en_US')
>>> print(syllables_path)
_working/output/20260110_153022_pyphen/syllables/en_US.txt
>>> # Batch mode - multiple files sharing one run directory
>>> timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
>>> s1, m1 = generate_output_filename(
...     run_timestamp=timestamp,
...     input_filename='alice.txt'
... )
>>> s2, m2 = generate_output_filename(
...     run_timestamp=timestamp,
...     input_filename='middlemarch.txt'
... )
>>> print(s1)
_working/output/20260110_153022_pyphen/syllables/alice.txt
>>> print(s2)
_working/output/20260110_153022_pyphen/syllables/middlemarch.txt
>>> # Both files share the same run directory

Note

For batch processing, always pass the same run_timestamp to group all outputs into a single run directory. This represents one logical batch operation, regardless of how many input files are processed.

build_tools.pyphen_syllable_extractor.get_alternative_locales(iso_code)[source]

Get alternative pyphen locale codes for a given ISO language code.

Some languages have multiple regional variants (e.g., English has en_US and en_GB). This function returns all available pyphen locales for a language.

Parameters:

iso_code (str) – ISO 639-1 language code (e.g., “en”, “de”, “pt”)

Return type:

list[str] | None

Returns:

List of pyphen locale codes for the language, or None if not available. Returns None if the language has no alternatives (only one locale).

Example

>>> get_alternative_locales("en")
['en_US', 'en_GB']
>>> get_alternative_locales("de")
['de_DE', 'de_AT', 'de_CH']
>>> get_alternative_locales("pt")
['pt_PT', 'pt_BR']
>>> get_alternative_locales("fr")  # Only one variant
None
>>> get_alternative_locales("xx")  # Unknown language
None
build_tools.pyphen_syllable_extractor.get_default_locale(iso_code)[source]

Get the default pyphen locale for an ISO language code.

This is the locale that will be used by detect_language_code() when the specified language is detected.

Parameters:

iso_code (str) – ISO 639-1 language code (e.g., “en”, “de”, “pt”)

Return type:

str | None

Returns:

Default pyphen locale code (e.g., “en_US”), or None if language is not supported.

Example

>>> get_default_locale("en")
'en_US'
>>> get_default_locale("pt")
'pt_PT'
>>> get_default_locale("de")
'de_DE'
>>> get_default_locale("xx")  # Unknown language
None
build_tools.pyphen_syllable_extractor.get_language_code(language_name)[source]

Get pyphen language code from language name.

Parameters:

language_name (str) – Full language name (e.g., “English (US)”)

Return type:

str | None

Returns:

Language code (e.g., “en_US”) or None if not found

Example

>>> get_language_code("English (US)")
'en_US'
build_tools.pyphen_syllable_extractor.get_language_name(code)[source]

Get language name from pyphen code.

Parameters:

code (str) – Pyphen language code (e.g., “en_US”)

Return type:

str | None

Returns:

Language name (e.g., “English (US)”) or None if not found

Example

>>> get_language_name("en_US")
'English (US)'
build_tools.pyphen_syllable_extractor.is_detection_available()[source]

Check if language detection is available.

Return type:

bool

Returns:

True if langdetect is installed and functional, False otherwise.

Example

>>> if is_detection_available():
...     print("Language detection is available")
... else:
...     print("Install langdetect: pip install langdetect")
build_tools.pyphen_syllable_extractor.list_supported_languages()[source]

Get a dictionary of all ISO codes and their default pyphen locales.

Return type:

dict[str, str]

Returns:

Dictionary mapping ISO 639-1 codes to pyphen locale codes.

Example

>>> langs = list_supported_languages()
>>> print(f"English: {langs['en']}")
English: en_US
>>> print(f"German: {langs['de']}")
German: de_DE
>>> print(f"Total languages: {len(langs)}")
Total languages: 40+
build_tools.pyphen_syllable_extractor.main(args=None)[source]

Main entry point for the pyphen syllable extractor CLI.

This function determines whether to run in interactive or batch mode based on the presence of command-line arguments.

Parameters:

args (list[str] | None) – Command-line arguments. If None, uses sys.argv.

Return type:

int

Returns:

Exit code (0 for success, non-zero for error).

Modes:
  • Interactive Mode: No arguments provided. Prompts user for all settings.

  • Batch Mode: Arguments provided. Processes files based on CLI flags.

Examples

Interactive mode (no arguments):

$ python -m build_tools.pyphen_syllable_extractor

Batch mode (with arguments):

$ python -m build_tools.pyphen_syllable_extractor --file input.txt --lang en_US
$ python -m build_tools.pyphen_syllable_extractor --files *.txt --auto
$ python -m build_tools.pyphen_syllable_extractor --source ~/docs/ --recursive --auto
build_tools.pyphen_syllable_extractor.main_batch(args)

Batch mode entry point for the pyphen syllable extractor CLI.

This function processes multiple files based on command-line arguments, providing progress indicators and comprehensive error reporting.

Parameters:

args (Namespace) – Parsed command-line arguments from argparse.Namespace containing: - file: Single file path (optional) - files: List of file paths (optional) - source: Directory path for scanning (optional) - pattern: File pattern for directory scanning (default: “*.txt”) - recursive: Whether to scan directories recursively - lang: Manual language code (mutually exclusive with auto) - auto: Use automatic language detection (mutually exclusive with lang) - min: Minimum syllable length (default: 2) - max: Maximum syllable length (default: 8) - output: Output directory (default: _working/output/) - quiet: Suppress progress indicators - verbose: Show detailed processing information

Return type:

None

Exit Codes:

0: All files processed successfully 1: One or more files failed to process

Raises:

SystemExit – On validation errors or processing completion

build_tools.pyphen_syllable_extractor.main_interactive()

Interactive mode entry point for the pyphen syllable extractor CLI.

Return type:

None

Workflow:
  1. Prompt user to select a language (or ‘auto’ for automatic detection)

  2. Configure extraction parameters (min/max syllable length)

  3. Prompt for input file path

  4. Extract syllables from input file (with optional auto-detection)

  5. Generate timestamped output filenames

  6. Save syllables and metadata to separate files

  7. Display summary to console

Language Detection:
  • If ‘auto’ is selected and langdetect is installed, the tool will automatically detect the language of the input text

  • Detection requires at least 20-50 characters for reliable results

  • Falls back to English (en_US) if detection fails

Output Files:
  • YYYYMMDD_HHMMSS.syllables.LANG.txt: One syllable per line, sorted

  • YYYYMMDD_HHMMSS.meta.LANG.txt: Extraction metadata and statistics

Both files are saved to _working/output/ by default.

build_tools.pyphen_syllable_extractor.process_batch(files, language_code, min_len, max_len, output_dir, quiet=False, verbose=False)[source]

Process multiple files sequentially in batch mode.

This is a backwards-compatible wrapper around run_batch_extraction.

Parameters:
  • files (list[Path]) – List of input file paths to process

  • language_code (str) – Language code (e.g., “en_US”) or “auto” for detection

  • min_len (int) – Minimum syllable length to include

  • max_len (int) – Maximum syllable length to include

  • output_dir (Path) – Output directory for all results (created if needed)

  • quiet (bool) – If True, suppress all output except errors (default: False)

  • verbose (bool) – If True, show detailed progress for each file (default: False).

Return type:

BatchResult

Returns:

BatchResult with overall statistics and individual file results.

build_tools.pyphen_syllable_extractor.process_single_file(input_path, language_code, min_len, max_len, output_dir, run_timestamp, verbose=False)[source]

Process a single file in batch mode with comprehensive error handling.

This function attempts to extract syllables from a single file and saves the results. Unlike interactive mode, this function catches all exceptions and returns a result object indicating success or failure, allowing batch processing to continue even when individual files fail.

Parameters:
  • input_path (Path) – Path to the input text file to process

  • language_code (str) – Language code (e.g., “en_US”, “de_DE”) or “auto” for automatic language detection

  • min_len (int) – Minimum syllable length to include in results

  • max_len (int) – Maximum syllable length to include in results

  • output_dir (Path) – Directory where output files should be saved

  • run_timestamp (str) – Timestamp for the batch run (shared across all files in batch)

  • verbose (bool) – If True, print detailed progress messages (default: False)

Return type:

FileProcessingResult

Returns:

FileProcessingResult object with success status, syllables count, output paths (if successful), or error message (if failed).

Note

This function never raises exceptions. All errors are caught and returned in the FileProcessingResult.error_message field. This design allows batch processing to continue despite individual failures.

build_tools.pyphen_syllable_extractor.process_single_file_batch(input_path, language_code, min_len, max_len, output_dir, run_timestamp, verbose=False)

Process a single file in batch mode with comprehensive error handling.

This function attempts to extract syllables from a single file and saves the results. Unlike interactive mode, this function catches all exceptions and returns a result object indicating success or failure, allowing batch processing to continue even when individual files fail.

Parameters:
  • input_path (Path) – Path to the input text file to process

  • language_code (str) – Language code (e.g., “en_US”, “de_DE”) or “auto” for automatic language detection

  • min_len (int) – Minimum syllable length to include in results

  • max_len (int) – Maximum syllable length to include in results

  • output_dir (Path) – Directory where output files should be saved

  • run_timestamp (str) – Timestamp for the batch run (shared across all files in batch)

  • verbose (bool) – If True, print detailed progress messages (default: False)

Return type:

FileProcessingResult

Returns:

FileProcessingResult object with success status, syllables count, output paths (if successful), or error message (if failed).

Note

This function never raises exceptions. All errors are caught and returned in the FileProcessingResult.error_message field. This design allows batch processing to continue despite individual failures.

build_tools.pyphen_syllable_extractor.run_batch(args)[source]

Batch mode entry point for the pyphen syllable extractor CLI.

This function processes multiple files based on command-line arguments, providing progress indicators and comprehensive error reporting.

Parameters:

args (Namespace) – Parsed command-line arguments from argparse.Namespace containing: - file: Single file path (optional) - files: List of file paths (optional) - source: Directory path for scanning (optional) - pattern: File pattern for directory scanning (default: “*.txt”) - recursive: Whether to scan directories recursively - lang: Manual language code (mutually exclusive with auto) - auto: Use automatic language detection (mutually exclusive with lang) - min: Minimum syllable length (default: 2) - max: Maximum syllable length (default: 8) - output: Output directory (default: _working/output/) - quiet: Suppress progress indicators - verbose: Show detailed processing information

Return type:

None

Exit Codes:

0: All files processed successfully 1: One or more files failed to process

Raises:

SystemExit – On validation errors or processing completion

build_tools.pyphen_syllable_extractor.run_interactive()[source]

Interactive mode entry point for the pyphen syllable extractor CLI.

Return type:

None

Workflow:
  1. Prompt user to select a language (or ‘auto’ for automatic detection)

  2. Configure extraction parameters (min/max syllable length)

  3. Prompt for input file path

  4. Extract syllables from input file (with optional auto-detection)

  5. Generate timestamped output filenames

  6. Save syllables and metadata to separate files

  7. Display summary to console

Language Detection:
  • If ‘auto’ is selected and langdetect is installed, the tool will automatically detect the language of the input text

  • Detection requires at least 20-50 characters for reliable results

  • Falls back to English (en_US) if detection fails

Output Files:
  • YYYYMMDD_HHMMSS.syllables.LANG.txt: One syllable per line, sorted

  • YYYYMMDD_HHMMSS.meta.LANG.txt: Extraction metadata and statistics

Both files are saved to _working/output/ by default.

build_tools.pyphen_syllable_extractor.save_metadata(result, output_path)[source]

Save extraction metadata to a text file.

Parameters:
  • result (ExtractionResult) – ExtractionResult containing metadata to save

  • output_path (Path) – Path to the output metadata file

Raises:

IOError – If there’s an error writing the file

Return type:

None

Example

>>> result = ExtractionResult(...)
>>> save_metadata(result, Path("output.meta.txt"))
build_tools.pyphen_syllable_extractor.select_language()[source]

Interactive prompt to select a language from supported options.

Return type:

str

Returns:

The pyphen language code for the selected language, or “auto” for automatic language detection

Note

Exits the program if the user provides invalid input after multiple attempts or requests to quit.

build_tools.pyphen_syllable_extractor.validate_language_code(code)[source]

Check if a language code is supported.

Parameters:

code (str) – Pyphen language code to validate

Return type:

bool

Returns:

True if the code is supported, False otherwise

Example

>>> validate_language_code("en_US")
True
>>> validate_language_code("invalid")
False