build_tools.nltk_syllable_normaliser.fragment_cleaner
Fragment cleaning logic for NLTK syllable normalization.
This module provides the FragmentCleaner class which handles reconstruction of phonetically coherent syllables from NLTK’s over-segmented output by merging isolated single-letter fragments with their neighbors.
Attributes
Classes
Clean NLTK-produced syllable fragments by merging isolated letters. |
Module Contents
- build_tools.nltk_syllable_normaliser.fragment_cleaner.VOWELS
- class build_tools.nltk_syllable_normaliser.fragment_cleaner.FragmentCleaner[source]
Clean NLTK-produced syllable fragments by merging isolated letters.
The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules.
- Merging Rules:
Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged
Example
>>> cleaner = FragmentCleaner() >>> fragments = ["i", "down", "the", "ra", "bbit"] >>> cleaner.clean_fragments(fragments) ['idown', 'the', 'rabbit']
Note
This is NLTK-specific preprocessing. Pyphen output doesn’t need fragment cleaning as it uses typographic hyphenation rules.
- static is_single_letter(token)[source]
Check if token is a single alphabetic character.
- Parameters:
token (str) – String to check.
- Returns:
True if token is exactly one alphabetic character, False otherwise.
- Return type:
Example
>>> FragmentCleaner.is_single_letter("a") True >>> FragmentCleaner.is_single_letter("ab") False >>> FragmentCleaner.is_single_letter("1") False
- static is_single_vowel(token)[source]
Check if token is a single vowel character.
- Parameters:
token (str) – String to check.
- Returns:
True if token is a single vowel (a, e, i, o, u, y), False otherwise.
- Return type:
Example
>>> FragmentCleaner.is_single_vowel("a") True >>> FragmentCleaner.is_single_vowel("b") False >>> FragmentCleaner.is_single_vowel("ae") False
- clean_fragments(fragments)[source]
Perform mechanical cleanup by merging single-letter fragments.
Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment
This reconstructs more phonetically coherent syllables from NLTK’s onset/coda-based over-segmentation.
- Parameters:
fragments (list[str]) – List of syllable fragments (possibly over-segmented).
- Returns:
List of cleaned fragments with single letters merged.
- Return type:
Example
>>> cleaner = FragmentCleaner() >>> # Example 1: Single vowel merging >>> cleaner.clean_fragments(["i", "down"]) ['idown'] >>> >>> # Example 2: Single consonant merging >>> cleaner.clean_fragments(["r", "abbit"]) ['rabbit'] >>> >>> # Example 3: Mixed fragments >>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"]) ['cha', 'pter', 'idown', 'the', 'ra'] >>> >>> # Example 4: Preserve multi-character fragments >>> cleaner.clean_fragments(["hel", "lo", "world"]) ['hel', 'lo', 'world']
Note
Fragments are processed left-to-right
Single letters merge with next fragment (if available)
Last fragment never merges (no next fragment available)
Empty input returns empty output
- clean_fragments_from_file(input_path, output_path)[source]
Clean fragments from input file and write to output file.
Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line).
- Parameters:
- Returns:
Tuple of (original_count, cleaned_count) indicating how many fragments were merged.
- Raises:
FileNotFoundError – If input file doesn’t exist.
IOError – If there’s an error reading or writing files.
- Return type:
Example
>>> # input.txt contains: >>> # i >>> # down >>> # the >>> # ra >>> # bbit >>> >>> cleaner = FragmentCleaner() >>> original, cleaned = cleaner.clean_fragments_from_file( ... "input.txt", "output.txt" ... ) >>> print(f"Cleaned {original} → {cleaned} fragments") Cleaned 5 → 3 fragments >>> >>> # output.txt now contains: >>> # idown >>> # the >>> # rabbit