build_tools.nltk_syllable_normaliser.fragment_cleaner
Fragment cleaning logic for NLTK syllable normalization.
This module provides the FragmentCleaner class which handles reconstruction of phonetically coherent syllables from NLTK’s over-segmented output by merging isolated single-letter fragments with their neighbors.
This module is intentionally strict and dumb.
Its responsibility is limited to: - Orthographic normalization - Structural reconstruction of broken fragments - Enforcing minimal length and basic phonetic viability
It MUST NOT: - Remove fragments based on perceived meaning or “word-likeness” - Apply language-, culture-, or corpus-specific filtering - Decide whether a fragment is suitable as a name, place, or object - Encode aesthetic judgement or semantic assumptions
Fragments such as common function words or corpus artefacts (e.g. “of”, “the”) are expected to survive this stage.
Such cases are handled explicitly and downstream via: - Feature detection - Candidate aggregation - Name-class selection policies
If a fragment appears questionable here, the correct response is: “preserve and annotate”, not “clean away”.
This separation is deliberate and non-negotiable.
Attributes
Classes
Clean NLTK-produced syllable fragments by merging isolated letters. |
Module Contents
- build_tools.nltk_syllable_normaliser.fragment_cleaner.VOWELS
- class build_tools.nltk_syllable_normaliser.fragment_cleaner.FragmentCleaner[source]
Clean NLTK-produced syllable fragments by merging isolated letters.
The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules.
- Merging Rules:
Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged
Example
>>> cleaner = FragmentCleaner() >>> fragments = ["i", "down", "the", "ra", "bbit"] >>> cleaner.clean_fragments(fragments) ['idown', 'the', 'rabbit']
Note
This is NLTK-specific preprocessing. Pyphen output doesn’t need fragment cleaning as it uses typographic hyphenation rules.
- static is_single_letter(token)[source]
Check if token is a single alphabetic character.
- Parameters:
token (str) – String to check.
- Returns:
True if token is exactly one alphabetic character, False otherwise.
- Return type:
Example
>>> FragmentCleaner.is_single_letter("a") True >>> FragmentCleaner.is_single_letter("ab") False >>> FragmentCleaner.is_single_letter("1") False
- static is_single_vowel(token)[source]
Check if token is a single vowel character.
- Parameters:
token (str) – String to check.
- Returns:
True if token is a single vowel (a, e, i, o, u, y), False otherwise.
- Return type:
Example
>>> FragmentCleaner.is_single_vowel("a") True >>> FragmentCleaner.is_single_vowel("b") False >>> FragmentCleaner.is_single_vowel("ae") False
- clean_fragments(fragments)[source]
Perform mechanical cleanup by merging single-letter fragments.
Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment
This reconstructs more phonetically coherent syllables from NLTK’s onset/coda-based over-segmentation.
- Parameters:
fragments (list[str]) – List of syllable fragments (possibly over-segmented).
- Returns:
List of cleaned fragments with single letters merged.
- Return type:
Example
>>> cleaner = FragmentCleaner() >>> # Example 1: Single vowel merging >>> cleaner.clean_fragments(["i", "down"]) ['idown'] >>> >>> # Example 2: Single consonant merging >>> cleaner.clean_fragments(["r", "abbit"]) ['rabbit'] >>> >>> # Example 3: Mixed fragments >>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"]) ['cha', 'pter', 'idown', 'the', 'ra'] >>> >>> # Example 4: Preserve multi-character fragments >>> cleaner.clean_fragments(["hel", "lo", "world"]) ['hel', 'lo', 'world']
Note
Fragments are processed left-to-right
Single letters merge with next fragment (if available)
Last fragment never merges (no next fragment available)
Empty input returns empty output
- clean_fragments_from_file(input_path, output_path)[source]
Clean fragments from input file and write to output file.
Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line).
- Parameters:
- Returns:
Tuple of (original_count, cleaned_count) indicating how many fragments were merged.
- Raises:
FileNotFoundError – If input file doesn’t exist.
IOError – If there’s an error reading or writing files.
- Return type:
Example
>>> # input.txt contains: >>> # i >>> # down >>> # the >>> # ra >>> # bbit >>> >>> cleaner = FragmentCleaner() >>> original, cleaned = cleaner.clean_fragments_from_file( ... "input.txt", "output.txt" ... ) >>> print(f"Cleaned {original} → {cleaned} fragments") Cleaned 5 → 3 fragments >>> >>> # output.txt now contains: >>> # idown >>> # the >>> # rabbit