build_tools.nltk_syllable_normaliser.fragment_cleaner ===================================================== .. py:module:: build_tools.nltk_syllable_normaliser.fragment_cleaner .. autoapi-nested-parse:: Fragment cleaning logic for NLTK syllable normalization. This module provides the FragmentCleaner class which handles reconstruction of phonetically coherent syllables from NLTK's over-segmented output by merging isolated single-letter fragments with their neighbors. --------------------------------------------------------------------------- Design Note (for future maintainers, including future-me): This module is intentionally *strict and dumb*. Its responsibility is limited to: - Orthographic normalization - Structural reconstruction of broken fragments - Enforcing minimal length and basic phonetic viability It MUST NOT: - Remove fragments based on perceived meaning or “word-likeness” - Apply language-, culture-, or corpus-specific filtering - Decide whether a fragment is suitable as a name, place, or object - Encode aesthetic judgement or semantic assumptions Fragments such as common function words or corpus artefacts (e.g. "of", "the") are expected to survive this stage. Such cases are handled explicitly and *downstream* via: - Feature detection - Candidate aggregation - Name-class selection policies If a fragment appears questionable here, the correct response is: “preserve and annotate”, not “clean away”. This separation is deliberate and non-negotiable. --------------------------------------------------------------------------- Attributes ---------- .. autoapisummary:: build_tools.nltk_syllable_normaliser.fragment_cleaner.VOWELS Classes ------- .. autoapisummary:: build_tools.nltk_syllable_normaliser.fragment_cleaner.FragmentCleaner Module Contents --------------- .. py:data:: VOWELS .. py:class:: FragmentCleaner Clean NLTK-produced syllable fragments by merging isolated letters. The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules. Merging Rules: 1. Single vowels (a, e, i, o, u, y) merge with next fragment 2. Single consonants merge with next fragment 3. Multi-character fragments remain unchanged .. admonition:: Example >>> cleaner = FragmentCleaner() >>> fragments = ["i", "down", "the", "ra", "bbit"] >>> cleaner.clean_fragments(fragments) ['idown', 'the', 'rabbit'] .. note:: This is NLTK-specific preprocessing. Pyphen output doesn't need fragment cleaning as it uses typographic hyphenation rules. .. py:method:: is_single_letter(token) :staticmethod: Check if token is a single alphabetic character. :param token: String to check. :returns: True if token is exactly one alphabetic character, False otherwise. .. admonition:: Example >>> FragmentCleaner.is_single_letter("a") True >>> FragmentCleaner.is_single_letter("ab") False >>> FragmentCleaner.is_single_letter("1") False .. py:method:: is_single_vowel(token) :staticmethod: Check if token is a single vowel character. :param token: String to check. :returns: True if token is a single vowel (a, e, i, o, u, y), False otherwise. .. admonition:: Example >>> FragmentCleaner.is_single_vowel("a") True >>> FragmentCleaner.is_single_vowel("b") False >>> FragmentCleaner.is_single_vowel("ae") False .. py:method:: clean_fragments(fragments) Perform mechanical cleanup by merging single-letter fragments. Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment This reconstructs more phonetically coherent syllables from NLTK's onset/coda-based over-segmentation. :param fragments: List of syllable fragments (possibly over-segmented). :returns: List of cleaned fragments with single letters merged. .. admonition:: Example >>> cleaner = FragmentCleaner() >>> # Example 1: Single vowel merging >>> cleaner.clean_fragments(["i", "down"]) ['idown'] >>> >>> # Example 2: Single consonant merging >>> cleaner.clean_fragments(["r", "abbit"]) ['rabbit'] >>> >>> # Example 3: Mixed fragments >>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"]) ['cha', 'pter', 'idown', 'the', 'ra'] >>> >>> # Example 4: Preserve multi-character fragments >>> cleaner.clean_fragments(["hel", "lo", "world"]) ['hel', 'lo', 'world'] .. note:: - Fragments are processed left-to-right - Single letters merge with next fragment (if available) - Last fragment never merges (no next fragment available) - Empty input returns empty output .. py:method:: clean_fragments_from_file(input_path, output_path) Clean fragments from input file and write to output file. Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line). :param input_path: Path to input file (one fragment per line). :param output_path: Path to output file for cleaned fragments. :returns: Tuple of (original_count, cleaned_count) indicating how many fragments were merged. :raises FileNotFoundError: If input file doesn't exist. :raises IOError: If there's an error reading or writing files. .. admonition:: Example >>> # input.txt contains: >>> # i >>> # down >>> # the >>> # ra >>> # bbit >>> >>> cleaner = FragmentCleaner() >>> original, cleaned = cleaner.clean_fragments_from_file( ... "input.txt", "output.txt" ... ) >>> print(f"Cleaned {original} → {cleaned} fragments") Cleaned 5 → 3 fragments >>> >>> # output.txt now contains: >>> # idown >>> # the >>> # rabbit