build_tools.nltk_syllable_normaliser.fragment_cleaner
=====================================================

.. py:module:: build_tools.nltk_syllable_normaliser.fragment_cleaner

.. autoapi-nested-parse::

   Fragment cleaning logic for NLTK syllable normalization.

   This module provides the FragmentCleaner class which handles reconstruction
   of phonetically coherent syllables from NLTK's over-segmented output by
   merging isolated single-letter fragments with their neighbors.

   ---------------------------------------------------------------------------
   Design Note (for future maintainers, including future-me):

   This module is intentionally *strict and dumb*.

   Its responsibility is limited to:
   - Orthographic normalization
   - Structural reconstruction of broken fragments
   - Enforcing minimal length and basic phonetic viability

   It MUST NOT:
   - Remove fragments based on perceived meaning or “word-likeness”
   - Apply language-, culture-, or corpus-specific filtering
   - Decide whether a fragment is suitable as a name, place, or object
   - Encode aesthetic judgement or semantic assumptions

   Fragments such as common function words or corpus artefacts
   (e.g. "of", "the") are expected to survive this stage.

   Such cases are handled explicitly and *downstream* via:
   - Feature detection
   - Candidate aggregation
   - Name-class selection policies

   If a fragment appears questionable here, the correct response is:
   “preserve and annotate”, not “clean away”.

   This separation is deliberate and non-negotiable.
   ---------------------------------------------------------------------------


Attributes
----------

.. autoapisummary::

   build_tools.nltk_syllable_normaliser.fragment_cleaner.VOWELS


Classes
-------

.. autoapisummary::

   build_tools.nltk_syllable_normaliser.fragment_cleaner.FragmentCleaner


Module Contents
---------------

.. py:data:: VOWELS

.. py:class:: FragmentCleaner

   Clean NLTK-produced syllable fragments by merging isolated letters.

   The NLTK syllable extractor uses phonetically-guided splitting with
   onset/coda principles, which can sometimes over-segment words into
   isolated single-letter fragments. This cleaner reconstructs more
   coherent syllables by applying mechanical merging rules.

   Merging Rules:
       1. Single vowels (a, e, i, o, u, y) merge with next fragment
       2. Single consonants merge with next fragment
       3. Multi-character fragments remain unchanged

   .. admonition:: Example

      >>> cleaner = FragmentCleaner()
      >>> fragments = ["i", "down", "the", "ra", "bbit"]
      >>> cleaner.clean_fragments(fragments)
      ['idown', 'the', 'rabbit']

   .. note::

      This is NLTK-specific preprocessing. Pyphen output doesn't need
      fragment cleaning as it uses typographic hyphenation rules.


   .. py:method:: is_single_letter(token)
      :staticmethod:


      Check if token is a single alphabetic character.

      :param token: String to check.

      :returns: True if token is exactly one alphabetic character, False otherwise.

      .. admonition:: Example

         >>> FragmentCleaner.is_single_letter("a")
         True
         >>> FragmentCleaner.is_single_letter("ab")
         False
         >>> FragmentCleaner.is_single_letter("1")
         False


   .. py:method:: is_single_vowel(token)
      :staticmethod:


      Check if token is a single vowel character.

      :param token: String to check.

      :returns: True if token is a single vowel (a, e, i, o, u, y), False otherwise.

      .. admonition:: Example

         >>> FragmentCleaner.is_single_vowel("a")
         True
         >>> FragmentCleaner.is_single_vowel("b")
         False
         >>> FragmentCleaner.is_single_vowel("ae")
         False


   .. py:method:: clean_fragments(fragments)

      Perform mechanical cleanup by merging single-letter fragments.

      Applies two merging rules in sequence:
      1. Merge isolated single vowels with the following fragment
      2. Merge isolated single consonants with the following fragment

      This reconstructs more phonetically coherent syllables from
      NLTK's onset/coda-based over-segmentation.

      :param fragments: List of syllable fragments (possibly over-segmented).

      :returns: List of cleaned fragments with single letters merged.

      .. admonition:: Example

         >>> cleaner = FragmentCleaner()
         >>> # Example 1: Single vowel merging
         >>> cleaner.clean_fragments(["i", "down"])
         ['idown']
         >>>
         >>> # Example 2: Single consonant merging
         >>> cleaner.clean_fragments(["r", "abbit"])
         ['rabbit']
         >>>
         >>> # Example 3: Mixed fragments
         >>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"])
         ['cha', 'pter', 'idown', 'the', 'ra']
         >>>
         >>> # Example 4: Preserve multi-character fragments
         >>> cleaner.clean_fragments(["hel", "lo", "world"])
         ['hel', 'lo', 'world']

      .. note::

         - Fragments are processed left-to-right
         - Single letters merge with next fragment (if available)
         - Last fragment never merges (no next fragment available)
         - Empty input returns empty output


   .. py:method:: clean_fragments_from_file(input_path, output_path)

      Clean fragments from input file and write to output file.

      Convenience method for file-based processing. Reads one fragment
      per line from input file, applies cleaning, and writes cleaned
      fragments to output file (one per line).

      :param input_path: Path to input file (one fragment per line).
      :param output_path: Path to output file for cleaned fragments.

      :returns: Tuple of (original_count, cleaned_count) indicating how many
                fragments were merged.

      :raises FileNotFoundError: If input file doesn't exist.
      :raises IOError: If there's an error reading or writing files.

      .. admonition:: Example

         >>> # input.txt contains:
         >>> # i
         >>> # down
         >>> # the
         >>> # ra
         >>> # bbit
         >>>
         >>> cleaner = FragmentCleaner()
         >>> original, cleaned = cleaner.clean_fragments_from_file(
         ...     "input.txt", "output.txt"
         ... )
         >>> print(f"Cleaned {original} → {cleaned} fragments")
         Cleaned 5 → 3 fragments
         >>>
         >>> # output.txt now contains:
         >>> # idown
         >>> # the
         >>> # rabbit