build_tools.nltk_syllable_normaliser.fragment_cleaner

Fragment cleaning logic for NLTK syllable normalization.

This module provides the FragmentCleaner class which handles reconstruction of phonetically coherent syllables from NLTK’s over-segmented output by merging isolated single-letter fragments with their neighbors.

Attributes

VOWELS

Classes

FragmentCleaner

Clean NLTK-produced syllable fragments by merging isolated letters.

Module Contents

build_tools.nltk_syllable_normaliser.fragment_cleaner.VOWELS

class build_tools.nltk_syllable_normaliser.fragment_cleaner.FragmentCleaner[source]

Clean NLTK-produced syllable fragments by merging isolated letters.

The NLTK syllable extractor uses phonetically-guided splitting with onset/coda principles, which can sometimes over-segment words into isolated single-letter fragments. This cleaner reconstructs more coherent syllables by applying mechanical merging rules.

Merging Rules:

Single vowels (a, e, i, o, u, y) merge with next fragment
Single consonants merge with next fragment
Multi-character fragments remain unchanged

Example

>>> cleaner = FragmentCleaner()
>>> fragments = ["i", "down", "the", "ra", "bbit"]
>>> cleaner.clean_fragments(fragments)
['idown', 'the', 'rabbit']

Note

This is NLTK-specific preprocessing. Pyphen output doesn’t need fragment cleaning as it uses typographic hyphenation rules.

static is_single_letter(token)[source]

Check if token is a single alphabetic character.

Parameters:: token (str) – String to check.
Returns:: True if token is exactly one alphabetic character, False otherwise.
Return type:: bool

Example

>>> FragmentCleaner.is_single_letter("a")
True
>>> FragmentCleaner.is_single_letter("ab")
False
>>> FragmentCleaner.is_single_letter("1")
False

static is_single_vowel(token)[source]

Check if token is a single vowel character.

Parameters:: token (str) – String to check.
Returns:: True if token is a single vowel (a, e, i, o, u, y), False otherwise.
Return type:: bool

Example

>>> FragmentCleaner.is_single_vowel("a")
True
>>> FragmentCleaner.is_single_vowel("b")
False
>>> FragmentCleaner.is_single_vowel("ae")
False

clean_fragments(fragments)[source]

Perform mechanical cleanup by merging single-letter fragments.

Applies two merging rules in sequence: 1. Merge isolated single vowels with the following fragment 2. Merge isolated single consonants with the following fragment

This reconstructs more phonetically coherent syllables from NLTK’s onset/coda-based over-segmentation.

Parameters:: fragments (list[str]) – List of syllable fragments (possibly over-segmented).
Returns:: List of cleaned fragments with single letters merged.
Return type:: list[str]

Example

>>> cleaner = FragmentCleaner()
>>> # Example 1: Single vowel merging
>>> cleaner.clean_fragments(["i", "down"])
['idown']
>>>
>>> # Example 2: Single consonant merging
>>> cleaner.clean_fragments(["r", "abbit"])
['rabbit']
>>>
>>> # Example 3: Mixed fragments
>>> cleaner.clean_fragments(["cha", "pter", "i", "down", "the", "r", "a"])
['cha', 'pter', 'idown', 'the', 'ra']
>>>
>>> # Example 4: Preserve multi-character fragments
>>> cleaner.clean_fragments(["hel", "lo", "world"])
['hel', 'lo', 'world']

Note

Fragments are processed left-to-right
Single letters merge with next fragment (if available)
Last fragment never merges (no next fragment available)
Empty input returns empty output

clean_fragments_from_file(input_path, output_path)[source]

Clean fragments from input file and write to output file.

Convenience method for file-based processing. Reads one fragment per line from input file, applies cleaning, and writes cleaned fragments to output file (one per line).

Parameters:

input_path (str) – Path to input file (one fragment per line).
output_path (str) – Path to output file for cleaned fragments.

Returns:

Tuple of (original_count, cleaned_count) indicating how many fragments were merged.

Raises:

FileNotFoundError – If input file doesn’t exist.
IOError – If there’s an error reading or writing files.

Return type:

tuple[int, int]

Example

>>> # input.txt contains:
>>> # i
>>> # down
>>> # the
>>> # ra
>>> # bbit
>>>
>>> cleaner = FragmentCleaner()
>>> original, cleaned = cleaner.clean_fragments_from_file(
...     "input.txt", "output.txt"
... )
>>> print(f"Cleaned {original} → {cleaned} fragments")
Cleaned 5 → 3 fragments
>>>
>>> # output.txt now contains:
>>> # idown
>>> # the
>>> # rabbit