Skip to content

New: iter_graphemes()#165

Merged
jquast merged 8 commits intomasterfrom
jq/next-new-grapheme
Jan 17, 2026
Merged

New: iter_graphemes()#165
jquast merged 8 commits intomasterfrom
jq/next-new-grapheme

Conversation

@jquast
Copy link
Copy Markdown
Owner

@jquast jquast commented Jan 14, 2026

Add iter_graphemes() function for Unicode grapheme cluster iteration following UAX #29. Enables segmentation of "user-perceived" characters: emoji sequences, combining marks, regional indicators, Indic conjuncts.

  • New file grapheme.py contains core algorithm,
  • New file table_grapheme.py is auto-generated by bin/update-tables.py
  • New file bisearch.py extracted from wcwidth.py -- shared by grapheme.py

A few examples from docs/intro.rst:

  >>> # cafe + combining cute accent
  >>> list(iter_graphemes('cafe\u0301'))
  ['c', 'a', 'f', 'é']

Implements Unicode Standard Annex #29 grapheme cluster boundaries.
Handles Hangul syllables, emoji ZWJ sequences, regional indicators,
combining characters, and Indic scripts.

New exports: iter_graphemes, _bisearch
@jquast jquast marked this pull request as ready for review January 14, 2026 22:59
We suggested to use ``wcwidth<2`` for years, when it should have been
``wcwidth<1``, I really hope somebody didn't copy & paste our
recommendation .. :(
its a private function, anyway, still ok.

Below the turtles, 0/1 is very much the definition of Falsey and Truthy.
@jquast jquast changed the title New: iter_graphemes() function New: iter_graphemes() Jan 15, 2026
@jquast jquast merged commit 875011d into master Jan 17, 2026
36 checks passed
@jquast jquast deleted the jq/next-new-grapheme branch January 17, 2026 17:45
jquast added a commit that referenced this pull request Jan 17, 2026
- Add new `width()` function for measuring terminal-aware strings, with support for control codes, escape sequences (SGR, OSC, CSI), cursor movement, and tab stops. 
- Add `iter_sequences()` function to iterate with text containing escape sequences
- New file, `control_codes.py` for control characters, categorized
- New file `escape_sequences.py` for terminal sequence patterns, categorized
- extract `_bisearch` , duplicates #165

A few examples from docs/intro.rst:

    >>> wcwidth.width('\x1b[38;2;255;150;100mWARN\x1b[0m')
    4

    >>> list(wcwidth.iter_sequences('\x1b[31mred\x1b[0m'))
    [('\x1b[31m', True), ('red', False), ('\x1b[0m', True)]

    >>> wcwidth.width('\U0001F1FF\U0001F1FC')
    2
jquast added a commit that referenced this pull request Jan 17, 2026
New ``wrap()`` function is an emoji, control and terminal sequence, wide, zero-width, and grapheme-aware version of textwrap.wrap(). This PR builds on #168 and #165 combined

    >>> # Wrapping CJK text (each character is 2 cells wide)
    >>> wrap('コンニチハ', 4)
    ['コン', 'ニチ', 'ハ']

    >>> # Text with ANSI color sequences
    >>> wrap('\x1b[31mhello world\x1b[0m', 5)
    ['\x1b[31mhello', 'world\x1b[0m']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant