Characters for Humans ⁕ Nova Patch

The \X Files: programming with extended grapheme clusters and emoji sequences.

A “character” can mean different things to different people, but the largest disparity is between applications and the humans who use them. Programmers aren’t to blame, as our programming languages, libraries, and databases provide little or no support for understanding user-perceived characters. Many systems disagree on the basic units of characters, some use code points, others use code units, and others still operate on individual bytes by default. This frequently leads to products with a poor experience in some users’ languages, especially written languages that use grapheme clusters, sequences of code points that compose a single user-perceived character. With the rise in global emoji usage and the rapid evolution of standard emoji sequences, this problem is increasingly experienced by users worldwide, regardless of their language.

This presentation covers:

Extended grapheme clusters and emoji sequences
Programming with these user-perceived characters
Data input, parsing, analysis, formatting, and output
Setting product requirements for character support
Examples from Shutterstock’s platforms for content editing and collaboration

Resources

Fabric.js: Open source HTML5 Canvas library
Shutterstock Editor: Browser-based editing of images and text, including emoji
Editor emoji examples: new emoji (json), dolphin bicycle, cinco de mayo, cinco de mayo?, and halloween
Perl 6: strings, regular expressions, and Unicode
Swift: strings, characters, and strings and characters
Unicode Line Breaking Algorithm (UAX #14)
Unicode Regular Expressions (UTS #18)
Unicode Text Segmentation (UAX #29)
Unicode Character Database (UAX #44)
Unicode Emoji (UTS #51)

Presented at

2017-10-18: Internationalization & Unicode Conference 41 (IUC41), Santa Clara, CA
2017-06-21: The Perl Conference (YAPC::NA), Washington, DC

Nova Patch (@novapatch) is a principal engineer at Shutterstock, specializing in internationalization, multilingual search, and building products that support the world’s languages, writing systems, and cultures. They are an open source developer, contributor to the Unicode CLDR, and member of the Unicode Consortium.