The \X Files: programming with extended grapheme clusters and emoji sequences.

A “character” can mean different things to different people, but the largest disparity is between applications and the humans who use them. Programmers aren’t to blame, as our programming languages, libraries, and databases provide little or no support for understanding user-perceived characters. Many systems disagree on the basic units of characters, some use code points, others use code units, and others still operate on individual bytes by default. This frequently leads to products with a poor experience in some users’ languages, especially written languages that use grapheme clusters, sequences of code points that compose a single user-perceived character. With the rise in global emoji usage and the rapid evolution of standard emoji sequences, this problem is increasingly experienced by users worldwide, regardless of their language.

This presentation covers:

  • Extended grapheme clusters and emoji sequences
  • Programming with these user-perceived characters
  • Data input, parsing, analysis, formatting, and output
  • Setting product requirements for character support
  • Examples from Shutterstock’s platforms for content editing and collaboration

Resources

Presented at

  • 2017-10-18: Internationalization & Unicode Conference 41 (IUC41), Santa Clara, CA
  • 2017-06-21: The Perl Conference (YAPC::NA), Washington, DC

Nova Patch (@novapatch) is a principal engineer at Shutterstock, specializing in internationalization, multilingual search, and building products that support the world’s languages, writing systems, and cultures. They are an open source developer, contributor to the Unicode CLDR, and member of the Unicode Consortium.