Fixing Javascript's Unicode/Emoji Problem

Sallar Kaboli / 2016-08-15 18:17:56
Fixing Javascript's Unicode/Emoji Problem

A while ago I needed a sim­ple string pad func­tion, and since it’s a freak­ishly sim­ple task, I just wrote a sim­ple func­tion:

function limit(str, limit = 16, padString = "#", padPosition = "right") {
    const strLength = str.length;

    if (strLength > limit) {
        return str.substring(0, limit);
    } else if (strLength < limit) {
        const padRepeats = padString.repeat(limit - strLength);
        return (padPosition === "left") ? padRepeats + str : str + padRepeats;
    }
    return str;
}

Pretty sim­ple, right? And it works too… Until you add in an emoji to your text. Then every­thing falls apart and worlds col­lide! How? Simple:

"💩".length // 2!

(If you’re read­ing this on Linux you might ac­tu­ally see 2 bro­ken uni­code char­ac­ters, and the re­sult makes sense un­less you have in­stalled an emoji pack­age on your browser or OS)

So how can that be? String.length in javascript counts the code units” in that string, but since Emoji’s are new and their code units aren’t known to Javascript, it can’t count them cor­rectly. Take it from this great ar­ti­cle:

Internally, JavaScript rep­re­sents as­tral sym­bols as sur­ro­gate pairs, and it ex­poses the sep­a­rate sur­ro­gate halves as sep­a­rate characters”. If you rep­re­sent the sym­bols us­ing noth­ing but ECMAScript 5-compatible es­cape se­quences, you’ll see that two es­capes are needed for each as­tral sym­bol. This is con­fus­ing, be­cause hu­mans gen­er­ally think in terms of Uni­code sym­bols or graphemes in­stead.

Read more here.

Trying to Fix It

ES6 Strings try to solve this prob­lem some­how:

Array.from("💩").length; // 1

Yes! As you can see, ES6 rec­og­nizes as­tral sym­bols cor­rectly… Mostly. If you pass in a color vari­a­tion emoji” to that func­tion this hap­pens:

Array.from("💅🏼"); // ["💅", "🏼"]

So again the length will be 2. Because even ES6 doesnt rec­og­nize the cor­rect sur­ro­gate pairs.

So af­ter that, I started look­ing for npm pack­ages that try to fix that prob­lem us­ing Regular Expressions. There are plently of them avail­able, but af­ter try­ing them, al­most all of them have the same prob­lem as ES6 strings. So I sub­mit­ted a lot of is­sues on Github. The only pack­age that did every­thing cor­rectly was the awe­some Lodash:

_.toArray("💅🏼"); // ["💅🏼"]

The Solution

Since the Lodash pack­age is a bit big in size and I needed a small so­lu­tion, I bor­rowed Lodash’s com­plex RegExp and made my own sim­ple string tools pack­age called: Stringz. It comes with a few helpers to make uni­code string padding and cut­ting much eas­ier. But ob­vi­ously, if you al­ready have Lodash in­stalled, go ahead and use that.

Stringz.limit("👍🏽👍🏽", 4, "👍🏽"); // "👍🏽👍🏽👍🏽👍🏽" 
Stringz.substring("Emojis 👍🏽 are 🍆 poison. 🌮s are bad.", 7, 14); // "👍🏽 are 🍆"
Stringz.length("💅🏼"); // 1

Stringz is re­leased un­der the MIT License and can be in­stalled us­ing npm install stringz. Thanks to the Lodash RegExp :-)

Same Issue in The Wild

Many web­sites that rely on count­ing in­put char­ac­ters get it wrong, in­clud­ing Twitter. Obviously they have tried to solve the prob­lem but still they have is­sues with the color vari­a­tions:

Twitter

As you can see, the re­main­ing char­ac­ter count is 136 when it should be 138. Hopefully Twitter fixes this prob­lem so you can safely tweet 140 col­ored emo­jis at once 🙃.