What is a String?

JavaScript strings are slightly more complex than people sometimes think they are. That comes up in relation to various topics I like to write about, so I thought a post talking about the nature of JavaScript strings would be handy as something I could refer to from future posts.

If you've already Chapter 10 of JavaScript: The New Toys, there's nothing new here for you and you can move on. If you haven't (why not? 😉), read on...

TL;DR - JavaScript uses Unicode, but the details of how it uses it mean that a "character" in a JavaScript string is not necessarily the entire character. This has implications for doing things like splitting a string into individual characters or just generally isolating a character from the string (for instance, to make the first character uppercase). Naïve handling of JavaScript strings can break their contents.

The details:

JavaScript uses Unicode for text. Unicode describes a very wide range of characters, or more generally code points, in order to accommodate the wide range of alphabets, writing systems, and even emojis in the world. Originally, code points were numbers in the range 0 to 0xFFFF and so they fit in 16 bits. That wasn't sufficient, though, so early on the range was extended to its current 0 to 0x10FFFF (inclusive), which is effectively 21 significant bits. (For now; it could grow though there's no plan for it to at the moment.) To avoid forcing systems to use three or four bytes (for alignment) for every character, Unicode defines transformation formats that encode those 21 bits into one or more code units of smaller size. UTF-8 uses byte-sized (8-bit) code units that can represent a lot of western text characters in one byte, but may require two, three, or even four bytes for some other characters. (For instance, the winking emoji code point I used above takes four code units in UTF-8: 0xF0 0x9F 0x98 0x89.) UTF-16 uses 16-bit code units, so it may require one or two code units depending on the code point being represented. That same winking face is 0xD83D 0xDE09 in UTF-16. (There's a lot more to it than this, of course.)

Coming back to JavaScript, a JavaScript string is:

...a finite ordered sequence of zero or more 16-bit unsigned integer values...Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text.

That means sometimes, it takes two "characters" of a JavaScript string to represent a code point:

const wink = "😉";
console.log(wink.length); // 2

xPeople often say that JavaScript strings are UTF-16, which is almost true, but in UTF-16 you can't have a surrogate code unit on its own, whereas JavaScript doesn't enforce that limitation or any other limitation other than that the values are 16-bit integers. This leads some folks to say that strings are in the obsolete UCS-2 format rather than UTF-16. But JavaScript does interpret strings for some operations, and when it does it interprets them as UTF-16, not UCS-2. Similarly environments where JavaScript is used usually use them as UTF-16. So I prefer to say they're UTF-16, but they tolerate invalid surrogates.

Either way, it means that you can't assume one "character" is a character on its own. For instance, a naïve version of "reversing" a string often looks like this:

const reverse = str => str.split("").reverse().join("");

But that will mess up strings containing surrogate pairs (and other things):

const wink = "😉";
const reversedWink = reverse(wink);
console.log(reversedWink); // Outputs two "unknown character" glyphs

But it goes beyond just surrogate pairs. Unicode code points aren't necessarily "characters" on their own, even when you don't break up their surrogate pairs. Sometimes it takes more than one code point to make what someone looking at the text will think of as a "character." If you'd like to know more about it, continue reading with Splitting Strings in 2021.

Happy coding!

Have a question about or comment on this post? Tweet me!