What is a String?

Posted 2021-01-26 by T.J. Crowder
Updated 2021-06-19
Tags: javascript string unicode

JavaScript strings are slightly more complex than people sometimes think they are. That comes up in relation to various topics I like to write about, so I thought a post talking about the nature of JavaScript strings would be handy as something I could refer to from future posts.

If you've already Chapter 10 of JavaScript: The New Toys, there's nothing new here for you and you can move on. If you haven't (why not? 😉), read on...

TL;DR - JavaScript uses Unicode, but the details of how it uses it mean that a "character" in a JavaScript string is not necessarily the entire character. This has implications for doing things like splitting a string into individual characters or just generally isolating a character from the string (for instance, to make the first character uppercase). Naïve handling of JavaScript strings can break their contents.

The details:

JavaScript uses Unicode for text. Unicode describes a very wide range of characters, or more generally code points, in order to accommodate the wide range of alphabets, writing systems, and even emojis in the world. Originally, code points were numbers in the range 0 to 0xFFFF and so they fit in 16 bits. That wasn't sufficient, though, so early on the range was extended to its current 0 to 0x10FFFF (inclusive), which is effectively 21 significant bits. (For now; it could grow though there's no plan for it to at the moment.) To avoid forcing systems to use three or four bytes (for alignment) for every character, Unicode defines transformation formats that encode those 21 bits into one or more code units of smaller size. UTF-8 uses byte-sized (8-bit) code units that can represent a lot of western text characters in one byte, but may require two, three, or even four bytes for some other characters. (For instance, the winking emoji code point I used above takes four code units in UTF-8: 0xF0 0x9F 0x98 0x89.) UTF-16 uses 16-bit code units, so it may require one or two code units depending on the code point being represented. That same winking face is 0xD83D 0xDE09 in UTF-16. (There's a lot more to it than this, of course.)

Coming back to JavaScript, a JavaScript string is:

...a finite ordered sequence of zero or more 16-bit unsigned integer values...Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text.

That means sometimes, it takes two "characters" of a JavaScript string to represent a code point:

const wink = "😉";
console.log(wink.length); // 2

People often say that JavaScript strings are UTF-16, which is almost true, but in UTF-16 you can't have a surrogate code unit on its own, whereas JavaScript doesn't enforce that limitation or any other limitation other than that the values are 16-bit integers. This leads some folks to say that strings are in the obsolete UCS-2 format rather than UTF-16. But JavaScript does interpret strings for some operations, and when it does it interprets them as UTF-16, not UCS-2. Similarly environments where JavaScript is used usually use them as UTF-16. So I prefer to say they're UTF-16, but they tolerate invalid surrogates.

Either way, it means that you can't assume one "character" is a character on its own. For instance, a naïve version of "reversing" a string often looks like this:

const reverse = str => str.split("").reverse().join("");

But that will mess up strings containing surrogate pairs (and other things):

const wink = "😉";
const reversedWink = reverse(wink);
console.log(reversedWink); // Outputs two "unknown character" glyphs

There's more to the story. Beyond surrogate pairs of code units that combine to create a code point, it can even take multiple Unicode code points to create a specific "character" (glyph). For example, in Devanagari, a writing system used in India and Nepal, vowel sounds are written as marks modifying the consonant glyph. Code point U+0928, न, is pronounced "na", but you can follow it with code point U+093F to produce नि ("ni"). (More details on that in Chapter 10 of the book.)

Happy coding!

Buy Now

Have a question or comment about this post? Ping me on Mastodon at @tjcrowdertech@hachyderm.io!

Posts by Tag

all: 16 antipattern: 1 array: 3 async: 1 await: 1 best-practices: 1 book: 1 class: 1 css: 1 english: 1 errata: 1 errors: 1 fetch: 1 finalizationregistry: 1 globals: 1 headers: 1 indexable: 1 javascript: 12 legacy-edge: 1 map: 1 misc: 1 new-toys: 2 promises: 1 pronouns: 1 proposals: 2 react: 1 scrolling: 1 sticky: 1 string: 3 table: 1 tc39: 1 typed-array: 2 unicode: 1 weakrefs: 1 web: 2

Buy Now