UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 non-surrogate code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that more than 216 (65,536) code points were needed.
UTF-16 is used internally by systems such as Microsoft Windows, the Java programming language and JavaScript/ECMAScript. It is also often used for plain text and for word-processing data files on Microsoft Windows. It is rarely used for files on Unix-like systems. As of May 2019, Microsoft seems to have reversed course and now supports and recommends using UTF-8.
UTF-16 is the only web-encoding incompatible with ASCII, and never gained popularity on the web, where it is used by under 0.005% (less than 1 hundredth of 1 percent) of web pages. UTF-8, by comparison, is used by 97% of all web pages. The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 “the mandatory encoding for all [text]” and that for security reasons browser applications should not use UTF-16.
What You Need To Know About UTF-16
- UTF stands for Unicode Transformation Format-16.
- UTF-16 is a variable-width that uses 2-byte or 4-byte for each character.
- UTF-16 can encode 1,112,064 code points.
- UTF-16 supports normalization. Normalization treats words that mean the same thing but are represented differently as identical.
- In UTF-16, the scripts can identify directionality, thus allowing the application to correctly render the words that are stored in the code.
- The current version of Windows, from Windows 2000 onwards, uses UTF-16.
UCS-2
UCS-2 stands for Unicode Character Set Coded in 2 octets. UCS-2 is a character encoding standard in which characters are represented by a fixed-length 16 bits (2 bytes). It is used as a fallback on many GSM networks when a message cannot be encoded using GSM-7 or when a language requires more than 128 characters to be rendered.
UCS-2 and the other UCS standards are defined by the International Organization for Standardization (ISO) in ISO 10646. UCS-2 represents a possible maximum of 65,536 characters, or in hexadecimals from 0000h – FFFFh (2 bytes). The characters in UCS-2 are synchronized to the Basic Multilingual Plane in Unicode.
Character is an overloaded term, so it is actually more correct to refer to code points. Code points allow abstraction from the character term, and are the atomic unit of storage of information in an encoding.
UCS-2 is a fixed-width encoding; each encoded code point will take exactly 2 bytes. As a SMS message is transmitted in 140 octets, a message which is encoded in UCS-2 has a maximum of 70 characters (really, code points): (140*8) / (2*8) = 70.
By the Unicode standard, UCS-2 is an obsolete encoding because it wasn’t designed to allow characters in the so-called supplementary or ‘astral’ planes in Unicode. Plane 0, the Basic Multilingual Plane, contains character encodings for what are believed to be the most commonly used characters in modern languages. UCS-2 is limited to FFFFh code points, or 65,536 possible characters.
UTF-16 is the successor to UCS-2. and has the ability to address Base and 16 Supplementary planes, for a total maximum number of characters of 10FFFFh, or 1,114,112 code points.
What You Need To Know About UCS-2
- UCS-2 stands for Unicode Character Set Coded in 2 octets.
- UCS-2 is a fixed-width 2-byte character encoding for Unicode.
- UCS-2 can encode 65,536 code points (0-0XFFFF).
- In UCS-2, normalization does not occur automatically, so the application needs to implement such a feature on its own.
- UCS-2 scripts lacks the ability to identify directionality, thus will not work with scripts like Arabic and Hebrew, which move from right to left.
- Early version of Windows, from Windows NT 3.1 and Windows 95 onwards used UCS-2.
Difference Between UCS-2 And UTF-16 In Tabular Form
BASIS OF COMPARISON | UTF-16 | UCS-2 |
Acronym | UTF stands for Unicode Transformation Format-16. | UCS-2 stands for Unicode Character Set Coded in 2 octets. |
Description | UTF-16 is a variable-width that uses 2-byte or 4-byte for each character. | UCS-2 is a fixed-width 2-byte character encoding for Unicode. |
Points | UTF-16 can encode 1,112,064 code points. | UCS-2 can encode 65,536 code points (0-0XFFFF). |
Normalization | UTF-16 supports normalization. Normalization treats words that mean the same thing but are represented differently as identical. | In UCS-2, normalization does not occur automatically, so the application needs to implement such a feature on its own. |
Scripts | In UTF-16, the scripts can identify directionality, thus allowing the application to correctly render the words that are stored in the code. | UCS-2 scripts lacks the ability to identify directionality, thus will not work with scripts like Arabic and Hebrew, which move from right to left. |
Application | The current version of Windows, from Windows 2000 onwards, uses UTF-16. | Early version of Windows, from Windows NT 3.1 and Windows 95 onwards used UCS-2. |