A browser with JavaScript enabled is required for this page to operate properly.
Trail: Internationalization
Lesson: Working with Text
Section: Unicode
Terminology
Home Page > Internationalization > Working with Text

Terminology

A character is a minimal unit of text with no shape or value.

A character set is a collection of characters that might be used by multiple languages. For example, the Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.

A coded character set is a character set where each character is assigned a unique number.

A code point is a value that can be used in a coded character set. A code point is a 32-bit int datat type, where the lower 21 bits represent a valid code point value and the upper 11 bits are 0.

A Unicode code unit is a 16-bit char value. For example, imagine a String that contains the letters "abc" followed by the Deseret LONG I, which is represented with two char values. That string contains four characters, four code points, but five code units.

To express a character in Unicode, the hexadecimal value is prefixed with the string U+. The valid code point range for the Unicode standard is U+0000 to U+10FFFF, inclusive. The code point value for the Latin character A is U+0040. The character € which represents the Euro currency, has the code point value U+20AC. The first letter in the Deseret alphabet, the LONG I, has the code point value U+10400.

The following table shows code point values for several characters:

Character Unicode Code Point Glyph
Latin A
U+0041

The Latin character A

Latin sharp S
U+00DF

The Latin small letter sharp S

Han for East
U+6771

The Han character for east, eastern or eastward

Deseret, LONG I
U+10400

The Deseret capital letter long I

As previously described, characters that are in the range U+10000 to U+10FFFF are called supplementary characters. The set of characters from U+0000 to U+FFFF are sometimes referred to as the Basic Multilingual Plane (BMP).

More terminology can be found in the Glossary of Unicode Terms, listed on the More Information page.


Problems with the examples? Try Compiling and Running the Examples: FAQs.
Complaints? Compliments? Suggestions? Give us your feedback.

Previous page: Unicode
Next page: Supplementary Characters as Surrogates