1. Preface
The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.
The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.
In particular, this specification defines all those encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript.
User agents have also significantly deviated from the labels listed in the IANA Character Sets registry . To stop spreading legacy encodings further, this specification is exhaustive about the aforementioned details and therefore has no need for the registry. In particular, this specification does not provide a mechanism for extending any aspect of encodings.
2. Security background
There is a set of encoding security issues when the producer and consumer do not agree on the encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was reported in 2011 where a Shift_JIS lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field. The producer did not see the problem even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the output would be U+FFFD U+0022.
This is a larger issue for encodings that map anything that is an ASCII byte to something that is not an ASCII code point , when there is no lead byte present. These are “ASCII-incompatible” encodings and other than ISO-2022-JP , UTF-16BE , and UTF-16LE , which are unfortunately required due to deployed content, they are not supported. (Investigation is ongoing whether more labels of other such encodings can be mapped to the replacement encoding, rather than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a resource and then encouraging the user to override the encoding, resulting in e.g. script execution.
Encoders used by URLs found in HTML and HTML’s form feature can also result in slight information loss when an encoding is used that cannot represent all scalar values. E.g. when a resource uses the windows-1252 encoding a server will not be able to distinguish between an end user entering “💩” and “💩” into a form.
The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
See also the Browser UI chapter.
3. Terminology
This specification depends on the Infra Standard. [INFRA]
Hexadecimal numbers are prefixed with "0x".
In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", integer division by "/" (returns the quotient), modulo by "%" (returns the remainder of an integer division), logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|".
For logical right shifts operands must have at least twenty-one bits precision.
A
token
An
I/O
queue
is
a
piece
type
of
list
with
items
of
data,
such
as
a
particular
type
(i.e.,
byte
bytes
or
scalar
value
.
A
stream
represents
an
ordered
sequence
of
tokens
.
End-of-stream
values
).
End-of-queue
is
a
special
token
item
that
can
be
present
in
I/O
queues
of
any
type
and
it
signifies
that
there
are
no
more
tokens
items
in
the
queue.
There
are
two
ways
to
use
an
I/O
queue
:
in
immediate
mode,
to
represent
I/O
data
stored
in
memory,
and
in
streaming
mode,
to
represent
data
coming
in
from
the
stream
.
network.
Immediate
queues
have
end-of-queue
as
their
last
item,
whereas
streaming
queues
need
not
have
it,
and
so
their
read
operation
might
block.
It
is
expected
that
streaming
I/O
queues
will
be
created
empty,
and
that
new
items
will
be
pushed
to
it
as
data
comes
in
from
the
network.
When
a
token
the
underlying
network
stream
closes,
an
end-of-queue
item
is
to
be
pushed
into
the
queue.
Since reading from a streaming I/O queue might block, streaming I/O queues are not to be used from an event loop . They are to be used in parallel instead.
To
read
an
item
from
a
stream
an
I/O
queue
ioQueue
,
run
these
steps:
If ioQueue is empty ,
the first token in the stream must be returned and subsequently removed,then wait until its size is at least 1.If ioQueue [0] is end-of-queue , then return end-of-queue .
Remove ioQueue [0] and
end-of-streamreturn it.
To
read
must
a
number
number
of
items
from
ioQueue
,
run
these
steps:
Let readItems be
returned otherwise.an empty list.-
When onePerform the following step number times: Remove end-of-queue from readItems .
Return readItems .
To peek a number number of items from an I/O queue ioQueue , run these steps:
Wait until either ioQueue ’s size is equal to or
more tokensgreater than number , or ioQueue containsare prependedend-of-queue , whichever comes first.Let prefix be an empty list.
For each n in the range 1 to number , inclusive:
If ioQueue [ n ] is end-of-queue , break .
Otherwise, append ioQueue [ n ] to prefix .
Return prefix .
To push an item item to an I/O queue ioQueue , run these steps:
If the last item in ioQueue is end-of-queue , then:
If item is end-of-queue , do nothing.
Otherwise, append item to ioQueue .
To
push
a
stream
sequence
of
items
to
an
I/O
queue
ioQueue
is
to
push
each
item
in
the
sequence
to
ioQueue
,
in
the
given
order.
To
prepend
an
item
other
than
end-of-queue
to
an
I/O
queue
,
perform
the
normal
list
prepend
operation.
To
prepend
a
sequence
of
items,
insert
those
tokens
must
be
inserted,
items,
in
the
given
order,
before
the
first
token
item
in
the
stream.
queue.
Inserting
the
sequence
of
tokens
scalar
value
items
💩
in
a
stream
an
I/O
queue
of
scalar
values
"
hello
world
",
results
in
a
stream
an
I/O
queue
"
💩
hello
world
".
The
next
token
item
to
be
read
would
be
&
.
When
one
To
convert
an
I/O
queue
ioQueue
into
a
list
,
string
or
more
tokens
byte
sequence
,
return
the
result
of
reading
are
pushed
an
indefinite
number
of
items
from
ioQueue
.
To
convert
to
a
stream
list
,
those
tokens
must
be
inserted,
string
or
byte
sequence
input
into
an
I/O
queue
,
return
an
I/O
queue
containing
the
items
in
input
,
in
given
order,
after
followed
by
end-of-queue
.
The Infra standard is expected to define some infrastructure around type conversions. See whatwg/infra issue #319 . [INFRA]
I/O
queues
are
defined
as
lists
,
not
queues
,
because
they
feature
a
prepend
operation.
However,
this
prepend
operation
is
an
internal
detail
of
the
last
token
algorithms
in
the
stream.
this
specification,
and
is
not
to
be
used
by
other
standards.
Implementations
are
free
to
find
alternative
ways
to
implement
such
algorithms,
as
detailed
in
Implementation
considerations
.
4. Encodings
An encoding defines a mapping from a scalar value sequence to a byte sequence (and vice versa). Each encoding has a name , and one or more labels .
This specification defines three encodings with the same names as encoding schemes defined in the Unicode standard: UTF-8 , UTF-16LE , and UTF-16BE . The encodings differ from the encoding schemes by byte order mark (also known as BOM) handling not being part of the encodings themselves and instead being part of wrapper algorithms in this specification, whereas byte order mark handling is part of the definition of the encoding schemes in the Unicode Standard. UTF-8 used together with the UTF-8 decode algorithm matches the encoding scheme of the same name. This specification does not provide wrapper algorithms that would combine with UTF-16LE and UTF-16BE to match the similarly-named encoding schemes . [UNICODE]
4.1. Encoders and decoders
Each
encoding
has
an
associated
decoder
and
most
of
them
have
an
associated
encoder
.
Each
decoder
and
encoder
have
a
handler
algorithm.
A
handler
algorithm
takes
an
input
stream
I/O
queue
and
a
token
an
item
,
and
returns
finished
,
one
or
more
tokens
items
,
error
optionally
with
a
code
point
,
or
continue
.
The replacement , UTF-16BE , and UTF-16LE encodings have no encoder .
An
error
mode
as
used
below
is
"
replacement
"
(default)
or
"
fatal
"
for
a
decoder
and
"
fatal
"
(default)
or
"
html
"
for
an
encoder
.
An
XML
processor
would
set
error
mode
to
"
fatal
".
[XML]
html
exists
as
error
mode
due
to
URLs
and
HTML
forms
requiring
a
non-terminating
legacy
encoder
.
The
"
html
"
error
mode
causes
a
sequence
to
be
emitted
that
cannot
be
distinguished
from
legitimate
input
and
can
therefore
lead
to
silent
data
loss.
Developers
are
strongly
encouraged
to
use
the
UTF-8
encoding
to
prevent
this
from
happening.
[URL]
[HTML]
To
run
an
encoding
’s
decoder
or
encoder
encoderDecoder
with
input
stream
I/O
queue
input
,
output
stream
I/O
queue
output
,
and
optional
error
mode
mode
,
run
these
steps:
-
If mode is not given, then set it to "
replacement
" if encoderDecoder is a decoder , otherwise "fatal
". -
Let encoderDecoderInstance be a new encoderDecoder .
-
While true:
-
Let result be the result of processing the result of reading from input for encoderDecoderInstance , input , output , and mode .
-
If result is not continue , return result .
-
Otherwise, do nothing.
-
To
process
a
token
an
item
token
item
for
an
encoding
’s
encoder
or
decoder
instance
encoderDecoderInstance
,
stream
I/O
queue
input
,
output
stream
I/O
queue
output
,
and
optional
error
mode
mode
,
run
these
steps:
-
If mode is not given, then set it to "
replacement
" if encoderDecoderInstance is a decoder instance, otherwise "fatal
". -
Assert: if encoderDecoderInstance is an encoder instance,
tokenitem is not a surrogate . -
Let result be the result of running encoderDecoderInstance ’s handler on input and
tokenitem . -
If result is continue
or finished, return result . -
Otherwise, if result is finished :
Push end-of-queue to output .
Return result .
Otherwise, if result is one or more
tokensitems :-
Assert: if encoderDecoderInstance is a decoder instance, result does not contain any surrogates .
-
Push result to output .
-
-
Otherwise, if result is error , switch on mode and run the associated steps:
-
"
replacement
" - Push U+FFFD to output .
-
"
html
" - Prepend U+0026, U+0023, followed by the shortest sequence of ASCII digits representing result ’s code point in base ten, followed by U+003B to input .
-
"
fatal
" - Return error .
-
"
- Return continue .
4.2. Names and labels
The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels .
For each encoding, ASCII-lowercasing its name yields one of its labels .
Authors
must
use
the
UTF-8
encoding
and
must
use
the
ASCII
case-insensitive
"
utf-8
"
label
to
identify
it.
New
protocols
and
formats,
as
well
as
existing
formats
deployed
in
new
contexts,
must
use
the
UTF-8
encoding
exclusively.
If
these
protocols
and
formats
need
to
expose
the
encoding
’s
name
or
label
,
they
must
expose
it
as
"
utf-8
".
To get an encoding from a string label , run these steps:
-
Remove any leading and trailing ASCII whitespace from label .
-
If label is an ASCII case-insensitive match for any of the labels listed in the table below, then return the corresponding encoding ; otherwise return failure.
This is a more basic and restrictive algorithm of mapping labels to encodings than section 1.4 of Unicode Technical Standard #22 prescribes, as that is necessary to be compatible with deployed content.
Name | Labels |
---|---|
The Encoding | |
UTF-8 |
"
unicode-1-1-utf-8
"
|
"
unicode11utf8
"
| |
"
unicode20utf8
"
| |
"
utf-8
"
| |
"
utf8
"
| |
"
x-unicode20utf8
"
| |
Legacy single-byte encodings | |
IBM866 |
"
866
"
|
"
cp866
"
| |
"
csibm866
"
| |
"
ibm866
"
| |
ISO-8859-2 |
"
csisolatin2
"
|
"
iso-8859-2
"
| |
"
iso-ir-101
"
| |
"
iso8859-2
"
| |
"
iso88592
"
| |
"
iso_8859-2
"
| |
"
iso_8859-2:1987
"
| |
"
l2
"
| |
"
latin2
"
| |
ISO-8859-3 |
"
csisolatin3
"
|
"
iso-8859-3
"
| |
"
iso-ir-109
"
| |
"
iso8859-3
"
| |
"
iso88593
"
| |
"
iso_8859-3
"
| |
"
iso_8859-3:1988
"
| |
"
l3
"
| |
"
latin3
"
| |
ISO-8859-4 |
"
csisolatin4
"
|
"
iso-8859-4
"
| |
"
iso-ir-110
"
| |
"
iso8859-4
"
| |
"
iso88594
"
| |
"
iso_8859-4
"
| |
"
iso_8859-4:1988
"
| |
"
l4
"
| |
"
latin4
"
| |
ISO-8859-5 |
"
csisolatincyrillic
"
|
"
cyrillic
"
| |
"
iso-8859-5
"
| |
"
iso-ir-144
"
| |
"
iso8859-5
"
| |
"
iso88595
"
| |
"
iso_8859-5
"
| |
"
iso_8859-5:1988
"
| |
ISO-8859-6 |
"
arabic
"
|
"
asmo-708
"
| |
"
csiso88596e
"
| |
"
csiso88596i
"
| |
"
csisolatinarabic
"
| |
"
ecma-114
"
| |
"
iso-8859-6
"
| |
"
iso-8859-6-e
"
| |
"
iso-8859-6-i
"
| |
"
iso-ir-127
"
| |
"
iso8859-6
"
| |
"
iso88596
"
| |
"
iso_8859-6
"
| |
"
iso_8859-6:1987
"
| |
ISO-8859-7 |
"
csisolatingreek
"
|
"
ecma-118
"
| |
"
elot_928
"
| |
"
greek
"
| |
"
greek8
"
| |
"
iso-8859-7
"
| |
"
iso-ir-126
"
| |
"
iso8859-7
"
| |
"
iso88597
"
| |
"
iso_8859-7
"
| |
"
iso_8859-7:1987
"
| |
"
sun_eu_greek
"
| |
ISO-8859-8 |
"
csiso88598e
"
|
"
csisolatinhebrew
"
| |
"
hebrew
"
| |
"
iso-8859-8
"
| |
"
iso-8859-8-e
"
| |
"
iso-ir-138
"
| |
"
iso8859-8
"
| |
"
iso88598
"
| |
"
iso_8859-8
"
| |
"
iso_8859-8:1988
"
| |
"
visual
"
| |
ISO-8859-8-I |
"
csiso88598i
"
|
"
iso-8859-8-i
"
| |
"
logical
"
| |
ISO-8859-10 |
"
csisolatin6
"
|
"
iso-8859-10
"
| |
"
iso-ir-157
"
| |
"
iso8859-10
"
| |
"
iso885910
"
| |
"
l6
"
| |
"
latin6
"
| |
ISO-8859-13 |
"
iso-8859-13
"
|
"
iso8859-13
"
| |
"
iso885913
"
| |
ISO-8859-14 |
"
iso-8859-14
"
|
"
iso8859-14
"
| |
"
iso885914
"
| |
ISO-8859-15 |
"
csisolatin9
"
|
"
iso-8859-15
"
| |
"
iso8859-15
"
| |
"
iso885915
"
| |
"
iso_8859-15
"
| |
"
l9
"
| |
ISO-8859-16 |
"
iso-8859-16
"
|
KOI8-R |
"
cskoi8r
"
|
"
koi
"
| |
"
koi8
"
| |
"
koi8-r
"
| |
"
koi8_r
"
| |
KOI8-U |
"
koi8-ru
"
|
"
koi8-u
"
| |
macintosh |
"
csmacintosh
"
|
"
mac
"
| |
"
macintosh
"
| |
"
x-mac-roman
"
| |
windows-874 |
"
dos-874
"
|
"
iso-8859-11
"
| |
"
iso8859-11
"
| |
"
iso885911
"
| |
"
tis-620
"
| |
"
windows-874
"
| |
windows-1250 |
"
cp1250
"
|
"
windows-1250
"
| |
"
x-cp1250
"
| |
windows-1251 |
"
cp1251
"
|
"
windows-1251
"
| |
"
x-cp1251
"
| |
windows-1252 |
"
ansi_x3.4-1968
"
|
"
ascii
"
| |
"
cp1252
"
| |
"
cp819
"
| |
"
csisolatin1
"
| |
"
ibm819
"
| |
"
iso-8859-1
"
| |
"
iso-ir-100
"
| |
"
iso8859-1
"
| |
"
iso88591
"
| |
"
iso_8859-1
"
| |
"
iso_8859-1:1987
"
| |
"
l1
"
| |
"
latin1
"
| |
"
us-ascii
"
| |
"
windows-1252
"
| |
"
x-cp1252
"
| |
windows-1253 |
"
cp1253
"
|
"
windows-1253
"
| |
"
x-cp1253
"
| |
windows-1254 |
"
cp1254
"
|
"
csisolatin5
"
| |
"
iso-8859-9
"
| |
"
iso-ir-148
"
| |
"
iso8859-9
"
| |
"
iso88599
"
| |
"
iso_8859-9
"
| |
"
iso_8859-9:1989
"
| |
"
l5
"
| |
"
latin5
"
| |
"
windows-1254
"
| |
"
x-cp1254
"
| |
windows-1255 |
"
cp1255
"
|
"
windows-1255
"
| |
"
x-cp1255
"
| |
windows-1256 |
"
cp1256
"
|
"
windows-1256
"
| |
"
x-cp1256
"
| |
windows-1257 |
"
cp1257
"
|
"
windows-1257
"
| |
"
x-cp1257
"
| |
windows-1258 |
"
cp1258
"
|
"
windows-1258
"
| |
"
x-cp1258
"
| |
x-mac-cyrillic |
"
x-mac-cyrillic
"
|
"
x-mac-ukrainian
"
| |
Legacy multi-byte Chinese (simplified) encodings | |
GBK |
"
chinese
"
|
"
csgb2312
"
| |
"
csiso58gb231280
"
| |
"
gb2312
"
| |
"
gb_2312
"
| |
"
gb_2312-80
"
| |
"
gbk
"
| |
"
iso-ir-58
"
| |
"
x-gbk
"
| |
gb18030 |
"
gb18030
"
|
Legacy multi-byte Chinese (traditional) encodings | |
Big5 |
"
big5
"
|
"
big5-hkscs
"
| |
"
cn-big5
"
| |
"
csbig5
"
| |
"
x-x-big5
"
| |
Legacy multi-byte Japanese encodings | |
EUC-JP |
"
cseucpkdfmtjapanese
"
|
"
euc-jp
"
| |
"
x-euc-jp
"
| |
ISO-2022-JP |
"
csiso2022jp
"
|
"
iso-2022-jp
"
| |
Shift_JIS |
"
csshiftjis
"
|
"
ms932
"
| |
"
ms_kanji
"
| |
"
shift-jis
"
| |
"
shift_jis
"
| |
"
sjis
"
| |
"
windows-31j
"
| |
"
x-sjis
"
| |
Legacy multi-byte Korean encodings | |
EUC-KR |
"
cseuckr
"
|
"
csksc56011987
"
| |
"
euc-kr
"
| |
"
iso-ir-149
"
| |
"
korean
"
| |
"
ks_c_5601-1987
"
| |
"
ks_c_5601-1989
"
| |
"
ksc5601
"
| |
"
ksc_5601
"
| |
"
windows-949
"
| |
Legacy miscellaneous encodings | |
replacement |
"
csiso2022kr
"
|
"
hz-gb-2312
"
| |
"
iso-2022-cn
"
| |
"
iso-2022-cn-ext
"
| |
"
iso-2022-kr
"
| |
"
replacement
"
| |
UTF-16BE |
"
unicodefffe
"
|
"
utf-16be
"
| |
UTF-16LE |
"
csunicode
"
|
"
iso-10646-ucs-2
"
| |
"
ucs-2
"
| |
"
unicode
"
| |
"
unicodefeff
"
| |
"
utf-16
"
| |
"
utf-16le
"
| |
x-user-defined |
"
x-user-defined
"
|
All encodings and their labels are also available as non-normative encodings.json resource.
The set of supported encodings is primarily based on the intersection of the sets supported by major browser engines when the development of this standard started, while removing encodings that were rarely used legitimately but that could be used in attacks. The inclusion of some encodings is questionable in the light of anecdotal evidence of the level of use by existing Web content. That is, while they have been broadly supported by browsers, it is unclear if they are broadly used by Web content. However, an effort has not been made to eagerly remove single-byte encodings that were broadly supported by browsers or are part of the ISO 8859 series. In particular, the necessity of the inclusion of IBM866 , macintosh , x-mac-cyrillic , ISO-8859-3 , ISO-8859-10 , ISO-8859-14 , and ISO-8859-16 is doubtful for the purpose of supporting existing content, but there are no plans to remove these.
4.3. Output encodings
To get an output encoding from an encoding encoding , run these steps:
-
If encoding is replacement , UTF-16BE , or UTF-16LE , return UTF-8 .
-
Return encoding .
The get an output encoding algorithm is useful for URL parsing and HTML form submission, which both need exactly this.
5. Indexes
Most legacy encodings make use of an index . An index is an ordered list of entries, each entry consisting of a pointer and a corresponding code point. Within an index pointers are unique and code points can be duplicated.
An efficient implementation likely has two indexes per encoding . One optimized for its decoder and one for its encoder .
To find the pointers and their corresponding code points in an index , let lines be the result of splitting the resource’s contents on U+000A. Then remove each item in lines that is the empty string or starts with U+0023. Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009. The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). Other subitems are not relevant.
To signify changes an index includes an Identifier and a Date . If an Identifier has changed, so has the index .
The index code point for pointer in index is the code point corresponding to pointer in index , or null if pointer is not in index .
The index pointer for code point in index is the first pointer corresponding to code point in index , or null if code point is not in index .
There is a non-normative visualization for each index other than index gb18030 ranges and index ISO-2022-JP katakana . index jis0208 also has an alternative Shift_JIS visualization. Additionally, there is visualization of the Basic Multilingual Plane coverage of each index other than index gb18030 ranges and index ISO-2022-JP katakana .
The legend for the visualizations is:
- Unmapped
- Two bytes in UTF-8
- Two bytes in UTF-8, code point follows immediately the code point of previous pointer
- Three bytes in UTF-8 (non-PUA)
- Three bytes in UTF-8 (non-PUA), code point follows immediately the code point of previous pointer
- Private Use
- Private Use, code point follows immediately the code point of previous pointer
- Four bytes in UTF-8
- Four bytes in UTF-8, code point follows immediately the code point of previous pointer
- Duplicate code point already mapped at an earlier index
- CJK Compatibility Ideograph
- CJK Unified Ideographs Extension A
These are the indexes defined by this specification, excluding index single-byte , which have their own table:
Index | Notes | |||
---|---|---|---|---|
index Big5 | index-big5.txt | index Big5 visualization | index Big5 BMP coverage | This matches the Big5 standard in combination with the Hong Kong Supplementary Character Set and other common extensions. |
index EUC-KR | index-euc-kr.txt | index EUC-KR visualization | index EUC-KR BMP coverage | This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949. It covers the Hangul Syllables block of Unicode in its entirety. The Hangul block whose top left corner in the visualization is at pointer 9026 is in the Unicode order. Taken separately, the rest of the Hangul syllables in this index are in the Unicode order, too. |
index gb18030 | index-gb18030.txt | index gb18030 visualization | index gb18030 BMP coverage | This matches the GB18030-2005 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or to the left of (the first) U+3000 in the visualization are in the Unicode order. |
index gb18030 ranges | index-gb18030-ranges.txt | This index works different from all others. Listing all code points would result in over a million items whereas they can be represented neatly in 207 ranges combined with trivial limit checks. It therefore only superficially matches the GB18030-2005 standard for code points encoded as four bytes. See also index gb18030 ranges code point and index gb18030 ranges pointer below. | ||
index jis0208 | index-jis0208.txt | index jis0208 visualization , Shift_JIS visualization | index jis0208 BMP coverage | This is the JIS X 0208 standard including formerly proprietary extensions from IBM and NEC. |
index jis0212 | index-jis0212.txt | index jis0212 visualization | index jis0212 BMP coverage | This is the JIS X 0212 standard. It is only used by the EUC-JP decoder due to lack of widespread support elsewhere. |
index ISO-2022-JP katakana | index-iso-2022-jp-katakana.txt | This maps halfwidth to fullwidth katakana as per Unicode Normalization Form KC, except that U+FF9E and U+FF9F map to U+309B and U+309C rather than U+3099 and U+309A. It is only used by the ISO-2022-JP encoder . [UNICODE] |
The index gb18030 ranges code point for pointer is the return value of these steps:
-
If pointer is greater than 39419 and less than 189000, or pointer is greater than 1237575, return null.
-
If pointer is 7457, return code point U+E7C7.
-
Let offset be the last pointer in index gb18030 ranges that is less than or equal to pointer and let code point offset be its corresponding code point.
-
Return a code point whose value is code point offset + pointer − offset .
The index gb18030 ranges pointer for code point is the return value of these steps:
-
If code point is U+E7C7, return pointer 7457.
-
Let offset be the last code point in index gb18030 ranges that is less than or equal to code point and let pointer offset be its corresponding pointer.
-
Return a pointer whose value is pointer offset + code point − offset .
The index Shift_JIS pointer for code point is the return value of these steps:
-
Let index be index jis0208 excluding all entries whose pointer is in the range 8272 to 8835, inclusive.
The index jis0208 contains duplicate code points so the exclusion of these entries causes later code points to be used.
-
Return the index pointer for code point in index .
The index Big5 pointer for code point is the return value of these steps:
-
Let index be index Big5 excluding all entries whose pointer is less than (0xA1 - 0x81) × 157.
Avoid returning Hong Kong Supplementary Character Set extensions literally.
-
If code point is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345, return the last pointer corresponding to code point in index .
There are other duplicate code points, but for those the first pointer is to be used.
-
Return the index pointer for code point in index .
All indexes are also available as a non-normative indexes.json resource. ( Index gb18030 ranges has a slightly different format here, to be able to represent ranges.)
6. Hooks for standards
The algorithms defined below ( decode , UTF-8 decode , UTF-8 decode without BOM , UTF-8 decode without BOM or fail , encode , UTF-8 encode , and BOM sniff ) are intended for usage by other standards.
For decoding, UTF-8 decode is to be used by new formats. For identifiers or byte sequences within a format or protocol, use UTF-8 decode without BOM or UTF-8 decode without BOM or fail .
For encoding, UTF-8 encode is to be used.
Standards are strongly discouraged from using decode , encode , and BOM sniff , except as needed for compatibility.
The get an encoding algorithm is to be used to turn a label into an encoding .
Standards
are
to
ensure
that
the
streams
I/O
queues
they
pass
to
the
encode
and
UTF-8
encode
algorithms
are
effectively
I/O
queues
of
scalar
value
streams,
values,
i.e.,
they
contain
no
surrogates
.
To
decode
a
byte
stream
an
I/O
queue
of
bytes
stream
ioQueue
using
given
a
fallback
encoding
encoding
,
and
an
optional
I/O
queue
of
scalar
values
output
(default
«
»),
run
these
steps:
-
Let BOMEncoding be the result of BOM sniffing
streamioQueue . -
If BOMEncoding is non-null:
-
Set encoding to BOMEncoding .
-
Read three bytes from
streamioQueue , if BOMEncoding is UTF-8 ; otherwise read two bytes. (Do nothing with those bytes.)
For compatibility with deployed content, the byte order mark is more authoritative than anything else. In a context where HTTP is used this is in violation of the semantics of the `
Content-Type
` header. -
-
Let output be a scalar value stream .Run encoding ’s decoder withstreamioQueue and output . -
Return output .
To
UTF-8
decode
a
byte
stream
an
I/O
queue
of
bytes
stream
,
ioQueue
given
an
optional
I/O
queue
of
scalar
values
output
(default
«
»),
run
these
steps:
-
Let buffer be
an empty byte sequence. Readthe result of reading three bytes fromstream into bufferioQueue . -
If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to
streamioQueue . -
Let output be a scalar value stream .Run UTF-8 ’s decoder withstreamioQueue and output . -
Return output .
To
UTF-8
decode
without
BOM
a
byte
stream
an
I/O
queue
of
bytes
stream
,
ioQueue
given
an
optional
I/O
queue
of
scalar
values
output
(default
«
»),
run
these
steps:
-
Let output be a scalar value stream .Run UTF-8 ’s decoder withstreamioQueue and output . -
Return output .
To
UTF-8
decode
without
BOM
or
fail
a
byte
stream
an
I/O
queue
of
bytes
stream
,
ioQueue
given
an
optional
I/O
queue
of
scalar
values
output
(default
«
»),
run
these
steps:
-
Let
output be a scalar value stream. LetpotentialError be the result of running UTF-8 ’s decoder withstreamioQueue , output , and "fatal
". -
If potentialError is error , return failure.
-
Return output .
To
encode
a
an
I/O
queue
of
scalar
value
stream
values
stream
ioQueue
using
given
an
encoding
encoding
,
and
an
optional
I/O
queue
of
scalar
values
output
(default
«
»),
run
these
steps:
-
Assert: encoding is not replacement , UTF-16BE , or UTF-16LE .
-
Let output be a byte stream .Run encoding ’s encoder withstreamioQueue , output , and "html
". -
Return output .
This is mostly a legacy hook for URLs and HTML forms. Layering UTF-8 encode on top is safe as it never triggers errors . [URL] [HTML]
To
UTF-8
encode
a
an
I/O
queue
of
scalar
value
stream
values
stream
,
ioQueue
given
an
optional
I/O
queue
of
scalar
values
output
(default
«
»),
return
the
result
of
encoding
stream
ioQueue
using
with
encoding
UTF-8
.
and
output
.
To
BOM
sniff
a
byte
stream
an
I/O
queue
of
bytes
stream
ioQueue
,
run
these
steps:
-
Wait untilLetstreamBOMhas three bytes available orbe theend-of-streamresult of peekinghas been reached, whichever comes first.3 bytes from ioQueue , converted to a byte sequence. -
For each of the rows in the table below, starting with the first one and going down, if
streamBOM starts with the bytes given in the first column, then return the encoding given in the cell in the second column of that row.(Do not consume those bytes.)Otherwise, return null.Byte order mark Encoding 0xEF 0xBB 0xBF UTF-8 0xFE 0xFF UTF-16BE 0xFF 0xFE UTF-16LE Return null.
This hook is a workaround for the fact that decode has no way to communicate back to the caller that it has found a byte order mark and is therefore not using the provided encoding. The hook is to be invoked before decode , and it will return an encoding corresponding to the byte order mark found, or null otherwise.
7. API
This section uses terminology from Web IDL. Browser user agents must support this API. JavaScript implementations should support this API. Other user agents or programming languages are encouraged to use an API suitable to their needs, which might not be this one. [WEBIDL]
The
following
example
uses
the
TextEncoder
object
to
encode
an
array
of
strings
into
an
ArrayBuffer
.
The
result
is
a
Uint8Array
containing
the
number
of
strings
(as
a
Uint32Array
),
followed
by
the
length
of
the
first
string
(as
a
Uint32Array
),
the
UTF-8
encoded
string
data,
the
length
of
the
second
string
(as
a
Uint32Array
),
the
string
data,
and
so
on.
function encodeArrayOfStrings( strings) {
var encoder, encoded, len, bytes, view, offset;
encoder = new TextEncoder();
encoded = [];
len = Uint32Array. BYTES_PER_ELEMENT;
for ( var i = 0 ; i < strings. length; i++ ) {
len += Uint32Array. BYTES_PER_ELEMENT;
encoded[ i] = encoder. encode( strings[ i]);
len += encoded[ i]. byteLength;
}
bytes = new Uint8Array( len);
view = new DataView( bytes. buffer);
offset = 0 ;
view. setUint32( offset, strings. length);
offset += Uint32Array. BYTES_PER_ELEMENT;
for ( var i = 0 ; i < encoded. length; i += 1 ) {
len = encoded[ i]. byteLength;
view. setUint32( offset, len);
offset += Uint32Array. BYTES_PER_ELEMENT;
bytes. set( encoded[ i], offset);
offset += len;
}
return bytes. buffer;
}
The
following
example
decodes
an
ArrayBuffer
containing
data
encoded
in
the
format
produced
by
the
previous
example,
or
an
equivalent
algorithm
for
encodings
other
than
UTF-8
,
back
into
an
array
of
strings.
function decodeArrayOfStrings( buffer, encoding) {
var decoder, view, offset, num_strings, strings, len;
decoder = new TextDecoder( encoding);
view = new DataView( buffer);
offset = 0 ;
strings = [];
num_strings = view. getUint32( offset);
offset += Uint32Array. BYTES_PER_ELEMENT;
for ( var i = 0 ; i < num_strings; i++ ) {
len = view. getUint32( offset);
offset += Uint32Array. BYTES_PER_ELEMENT;
strings[ i] = decoder. decode(
new DataView( view. buffer, offset, len));
offset += len;
}
return strings;
}
7.1.
Interface
mixin
TextDecoderCommon
interface mixin {
TextDecoderCommon readonly attribute DOMString encoding ;readonly attribute boolean fatal ;readonly attribute boolean ignoreBOM ; };
The
TextDecoderCommon
interface
mixin
defines
common
getters
that
are
shared
between
TextDecoder
and
TextDecoderStream
objects.
These
objects
have
an
associated:
- encoding
- An encoding .
- decoder
- A decoder .
-
streamI/O queue -
A stream .An I/O queue of bytes. - ignore BOM
- A boolean, initially false.
- BOM seen
- A boolean, initially false.
- error mode
-
An
error
mode
,
initially
"
replacement
".
The
serialize
stream
I/O
queue
algorithm,
given
a
TextDecoderCommon
decoder
and
a
stream
an
I/O
queue
of
scalar
values
stream
ioQueue
,
runs
these
steps:
-
Let output be the empty string.
-
While true:
-
Let
tokenitem be the result of reading fromstreamioQueue . -
If
tokenitem isend-of-streamend-of-queue , then return output . -
If decoder ’s encoding is UTF-8 , UTF-16BE , or UTF-16LE , and decoder ’s ignore BOM and BOM seen are false, then:
-
Append
tokenitem to output .
-
This algorithm is intentionally different with respect to BOM handling from the decode algorithm used by the rest of the platform to give API users more control.
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile Yes
Node.js 8.3.0+
The
encoding
getter
steps
are
to
return
this
’s
encoding
’s
name
,
ASCII
lowercased
.
The
fatal
getter
steps
are
to
return
true
if
this
’s
error
mode
is
"
fatal
",
otherwise
false.
The
ignoreBOM
getter
steps
are
to
return
this
’s
ignore
BOM
.
7.2.
Interface
TextDecoder
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile Yes
Node.js 11.0.0+
dictionary {
TextDecoderOptions boolean =
fatal false ;boolean =
ignoreBOM false ; };dictionary {
TextDecodeOptions boolean =
stream false ; }; [Exposed =(Window ,Worker )]interface {
TextDecoder constructor (optional DOMString = "utf-8",
label optional TextDecoderOptions = {});
options USVString decode (optional [AllowShared ]BufferSource ,
input optional TextDecodeOptions = {}); };
options TextDecoder includes TextDecoderCommon ;
A
TextDecoder
object
has
an
associated
do
not
flush
,
which
is
a
boolean,
initially
false.
-
decoder = new TextDecoder([ label = "utf-8" [, options ]])
-
Returns a new
TextDecoder
object.If label is either not a label or is a label for replacement , throws a
RangeError
. -
decoder . encoding
-
decoder . fatal
-
Returns true if error mode is "
fatal
", otherwise false. -
decoder . ignoreBOM
-
Returns the value of ignore BOM .
-
decoder . decode([ input [, options ]])
-
Returns the result of running encoding ’s decoder . The method can be invoked zero or more times with options ’s
stream
set to true, and then once without options ’sstream
(or set to false), to process a fragmentedstream.input. If the invocation without options ’sstream
(or set to false) has no input , it’s clearest to omit both arguments.var string= "" , decoder= new TextDecoder( encoding), buffer; while ( buffer= next_chunk()) { string+= decoder. decode( buffer, { stream: true }); } string+= decoder. decode(); // end-of-streamend-of-queueIf the error mode is "
fatal
" and encoding ’s decoder returns error , throws aTypeError
.
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile ?
Node.js ?
The
new
TextDecoder(
label
,
options
)
constructor
steps
are:
-
Let encoding be the result of getting an encoding from label .
-
If encoding is failure or replacement , then throw a
RangeError
. -
If options ["
fatal
"] is true, then set this ’s error mode to "fatal
". -
Set this ’s ignore BOM to options ["
ignoreBOM
"].
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile Yes
Node.js 11.0.0+
The
decode(
input
,
options
)
method
steps
are:
-
If this ’s do not flush is false, then set this ’s decoder to a new decoder for this ’s encoding , this ’s
streamI/O queue toa new stream ,the I/O queue of bytes « end-of-queue », and this ’s BOM seen to false. -
Set this ’s do not flush to options ["
stream
"]. -
If input is given, then push a copy of input to this ’s
streamI/O queue .Implementations are strongly encouraged to use an implementation strategy that avoids this copy. When doing so they will have to make sure that changes to input do not affect future calls to
decode()
. -
Let output be
a new stream .the I/O queue of scalar values « end-of-queue ». -
While true:
-
Let
tokenitem be the result of reading from this ’sstreamI/O queue . -
If
tokenitem isend-of-streamend-of-queue and this ’s do not flush is true, then return the result of running serializestreamI/O queue with this and output .The way streaming works is to not handle
end-of-streamend-of-queue here when this ’s do not flush is true and to not set it to false. That way in a subsequent invocation this ’s decoder is not set anew in the first step of the algorithm and its state is preserved. -
Otherwise:
-
Let result be the result of processing
tokenitem for this ’s decoder , this ’sstreamI/O queue , output , and this ’s error mode . -
If result is finished , then return the result of running serialize
streamI/O queue with this and output .
-
-
7.3.
Interface
mixin
TextEncoderCommon
interface mixin {
TextEncoderCommon readonly attribute DOMString encoding ; };
The
TextEncoderCommon
interface
mixin
defines
common
getters
that
are
shared
between
TextEncoder
and
TextEncoderStream
objects.
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile Yes
Node.js 8.3.0+
The
encoding
getter
steps
are
to
return
"
utf-8
".
7.4.
Interface
TextEncoder
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile Yes
Node.js 11.0.0+
dictionary {
TextEncoderEncodeIntoResult unsigned long long ;
read unsigned long long ; }; [
written Exposed =(Window ,Worker )]interface {
TextEncoder constructor (); [NewObject ]Uint8Array encode (optional USVString = "");
input TextEncoderEncodeIntoResult encodeInto (USVString , [
source AllowShared ]Uint8Array ); };
destination TextEncoder includes TextEncoderCommon ;
A
TextEncoder
object
offers
no
label
argument
as
it
only
supports
UTF-8
.
It
also
offers
no
stream
option
as
no
encoder
requires
buffering
of
scalar
values.
-
encoder = new TextEncoder()
-
Returns a new
TextEncoder
object. -
encoder . encoding
-
Returns "
utf-8
". -
encoder . encode([ input = ""])
-
encoder . encodeInto( source , destination )
-
Runs the UTF-8 encoder on source , stores the result of that operation into destination , and returns the progress made as an object wherein
read
is the number of converted code units of source andwritten
is the number of bytes modified in destination .
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 48+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile ?
Node.js 11.0.0+
The
new
TextEncoder()
constructor
steps
are
to
do
nothing.
In all current engines.
Opera 25+ Edge 79+
Edge (Legacy) None IE None
Firefox for Android 19+ iOS Safari 10.3+ Chrome for Android 38+ Android WebView 38+ Samsung Internet 3.0+ Opera Mobile Yes
Node.js 8.3.0+
The
encode(
input
)
method
steps
are:
-
Let output be
a new stream .the I/O queue of bytes « end-of-queue ». -
While true:
-
Let
tokenitem be the result of reading from input . -
Let result be the result of processing
tokenitem for the UTF-8 encoder , input , output . -
Assert: result is not error .
The UTF-8 encoder cannot return error .
-
If result is finished , convert output into a byte sequence, and then return a
Uint8Array
object wrapping anArrayBuffer
containing output .
-
Opera None Edge 79+
Edge (Legacy) None IE None
Firefox for Android 66+ iOS Safari None Chrome for Android 74+ Android WebView 74+ Samsung Internet 11.0+ Opera Mobile None
Node.js ?
The
encodeInto(
source
,
destination
)
method
steps
are:
-
Let read be 0.
-
Let written be 0.
-
Let destinationBytes be the result of getting a reference to the bytes held by destination .
-
Let unused be
a new stream .the I/O queue of scalar values « end-of-queue ».The handler algorithm invoked below requires this argument, but it is not used by the UTF-8 encoder .
-
While true:
-
Let
tokenitem be the result of reading from source . -
Let result be the result of running the UTF-8 encoder ’s handler on unused and
tokenitem . -
Otherwise:
-
If destinationBytes ’s length − written is greater than or equal to the number of bytes in result , then:
-
If
tokenitem is greater than U+FFFF, then increment read by 2. -
Otherwise, increment read by 1.
-
Write the bytes in result into destinationBytes , from byte offset written .
See the warning for
SharedArrayBuffer
objects above. -
Increment written by the number of bytes in result .
-
-
Otherwise, break .
-
-
The
encodeInto()
method
can
be
used
to
encode
a
string
into
an
existing
ArrayBuffer
object.
Various
details
below
are
left
as
an
exercise
for
the
reader,
but
this
demonstrates
an
approach
one
could
take
to
use
this
method:
function convertString( buffer, input, callback) {
let bufferSize = 256 ,
bufferStart = malloc( buffer, bufferSize),
writeOffset = 0 ,
readOffset = 0 ;
while ( true ) {
const view = new Uint8Array( buffer, bufferStart + writeOffset, bufferSize - writeOffset),
{ read, written} = cachedEncoder. encodeInto( input. substring( readOffset), view);
readOffset += read;
writeOffset += written;
if ( readOffset === input. length) {
callback( bufferStart, writeOffset);
free( buffer, bufferStart);
return ;
}
bufferSize *= 2 ;
bufferStart = realloc( buffer, bufferStart, bufferSize);
}
}
7.5.
Interface
mixin
GenericTransformStream
The
GenericTransformStream
interface
mixin
represents
the
concept
of
a
transform
stream
in
IDL.
It
is
not
a
TransformStream
,
though
it
has
the
same
interface
and
it
delegates
to
one.
interface mixin {
GenericTransformStream readonly attribute ReadableStream readable ;readonly attribute WritableStream writable ; };
An
object
that
includes
GenericTransformStream
has
an
associated
transform
of
type
TransformStream
.
The
readable
getter
steps
are
to
return
this
’s
transform
.[[readable]].
The
writable
getter
steps
are
to
return
this
’s
transform
.[[writable]].
7.6.
Interface
TextDecoderStream
[Exposed =(Window ,Worker )]interface {
TextDecoderStream constructor (optional DOMString = "utf-8",
label optional TextDecoderOptions = {}); };
options TextDecoderStream includes TextDecoderCommon ;TextDecoderStream includes GenericTransformStream ;
-
decoder = new TextDecoderStream([ label = "utf-8" [, options ]])
-
Returns a new
TextDecoderStream
object.If label is either not a label or is a label for replacement , throws a
RangeError
. -
decoder . encoding
-
decoder . fatal
-
Returns true if error mode is "
fatal
", and false otherwise. -
decoder . ignoreBOM
-
Returns the value of ignore BOM .
-
decoder . readable
-
Returns a readable stream whose chunks are strings resulting from running encoding ’s decoder on the chunks written to
writable
. -
decoder . writable
-
Returns a writable stream which accepts
[ AllowShared ] BufferSource
chunks and runs them through encoding ’s decoder before making them available toreadable
.Typically this will be used via the
pipeThrough()
method on aReadableStream
source.var decoder= new TextDecoderStream( encoding); byteReadable. pipeThrough( decoder) . pipeTo( textWritable); If the error mode is "
fatal
" and encoding ’s decoder returns error , bothreadable
andwritable
will be errored with aTypeError
.
The
new
TextDecoderStream(
label
,
options
)
constructor
steps
are:
-
Let encoding be the result of getting an encoding from label .
-
If encoding is failure or replacement , then throw a
RangeError
. -
If options ["
fatal
"] is true, then set this ’s error mode to "fatal
". -
set this ’s ignore BOM to options ["
ignoreBOM
"]. -
Set this ’s decoder to a new decoder for this ’s encoding , and set this ’s
streamI/O queue to a newstreamI/O queue . -
Let startAlgorithm be an algorithm that takes no arguments and returns nothing.
-
Let transformAlgorithm be an algorithm which takes a chunk argument and runs the decode and enqueue a chunk algorithm with this and chunk .
-
Let flushAlgorithm be an algorithm which takes no arguments and runs the flush and enqueue algorithm with this .
-
Let transform be the result of calling CreateTransformStream ( startAlgorithm , transformAlgorithm , flushAlgorithm ).
The
decode
and
enqueue
a
chunk
algorithm,
given
a
TextDecoderStream
object
decoder
and
a
chunk
,
runs
these
steps:
-
Let bufferSource be the result of converting chunk to an
[ AllowShared ] BufferSource
. If this throws an exception, then return a promise rejected with that exception. -
Push a copy of bufferSource to decoder ’s
streamI/O queue . If this throws an exception, then return a promise rejected with that exception.See the warning for
SharedArrayBuffer
objects above. -
Let controller be decoder ’s transform .[[transformStreamController]].
-
Let output be
a new stream .the I/O queue of scalar values « end-of-queue ». -
While
true, run these steps:true:-
Let
tokenitem be the result of reading from decoder ’sstreamI/O queue . -
If
tokenitem isend-of-streamend-of-queue ,run these steps:then:-
Let outputChunk be the result of running serialize
streamI/O queue with decoder and output . -
if outputChunk is non-empty, call TransformStreamDefaultControllerEnqueue ( controller , outputChunk ).
-
Return a new promise resolved with undefined.
-
-
Let result be the result of processing
tokenitem for decoder ’s decoder , decoder ’sstreamI/O queue , output , and decoder ’s error mode . -
If result is error , then return a new promise rejected with a
TypeError
exception.
-
The
flush
and
enqueue
algorithm,
which
handles
the
end
of
data
from
the
input
ReadableStream
object,
given
a
TextDecoderStream
object
decoder
,
runs
these
steps:
-
Let output be
a new stream .the I/O queue of scalar values « end-of-queue ». -
Let result be the result of processing
end-of-streamend-of-queue for decoder ’s decoder and decoder ’sstreamI/O queue , output , and decoder ’s error mode . -
If result is finished ,
run these steps:then:-
Let outputChunk be the result of running serialize
streamI/O queue with decoder and output . -
Let controller be decoder ’s transform .[[transformStreamController]].
-
If outputChunk is non-empty, call TransformStreamDefaultControllerEnqueue ( controller , outputChunk ).
-
Return a new promise resolved with undefined.
-
-
Otherwise, return a new promise rejected with a
TypeError
exception.
7.7.
Interface
TextEncoderStream
[Exposed =(Window ,Worker )]interface {
TextEncoderStream constructor (); };TextEncoderStream includes TextEncoderCommon ;TextEncoderStream includes GenericTransformStream ;
A
TextEncoderStream
object
has
an
associated:
- encoder
- An encoder .
- pending high surrogate
- Null or a surrogate , initially null.
A
TextEncoderStream
object
offers
no
label
argument
as
it
only
supports
UTF-8
.
-
encoder = new TextEncoderStream()
-
Returns a new
TextEncoderStream
object. -
encoder . encoding
-
Returns "
utf-8
". -
encoder . readable
-
Returns a readable stream whose chunks are
Uint8Array
s resulting from running UTF-8 ’s encoder on the chunks written towritable
. -
encoder . writable
-
Returns a writable stream which accepts string chunks and runs them through UTF-8 ’s encoder before making them available to
readable
.Typically this will be used via the
pipeThrough()
method on aReadableStream
source.textReadable
. pipeThrough( new TextEncoderStream()) . pipeTo( byteWritable);
The
new
TextEncoderStream()
constructor
steps
are:
-
Let startAlgorithm be an algorithm that takes no arguments and returns nothing.
-
Let transformAlgorithm be an algorithm which takes a chunk argument and runs the encode and enqueue a chunk algorithm with this and chunk .
-
Let flushAlgorithm be an algorithm which runs the encode and flush algorithm with this .
-
Let transform be the result of calling CreateTransformStream ( startAlgorithm , transformAlgorithm , flushAlgorithm ).
The
encode
and
enqueue
a
chunk
algorithm,
given
a
TextEncoderStream
object
encoder
and
chunk
,
runs
these
steps:
-
Let input be the result of converting chunk to a
DOMString
. If this throws an exception, then return a promise rejected with that exception. -
Convert input to an I/O queue of code units .
DOMString
, as well as an I/O queue of code units rather than scalar values, are used here so that a surrogate pair that is split between chunks can be reassembled into the appropriate scalar value. The behavior is otherwise identical toisUSVString
. In particular, lone surrogates will be replaced with U+FFFD. -
Convert input to a stream .Let output bea new stream .the I/O queue of bytes « end-of-queue ». -
Let controller be encoder ’s transform .[[transformStreamController]].
-
While
true, run these steps:true:-
Let
tokenitem be the result of reading from input . -
If
tokenitem isend-of-streamend-of-queue ,run these steps:then:-
Convert output into a byte sequence.
-
If output is non-empty,
run these steps:then:-
Let chunk be a
Uint8Array
object wrapping anArrayBuffer
containing output . -
Call TransformStreamDefaultControllerEnqueue ( controller , chunk ).
-
-
Return a new promise resolved with undefined.
-
-
Let result be the result of executing the convert code unit to scalar value algorithm with encoder ,
tokenitem and input . -
If result is not continue , then process result for encoder , input , output .
-
The
convert
code
unit
to
scalar
value
algorithm,
given
a
TextEncoderStream
object
encoder
,
a
code
unit
token
item
,
and
stream
an
I/O
queue
of
code
units
input
,
runs
these
steps:
-
If encoder ’s pending high surrogate is non-null,
run these steps:then:-
Let high surrogate be encoder ’s pending high surrogate .
-
Set encoder ’s pending high surrogate to null.
-
If
tokenitem is in the range U+DC00 to U+DFFF, inclusive, then return acode pointscalar value whose value is 0x10000 + (( high surrogate − 0xD800) << 10) + (tokenitem − 0xDC00). -
Prepend
tokenitem to input . -
Return U+FFFD.
-
-
If
tokenitem is in the range U+D800 to U+DBFF, inclusive, then set pending high surrogate totokenitem and return continue . -
If
tokenitem is in the range U+DC00 to U+DFFF, inclusive, then return U+FFFD. -
Return
tokenitem .
This is equivalent to the " convert a string into a scalar value string " algorithm from the Infra Standard, but allows for surrogate pairs that are split between strings. [INFRA]
The
encode
and
flush
algorithm,
given
a
TextEncoderStream
object
encoder
,
runs
these
steps:
-
If encoder ’s pending high surrogate is non-null,
run these steps:then:-
Let controller be encoder ’s transform .[[transformStreamController]].
-
Let output be the byte sequence 0xEF 0xBF 0xBD.
This is the replacement character U+FFFD encoded as UTF-8.
-
Let chunk be a
Uint8Array
object wrapping anArrayBuffer
containing output . -
Call TransformStreamDefaultControllerEnqueue ( controller , chunk ).
-
-
Return a new promise resolved with undefined.
8. The encoding
8.1. UTF-8
8.1.1. UTF-8 decoder
A byte order mark has priority over a label as it has been found to be more accurate in deployed content. Therefore it is not part of the UTF-8 decoder algorithm but rather the decode and UTF-8 decode algorithms.
UTF-8 ’s decoder ’s has an associated UTF-8 code point , UTF-8 bytes seen , and UTF-8 bytes needed (all initially 0), a UTF-8 lower boundary (initially 0x80), and a UTF-8 upper boundary (initially 0xBF).
UTF-8
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and UTF-8 bytes needed is not 0, set UTF-8 bytes needed to 0 and return error . -
If byte is
end-of-streamend-of-queue , return finished . -
If UTF-8 bytes needed is 0, based on byte :
- 0x00 to 0x7F
-
Return a code point whose value is byte .
- 0xC2 to 0xDF
-
-
Set UTF-8 bytes needed to 1.
-
Set UTF-8 code point to byte & 0x1F.
The five least significant bits of byte .
-
- 0xE0 to 0xEF
-
-
If byte is 0xE0, set UTF-8 lower boundary to 0xA0.
-
If byte is 0xED, set UTF-8 upper boundary to 0x9F.
-
Set UTF-8 bytes needed to 2.
-
Set UTF-8 code point to byte & 0xF.
The four least significant bits of byte .
-
- 0xF0 to 0xF4
-
-
If byte is 0xF0, set UTF-8 lower boundary to 0x90.
-
If byte is 0xF4, set UTF-8 upper boundary to 0x8F.
-
Set UTF-8 bytes needed to 3.
-
Set UTF-8 code point to byte & 0x7.
The three least significant bits of byte .
-
- Otherwise
-
Return error .
Return continue .
-
If byte is not in the range UTF-8 lower boundary to UTF-8 upper boundary , inclusive, then:
-
Set UTF-8 code point , UTF-8 bytes needed , and UTF-8 bytes seen to 0, set UTF-8 lower boundary to 0x80, and set UTF-8 upper boundary to 0xBF.
-
Prepend byte to
streamioQueue . -
Return error .
-
-
Set UTF-8 lower boundary to 0x80 and UTF-8 upper boundary to 0xBF.
-
Set UTF-8 code point to ( UTF-8 code point << 6) | ( byte & 0x3F)
Shift the existing bits of UTF-8 code point left by six places and set the newly-vacated six least significant bits to the six least significant bits of byte .
-
Increase UTF-8 bytes seen by one.
-
If UTF-8 bytes seen is not equal to UTF-8 bytes needed , return continue .
-
Let code point be UTF-8 code point .
-
Set UTF-8 code point , UTF-8 bytes needed , and UTF-8 bytes seen to 0.
-
Return a code point whose value is code point .
The constraints in the UTF-8 decoder above match “Best Practices for Using U+FFFD” from the Unicode standard. No other behavior is permitted per the Encoding Standard (other algorithms that achieve the same result are fine, even encouraged). [UNICODE]
8.1.2. UTF-8 encoder
UTF-8
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
Set count and offset based on the range code point is in:
- U+0080 to U+07FF, inclusive
- 1 and 0xC0
- U+0800 to U+FFFF, inclusive
- 2 and 0xE0
- U+10000 to U+10FFFF, inclusive
- 3 and 0xF0
-
Let bytes be a byte sequence whose first byte is ( code point >> (6 × count )) + offset .
-
While count is greater than 0:
-
Set temp to code point >> (6 × ( count − 1)).
-
Append to bytes 0x80 | ( temp & 0x3F).
-
Decrease count by one.
-
-
Return bytes bytes , in order.
This algorithm has identical results to the one described in the Unicode standard. It is included here for completeness. [UNICODE]
9. Legacy single-byte encodings
An encoding where each byte is either a single code point or nothing, is a single-byte encoding . Single-byte encodings share the decoder and encoder . Index single-byte , as referenced by the single-byte decoder and single-byte encoder , is defined by the following table, and depends on the single-byte encoding in use. All but two single-byte encodings have a unique index .
ISO-8859-8 and ISO-8859-8-I are distinct encoding names , because ISO-8859-8 has influence on the layout direction. And although historically this might have been the case for ISO-8859-6 and "ISO-8859-6-I" as well, that is no longer true.
9.1. single-byte decoder
Single-byte
encodings
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue , return finished . -
If byte is an ASCII byte , return a code point whose value is byte .
-
Let code point be the index code point for byte − 0x80 in index single-byte .
-
If code point is null, return error .
-
Return a code point whose value is code point .
9.2. single-byte encoder
Single-byte
encodings
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
Let pointer be the index pointer for code point in index single-byte .
-
If pointer is null, return error with code point .
-
Return a byte whose value is pointer + 0x80.
10. Legacy multi-byte Chinese (simplified) encodings
10.1. GBK
10.1.1. GBK decoder
GBK ’s decoder is gb18030 ’s decoder .
10.1.2. GBK encoder
GBK ’s encoder is gb18030 ’s encoder with its is GBK set to true.
Not fully aliasing GBK with gb18030 is a conservative move to decrease the chances of breaking legacy servers and other consumers of content generated with GBK ’s encoder .
10.2. gb18030
10.2.1. gb18030 decoder
gb18030 ’s decoder has an associated gb18030 first , gb18030 second , and gb18030 third (all initially 0x00).
gb18030
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and gb18030 first , gb18030 second , and gb18030 third are 0x00, return finished . -
If byte is
end-of-streamend-of-queue , and gb18030 first , gb18030 second , or gb18030 third is not 0x00, set gb18030 first , gb18030 second , and gb18030 third to 0x00, and return error . -
If gb18030 third is not 0x00, then:
-
If byte is not in the range 0x30 to 0x39, inclusive, then:
-
Prepend gb18030 second , gb18030 third , and byte to
streamioQueue . -
Set gb18030 first , gb18030 second , and gb18030 third to 0x00.
-
Return error .
-
-
Let code point be the index gb18030 ranges code point for (( gb18030 first − 0x81) × (10 × 126 × 10)) + (( gb18030 second − 0x30) × (10 × 126)) + (( gb18030 third − 0x81) × 10) + byte − 0x30.
-
Set gb18030 first , gb18030 second , and gb18030 third to 0x00.
-
If code point is null, return error .
-
Return a code point whose value is code point .
-
-
If gb18030 second is not 0x00, then:
-
If byte is in the range 0x81 to 0xFE, inclusive, set gb18030 third to byte and return continue .
-
Prepend gb18030 second followed by byte to
streamioQueue , set gb18030 first and gb18030 second to 0x00, and return error .
-
-
If gb18030 first is not 0x00, then:
-
If byte is in the range 0x30 to 0x39, inclusive, set gb18030 second to byte and return continue .
-
Let lead be gb18030 first , let pointer be null, and set gb18030 first to 0x00.
-
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x41.
-
If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFE, inclusive, set pointer to ( lead − 0x81) × 190 + ( byte − offset ).
-
Let code point be null if pointer is null, otherwise the index code point for pointer in index gb18030 .
-
If code point is non-null, return a code point whose value is code point .
-
If byte is an ASCII byte , prepend byte to
streamioQueue . -
Return error .
-
-
If byte is an ASCII byte , return a code point whose value is byte .
-
If byte is 0x80, return code point U+20AC.
-
If byte is in the range 0x81 to 0xFE, inclusive, set gb18030 first to byte and return continue .
-
Return error .
10.2.2. gb18030 encoder
gb18030 ’s encoder has an associated is GBK (initially false).
gb18030
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
If code point is U+E5E5, return error with code point .
Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for compatibility with deployed content. Therefore it cannot roundtrip.
-
If is GBK is true and code point is U+20AC, return byte 0x80.
-
Let pointer be the index pointer for code point in index gb18030 .
-
If pointer is non-null, then:
-
Let lead be pointer / 190 + 0x81.
-
Let trail be pointer % 190.
-
Let offset be 0x40 if trail is less than 0x3F, otherwise 0x41.
-
Return two bytes whose values are lead and trail + offset .
-
-
Set pointer to the index gb18030 ranges pointer for code point .
-
Let byte1 be pointer / (10 × 126 × 10).
-
Set pointer to pointer % (10 × 126 × 10).
-
Let byte2 be pointer / (10 × 126).
-
Set pointer to pointer % (10 × 126).
-
Let byte3 be pointer / 10.
-
Let byte4 be pointer % 10.
-
Return four bytes whose values are byte1 + 0x81, byte2 + 0x30, byte3 + 0x81, byte4 + 0x30.
11. Legacy multi-byte Chinese (traditional) encodings
11.1. Big5
11.1.1. Big5 decoder
Big5 ’s decoder has an associated Big5 lead (initially 0x00).
Big5
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and Big5 lead is not 0x00, set Big5 lead to 0x00 and return error . -
If byte is
end-of-streamend-of-queue and Big5 lead is 0x00, return finished . -
If Big5 lead is not 0x00, let lead be Big5 lead , let pointer be null, set Big5 lead to 0x00, and then:
-
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x62.
-
If byte is in the range 0x40 to 0x7E, inclusive, or 0xA1 to 0xFE, inclusive, set pointer to ( lead − 0x81) × 157 + ( byte − offset ).
-
If there is a row in the table below whose first column is pointer , return the two code points listed in its second column (the third column is irrelevant):
Pointer Code points Notes 1133 U+00CA U+0304 Ê̄ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON) 1135 U+00CA U+030C Ê̌ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON) 1164 U+00EA U+0304 ê̄ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON) 1166 U+00EA U+030C ê̌ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON) Since indexes are limited to single code points this table is used for these pointers.
-
Let code point be null if pointer is null, otherwise the index code point for pointer in index Big5 .
-
If code point is non-null, return a code point whose value is code point .
-
If byte is an ASCII byte , prepend byte to
streamioQueue . -
Return error .
-
-
If byte is an ASCII byte , return a code point whose value is byte .
-
If byte is in the range 0x81 to 0xFE, inclusive, set Big5 lead to byte and return continue .
-
Return error .
11.1.2. Big5 encoder
Big5
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
Let pointer be the index Big5 pointer for code point .
-
If pointer is null, return error with code point .
-
Let lead be pointer / 157 + 0x81.
-
Let trail be pointer % 157.
-
Let offset be 0x40 if trail is less than 0x3F, otherwise 0x62.
-
Return two bytes whose values are lead and trail + offset .
12. Legacy multi-byte Japanese encodings
12.1. EUC-JP
12.1.1. EUC-JP decoder
EUC-JP ’s decoder has an associated EUC-JP jis0212 (initially false) and EUC-JP lead (initially 0x00).
EUC-JP
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and EUC-JP lead is not 0x00, set EUC-JP lead to 0x00, and return error . -
If byte is
end-of-streamend-of-queue and EUC-JP lead is 0x00, return finished . -
If EUC-JP lead is 0x8E and byte is in the range 0xA1 to 0xDF, inclusive, set EUC-JP lead to 0x00 and return a code point whose value is 0xFF61 − 0xA1 + byte .
-
If EUC-JP lead is 0x8F and byte is in the range 0xA1 to 0xFE, inclusive, set EUC-JP jis0212 to true, set EUC-JP lead to byte , and return continue .
-
If EUC-JP lead is not 0x00, let lead be EUC-JP lead , set EUC-JP lead to 0x00, and then:
-
Let code point be null.
-
If lead and byte are both in the range 0xA1 to 0xFE, inclusive, then set code point to the index code point for ( lead − 0xA1) × 94 + byte − 0xA1 in index jis0208 if EUC-JP jis0212 is false and in index jis0212 otherwise.
-
Set EUC-JP jis0212 to false.
-
If code point is non-null, return a code point whose value is code point .
-
If byte is an ASCII byte , prepend byte to
streamioQueue . -
Return error .
-
-
If byte is an ASCII byte , return a code point whose value is byte .
-
If byte is 0x8E, 0x8F, or in the range 0xA1 to 0xFE, inclusive, set EUC-JP lead to byte and return continue .
-
Return error .
12.1.2. EUC-JP encoder
EUC-JP
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
If code point is U+00A5, return byte 0x5C.
-
If code point is U+203E, return byte 0x7E.
-
If code point is in the range U+FF61 to U+FF9F, inclusive, return two bytes whose values are 0x8E and code point − 0xFF61 + 0xA1.
-
If code point is U+2212, set it to U+FF0D.
-
Let pointer be the index pointer for code point in index jis0208 .
If pointer is non-null, it is less than 8836 due to the nature of index jis0208 and the index pointer operation.
-
If pointer is null, return error with code point .
-
Let lead be pointer / 94 + 0xA1.
-
Let trail be pointer % 94 + 0xA1.
-
Return two bytes whose values are lead and trail .
12.2. ISO-2022-JP
12.2.1. ISO-2022-JP decoder
ISO-2022-JP ’s decoder has an associated ISO-2022-JP decoder state (initially ASCII ), ISO-2022-JP decoder output state (initially ASCII ), ISO-2022-JP lead (initially 0x00), and ISO-2022-JP output (initially false).
ISO-2022-JP
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps,
switching
on
ISO-2022-JP
decoder
state
:
- ASCII
-
Based on byte :
- 0x1B
-
Set ISO-2022-JP decoder state to escape start and return continue .
- 0x00 to 0x7F, excluding 0x0E, 0x0F, and 0x1B
-
Set ISO-2022-JP output to false and return a code point whose value is byte .
-
end-of-streamend-of-queue -
Return finished .
- Otherwise
-
Set ISO-2022-JP output to false and return error .
- Roman
-
Based on byte :
- 0x1B
-
Set ISO-2022-JP decoder state to escape start and return continue .
- 0x5C
-
Set ISO-2022-JP output to false and return code point U+00A5.
- 0x7E
-
Set ISO-2022-JP output to false and return code point U+203E.
- 0x00 to 0x7F, excluding 0x0E, 0x0F, 0x1B, 0x5C, and 0x7E
-
Set ISO-2022-JP output to false and return a code point whose value is byte .
-
end-of-streamend-of-queue -
Return finished .
- Otherwise
-
Set ISO-2022-JP output to false and return error .
- katakana
-
Based on byte :
- 0x1B
-
Set ISO-2022-JP decoder state to escape start and return continue .
- 0x21 to 0x5F
-
Set ISO-2022-JP output to false and return a code point whose value is 0xFF61 − 0x21 + byte .
-
end-of-streamend-of-queue -
Return finished .
- Otherwise
-
Set ISO-2022-JP output to false and return error .
- Lead byte
-
Based on byte :
- 0x1B
-
Set ISO-2022-JP decoder state to escape start and return continue .
- 0x21 to 0x7E
-
Set ISO-2022-JP output to false, ISO-2022-JP lead to byte , ISO-2022-JP decoder state to trail byte , and return continue .
-
end-of-streamend-of-queue -
Return finished .
- Otherwise
-
Set ISO-2022-JP output to false and return error .
- Trail byte
-
Based on byte :
- 0x1B
-
Set ISO-2022-JP decoder state to escape start and return error .
- 0x21 to 0x7E
-
-
Set the ISO-2022-JP decoder state to lead byte .
-
Let pointer be ( ISO-2022-JP lead − 0x21) × 94 + byte − 0x21.
-
Let code point be the index code point for pointer in index jis0208 .
-
If code point is null, return error .
-
Return a code point whose value is code point .
-
-
end-of-streamend-of-queue -
Set the ISO-2022-JP decoder state to lead byte , prepend byte to
streamioQueue , and return error . - Otherwise
-
Set ISO-2022-JP decoder state to lead byte and return error .
- Escape start
-
-
If byte is either 0x24 or 0x28, set ISO-2022-JP lead to byte , ISO-2022-JP decoder state to escape , and return continue .
-
Prepend byte to
streamioQueue . -
Set ISO-2022-JP output to false, ISO-2022-JP decoder state to ISO-2022-JP decoder output state , and return error .
-
- Escape
-
-
Let lead be ISO-2022-JP lead and set ISO-2022-JP lead to 0x00.
-
Let state be null.
-
If lead is 0x28 and byte is 0x42, set state to ASCII .
-
If lead is 0x28 and byte is 0x4A, set state to Roman .
-
If lead is 0x28 and byte is 0x49, set state to katakana .
-
If lead is 0x24 and byte is either 0x40 or 0x42, set state to lead byte .
-
If state is non-null, then:
-
Set ISO-2022-JP decoder state and ISO-2022-JP decoder output state to state .
-
Let output be the value of ISO-2022-JP output .
-
Set ISO-2022-JP output to true.
-
-
Prepend lead and byte to
streamioQueue . -
Set ISO-2022-JP output to false, ISO-2022-JP decoder state to ISO-2022-JP decoder output state and return error .
-
12.2.2. ISO-2022-JP encoder
The ISO-2022-JP encoder is the only encoder for which the concatenation of multiple outputs can result in an error when run through the corresponding decoder .
Encoding U+00A5 gives 0x1B 0x28 0x4A 0x5C 0x1B 0x28 0x42. Doing that twice, concatenating the results, and then decoding yields U+00A5 U+FFFD U+00A5.
ISO-2022-JP ’s encoder has an associated ISO-2022-JP encoder state which is ASCII , Roman , or jis0208 (initially ASCII ).
ISO-2022-JP
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue and ISO-2022-JP encoder state is not ASCII , prepend code point tostreamioQueue , set ISO-2022-JP encoder state to ASCII , and return three bytes 0x1B 0x28 0x42. -
If code point is
end-of-streamend-of-queue and ISO-2022-JP encoder state is ASCII , return finished . -
If ISO-2022-JP encoder state is ASCII or Roman , and code point is U+000E, U+000F, or U+001B, return error with U+FFFD.
This returns U+FFFD rather than code point to prevent attacks.
-
If ISO-2022-JP encoder state is ASCII and code point is an ASCII code point , return a byte whose value is code point .
-
If ISO-2022-JP encoder state is Roman and code point is an ASCII code point , excluding U+005C and U+007E, or is U+00A5 or U+203E, then:
-
If code point is an ASCII code point , return a byte whose value is code point .
-
If code point is U+00A5, return byte 0x5C.
-
If code point is U+203E, return byte 0x7E.
-
-
If code point is an ASCII code point , and ISO-2022-JP encoder state is not ASCII , prepend code point to
streamioQueue , set ISO-2022-JP encoder state to ASCII , and return three bytes 0x1B 0x28 0x42. -
If code point is either U+00A5 or U+203E, and ISO-2022-JP encoder state is not Roman , prepend code point to
streamioQueue , set ISO-2022-JP encoder state to Roman , and return three bytes 0x1B 0x28 0x4A. -
If code point is U+2212, set it to U+FF0D.
-
If code point is in the range U+FF61 to U+FF9F, inclusive, set it to the index code point for code point − 0xFF61 in index ISO-2022-JP katakana .
-
Let pointer be the index pointer for code point in index jis0208 .
If pointer is non-null, it is less than 8836 due to the nature of index jis0208 and the index pointer operation.
-
If pointer is null, return error with code point .
-
If ISO-2022-JP encoder state is not jis0208 , prepend code point to
streamioQueue , set ISO-2022-JP encoder state to jis0208 , and return three bytes 0x1B 0x24 0x42. -
Let lead be pointer / 94 + 0x21.
-
Let trail be pointer % 94 + 0x21.
-
Return two bytes whose values are lead and trail .
12.3. Shift_JIS
12.3.1. Shift_JIS decoder
Shift_JIS ’s decoder has an associated Shift_JIS lead (initially 0x00).
Shift_JIS
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and Shift_JIS lead is not 0x00, set Shift_JIS lead to 0x00 and return error . -
If byte is
end-of-streamend-of-queue and Shift_JIS lead is 0x00, return finished . -
If Shift_JIS lead is not 0x00, let lead be Shift_JIS lead , let pointer be null, set Shift_JIS lead to 0x00, and then:
-
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x41.
-
Let lead offset be 0x81 if lead is less than 0xA0, otherwise 0xC1.
-
If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFC, inclusive, set pointer to ( lead − lead offset ) × 188 + byte − offset .
-
If pointer is in the range 8836 to 10715, inclusive, return a code point whose value is 0xE000 − 8836 + pointer .
This is interoperable legacy from Windows known as EUDC.
-
Let code point be null if pointer is null, otherwise the index code point for pointer in index jis0208 .
-
If code point is non-null, return a code point whose value is code point .
-
If byte is an ASCII byte , prepend byte to
streamioQueue . -
Return error .
-
-
If byte is an ASCII byte or 0x80, return a code point whose value is byte .
-
If byte is in the range 0xA1 to 0xDF, inclusive, return a code point whose value is 0xFF61 − 0xA1 + byte .
-
If byte is in the range 0x81 to 0x9F, inclusive, or 0xE0 to 0xFC, inclusive, set Shift_JIS lead to byte and return continue .
-
Return error .
12.3.2. Shift_JIS encoder
Shift_JIS
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point or U+0080, return a byte whose value is code point .
-
If code point is U+00A5, return byte 0x5C.
-
If code point is U+203E, return byte 0x7E.
-
If code point is in the range U+FF61 to U+FF9F, inclusive, return a byte whose value is code point − 0xFF61 + 0xA1.
-
If code point is U+2212, set it to U+FF0D.
-
Let pointer be the index Shift_JIS pointer for code point .
-
If pointer is null, return error with code point .
-
Let lead be pointer / 188.
-
Let lead offset be 0x81 if lead is less than 0x1F, otherwise 0xC1.
-
Let trail be pointer % 188.
-
Let offset be 0x40 if trail is less than 0x3F, otherwise 0x41.
-
Return two bytes whose values are lead + lead offset and trail + offset .
13. Legacy multi-byte Korean encodings
13.1. EUC-KR
13.1.1. EUC-KR decoder
EUC-KR ’s decoder has an associated EUC-KR lead (initially 0x00).
EUC-KR
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and EUC-KR lead is not 0x00, set EUC-KR lead to 0x00 and return error . -
If byte is
end-of-streamend-of-queue and EUC-KR lead is 0x00, return finished . -
If EUC-KR lead is not 0x00, let lead be EUC-KR lead , let pointer be null, set EUC-KR lead to 0x00, and then:
-
If byte is in the range 0x41 to 0xFE, inclusive, set pointer to ( lead − 0x81) × 190 + ( byte − 0x41).
-
Let code point be null if pointer is null, otherwise the index code point for pointer in index EUC-KR .
-
If code point is non-null, return a code point whose value is code point .
-
If byte is an ASCII byte , prepend byte to
streamioQueue . -
Return error .
-
-
If byte is an ASCII byte , return a code point whose value is byte .
-
If byte is in the range 0x81 to 0xFE, inclusive, set EUC-KR lead to byte and return continue .
-
Return error .
13.1.2. EUC-KR encoder
EUC-KR
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
Let pointer be the index pointer for code point in index EUC-KR .
-
If pointer is null, return error with code point .
-
Let lead be pointer / 190 + 0x81.
-
Let trail be pointer % 190 + 0x41.
-
Return two bytes whose values are lead and trail .
14. Legacy miscellaneous encodings
14.1. replacement
The replacement encoding exists to prevent certain attacks that abuse a mismatch between encodings supported on the server and the client.
14.1.1. replacement decoder
replacement ’s decoder has an associated replacement error returned (initially false).
replacement
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue , return finished . -
If replacement error returned is false, set replacement error returned to true and return error .
-
Return finished .
14.2. Common infrastructure for UTF-16BE and UTF-16LE
14.2.1. shared UTF-16 decoder
A byte order mark has priority over a label as it has been found to be more accurate in deployed content. Therefore it is not part of the shared UTF-16 decoder algorithm but rather the decode algorithm.
shared UTF-16 decoder has an associated UTF-16 lead byte and UTF-16 lead surrogate (both initially null), and is UTF-16BE decoder (initially false).
shared
UTF-16
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue and either UTF-16 lead byte or UTF-16 lead surrogate is non-null, set UTF-16 lead byte and UTF-16 lead surrogate to null, and return error . -
If byte is
end-of-streamend-of-queue and UTF-16 lead byte and UTF-16 lead surrogate are null, return finished . -
If UTF-16 lead byte is null, set UTF-16 lead byte to byte and return continue .
-
Let code unit be the result of:
- is UTF-16BE decoder is true
-
( UTF-16 lead byte << 8) + byte .
- is UTF-16BE decoder is false
-
( byte << 8) + UTF-16 lead byte .
Then set UTF-16 lead byte to null.
-
If UTF-16 lead surrogate is non-null, let lead surrogate be UTF-16 lead surrogate , set UTF-16 lead surrogate to null, and then:
-
If code unit is in the range U+DC00 to U+DFFF, inclusive, return a code point whose value is 0x10000 + (( lead surrogate − 0xD800) << 10) + ( code unit − 0xDC00).
-
Let byte1 be code unit >> 8.
-
Let byte2 be code unit & 0x00FF.
-
Let bytes be two bytes whose values are byte1 and byte2 , if is UTF-16BE decoder is true, and byte2 and byte1 otherwise.
-
-
If code unit is in the range U+D800 to U+DBFF, inclusive, set UTF-16 lead surrogate to code unit and return continue .
-
If code unit is in the range U+DC00 to U+DFFF, inclusive, return error .
-
Return code point code unit .
14.3. UTF-16BE
14.3.1. UTF-16BE decoder
UTF-16BE ’s decoder is shared UTF-16 decoder with its is UTF-16BE decoder set to true.
14.4. UTF-16LE
"
utf-16
"
is
a
label
for
UTF-16LE
to
deal
with
deployed
content.
14.4.1. UTF-16LE decoder
UTF-16LE ’s decoder is shared UTF-16 decoder .
14.5. x-user-defined
While technically this is a single-byte encoding , it is defined separately as it can be implemented algorithmically.
14.5.1. x-user-defined decoder
x-user-defined
’s
decoder
’s
handler
,
given
a
stream
ioQueue
and
byte
,
runs
these
steps:
-
If byte is
end-of-streamend-of-queue , return finished . -
If byte is an ASCII byte , return a code point whose value is byte .
-
Return a code point whose value is 0xF780 + byte − 0x80.
14.5.2. x-user-defined encoder
x-user-defined
’s
encoder
’s
handler
,
given
a
stream
ioQueue
and
code
point
,
runs
these
steps:
-
If code point is
end-of-streamend-of-queue , return finished . -
If code point is an ASCII code point , return a byte whose value is code point .
-
If code point is in the range U+F780 to U+F7FF, inclusive, return a byte whose value is code point − 0xF780 + 0x80.
-
Return error with code point .
15. Browser UI
Browsers are encouraged to not enable overriding the encoding of a resource. If such a feature is nonetheless present, browsers should not offer either UTF-16BE or UTF-16LE as option due to aforementioned security issues. Browsers also should disable this feature if the resource was decoded using either UTF-16BE or UTF-16LE .
Implementation considerations
Instead
of
supporting
streams
I/O
queues
with
arbitrary
prepend
,
the
decoders
for
encodings
in
this
standard
could
be
implemented
with:
-
The ability to unread the current byte.
-
A single-byte buffer for gb18030 (an ASCII byte ) and ISO-2022-JP (0x24 or 0x28).
For gb18030 when hitting a bogus byte while gb18030 third is not 0x00, gb18030 second could be moved into the single-byte buffer to be returned next, and gb18030 third would be the new gb18030 first , checked for not being 0x00 after the single-byte buffer was returned and emptied. This is possible as the range for the first and third byte in gb18030 is identical.
The ISO-2022-JP encoder needs ISO-2022-JP encoder state as additional state, but other than that, none of the encoders for encodings in this standard require additional state or buffers.
Acknowledgments
There have been a lot of people that have helped make encodings more interoperable over the years and thereby furthered the goals of this standard. Likewise many people have helped making this standard what it is today.
With that, many thanks to Adam Rice, Alan Chaney, Alexander Shtuchkin, Allen Wirfs-Brock, Andreu Botella, Aneesh Agrawal, Arkadiusz Michalski, Asmus Freytag, Ben Noordhuis, Bnaya Peretz, Boris Zbarsky, Bruno Haible, Cameron McCormack, Charles McCathieNeville, Christopher Foo, CodifierNL, David Carlisle, Domenic Denicola, Dominique Hazaël-Massieux, Doug Ewell, Erik van der Poel, 譚永鋒 (Frank Yung-Fong Tang), Glenn Maynard, Gordon P. Hemsley, Henri Sivonen, Ian Hickson, James Graham, Jeffrey Yasskin, John Tamplin, Joshua Bell, 村井純 (Jun Murai), 신정식 (Jungshik Shin), Jxck, 강 성훈 (Kang Seonghoon), 川幡太一 (Kawabata Taichi), Ken Lunde, Ken Whistler, Kenneth Russell, 田村健人 (Kent Tamura), Leif Halvard Silli, Luke Wagner, Maciej Hirsz, Makoto Kato, Mark Callow, Mark Crispin, Mark Davis, Martin Dürst, Masatoshi Kimura, Mattias Buelens, Ms2ger, Nigel Megitt, Nigel Tao, Norbert Lindenberg, Øistein E. Andersen, Peter Krefting, Philip Jägenstedt, Philip Taylor, Richard Ishida, Robbert Broersma, Robert Mustacchi, Ryan Dahl, Sam Sneddon, Shawn Steele, Simon Montagu, Simon Pieters, Simon Sapin, 寺田健 (Takeshi Terada), Vyacheslav Matva, and 成瀬ゆい (Yui Naruse) for being awesome.
This standard is written by Anne van Kesteren ( Mozilla , annevk@annevk.nl ). The API chapter was initially written by Joshua Bell ( Google ).
Copyright
©
2020
WHATWG
(Apple,
Google,
Mozilla,
Microsoft).
This
work
is
licensed
under
a
Creative
Commons
Attribution
4.0
International
License
.