Living Standard — Last Updated 13 January 2025
This section only applies to user agents, data mining tools, and conformance checkers.
The rules for parsing XML documents into DOM trees are covered by the next section, entitled " The XML syntax ".
User
agents
must
use
the
parsing
rules
described
in
this
section
to
generate
the
DOM
trees
from
text/html
resources.
Together,
these
rules
define
what
is
referred
to
as
the
HTML
parser
.
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML tools and the XML serialization of HTML.
For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax , then it is an HTML document .
As
stated
in
the
terminology
section
,
references
to
element
types
that
do
not
explicitly
specify
a
namespace
always
refer
to
elements
in
the
HTML
namespace
.
For
example,
if
the
spec
talks
about
"a
menu
element",
then
that
is
an
element
with
the
local
name
"
menu
",
the
namespace
"
http://www.w3.org/1999/xhtml
",
and
the
interface
HTMLMenuElement
.
Where
possible,
references
to
such
elements
are
hyperlinked
to
their
definition.
The
input
to
the
HTML
parsing
process
consists
of
a
stream
of
code
points
,
which
is
passed
through
a
tokenization
stage
followed
by
a
tree
construction
stage.
The
output
is
a
Document
object.
Implementations
that
do
not
support
scripting
do
not
have
to
actually
create
a
DOM
Document
object,
but
the
DOM
tree
in
such
cases
is
still
used
as
the
model
for
the
rest
of
the
specification.
In
the
common
case,
the
data
handled
by
the
tokenization
stage
comes
from
the
network,
but
it
can
also
come
from
script
running
in
the
user
agent,
e.g.
using
the
document.write()
API.
There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token:
...
<script>
document.write('<p>');
</script>
...
To handle these cases, parsers have a script nesting level , which must be initially set to zero, and a parser pause flag , which must be initially set to false.
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors . The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.
Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.
Some parse errors have dedicated codes outlined in the table below that should be used by conformance checkers in reports.
Error descriptions in the table below are non-normative.
Code | Description |
---|---|
abrupt-closing-of-empty-comment |
This
error
occurs
if
the
parser
encounters
an
empty
comment
that
is
abruptly
closed
by
a
U+003E
(>)
code
point
(i.e.,
|
abrupt-doctype-public-identifier |
This
error
occurs
if
the
parser
encounters
a
U+003E
(>)
code
point
in
the
DOCTYPE
public
identifier
(e.g.,
|
abrupt-doctype-system-identifier |
This
error
occurs
if
the
parser
encounters
a
U+003E
(>)
code
point
in
the
DOCTYPE
system
identifier
(e.g.,
|
absence-of-digits-in-numeric-character-reference |
This
error
occurs
if
the
parser
encounters
a
numeric
character
reference
that
doesn't
contain
any
digits
(e.g.,
|
cdata-in-html-content |
This
error
occurs
if
the
parser
encounters
a
CDATA
section
outside
of
foreign
content
(SVG
or
MathML).
The
parser
treats
such
CDATA
sections
(including
leading
"
|
character-reference-outside-unicode-range |
This error occurs if the parser encounters a numeric character reference that references a code point that is greater than the valid Unicode range. The parser resolves such a character reference to a U+FFFD REPLACEMENT CHARACTER. |
control-character-in-input-stream |
This error occurs if the input stream contains a control code point that is not ASCII whitespace or U+0000 NULL. Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM. |
control-character-reference |
This error occurs if the parser encounters a numeric character reference that references a control code point that is not ASCII whitespace or is a U+000D CARRIAGE RETURN. The parser resolves such character references as-is except C1 control references that are replaced according to the numeric character reference end state . |
duplicate-attribute |
This error occurs if the parser encounters an attribute in a tag that already has an attribute with the same name. The parser ignores all such duplicate occurrences of the attribute. |
end-tag-with-attributes |
This error occurs if the parser encounters an end tag with attributes . Attributes in end tags are ignored and do not make their way into the DOM. |
end-tag-with-trailing-solidus |
This
error
occurs
if
the
parser
encounters
an
end
tag
that
has
a
U+002F
(/)
code
point
right
before
the
closing
U+003E
(>)
code
point
(e.g.,
|
eof-before-tag-name |
This
error
occurs
if
the
parser
encounters
the
end
of
the
input
stream
where
a
tag
name
is
expected.
In
this
case
the
parser
treats
the
beginning
of
a
start
tag
(i.e.,
|
eof-in-cdata |
This error occurs if the parser encounters the end of the input stream in a CDATA section . The parser treats such CDATA sections as if they are closed immediately before the end of the input stream. |
eof-in-comment |
This error occurs if the parser encounters the end of the input stream in a comment . The parser treats such comments as if they are closed immediately before the end of the input stream. |
eof-in-doctype |
This
error
occurs
if
the
parser
encounters
the
end
of
the
input
stream
in
a
DOCTYPE
.
In
such
a
case,
if
the
DOCTYPE
is
correctly
placed
as
a
document
preamble,
the
parser
sets
the
|
eof-in-script-html-comment-like-text |
This
error
occurs
if
the
parser
encounters
the
end
of
the
input
stream
in
text
that
resembles
an
HTML
comment
inside
Syntactic
structures
that
resemble
HTML
comments
in
|
eof-in-tag |
This
error
occurs
if
the
parser
encounters
the
end
of
the
input
stream
in
a
start
tag
or
an
end
tag
(e.g.,
|
incorrectly-closed-comment |
This
error
occurs
if
the
parser
encounters
a
comment
that
is
closed
by
the
"
|
incorrectly-opened-comment |
This
error
occurs
if
the
parser
encounters
the
"
One
possible
cause
of
this
error
is
using
an
XML
markup
declaration
(e.g.,
|
invalid-character-sequence-after-doctype-name |
This
error
occurs
if
the
parser
encounters
any
code
point
sequence
other
than
"
|
invalid-first-character-of-tag-name |
This error occurs if the parser encounters a code point that is not an ASCII alpha where first code point of a start tag name or an end tag name is expected. If a start tag was expected such code point and a preceding U+003C (<) is treated as text content, and all content that follows is treated as markup. Whereas, if an end tag was expected, such code point and all content that follows up to a U+003E (>) code point (if present) or to the end of the input stream is treated as a comment. For example, consider the following markup:
This will be parsed into: While the first code point of a tag name is limited to an ASCII alpha , a wide range of code points (including ASCII digits ) is allowed in subsequent positions. |
missing-attribute-value |
This
error
occurs
if
the
parser
encounters
a
U+003E
(>)
code
point
where
an
attribute
value
is
expected
(e.g.,
|
missing-doctype-name |
This
error
occurs
if
the
parser
encounters
a
DOCTYPE
that
is
missing
a
name
(e.g.,
|
missing-doctype-public-identifier |
This
error
occurs
if
the
parser
encounters
a
U+003E
(>)
code
point
where
start
of
the
DOCTYPE
public
identifier
is
expected
(e.g.,
|
missing-doctype-system-identifier |
This
error
occurs
if
the
parser
encounters
a
U+003E
(>)
code
point
where
start
of
the
DOCTYPE
system
identifier
is
expected
(e.g.,
|
missing-end-tag-name |
This
error
occurs
if
the
parser
encounters
a
U+003E
(>)
code
point
where
an
end
tag
name
is
expected,
i.e.,
|
missing-quote-before-doctype-public-identifier |
This
error
occurs
if
the
parser
encounters
the
DOCTYPE
public
identifier
that
is
not
preceded
by
a
quote
(e.g.,
|
missing-quote-before-doctype-system-identifier |
This
error
occurs
if
the
parser
encounters
the
DOCTYPE
system
identifier
that
is
not
preceded
by
a
quote
(e.g.,
|
missing-semicolon-after-character-reference |
This error occurs if the parser encounters a character reference that is not terminated by a U+003B (;) code point . Usually the parser behaves as if character reference is terminated by the U+003B (;) code point; however, there are some ambiguous cases in which the parser includes subsequent code points in the character reference.
For
example,
|
missing-whitespace-after-doctype-public-keyword |
This
error
occurs
if
the
parser
encounters
a
DOCTYPE
whose
"
|
missing-whitespace-after-doctype-system-keyword |
This
error
occurs
if
the
parser
encounters
a
DOCTYPE
whose
"
|
missing-whitespace-before-doctype-name |
This
error
occurs
if
the
parser
encounters
a
DOCTYPE
whose
"
|
missing-whitespace-between-attributes |
This
error
occurs
if
the
parser
encounters
attributes
that
are
not
separated
by
ASCII
whitespace
(e.g.,
|
missing-whitespace-between-doctype-public-and-system-identifiers |
This error occurs if the parser encounters a DOCTYPE whose public and system identifiers are not separated by ASCII whitespace . In this case the parser behaves as if ASCII whitespace is present. |
nested-comment |
This
error
occurs
if
the
parser
encounters
a
nested
comment
(e.g.,
|
noncharacter-character-reference |
This error occurs if the parser encounters a numeric character reference that references a noncharacter . The parser resolves such character references as-is. |
noncharacter-in-input-stream |
This error occurs if the input stream contains a noncharacter . Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM. |
non-void-html-element-start-tag-with-trailing-solidus |
This error occurs if the parser encounters a start tag for an element that is not in the list of void elements or is not a part of foreign content (i.e., not an SVG or MathML element) that has a U+002F (/) code point right before the closing U+003E (>) code point. The parser behaves as if the U+002F (/) is not present. For example, consider the following markup:
This will be parsed into: The trailing U+002F (/) in a start tag name can be used only in foreign content to specify self-closing tags. (Self-closing tags don't exist in HTML.) It is also allowed for void elements, but doesn't have any effect in this case. |
null-character-reference |
This error occurs if the parser encounters a numeric character reference that references a U+0000 NULL code point . The parser resolves such character references to a U+FFFD REPLACEMENT CHARACTER. |
surrogate-character-reference |
This error occurs if the parser encounters a numeric character reference that references a surrogate . The parser resolves such character references to a U+FFFD REPLACEMENT CHARACTER. |
surrogate-in-input-stream |
This error occurs if the input stream contains a surrogate . Such code points are parsed as-is and usually, where parsing rules don't apply any additional restrictions, make their way into the DOM.
Surrogates
can
only
find
their
way
into
the
input
stream
via
script
APIs
such
as
|
unexpected-character-after-doctype-system-identifier |
This error occurs if the parser encounters any code points other than ASCII whitespace or closing U+003E (>) after the DOCTYPE system identifier. The parser ignores these code points. |
unexpected-character-in-attribute-name |
This error occurs if the parser encounters a U+0022 ("), U+0027 ('), or U+003C (<) code point in an attribute name . The parser includes such code points in the attribute name. Code points that trigger this error are usually a part of another syntactic construct and can be a sign of a typo around the attribute name. For example, consider the following markup:
Due
to
a
forgotten
U+003E
(>)
code
point
after
As another example of this error, consider the following markup:
Due
to
a
forgotten
U+003D
(=)
code
point
between
an
attribute
name
and
value
the
parser
treats
this
markup
as
a
|
unexpected-character-in-unquoted-attribute-value |
This error occurs if the parser encounters a U+0022 ("), U+0027 ('), U+003C (<), U+003D (=), or U+0060 (`) code point in an unquoted attribute value . The parser includes such code points in the attribute value. Code points that trigger this error are usually a part of another syntactic construct and can be a sign of a typo around the attribute value. U+0060 (`) is in the list of code points that trigger this error because certain legacy user agents treat it as a quote. For example, consider the following markup:
Due
to
a
misplaced
U+0027
(')
code
point
the
parser
sets
the
value
of
the
"
|
unexpected-equals-sign-before-attribute-name |
This error occurs if the parser encounters a U+003D (=) code point before an attribute name. In this case the parser treats U+003D (=) as the first code point of the attribute name. The common reason for this error is a forgotten attribute name. For example, consider the following markup:
Due
to
a
forgotten
attribute
name
the
parser
treats
this
markup
as
a
|
unexpected-null-character |
This error occurs if the parser encounters a U+0000 NULL code point in the input stream in certain positions. In general, such code points are either ignored or, for security reasons, replaced with a U+FFFD REPLACEMENT CHARACTER. |
unexpected-question-mark-instead-of-tag-name |
This error occurs if the parser encounters a U+003F (?) code point where first code point of a start tag name is expected. The U+003F (?) and all content that follows up to a U+003E (>) code point (if present) or to the end of the input stream is treated as a comment. For example, consider the following markup:
This will be parsed into:
The
common
reason
for
this
error
is
an
XML
processing
instruction
(e.g.,
|
unexpected-solidus-in-tag |
This
error
occurs
if
the
parser
encounters
a
U+002F
(/)
code
point
that
is
not
a
part
of
a
quoted
attribute
value
and
not
immediately
followed
by
a
U+003E
(>)
code
point
in
a
tag
(e.g.,
|
unknown-named-character-reference |
This error occurs if the parser encounters an ambiguous ampersand . In this case the parser doesn't resolve the character reference . |
The stream of code points that comprises the input to the tokenization stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a particular character encoding , which the user agent uses to decode the bytes into characters.
For XML documents, the algorithm user agents are required to use to determine the character encoding is given by XML . This section does not apply to XML documents. [XML]
Usually, the encoding sniffing algorithm defined below is used to determine the character encoding.
Given a character encoding, the bytes in the input byte stream must be converted to characters for the tokenizer's input stream , by passing the input byte stream and character encoding to decode .
A leading Byte Order Mark (BOM) causes the character encoding argument to be ignored and will itself be skipped.
Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]
The decoder algorithms describe how to handle invalid input; for security reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte sequences are handled can result in, amongst other problems, script injection vulnerabilities ("XSS").
When the HTML parser is decoding an input byte stream, it uses a character encoding and a confidence . The confidence is either tentative , certain , or irrelevant . The encoding used, and whether the confidence in that encoding is tentative or certain , is used during the parsing to determine whether to change the encoding . If no encoding is necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a character encoding at all, then the confidence is irrelevant .
Some algorithms feed the parser by directly adding characters to the input stream rather than adding bytes to the input byte stream .
When the HTML parser is to operate on an input byte stream that has a known definite encoding , then the character encoding is that encoding and the confidence is certain .
In some cases, it might be impractical to unambiguously determine the encoding before parsing the document. Because of this, this specification provides for a two-pass mechanism with an optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing algorithm to whatever bytes they have available before beginning to parse the document. Then, the real parser is started, using a tentative encoding derived from this pre-parse and other out-of-band metadata. If, while the document is being loaded, the user agent discovers a character encoding declaration that conflicts with this information, then the parser can get reinvoked to perform a parse of the document with the real encoding.
User agents must use the following algorithm, called the encoding sniffing algorithm , to determine the character encoding to use when decoding a document in the first pass. This algorithm takes as input any out-of-band metadata available to the user agent (e.g. the Content-Type metadata of the document) and all the bytes available so far, and returns a character encoding and a confidence that is either tentative or certain .
If the result of BOM sniffing is an encoding, return that encoding with confidence certain .
Although the decode algorithm will itself change the encoding to use based on the presence of a byte order mark, this algorithm sniffs the BOM as well in order to set the correct document's character encoding and confidence .
If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain .
Typically,
user
agents
remember
such
user
requests
across
sessions,
and
in
some
cases
apply
them
to
documents
in
iframe
s
as
well.
The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.
The authoring conformance requirements for character encoding declarations limit them to only appearing in the first 1024 bytes . User agents are therefore encouraged to use the prescan algorithm below (as invoked by these steps) on the first 1024 bytes, but not to stall beyond that.
If the transport layer specifies a character encoding, and it is supported, return that encoding with the confidence certain .
Optionally prescan the byte stream to determine its encoding , with the end condition being when the user agent decides that scanning further bytes would not be efficient. User agents are encouraged to only prescan the first 1024 bytes. User agents may decide that scanning any bytes is not efficient, in which case these substeps are entirely skipped.
The aforementioned algorithm returns either a character encoding or failure. If it returns a character encoding, then return the same encoding, with confidence tentative .
If
the
HTML
parser
for
which
this
algorithm
is
being
run
is
associated
with
a
Document
d
whose
container
document
is
non-null,
then:
Let parentDocument be d 's container document .
If parentDocument 's origin is same origin with d 's origin and parentDocument 's character encoding is not UTF-16BE/LE , then return parentDocument 's character encoding , with the confidence tentative .
Otherwise, if the user agent has information on the likely encoding for this page, e.g. based on the encoding of the page when it was last visited, then return that encoding, with the confidence tentative .
The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, and that encoding is a supported encoding, then return that encoding, with the confidence tentative . [UNIVCHARDET]
User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content.
The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective. [PPUTF8] [UTF8DET]
Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative .
In
controlled
environments
or
in
environments
where
the
encoding
of
documents
can
be
prescribed
(for
example,
for
user
agents
intended
for
dedicated
use
in
new
networks),
the
comprehensive
UTF-8
encoding
is
suggested.
In other environments, the default encoding is typically dependent on the user's locale (an approximation of the languages, and thus often encodings, of the pages that the user is likely to frequent). The following table gives suggested defaults based on the user's locale, for compatibility with legacy content. Locales are identified by BCP 47 language tags. [BCP47] [ENCODING]
Locale language | Suggested default encoding | |
---|---|---|
ar | Arabic | windows-1256 |
az | Azeri | windows-1254 |
ba | Bashkir | windows-1251 |
be | Belarusian | windows-1251 |
bg | Bulgarian | windows-1251 |
cs | Czech | windows-1250 |
el | Greek | ISO-8859-7 |
et | Estonian | windows-1257 |
fa | Persian | windows-1256 |
he | Hebrew | windows-1255 |
hr | Croatian | windows-1250 |
hu | Hungarian | ISO-8859-2 |
ja | Japanese | Shift_JIS |
kk | Kazakh | windows-1251 |
ko | Korean | EUC-KR |
ku | Kurdish | windows-1254 |
ky | Kyrgyz | windows-1251 |
lt | Lithuanian | windows-1257 |
lv | Latvian | windows-1257 |
mk | Macedonian | windows-1251 |
pl | Polish | ISO-8859-2 |
ru | Russian | windows-1251 |
sah | Yakut | windows-1251 |
sk | Slovak | windows-1250 |
sl | Slovenian | ISO-8859-2 |
sr | Serbian | windows-1251 |
tg | Tajik | windows-1251 |
th | Thai | windows-874 |
tr | Turkish | windows-1254 |
tt | Tatar | windows-1251 |
uk | Ukrainian | windows-1251 |
vi | Vietnamese | windows-1258 |
zh-Hans, zh-CN, zh-SG | Chinese, Simplified | GBK |
zh-Hant, zh-HK, zh-MO, zh-TW | Chinese, Traditional | Big5 |
All other locales | windows-1252 |
The contents of this table are derived from the intersection of Windows, Chrome, and Firefox defaults.
The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.
When an algorithm requires a user agent to prescan a byte stream to determine its encoding , given some defined end condition , then it must run the following steps. If at any point during these steps (including during instances of the get an attribute algorithm invoked by this one) the user agent either runs out of bytes (meaning the position pointer created in the first step below goes beyond the end of the byte stream obtained so far) or reaches its end condition , then abort the prescan a byte stream to determine its encoding algorithm and return the result get an XML encoding applied to the same bytes that the prescan a byte stream to determine its encoding algorithm was applied to. Otherwise, these steps will return a character encoding.
Let fallback encoding be null.
Let position be a pointer to a byte in the input byte stream, initially pointing at the first byte.
Prescan for UTF-16 XML declarations: If position points to:
Return UTF-16LE .
Return UTF-16BE .
For historical reasons, the prefix is two bytes longer than in Appendix F of XML and the encoding name is not checked.
Loop : If position points to:
<!--
`)
Advance the position pointer so that it points at the first 0x3E byte which is preceded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes after the 0x3C byte that was found. (The two 0x2D bytes can be the same as those in the '<!--' sequence.)
Advance the position pointer so that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched above).
Let attribute list be an empty list of strings.
Let got pragma be false.
Let need pragma be null.
Let charset be the null value (which, for the purposes of this algorithm, is distinct from an unrecognized encoding or the empty string).
Attributes : Get an attribute and its value. If no attribute was sniffed, then jump to the processing step below.
If the attribute's name is already in attribute list , then return to the step labeled attributes .
Add the attribute's name to attribute list .
Run the appropriate step from the following list, if one applies:
http-equiv
"
If
the
attribute's
value
is
"
content-type
",
then
set
got
pragma
to
true.
content
"
Apply
the
algorithm
for
extracting
a
character
encoding
from
a
meta
element
,
giving
the
attribute's
value
as
the
string
to
parse.
If
a
character
encoding
is
returned,
and
if
charset
is
still
set
to
null,
let
charset
be
the
encoding
returned,
and
set
need
pragma
to
true.
charset
"
Let charset be the result of getting an encoding from the attribute's value, and set need pragma to false.
Return to the step labeled attributes .
Processing : If need pragma is null, then jump to the step below labeled next byte .
If need pragma is true but got pragma is false, then jump to the step below labeled next byte .
If charset is failure, then jump to the step below labeled next byte .
If charset is UTF-16BE/LE , then set charset to UTF-8 .
If charset is x-user-defined , then set charset to windows-1252 .
Return charset .
Advance the position pointer so that it points at the next 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x3E (>) byte.
Repeatedly get an attribute until no further attributes can be found, then jump to the step below labeled next byte .
<!
`)
</
`)
<?
`)
Advance the position pointer so that it points at the first 0x3E byte (>) that comes after the 0x3C byte that was found.
Do nothing with that byte.
When the prescan a byte stream to determine its encoding algorithm says to get an attribute , it means doing this:
If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), 0x20 (SP), or 0x2F (/) then advance position to the next byte and redo this step.
If the byte at position is 0x3E (>), then abort the get an attribute algorithm. There isn't one.
Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.
Process the byte at position as follows:
Advance position to the next byte and return to the previous step.
Spaces : If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat this step.
If the byte at position is not 0x3D (=), abort the get an attribute algorithm. The attribute's name is the value of attribute name , its value is the empty string.
Advance position past the 0x3D (=) byte.
Value : If the byte at position is one of 0x09 (HT), 0x0A (LF), 0x0C (FF), 0x0D (CR), or 0x20 (SP) then advance position to the next byte, then, repeat this step.
Process the byte at position as follows:
Process the byte at position as follows:
Advance position to the next byte and return to the previous step.
When the prescan a byte stream to determine its encoding algorithm is aborted without returning an encoding, get an XML encoding means doing this.
Looking
for
syntax
resembling
an
XML
declaration,
even
in
text/html
,
is
necessary
for
compatibility
with
existing
content.
Let encodingPosition be a pointer to the start of the stream.
If
encodingPosition
does
not
point
to
the
start
of
a
byte
sequence
0x3C,
0x3F,
0x78,
0x6D,
0x6C
(`
<?xml
`),
then
return
failure.
Let xmlDeclarationEnd be a pointer to the next byte in the input byte stream which is 0x3E (>). If there is no such byte, then return failure.
Set
encodingPosition
to
the
position
of
the
first
occurrence
of
the
subsequence
of
bytes
0x65,
0x6E,
0x63,
0x6F,
0x64,
0x69,
0x6E,
0x67
(`
encoding
`)
at
or
after
the
current
encodingPosition
.
If
there
is
no
such
sequence,
then
return
failure.
Advance encodingPosition past the 0x67 (g) byte.
While the byte at encodingPosition is less than or equal to 0x20 (i.e., it is either an ASCII space or control character), advance encodingPosition to the next byte.
If the byte at encodingPosition is not 0x3D (=), then return failure.
Advance encodingPosition to the next byte.
While the byte at encodingPosition is less than or equal to 0x20 (i.e., it is either an ASCII space or control character), advance encodingPosition to the next byte.
Let quoteMark be the byte at encodingPosition .
If quoteMark is not either 0x22 (") or 0x27 ('), then return failure.
Advance encodingPosition to the next byte.
Let encodingEndPosition be the position of the next occurrence of quoteMark at or after encodingPosition . If quoteMark does not occur again, then return failure.
Let potentialEncoding be the sequence of the bytes between encodingPosition (inclusive) and encodingEndPosition (exclusive).
If potentialEncoding contains one or more bytes whose byte value is 0x20 or below, then return failure.
Let encoding be the result of getting an encoding given potentialEncoding isomorphic decoded .
If the encoding is UTF-16BE/LE , then change it to UTF-8 .
Return encoding .
For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)
User agents must support the encodings defined in Encoding , including, but not limited to, UTF-8 , ISO-8859-2 , ISO-8859-7 , ISO-8859-8 , windows-874 , windows-1250 , windows-1251 , windows-1252 , windows-1254 , windows-1255 , windows-1256 , windows-1257 , windows-1258 , GBK , Big5 , ISO-2022-JP , Shift_JIS , EUC-KR , UTF-16BE , UTF-16LE , UTF-16BE/LE , and x-user-defined . User agents must not support other encodings.
The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32. This specification does not make any attempt to support prohibited encodings in its algorithms; support and use of prohibited encodings would thus lead to unexpected behavior. [CESU8] [UTF7] [BOCU1] [SCSU]
When the parser requires the user agent to change the encoding , it must run the following steps. This might happen if the encoding sniffing algorithm described above failed to find a character encoding, or if it found a character encoding that was not the actual encoding of the file.
If the encoding that is already being used to interpret the input stream is UTF-16BE/LE , then set the confidence to certain and return. The new encoding is ignored; if it was anything but the same encoding, then it would be clearly incorrect.
If the new encoding is UTF-16BE/LE , then change it to UTF-8 .
If the new encoding is x-user-defined , then change it to windows-1252 .
If the new encoding is identical or equivalent to the encoding that is already being used to interpret the input stream, then set the confidence to certain and return. This happens when the encoding information found in the file matches what the encoding sniffing algorithm determined to be the encoding, and in the second pass through the parser if the first pass found that the encoding sniffing algorithm described in the earlier section failed to find the right encoding.
If all the bytes up to the last byte converted by the current decoder have the same Unicode interpretations in both the current encoding and the new encoding, and if the user agent supports changing the converter on the fly, then the user agent may change to the new converter for the encoding on the fly. Set the document's character encoding and the encoding used to convert the input stream to the new encoding, set the confidence to certain , and return.
Otherwise,
restart
the
navigate
algorithm,
with
historyHandling
set
to
"
replace
"
and
other
inputs
kept
the
same,
but
this
time
skip
the
encoding
sniffing
algorithm
and
instead
just
set
the
encoding
to
the
new
encoding
and
the
confidence
to
certain
.
Whenever
possible,
this
should
be
done
without
actually
contacting
the
network
layer
(the
bytes
should
be
re-parsed
from
memory),
even
if,
e.g.,
the
document
is
marked
as
not
being
cacheable.
If
this
is
not
possible
and
contacting
the
network
layer
would
involve
repeating
a
request
that
uses
a
method
other
than
`
GET
`,
then
instead
set
the
confidence
to
certain
and
ignore
the
new
encoding.
The
resource
will
be
misinterpreted.
User
agents
may
notify
the
user
of
the
situation,
to
aid
in
application
development.
This
algorithm
is
only
invoked
when
a
new
encoding
is
found
declared
on
a
meta
element.
The input stream consists of the characters pushed into it as the input byte stream is decoded or from the various APIs that directly manipulate the input stream.
Any occurrences of surrogates are surrogate-in-input-stream parse errors . Any occurrences of noncharacters are noncharacter-in-input-stream parse errors and any occurrences of controls other than ASCII whitespace and U+0000 NULL characters are control-character-in-input-stream parse errors .
The handling of U+0000 NULL characters varies based on where the characters are found and happens at the later stages of the parsing. They are either ignored or, for security reasons, replaced with a U+FFFD REPLACEMENT CHARACTER. This handling is, by necessity, spread across both the tokenization stage and the tree construction stage.
Before the tokenization stage, the input stream must be preprocessed by normalizing newlines . Thus, newlines in HTML DOMs are represented by U+000A LF characters, and there are never any U+000D CR characters in the input to the tokenization stage.
The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored by the requirements in this section. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed .
The
insertion
point
is
the
position
(just
before
a
character
or
just
before
the
end
of
the
input
stream)
where
content
inserted
using
document.write()
is
actually
inserted.
The
insertion
point
is
relative
to
the
position
of
the
character
immediately
after
it,
it
is
not
an
absolute
offset
into
the
input
stream.
Initially,
the
insertion
point
is
undefined.
The
"EOF"
character
in
the
tables
below
is
a
conceptual
character
representing
the
end
of
the
input
stream
.
If
the
parser
is
a
script-created
parser
,
then
the
end
of
the
input
stream
is
reached
when
an
explicit
"EOF"
character
(inserted
by
the
document.close()
method)
is
consumed.
Otherwise,
the
"EOF"
character
is
not
a
real
character
in
the
stream,
but
rather
the
lack
of
any
further
characters.
The insertion mode is a state variable that controls the primary operation of the tree construction stage.
Initially, the insertion mode is " initial ". It can change to " before html ", " before head ", " in head ", " in head noscript ", " after head ", " in body ", " text ", " in table ", " in table text ", " in caption ", " in column group ", " in table body ", " in row ", " in cell ", " in select ", " in select in table ", " in template ", " after body ", " in frameset ", " after frameset ", " after after body ", and " after after frameset " during the course of the parsing, as described in the tree construction stage. The insertion mode affects how tokens are processed and whether CDATA sections are supported.
Several of these modes, namely " in head ", " in body ", " in table ", and " in select ", are special, in that the other modes defer to them at various times. When the algorithm below says that the user agent is to do something " using the rules for the m insertion mode", where m is one of these modes, the user agent must use the rules described under the m insertion mode 's section, but must leave the insertion mode unchanged unless the rules in m themselves switch the insertion mode to a new value.
When the insertion mode is switched to " text " or " in table text ", the original insertion mode is also set. This is the insertion mode to which the tree construction stage will return.
Similarly,
to
parse
nested
template
elements,
a
stack
of
template
insertion
modes
is
used.
It
is
initially
empty.
The
current
template
insertion
mode
is
the
insertion
mode
that
was
most
recently
added
to
the
stack
of
template
insertion
modes
.
The
algorithms
in
the
sections
below
will
push
insertion
modes
onto
this
stack,
meaning
that
the
specified
insertion
mode
is
to
be
added
to
the
stack,
and
pop
insertion
modes
from
the
stack,
which
means
that
the
most
recently
added
insertion
mode
must
be
removed
from
the
stack.
When the steps below require the UA to reset the insertion mode appropriately , it means the UA must follow these steps:
Let last be false.
Let node be the last node in the stack of open elements .
Loop : If node is the first node in the stack of open elements, then set last to true, and, if the parser was created as part of the HTML fragment parsing algorithm ( fragment case ), set node to the context element passed to that algorithm.
If
node
is
a
select
element,
run
these
substeps:
If last is true, jump to the step below labeled done .
Let ancestor be node .
Loop : If ancestor is the first node in the stack of open elements , jump to the step below labeled done .
Let ancestor be the node before ancestor in the stack of open elements .
If
ancestor
is
a
template
node,
jump
to
the
step
below
labeled
done
.
If
ancestor
is
a
table
node,
switch
the
insertion
mode
to
"
in
select
in
table
"
and
return.
Jump back to the step labeled loop .
Done : Switch the insertion mode to " in select " and return.
If
node
is
a
td
or
th
element
and
last
is
false,
then
switch
the
insertion
mode
to
"
in
cell
"
and
return.
If
node
is
a
tr
element,
then
switch
the
insertion
mode
to
"
in
row
"
and
return.
If
node
is
a
tbody
,
thead
,
or
tfoot
element,
then
switch
the
insertion
mode
to
"
in
table
body
"
and
return.
If
node
is
a
caption
element,
then
switch
the
insertion
mode
to
"
in
caption
"
and
return.
If
node
is
a
colgroup
element,
then
switch
the
insertion
mode
to
"
in
column
group
"
and
return.
If
node
is
a
table
element,
then
switch
the
insertion
mode
to
"
in
table
"
and
return.
If
node
is
a
template
element,
then
switch
the
insertion
mode
to
the
current
template
insertion
mode
and
return.
If
node
is
a
head
element
and
last
is
false,
then
switch
the
insertion
mode
to
"
in
head
"
and
return.
If
node
is
a
body
element,
then
switch
the
insertion
mode
to
"
in
body
"
and
return.
If
node
is
a
frameset
element,
then
switch
the
insertion
mode
to
"
in
frameset
"
and
return.
(
fragment
case
)
If
node
is
an
html
element,
run
these
substeps:
If
the
head
element
pointer
is
null,
switch
the
insertion
mode
to
"
before
head
"
and
return.
(
fragment
case
)
Otherwise,
the
head
element
pointer
is
not
null,
switch
the
insertion
mode
to
"
after
head
"
and
return.
If last is true, then switch the insertion mode to " in body " and return. ( fragment case )
Let node now be the node before node in the stack of open elements .
Return to the step labeled loop .
Initially, the stack of open elements is empty. The stack grows downwards; the topmost node on the stack is the first one added to the stack, and the bottommost node of the stack is the most recently added node in the stack (notwithstanding when the stack is manipulated in a random access fashion as part of the handling for misnested tags ).
The
"
before
html
"
insertion
mode
creates
the
html
document
element
,
which
is
then
added
to
the
stack.
In
the
fragment
case
,
the
stack
of
open
elements
is
initialized
to
contain
an
html
element
that
is
created
as
part
of
that
algorithm
.
(The
fragment
case
skips
the
"
before
html
"
insertion
mode
.)
The
html
node,
however
it
is
created,
is
the
topmost
node
of
the
stack.
It
only
gets
popped
off
the
stack
when
the
parser
finishes
.
The current node is the bottommost node in this stack of open elements .
The adjusted current node is the context element if the parser was created as part of the HTML fragment parsing algorithm and the stack of open elements has only one element in it ( fragment case ); otherwise, the adjusted current node is the current node .
When the current node is removed from the stack of open elements , process internal resource links given the current node 's node document .
Elements in the stack of open elements fall into the following categories:
The
following
elements
have
varying
levels
of
special
parsing
rules:
HTML's
address
,
applet
,
area
,
article
,
aside
,
base
,
basefont
,
bgsound
,
blockquote
,
body
,
br
,
button
,
caption
,
center
,
col
,
colgroup
,
dd
,
details
,
dir
,
div
,
dl
,
dt
,
embed
,
fieldset
,
figcaption
,
figure
,
footer
,
form
,
frame
,
frameset
,
h1
,
h2
,
h3
,
h4
,
h5
,
h6
,
head
,
header
,
hgroup
,
hr
,
html
,
iframe
,
img
,
input
,
keygen
,
li
,
link
,
listing
,
main
,
marquee
,
menu
,
meta
,
nav
,
noembed
,
noframes
,
noscript
,
object
,
ol
,
p
,
param
,
plaintext
,
pre
,
script
,
search
,
section
,
select
,
source
,
style
,
summary
,
table
,
tbody
,
td
,
template
,
textarea
,
tfoot
,
th
,
thead
,
title
,
tr
,
track
,
ul
,
wbr
,
xmp
;
MathML
mi
,
MathML
mo
,
MathML
mn
,
MathML
ms
,
MathML
mtext
,
and
MathML
annotation-xml
;
and
SVG
foreignObject
,
SVG
desc
,
and
SVG
title
.
An
image
start
tag
token
is
handled
by
the
tree
builder,
but
it
is
not
in
this
list
because
it
is
not
an
element;
it
gets
turned
into
an
img
element.
The
following
HTML
elements
are
those
that
end
up
in
the
list
of
active
formatting
elements
:
a
,
b
,
big
,
code
,
em
,
font
,
i
,
nobr
,
s
,
small
,
strike
,
strong
,
tt
,
and
u
.
All other elements found while parsing an HTML document.
Typically,
the
special
elements
have
the
start
and
end
tag
tokens
handled
specifically,
while
ordinary
elements'
tokens
fall
into
"any
other
start
tag"
and
"any
other
end
tag"
clauses,
and
some
parts
of
the
tree
builder
check
if
a
particular
element
in
the
stack
of
open
elements
is
in
the
special
category.
However,
some
elements
(e.g.,
the
option
element)
have
their
start
or
end
tag
tokens
handled
specifically,
but
are
still
not
in
the
special
category,
so
that
they
get
the
ordinary
handling
elsewhere.
The stack of open elements is said to have an element target node in a specific scope consisting of a list of element types list when the following algorithm terminates in a match state:
Initialize node to be the current node (the bottommost node of the stack).
If node is the target node, terminate in a match state.
Otherwise, if node is one of the element types in list , terminate in a failure state.
Otherwise,
set
node
to
the
previous
entry
in
the
stack
of
open
elements
and
return
to
step
2.
(This
will
never
fail,
since
the
loop
will
always
terminate
in
the
previous
step
if
the
top
of
the
stack
—
an
html
element
—
is
reached.)
The stack of open elements is said to have a particular element in scope when it has that element in the specific scope consisting of the following element types:
applet
caption
html
table
td
th
marquee
object
template
mi
mo
mn
ms
mtext
annotation-xml
foreignObject
desc
title
The stack of open elements is said to have a particular element in list item scope when it has that element in the specific scope consisting of the following element types:
ol
in
the
HTML
namespace
ul
in
the
HTML
namespace
The stack of open elements is said to have a particular element in button scope when it has that element in the specific scope consisting of the following element types:
button
in
the
HTML
namespace
The stack of open elements is said to have a particular element in table scope when it has that element in the specific scope consisting of the following element types:
html
in
the
HTML
namespace
table
in
the
HTML
namespace
template
in
the
HTML
namespace
The stack of open elements is said to have a particular element in select scope when it has that element in the specific scope consisting of all element types except the following:
optgroup
in
the
HTML
namespace
option
in
the
HTML
namespace
Nothing
happens
if
at
any
time
any
of
the
elements
in
the
stack
of
open
elements
are
moved
to
a
new
location
in,
or
removed
from,
the
Document
tree.
In
particular,
the
stack
is
not
changed
in
this
situation.
This
can
cause,
amongst
other
strange
effects,
content
to
be
appended
to
nodes
that
are
no
longer
in
the
DOM.
In some cases (namely, when closing misnested formatting elements ), the stack is manipulated in a random-access fashion.
Initially, the list of active formatting elements is empty. It is used to handle mis-nested formatting element tags .
The
list
contains
elements
in
the
formatting
category,
and
markers
.
The
markers
are
inserted
when
entering
applet
,
object
,
marquee
,
template
,
td
,
th
,
and
caption
elements,
and
are
used
to
prevent
formatting
from
"leaking"
into
applet
,
object
,
marquee
,
template
,
td
,
th
,
and
caption
elements.
In addition, each element in the list of active formatting elements is associated with the token for which it was created, so that further elements can be created for that token if necessary.
When the steps below require the UA to push onto the list of active formatting elements an element element , the UA must perform the following steps:
If there are already three elements in the list of active formatting elements after the last marker , if any, or anywhere in the list if there are no markers , that have the same tag name, namespace, and attributes as element , then remove the earliest such element from the list of active formatting elements . For these purposes, the attributes must be compared as they were when the elements were created by the parser; two elements have the same attributes if all their parsed attributes can be paired such that the two attributes in each pair have identical names, namespaces, and values (the order of the attributes does not matter).
This is the Noah's Ark clause. But with three per family instead of two.
Add element to the list of active formatting elements .
When the steps below require the UA to reconstruct the active formatting elements , the UA must perform the following steps:
If there are no entries in the list of active formatting elements , then there is nothing to reconstruct; stop this algorithm.
If the last (most recently added) entry in the list of active formatting elements is a marker , or if it is an element that is in the stack of open elements , then there is nothing to reconstruct; stop this algorithm.
Let entry be the last (most recently added) element in the list of active formatting elements .
Rewind : If there are no entries before entry in the list of active formatting elements , then jump to the step labeled create .
Let entry be the entry one earlier than entry in the list of active formatting elements .
If entry is neither a marker nor an element that is also in the stack of open elements , go to the step labeled rewind .
Advance : Let entry be the element one later than entry in the list of active formatting elements .
Create : Insert an HTML element for the token for which the element entry was created, to obtain new element .
Replace the entry for entry in the list with an entry for new element .
If the entry for new element in the list of active formatting elements is not the last entry in the list, return to the step labeled advance .
This has the effect of reopening all the formatting elements that were opened in the current body, cell, or caption (whichever is youngest) that haven't been explicitly closed.
The way this specification is written, the list of active formatting elements always consists of elements in chronological order with the least recently added element first and the most recently added element last (except for while steps 7 to 10 of the above algorithm are being executed, of course).
When the steps below require the UA to clear the list of active formatting elements up to the last marker , the UA must perform the following steps:
Let entry be the last (most recently added) entry in the list of active formatting elements .
Remove entry from the list of active formatting elements .
If entry was a marker , then stop the algorithm at this point. The list has been cleared up to the last marker .
Go to step 1.
Initially,
the
head
element
pointer
and
the
form
element
pointer
are
both
null.
Once
a
head
element
has
been
parsed
(whether
implicitly
or
explicitly)
the
head
element
pointer
gets
set
to
point
to
this
node.
The
form
element
pointer
points
to
the
last
form
element
that
was
opened
and
whose
end
tag
has
not
yet
been
seen.
It
is
used
to
make
form
controls
associate
with
forms
in
the
face
of
dramatically
bad
markup,
for
historical
reasons.
It
is
ignored
inside
template
elements.
The
scripting
flag
is
set
to
"enabled"
if
scripting
was
enabled
for
the
Document
with
which
the
parser
is
associated
when
the
parser
was
created,
and
"disabled"
otherwise.
The
scripting
flag
can
be
enabled
even
when
the
parser
was
created
as
part
of
the
HTML
fragment
parsing
algorithm
,
even
though
script
elements
don't
execute
in
that
case.
The frameset-ok flag is set to "ok" when the parser is created. It is set to "not ok" after certain tokens are seen.
Implementations must act as if they used the following state machine to tokenize HTML. The state machine must start in the data state . Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the current input character , or switches it to a new state to consume the next character , or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.
When a state says to reconsume a matched character in a specified state, that means to switch to that state, but when it attempts to consume the next input character , provide it with the current input character instead.
The exact behavior of certain states depends on the insertion mode and the stack of open elements . Certain states also use a temporary buffer to track progress, and the character reference state uses a return state to return to the state it was invoked from.
The output of the tokenization step is a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public identifier, a system identifier, and a force-quirks flag . When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the force-quirks flag must be set to off (its other state is on ). Start and end tag tokens have a tag name, a self-closing flag , and a list of attributes, each of which has a name and a value. When a start or end tag token is created, its self-closing flag must be unset (its other state is that it be set), and its attributes list must be empty. Comment and character tokens have data.
When
a
token
is
emitted,
it
must
immediately
be
handled
by
the
tree
construction
stage.
The
tree
construction
stage
can
affect
the
state
of
the
tokenization
stage,
and
can
insert
additional
characters
into
the
stream.
(For
example,
the
script
element
can
result
in
scripts
executing
and
using
the
dynamic
markup
insertion
APIs
to
insert
characters
into
the
stream
being
tokenized.)
Creating a token and emitting it are distinct actions. It is possible for a token to be created but implicitly abandoned (never emitted), e.g. if the file ends unexpectedly while processing the characters that are being parsed into a start tag token.
When a start tag token is emitted with its self-closing flag set, if the flag is not acknowledged when it is processed by the tree construction stage, that is a non-void-html-element-start-tag-with-trailing-solidus parse error .
When an end tag token is emitted with attributes, that is an end-tag-with-attributes parse error .
When an end tag token is emitted with its self-closing flag set, that is an end-tag-with-trailing-solidus parse error .
An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been emitted from this tokenizer, then no end tag token is appropriate.
A character reference is said to be consumed as part of an attribute if the return state is either attribute value (double-quoted) state , attribute value (single-quoted) state , or attribute value (unquoted) state .
When a state says to flush code points consumed as a character reference , it means that for each code point in the temporary buffer (in the order they were added to the buffer) user agent must append the code point from the buffer to the current attribute's value if the character reference was consumed as part of an attribute , or emit the code point as a character token otherwise.
Before each step of the tokenizer, the user agent must first check the parser pause flag . If it is true, then the tokenizer must abort the processing of any nested invocations of the tokenizer, yielding control back to the caller.
The tokenizer state machine consists of the states defined in the following subsections.
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
script
",
then
switch
to
the
script
data
double
escaped
state
.
Otherwise,
switch
to
the
script
data
escaped
state
.
Emit
the
current
input
character
as
a
character
token.
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
script
",
then
switch
to
the
script
data
escaped
state
.
Otherwise,
switch
to
the
script
data
double
escaped
state
.
Emit
the
current
input
character
as
a
character
token.
Consume the next input character :
Consume the next input character :
When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a duplicate-attribute parse error and the new attribute must be removed from the token.
If an attribute is so removed from a token, it, and the value that gets associated with it, if any, are never subsequently used by the parser, and are therefore effectively discarded. Removing the attribute in this way does not change its status as the "current attribute" for the purposes of the tokenizer, however.
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
If the next few characters are:
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
If the six characters starting from the current input character are an ASCII case-insensitive match for the word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state .
Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for the word "SYSTEM", then consume those characters and switch to the after DOCTYPE system keyword state .
Otherwise, this is an invalid-character-sequence-after-doctype-name parse error . Set the current DOCTYPE token's force-quirks flag to on . Reconsume in the bogus DOCTYPE state .
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
U+0000 NULL characters are handled in the tree construction stage, as part of the in foreign content insertion mode, which is the only place where CDATA sections can appear.
Consume the next input character :
Consume the next input character :
Set the temporary buffer to the empty string. Append a U+0026 AMPERSAND (&) character to the temporary buffer . Consume the next input character :
Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table. Append each character to the temporary buffer when it's consumed.
If the character reference was consumed as part of an attribute , and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric , then, for historical reasons, flush code points consumed as a character reference and switch to the return state .
Otherwise:
If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error .
Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer .
If
the
markup
contains
(not
in
an
attribute)
the
string
I'm
¬it;
I
tell
you
,
the
character
reference
is
parsed
as
"not",
as
in,
I'm
¬it;
I
tell
you
(and
this
is
a
parse
error).
But
if
the
markup
was
I'm
∉
I
tell
you
,
the
character
reference
would
be
parsed
as
"notin;",
resulting
in
I'm
∉
I
tell
you
(and
no
parse
error).
However,
if
the
markup
contains
the
string
I'm
¬it;
I
tell
you
in
an
attribute,
no
character
reference
is
parsed
and
string
remains
intact
(and
there
is
no
parse
error).
Consume the next input character :
Set the character reference code to zero (0).
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Consume the next input character :
Check the character reference code :
If the number is 0x00, then this is a null-character-reference parse error . Set the character reference code to 0xFFFD.
If the number is greater than 0x10FFFF, then this is a character-reference-outside-unicode-range parse error . Set the character reference code to 0xFFFD.
If the number is a surrogate , then this is a surrogate-character-reference parse error . Set the character reference code to 0xFFFD.
If the number is a noncharacter , then this is a noncharacter-character-reference parse error .
If the number is 0x0D, or a control that's not ASCII whitespace , then this is a control-character-reference parse error . If the number is one of the numbers in the first column of the following table, then find the row with that number in the first column, and set the character reference code to the number in the second column of that row.
Number | Code point | |
---|---|---|
0x80 | 0x20AC | EURO SIGN (€) |
0x82 | 0x201A | SINGLE LOW-9 QUOTATION MARK (‚) |
0x83 | 0x0192 | LATIN SMALL LETTER F WITH HOOK (ƒ) |
0x84 | 0x201E | DOUBLE LOW-9 QUOTATION MARK („) |
0x85 | 0x2026 | HORIZONTAL ELLIPSIS (…) |
0x86 | 0x2020 | DAGGER (†) |
0x87 | 0x2021 | DOUBLE DAGGER (‡) |
0x88 | 0x02C6 | MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ) |
0x89 | 0x2030 | PER MILLE SIGN (‰) |
0x8A | 0x0160 | LATIN CAPITAL LETTER S WITH CARON (Š) |
0x8B | 0x2039 | SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹) |
0x8C | 0x0152 | LATIN CAPITAL LIGATURE OE (Œ) |
0x8E | 0x017D | LATIN CAPITAL LETTER Z WITH CARON (Ž) |
0x91 | 0x2018 | LEFT SINGLE QUOTATION MARK (‘) |
0x92 | 0x2019 | RIGHT SINGLE QUOTATION MARK (’) |
0x93 | 0x201C | LEFT DOUBLE QUOTATION MARK (“) |
0x94 | 0x201D | RIGHT DOUBLE QUOTATION MARK (”) |
0x95 | 0x2022 | BULLET (•) |
0x96 | 0x2013 | EN DASH (–) |
0x97 | 0x2014 | EM DASH (—) |
0x98 | 0x02DC | SMALL TILDE (˜) |
0x99 | 0x2122 | TRADE MARK SIGN (™) |
0x9A | 0x0161 | LATIN SMALL LETTER S WITH CARON (š) |
0x9B | 0x203A | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›) |
0x9C | 0x0153 | LATIN SMALL LIGATURE OE (œ) |
0x9E | 0x017E | LATIN SMALL LETTER Z WITH CARON (ž) |
0x9F | 0x0178 | LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ) |
Set the temporary buffer to the empty string. Append a code point equal to the character reference code to the temporary buffer . Flush code points consumed as a character reference . Switch to the return state .
The
input
to
the
tree
construction
stage
is
a
sequence
of
tokens
from
the
tokenization
stage.
The
tree
construction
stage
is
associated
with
a
DOM
Document
object
when
a
parser
is
created.
The
"output"
of
this
stage
consists
of
dynamically
modifying
or
extending
that
document's
DOM
tree.
This
specification
does
not
define
when
an
interactive
user
agent
has
to
render
the
Document
so
that
it
is
available
to
the
user,
or
when
it
has
to
begin
accepting
user
input.
As each token is emitted from the tokenizer, the user agent must follow the appropriate steps from the following list, known as the tree construction dispatcher :
annotation-xml
element
and
the
token
is
a
start
tag
whose
tag
name
is
"svg"
The next token is the token that is about to be processed by the tree construction dispatcher (even if the token is subsequently just ignored).
A node is a MathML text integration point if it is one of the following elements:
mi
element
mo
element
mn
element
ms
element
mtext
element
A node is an HTML integration point if it is one of the following elements:
annotation-xml
element
whose
start
tag
token
had
an
attribute
with
the
name
"encoding"
whose
value
was
an
ASCII
case-insensitive
match
for
the
string
"
text/html
"
annotation-xml
element
whose
start
tag
token
had
an
attribute
with
the
name
"encoding"
whose
value
was
an
ASCII
case-insensitive
match
for
the
string
"
application/xhtml+xml
"
foreignObject
element
desc
element
title
element
If the node in question is the context element passed to the HTML fragment parsing algorithm , then the start tag token for that element is the "fake" token created during by that HTML fragment parsing algorithm .
Not all of the tag names mentioned below are conformant tag names in this specification; many are included to handle legacy content. They still form part of the algorithm that implementations are required to implement to claim conformance.
The
algorithm
described
below
places
no
limit
on
the
depth
of
the
DOM
tree
generated,
or
on
the
length
of
tag
names,
attribute
names,
attribute
values,
Text
nodes,
etc.
While
implementers
are
encouraged
to
avoid
arbitrary
limits
,
it
is
recognized
that
practical
concerns
will
likely
force
user
agents
to
impose
nesting
depth
constraints.
While the parser is processing a token, it can enable or disable foster parenting . This affects the following algorithm.
The appropriate place for inserting a node , optionally using a particular override target , is the position in an element returned by running the following steps:
If there was an override target specified, then let target be the override target .
Otherwise, let target be the current node .
Determine the adjusted insertion location using the first matching steps from the following list:
table
,
tbody
,
tfoot
,
thead
,
or
tr
element
Foster parenting happens when content is misnested in tables.
Run these substeps:
Let
last
template
be
the
last
template
element
in
the
stack
of
open
elements
,
if
any.
Let
last
table
be
the
last
table
element
in
the
stack
of
open
elements
,
if
any.
If there is a last template and either there is no last table , or there is one, but last template is lower (more recently added) than last table in the stack of open elements , then: let adjusted insertion location be inside last template 's template contents , after its last child (if any), and abort these steps.
If
there
is
no
last
table
,
then
let
adjusted
insertion
location
be
inside
the
first
element
in
the
stack
of
open
elements
(the
html
element),
after
its
last
child
(if
any),
and
abort
these
steps.
(
fragment
case
)
If last table has a parent node, then let adjusted insertion location be inside last table 's parent node, immediately before last table , and abort these steps.
Let previous element be the element immediately above last table in the stack of open elements .
Let adjusted insertion location be inside previous element , after its last child (if any).
These
steps
are
involved
in
part
because
it's
possible
for
elements,
the
table
element
in
this
case
in
particular,
to
have
been
moved
by
a
script
around
in
the
DOM,
or
indeed
removed
from
the
DOM
entirely,
after
the
element
was
inserted
by
the
parser.
Let adjusted insertion location be inside target , after its last child (if any).
If
the
adjusted
insertion
location
is
inside
a
template
element,
let
it
instead
be
inside
the
template
element's
template
contents
,
after
its
last
child
(if
any).
Return the adjusted insertion location .
When the steps below require the UA to create an element for a token in a particular given namespace and with a particular intended parent , the UA must run the following steps:
If the active speculative HTML parser is not null, then return the result of creating a speculative mock element given given namespace , the tag name of the given token, and the attributes of the given token.
Otherwise, optionally create a speculative mock element given given namespace , the tag name of the given token, and the attributes of the given token.
The result is not used. This step allows for a speculative fetch to be initiated from non-speculative parsing. The fetch is still speculative at this point, because, for example, by the time the element is inserted, intended parent might have been removed from the document.
Let document be intended parent 's node document .
Let local name be the tag name of the token.
Let
is
be
the
value
of
the
"
is
"
attribute
in
the
given
token,
if
such
an
attribute
exists,
or
null
otherwise.
Let definition be the result of looking up a custom element definition given document , given namespace , local name , and is .
Let willExecuteScript be true if definition is non-null and the parser was not created as part of the HTML fragment parsing algorithm ; otherwise false.
If willExecuteScript is true:
Increment document 's throw-on-dynamic-markup-insertion counter .
If the JavaScript execution context stack is empty, then perform a microtask checkpoint .
Push a new element queue onto document 's relevant agent 's custom element reactions stack .
Let element be the result of creating an element given document , localName , given namespace , null, is , and willExecuteScript .
This will cause custom element constructors to run, if willExecuteScript is true. However, since we incremented the throw-on-dynamic-markup-insertion counter , this cannot cause new characters to be inserted into the tokenizer , or the document to be blown away .
Append each attribute in the given token to element .
This
can
enqueue
a
custom
element
callback
reaction
for
the
attributeChangedCallback
,
which
might
run
immediately
(in
the
next
step).
Even
though
the
is
attribute
governs
the
creation
of
a
customized
built-in
element
,
it
is
not
present
during
the
execution
of
the
relevant
custom
element
constructor
;
it
is
appended
in
this
step,
along
with
all
other
attributes.
If willExecuteScript is true:
Let queue be the result of popping from document 's relevant agent 's custom element reactions stack . (This will be the same element queue as was pushed above.)
Invoke custom element reactions in queue .
Decrement document 's throw-on-dynamic-markup-insertion counter .
If
element
has
an
xmlns
attribute
in
the
XMLNS
namespace
whose
value
is
not
exactly
the
same
as
the
element's
namespace,
that
is
a
parse
error
.
Similarly,
if
element
has
an
xmlns:xlink
attribute
in
the
XMLNS
namespace
whose
value
is
not
the
XLink
Namespace
,
that
is
a
parse
error
.
If element is a resettable element , invoke its reset algorithm . (This initializes the element's value and checkedness based on the element's attributes.)
If
element
is
a
form-associated
element
and
not
a
form-associated
custom
element
,
the
form
element
pointer
is
not
null,
there
is
no
template
element
on
the
stack
of
open
elements
,
element
is
either
not
listed
or
doesn't
have
a
form
attribute,
and
the
intended
parent
is
in
the
same
tree
as
the
element
pointed
to
by
the
form
element
pointer
,
then
associate
element
with
the
form
element
pointed
to
by
the
form
element
pointer
and
set
element
's
parser
inserted
flag
.
Return element .
To insert an element at the adjusted insertion location with an element element :
Let the adjusted insertion location be the appropriate place for inserting a node .
If it is not possible to insert element at the adjusted insertion location , abort these steps.
If the parser was not created as part of the HTML fragment parsing algorithm , then push a new element queue onto element 's relevant agent 's custom element reactions stack .
Insert element at the adjusted insertion location .
If the parser was not created as part of the HTML fragment parsing algorithm , then pop the element queue from element 's relevant agent 's custom element reactions stack , and invoke custom element reactions in that queue.
If
the
adjusted
insertion
location
cannot
accept
more
elements,
e.g.,
because
it's
a
Document
that
already
has
an
element
child,
then
element
is
dropped
on
the
floor.
When the steps below require the user agent to insert a foreign element for a token in a given namespace and with a boolean onlyAddToElementStack , the user agent must run these steps:
Let the adjusted insertion location be the appropriate place for inserting a node .
Let element be the result of creating an element for the token in the given namespace, with the intended parent being the element in which the adjusted insertion location finds itself.
If onlyAddToElementStack is false, then run insert an element at the adjusted insertion location with element .
Push element onto the stack of open elements so that it is the new current node .
Return element .
When the steps below require the user agent to insert an HTML element for a token, the user agent must insert a foreign element for the token, with the HTML namespace and false.
When
the
steps
below
require
the
user
agent
to
adjust
MathML
attributes
for
a
token,
then,
if
the
token
has
an
attribute
named
definitionurl
,
change
its
name
to
definitionURL
(note
the
case
difference).
When the steps below require the user agent to adjust SVG attributes for a token, then, for each attribute on the token whose attribute name is one of the ones in the first column of the following table, change the attribute's name to the name given in the corresponding cell in the second column. (This fixes the case of SVG attributes that are not all lowercase.)
Attribute name on token | Attribute name on element |
---|---|
attributename
|
attributeName
|
attributetype
|
attributeType
|
basefrequency
|
baseFrequency
|
baseprofile
|
baseProfile
|
calcmode
|
calcMode
|
clippathunits
|
clipPathUnits
|
diffuseconstant
|
diffuseConstant
|
edgemode
|
edgeMode
|
filterunits
|
filterUnits
|
glyphref
|
glyphRef
|
gradienttransform
|
gradientTransform
|
gradientunits
|
gradientUnits
|
kernelmatrix
|
kernelMatrix
|
kernelunitlength
|
kernelUnitLength
|
keypoints
|
keyPoints
|
keysplines
|
keySplines
|
keytimes
|
keyTimes
|
lengthadjust
|
lengthAdjust
|
limitingconeangle
|
limitingConeAngle
|
markerheight
|
markerHeight
|
markerunits
|
markerUnits
|
markerwidth
|
markerWidth
|
maskcontentunits
|
maskContentUnits
|
maskunits
|
maskUnits
|
numoctaves
|
numOctaves
|
pathlength
|
pathLength
|
patterncontentunits
|
patternContentUnits
|
patterntransform
|
patternTransform
|
patternunits
|
patternUnits
|
pointsatx
|
pointsAtX
|
pointsaty
|
pointsAtY
|
pointsatz
|
pointsAtZ
|
preservealpha
|
preserveAlpha
|
preserveaspectratio
|
preserveAspectRatio
|
primitiveunits
|
primitiveUnits
|
refx
|
refX
|
refy
|
refY
|
repeatcount
|
repeatCount
|
repeatdur
|
repeatDur
|
requiredextensions
|
requiredExtensions
|
requiredfeatures
|
requiredFeatures
|
specularconstant
|
specularConstant
|
specularexponent
|
specularExponent
|
spreadmethod
|
spreadMethod
|
startoffset
|
startOffset
|
stddeviation
|
stdDeviation
|
stitchtiles
|
stitchTiles
|
surfacescale
|
surfaceScale
|
systemlanguage
|
systemLanguage
|
tablevalues
|
tableValues
|
targetx
|
targetX
|
targety
|
targetY
|
textlength
|
textLength
|
viewbox
|
viewBox
|
viewtarget
|
viewTarget
|
xchannelselector
|
xChannelSelector
|
ychannelselector
|
yChannelSelector
|
zoomandpan
|
zoomAndPan
|
When
the
steps
below
require
the
user
agent
to
adjust
foreign
attributes
for
a
token,
then,
if
any
of
the
attributes
on
the
token
match
the
strings
given
in
the
first
column
of
the
following
table,
let
the
attribute
be
a
namespaced
attribute,
with
the
prefix
being
the
string
given
in
the
corresponding
cell
in
the
second
column,
the
local
name
being
the
string
given
in
the
corresponding
cell
in
the
third
column,
and
the
namespace
being
the
namespace
given
in
the
corresponding
cell
in
the
fourth
column.
(This
fixes
the
use
of
namespaced
attributes,
in
particular
lang
attributes
in
the
XML
namespace
.)
Attribute name | Prefix | Local name | Namespace |
---|---|---|---|
xlink:actuate
|
xlink
|
actuate
| XLink namespace |
xlink:arcrole
|
xlink
|
arcrole
| XLink namespace |
xlink:href
|
xlink
|
href
| XLink namespace |
xlink:role
|
xlink
|
role
| XLink namespace |
xlink:show
|
xlink
|
show
| XLink namespace |
xlink:title
|
xlink
|
title
| XLink namespace |
xlink:type
|
xlink
|
type
| XLink namespace |
xml:lang
|
xml
|
lang
| XML namespace |
xml:space
|
xml
|
space
| XML namespace |
xmlns
| (none) |
xmlns
| XMLNS namespace |
xmlns:xlink
|
xmlns
|
xlink
| XMLNS namespace |
When the steps below require the user agent to insert a character while processing a token, the user agent must run the following steps:
Let data be the characters passed to the algorithm, or, if no characters were explicitly specified, the character of the character token being processed.
Let the adjusted insertion location be the appropriate place for inserting a node .
If
the
adjusted
insertion
location
is
in
a
Document
node,
then
return.
The
DOM
will
not
let
Document
nodes
have
Text
node
children,
so
they
are
dropped
on
the
floor.
If
there
is
a
Text
node
immediately
before
the
adjusted
insertion
location
,
then
append
data
to
that
Text
node's
data
.
Otherwise,
create
a
new
Text
node
whose
data
is
data
and
whose
node
document
is
the
same
as
that
of
the
element
in
which
the
adjusted
insertion
location
finds
itself,
and
insert
the
newly
created
node
at
the
adjusted
insertion
location
.
Here
are
some
sample
inputs
to
the
parser
and
the
corresponding
number
of
Text
nodes
that
they
result
in,
assuming
a
user
agent
that
executes
scripts.
Input |
Number
of
Text
nodes
|
---|---|
|
One
Text
node
in
the
document,
containing
"AB".
|
|
Three
Text
nodes;
"A"
before
the
script,
the
script's
contents,
and
"BC"
after
the
script
(the
parser
appends
to
the
Text
node
created
by
the
script).
|
|
Two
adjacent
Text
nodes
in
the
document,
containing
"A"
and
"BC".
|
|
One
Text
node
before
the
table,
containing
"ABCD".
(This
is
caused
by
foster
parenting
.)
|
|
One
Text
node
before
the
table,
containing
"A B C"
(A-space-B-space-C).
(This
is
caused
by
foster
parenting
.)
|
|
One
Text
node
before
the
table,
containing
"A BC"
(A-space-B-C),
and
one
Text
node
inside
the
table
(as
a
child
of
a
tbody
)
with
a
single
space
character.
(Space
characters
separated
from
non-space
characters
by
non-character
tokens
are
not
affected
by
foster
parenting
,
even
if
those
other
tokens
then
get
ignored.)
|
When the steps below require the user agent to insert a comment while processing a comment token, optionally with an explicitly insertion position position , the user agent must run the following steps:
Let data be the data given in the comment token being processed.
If position was specified, then let the adjusted insertion location be position . Otherwise, let adjusted insertion location be the appropriate place for inserting a node .
Create
a
Comment
node
whose
data
attribute
is
set
to
data
and
whose
node
document
is
the
same
as
that
of
the
node
in
which
the
adjusted
insertion
location
finds
itself.
Insert the newly created node at the adjusted insertion location .
The generic raw text element parsing algorithm and the generic RCDATA element parsing algorithm consist of the following steps. These algorithms are always invoked in response to a start tag token.
Insert an HTML element for the token.
If the algorithm that was invoked is the generic raw text element parsing algorithm , switch the tokenizer to the RAWTEXT state ; otherwise the algorithm invoked was the generic RCDATA element parsing algorithm , switch the tokenizer to the RCDATA state .
Let the original insertion mode be the current insertion mode .
Then, switch the insertion mode to " text ".
When
the
steps
below
require
the
UA
to
generate
implied
end
tags
,
then,
while
the
current
node
is
a
dd
element,
a
dt
element,
an
li
element,
an
optgroup
element,
an
option
element,
a
p
element,
an
rb
element,
an
rp
element,
an
rt
element,
or
an
rtc
element,
the
UA
must
pop
the
current
node
off
the
stack
of
open
elements
.
If a step requires the UA to generate implied end tags but lists an element to exclude from the process, then the UA must perform the above steps as if that element was not in the above list.
When
the
steps
below
require
the
UA
to
generate
all
implied
end
tags
thoroughly
,
then,
while
the
current
node
is
a
caption
element,
a
colgroup
element,
a
dd
element,
a
dt
element,
an
li
element,
an
optgroup
element,
an
option
element,
a
p
element,
an
rb
element,
an
rp
element,
an
rt
element,
an
rtc
element,
a
tbody
element,
a
td
element,
a
tfoot
element,
a
th
element,
a
thead
element,
or
a
tr
element,
the
UA
must
pop
the
current
node
off
the
stack
of
open
elements
.
A
Document
object
has
an
associated
parser
cannot
change
the
mode
flag
(a
boolean).
It
is
initially
false.
When the user agent is to apply the rules for the " initial " insertion mode , the user agent must handle the token as follows:
Ignore the token.
Insert
a
comment
as
the
last
child
of
the
Document
object.
If
the
DOCTYPE
token's
name
is
not
"
html
",
or
the
token's
public
identifier
is
not
missing,
or
the
token's
system
identifier
is
neither
missing
nor
"
about:legacy-compat
",
then
there
is
a
parse
error
.
Append
a
DocumentType
node
to
the
Document
node,
with
its
name
set
to
the
name
given
in
the
DOCTYPE
token,
or
the
empty
string
if
the
name
was
missing;
its
public
ID
set
to
the
public
identifier
given
in
the
DOCTYPE
token,
or
the
empty
string
if
the
public
identifier
was
missing;
and
its
system
ID
set
to
the
system
identifier
given
in
the
DOCTYPE
token,
or
the
empty
string
if
the
system
identifier
was
missing.
This
also
ensures
that
the
DocumentType
node
is
returned
as
the
value
of
the
doctype
attribute
of
the
Document
object.
Then,
if
the
document
is
not
an
iframe
srcdoc
document
,
and
the
parser
cannot
change
the
mode
flag
is
false,
and
the
DOCTYPE
token
matches
one
of
the
conditions
in
the
following
list,
then
set
the
Document
to
quirks
mode
:
html
".
-//W3O//DTD
W3
HTML
Strict
3.0//EN//
"
-/W3C/DTD
HTML
4.0
Transitional/EN
"
HTML
"
http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd
"
+//Silmaril//dtd
html
Pro
v0r11
19970101//
"
-//AS//DTD
HTML
3.0
asWedit
+
extensions//
"
-//AdvaSoft
Ltd//DTD
HTML
3.0
asWedit
+
extensions//
"
-//IETF//DTD
HTML
2.0
Level
1//
"
-//IETF//DTD
HTML
2.0
Level
2//
"
-//IETF//DTD
HTML
2.0
Strict
Level
1//
"
-//IETF//DTD
HTML
2.0
Strict
Level
2//
"
-//IETF//DTD
HTML
2.0
Strict//
"
-//IETF//DTD
HTML
2.0//
"
-//IETF//DTD
HTML
2.1E//
"
-//IETF//DTD
HTML
3.0//
"
-//IETF//DTD
HTML
3.2
Final//
"
-//IETF//DTD
HTML
3.2//
"
-//IETF//DTD
HTML
3//
"
-//IETF//DTD
HTML
Level
0//
"
-//IETF//DTD
HTML
Level
1//
"
-//IETF//DTD
HTML
Level
2//
"
-//IETF//DTD
HTML
Level
3//
"
-//IETF//DTD
HTML
Strict
Level
0//
"
-//IETF//DTD
HTML
Strict
Level
1//
"
-//IETF//DTD
HTML
Strict
Level
2//
"
-//IETF//DTD
HTML
Strict
Level
3//
"
-//IETF//DTD
HTML
Strict//
"
-//IETF//DTD
HTML//
"
-//Metrius//DTD
Metrius
Presentational//
"
-//Microsoft//DTD
Internet
Explorer
2.0
HTML
Strict//
"
-//Microsoft//DTD
Internet
Explorer
2.0
HTML//
"
-//Microsoft//DTD
Internet
Explorer
2.0
Tables//
"
-//Microsoft//DTD
Internet
Explorer
3.0
HTML
Strict//
"
-//Microsoft//DTD
Internet
Explorer
3.0
HTML//
"
-//Microsoft//DTD
Internet
Explorer
3.0
Tables//
"
-//Netscape
Comm.
Corp.//DTD
HTML//
"
-//Netscape
Comm.
Corp.//DTD
Strict
HTML//
"
-//O'Reilly
and
Associates//DTD
HTML
2.0//
"
-//O'Reilly
and
Associates//DTD
HTML
Extended
1.0//
"
-//O'Reilly
and
Associates//DTD
HTML
Extended
Relaxed
1.0//
"
-//SQ//DTD
HTML
2.0
HoTMetaL
+
extensions//
"
-//SoftQuad
Software//DTD
HoTMetaL
PRO
6.0::19990601::extensions
to
HTML
4.0//
"
-//SoftQuad//DTD
HoTMetaL
PRO
4.0::19971010::extensions
to
HTML
4.0//
"
-//Spyglass//DTD
HTML
2.0
Extended//
"
-//Sun
Microsystems
Corp.//DTD
HotJava
HTML//
"
-//Sun
Microsystems
Corp.//DTD
HotJava
Strict
HTML//
"
-//W3C//DTD
HTML
3
1995-03-24//
"
-//W3C//DTD
HTML
3.2
Draft//
"
-//W3C//DTD
HTML
3.2
Final//
"
-//W3C//DTD
HTML
3.2//
"
-//W3C//DTD
HTML
3.2S
Draft//
"
-//W3C//DTD
HTML
4.0
Frameset//
"
-//W3C//DTD
HTML
4.0
Transitional//
"
-//W3C//DTD
HTML
Experimental
19960712//
"
-//W3C//DTD
HTML
Experimental
970421//
"
-//W3C//DTD
W3
HTML//
"
-//W3O//DTD
W3
HTML
3.0//
"
-//WebTechs//DTD
Mozilla
HTML
2.0//
"
-//WebTechs//DTD
Mozilla
HTML//
"
-//W3C//DTD
HTML
4.01
Frameset//
"
-//W3C//DTD
HTML
4.01
Transitional//
"
Otherwise,
if
the
document
is
not
an
iframe
srcdoc
document
,
and
the
parser
cannot
change
the
mode
flag
is
false,
and
the
DOCTYPE
token
matches
one
of
the
conditions
in
the
following
list,
then
set
the
Document
to
limited-quirks
mode
:
-//W3C//DTD
XHTML
1.0
Frameset//
"
-//W3C//DTD
XHTML
1.0
Transitional//
"
-//W3C//DTD
HTML
4.01
Frameset//
"
-//W3C//DTD
HTML
4.01
Transitional//
"
The system identifier and public identifier strings must be compared to the values given in the lists above in an ASCII case-insensitive manner. A system identifier whose value is the empty string is not considered missing for the purposes of the conditions above.
Then, switch the insertion mode to " before html ".
If
the
document
is
not
an
iframe
srcdoc
document
,
then
this
is
a
parse
error
;
if
the
parser
cannot
change
the
mode
flag
is
false,
set
the
Document
to
quirks
mode
.
In any case, switch the insertion mode to " before html ", then reprocess the token.
When the user agent is to apply the rules for the " before html " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Insert
a
comment
as
the
last
child
of
the
Document
object.
Ignore the token.
Create
an
element
for
the
token
in
the
HTML
namespace
,
with
the
Document
as
the
intended
parent.
Append
it
to
the
Document
object.
Put
this
element
in
the
stack
of
open
elements
.
Switch the insertion mode to " before head ".
Act as described in the "anything else" entry below.
Parse error . Ignore the token.
Create
an
html
element
whose
node
document
is
the
Document
object.
Append
it
to
the
Document
object.
Put
this
element
in
the
stack
of
open
elements
.
Switch the insertion mode to " before head ", then reprocess the token.
The
document
element
can
end
up
being
removed
from
the
Document
object,
e.g.
by
scripts;
nothing
in
particular
happens
in
such
cases,
content
continues
being
appended
to
the
nodes
as
described
in
the
next
section.
When the user agent is to apply the rules for the " before head " insertion mode , the user agent must handle the token as follows:
Ignore the token.
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Insert an HTML element for the token.
Set
the
head
element
pointer
to
the
newly
created
head
element.
Switch the insertion mode to " in head ".
Act as described in the "anything else" entry below.
Parse error . Ignore the token.
Insert an HTML element for a "head" start tag token with no attributes.
Set
the
head
element
pointer
to
the
newly
created
head
element.
Switch the insertion mode to " in head ".
Reprocess the current token.
When the user agent is to apply the rules for the " in head " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
If the active speculative HTML parser is null, then:
If
the
element
has
a
charset
attribute,
and
getting
an
encoding
from
its
value
results
in
an
encoding
,
and
the
confidence
is
currently
tentative
,
then
change
the
encoding
to
the
resulting
encoding.
Otherwise,
if
the
element
has
an
http-equiv
attribute
whose
value
is
an
ASCII
case-insensitive
match
for
the
string
"
Content-Type
",
and
the
element
has
a
content
attribute,
and
applying
the
algorithm
for
extracting
a
character
encoding
from
a
meta
element
to
that
attribute's
value
returns
an
encoding
,
and
the
confidence
is
currently
tentative
,
then
change
the
encoding
to
the
extracted
encoding.
The speculative HTML parser doesn't speculatively apply character encoding declarations in order to reduce implementation complexity.
Follow the generic RCDATA element parsing algorithm .
Follow the generic raw text element parsing algorithm .
Insert an HTML element for the token.
Switch the insertion mode to " in head noscript ".
Run these steps:
Let the adjusted insertion location be the appropriate place for inserting a node .
Create an element for the token in the HTML namespace , with the intended parent being the element in which the adjusted insertion location finds itself.
Set
the
element's
parser
document
to
the
Document
,
and
set
the
element's
force
async
to
false.
This
ensures
that,
if
the
script
is
external,
any
document.write()
calls
in
the
script
will
execute
in-line,
instead
of
blowing
the
document
away,
as
would
happen
in
most
other
cases.
It
also
prevents
the
script
from
executing
until
the
end
tag
is
seen.
If
the
parser
was
created
as
part
of
the
HTML
fragment
parsing
algorithm
,
then
set
the
script
element's
already
started
to
true.
(
fragment
case
)
If
the
parser
was
invoked
via
the
document.write()
or
document.writeln()
methods,
then
optionally
set
the
script
element's
already
started
to
true.
(For
example,
the
user
agent
might
use
this
clause
to
prevent
execution
of
cross-origin
scripts
inserted
via
document.write()
under
slow
network
conditions,
or
when
the
page
has
already
taken
a
long
time
to
load.)
Insert the newly created element at the adjusted insertion location .
Push the element onto the stack of open elements so that it is the new current node .
Switch the tokenizer to the script data state .
Let the original insertion mode be the current insertion mode .
Switch the insertion mode to " text ".
Pop
the
current
node
(which
will
be
the
head
element)
off
the
stack
of
open
elements
.
Switch the insertion mode to " after head ".
Act as described in the "anything else" entry below.
Let template start tag be the start tag.
Insert a marker at the end of the list of active formatting elements .
Set the frameset-ok flag to "not ok".
Switch the insertion mode to " in template ".
Push " in template " onto the stack of template insertion modes so that it is the new current template insertion mode .
Let the adjusted insertion location be the appropriate place for inserting a node .
Let intended parent be the element in which the adjusted insertion location finds itself.
Let document be intended parent 's node document .
If any of the following are false:
shadowrootmode
is
not
in
the
none
state;
then insert an HTML element for the token.
Otherwise:
Let declarative shadow host element be adjusted current node .
Let template be the result of insert a foreign element for template start tag , with HTML namespace and true.
Let
mode
be
template
start
tag
's
shadowrootmode
attribute's
value.
Let
clonable
be
true
if
template
start
tag
has
a
shadowrootclonable
attribute;
otherwise
false.
Let
serializable
be
true
if
template
start
tag
has
a
shadowrootserializable
attribute;
otherwise
false.
Let
delegatesFocus
be
true
if
template
start
tag
has
a
shadowrootdelegatesfocus
attribute;
otherwise
false.
If declarative shadow host element is a shadow host , then insert an element at the adjusted insertion location with template .
Otherwise:
Attach
a
shadow
root
with
declarative
shadow
host
element
,
mode
,
clonable
,
serializable
,
delegatesFocus
,
and
"
named
".
If an exception is thrown, then catch it and:
Insert an element at the adjusted insertion location with template .
The user agent may report an error to the developer console.
Return.
Let shadow be declarative shadow host element 's shadow root .
Set shadow 's declarative to true.
Set template 's template contents property to shadow .
Set shadow 's available to element internals to true.
If
there
is
no
template
element
on
the
stack
of
open
elements
,
then
this
is
a
parse
error
;
ignore
the
token.
Otherwise, run these steps:
If
the
current
node
is
not
a
template
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
a
template
element
has
been
popped
from
the
stack.
Pop the current template insertion mode off the stack of template insertion modes .
Parse error . Ignore the token.
Pop
the
current
node
(which
will
be
the
head
element)
off
the
stack
of
open
elements
.
Switch the insertion mode to " after head ".
Reprocess the token.
When the user agent is to apply the rules for the " in head noscript " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Pop
the
current
node
(which
will
be
a
noscript
element)
from
the
stack
of
open
elements
;
the
new
current
node
will
be
a
head
element.
Switch the insertion mode to " in head ".
Process the token using the rules for the " in head " insertion mode .
Act as described in the "anything else" entry below.
Parse error . Ignore the token.
Pop
the
current
node
(which
will
be
a
noscript
element)
from
the
stack
of
open
elements
;
the
new
current
node
will
be
a
head
element.
Switch the insertion mode to " in head ".
Reprocess the token.
When the user agent is to apply the rules for the " after head " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Insert an HTML element for the token.
Set the frameset-ok flag to "not ok".
Switch the insertion mode to " in body ".
Insert an HTML element for the token.
Switch the insertion mode to " in frameset ".
Push
the
node
pointed
to
by
the
head
element
pointer
onto
the
stack
of
open
elements
.
Process the token using the rules for the " in head " insertion mode .
Remove
the
node
pointed
to
by
the
head
element
pointer
from
the
stack
of
open
elements
.
(It
might
not
be
the
current
node
at
this
point.)
The
head
element
pointer
cannot
be
null
at
this
point.
Process the token using the rules for the " in head " insertion mode .
Act as described in the "anything else" entry below.
Parse error . Ignore the token.
Insert an HTML element for a "body" start tag token with no attributes.
Switch the insertion mode to " in body ".
Reprocess the current token.
When the user agent is to apply the rules for the " in body " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Reconstruct the active formatting elements , if any.
Insert the token's character .
Set the frameset-ok flag to "not ok".
Parse error . Ignore the token.
If
there
is
a
template
element
on
the
stack
of
open
elements
,
then
ignore
the
token.
Otherwise, for each attribute on the token, check to see if the attribute is already present on the top element of the stack of open elements . If it is not, add the attribute and its corresponding value to that element.
Process the token using the rules for the " in head " insertion mode .
If
the
stack
of
open
elements
has
only
one
node
on
it,
if
the
second
element
on
the
stack
of
open
elements
is
not
a
body
element,
or
if
there
is
a
template
element
on
the
stack
of
open
elements
,
then
ignore
the
token.
(
fragment
case
or
there
is
a
template
element
on
the
stack)
Otherwise,
set
the
frameset-ok
flag
to
"not
ok";
then,
for
each
attribute
on
the
token,
check
to
see
if
the
attribute
is
already
present
on
the
body
element
(the
second
element)
on
the
stack
of
open
elements
,
and
if
it
is
not,
add
the
attribute
and
its
corresponding
value
to
that
element.
If
the
stack
of
open
elements
has
only
one
node
on
it,
or
if
the
second
element
on
the
stack
of
open
elements
is
not
a
body
element,
then
ignore
the
token.
(
fragment
case
or
there
is
a
template
element
on
the
stack)
If the frameset-ok flag is set to "not ok", ignore the token.
Otherwise, run the following steps:
Remove the second element on the stack of open elements from its parent node, if it has one.
Pop
all
the
nodes
from
the
bottom
of
the
stack
of
open
elements
,
from
the
current
node
up
to,
but
not
including,
the
root
html
element.
Insert an HTML element for the token.
Switch the insertion mode to " in frameset ".
If the stack of template insertion modes is not empty, then process the token using the rules for the " in template " insertion mode .
Otherwise, follow these steps:
If
there
is
a
node
in
the
stack
of
open
elements
that
is
not
either
a
dd
element,
a
dt
element,
an
li
element,
an
optgroup
element,
an
option
element,
a
p
element,
an
rb
element,
an
rp
element,
an
rt
element,
an
rtc
element,
a
tbody
element,
a
td
element,
a
tfoot
element,
a
th
element,
a
thead
element,
a
tr
element,
the
body
element,
or
the
html
element,
then
this
is
a
parse
error
.
If
the
stack
of
open
elements
does
not
have
a
body
element
in
scope
,
this
is
a
parse
error
;
ignore
the
token.
Otherwise,
if
there
is
a
node
in
the
stack
of
open
elements
that
is
not
either
a
dd
element,
a
dt
element,
an
li
element,
an
optgroup
element,
an
option
element,
a
p
element,
an
rb
element,
an
rp
element,
an
rt
element,
an
rtc
element,
a
tbody
element,
a
td
element,
a
tfoot
element,
a
th
element,
a
thead
element,
a
tr
element,
the
body
element,
or
the
html
element,
then
this
is
a
parse
error
.
Switch the insertion mode to " after body ".
If
the
stack
of
open
elements
does
not
have
a
body
element
in
scope
,
this
is
a
parse
error
;
ignore
the
token.
Otherwise,
if
there
is
a
node
in
the
stack
of
open
elements
that
is
not
either
a
dd
element,
a
dt
element,
an
li
element,
an
optgroup
element,
an
option
element,
a
p
element,
an
rb
element,
an
rp
element,
an
rt
element,
an
rtc
element,
a
tbody
element,
a
td
element,
a
tfoot
element,
a
th
element,
a
thead
element,
a
tr
element,
the
body
element,
or
the
html
element,
then
this
is
a
parse
error
.
Switch the insertion mode to " after body ".
Reprocess the token.
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Insert an HTML element for the token.
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
If the current node is an HTML element whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6", then this is a parse error ; pop the current node off the stack of open elements .
Insert an HTML element for the token.
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Insert an HTML element for the token.
If
the
next
token
is
a
U+000A
LINE
FEED
(LF)
character
token,
then
ignore
that
token
and
move
on
to
the
next
one.
(Newlines
at
the
start
of
pre
blocks
are
ignored
as
an
authoring
convenience.)
Set the frameset-ok flag to "not ok".
If
the
form
element
pointer
is
not
null,
and
there
is
no
template
element
on
the
stack
of
open
elements
,
then
this
is
a
parse
error
;
ignore
the
token.
Otherwise:
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Insert
an
HTML
element
for
the
token,
and,
if
there
is
no
template
element
on
the
stack
of
open
elements
,
set
the
form
element
pointer
to
point
to
the
element
created.
Run these steps:
Set the frameset-ok flag to "not ok".
Initialize node to be the current node (the bottommost node of the stack).
Loop
:
If
node
is
an
li
element,
then
run
these
substeps:
Generate
implied
end
tags
,
except
for
li
elements.
If
the
current
node
is
not
an
li
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
an
li
element
has
been
popped
from
the
stack.
Jump to the step labeled done below.
If
node
is
in
the
special
category,
but
is
not
an
address
,
div
,
or
p
element,
then
jump
to
the
step
labeled
done
below.
Otherwise, set node to the previous entry in the stack of open elements and return to the step labeled loop .
Done
:
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Finally, insert an HTML element for the token.
Run these steps:
Set the frameset-ok flag to "not ok".
Initialize node to be the current node (the bottommost node of the stack).
Loop
:
If
node
is
a
dd
element,
then
run
these
substeps:
Generate
implied
end
tags
,
except
for
dd
elements.
If
the
current
node
is
not
a
dd
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
a
dd
element
has
been
popped
from
the
stack.
Jump to the step labeled done below.
If
node
is
a
dt
element,
then
run
these
substeps:
Generate
implied
end
tags
,
except
for
dt
elements.
If
the
current
node
is
not
a
dt
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
a
dt
element
has
been
popped
from
the
stack.
Jump to the step labeled done below.
If
node
is
in
the
special
category,
but
is
not
an
address
,
div
,
or
p
element,
then
jump
to
the
step
labeled
done
below.
Otherwise, set node to the previous entry in the stack of open elements and return to the step labeled loop .
Done
:
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Finally, insert an HTML element for the token.
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Insert an HTML element for the token.
Switch the tokenizer to the PLAINTEXT state .
Once
a
start
tag
with
the
tag
name
"plaintext"
has
been
seen,
all
remaining
tokens
will
be
character
tokens
(and
a
final
end-of-file
token)
because
there
is
no
way
to
switch
the
tokenizer
out
of
the
PLAINTEXT
state
.
However,
as
the
tree
builder
remains
in
its
existing
insertion
mode,
it
might
reconstruct
the
active
formatting
elements
while
processing
those
character
tokens.
This
means
that
the
parser
can
insert
other
elements
into
the
plaintext
element.
If
the
stack
of
open
elements
has
a
button
element
in
scope
,
then
run
these
substeps:
Pop
elements
from
the
stack
of
open
elements
until
a
button
element
has
been
popped
from
the
stack.
Insert an HTML element for the token.
Set the frameset-ok flag to "not ok".
If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error ; ignore the token.
Otherwise, run these steps:
If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error .
Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.
If
there
is
no
template
element
on
the
stack
of
open
elements
,
then
run
these
substeps:
Let
node
be
the
element
that
the
form
element
pointer
is
set
to,
or
null
if
it
is
not
set
to
an
element.
Set
the
form
element
pointer
to
null.
If node is null or if the stack of open elements does not have node in scope , then this is a parse error ; return and ignore the token.
If the current node is not node , then this is a parse error .
Remove node from the stack of open elements .
If
there
is
a
template
element
on
the
stack
of
open
elements
,
then
run
these
substeps
instead:
If
the
stack
of
open
elements
does
not
have
a
form
element
in
scope
,
then
this
is
a
parse
error
;
return
and
ignore
the
token.
If
the
current
node
is
not
a
form
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
a
form
element
has
been
popped
from
the
stack.
If
the
stack
of
open
elements
does
not
have
a
p
element
in
button
scope
,
then
this
is
a
parse
error
;
insert
an
HTML
element
for
a
"p"
start
tag
token
with
no
attributes.
If
the
stack
of
open
elements
does
not
have
an
li
element
in
list
item
scope
,
then
this
is
a
parse
error
;
ignore
the
token.
Otherwise, run these steps:
Generate
implied
end
tags
,
except
for
li
elements.
If
the
current
node
is
not
an
li
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
an
li
element
has
been
popped
from
the
stack.
If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error ; ignore the token.
Otherwise, run these steps:
Generate implied end tags , except for HTML elements with the same tag name as the token.
If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error .
Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.
If the stack of open elements does not have an element in scope that is an HTML element and whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6", then this is a parse error ; ignore the token.
Otherwise, run these steps:
If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error .
Pop elements from the stack of open elements until an HTML element whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6" has been popped from the stack.
Take a deep breath, then act as described in the "any other end tag" entry below.
If
the
list
of
active
formatting
elements
contains
an
a
element
between
the
end
of
the
list
and
the
last
marker
on
the
list
(or
the
start
of
the
list
if
there
is
no
marker
on
the
list),
then
this
is
a
parse
error
;
run
the
adoption
agency
algorithm
for
the
token,
then
remove
that
element
from
the
list
of
active
formatting
elements
and
the
stack
of
open
elements
if
the
adoption
agency
algorithm
didn't
already
remove
it
(it
might
not
have
if
the
element
is
not
in
table
scope
).
In
the
non-conforming
stream
<a href="a">a<table><a href="b">b</table>x
,
the
first
a
element
would
be
closed
upon
seeing
the
second
one,
and
the
"x"
character
would
be
inside
a
link
to
"b",
not
to
"a".
This
is
despite
the
fact
that
the
outer
a
element
is
not
in
table
scope
(meaning
that
a
regular
</a>
end
tag
at
the
start
of
the
table
wouldn't
close
the
outer
a
element).
The
result
is
that
the
two
a
elements
are
indirectly
nested
inside
each
other
—
non-conforming
markup
will
often
result
in
non-conforming
DOMs
when
parsed.
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token. Push onto the list of active formatting elements that element.
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token. Push onto the list of active formatting elements that element.
Reconstruct the active formatting elements , if any.
If
the
stack
of
open
elements
has
a
nobr
element
in
scope
,
then
this
is
a
parse
error
;
run
the
adoption
agency
algorithm
for
the
token,
then
once
again
reconstruct
the
active
formatting
elements
,
if
any.
Insert an HTML element for the token. Push onto the list of active formatting elements that element.
Run the adoption agency algorithm for the token.
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token.
Insert a marker at the end of the list of active formatting elements .
Set the frameset-ok flag to "not ok".
If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error ; ignore the token.
Otherwise, run these steps:
If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error .
Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.
If
the
Document
is
not
set
to
quirks
mode
,
and
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Insert an HTML element for the token.
Set the frameset-ok flag to "not ok".
Switch the insertion mode to " in table ".
Parse error . Drop the attributes from the token, and act as described in the next entry; i.e. act as if this was a "br" start tag token with no attributes, rather than the end tag token that it actually is.
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
Set the frameset-ok flag to "not ok".
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
If
the
token
does
not
have
an
attribute
with
the
name
"type",
or
if
it
does,
but
that
attribute's
value
is
not
an
ASCII
case-insensitive
match
for
the
string
"
hidden
",
then:
set
the
frameset-ok
flag
to
"not
ok".
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
Set the frameset-ok flag to "not ok".
Parse error . Change the token's tag name to "img" and reprocess it. (Don't ask.)
Run these steps:
Insert an HTML element for the token.
If
the
next
token
is
a
U+000A
LINE
FEED
(LF)
character
token,
then
ignore
that
token
and
move
on
to
the
next
one.
(Newlines
at
the
start
of
textarea
elements
are
ignored
as
an
authoring
convenience.)
Switch the tokenizer to the RCDATA state .
Let the original insertion mode be the current insertion mode .
Set the frameset-ok flag to "not ok".
Switch the insertion mode to " text ".
If
the
stack
of
open
elements
has
a
p
element
in
button
scope
,
then
close
a
p
element
.
Reconstruct the active formatting elements , if any.
Set the frameset-ok flag to "not ok".
Follow the generic raw text element parsing algorithm .
Set the frameset-ok flag to "not ok".
Follow the generic raw text element parsing algorithm .
Follow the generic raw text element parsing algorithm .
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token.
Set the frameset-ok flag to "not ok".
If the insertion mode is one of " in table ", " in caption ", " in table body ", " in row ", or " in cell ", then switch the insertion mode to " in select in table ". Otherwise, switch the insertion mode to " in select ".
If
the
current
node
is
an
option
element,
then
pop
the
current
node
off
the
stack
of
open
elements
.
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token.
If
the
stack
of
open
elements
has
a
ruby
element
in
scope
,
then
generate
implied
end
tags
.
If
the
current
node
is
not
now
a
ruby
element,
this
is
a
parse
error
.
Insert an HTML element for the token.
If
the
stack
of
open
elements
has
a
ruby
element
in
scope
,
then
generate
implied
end
tags
,
except
for
rtc
elements.
If
the
current
node
is
not
now
a
rtc
element
or
a
ruby
element,
this
is
a
parse
error
.
Insert an HTML element for the token.
Reconstruct the active formatting elements , if any.
Adjust MathML attributes for the token. (This fixes the case of MathML attributes that are not all lowercase.)
Adjust foreign attributes for the token. (This fixes the use of namespaced attributes, in particular XLink.)
Insert a foreign element for the token, with MathML namespace and false.
If the token has its self-closing flag set, pop the current node off the stack of open elements and acknowledge the token's self-closing flag .
Reconstruct the active formatting elements , if any.
Adjust SVG attributes for the token. (This fixes the case of SVG attributes that are not all lowercase.)
Adjust foreign attributes for the token. (This fixes the use of namespaced attributes, in particular XLink in SVG.)
Insert a foreign element for the token, with SVG namespace and false.
If the token has its self-closing flag set, pop the current node off the stack of open elements and acknowledge the token's self-closing flag .
Parse error . Ignore the token.
Reconstruct the active formatting elements , if any.
Insert an HTML element for the token.
This
element
will
be
an
ordinary
element.
With
one
exception:
if
the
scripting
flag
is
disabled,
it
can
also
be
a
noscript
element.
Run these steps:
Initialize node to be the current node (the bottommost node of the stack).
Loop : If node is an HTML element with the same tag name as the token, then:
Generate implied end tags , except for HTML elements with the same tag name as the token.
If node is not the current node , then this is a parse error .
Pop all the nodes from the current node up to node , including node , then stop these steps.
Otherwise, if node is in the special category, then this is a parse error ; ignore the token, and return.
Set node to the previous entry in the stack of open elements .
Return to the step labeled loop .
When
the
steps
above
say
the
user
agent
is
to
close
a
p
element
,
it
means
that
the
user
agent
must
run
the
following
steps:
Generate
implied
end
tags
,
except
for
p
elements.
If
the
current
node
is
not
a
p
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
a
p
element
has
been
popped
from
the
stack.
The adoption agency algorithm , which takes as its only argument a token token for which the algorithm is being run, consists of the following steps:
Let subject be token 's tag name.
If the current node is an HTML element whose tag name is subject , and the current node is not in the list of active formatting elements , then pop the current node off the stack of open elements and return.
Let outerLoopCounter be 0.
While true:
If outerLoopCounter is greater than or equal to 8, then return.
Increment outerLoopCounter by 1.
Let formattingElement be the last element in the list of active formatting elements that:
If there is no such element, then return and instead act as described in the "any other end tag" entry above.
If formattingElement is not in the stack of open elements , then this is a parse error ; remove the element from the list, and return.
If formattingElement is in the stack of open elements , but the element is not in scope , then this is a parse error ; return.
If formattingElement is not the current node , this is a parse error . (But do not return.)
Let furthestBlock be the topmost node in the stack of open elements that is lower in the stack than formattingElement , and is an element in the special category. There might not be one.
If there is no furthestBlock , then the UA must first pop all the nodes from the bottom of the stack of open elements , from the current node up to and including formattingElement , then remove formattingElement from the list of active formatting elements , and finally return.
Let commonAncestor be the element immediately above formattingElement in the stack of open elements .
Let a bookmark note the position of formattingElement in the list of active formatting elements relative to the elements on either side of it in the list.
Let node and lastNode be furthestBlock .
Let innerLoopCounter be 0.
While true:
Increment innerLoopCounter by 1.
Let node be the element immediately above node in the stack of open elements , or if node is no longer in the stack of open elements (e.g. because it got removed by this algorithm), the element that was immediately above node in the stack of open elements before node was removed.
If node is formattingElement , then break .
If innerLoopCounter is greater than 3 and node is in the list of active formatting elements , then remove node from the list of active formatting elements .
If node is not in the list of active formatting elements , then remove node from the stack of open elements and continue .
Create an element for the token for which the element node was created, in the HTML namespace , with commonAncestor as the intended parent; replace the entry for node in the list of active formatting elements with an entry for the new element, replace the entry for node in the stack of open elements with an entry for the new element, and let node be the new element.
If lastNode is furthestBlock , then move the aforementioned bookmark to be immediately after the new node in the list of active formatting elements .
Append lastNode to node .
Set lastNode to node .
Insert whatever lastNode ended up being in the previous step at the appropriate place for inserting a node , but using commonAncestor as the override target .
Create an element for the token for which formattingElement was created, in the HTML namespace , with furthestBlock as the intended parent.
Take all of the child nodes of furthestBlock and append them to the element created in the last step.
Append that new element to furthestBlock .
Remove formattingElement from the list of active formatting elements , and insert the new element into the list of active formatting elements at the position of the aforementioned bookmark.
Remove formattingElement from the stack of open elements , and insert the new element into the stack of open elements immediately below the position of furthestBlock in that stack.
This algorithm's name, the "adoption agency algorithm", comes from the way it causes elements to change parents, and is in contrast with other possible algorithms for dealing with misnested content.
When the user agent is to apply the rules for the " text " insertion mode , the user agent must handle the token as follows:
Insert the token's character .
This can never be a U+0000 NULL character; the tokenizer converts those to U+FFFD REPLACEMENT CHARACTER characters.
If
the
current
node
is
a
script
element,
then
set
its
already
started
to
true.
Pop the current node off the stack of open elements .
Switch the insertion mode to the original insertion mode and reprocess the token.
If the active speculative HTML parser is null and the JavaScript execution context stack is empty, then perform a microtask checkpoint .
Let
script
be
the
current
node
(which
will
be
a
script
element).
Pop the current node off the stack of open elements .
Switch the insertion mode to the original insertion mode .
Let the old insertion point have the same value as the current insertion point . Let the insertion point be just before the next input character .
Increment the parser's script nesting level by one.
If the active speculative HTML parser is null, then prepare the script element script . This might cause some script to execute, which might cause new characters to be inserted into the tokenizer , and might cause the tokenizer to output more tokens, resulting in a reentrant invocation of the parser .
Decrement the parser's script nesting level by one. If the parser's script nesting level is zero, then set the parser pause flag to false.
Let the insertion point have the value of the old insertion point . (In other words, restore the insertion point to its previous value. This value might be the "undefined" value.)
At this stage, if the pending parsing-blocking script is not null, then:
Set the parser pause flag to true, and abort the processing of any nested invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume when the caller returns to the "outer" tree construction stage.)
The
tree
construction
stage
of
this
particular
parser
is
being
called
reentrantly
,
say
from
a
call
to
document.write()
.
While the pending parsing-blocking script is not null:
Let the script be the pending parsing-blocking script .
Set the pending parsing-blocking script to null.
Start the speculative HTML parser for this instance of the HTML parser.
Block the tokenizer for this instance of the HTML parser , such that the event loop will not run tasks that invoke the tokenizer .
If
the
parser's
Document
has
a
style
sheet
that
is
blocking
scripts
or
the
script
's
ready
to
be
parser-executed
is
false:
spin
the
event
loop
until
the
parser's
Document
has
no
style
sheet
that
is
blocking
scripts
and
the
script
's
ready
to
be
parser-executed
becomes
true.
If this parser has been aborted in the meantime, return.
This
could
happen
if,
e.g.,
while
the
spin
the
event
loop
algorithm
is
running,
the
Document
gets
destroyed
,
or
the
document.open()
method
gets
invoked
on
the
Document
.
Stop the speculative HTML parser for this instance of the HTML parser.
Unblock the tokenizer for this instance of the HTML parser , such that tasks that invoke the tokenizer can again be run.
Let the insertion point be just before the next input character .
Increment the parser's script nesting level by one (it should be zero before this step, so this sets it to one).
Execute the script element the script .
Decrement the parser's script nesting level by one. If the parser's script nesting level is zero (which it always should be at this point), then set the parser pause flag to false.
Let the insertion point be undefined again.
Pop the current node off the stack of open elements .
Switch the insertion mode to the original insertion mode .
When the user agent is to apply the rules for the " in table " insertion mode , the user agent must handle the token as follows:
table
,
tbody
,
template
,
tfoot
,
thead
,
or
tr
element
Let the pending table character tokens be an empty list of tokens.
Let the original insertion mode be the current insertion mode .
Switch the insertion mode to " in table text " and reprocess the token.
Parse error . Ignore the token.
Clear the stack back to a table context . (See below.)
Insert a marker at the end of the list of active formatting elements .
Insert an HTML element for the token, then switch the insertion mode to " in caption ".
Clear the stack back to a table context . (See below.)
Insert an HTML element for the token, then switch the insertion mode to " in column group ".
Clear the stack back to a table context . (See below.)
Insert an HTML element for a "colgroup" start tag token with no attributes, then switch the insertion mode to " in column group ".
Reprocess the current token.
Clear the stack back to a table context . (See below.)
Insert an HTML element for the token, then switch the insertion mode to " in table body ".
Clear the stack back to a table context . (See below.)
Insert an HTML element for a "tbody" start tag token with no attributes, then switch the insertion mode to " in table body ".
Reprocess the current token.
If
the
stack
of
open
elements
does
not
have
a
table
element
in
table
scope
,
ignore
the
token.
Otherwise:
Pop
elements
from
this
stack
until
a
table
element
has
been
popped
from
the
stack.
Reset the insertion mode appropriately .
Reprocess the token.
If
the
stack
of
open
elements
does
not
have
a
table
element
in
table
scope
,
this
is
a
parse
error
;
ignore
the
token.
Otherwise:
Pop
elements
from
this
stack
until
a
table
element
has
been
popped
from
the
stack.
Parse error . Ignore the token.
Process the token using the rules for the " in head " insertion mode .
If
the
token
does
not
have
an
attribute
with
the
name
"type",
or
if
it
does,
but
that
attribute's
value
is
not
an
ASCII
case-insensitive
match
for
the
string
"
hidden
",
then:
act
as
described
in
the
"anything
else"
entry
below.
Otherwise:
Insert an HTML element for the token.
Pop
that
input
element
off
the
stack
of
open
elements
.
Acknowledge the token's self-closing flag , if it is set.
If
there
is
a
template
element
on
the
stack
of
open
elements
,
or
if
the
form
element
pointer
is
not
null,
ignore
the
token.
Otherwise:
Insert
an
HTML
element
for
the
token,
and
set
the
form
element
pointer
to
point
to
the
element
created.
Pop
that
form
element
off
the
stack
of
open
elements
.
Process the token using the rules for the " in body " insertion mode .
Parse error . Enable foster parenting , process the token using the rules for the " in body " insertion mode , and then disable foster parenting .
When
the
steps
above
require
the
UA
to
clear
the
stack
back
to
a
table
context
,
it
means
that
the
UA
must,
while
the
current
node
is
not
a
table
,
template
,
or
html
element,
pop
elements
from
the
stack
of
open
elements
.
This is the same list of elements as used in the has an element in table scope steps.
The
current
node
being
an
html
element
after
this
process
is
a
fragment
case
.
When the user agent is to apply the rules for the " in table text " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Append the character token to the pending table character tokens list.
If any of the tokens in the pending table character tokens list are character tokens that are not ASCII whitespace , then this is a parse error : reprocess the character tokens in the pending table character tokens list using the rules given in the "anything else" entry in the " in table " insertion mode.
Otherwise, insert the characters given by the pending table character tokens list.
Switch the insertion mode to the original insertion mode and reprocess the token.
When the user agent is to apply the rules for the " in caption " insertion mode , the user agent must handle the token as follows:
If
the
stack
of
open
elements
does
not
have
a
caption
element
in
table
scope
,
this
is
a
parse
error
;
ignore
the
token.
(
fragment
case
)
Otherwise:
Now,
if
the
current
node
is
not
a
caption
element,
then
this
is
a
parse
error
.
Pop
elements
from
this
stack
until
a
caption
element
has
been
popped
from
the
stack.
Clear the list of active formatting elements up to the last marker .
Switch the insertion mode to " in table ".
If
the
stack
of
open
elements
does
not
have
a
caption
element
in
table
scope
,
this
is
a
parse
error
;
ignore
the
token.
(
fragment
case
)
Otherwise:
Now,
if
the
current
node
is
not
a
caption
element,
then
this
is
a
parse
error
.
Pop
elements
from
this
stack
until
a
caption
element
has
been
popped
from
the
stack.
Clear the list of active formatting elements up to the last marker .
Switch the insertion mode to " in table ".
Reprocess the token.
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
When the user agent is to apply the rules for the " in column group " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
If
the
current
node
is
not
a
colgroup
element,
then
this
is
a
parse
error
;
ignore
the
token.
Otherwise, pop the current node from the stack of open elements . Switch the insertion mode to " in table ".
Parse error . Ignore the token.
Process the token using the rules for the " in head " insertion mode .
Process the token using the rules for the " in body " insertion mode .
If
the
current
node
is
not
a
colgroup
element,
then
this
is
a
parse
error
;
ignore
the
token.
Otherwise, pop the current node from the stack of open elements .
Switch the insertion mode to " in table ".
Reprocess the token.
When the user agent is to apply the rules for the " in table body " insertion mode , the user agent must handle the token as follows:
Clear the stack back to a table body context . (See below.)
Insert an HTML element for the token, then switch the insertion mode to " in row ".
Clear the stack back to a table body context . (See below.)
Insert an HTML element for a "tr" start tag token with no attributes, then switch the insertion mode to " in row ".
Reprocess the current token.
If the stack of open elements does not have an element in table scope that is an HTML element with the same tag name as the token, this is a parse error ; ignore the token.
Otherwise:
Clear the stack back to a table body context . (See below.)
Pop the current node from the stack of open elements . Switch the insertion mode to " in table ".
If
the
stack
of
open
elements
does
not
have
a
tbody
,
thead
,
or
tfoot
element
in
table
scope
,
this
is
a
parse
error
;
ignore
the
token.
Otherwise:
Clear the stack back to a table body context . (See below.)
Pop the current node from the stack of open elements . Switch the insertion mode to " in table ".
Reprocess the token.
Parse error . Ignore the token.
Process the token using the rules for the " in table " insertion mode .
When
the
steps
above
require
the
UA
to
clear
the
stack
back
to
a
table
body
context
,
it
means
that
the
UA
must,
while
the
current
node
is
not
a
tbody
,
tfoot
,
thead
,
template
,
or
html
element,
pop
elements
from
the
stack
of
open
elements
.
The
current
node
being
an
html
element
after
this
process
is
a
fragment
case
.
When the user agent is to apply the rules for the " in row " insertion mode , the user agent must handle the token as follows:
Clear the stack back to a table row context . (See below.)
Insert an HTML element for the token, then switch the insertion mode to " in cell ".
Insert a marker at the end of the list of active formatting elements .
If
the
stack
of
open
elements
does
not
have
a
tr
element
in
table
scope
,
this
is
a
parse
error
;
ignore
the
token.
Otherwise:
Clear the stack back to a table row context . (See below.)
Pop
the
current
node
(which
will
be
a
tr
element)
from
the
stack
of
open
elements
.
Switch
the
insertion
mode
to
"
in
table
body
".
If
the
stack
of
open
elements
does
not
have
a
tr
element
in
table
scope
,
this
is
a
parse
error
;
ignore
the
token.
Otherwise:
Clear the stack back to a table row context . (See below.)
Pop
the
current
node
(which
will
be
a
tr
element)
from
the
stack
of
open
elements
.
Switch
the
insertion
mode
to
"
in
table
body
".
Reprocess the token.
If the stack of open elements does not have an element in table scope that is an HTML element with the same tag name as the token, this is a parse error ; ignore the token.
If
the
stack
of
open
elements
does
not
have
a
tr
element
in
table
scope
,
ignore
the
token.
Otherwise:
Clear the stack back to a table row context . (See below.)
Pop
the
current
node
(which
will
be
a
tr
element)
from
the
stack
of
open
elements
.
Switch
the
insertion
mode
to
"
in
table
body
".
Reprocess the token.
Parse error . Ignore the token.
Process the token using the rules for the " in table " insertion mode .
When
the
steps
above
require
the
UA
to
clear
the
stack
back
to
a
table
row
context
,
it
means
that
the
UA
must,
while
the
current
node
is
not
a
tr
,
template
,
or
html
element,
pop
elements
from
the
stack
of
open
elements
.
The
current
node
being
an
html
element
after
this
process
is
a
fragment
case
.
When the user agent is to apply the rules for the " in cell " insertion mode , the user agent must handle the token as follows:
If the stack of open elements does not have an element in table scope that is an HTML element with the same tag name as that of the token, then this is a parse error ; ignore the token.
Otherwise:
Now, if the current node is not an HTML element with the same tag name as the token, then this is a parse error .
Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.
Clear the list of active formatting elements up to the last marker .
Switch the insertion mode to " in row ".
Assert
:
The
stack
of
open
elements
has
a
td
or
th
element
in
table
scope
.
Close the cell (see below) and reprocess the token.
Parse error . Ignore the token.
If the stack of open elements does not have an element in table scope that is an HTML element with the same tag name as that of the token, then this is a parse error ; ignore the token.
Otherwise, close the cell (see below) and reprocess the token.
Process the token using the rules for the " in body " insertion mode .
Where the steps above say to close the cell , they mean to run the following algorithm:
If
the
current
node
is
not
now
a
td
element
or
a
th
element,
then
this
is
a
parse
error
.
Pop
elements
from
the
stack
of
open
elements
until
a
td
element
or
a
th
element
has
been
popped
from
the
stack.
Clear the list of active formatting elements up to the last marker .
Switch the insertion mode to " in row ".
The
stack
of
open
elements
cannot
have
both
a
td
and
a
th
element
in
table
scope
at
the
same
time,
nor
can
it
have
neither
when
the
close
the
cell
algorithm
is
invoked.
When the user agent is to apply the rules for the " in select " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
If
the
current
node
is
an
option
element,
pop
that
node
from
the
stack
of
open
elements
.
Insert an HTML element for the token.
If
the
current
node
is
an
option
element,
pop
that
node
from
the
stack
of
open
elements
.
If
the
current
node
is
an
optgroup
element,
pop
that
node
from
the
stack
of
open
elements
.
Insert an HTML element for the token.
If
the
current
node
is
an
option
element,
pop
that
node
from
the
stack
of
open
elements
.
If
the
current
node
is
an
optgroup
element,
pop
that
node
from
the
stack
of
open
elements
.
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
First,
if
the
current
node
is
an
option
element,
and
the
node
immediately
before
it
in
the
stack
of
open
elements
is
an
optgroup
element,
then
pop
the
current
node
from
the
stack
of
open
elements
.
If
the
current
node
is
an
optgroup
element,
then
pop
that
node
from
the
stack
of
open
elements
.
Otherwise,
this
is
a
parse
error
;
ignore
the
token.
If
the
current
node
is
an
option
element,
then
pop
that
node
from
the
stack
of
open
elements
.
Otherwise,
this
is
a
parse
error
;
ignore
the
token.
If
the
stack
of
open
elements
does
not
have
a
select
element
in
select
scope
,
this
is
a
parse
error
;
ignore
the
token.
(
fragment
case
)
Otherwise:
Pop
elements
from
the
stack
of
open
elements
until
a
select
element
has
been
popped
from
the
stack.
If
the
stack
of
open
elements
does
not
have
a
select
element
in
select
scope
,
ignore
the
token.
(
fragment
case
)
Otherwise:
Pop
elements
from
the
stack
of
open
elements
until
a
select
element
has
been
popped
from
the
stack.
Reset the insertion mode appropriately .
It just gets treated like an end tag.
If
the
stack
of
open
elements
does
not
have
a
select
element
in
select
scope
,
ignore
the
token.
(
fragment
case
)
Otherwise:
Pop
elements
from
the
stack
of
open
elements
until
a
select
element
has
been
popped
from
the
stack.
Reset the insertion mode appropriately .
Reprocess the token.
Process the token using the rules for the " in head " insertion mode .
Process the token using the rules for the " in body " insertion mode .
Parse error . Ignore the token.
When the user agent is to apply the rules for the " in select in table " insertion mode , the user agent must handle the token as follows:
Pop
elements
from
the
stack
of
open
elements
until
a
select
element
has
been
popped
from
the
stack.
Reset the insertion mode appropriately .
Reprocess the token.
If the stack of open elements does not have an element in table scope that is an HTML element with the same tag name as that of the token, then ignore the token.
Otherwise:
Pop
elements
from
the
stack
of
open
elements
until
a
select
element
has
been
popped
from
the
stack.
Reset the insertion mode appropriately .
Reprocess the token.
Process the token using the rules for the " in select " insertion mode .
When the user agent is to apply the rules for the " in template " insertion mode , the user agent must handle the token as follows:
Process the token using the rules for the " in body " insertion mode .
Process the token using the rules for the " in head " insertion mode .
Pop the current template insertion mode off the stack of template insertion modes .
Push " in table " onto the stack of template insertion modes so that it is the new current template insertion mode .
Switch the insertion mode to " in table ", and reprocess the token.
Pop the current template insertion mode off the stack of template insertion modes .
Push " in column group " onto the stack of template insertion modes so that it is the new current template insertion mode .
Switch the insertion mode to " in column group ", and reprocess the token.
Pop the current template insertion mode off the stack of template insertion modes .
Push " in table body " onto the stack of template insertion modes so that it is the new current template insertion mode .
Switch the insertion mode to " in table body ", and reprocess the token.
Pop the current template insertion mode off the stack of template insertion modes .
Push " in row " onto the stack of template insertion modes so that it is the new current template insertion mode .
Switch the insertion mode to " in row ", and reprocess the token.
Pop the current template insertion mode off the stack of template insertion modes .
Push " in body " onto the stack of template insertion modes so that it is the new current template insertion mode .
Switch the insertion mode to " in body ", and reprocess the token.
Parse error . Ignore the token.
If
there
is
no
template
element
on
the
stack
of
open
elements
,
then
stop
parsing
.
(
fragment
case
)
Otherwise, this is a parse error .
Pop
elements
from
the
stack
of
open
elements
until
a
template
element
has
been
popped
from
the
stack.
Clear the list of active formatting elements up to the last marker .
Pop the current template insertion mode off the stack of template insertion modes .
Reset the insertion mode appropriately .
Reprocess the token.
When the user agent is to apply the rules for the " after body " insertion mode , the user agent must handle the token as follows:
Process the token using the rules for the " in body " insertion mode .
Insert
a
comment
as
the
last
child
of
the
first
element
in
the
stack
of
open
elements
(the
html
element).
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
If the parser was created as part of the HTML fragment parsing algorithm , this is a parse error ; ignore the token. ( fragment case )
Otherwise, switch the insertion mode to " after after body ".
Parse error . Switch the insertion mode to " in body " and reprocess the token.
When the user agent is to apply the rules for the " in frameset " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Insert an HTML element for the token.
If
the
current
node
is
the
root
html
element,
then
this
is
a
parse
error
;
ignore
the
token.
(
fragment
case
)
Otherwise, pop the current node from the stack of open elements .
If
the
parser
was
not
created
as
part
of
the
HTML
fragment
parsing
algorithm
(
fragment
case
),
and
the
current
node
is
no
longer
a
frameset
element,
then
switch
the
insertion
mode
to
"
after
frameset
".
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements .
Acknowledge the token's self-closing flag , if it is set.
Process the token using the rules for the " in head " insertion mode .
If
the
current
node
is
not
the
root
html
element,
then
this
is
a
parse
error
.
The
current
node
can
only
be
the
root
html
element
in
the
fragment
case
.
Parse error . Ignore the token.
When the user agent is to apply the rules for the " after frameset " insertion mode , the user agent must handle the token as follows:
Parse error . Ignore the token.
Process the token using the rules for the " in body " insertion mode .
Switch the insertion mode to " after after frameset ".
Process the token using the rules for the " in head " insertion mode .
Parse error . Ignore the token.
When the user agent is to apply the rules for the " after after body " insertion mode , the user agent must handle the token as follows:
Insert
a
comment
as
the
last
child
of
the
Document
object.
Process the token using the rules for the " in body " insertion mode .
Parse error . Switch the insertion mode to " in body " and reprocess the token.
When the user agent is to apply the rules for the " after after frameset " insertion mode , the user agent must handle the token as follows:
Insert
a
comment
as
the
last
child
of
the
Document
object.
Process the token using the rules for the " in body " insertion mode .
Process the token using the rules for the " in head " insertion mode .
Parse error . Ignore the token.
When the user agent is to apply the rules for parsing tokens in foreign content, the user agent must handle the token as follows:
Parse error . Insert a U+FFFD REPLACEMENT CHARACTER character .
Insert the token's character .
Set the frameset-ok flag to "not ok".
Parse error . Ignore the token.
While the current node is not a MathML text integration point , an HTML integration point , or an element in the HTML namespace , pop elements from the stack of open elements .
Reprocess the token according to the rules given in the section corresponding to the current insertion mode in HTML content.
If the adjusted current node is an element in the MathML namespace , adjust MathML attributes for the token. (This fixes the case of MathML attributes that are not all lowercase.)
If the adjusted current node is an element in the SVG namespace , and the token's tag name is one of the ones in the first column of the following table, change the tag name to the name given in the corresponding cell in the second column. (This fixes the case of SVG elements that are not all lowercase.)
Tag name | Element name |
---|---|
altglyph
|
altGlyph
|
altglyphdef
|
altGlyphDef
|
altglyphitem
|
altGlyphItem
|
animatecolor
|
animateColor
|
animatemotion
|
animateMotion
|
animatetransform
|
animateTransform
|
clippath
|
clipPath
|
feblend
|
feBlend
|
fecolormatrix
|
feColorMatrix
|
fecomponenttransfer
|
feComponentTransfer
|
fecomposite
|
feComposite
|
feconvolvematrix
|
feConvolveMatrix
|
fediffuselighting
|
feDiffuseLighting
|
fedisplacementmap
|
feDisplacementMap
|
fedistantlight
|
feDistantLight
|
fedropshadow
|
feDropShadow
|
feflood
|
feFlood
|
fefunca
|
feFuncA
|
fefuncb
|
feFuncB
|
fefuncg
|
feFuncG
|
fefuncr
|
feFuncR
|
fegaussianblur
|
feGaussianBlur
|
feimage
|
feImage
|
femerge
|
feMerge
|
femergenode
|
feMergeNode
|
femorphology
|
feMorphology
|
feoffset
|
feOffset
|
fepointlight
|
fePointLight
|
fespecularlighting
|
feSpecularLighting
|
fespotlight
|
feSpotLight
|
fetile
|
feTile
|
feturbulence
|
feTurbulence
|
foreignobject
|
foreignObject
|
glyphref
|
glyphRef
|
lineargradient
|
linearGradient
|
radialgradient
|
radialGradient
|
textpath
|
textPath
|
If the adjusted current node is an element in the SVG namespace , adjust SVG attributes for the token. (This fixes the case of SVG attributes that are not all lowercase.)
Adjust foreign attributes for the token. (This fixes the use of namespaced attributes, in particular XLink in SVG.)
Insert a foreign element for the token, with adjusted current node 's namespace and false.
If the token has its self-closing flag set, then run the appropriate steps from the following list:
Acknowledge the token's self-closing flag , and then act as described in the steps for a "script" end tag below.
Pop the current node off the stack of open elements and acknowledge the token's self-closing flag .
script
element
Pop the current node off the stack of open elements .
Let the old insertion point have the same value as the current insertion point . Let the insertion point be just before the next input character .
Increment the parser's script nesting level by one. Set the parser pause flag to true.
If
the
active
speculative
HTML
parser
is
null
and
the
user
agent
supports
SVG,
then
Process
the
SVG
script
element
according
to
the
SVG
rules.
[SVG]
Even if this causes new characters to be inserted into the tokenizer , the parser will not be executed reentrantly, since the parser pause flag is true.
Decrement the parser's script nesting level by one. If the parser's script nesting level is zero, then set the parser pause flag to false.
Let the insertion point have the value of the old insertion point . (In other words, restore the insertion point to its previous value. This value might be the "undefined" value.)
Run these steps:
Initialize node to be the current node (the bottommost node of the stack).
If node 's tag name, converted to ASCII lowercase , is not the same as the tag name of the token, then this is a parse error .
Loop : If node is the topmost element in the stack of open elements , then return. ( fragment case )
If node 's tag name, converted to ASCII lowercase , is the same as the tag name of the token, pop elements from the stack of open elements until node has been popped from the stack, and then return.
Set node to the previous entry in the stack of open elements .
If node is not an element in the HTML namespace , return to the step labeled loop .
Otherwise, process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.
Document/DOMContentLoaded_event
Support in all current engines.
Once the user agent stops parsing the document, the user agent must run the following steps:
Support in all current engines.
If the active speculative HTML parser is not null, then stop the speculative HTML parser and return.
Set the insertion point to undefined.
Update
the
current
document
readiness
to
"
interactive
".
Pop all the nodes off the stack of open elements .
While the list of scripts that will execute when the document has finished parsing is not empty:
Spin
the
event
loop
until
the
first
script
in
the
list
of
scripts
that
will
execute
when
the
document
has
finished
parsing
has
its
ready
to
be
parser-executed
set
to
true
and
the
parser's
Document
has
no
style
sheet
that
is
blocking
scripts
.
Execute
the
script
element
given
by
the
first
script
in
the
list
of
scripts
that
will
execute
when
the
document
has
finished
parsing
.
Remove
the
first
script
element
from
the
list
of
scripts
that
will
execute
when
the
document
has
finished
parsing
(i.e.
shift
out
the
first
entry
in
the
list).
Queue
a
global
task
on
the
DOM
manipulation
task
source
given
the
Document
's
relevant
global
object
to
run
the
following
substeps:
Set
the
Document
's
load
timing
info
's
DOM
content
loaded
event
start
time
to
the
current
high
resolution
time
given
the
Document
's
relevant
global
object
.
Fire
an
event
named
DOMContentLoaded
at
the
Document
object,
with
its
bubbles
attribute
initialized
to
true.
Set
the
Document
's
load
timing
info
's
DOM
content
loaded
event
end
time
to
the
current
high
resolution
time
given
the
Document
's
relevant
global
object
.
Enable
the
client
message
queue
of
the
ServiceWorkerContainer
object
whose
associated
service
worker
client
is
the
Document
object's
relevant
settings
object
.
Invoke
WebDriver
BiDi
DOM
content
loaded
with
the
Document
's
browsing
context
,
and
a
new
WebDriver
BiDi
navigation
status
whose
id
is
the
Document
object's
during-loading
navigation
ID
for
WebDriver
BiDi
,
status
is
"
pending
",
and
url
is
the
Document
object's
URL
.
Spin the event loop until the set of scripts that will execute as soon as possible and the list of scripts that will execute in order as soon as possible are empty.
Spin
the
event
loop
until
there
is
nothing
that
delays
the
load
event
in
the
Document
.
Queue
a
global
task
on
the
DOM
manipulation
task
source
given
the
Document
's
relevant
global
object
to
run
the
following
steps:
Update
the
current
document
readiness
to
"
complete
".
If
the
Document
object's
browsing
context
is
null,
then
abort
these
steps.
Let
window
be
the
Document
's
relevant
global
object
.
Set
the
Document
's
load
timing
info
's
load
event
start
time
to
the
current
high
resolution
time
given
window
.
Fire
an
event
named
load
at
window
,
with
legacy
target
override
flag
set.
Invoke
WebDriver
BiDi
load
complete
with
the
Document
's
browsing
context
,
and
a
new
WebDriver
BiDi
navigation
status
whose
id
is
the
Document
object's
during-loading
navigation
ID
for
WebDriver
BiDi
,
status
is
"
complete
",
and
url
is
the
Document
object's
URL
.
Set
the
Document
object's
during-loading
navigation
ID
for
WebDriver
BiDi
to
null.
Set
the
Document
's
load
timing
info
's
load
event
end
time
to
the
current
high
resolution
time
given
window
.
Assert
:
Document
's
page
showing
is
false.
Set
the
Document
's
page
showing
flag
to
true.
Fire
a
page
transition
event
named
pageshow
at
window
with
false.
Queue
the
navigation
timing
entry
for
the
Document
.
If
the
Document
's
print
when
loaded
flag
is
set,
then
run
the
printing
steps
.
The
Document
is
now
ready
for
post-load
tasks
.
When the user agent is to abort a parser , it must run the following steps:
Throw away any pending content in the input stream , and discard any future content that would have been added to it.
Stop the speculative HTML parser for this HTML parser.
Update
the
current
document
readiness
to
"
interactive
".
Pop all the nodes off the stack of open elements .
Update
the
current
document
readiness
to
"
complete
".
User agents may implement an optimization, as described in this section, to speculatively fetch resources that are declared in the HTML markup while the HTML parser is waiting for a pending parsing-blocking script to be fetched and executed, or during normal parsing, at the time an element is created for a token . While this optimization is not defined in precise detail, there are some rules to consider for interoperability.
Each HTML parser can have an active speculative HTML parser . It is initially null.
The speculative HTML parser must act like the normal HTML parser (e.g., the tree builder rules apply), with some exceptions:
The state of the normal HTML parser and the document itself must not be affected.
For example, the next input character or the stack of open elements for the normal HTML parser is not affected by the speculative HTML parser .
Bytes pushed into the HTML parser's input byte stream must also be pushed into the speculative HTML parser's input byte stream . Bytes read from the streams must be independent.
The result of the speculative parsing is primarily a series of speculative fetches . Which kinds of resources to speculatively fetch is implementation-defined , but user agents must not speculatively fetch resources that would not be fetched with the normal HTML parser, under the assumption that the script that is blocking the HTML parser does nothing.
It is possible that the same markup is seen multiple times from the speculative HTML parser and then the normal HTML parser. It is expected that duplicated fetches will be prevented by caching rules, which are not yet fully specified.
A speculative fetch for a speculative mock element element must follow these rules:
Should some of these things be applied to the document "for real", even though they are found speculatively?
If the speculative HTML parser encounters one of the following elements, then act as if that element is processed for the purpose of its effect of subsequent speculative fetches.
base
element.
meta
element
whose
http-equiv
attribute
is
in
the
Content
security
policy
state.
meta
element
whose
name
attribute
is
an
ASCII
case-insensitive
match
for
"
referrer
".
meta
element
whose
name
attribute
is
an
ASCII
case-insensitive
match
for
"
viewport
".
(This
can
affect
whether
a
media
query
list
matches
the
environment
.)
[CSSDEVICEADAPT]
Let url be the URL that element would fetch if it was processed normally. If there is no such URL or if it is the empty string, then do nothing. Otherwise, if url is already in the list of speculative fetch URLs , then do nothing. Otherwise, fetch url as if the element was processed normally, and add url to the list of speculative fetch URLs .
Each
Document
has
a
list
of
speculative
fetch
URLs
,
which
is
a
list
of
URLs
,
initially
empty.
To start the speculative HTML parser for an instance of an HTML parser parser :
Optionally, return.
This step allows user agents to opt out of speculative HTML parsing.
If parser 's active speculative HTML parser is not null, then stop the speculative HTML parser for parser .
This
can
happen
when
document.write()
writes
another
parser-blocking
script.
For
simplicity,
this
specification
always
restarts
speculative
parsing,
but
user
agents
can
implement
a
more
efficient
strategy,
so
long
as
the
end
result
is
equivalent.
Let speculativeParser be a new speculative HTML parser , with the same state as parser .
Let
speculativeDoc
be
a
new
isomorphic
representation
of
parser
's
Document
,
where
all
elements
are
instead
speculative
mock
elements
.
Let
speculativeParser
parse
into
speculativeDoc
.
Set parser 's active speculative HTML parser to speculativeParser .
In parallel , run speculativeParser until it is stopped or until it reaches the end of its input stream .
To stop the speculative HTML parser for an instance of an HTML parser parser :
Let speculativeParser be parser 's active speculative HTML parser .
If speculativeParser is null, then return.
Throw away any pending content in speculativeParser 's input stream , and discard any future content that would have been added to it.
Set parser 's active speculative HTML parser to null.
The speculative HTML parser will create speculative mock elements instead of normal elements. DOM operations that the tree builder normally does on elements are expected to work appropriately on speculative mock elements.
A speculative mock element is a struct with the following items :
A string namespace , corresponding to an element's namespace .
A string local name , corresponding to an element's local name .
A list attribute list , corresponding to an element's attribute list .
To create a speculative mock element given a namespace , tagName , and attributes :
Let element be a new speculative mock element .
Set element 's namespace to namespace .
Set element 's local name to tagName .
Set element 's attribute list to attributes .
Optionally, perform a speculative fetch for element .
Return element .
When
the
tree
builder
says
to
insert
an
element
into
a
template
element's
template
contents
,
if
that
is
a
speculative
mock
element
,
and
the
template
element's
template
contents
is
not
a
ShadowRoot
node,
instead
do
nothing.
URLs
found
speculatively
inside
non-declarative-shadow-root
template
elements
might
themselves
be
templates,
and
must
not
be
speculatively
fetched.
When
an
application
uses
an
HTML
parser
in
conjunction
with
an
XML
pipeline,
it
is
possible
that
the
constructed
DOM
is
not
compatible
with
the
XML
tool
chain
in
certain
subtle
ways.
For
example,
an
XML
toolchain
might
not
be
able
to
represent
attributes
with
the
name
xmlns
,
since
they
conflict
with
the
Namespaces
in
XML
syntax.
There
is
also
some
data
that
the
HTML
parser
generates
that
isn't
included
in
the
DOM
itself.
This
section
specifies
some
rules
for
handling
these
issues.
If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.
If
the
XML
API
doesn't
support
attributes
in
no
namespace
that
are
named
"
xmlns
",
attributes
whose
names
start
with
"
xmlns:
",
or
attributes
in
the
XMLNS
namespace
,
then
the
tool
may
drop
such
attributes.
The tool may annotate the output with any namespace declarations required for proper operation.
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.
For
example,
the
element
name
foo<bar
,
which
can
be
output
by
the
HTML
parser
,
though
it
is
neither
a
legal
HTML
element
name
nor
a
well-formed
XML
element
name,
would
be
converted
into
fooU00003Cbar
,
which
is
a
well-formed
XML
element
name
(though
it's
still
not
legal
in
HTML
by
any
means).
As
another
example,
consider
the
attribute
xlink:href
.
Used
on
a
MathML
element,
it
becomes,
after
being
adjusted
,
an
attribute
with
a
prefix
"
xlink
"
and
a
local
name
"
href
".
However,
used
on
an
HTML
element,
it
becomes
an
attribute
with
no
prefix
and
the
local
name
"
xlink:href
",
which
is
not
a
valid
NCName,
and
thus
might
not
be
accepted
by
an
XML
API.
It
could
thus
get
converted,
becoming
"
xlinkU00003Ahref
".
The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser , since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.
If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE character between any such offending characters.
If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS character (-), the tool may insert a single U+0020 SPACE character at the end of such comments.
If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.
If the tool has no way to convey out-of-band information, then the tool may drop the following information:
form
element
ancestor
(use
of
the
form
element
pointer
in
the
parser)
template
elements.
The
mutations
allowed
by
this
section
apply
after
the
HTML
parser
's
rules
have
been
applied.
For
example,
a
<a::>
start
tag
will
be
closed
by
a
</a::>
end
tag,
and
never
by
a
</aU00003AU00003A>
end
tag,
even
if
the
user
agent
is
using
the
rules
above
to
then
generate
an
actual
element
in
the
DOM
with
the
name
aU00003AU00003A
for
that
start
tag.
This section is non-normative.
This section examines some erroneous markup and discusses how the HTML parser handles these cases.
This section is non-normative.
The most-often discussed example of erroneous markup is as follows:
<p>1<b>2<i>3</b>4</i>5</p>
The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:
Here,
the
stack
of
open
elements
has
five
elements
on
it:
html
,
body
,
p
,
b
,
and
i
.
The
list
of
active
formatting
elements
just
has
two:
b
and
i
.
The
insertion
mode
is
"
in
body
".
Upon
receiving
the
end
tag
token
with
the
tag
name
"b",
the
"
adoption
agency
algorithm
"
is
invoked.
This
is
a
simple
case,
in
that
the
formattingElement
is
the
b
element,
and
there
is
no
furthest
block
.
Thus,
the
stack
of
open
elements
ends
up
with
just
three
elements:
html
,
body
,
and
p
,
while
the
list
of
active
formatting
elements
has
just
one:
i
.
The
DOM
tree
is
unmodified
at
this
point.
The
next
token
is
a
character
("4"),
triggers
the
reconstruction
of
the
active
formatting
elements
,
in
this
case
just
the
i
element.
A
new
i
element
is
thus
created
for
the
"4"
Text
node.
After
the
end
tag
token
for
the
"i"
is
also
received,
and
the
"5"
Text
node
is
inserted,
the
DOM
looks
as
follows:
This section is non-normative.
A case similar to the previous one is the following:
<b>1<p>2</b>3</p>
Up to the "2" the parsing here is straightforward:
The interesting part is when the end tag token with the tag name "b" is parsed.
Before
that
token
is
seen,
the
stack
of
open
elements
has
four
elements
on
it:
html
,
body
,
b
,
and
p
.
The
list
of
active
formatting
elements
just
has
the
one:
b
.
The
insertion
mode
is
"
in
body
".
Upon
receiving
the
end
tag
token
with
the
tag
name
"b",
the
"
adoption
agency
algorithm
"
is
invoked,
as
in
the
previous
example.
However,
in
this
case,
there
is
a
furthest
block
,
namely
the
p
element.
Thus,
this
time
the
adoption
agency
algorithm
isn't
skipped
over.
The
common
ancestor
is
the
body
element.
A
conceptual
"bookmark"
marks
the
position
of
the
b
in
the
list
of
active
formatting
elements
,
but
since
that
list
has
only
one
element
in
it,
the
bookmark
won't
have
much
effect.
As
the
algorithm
progresses,
node
ends
up
set
to
the
formatting
element
(
b
),
and
last
node
ends
up
set
to
the
furthest
block
(
p
).
The last node gets appended (moved) to the common ancestor , so that the DOM looks like:
A
new
b
element
is
created,
and
the
children
of
the
p
element
are
moved
to
it:
Finally,
the
new
b
element
is
appended
to
the
p
element,
so
that
the
DOM
looks
like:
The
b
element
is
removed
from
the
list
of
active
formatting
elements
and
the
stack
of
open
elements
,
so
that
when
the
"3"
is
parsed,
it
is
appended
to
the
p
element:
This section is non-normative.
Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:
<table>
<b>
<tr><td>aaa</td></tr>
bbb
</table>ccc
The
highlighted
b
element
start
tag
is
not
allowed
directly
inside
a
table
like
that,
and
the
parser
handles
this
case
by
placing
the
element
before
the
table.
(This
is
called
foster
parenting
.)
This
can
be
seen
by
examining
the
DOM
tree
as
it
stands
just
after
the
table
element's
start
tag
has
been
seen:
...and
then
immediately
after
the
b
element
start
tag
has
been
seen:
At
this
point,
the
stack
of
open
elements
has
on
it
the
elements
html
,
body
,
table
,
and
b
(in
that
order,
despite
the
resulting
DOM
tree);
the
list
of
active
formatting
elements
just
has
the
b
element
in
it;
and
the
insertion
mode
is
"
in
table
".
The
tr
start
tag
causes
the
b
element
to
be
popped
off
the
stack
and
a
tbody
start
tag
to
be
implied;
the
tbody
and
tr
elements
are
then
handled
in
a
rather
straight-forward
manner,
taking
the
parser
through
the
"
in
table
body
"
and
"
in
row
"
insertion
modes,
after
which
the
DOM
looks
as
follows:
Here,
the
stack
of
open
elements
has
on
it
the
elements
html
,
body
,
table
,
tbody
,
and
tr
;
the
list
of
active
formatting
elements
still
has
the
b
element
in
it;
and
the
insertion
mode
is
"
in
row
".
The
td
element
start
tag
token,
after
putting
a
td
element
on
the
tree,
puts
a
marker
on
the
list
of
active
formatting
elements
(it
also
switches
to
the
"
in
cell
"
insertion
mode
).
The
marker
means
that
when
the
"aaa"
character
tokens
are
seen,
no
b
element
is
created
to
hold
the
resulting
Text
node:
The
end
tags
are
handled
in
a
straight-forward
manner;
after
handling
them,
the
stack
of
open
elements
has
on
it
the
elements
html
,
body
,
table
,
and
tbody
;
the
list
of
active
formatting
elements
still
has
the
b
element
in
it
(the
marker
having
been
removed
by
the
"td"
end
tag
token);
and
the
insertion
mode
is
"
in
table
body
".
Thus
it
is
that
the
"bbb"
character
tokens
are
found.
These
trigger
the
"
in
table
text
"
insertion
mode
to
be
used
(with
the
original
insertion
mode
set
to
"
in
table
body
").
The
character
tokens
are
collected,
and
when
the
next
token
(the
table
element
end
tag)
is
seen,
they
are
processed
as
a
group.
Since
they
are
not
all
spaces,
they
are
handled
as
per
the
"anything
else"
rules
in
the
"
in
table
"
insertion
mode,
which
defer
to
the
"
in
body
"
insertion
mode
but
with
foster
parenting
.
When
the
active
formatting
elements
are
reconstructed
,
a
b
element
is
created
and
foster
parented
,
and
then
the
"bbb"
Text
node
is
appended
to
it:
The
stack
of
open
elements
has
on
it
the
elements
html
,
body
,
table
,
tbody
,
and
the
new
b
(again,
note
that
this
doesn't
match
the
resulting
tree!);
the
list
of
active
formatting
elements
has
the
new
b
element
in
it;
and
the
insertion
mode
is
still
"
in
table
body
".
Had
the
character
tokens
been
only
ASCII
whitespace
instead
of
"bbb",
then
that
ASCII
whitespace
would
just
be
appended
to
the
tbody
element.
Finally,
the
table
is
closed
by
a
"table"
end
tag.
This
pops
all
the
nodes
from
the
stack
of
open
elements
up
to
and
including
the
table
element,
but
it
doesn't
affect
the
list
of
active
formatting
elements
,
so
the
"ccc"
character
tokens
after
the
table
result
in
yet
another
b
element
being
created,
this
time
after
the
table:
This section is non-normative.
Consider
the
following
markup,
which
for
this
example
we
will
assume
is
the
document
with
URL
https://example.com/inner
,
being
rendered
as
the
content
of
an
iframe
in
another
document
with
the
URL
https://example.com/outer
:
<div id=a>
<script>
var div = document.getElementById('a');
parent.document.body.appendChild(div);
</script>
<script>
alert(document.URL);
</script>
</div>
<script>
alert(document.URL);
</script>
Up to the first "script" end tag, before the script is parsed, the result is relatively straightforward:
After
the
script
is
parsed,
though,
the
div
element
and
its
child
script
element
are
gone:
They
are,
at
this
point,
in
the
Document
of
the
aforementioned
outer
browsing
context
.
However,
the
stack
of
open
elements
still
contains
the
div
element
.
Thus,
when
the
second
script
element
is
parsed,
it
is
inserted
into
the
outer
Document
object
.
Those
parsed
into
different
Document
s
than
the
one
the
parser
was
created
for
do
not
execute,
so
the
first
alert
does
not
show.
Once
the
div
element's
end
tag
is
parsed,
the
div
element
is
popped
off
the
stack,
and
so
the
next
script
element
is
in
the
inner
Document
:
This script does execute, resulting in an alert that says "https://example.com/inner".
This section is non-normative.
Elaborating
on
the
example
in
the
previous
section,
consider
the
case
where
the
second
script
element
is
an
external
script
(i.e.
one
with
a
src
attribute).
Since
the
element
was
not
in
the
parser's
Document
when
it
was
created,
that
external
script
is
not
even
downloaded.
In
a
case
where
a
script
element
with
a
src
attribute
is
parsed
normally
into
its
parser's
Document
,
but
while
the
external
script
is
being
downloaded,
the
element
is
moved
to
another
document,
the
script
continues
to
download,
but
does
not
execute.
In
general,
moving
script
elements
between
Document
s
is
considered
a
bad
practice.
This section is non-normative.
The
following
markup
shows
how
nested
formatting
elements
(such
as
b
)
get
collected
and
continue
to
be
applied
even
as
the
elements
they
are
contained
in
are
closed,
but
that
excessive
duplicates
are
thrown
away.
<!DOCTYPE html>
<p><b class=x><b class=x><b><b class=x><b class=x><b>X
<p>X
<p><b><b class=x><b>X
<p></b></b></b></b></b></b>X
The resulting DOM tree is as follows:
html
html
Note
how
the
second
p
element
in
the
markup
has
no
explicit
b
elements,
but
in
the
resulting
DOM,
up
to
three
of
each
kind
of
formatting
element
(in
this
case
three
b
elements
with
the
class
attribute,
and
two
unadorned
b
elements)
get
reconstructed
before
the
element's
"X".
Also
note
how
this
means
that
in
the
final
paragraph
only
six
b
end
tags
are
needed
to
completely
clear
the
list
of
active
formatting
elements
,
even
though
nine
b
start
tags
have
been
seen
up
to
this
point.
For
the
purposes
of
the
following
algorithm,
an
element
serializes
as
void
if
its
element
type
is
one
of
the
void
elements
,
or
is
basefont
,
bgsound
,
frame
,
keygen
,
or
param
.
The
following
steps
form
the
HTML
fragment
serialization
algorithm
.
The
algorithm
takes
as
input
a
DOM
Element
,
Document
,
or
DocumentFragment
referred
to
as
the
node
,
a
boolean
serializableShadowRoots
,
and
a
sequence<ShadowRoot>
shadowRoots
,
and
returns
a
string.
This algorithm serializes the children of the node being serialized, not the node itself.
If the node serializes as void , then return the empty string.
Let s be a string, and initialize it to the empty string.
If
the
node
is
a
template
element,
then
let
the
node
instead
be
the
template
element's
template
contents
(a
DocumentFragment
node).
If current node is a shadow host , then:
Let shadow be current node 's shadow root .
If one of the following is true:
serializableShadowRoots is true and shadow 's serializable is true; or
shadowRoots contains shadow ,
then:
Append
"
<template
shadowrootmode="
".
If
shadow
's
mode
is
"
open
",
then
append
"
open
".
Otherwise,
append
"
closed
".
Append
"
"
".
If
shadow
's
delegates
focus
is
set,
then
append
"
shadowrootdelegatesfocus=""
".
If
shadow
's
serializable
is
set,
then
append
"
shadowrootserializable=""
".
If
shadow
's
clonable
is
set,
then
append
"
shadowrootclonable=""
".
Append
"
>
".
Append the value of running the HTML fragment serialization algorithm with shadow , serializableShadowRoots , and shadowRoots (thus recursing into this algorithm for that element).
Append
"
</template>
".
For each child node of the node , in tree order , run the following steps:
Let current node be the child node being processed.
Append the appropriate string from the following list to s :
Element
If current node is an element in the HTML namespace , the MathML namespace , or the SVG namespace , then let tagname be current node 's local name. Otherwise, let tagname be current node 's qualified name.
Append a U+003C LESS-THAN SIGN character (<), followed by tagname .
For
HTML
elements
created
by
the
HTML
parser
or
createElement()
,
tagname
will
be
lowercase.
If
current
node
's
is
value
is
not
null,
and
the
element
does
not
have
an
is
attribute
in
its
attribute
list,
then
append
the
string
"
is="
",
followed
by
current
node
's
is
value
escaped
as
described
below
in
attribute
mode
,
followed
by
a
U+0022
QUOTATION
MARK
character
(").
For each attribute that the element has, append a U+0020 SPACE character, the attribute's serialized name as described below , a U+003D EQUALS SIGN character (=), a U+0022 QUOTATION MARK character ("), the attribute's value, escaped as described below in attribute mode , and a second U+0022 QUOTATION MARK character (").
An attribute's serialized name for the purposes of the previous paragraph must be determined as follows:
The attribute's serialized name is the attribute's local name.
For
attributes
on
HTML
elements
set
by
the
HTML
parser
or
by
setAttribute()
,
the
local
name
will
be
lowercase.
The
attribute's
serialized
name
is
the
string
"
xml:
"
followed
by
the
attribute's
local
name.
xmlns
The
attribute's
serialized
name
is
the
string
"
xmlns
".
xmlns
The
attribute's
serialized
name
is
the
string
"
xmlns:
"
followed
by
the
attribute's
local
name.
The
attribute's
serialized
name
is
the
string
"
xlink:
"
followed
by
the
attribute's
local
name.
The attribute's serialized name is the attribute's qualified name.
While the exact order of attributes is implementation-defined , and may depend on factors such as the order that the attributes were given in the original markup, the sort order must be stable, such that consecutive invocations of this algorithm serialize an element's attributes in the same order.
Append a U+003E GREATER-THAN SIGN character (>).
If current node serializes as void , then continue on to the next child node at this point.
Append the value of running the HTML fragment serialization algorithm with current node , serializableShadowRoots , and shadowRoots (thus recursing into this algorithm for that node), followed by a U+003C LESS-THAN SIGN character (<), a U+002F SOLIDUS character (/), tagname again, and finally a U+003E GREATER-THAN SIGN character (>).
Text
node
If
the
parent
of
current
node
is
a
style
,
script
,
xmp
,
iframe
,
noembed
,
noframes
,
or
plaintext
element,
or
if
the
parent
of
current
node
is
a
noscript
element
and
scripting
is
enabled
for
the
node,
then
append
the
value
of
current
node
's
data
literally.
Otherwise, append the value of current node 's data , escaped as described below .
Comment
Append
"
<!--
"
(U+003C
LESS-THAN
SIGN,
U+0021
EXCLAMATION
MARK,
U+002D
HYPHEN-MINUS,
U+002D
HYPHEN-MINUS),
followed
by
the
value
of
current
node
's
data
,
followed
by
the
literal
string
"
-->
"
(U+002D
HYPHEN-MINUS,
U+002D
HYPHEN-MINUS,
U+003E
GREATER-THAN
SIGN).
ProcessingInstruction
Append
"
<?
"
(U+003C
LESS-THAN
SIGN,
U+003F
QUESTION
MARK),
followed
by
the
value
of
current
node
's
target
IDL
attribute,
followed
by
a
single
U+0020
SPACE
character,
followed
by
the
value
of
current
node
's
data
,
followed
by
a
single
U+003E
GREATER-THAN
SIGN
character
(>).
DocumentType
Append
"
<!DOCTYPE
"
(U+003C
LESS-THAN
SIGN,
U+0021
EXCLAMATION
MARK,
U+0044
LATIN
CAPITAL
LETTER
D,
U+004F
LATIN
CAPITAL
LETTER
O,
U+0043
LATIN
CAPITAL
LETTER
C,
U+0054
LATIN
CAPITAL
LETTER
T,
U+0059
LATIN
CAPITAL
LETTER
Y,
U+0050
LATIN
CAPITAL
LETTER
P,
U+0045
LATIN
CAPITAL
LETTER
E),
followed
by
a
space
(U+0020
SPACE),
followed
by
the
value
of
current
node
's
name
,
followed
by
"
>
"
(U+003E
GREATER-THAN
SIGN).
Return s .
It is possible that the output of this algorithm, if parsed with an HTML parser , will not return the original tree structure. Tree structures that do not roundtrip a serialize and reparse step can also be produced by the HTML parser itself, although such cases are typically non-conforming.
For
instance,
if
a
textarea
element
to
which
a
Comment
node
has
been
appended
is
serialized
and
the
output
is
then
reparsed,
the
comment
will
end
up
being
displayed
in
the
text
control.
Similarly,
if,
as
a
result
of
DOM
manipulation,
an
element
contains
a
comment
that
contains
"
-->
",
then
when
the
result
of
serializing
the
element
is
parsed,
the
comment
will
be
truncated
at
that
point
and
the
rest
of
the
comment
will
be
interpreted
as
markup.
More
examples
would
be
making
a
script
element
contain
a
Text
node
with
the
text
string
"
</script>
",
or
having
a
p
element
that
contains
a
ul
element
(as
the
ul
element's
start
tag
would
imply
the
end
tag
for
the
p
).
This
can
enable
cross-site
scripting
attacks.
An
example
of
this
would
be
a
page
that
lets
the
user
enter
some
font
family
names
that
are
then
inserted
into
a
CSS
style
block
via
the
DOM
and
which
then
uses
the
innerHTML
IDL
attribute
to
get
the
HTML
serialization
of
that
style
element:
if
the
user
enters
"
</style><script>attack</script>
"
as
a
font
family
name,
innerHTML
will
return
markup
that,
if
parsed
in
a
different
context,
would
contain
a
script
node,
even
though
no
script
node
existed
in
the
original
DOM.
For example, consider the following markup:
<form
id="outer"><div></form><form
id="inner"><input>
This will be parsed into:
The
input
element
will
be
associated
with
the
inner
form
element.
Now,
if
this
tree
structure
is
serialized
and
reparsed,
the
<form
id="inner">
start
tag
will
be
ignored,
and
so
the
input
element
will
be
associated
with
the
outer
form
element
instead.
<html><head></head><body><form
id="outer"><div>
<form
id="inner">
<input></form></div></form></body></html>
As another example, consider the following markup:
<a><table><a>
This will be parsed into:
That
is,
the
a
elements
are
nested,
because
the
second
a
element
is
foster
parented
.
After
a
serialize-reparse
roundtrip,
the
a
elements
and
the
table
element
would
all
be
siblings,
because
the
second
<a>
start
tag
implicitly
closes
the
first
a
element.
<html><head></head><body><a>
<a>
</a><table></table></a></body></html>
For
historical
reasons,
this
algorithm
does
not
round-trip
an
initial
U+000A
LINE
FEED
(LF)
character
in
pre
,
textarea
,
or
listing
elements,
even
though
(in
the
first
two
cases)
the
markup
being
round-tripped
can
be
conforming.
The
HTML
parser
will
drop
such
a
character
during
parsing,
but
this
algorithm
does
not
serialize
an
extra
U+000A
LINE
FEED
(LF)
character.
For example, consider the following markup:
<pre>
Hello.</pre>
When
this
document
is
first
parsed,
the
pre
element's
child
text
content
starts
with
a
single
newline
character.
After
a
serialize-reparse
roundtrip,
the
pre
element's
child
text
content
is
simply
"
Hello.
".
Because
of
the
special
role
of
the
is
attribute
in
signaling
the
creation
of
customized
built-in
elements
,
in
that
it
provides
a
mechanism
for
parsed
HTML
to
set
the
element's
is
value
,
we
special-case
its
handling
during
serialization.
This
ensures
that
an
element's
is
value
is
preserved
through
serialize-parse
roundtrips.
When
creating
a
customized
built-in
element
via
the
parser,
a
developer
uses
the
is
attribute
directly;
in
such
cases
serialize-parse
roundtrips
work
fine.
<script>
window.SuperP = class extends HTMLParagraphElement {};
customElements.define("super-p", SuperP, { extends: "p" });
</script>
<div id="container"><p is="super-p">Superb!</p></div>
<script>
console.log(container.innerHTML); // <p is="super-p">
container.innerHTML = container.innerHTML;
console.log(container.innerHTML); // <p is="super-p">
console.assert(container.firstChild instanceof SuperP);
</script>
But
when
creating
a
customized
built-in
element
via
its
constructor
or
via
createElement()
,
the
is
attribute
is
not
added.
Instead,
the
is
value
(which
is
what
the
custom
elements
machinery
uses)
is
set
without
intermediating
through
an
attribute.
<script>
container.innerHTML = "";
const p = document.createElement("p", { is: "super-p" });
container.appendChild(p);
// The is attribute is not present in the DOM:
console.assert(!p.hasAttribute("is"));
// But the element is still a super-p:
console.assert(p instanceof SuperP);
</script>
To
ensure
that
serialize-parse
roundtrips
still
work,
the
serialization
process
explicitly
writes
out
the
element's
is
value
as
an
is
attribute:
<script>
console.log(container.innerHTML); // <p is="super-p">
container.innerHTML = container.innerHTML;
console.log(container.innerHTML); // <p is="super-p">
console.assert(container.firstChild instanceof SuperP);
</script>
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace
any
occurrence
of
the
"
&
"
character
by
the
string
"
&
".
Replace
any
occurrences
of
the
U+00A0
NO-BREAK
SPACE
character
by
the
string
"
".
If
the
algorithm
was
invoked
in
the
attribute
mode
,
replace
any
occurrences
of
the
"
"
"
character
by
the
string
"
"
".
If
the
algorithm
was
not
invoked
in
the
attribute
mode
,
replace
any
occurrences
of
the
"
<
"
character
by
the
string
"
<
",
and
any
occurrences
of
the
"
>
"
character
by
the
string
"
>
".
The
HTML
fragment
parsing
algorithm
,
given
an
Element
node
context
,
string
input
,
and
an
optional
boolean
allowDeclarativeShadowRoots
(default
false)
is
the
following
steps.
They
return
a
list
of
zero
or
more
nodes.
Parts marked fragment case in algorithms in the HTML parser section are parts that only occur if the parser was created for the purposes of this algorithm. The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.
If
context
's
node
document
is
in
quirks
mode
,
then
set
document
's
mode
to
"
quirks
".
Otherwise,
if
context
's
node
document
is
in
limited-quirks
mode
,
then
set
document
's
mode
to
"
limited-quirks
".
If allowDeclarativeShadowRoots is true, then set document 's allow declarative shadow roots to true.
Create a new HTML parser , and associate it with document .
Set the state of the HTML parser 's tokenization stage as follows, switching on the context element:
title
textarea
style
xmp
iframe
noembed
noframes
script
noscript
plaintext
For performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions.
Let
root
be
the
result
of
creating
an
element
given
document
,
"
html
",
and
the
HTML
namespace
.
Append root to document .
Set up the HTML parser 's stack of open elements so that it contains just the single element root .
If
context
is
a
template
element,
then
push
"
in
template
"
onto
the
stack
of
template
insertion
modes
so
that
it
is
the
new
current
template
insertion
mode
.
Create a start tag token whose name is the local name of context and whose attributes are the attributes of context .
Let this start tag token be the start tag token of context ; e.g. for the purposes of determining if it is an HTML integration point .
Reset the parser's insertion mode appropriately .
The parser will reference the context element as part of that algorithm.
Set
the
HTML
parser
's
form
element
pointer
to
the
nearest
node
to
context
that
is
a
form
element
(going
straight
up
the
ancestor
chain,
and
including
the
element
itself,
if
it
is
a
form
element),
if
any.
(If
there
is
no
such
form
element,
the
form
element
pointer
keeps
its
initial
value,
null.)
Place the input into the input stream for the HTML parser just created. The encoding confidence is irrelevant .
Start the HTML parser and let it run until it has consumed all the characters just inserted into the input stream.
Return root 's children , in tree order .