-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
Description
Summary
When parsing HTML with Dom\HTMLDocument::createFromString(), a </noscript> end tag inside <head> is not handled correctly in the HTML5 parser path.
As a result, subsequent head elements (for example <link>) are incorrectly inserted as children of <noscript>.
This is a behavior bug in the Lexbor HTML tree-construction path used by Dom\HTMLDocument, not in legacy DOMDocument::loadHTML().
Affected component
- PHP
ext/dommodern HTML5 parser (Dom\HTMLDocument) - Vendored Lexbor tree insertion mode implementation:
ext/lexbor/lexbor/html/tree/insertion_mode/in_head_noscript.c
Environment
- PHP:
8.5.1(also reproduced while inspecting 8.5.3 source tree) - libxml runtime:
2.9.13 - API used:
Dom\HTMLDocument::createFromString()
Reproducer
<?php
$html = '<!DOCTYPE html><html><head>
<noscript>
<style>body { margin: 0; }</style>
</noscript>
<link href="/style.css" rel="stylesheet">
</head><body></body></html>';
$doc = Dom\HTMLDocument::createFromString($html, LIBXML_NOERROR);
echo $doc->saveHTML(), PHP_EOL;
$link = $doc->getElementsByTagName('link')->item(0);
echo "Link parent: ", $link->parentNode->nodeName, PHP_EOL;Actual result
- Serialized tree effectively moves
</noscript>to after<link>. link->parentNode->nodeNameisNOSCRIPT.
Example output:
<!DOCTYPE html><html><head>
<noscript>
<style>body { margin: 0; }</style>
<link href="/style.css" rel="stylesheet">
</noscript></head><body></body></html>
Link parent: NOSCRIPTExpected result
</noscript>should close the<noscript>element.<link>should be a direct child of<head>.link->parentNode->nodeNameshould beHEAD.
Control comparison
Using legacy parser path:
$d = new DOMDocument();
@$d->loadHTML($html, LIBXML_NOERROR);
echo $d->getElementsByTagName('link')->item(0)->parentNode->nodeName;Result is head (as expected), confirming issue is specific to modern HTML5 parser path.
Root cause analysis
The closing-tag handler for in-head-noscript insertion mode does not implement handling for </noscript>.
Current code:
ext/lexbor/lexbor/html/tree/insertion_mode/in_head_noscript.c:95lxb_html_tree_insertion_mode_in_head_noscript_closed(...)
Behavior:
- If closing tag is
</br>, it routes toanything_else. - Otherwise it emits parse error (
LXB_HTML_RULES_ERROR_UNTO) and returnstrue. - It never handles
LXB_TAG_NOSCRIPT, never pops<noscript>, and never restorestree->mode = in_head.
Because the open-elements stack still has <noscript> as current node, the next <link> token (delegated to in_head) is inserted under <noscript>.
Suggested fix direction
In lxb_html_tree_insertion_mode_in_head_noscript_closed(...), add explicit handling for LXB_TAG_NOSCRIPT:
- Verify current node is
noscript(or report parse error if not in expected state). - Pop current node from open-elements stack.
- Set
tree->mode = lxb_html_tree_insertion_mode_in_head. - Return
true.
This should match intended HTML5 tree-construction behavior for closing noscript in this insertion mode.
Suggested regression test
Add a DOM test that parses:
<!doctype html><html><head><noscript></noscript><link rel="stylesheet" href="/x.css"></head><body></body></html>And asserts:
getElementsByTagName("link")[0]->parentNode->nodeName === "HEAD"- serialization does not place
<link>inside<noscript>.
Notes
- This issue is independent from libxml2 legacy HTML parser behavior.
- It appears in the Lexbor-based parser path used by
Dom\HTMLDocument.
PHP Version
PHP 8.5.1 (cli) (built: Dec 16 2025 15:59:07) (NTS)
Copyright (c) The PHP Group
Built by Homebrew
Zend Engine v4.5.1, Copyright (c) Zend Technologies
with Zend OPcache v8.5.1, Copyright (c), by Zend Technologies
Also in 8.5.3, compare https://3v4l.org/TmBjH#v8.5.3
Operating System
No response