Skip to content

Latest commit

 

History

History
128 lines (101 loc) · 8.14 KB

DOM.adoc

File metadata and controls

128 lines (101 loc) · 8.14 KB

DOM

A short note about parsing HTML and XML documents using W3C’s DOM (Document Object Model), as implemented in Java. Learning to parse using the DOM is good because the DOM is a widely implemented standard: once you know how it works, you can parse HTML (and XML) documents in Javascript, Python, .NET, …

Valid HTML

This should successfully parse any document that can be transformed to a DOM according to W3C standard, including valid HTML documents in XML syntax. It uses the bootstrapping approach (described in the DOM Level 3 Core Specification) and the LS feature (described in the DOM Level 3 Load and Save specification).

Obtain a DOM object from a valid HTML document in XML syntax using DOM Level 3 Load and Save
Document doc;
Path input = Path.of("input.xhtml");

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
doc = builder.parseURI(input.toUri().toString());

Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());

XML syntax

This approach uses SAX rather than the standard bootstrapping approach. I recommend using the previous one instead where applicable.

Obtain a DOM object from a valid HTML document in XML syntax using SAX
Document doc;
Path input = Path.of("input.xhtml");

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(input.toUri().toString());

Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());

(To manipulate general XML, add factory.setNamespaceAware(true); before creating the builder.)

Real-life HTML

HTML documents in the wild are seldom valid. You may use the jsoup library for these cases.

Obtain a DOM object from a real-life HTML document
Document doc;
File inputFile = new File("input.html");

org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(inputFile, StandardCharsets.UTF_8.name());
doc = new W3CDom().fromJsoup(jsoupDoc);

Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());

Writing HTML

Here is how to programmatically write an HTML document (in XML syntax, with namespaces). Assuming a private static final String XHTML_NAME_SPACE = "http://www.w3.org/1999/xhtml"; field.

Write HTML in XML syntax
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();

Element html = document.createElementNS(XHTML_NAME_SPACE, "html");
html.setAttribute("lang", "en");
document.appendChild(html);
Element head = document.createElementNS(XHTML_NAME_SPACE, "head");
html.appendChild(head);
Element meta = document.createElementNS(XHTML_NAME_SPACE, "meta");
meta.setAttribute("http-equiv", "Content-type");
meta.setAttribute("content", "text/html; charset=utf-8");
head.appendChild(meta);
Element body = document.createElementNS(XHTML_NAME_SPACE, "body");
html.appendChild(body);
/** And so on. */

Document to String

And here is now to print your document to a string.

Pretty print a Document into a String
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer ser = impl.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", true);
/** Do not use ser.writeToString: it uses UTF-16. */
LSOutput output = impl.createLSOutput();
StringWriter writer = new StringWriter();
output.setCharacterStream(writer);
ser.write(document, output);
return writer.toString();

Validate your XML

Validating the XML on the fly while writing it using an LSSerializer seems unsupported. An alternative approach is to use the Java Validation API, as illustrated below (thanks).

Validate an HTML document in XML syntax
Schema schema = SchemaFactory.newDefaultInstance().newSchema(getClass().getResource("xhtml1-strict.xsd"));
DOMSource docAsSource = new DOMSource();
docAsSource.setNode(document);
schema.newValidator().validate(docAsSource);

About dependencies

The above code samples run without any added dependencies. In particular, do not include xml-apis:xml-apis among your dependencies. The java.xml module of the JRE probably provides everything you might need from xml-apis, and conflicts will occur if both are reachable from your project. You might have to explicitly exclude xml-apis:xml-apis from the transitive dependencies of your dependencies.

If possible, do not include xerces:xercesImpl either, for a similar reason: the JRE also includes the Xerces2 code as part of the java.xml module. Note however that there’s a subtelty: the Xerces2 implementation included with the JRE is private, and the packages have been renamed, for example, the package org.apache.xerces becomes com.sun.org.apache.xerces.internal inside the JRE. Thus, one reason for including xercesImpl is when one of your dependencies (incorrectly, probably) itself wants access to the Xerces implementation classes (e.g., tries to load explicitly org.apache.xerces.dom.DocumentImpl as for example org.apache.odftoolkit:simple-odf:0.8.2-incubating does). This will (hopefully) not create problems, provided you exclude xml-apis:xml-apis from xercesImpl’s transitive dependencies, as indicated above.

Other libraries for XML and DOM parsing

If you want something more powerful than the W3C DOM API included in the JDK, check out xom or perhaps <dom4j>, which seems very popular (I have not used these libraries).

Refs

  • Parsing from DOM and related technologies in Java: see the JAXP tutorials here (focus on the parts related to the DOM) and there.

    • (JAXP (JSR 206) has been withdrawn as a standalone project following its inclusion in OpenJDK 7, see section 11.5 in the specification PDF of JAXP 1.6 (Maintenance Release 3, 4 March 2014).)

  • <dom4j>, a well-written library for simpler DOM manipulation

  • Comparison of HTML parsers (Wikipedia)

  • W3C DOM4 (Recommendation 19 November 2015), a snapshot of the DOM Living Standard

  • xom, seems to be high-quality