From e06a877be8642e3f13f0b57028a11774760ca49e Mon Sep 17 00:00:00 2001 From: Dominic Ricottone Date: Thu, 27 Oct 2022 11:15:52 -0500 Subject: [PATCH] New post --- content/posts/html_and_xml.md | 110 ++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 content/posts/html_and_xml.md diff --git a/content/posts/html_and_xml.md b/content/posts/html_and_xml.md new file mode 100644 index 0000000..5be14c1 --- /dev/null +++ b/content/posts/html_and_xml.md @@ -0,0 +1,110 @@ +--- +title: HTML and XML +date: "2022-10-27T11:08:41-05:00" +draft: false +--- + +I'm confident that many new and self-taught programmers have *at least once* +entered into a Google or StackOverflow search something along the lines of +'minidom parse html' or 'parse html to xml'. +Web scrapers are an excellent self-teaching project. +And angle-bracketed markup languages look the same. +So what would be *unusual* about using Python's highly-popular minidom parser +to scrape a web page? + +Searching those phrases will land you on a page enumerating all the reasons of +*why* it's not just unusual, but *bad*. +A sample: + + 1. Real world HTML documents are poorly formed, with unclosed elements. + 2. The HTML spec includes a variety of self-closing elements, like `
`. + HTML5 has only made this more pervasive. + 3. Element attributes aren't always specified with quotes around the value + +And these are all perfectly valid points (if you aren't using a SAX parser). +I don't want to take away from the sound advice that HTML documents should be +fed into an HTML parser. + +But let's consider the alternative: *doing it anyway.* + +---- + +There are some bumps on this inadvisable road. + +While many HTML documents could technically validate as XHTML, there is almost +always going to be one sticking point: the first line[^1]. +There are options. +I've always taken the brute force approach of *scan past the first line before +parsing.* +Inelegant? Yes. But it works. + +And as mentioned above, some HTML elements are self-closing. +While an XML parser *should* be able to handle `
`, it will be confused by +`
`. +But there's two important counter-points. + + 1. *Will it though?* + A SAX parser can just skip over the token. + 2. *Does it matter?* + Visual cues like line breaks are uninteresting outside of rendering a + complete web page. + +In summary, parsing a complete DOM from an HTML document using an XML parser is +not possible. +But rarely do you need the *complete* DOM. +Figure out what data is actually necessary and ignore the rest. + +---- + +On the other side of things, why might one *want* to use an XML parser on +HTML documents? + +For one, I have yet to see an HTML parser expose a SAX-like API. + +In many circumstances, you *already* need an XML parser. +Configuring and testing an HTML parser on top of that can only add challenges. +Even if the runtime overhead is negligible, the source code is made more +complex with another API. +Code gets duplicated between the two. +Function and class names get bikeshedded, and usually elongated with a prefix, +just to indicate which format they work with. +Code churn for the sake of code churn is not a great idea. + +---- + +At the end of the day, most of the HTML scraping I do is supervised ingestion. +My advise for designing an automated pipeline would be vastly different. +But for my personal projects, a hack job is just as good. + + +[^1]: Compare the XML declaration, document type declaration, and root + element for XHTML: + + ``` + + + + ``` + + To the document type declaration, root element, and charset declaration for + HTML4: + + ``` + + + + ``` + + To that for HTML5: + + ``` + + + + ``` + + While all of this is fine and dandy for specifying the behavior of web browsers + must have comparable visual renderings of web pages across the specification... + *absolutely none of this is useful for a script.* + + -- 2.45.2