~dricottone/blog

ref: 8764dd6baadd50e792ad108e8bed227c0bda654d blog/content/posts/html_and_xml.md -rw-r--r-- 3.7 KiB
8764dd6bDominic Ricottone Minor comment update 1 year, 8 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: HTML and XML
date: "2022-10-27T11:08:41-05:00"
draft: false
---

I'm confident that many new and self-taught programmers have *at least once*
entered into a Google or StackOverflow search something along the lines of
'minidom parse html' or 'parse html to xml'.
Web scrapers are an excellent self-teaching project.
And angle-bracketed markup languages look the same.
So what would be *unusual* about using Python's highly-popular minidom parser
to scrape a web page?

Searching those phrases will land you on a page enumerating all the reasons of
*why* it's not just unusual, but *bad*.
A sample:

 1. Real world HTML documents are poorly formed, with unclosed elements.
 2. The HTML spec includes a variety of self-closing elements, like `<br>`.
    HTML5 has only made this more pervasive.
 3. Element attributes aren't always specified with quotes around the value

And these are all perfectly valid points (if you aren't using a SAX parser).
I don't want to take away from the sound advice that HTML documents should be
fed into an HTML parser.

But let's consider the alternative: *doing it anyway.*

----

There are some bumps on this inadvisable road.

While many HTML documents could technically validate as XHTML, there is almost
always going to be one sticking point: the first line[^1].
There are options.
I've always taken the brute force approach of *scan past the first line before
parsing.*
Inelegant? Yes. But it works.

And as mentioned above, some HTML elements are self-closing.
While an XML parser *should* be able to handle `<br />`, it will be confused by
`<br>`.
But there's two important counter-points.

 1. *Will it though?*
    A SAX parser can just skip over the token.
 2. *Does it matter?*
    Visual cues like line breaks are uninteresting outside of rendering a
    complete web page.

In summary, parsing a complete DOM from an HTML document using an XML parser is
not possible.
But rarely do you need the *complete* DOM.
Figure out what data is actually necessary and ignore the rest.

----

On the other side of things, why might one *want* to use an XML parser on
HTML documents?

For one, I have yet to see an HTML parser expose a SAX-like API.

In many circumstances, you *already* need an XML parser.
Configuring and testing an HTML parser on top of that can only add challenges.
Even if the runtime overhead is negligible, the source code is made more
complex with another API.
Code gets duplicated between the two.
Function and class names get bikeshedded, and usually elongated with a prefix,
just to indicate which format they work with.
Code churn for the sake of code churn is not a great idea.

----

At the end of the day, most of the HTML scraping I do is supervised ingestion.
My advise for designing an automated pipeline would be vastly different.
But for my personal projects, a hack job is just as good.


[^1]: Compare the XML declaration, document type declaration, and root
      element for XHTML:

      ```
      <?xml version="1.0" encoding="UTF-8" ?>
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
      <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
      ```

      To the document type declaration, root element, and charset declaration for
      HTML4:

      ```
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
      <html lang="en">
      <head><meta charset="utf-8">
      ```

      To that for HTML5:

      ```
      <!DOCTYPE html>
      <html lang="en">
      <head><meta charset="utf-8">
      ```

      While all of this is fine and dandy for specifying the behavior of web browsers
      must have comparable visual renderings of web pages across the specification...
      *absolutely none of this is useful for a script.*