I have an application that takes the “modern” approach to screen-scraping. Gone are the days of reading serial ports, and parsing VT100 control codes. The modern screen-scrape is far more unpleasant.
This application is a package holiday booking system, which includes its own web interface. “Scraping” should just be a case of traversing the HTML DOM to the information I need, sipping my coffee, and retrieving it. But it’s not.
This is not HTML4 compliant markup. It’s almost difficult to imagine how you could purposefully produce this kind of markup- it’s not just missing a DOCTYPE, or not closing a tag or two. It has new tags; tags no one has heard of before. It has nested comments. It has invalid entities.
Try #1 involved regular expressions. The Python 2.2 re module would choke on too much recursion. Try #2 piped the HTML through Tidy on “nazi” mode, and minidom with lots of:
Mmm. Fragile. Try #3 uses XPath with 4DOM. This is the currently live version. It’s hard to express what a productivity boost using a domain specific language like XPath is over coding long chains like the above. Yay XPath.
Anyway, the customer has mentioned that it’s quite unresponsive lately. In the ultimate case of ignoring “profile-first” I had assumed that it was either network latency (the application we scrape is remote), slowness on the origin side (it’s not a fast application at the best of times), or lots of CPU being used by Tidy. A check of our server load graphs showed hugely disproportionate amounts of CPU being used.
Tidy then. Well, just to check my findings, I ran the Python profiler. Tidy was taking about 250ms. It was taking >5 seconds to parse Tidy’s output. OMFG in what world is it acceptable to parse a 40k file, on a 1Ghz machine, in 5 seconds?
Acceptible is probably the wrong word; I didn’t pay for 4DOM, so I can hardly complain, but Uche Ogbuji rips into pyRXP in this article for not being valid as regards Unicode. By the way, I happen to completely agree that saying “XML, except for Unicode” is like saying “XML, except for angle brackets”. Now, as I understand it, pyRXPU is basically as fast as the 8-bit version, but it’s broken in CVS (for me, at least - Windows binaries work fine, Unix source from 2005/01/01 fails a few tests), so I did some tests using the 8-bit version.
I had to write a simple tree-walker to emulate the XPath expressions we use. I understand that this makes this comparison “apples and oranges” in some people’s eyes, but XPath was only using a few hundred milliseconds, parsing is the major CPU hog.
The tests used are available here, and require PyXML and pyRXP. The results? In wall-clock seconds, averaged over 10 runs, fastest to slowest:
- 0.06 = pyRXP and a custom tree-walker.
- 0.54 = minidom and XPath.
- ~3.0 = opening the file in Vim and typing ‘/name=”reselectApi”’.
- 5.3 = 4DOM and XPath.
For my purposes, pyRXP is ~90 times faster. 90. Ninety.
I understand that 4DOM is a completely compliant DOM implementation, and that this is a very, very hard target to meet, and quite important, but it ought to come with a huge warning label about the speed hit you take by using it.
I would never have expected something as innocent as an XML parser to be such a bottleneck.