When did web pages stop being documents?

Posted on September 17, 2015

Tags: software

It seems like these days the Web is all about “applications.” Which makes some sense for actual applications that need to be interactive.

But something like this seems, to me, to clearly be a document. It certainly looks like a document when viewed in a modern web browser. But in Lynx, it’s unreadable. And looking at the source, there appears to be no actual HTML for the body text. It seems like Javascript is rendering Markdown to HTML on the fly, or something like that. But why on earth does that need to be done client-side?

One of my arguments for why “data should be data,” and not code, is that data is usually much easier to manipulate programmatically. I ran into this problem because I wanted to screen-scrape this table and I discovered that the table doesn’t exist as an HTML table in the document! Of course, the table data is in there, in a custom format which is presumably parsed by the Javascript. So, I could write my own custom parsing code to pull the table data out, but it feels very brittle and icky, compared to using an off-the-shelf HTML parser.

It’s interesting that HTML and Postscript have gone in exactly opposite directions over the past couple decades. Postscript turned into PDF, which is essentially Postscript but as data, not code. (Or, PDF could be thought of as the result of running a Postscript program.) HTML, on the other hand, started out as data, but with its increasing reliance on Javascript, it’s becoming much more like Postscript used to be.

It’s fairly easy to use Ghostscript to turn Postscript into PDF. What’s currently lacking, AFAIK, is a similarly easy program for turning Javascript into HTML. (Basically, run the code and then dump the DOM as HTML.) I know there are some headless web browsers out there, but I get the impression they are awkward and heavyweight. Maybe I’m just not familiar enough with them, though.