Rewriting froghat.ca

Thu Feb 24 '22

I rewrote my static site generator again. But only a little bit.

My goal was to make things simpler and easier to read and understand. And maybe make things a bit faster too.

This is the output of cloc (a program that counts lines of code) before …

Language

files

blank

comment

code

Python

17

517

169

1595

C

1

57

9

167

SUM:

18

574

178

1762

… and after …

Language

files

blank

comment

code

Python

12

494

163

1332

C

1

24

2

73

SUM:

13

518

165

1405

So that’s nice. Although maybe it’s more dumb if you include dependencies in the standard library and from pypi or wherever.

There were three fun little changes.

/proc/[pid]/cmdline

For reasons, the site generator uses ninja to execute build steps in order and rerun steps when dependencies are old or missing. But starting Python for each step is slow.

Instead, a C program starts and connects to the main Python process and sends its command line arguments and stdin, stdout, and stderr file descriptors. The main Python process uses those file descriptors in place of its own to pretend like it’s this other program and so ninja captures the output. It does the thing for the arguments and then replies to the C program with the code to exit with.

Previously, data was serialized over the socket as XDR because it’s compact, easy to implement in C, and Python has an implementation in the standard library, xdrlib.

I noticed it was possible to get the process id (pid) of a peer connected over a unix socket with getsockopt’s SO_PEERCRED option. From this, we can read the program’s command line arguments from the file at /proc/[pid]/cmdline.

This means we don’t need to send the arguments over the socket. So I managed to throw out the whole XDR thing because the only value sent over the socket now (other than the file descriptors) is the return code.

So that’s why the C program is more smol.

null_publish()

The documents for my site are written with reStructuredText markup. docutils is the package that the site generator uses to parse that markup and render HTML. It’s relatively slow.

To go relatively fast, the site generator will fork and do this step in a child process so that the main process can continue handling requests for more build steps from ninja.

One drawback of this is that caching done by child processes cannot benefit subsequent build steps because the child just quits when it’s done. This is bad particularly in Python, where module imports can happen any whenever, or with pygments, where syntax highlighting definitions are loaded on demand.

Aside from the import, a lot of the caching that benefits docutils is done by the regular expression package in Python’s standard library. Regular expression patterns are compiled when they are first used each. Subsequent uses are much faster because the regular expression package will use the compiled pattern from the first time.

Previously, I had read through the docutils source code to find some regular expression pattern strings that seemed easy to reference. Before forking, I’d import docutils and pass those strings to re.compile() in order to cache them for when they’d be used later after a fork.

It occurred to me to ask docutils to parse and render an empty document instead. It was less hacky and a bit more effective than what I was doing before.

Rendering an empty document takes about 15 milliseconds the first time and just one and a half milliseconds subsequently. Importing and configuring docutils and publishing the empty document takes about 70 milliseconds the first time and like two milliseconds subsequently.

And those tens of milliseconds add up to several tens of milliseconds over the course of the entire build. Since I do a full rebuild before publishing the site, and given the rate I publish new posts – twenty posts in about three years – it adds up to about one second saved every three years.

Which sounds not great for the hours spent on this. But, keep in mind that the time you spend on meaningless optimisation is time not spent being self-aware and vividly alert to your inadequacies and personal faults and the mistakes you’ve made in your past and present relationships and decisions and how it’s all lead to the circumstances that will haunt you through your mortal life in this hell we call reality.

syntax highlighting with syntect

The forking approach still makes syntax highlighting slower overall.

Normally, syntax definitions are just loaded once and saved for when the same syntax is highlighted later.

For example, highlighting a Python code block requires loading some information about how to highlight Python in particular. But we only pay that cost the first time we highlight Python code because we don’t have to load it again next time we see some Python code.

Except in this forking model. After a child processes loads the syntax highlighting definitions and uses it in a single document, it exits.

I looked into using a Rust library called syntect to do syntax highlighting. It’s pretty interesting as it stores a lot of syntax highlighting definitions in a compact binary format. It takes my laptop a bit under 25ish milliseconds to load them. Preloading all the syntax definitions here is certainly more feasible than doing so in pygments.

I made some Python bindings for them real quick to try it out, and a docutils directive to use syntect for highlighting instead of pygments. It performed better but the built in syntax definitions were disappointing.

Specifically, I was missing definitions for Elm, GraphQL, INI, nginx, and TOML. Also, it had definitions for Clojure but they didn’t work so I had to provide those too. This was not quite what I expected.

Also, the output it generates seems a little more verbose than what pygments will produce. The class names of the highlighting spans in the HTML it outputs are longer and it seems there’s a lot of duplication. It’s not a big deal but it’s a thing.

To some extent, I didn’t like adding a dependency that required Rust. Mostly, I was put off by having to provide my own syntax highlighting definitions. It’s kind of a trade-off between the convenience of pygments just working and a weird benchmarking kink I enjoy for my own amusement.

But syntect is quite popular and used in other projects like bat which I know supports a bunch of the languages that I mentioned were missing in syntect. So doing what they do might be worth looking into.