Content
Fri Nov 22 '19
I started writing some posts a while ago but never finished them. That’s why my blag has been idle. I am taking a break to write about one of the things keeping up me at night lately.[1]
I was writing about GraphQL. I produced a small number of petty criticisms when I realized that it’s actually a really okay serialization format for a very specific kind of structure: a node (or a field) with a name, an optional alias, optional arguments, and optional children node/fields.[2] If we were to encode this sort of structure in a format like JSON, you might end up with something either verbose or awkward looking like this JMAP example. Either we encode the node as a list/tuple, which is not very human-readable, or some sort of mapping/associative array, where recurring keys are a bit wasteful and either way is more verbose than what GraphQL accomplishes.
I’m simplifying a lot here and I also haven’t used GraphQL in a long time so my impressions are probably dated. You should probably be careful about how seriously you take what I write here.
GraphQL is pretty generic and flexible and we can use it to define a variety of ways to operate over data. This is in contrast to something like REST which is really more suited to CRUD operations. Specifically, so I’m not just throwing buzzwords around, I think a good REST API is about having four verbs that you can use to operate over different structures in similar and predictable ways, while a GraphQL API can accommodate models that are special snowflakes that each have different verbs and little to nothing in common with each other.
Maybe I’m a silly person who smoked too much duck typing, but I get the feeling that GraphQL is about managing complexity and REST is about exposing something that resembles a connection to a database. Sometimes your domain models are complex. But, and I could be wrong here, it seems like most of the web is an HTML form in front of a database. Most of the web is not APIs for threading, creating parsers, leftpad, or drawing graphics. It’s reddit and dumb blags for people to complain about web technologies that they don’t fully understand. It’s mostly about moving content around or getting presentations of specific content.
In principle, REST should be great if all I want to do is some super simple CRUD stuff,
right? Endpoints like /collection
and /collection/item
map closely to the collections/tables/objects and documents/rows/instances I
have.
But that alone isn’t very powerful. If we want to make a REST API for some content, we probably want to do things like filtering and pagination. Let’s think about how we would make one and add features until it looks all fucked up.
We allow the client to filter for items by testing equality on some field.
Lets use the query string for this; /posts?title=Bananas
. Great.
And now we need to filter on another field. But, it’s not for equality, it’s a full text
search; /posts?content=yellow+fruit
.
It looks simple. Simplicity is good, right?
This is a great design for a few reasons. First, things get awkward if the API needs to support some other expression on one of these fields, like a case-insensitive substring match on the title or something. Second, Since our field comparisons are implicit, it’s not obvious to someone who isn’t very familiar with this particular endpoint what comparisons are used by the filter. Fortunately, we can document our bespoke endpoint filters to solve that.
Not least of all, you can trick a clever user who is trying to make an inference about one
endpoint’s filter behaviour based on the behaviour of another endpoint.
Like if they know how /posts?content=your+blag+sucks
works, they might assume that
/comments?content=your+blag+still+sucks
does the same thing.
But it turns out this was a trap all along.
Even though those filters were intended to be the same,
they used to be substring matches
before someone realized that a full text search
was way better and changed it.
But they only changed one endpoint, forgetting the filter for the other.
Now the filters between both endpoints are different,
but not so different enough that anybody will even really notice
for a couple years, if ever. They might just have a feeling that searching one resource doesn’t
quite work as well as the other for some reason. It’s perfect.
The next thing we can do is nest resources and endpoints.
Endpoints like /thread/<id>/comments
are just filtered queries on /comments
,
something like /comments?thread__eq=<id>
. You don’t even need them but you can trick
everyone into thinking you do by not having a /comments
endpoint at all.
And users can’t even query for comments without a filter.
Later, you might add /user/<id>/comments
that returns comments authored by a particular user.
But since endpoints aren’t composable (we can’t logically conjunct endpoints together)
there isn’t an intuitive way to filter comments on both their author and the post they
belong to.
To solve this, lets allow query string filters like everywhere else. But make sure that
/thread/<id>/comments
and /user/<id>/comments
accept different query parameters and
different filters. Either on purpose or by accident because they use different
implementations and there was a copy-paste mistake.
But we can solve this getting someone to document the inconsistencies.
It turns out you can cut corners and not bother with creating a self-describing
interface if you can just refer your users to literature every time they can’t figure
out what a button does.[3]
And while you’re adding documentation for this, make sure that the documentation we wrote for that other thing earlier isn’t out of date yet.
Now we have a REST API made out of bespoke snowflake endpoints that infuriates the living fuck out of the everyone that uses it or supports it. The only thing that really happened is that we wanted to parameterize filtering and we came up with something that was not expressive enough to work generally across each of our collections. So instead of dealing with that at the door, we made it the concern of every endpoint.
tldr
The whole point of the above wall of text is only to say that expressing how to fetch data is kind of hard and there are a lot of bad solutions.[4]
SQL is amazing but it’s quite complicated. I’m pretty sure I learn something new every time I look at SQL database engine user documentation.
Most of the GraphQL APIs I’ve seen do a similar thing as the above. But, it seems like you can get away with a lot more because GraphQL APIs have schemas. Presumably, mistakes in clients can be caught early by programming against an explicit and typed interface. The consequences of having a complex interface is minimized by tooling.
Instead of encoding our filters into a URL query string, we can encode it as part of a GraphQL query and pass it either in the URL of a GET request or as data in a POST request if our API does away with GET requests entirely.
Here’s an example from the GraphQL project website:[5]
{
human(id: "1000") {
name
height
}
}
… which looks up and returns a human with that id.
Some GraphQL APIs are creative with parameter names and include operators in them, like
human(name__contains: ...)
or something like that.
And some REST APIs let you do this with query string filtering too, like /human?name__contains=...
.
It’s basically the same thing.
(Except GraphQL is typed, while I think x-www-form-urlencoded
encodes the boolean true
and the string "true"
the same way which can be a pain.)
It’s worth mentioning that GraphQL makes it natural for clients to specify the shape of the data in the selection set; what fields we want the API to return and any related records as well.
PostgREST is a really fun and interesting program. Their syntax for returning
related records was apparently inspired by GraphQL. It lets us write something like
/users?select=*,comments(*)&comments.thread=123
which would return a set of users and
their authored comments for a specific thread.
PostgREST was also designed with the awareness that nested endpoints are filters.
When you think about it, even /table/<id>
is just a filter.
So write /table?id=eq.123
instead. This has the advantage of removing the question of
how to build an item endpoint for a model with a composite primary key.
I also want to mention HTSQL. I haven’t heard anybody else talk about it and haven’t properly used it myself. So I don’t really have any insight to offer. But the design and literature seem super interesting. It looks expressive and powerful, but I’d also expect this to mean that it’s more difficult to implement and a bit harder to learn. Although, I think a goal of theirs was to create something easier to digest than SQL.
Admittedly, more expressiveness isn’t necessarily worth the complexity it may introduce.
Let’s suppose we want to find posts, but only ones made since their author’s last login. Using
HTSQL, we can write something like
/posts{title,author{name}} ?published_at>author.last_login
which looks like.
{
"posts": [
{
"title": "interesting post",
"author": {
"name": "Spongebob"
}
},
{
"title": "why starfish are the best",
"author": {
"name": "Patrick Star"
}
}
]
}
This is pretty cool. I’m not saying that this here exactly is the line, but there probably is some point where queries/requests become so esoteric that giving them first-class support might not be worth it.
PostgREST has a bit on computed columns which can be used to define interesting expressions on the server that are usable by clients. Which is better than nothing and preserves the purity/simplicity of the filter and selection syntax.
PostgREST also has a whole thing for doing executing stored procedures, for when your models have special verbs that don’t fit the CRUD model. So that’s a cool thing worth mentioning. Because even if 90% of the APIs you’re trying to create on the web is just for content that fits CRUD, the other 10% is important enough that, if you ignore it, it’s likely to take as much or more time to solve than the first 90%. And that sucks.
An important fact of PostgREST and HTSQL is that they operate using introspection. Since the features are written for database concepts, any kind of snowflakeyness must exist in the database that is being reflected. And we don’t have any stupid excuses like a we didn’t copy-paste the filter handling code correctly between endpoints because they share the same code in the first place.[6] Business rules are naturally idiosyncratic and minimizing that so they don’t leak into our programming interfaces is valuable.
Reusing code to create patterns and simplify things for users is a whole thing that one could spend a lot of time going into. But, I’m pretty sure everyone is already aware of how this works. People don’t execute on this knowledge in part because politics.
But also it’s hard to find the appropriate pattern. Doing it wrong is often worse than not doing it at all.
Here’s a picture of someone else’s cat to break up the text.
shower thoughts
This is getting too long, but I wanted to include some garbage I’ve been thinking about off and on for quite some time.
Interacting with content follows similar patterns. You usually want to a combine or compose the following:
Fetch content. Which might involve filtering, ordering, limits, or offsets.
Mutate something either modifying fetched content or creating new content.
Present fetched content with some shape maybe using some kind of pagination.
Performing this interaction using GraphQL or URL query strings is about encoding this information into a fairly human readable and writable string.
At the risk of getting much deeper into this, I’ll mention MongoDB real quick. I find it interesting because you can write an expression as a JavaScript object using some special notation for operators. In the request, the structure is serialized to BSON. Being a binary format, BSON is easier for machines to parse than text formats[7] and it supports quite a few more types than JSON so applications don’t have to guess at if something is a date or a string.
{ $or: [ { qty: { $lt: 30 } }, { item: /^p/ } ] }
On the other hand, suggesting that a binary query format will improve performance by reduing time spent parsing is probably naive. Your program probably isn’t spending enough time at the parsing step for a reduction in time to matter when handling a request. Also, you can cache query strings to their internal representations. I’m pretty sure most GraphQL servers and database engines already do this.
Structured queries like this are interesting because the format is more to do with how to arrange native types, rather than arranging a string. (Assuming you can send your structure over the wire by using a suitable serialization format.)
On the other hand, strings are universal. Some programs may not have strong notions of objects or structures. Like maybe something written in bash (or Scratch?). Or maybe some weird domain-specific languages. I’m not sure.
Part of me believes in a future where we can compose and serialize generic query structures that describe what content to gather or what mutations to perform. And then, if needed, reference them from some remote procedure call/not CRUD specific format like GraphQL.
Because, for putting content on the web, the databases we make and the engines they use important, if not the most important things. But our time is spent, over and over, on writing protocols for moving that information around and also they can’t talk to each other.
while we’re on the topic of serialization formats and the efficiency of text-vs-binary
It’s been claimed that binary formats are cheaper because they don’t repeat keys. For instance, each element of a JSON list is self-describing and will contain all its keys as strings, even if every element has the same structure. Whereas something like protobuf doesn’t need to represent object keys as strings because you have a schema to determine how to map bytestreams to objects and their members.
However, this is has nothing to do with if the format is binary or text. MessagePack and CBOR are binary formats where objects are self-describing. Do you know what self-describing text format doesn’t repeat keys for each object in a sequence? Comma-separated values.[8]
Almost as well liked as XML.
What if our JSON had a header too?
[
[
"title",
"published_at",
{ "author": [ "id", "name" ] }
],
[
"interesting post",
"Fri Nov 22 11:56:49 PST 2019",
[ 1, "spongebob" ]
],
[
"why starfish are the best",
"Fri Nov 22 13:12:32 PST 2019",
[ 2, "Patrick Star" ]
]
]
I imagine this is how a lot of wire protocols for relational tabular database already work. So this is probably not a new thing. I just think it’s pretty memes that CSV has this over every other popular serialization format. I mean, I don’t really know if this is a meaningful improvement; maybe it’s something to check out. I think it’s fair to say that this impairs human-readability at least a little bit, especially as things nest. But that’s kind of irrelevant for binary formats anyway.