Impetuous

Thu Mar 12 '20

In keeping with the latest trends of rewriting perfectly docile & unsuspecting software in Rust, I have been working on a Rust implementation of some old time tracking software I wrote in Python some years ago named Impetuous.

The short summary is that there are there are a lot of things wrong with how the rewrite ended up, but it can do some pretty cool things and I learned a lot.

A lot of the badness seem to be from where I abused macros because I couldn’t figure out how to use Rust traits to do the thing I wanted; and there are a few places where stuff makes no sense because an effort to follow paradigms like KISS/YAGNI[1] resulted in too much coupling/things in the system having to care about everything else.

In all of this, one of the most responsible things I can say about Rust is that it seems to be relatively good at managing the complexity in how your program works at the expense of the complexity of how Rust works.

If you are writing something that must be thread-safe, is it easier to reason about threading or about Rust and its borrow checker and how the borrow checker works in different places like traits or closures or any other feature in Rust?

I could spend a lot of time writing stuff that nobody will read about Rust; but, what I want to cover here are the goals I had in mind to achieve and how that influenced decisions that lead to why the Impetuous is the way that it is.

`im doing`

Briefly, Impetuous was written to help me track my time spent at work and post that time to Jira and Freshdesk.

$ im doing looking for my red stapler
 7:30:00 /        0s / --:--:--  looking for my red stapler
$ im doing plotting against whoever took my red stapler
 8:15:00 /        0s / --:--:--  plotting against whoever took my red stapler
 7:30:00 /    45m 0s /  8:15:00  looking for my red stapler
$ im doing
 8:15:00 /    15m 0s /  8:30:00  plotting against whoever took my red stapler

The Python implementation had two tables: one for these time entries, another for the results of posting time to Jira or Freskdesk. It’s pretty basic stuff.

Over-engineering pretty basic stuff

When writing the first version, there was an attempt to model interactions between a user agent and the data store through data structures that described a way to gather and change or present information in the database.

The hope was that composing those data structures would be an efficient & effective way to describe how to interact with Impetuous in terms of CRUD.

Using some context, like access control, these structures would be mapped into one or more SQL queries and run against a SQLite database, ultimately returning a document containing objects gathered in the shape described by the presentation structure.

In Python, I was spoiled by SQLAlchemy, which I found to be a very nice library for querying a SQLite database. In Rust, the only mature option I found was Diesel, which boasts compile-time correctness at the cost of allowing queries to be dynamically generated at runtime. When I evaluated it, the Diesel API was so strict that every query was required to know at compile-time exactly the number of columns it would return. Conditionally selecting columns or joining other tables based on some gather or shape in a request would require that Impetuous generate every permutation of a request at compile-time.

Over-engineering query building

I ended up writing my own thing for building SQL query strings and used a very nice driver called rusqlite. This went through about two and a half rewrites. It looks something like:

// Honestly writing "VALUES (?, ?, ?, ?, ?)" is more readable than this...
let expr = sql::Values::one(sql::ExprTuple::from((
    sql::bind::<()>(),
    sql::bind::<()>(),
    sql::bind::<()>(),
    sql::bind::<()>(),
    sql::bind::<()>(),
)));
sql::Insert::default_tuples()
    .table(&postings::table)
    .columns(sql::ident("doing_tag"))
    .columns(&postings::source)
    .columns(&postings::entity)
    .columns(&postings::status)
    .columns(&postings::doing)
    .expr(expr)
    .upsert(sql::lit::<()>("ON CONFLICT DO NOTHING"))
    .as_sql();
// INSERT INTO "postings" ("doing_tag", "source", "entity", "status", "doing")
//      VALUES (?, ?, ?, ?, ?) ON CONFLICT DO NOTHING

Many of the structures, including Select, Insert, Update, and Delete, use generics for almost every part of the statement, allowing users quite a bit of flexibility around what types they’re building queries out of. This also gives some control to users to determine the compile/run-time strictness of values permitted.[3]

In a statement like SELECT, you may have multiples of things like result columns, FROM tables, or JOIN clauses. We build up the statement by accumulating values for these parts into the statement structure.

If, for each collection, whether it’s result columns or joins or whatever, we accumulate elements into a vector, then each element must be of the same type. Or we can accumulate these into a tuple of differently typed elements; (T1, T2, T3, ...). But every time we grow the tuple, it becomes a new type. In Rust, this is done through a move, allowing us to create a new type by consuming our ownership of the old one.

Since each has their own trade-offs, I supported both and use one or the other depending on the circumstances.

In the example earlier, the first call to columns moves the Select in order to change its columns type from the unit type () to a single element tuple of an owned identifier (sql::Ident,). The second call moves it again to change the type to a (sql::Ident, &sql::Ident). The owned and borrowed identifiers are different types, so we couldn’t do this with a vector – unless we borrowed the first identifier, or copied the later ones, or used a clone-on-write or some kind of union thing with a variant for each type.

On the other hand, if you use tuples, then I guess you can’t really do anything conditional, since an INSERT of a tuple of two columns is a different type than an INSERT of a tuple of three columns and they cannot be assigned to the same thing. Allegedly, we can use a Box<dyn T> or &dyn T (where T is a particular trait), instead of a vector or a tuple, to point to some owned or borrowed place in memory that implements whatever trait that the tuple or vector is implementing. But there are a lot of caveats and lore around this that makes it difficult to employ reliably.

Over-engineering object mapping

The place where everything really fell apart was with the structure and API for querying values from the database and loading them into structures in the program.

I started with defining a bit of schema, like well known table identifiers and column identifiers that can be used and to build queries. I wrote a macro to help with this. Then I needed a few more things like filter expressions, column selection, joining parent tables, and loading rows from child tables. Ultimately, the macro ended up generating a lot of code responsible for this stuff because, every time I wanted to implement it generally, something went wrong – often with how I understand how traits worked.

Also, the generated code can only load rows into a single structure/type per table. And the structure is the same thing used to serialize and deserialize objects to/from the body of an HTTP request or wherever they’re going. So that structure is well suited for that purpose but really hard to use everywhere else.

It does support loading related records. So, if you want to find articles and their authors, it will include the authors in the query with a SQL JOIN and load them. Or, if you want to load all the comments for those articles, it use the primary keys for the articles you asked for to run another query to fetch comments for those articles. So that’s pretty cool.

But, by the end, other features were missing, it was awkward to use, and unsound to improve on due to the things mentioned above. This really limited the library’s usefulness with future applications.

Over-engineering state transfer

So the gather and shape stuff from earlier ended up being something with its own stringy syntax. I think I was emboldened by interesting formats for filtering and fetching via an HTTP request URL, like those featured in HTSQL and PostgREST:

HTSQL filter: /program?school.code='bus'&degree!='bs'
HTSQL join: /course{department{school.name, name}, title}
PostgREST filter: /people?age=gte.18&student=is.true
PostgREST join: /films?select=title,director:directors(id,last_name)

I ended up with something in the middle; except I didn’t think it should support literals. Instead, if you want to write a filter expression that compares a field to a user-provided value, supply a parameter name and pass the value in an accompanying document, like the request body over HTTP.

http 'localhost:7193/doings{start,end,note?text.eq.$text}' \
        text="looking for my red stapler"
[
    {
        "end": "2020-03-09T15:15:00Z",
        "note": {
            "text": "looking for my red stapler"
        },
        "start": "2020-03-09T14:30:00Z"
    }
]

And filters may include and as well as or, like:

and(start.lt.$until, or(end.is.null, end.ge.$since))

Although, I just contradicted myself here because null is a literal value. To be honest, so are true and false. But those are the only other exceptions, I promise. The idea here is just that, if a lack of literals leads to queries that look like field.is.$null or field.is.$true and the parameter name doesn’t really mean or add anything, then that’s not good. So that’s the way it is I guess.

So the gather structure here is, roughly, some resource identifier followed by zero or more of:

filter: ? expires_on.le.$today
shape: {text, start, end}
order: sort(-start)
limit: ..$n or ..100[4]

The shape expressions is a sequence of gather structures. So you can be a bit recursive. But only to fields that link to other tables.

And there are a few other confusing things like:

If you ask for articles {author..0} you can limit the author but it doesn’t do anything and doesn’t make sense given every article has exactly one author.
Is articles {comments..10} supposed to show at most ten comments in total or per article?

Why not GraphQL?

I don’t think it would have helped. I spent a short amount of time on implementing the syntax. Most of the work was making the query building & object loading stuff. Which I did because I couldn’t use Diesel.[5] So I would have had to do that with GraphQL.[6]

Also, GraphQL is a fair bit more complicated than the sort of thing I needed. It has more features – if you want to look at it that way. There were two things I intentionally designed for.

I want type inference for variables.

Using variables in GraphQL requires a bunch of extra stuff like specifying the operation name and operation type, as well as the variable type for each variable you’ve defined.

Can’t the type be inferred? It’s a bother to be explicit about it and I complain easily.
I don’t want support for literals.

Because I don’t want to compete with actual serialization formats.

Except, as mentioned earlier, maybe in cases where that encourages the use of parameters named after their literal expression, like $true or $null.

Mutations

This complicated DSL thing only supports fetching data, not mutating it.

I had intended to include syntax for this using + or ! operators. There were a few iterations on this but it ended up making not a lot of sense as a feature. I had two requirements:

Mutations should be batchable. If I wanted to write a request that adds, updates, or removes one record, the same request should work to modify multiple records if I submit a sequence of items instead of a single item for the parameter values.
It was important to me to have some sort of optimistic concurrency control. In this case, mutations must include, for each value they change, what they believe the current value is.

A consequence of the last point is that, if I try to delete a user with users ! $bob then $bob is the whole user and it must match the current bob. Similarly, I can insert with users + $bob. But what does an update look like? Is it the same as an insert but where $bob is a old-new pair of bobs instead of one bob? That seems a little weird to me for some reason.

Even if disregard that, these operators struggle to be composed with each other or with any of the operators for fetching. Like lets try to figure out what the following mean:

articles ? title.eq.$title !

This probably deletes articles with that title.

But this kinda skips skips concurrency control, so you don’t really know what you’re deleting.
articles ? title.eq.$title {title + $new_title}

This probably sets the title to $new_title for each article with the title $title. How do I do this in batch? If $new_title is a sequence, which element do I use?

Maybe $title is also a sequence of the same length and I’m mapping between the sequences?

Or we use $title.old and $title.new as parameters instead and $title is a sequence of objects with old and new keys?

Or let $title be a map instead of a sequence that maps old to new titles. And come up with some syntax for using that. But some popular serialization formats only support maps with strings for keys.
articles ? title.eq.$title {comments + $comment}

This reads that we should insert $comment for every article with the given title. But what if $comment includes a specific value for its article? Is it used, ignored, or an error?

And this has the same problem as above in terms of being batchable.

At any rate, I realized that none of these examples are things I had any desire for be able to do.

And anyway, the point in the beginning was to use data structures to describe a request. They could be serialized in the same way that the parameter values would be. And you don’t even need parameter values because you don’t need parameters because you don’t need to avoid interpolating data into the query because it’s all one serializable data structure already. So any problems to do with parameters only exist in this weird DSL format.

As it is now, the mutations doesn’t use this fancy query stuff at all. It takes a sequence of objects that specifies what collection to operate on, the old item, and the new item. For deletions, new is null. For insertions, old is null.

It’s not very fancy. But some of the long term goals are to do with federation and the way it works now makes a lot more sense for synchronization.

Also, in the middle of this project, I changed my views on some stuff after learning about interesting things like Datalog, Datomic, and Noria and some of the technologies surrounding those things. And now I wonder if user-defined materialized views interpreted from some kind of declarative language is a formidable design for communication between software modules.

I want to write & play more about that, but this post is quite long already so that’ll have to be something for future me.