Don’t Woof
Mon May 25 '20
Some time last December I discovered a database called Datomic. Datomic has an interesting data model where each value exists in a entity, attribute, value, transaction, operation five-tuple known as a Datom.
The transaction and operation allow Datomic to record a history of database changes. While this is cool, it won’t be what I’ll be talking about here.
The entity, attribute, value structure allows describing objects by grouping multiple attributions to the same entity. Through this, a single entity can receive many attributions, even those belonging to different types or object categories.[1]
This is an (somewhat embellished) example of entity, attribute, value entries from the Datomic documentation:
E
A
V
7
:inv/id
“SKU-1234”
7
:inv/color
:green
8
:inv/id
“SKU-5678”
8
:inv/watts
60000
8
:doc/url
“http:…”
Datomic has a few important indexes around the entity, attribute, value structure. Again, quoting their documentation:
Datomic’s indexes automatically support multiple styles of data access: row-oriented, column-oriented, document-oriented, K/V, and graph:
The E-leading index EAVT supports efficient queries for details about particular entities, analogous to a traditional relational database.
The A-leading index AEVT supports efficient queries for a single attribute across all entities, analogous to a column store.
The V-leading index VAET supports efficient queries for references between entities, analogous to a graph database.
The combination of EAVT and VAET supports arbitrary nested navigation among entities. This is like a document database but vastly more flexible. You can navigate in any direction at any time, instead of being limited to the containment hierarchy you selected when storing the document.
For some reason those paragraphs do not mention the AVET index, but that’s also one of the four important indexes that make things work. To review:
Index
Sort Order
EAVT
entity / attribute / value / tx
AEVT
attribute / entity / value / tx
AVET
attribute / value / entity / tx
VAET
value / attribute / entity / tx
So, Datomic lets you throw a bunch of entity, attribute, value pairs into a big soup and index them allowing efficient lookup of stuff like attributes on a given entity and entities with a given attribute.
What I think is very cool about the database and the entity, attribute, value model are the implications for querying. Datomic uses “an extended form of Datalog” to form queries. And what you get from this is a query language that makes a lot of use of pattern matching.
Pattern Matching
I’d roughly summarize pattern matching as being when your assertions resemble your interrogatives. That is, the thing that you write to ask a question would look like what you could write to state the fact of the answer if you knew it.
I think algebra is a really accessible reference point for this.
Suppose we solve for x in the following expression 3x = 12
. The fact of what x is
can be demonstrated by “plugging in” its value and testing that the
expression is true. In this example, the number 4 is the only substitution for x
(out of all the numbers) where the expression holds.
Applying this to something structured, like our datoms, we might construct similar expressions to “solve for x” in an entity, attribute, value tuple such as:
x :pet/name "Garfield"
Here, we’re asking what values could x be such that this expression matches an entry in our database. Or, what entities are attributed the :pet/name “Garfield”.
The documentation for Datomic explains their syntax for this:
This query limits datoms to
:artist/name
“The Beatles”, and returns the entity ids for such results:[:find ?e :where [?e :artist/name "The Beatles"]]
We can build up more interesting queries by adding more patterns and relating them through variables.
Unification occurs when a variable appears in more than one data pattern. In the following query, ?e appears twice:
;;which 42-year-olds like what? [:find ?e ?x :where [?e :age 42] [?e :likes ?x]]
Matches for the variable ?e must unify, i.e. represent the same value in every clause in order to satisfy the set of clauses. So a matching ?e must have both :age 42 and :likes for some ?x:
[[fred pizza], [ethel sushi]]
This paragraph is intended to comfortably guide this section into the following one, giving the appearance of a cohesive thought structure.
Is the Man Who Is Tall Happy?
Warning: I am not a word scientist. I basicallly don’t cite anything and I don’t know what I am talking about and you should not listen to me.
I happened to watch a film titled Is the Man Who Is Tall Happy? where a French student, Michel Gondry, interviews Noam Chomsky about linguistics.
The student asks Chomsky to explain a point made in one of Chomsky’s books and Chomsky replies:
Take the sentence that you gave me, “The man who is tall is happy”. If you want to form a question from that, you take the word “is” and put it in the front. So, “Is the man who is tall happy?” Right? That’s the question.
You don’t take the first occurrence of “is”. You don’t take the closest one to the front and say, “Is the man who tall is happy?” That’s gibberish.
Why doesn’t the child do the simple thing? Take the first occurrence of “is” and put it in the front? Computationally, that’s much easier than finding the main occurrence which requires knowing the phrases and so on.
It’s the same principle in all languages. So why?
A structure is shown to illustrate this point. It resembles the following:
In the film, we trace a path from the beginning of the structure (at the top) down to each “is”. The point is made that more grammatical structure is traversed to get to the “first” “is” compared to the “second” “is”. In terms of the elements of the diagram, more nodes are visited and more lines are drawn in the first path.[2]
In this diagram, the words in the sentence are not forced into a row. Consequently, the nesting is emphasized.
Chomsky continues:
Structurally speaking, this [pointing to the “is” in “is happy”] is the closest to the front. Linearly, this [pointing to the “is” in “is tall”] is the closest to the front.
Now the question is: Why do you use structural proximity and not linear proximity?
The child picks structural closeness because that’s a property of language probably genetically determined.
Chomsky’s argument is part of a larger topic that I don’t understand called universal grammar or something.
The gist, as far as I can tell, is that people have intuitions about language that convey an advantage toward learning and mastery of languages that employ the features that our intuition rely on.
In contrast to a behaviourist model, this proposition suggests that not all possible languages can be learned with equal fluency by a particular person given controlled stimuli.
Again, I don’t know what I’m talking about. This is not my domain and you shouldn’t believe anything I’ve written here.
The influence that genetics has on our intuition may be contentious. But it’s a stronger claim than what is needed to suggest, simply, if the utility of natural and constructed languages are subject to our intuitions and dispositions, might the effect extend to the languages we use in computing?
Broadly, why is language the way it is and not another way?
Does the answer to this question have implications for how we can design utility in either information languages or query languages in software?
How can we know or measure the effect?[3]
I’m probably not going to wake up tomorrow and read on the news that the real reason for the collapse of the Roman Empire was that nobody could remember how to use a window function in their SQL query and their bureaucracy crumbled under its own weight.
Or maybe I will. I guess weirder stuff has been news.
To me, pattern matching has an appeal that I can’t really explain. Maybe it has to do with how I happened to be taught algebra or symbolic logic or something and it’s not shared with, or generally applicable to, others folks. On the other hand, maybe it has to do with the way my brain, and brains generally, deal with language. I think that would be interesting if that were the case.
My guess, in my capacity as someone who doesn’t know what they’re talking about, is that there are important ways in how our brain works and how language works that make certain designs more effective than others, and that the difference between imperative and declarative programming styles is an example of this.
Owoof
Recently, I started making a program that could store entity, attribute, value tuples into a SQLite database and query them with a syntax resembling Datalog.
Like the Datomic examples from earlier, queries should resemble a sequence of entity, attribute, value tuples where each element can optionally be a variable binding.
?b :book/title "The Complete Calvin and Hobbes"
?r :review/book ?b
?r :review/user ?u
?r :review/score 1
There were a couple ways to think about these queries in terms of SQL but I will just talk about the one I implemented.
It’s easiest to explain this as a graph where the nodes are sets of datoms and edges are constraints that relate the sets of datoms together. In terms of SQL, we are selecting a datoms table one or more times and joining it with itself.
There are four groups of boxes in the figure above, each correspond to an aliased use of the datoms table in the FROM clause of a SELECT statement and each edge is an expression in a conjunction in the WHERE clause. The lines between boxes correspond to equality constraints between aliased selections of the datoms table. This is how the table is joined with itself.
Below is an example of a SQL query generated with this approach. For simplicity, what is shown here differs from the real query in that we omit some details to do with parameter binding and denormalizing things like attribute identifiers.
SELECT datoms2.v
FROM datoms datoms0
, datoms datoms1
, datoms datoms2
, datoms datoms3
WHERE datoms0.a = ":book/title"
AND datoms0.v = "The Complete Calvin and Hobbes"
AND datoms1.a = ":review/book"
AND datoms1.v = datoms0.e
AND datoms2.e = datoms1.e
AND datoms2.a = ":review/user"
AND datoms3.e = datoms1.e
AND datoms3.a = ":review/score"
AND datoms3.v = 1
In practice, the program takes a sequence of patterns and, optionally, parameters to control sorting, limiting, and what to output.
To find datoms for the :book/title attribute:
$ owoof '?b :book/title ?t' \
--desc '?b :book/avg-rating'
[
"The Complete Calvin and Hobbes",
"Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5)",
"Words of Radiance (The Stormlight Archive, #2)",
"ESV Study Bible",
"Mark of the Lion Trilogy",
"It's a Magical World: A Calvin and Hobbes Collection",
"Harry Potter Boxset (Harry Potter, #1-7)",
"There's Treasure Everywhere: A Calvin and Hobbes Collection",
"Harry Potter Collection (Harry Potter, #1-6)",
"The Authoritative Calvin and Hobbes: A Calvin and Hobbes Treasury"
]
Notice that we only saw the titles. If no parameters are passed to specify what data to show from the query, the program will look for a variable with no constraints and use that.[4]
In this case, ?t
was unconstrained, so the values for ?t
that matched entries in
our database were shown.
In the diagram further above, showing the book review query as a graph, the ?u
variable
in the query corresponds to the v cell of the top-left triad of boxes. In the graph, we see
there are no edges that end at that cell, showing that the ?u
variable as
no constraints.
The --show
option in the program allows controlling the shape of the output a bit.
Either by specifying a variable with a discrete value …
$ owoof '?b :book/title "The Complete Calvin and Hobbes"' \
'?b :book/authors ?a' \
--show '?a'
[
"Bill Watterson"
]
… or a variable to an entity and one or more attributes to be shown for that entity.
$ owoof '?b :book/title "The Complete Calvin and Hobbes"' \
--show '?b :book/authors'
[
{
":book/authors": "Bill Watterson"
}
]
Looking for book ratings with a score of 1 for books where the author is “Dan Brown”.
The --show
parameter is used to ask for two objects back for each result, one showing
the user of the rating and another for the book title and ISBN.
$ owoof '?r :rating/score 1' \
'?r :rating/book ?b' \
'?b :book/authors "Dan Brown"' \
--show '?r :rating/user' \
--show '?b :book/title :book/isbn' \
--desc '?b :book/avg-rating'
[
[
{
":rating/user": 9
},
{
":book/title": "Angels & Demons (Robert Langdon, #1)",
":book/isbn": "1416524797"
}
],
[
{
":rating/user": 126
},
{
":book/isbn": "1416524797",
":book/title": "Angels & Demons (Robert Langdon, #1)"
}
],
[
{
":rating/user": 406
},
{
":book/isbn": "1416524797",
":book/title": "Angels & Demons (Robert Langdon, #1)"
}
],
# (more output omitted for brevity...)
This is an example of a more complicated query that shows unpopular books that were rated highly among users that did not enjoy The Complete Calvin and Hobbes.
$ owoof '?calvin :book/title "The Complete Calvin and Hobbes"' \
'?rating :rating/book ?calvin' \
'?rating :rating/score 1' \
'?rating :rating/user ?u' \
'?more-great-takes :rating/user ?u' \
'?more-great-takes :rating/book ?b' \
'?more-great-takes :rating/score 5' \
--show '?b :book/title :book/avg-rating' \
--asc '?b :book/avg-rating' \
--limit 10
[
{
":book/title": "The Short Second Life of Bree Tanner: An Eclipse Novella (Twilight, #3.5)",
":book/avg-rating": 3.51
},
{
":book/avg-rating": 3.57,
":book/title": "Twilight (Twilight, #1)"
},
{
":book/avg-rating": 3.64,
":book/title": "The Memory Keeper's Daughter"
},
{
":book/avg-rating": 3.67,
":book/title": "The Edible Woman"
},
{
":book/title": "Breaking Dawn (Twilight, #4)",
":book/avg-rating": 3.7
},
{
":book/title": "Romeo and Juliet",
":book/avg-rating": 3.73
},
{
":book/title": "A Great and Terrible Beauty (Gemma Doyle, #1)",
":book/avg-rating": 3.79
},
{
":book/avg-rating": 3.8,
":book/title": "Northanger Abbey"
},
{
":book/avg-rating": 3.81,
":book/title": "The Taming of the Shrew"
},
{
":book/avg-rating": 3.84,
":book/title": "Of Mice and Men"
}
]
The SQL query generated for the previous command looks like the following. Unlike the
example query shown earlier, this includes some gibberish that helps with value types
(the t column) as well as denormalization on attribute entities using their
identifier and entities using a handle that happens to be a UUID.
The only difference between what is shown below and what the program actually runs is
that I have placed the bound parameter values in comments above/near the each binding.
(Also, note that attributes
is not a table, but a view over datoms that match ?e :attr/ident ?v
)
SELECT datoms7.t, CASE datoms7.t WHEN -1 THEN (SELECT uuid FROM entities WHERE rowid = datoms7.v) ELSE datoms7.v END
, datoms8.t, CASE datoms8.t WHEN -1 THEN (SELECT uuid FROM entities WHERE rowid = datoms8.v) ELSE datoms8.v END
FROM datoms datoms0
, datoms datoms1
, datoms datoms2
, datoms datoms3
, datoms datoms4
, datoms datoms5
, datoms datoms6
, datoms datoms7
, datoms datoms8
-- :book/title
WHERE datoms0.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms0.v = ? -- "The Complete Calvin and Hobbes"
AND datoms0.t = ? -- 0
-- :rating/book
AND datoms1.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms1.v = datoms0.e
AND datoms2.e = datoms1.e
-- :rating/score
AND datoms2.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms2.v = ? -- 1
AND datoms2.t = ? -- 0
AND datoms3.e = datoms1.e
-- :rating/user
AND datoms3.a = (SELECT rowid FROM attributes WHERE ident = ?)
-- :rating/user
AND datoms4.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms4.v = datoms3.v
AND datoms5.e = datoms4.e
-- :rating/book
AND datoms5.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms6.e = datoms4.e
-- :rating/score
AND datoms6.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms6.v = ? -- 5
AND datoms6.t = ? -- 0
AND datoms7.e = datoms5.v
-- :book/title
AND datoms7.a = (SELECT rowid FROM attributes WHERE ident = ?)
AND datoms8.e = datoms5.v
-- :book/avg-rating
AND datoms8.a = (SELECT rowid FROM attributes WHERE ident = ?)
ORDER BY datoms8.v ASC
LIMIT ? -- 10
Scale & Performance
The point of the program was to be able to generate queries like this from a datalog-like syntax. Being a toy program/side-project without a real use case, feasibility wasn’t a huge concern, but the performance is still interesting. The query above takes about 400ms to run on my laptop when the database file is in my OS cache (or a bit under 4.5s from disk). The database file is 826MB with just over six million datoms.[5]
For comparison, the dataset used has 3.3MB of books and 72MB of ratings in a csv format, and I think I only imported about a quarter of the ratings.
The dataset used is goodbooks-10k.
Query performance has been flaky at times where minor changes in the given patterns/constraints cause SQL query plans to change significantly enough that some queries (that are otherwise sub-second) never finish. Specifically, SQLite may either select less than ideal indexes to use for some sets of datoms[6] or change the order that sets of datom are fetched in a way that is detrimental.
Recall that each “set of datoms” correspond to the group of e, a, v boxes in the earlier diagram and to each aliased “datoms” table in the FROM clause of the SQL query above.
For each entry in the FROM clause, SQLite will typically use one index to populate it.
The dataset used happened to be good at pointing this out. I import ten thousand books and about one and a half million book ratings, so the cardinality of these “objects” is significantly inequal.
Normally, this would be stored in different tables and have different indexes. This would allow SQLite to have a good idea of how big indexes will be and which tables to start with in a query involving joins. But, in my case, everything is in one big table of datoms.
Take, for example, a query that looks for books by a specific author (using :book/authors) and joins that with ratings with a specific score (using :rating/score; a one-to-five scale). Starting by looking for ratings will be a slower query than starting with the books since there are far more ratings than books, particularly with these filters since the cardinality of :book/authors is greater than :rating/score. In other words, there are many more values for :book/authors than for :rating/score, so filtering by a specific score doesn’t reduce the size of our selection by as much as filtering by specific authors.
In fact, because of the difference in the number of rating in the database compared to the number of books, filtering for a specific value of :rating/score will yield more datoms than filtering for any/all values of :book/title even though the former is a stronger constraint. Ultimately, since all the data is in one table and they share the same indexes, SQLite might have trouble knowing the cardinality of these indexes and might not have a hard time producing efficient query plans.
Running the ANALYZE command has solved every instance so far where a query could not complete due to a bad query plan. But I haven’t tested this with very many complicated queries or other data sets.
Schema & Reflection
Aside from querying data, we can assert or retract datoms to add or remove information from the database. Datomic keeps a transaction and history of every modification so that retractions don’t actually remove information but just mark it as no-longer-being-a-fact-at-some-monotonically-increasing-transaction-level. This is cool for reasons but my toy project does not do this so I won’t talk about it.
Instead, assertions and retractions literally mean inserting and deleting datoms.
Another feature that Datomic advertises is a self-referential or reflective interface for defining a schema. Datoms let you model books, book ratings, animals, pets, whatever your application domain is about. But database properties themselves are also datoms. Datomic has a rich interface in this respect, allowing attributes to receive attributes to specify things like documentation, uniqueness, value type. To clarify, attributes have attributes because they are entities in the database.
My program implements this to a lesser and slightly broken extent.
Shown below is every entity, attribute, value in a new database.[7]
The command line interface only knows how to group datoms into objects if you ask for
specific attributes. So instead we asked for the entity, attribute, value tuple
and grouped them by their entity with jq
.
$ owoof '?e ?a ?v' \
--show '?e' --show '?a' --show '?v' \
| jq 'group_by(.[0]) | .[] | map({(.[1]): .[2]}) | add'
{
":attr/unique": 1,
":attr/ident": ":entity/uuid",
":entity/uuid": "#26eb5b7a-f6cb-4079-8b48-738c32a4c954"
}
{
":attr/ident": ":attr/unique",
":entity/uuid": "#378f5726-9cf0-40ba-8987-7e2beef3920e"
}
{
":attr/unique": 1,
":attr/ident": ":attr/ident",
":entity/uuid": "#96e193ad-5f39-4820-9c3e-651e95c62ca4"
}
This shows that a new database has three entities.
Remember that the point of the database is that the attributes that any entity has is what makes it meaningful in a context. For example, giving some entity a value for the :attr/ident attribute (which stands for an attribute identifier) is what is required for that entity to be an attribute in the database.
Each of the three entities in the result above has an :attr/ident, showing that they are all attributes.
The first, is the attribute for entity handles that allow us to reference the entity and identify it uniquely in the system. We can match on :entity/uuid in our queries to search for specific entities or include an :entity/uuid in an assertion (an insert) to add datoms about an existing entity instead of creating a new one.
The second, is an attribute for specifying uniqueness. The presence of that attribute means that the attribute and value pair must be unique together for the entire database. For some attributes, like :book/authors or :pet/name, it should be valid to have the same value for different entities, like if you have books by the same authors or pets with the same name. But this is not valid in some contexts, like attribute identifiers. Having multiple attributes with the same identifier is not meaningful to the database. Setting this attribute on the attribute identifier entity enforces unique values for attribute identifiers.
The third entity is the entity for attribute identifiers. The way this works is a little weird but (spoiler alert) a clue is in the value for the “:attr/ident” key of this entity in the JSON object shown above.
Recall that attributes are just entities with values for the :attr/ident attribute. To create an attribute, we can assert a datom about a new or existing entity with the :attr/ident attribute and a value like “:pet/name” or something.
$ jq -n '{":attr/ident": ":pet/name"}' | owoof assert
"#1d1bb352-b157-4cd9-ba14-a900b4eb88d7"
$ owoof '1d1bb352-b157-4cd9-ba14-a900b4eb88d7 :attr/ident ?v' \
--show '?v'
[
":pet/name"
]
$ jq -n '{":pet/name": "Garfield"}' | owoof assert
"#1e0ca035-e824-4f08-b51b-09faa51c7ec9"
$ owoof --show '?_ :pet/name :entity/uuid'
[
{
":entity/uuid": "#1e0ca035-e824-4f08-b51b-09faa51c7ec9",
":pet/name": "Garfield"
}
]
The interesting bit is in how we create the :attr/ident attribute in the first place.
Generally speaking, given an entity e and a value v, for e to be an attribute, there must exist a datom e a v where a is an entity that is the :attr/ident attribute.
Unfortunately, that definition is a bit recursive. But we can apply it to our :attr/ident attribute, where v is :attr/ident.
Then, there must exist a datom e a :attr/ident where a is an attribute with the value :attr/ident. And it turns out that if a is equal to e then it works. And since the :attr/ident attribute has values that are unique for that attribute (there is at most one :attr/ident attribute), then e must equal a.
So the datom for the :attr/ident attribute references the same entity in its entity and attribute fields.
The larger point is that the interface around information about your domain models is the same as for database models. You can query and assert things about database attributes in the same way as you can about pet names or recipes or whatever matters. And I think that’s super cool.
I have put the source code for Owoof on GitHub.