Memory Wheel

Mon May 5 '25

One time, I tried to make a concurrent data structure, but it didn’t work. But I had another go at it recently, it still probably doesn’t work but I learned some things. So this post is about why the queue is useful in computing and some of the things I’ve learned.

The code at github.com/sqwishy/memorywheel.

Blocking

The old queue was lock-free, thread-safe, and fixed-size, supporting concurrent reads and writes from multiple consumers and producers. It would allow a multi-threaded program to, without blocking, send pointers between threads. The pointers don’t have much information in themselves; they are just addresses to shared memory that can contain a longer variable sized message. But the queue ensures that pointers to these messages can be shared exactly one time without blocking.

Blocking refers to when a program pauses and waits for a resource that is not yet available. Often this is a resource provided by the operating system, like a file on disk or a networking socket.

In a lot of situations, having a program that does not block is important. But sometimes there isn’t anything better for the program to do, so then it is valuable to allow a program to wait for a bit. A very common way to do this is through an event loop. An event loop lets a program poll on multiple resources at the same time and wake up when at least one resource becomes available.

On Linux, each resource, like a socket or file, is accessed through a file descriptor. File descriptors can be registered with an event loop, allowing many resources to be waited on at the same time.

In the context of my queue, blocking could happen when trying to read the queue while empty. Or trying to write to it while it’s full. So working along side an event loop to avoid blocking was an important function of the queue.

My Thread-Unsafe Queue

I mentioned earlier that the queue I made was not reliable. Being a fixed-size queue, each write and read adjusted a file descriptor that counted the number of items readable or writable in the queue. The file descriptor was important as it meant the queue could be used with an event loop to avoid blocking. It was also a first-in first-out (FIFO) queue, so all elements must be read in the same order they were written.

It turned out that if two or more threads wrote to the queue at the same time, they could incorrectly signal that there was data to be read. This could happen if they started writing into the queue in one order, but completed in a different order. The fast later writer would signal that data can be read. Then, a reader would try to read from the queue and fail because the slow earlier writer hadn’t yet finished and, because it’s a FIFO queue, the slow earlier writer blocked reading what was written by the faster later writer.

In March this year, I was interested in a non-blocking queue that would work, not just across threads, but across processes with different virtual memory space.

Different virtual memory spaces means that even though processes might share memory, the address that each process has for that memory can be different.

The previous queue only shared pointers to messages in shared memory. This was fine in a multi-threaded programs as all threads will share the virtual memory space. The pointers in one thread will point to the same place in other threads. But to work with processes with different virtual memory spaces, the new queue stores and shares offsets to the start of the shared memory.

That queue is called Memory Wheel. I wrote it in C and you can find the source code at github github.com/sqwishy/memorywheel.

At the moment, it doesn’t support multiple readers or writers. It is (as far as I can tell) thread-safe to read and write concurrently, as long as there is only one reader and one writer. I had plans to at least support multiple readers, but I ran out of time.

Reinventing the Wheel

Because of the virtual memory space thing, entire variable length messages are written into the Memory Wheel instead of just pointers to shared memory.

In a multi-process environment, most of the memory is not shared, as it is for multi-threaded programs. The state of the Memory Wheel lives in shared memory. And since that memory needs to be shared between processes anyway, we may as well use the same region to read and write the messages contents we want to share.

So while the old queue shared pointers to messages elsewhere in memory, the Memory Wheel shares the messages themselves.

These are the steps to send a message with the queue:

A writer atomically reserves a segment of shared memory of a size of its choosing. I call that segment a slice.
That slice is exclusive to the writer and it can safely write to it knowing that nobody else can read or write to it.
The writer atomically signals that it is done using its slice and the slice becomes available for a reader to read.
A reader atomically gets a pointer (in its virtual memory space) to the next readable slice in the queue.
The reader has can safely read from the slice.
The slice is atomically returned, after the reader is done with it, so it can be reserved again in the first step.

Since the internal state of the Memory Wheel lives in shared memory, it uses atomics to ensure that different process can modify the state correctly. It stores offsets, rather then pointers, so that it can work in different virtual memory spaces. The offsets are 32-bit numbers that count along 64 byte sized chunks. Cache lines seem to be 64 bytes in size so keeping slices allocated along 64 byte chunks in this way avoids false sharing, hopefully. Even though the offsets are 32-bits, the 64 byte chunks allows for something like 274 GB of addressable memory. And, importantly, a pair of 32-bit offsets can be packed and read or written together in a single 64-bit atomic!

The shared memory that the Memory Wheel uses a single contiguous fixed size buffer. For the most part, each slice that a writer reserves (step 1 above) will follow the previous one, up until the end of the single contiguous buffer. As the writer signals that each slice is readable (3), the reader picks it up (4), reads it (5), and returns it (6). And when the writer gets to the end of this contiguous chunk of shared memory, it wraps around and starts reusing the slices that the reader has returned. So even though the memory is linear, the wrap-around makes it wheel shaped. And it just keeps going as long as each end can keep pace with the other.

Some folks have taken to referring to this as a circular buffer, but I don’t think that naming will catch on.

On Performance

The biggest takeaway, for me, when tweaking and profiling this was how the overhead of system calls was larger than I expected. This chart shows 🡅 time spent compared to 🡆 packet size.

A line chart showing a slope increasing to the top right. The minimum point at the bottom left represents an eight byte packet and is at 0.39 on the vertical axis for time cost. The maximum point at the top right represents a thirty two thousand byte packet positioned at 1.0 on the vertical axis for time cost.

On my machine, the time taken to send and receive a 32k packet, just over a unix stream socket pair, is only two and a half times as long as the same with just eight bytes. So, even though it’s four thousand times larger, it only takes two and half times as long to send and receive because of the overhead of just the system call in comparison to the cost of copying between kernel and user space.

In order to use the Memory Wheel with an event loop, it has two file descriptors. One is readable when the queue is not empty. Another, writable when the queue is not full.

In earlier versions, a read or write to the queue would need two system calls, one for each file descriptor. This performed about as slow as sockets for smaller sized messages.

I’d wanted an excuse to use bpftrace for a while. So I used it to run this program to measure time spent spent in which system calls.

tracepoint:raw_syscalls:sys_enter /comm == str($1)/
{
  @entry[tid] = nsecs;
}

tracepoint:raw_syscalls:sys_exit /comm == str($1) && @entry[tid]/
{
  @sumt[args.id] += nsecs - @entry[tid];
  delete(@entry[tid]);
}

And it outputs something like this.

@sumt[0]: 79066898
@sumt[1]: 168989018
@sumt[281]: 611894727

It shows a histogram of total time spent in each system call.

Then I looked on this Searchable Linux Syscall Table to figure out what syscall each histogram entry referred to. In this case, it spent a lot of time in epoll_pwait (281), read (1), and write (0). Those read and write calls are what modify the file descriptors that signal when the queue is readable or writable.

While writing the Memory Wheel, I looked at a few other libraries that did a similar thing and tried to copy them. After measuring the time spent in system calls on the eventfds, I remembered seeing a library somewhere that only modified these file descriptors when switching between full and non-full or between empty and non-empty – instead of on every read and write.

So the Memory Wheel has two flags, one for the readable state and another for writable, that are either true or false. These flags are in shared memory and updated with atomics. Instead of updating the eventfd on each read or write, it checks those flags and only updates the eventfd if they change.

A Memory Fondue

The performance isn’t terrible after that optimization. Even with small (16 byte) messages, the Memory Wheel using eventfds with uvloop seems to perform a little better than sockets, running in about 71% the time. And a comparison like that should favour sockets because the Memory Wheel’s benefit of not having to copy between kernel and user space won’t matter for small messages.

There are other benefits to a socket, other than that they’re gonna be less buggy. Like if one end disconnects because the process is killed by a signal and can’t shut down cleanly, then the peer of a stream socket has a way to know that. Whereas in my queue, getting killed at the wrong time could leave the queue in a funny state.

As I was wrapping up this blog post, I saw a mistake in the Memory Wheel. It ate a good chunk of my Sunday fixing it so that this post could be plausible. But surely that was the only bug. And there aren’t seven others lying in darkness ready to pounce – like a sneaky fox.

And it’s still pretty half-baked. There’s even all this extra stuff to support returning slices out of order after reading multiple at a time. But there isn’t a way to read more than one slice at a time anyway.

It would be interesting if, in another version, that cruft were cleaned up and it supported multiple readers and writers.

The cost of system calls was different than I expected. I had expected the copy between user and kernel space was the more impactful than it was when I measured it. I guess any system call is pretty significant just from the context switch or whatever wackiness is going on in there. Measuring is important. And it’s another thing that makes io_uring interesting. I’d like to find a reason to use that one day.

The spin loop version of the Memory Wheel, without eventfds or an event loop, can run in under 20% of the time of the eventfd version just by busy polling. But that is super variable depending on conditions; spin loops are a dangerous game.

Really, the moral of the story is that everything should run on ring zero so everything can be as fast as spin loops. I’ll be the first in line when Amazon puts TempleOS on their AWS Marketplace and brings forth the next era of computing.