lsr: ls but with io_uring
As an excercise in syscall golf, I wrote an implementation of ls(1)
which uses
my IO library, ourio to perform as much
of the IO as possible. What I ended up with is something that is faster than any
version or alternative to ls
I tested, and also performs an order of
magnitude fewer syscalls. I’m calling it
lsr. Let’s start with the benchmarks,
then we’ll see how we got there.
Benchmarks #
Time #
Data gathered with hyperfine
on a directory of n
plain files.
Program | n=10 | n=100 | n=1,000 | n=10,000 |
---|---|---|---|---|
lsr -al | 372.6 µs | 634.3 µs | 2.7 ms | 22.1 ms |
ls -al | 1.4 ms | 1.7 ms | 4.7 ms | 38.0 ms |
eza -al | 2.9 ms | 3.3 ms | 6.6 ms | 40.2 ms |
lsd -al | 2.1 ms | 3.5 ms | 17.0 ms | 153.4 ms |
uutils ls -al | 2.9 ms | 3.6 ms | 11.3 ms | 89.6 ms |
Syscalls #
Data gathered with strace -c
on a directory of n
plain files. (Lower is better)
Program | n=10 | n=100 | n=1,000 | n=10,000 |
---|---|---|---|---|
lsr -al | 20 | 28 | 105 | 848 |
ls -al | 405 | 675 | 3,377 | 30,396 |
eza -al | 319 | 411 | 1,320 | 10,364 |
lsd -al | 508 | 1,408 | 10,423 | 100,512 |
uutils ls -al | 445 | 986 | 6,397 | 10,005 |
How we got there #
Let’s start with how lsr
works. To list directory contents, we basically have
3 stages to the program:
- Parse args
- Gather data
- Print data
All of the IO involved happens in the second step. Wherever possible, lsr
utilizes io_uring to pull in the data it needs. To get to that point, it means
that we open the target directory with io_uring, if we need local time, user
data, or group data, we open (and read) those files with io_uring. We do all
stat
calls via io_uring, and as needed we do the equivalent of an lstat
via
io_uring. In practice, this means that the number of syscalls we have should be
drastically smaller than equivalent programs because we are able to batch the
stat
syscall. The results clearly show this…lsr
has at least an order of
magnitude fewer syscalls than it’s closest equivalent, being uutils ls
.
We also use the zig stdlib StackFallbackAllocator. This let’s lsr
allocate
memory it needs up front, but fallback to a different allocator when it’s
exhausted the fixed allocation. We allocate 1MB up front, which is more than
enough for typical usage. This further reduces syscalls by reducing mmap usage.
As a result of working directly with io_uring, we also bypass several libc
related pitfalls. Namely, we have no dynamic linking - ls
has some
considerable overhead in loading libc and related libraries…but it also has
the benefit of having locale support. lsr
does not boast such a feature.
Despite being statically linked, lsr
is still smaller than GNU ls
: 138.7KB
vs 79.3KB when built with ReleaseSmall.
Anomolies and Thoughts #
I have no idea what lsd
is doing. I haven’t read the source code, but from
viewing it’s strace
, it is calling clock_gettime
around 5 times per
file. Why? I don’t know. Maybe it’s doing internal timing of steps along the
way?
Sorting ends up being a massive part of the workload. I suspect this is where
uutils ls
is getting slowed down, since it is doing pretty good on a syscall
basis. lsr
spends about 30% of it’s runtime sorting, the rest is the IO loop.
This ended up being a pretty fun project to write, and didn’t take too much time
either. I am shocked at how much io_uring can be used to reduce syscalls…ls
is a pretty basic example but you can only imagine how much of an effect this
would have on something like a server.
Also - I’m using tangled.sh for this project. They have a really cool idea, and I want to see how the PR workflow is so…if you have any bugs or changes, please visit the repo. All you need is an atproto account + app password. I suspect more icons will be needed, feel free to make an issue for icon requests!