I was kind of bored for once and figured I could clean up an old microbenchmark and kick the results out of the door.
This is gonna be another very technical post for programmers. Be prepared. ;)
A while ago I came across a paper about "exokernels", very light-weight operating system kernels that leave most of the work up to libraries (Exokernel: An Operating System Architecture for Application-Level Resource Management).
The paper is from 1995 and among other things they implemented high performance system calls.
I always heard system calls are expensive but I never saw any real numbers or measurements.
Only the stuff the C standard library was doing with fread()
and fwrite()
: Batching I/O to minimize system calls.
But system calls evolved a lot since the x86 32 bit days and today we have a dedicated syscall
instruction for them.
This got me wondering: How fast are system calls on Linux today compared to normal function calls?
In the paper they compared a function call with no arguments to the getpid()
system call.
That system call just does the minimum of work: Switch into kernel mode, read one variable and return its value.
Sounds simple, so I wrote a set of small microbenchmarks and did the same on my Linux notebook.
Back when reading the paper I just ran the microbenchmarks on my own notebook (an old Intel Core 2 Duo T6600 from 2009).
But right now I'm with my brother and he owns quite a few computers and I asked him to run the microbenchmarks on some of them.
We also run the benchmarks on some older CPUs (see the photo :)).
Anyway, it gave me a nice perspective of how the performance of system calls evolved over time.
Mind you, I just took whatever PC was close by (no idea how fast or slow) as well as a few older CPUs.
I was curious how system call performance changed over the years on x86_64 and don't really care about what CPU is faster or slower.
Don't take these numbers and say that CPU X is faster than CPU Y. It just doesn't make sense.
System calls are one of many factors that make up the performance of a CPU and software but by no means a defining one.
So don't take the numbers to serious. This is just for fun after all. ;)
Before we get to the numbers I have to explain a bit about how Linux optimizes some system calls:
On x86 64 bit system calls are usually made through the syscall
instruction.
You put the arguments into the proper registers and the syscall
instruction then transitions into kernel mode and calls the proper kernel function.
This incurs some overhead since the CPU has to switch into a different address space and this might need updates to the TLB, etc.
For some system calls the Linux kernel provides optimized versions via the vDSO.
In some cases these optimized versions don't need to transition into kernel space and thus avoid the overhead.
By default the C runtime on Linux (glibc) uses these optimized functions automatically whenever there is one available.
I'm interested in how both of these mechanisms perform.
On to the numbers. Just for a bit of context: Reading a value from the main memory takes about 100ns, see Latency Numbers Every Programmer Should Know.
Syscalls incur quite a heavy overhead compared to normal function calls. That got better with time.
The vDSO implementation of the getpid()
system call is pretty good at mitigating the system call overhead and is almost as fast as a normal function call.
On the Intel Celeron D 341 from 2004 the a system call via the syscall
instruction was about 25 times slower than a system call via the vDSO.
On the Intel Core i7-4790K from 2014 it's only about 12 times slower.
For me I'll use 10 times slower as rule of thumb for modern CPUs and 25 times for older CPUs.
In case you're wondering about the details:
- All benchmarks were run on Linux Mint booted via USB stick (Linux 4.4.0-21-generic)
- The benchmarks execute the function call or
syscall
10 million times in a loop.
- The benchmark is run 10 times. The average of those runs is then divided by 10 million to get the time per function or system call.
- The function call benchmark is compiled with GCC (I also did it by hand in assembler but the result was almost the same).
- The syscall benchmark is programmed by hand in assembler using
yasm
.
- The vDSO benchmark is a small C program compiled with GCC. When you use the "normal" C functions for system calls glibc automatically uses the vDSO.
- I've put the source code and results of the microbenchmarks into a repository. Feel free to look around in case you have any questions. You can also find a lot more graphs there.
Performance of fread()
and fwrite()
I/O batching
I was also wondering about the I/O batching the standard C library does with fread()
and fwrite()
.
Is it really worth the effort? Or are that many function calls wasted since system calls can be pretty fast, too?
A few microbenchmarks answered that question, too. This time I measured 1 million 4 byte reads and writes.
Again directly via the syscall
instruction and via the vDSO.
But this time also via fread()
and fwrite()
instead of using system calls.
Here's what I got:
So, yeah, in this scenario the I/O batching definitely is worth it. It's about 10 to 5 times faster.
But keep in mind, these were 4 byte reads and writes. I took a look with strace
and they got batched into 4 KiByte reads and writes.
So when you're reading or writing a few KiBytes the speedup is likely not that large.
It's also interesting that the vDSO doesn't help much for the read()
and write()
system calls. But usually it doesn't hurt either.
So maybe these system calls can't be optimized in that way.
If you're still with me: Thanks for reading. I hope my strange little experiment was interesting. :)