Arkanis Development

Styles

Fast line iteration in PHP

Published

It's been a long time since I wrote anything here. During the last 2 years of study I haven't had much time to finish anything. Well, now I have the time. One of the first things on the list is an optimized mail parser completely written in PHP (so you don't need to install any extensions). Since mails are a line based format the parser will have to iterate over lines quite often. So it's worth to do some microbenchmarks on that.

I'll focus on two use cases:

Please be aware that this is a microbenchmark! Results can very quite a bit depending on the rest of your code. The values here are a starting point for your own function or library microbenchmarks not a definitive answer.

Test setup

All tests were performed with PHP 5.4.6 on Linux. The test data was a 508 MiByte dump of the digitalmars.D newsgroup with CRLF ("\r\n") line breaks and 12 930 435 lines (on average 41.2 Bytes per line). The NNTP, IMAP and HTTP protocols also use these line breaks.

I measured the throughput and memory usage of different line iteration approaches. Most of them were already listed on Stack Overflow. Throughput is the time it took to crunch through the 508 MiByte dump divided by the dump size. The time was measured with the microtime() function. On the memory side I measured the peak memory usage with memory_get_peak_usage() after the iteration was done. Not perfect but I also tested the memory usage during iteration: practically the same values but the measurement degraded the throughput significantly. So we stick to the peak memory usage after iteration.

All tests were performed 11 times and the result is the average of test runs 2 to 11. The first run only served to load the entire file into the kernel read cache. Therefore the result of the first run is discarded. I tried to eliminate IO factors here since they can vary extremely if your data comes from a normal hard disk, an SSD or directly from a network connection.

All tests are online at GitHub. Take a look for any details not mentioned here.

Facts first

The results are grouped by the use case. First the results of file iteration then the results from memory buffer iteration.

Performance of iterating over the lines of a file

Details on the different approaches:

I subtracted the buffer size from the peak memory usage in the memory buffer tests. This leaves only the memory usage of the iteration (that's what's interesting, right?).

Performance of memory buffer iteration

Details on the different approaches:

Results
Source and details of different approaches
fgets() from file
$fd = fopen(…, 'rb');
while( ($line = fgets($fd)) !== false ) {
    // do something with $line
}
fclose($fd);
stream_get_line() from file
$fd = fopen(…, 'rb');
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
    // do something with $line
}
fclose($fd);
fscanf() from file
$fd = fopen(..., 'rb');
$size = fstat($fd)['size'];
while(true){
    $matches = fscanf($fd, "%[^\r\n]");
    if ($matches === false)
        break;
    $line = $matches[0];
    // do something with $line
}
fclose($fd);

Not really useful for raw line iteration but fscanf() can be useful for simple line based text data like configuration files. For example take a simple config file:

key: "value"
key: "value" # comment
# comment

Kind of simplified YAML without nesting. fscanf() parsed it with 77 MiByte/s (see file_conf_fscanf.php). fgets() followed by a simple preg_match() got about 42 MiByte/s (see file_conf_regexpr.php). Both with a peak memory usage of 0.25 MiB.

For simple data fscanf() can be quite fast. Note that you'll not find useful documentation in PHPs manual. The only useful documentation of fscanf() I know of is the Linux man page or the C language specification.

fgets() from memory buffer
$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
rewind($fd);
while( ($line = fgets($fd)) !== false ) {
    // do something with $line
}
fclose($fd);
strtok() from memory buffer
$data = file_get_contents('../mails.txt');
$line = strtok($data, "\r\n");
while ($line !== false) {
    $line = strtok("\r\n");
    // do something with $line
}
stream_get_line
$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
rewind($fd);
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
    // do something with $line
}
fclose($fd);
strpos() from memory buffer
$data = file_get_contents(...);
$pos = 0;
while ( ($line_end_pos = strpos($data, "\r\n", $pos)) !== false ) {
    $line = substr($data, $pos, $line_end_pos - $pos);
    // do something with $line
    $pos = $line_end_pos + 2;
}
preg_split with foreach from memory buffer
$data = file_get_contents(...);
foreach( preg_split('/\r?\n/', $data) as $line ) {
    // do something with $line
}
Closing remarks

Again this is just a microbenchmark. Results can vary greatly depending on what you're actually doing. Most problems aren't as simple as pure line iteration… especially since the benchmark doesn't do anything with the lines. A very clever compiler would optimize the entire thing away.

Remember that the stream based approaches (e.g. fgets() or stream_get_line()) can use the full power of PHPs streams. They can read data from pretty much everything and you can combine them with filters to shovel e.g. base64 decoding or decompression into internal PHP routines. Haven't tested that though.

At the end just a small thing to put the throughput of PHP into perspective. The fastest value I measured was 261 MiByte/s. The line count command pv test-data.txt | wc -l got a throughput of 1.6 to 1.8 GiByte/s (pv is like cat but displays a progress bar and some statistics). So if you really need performance don't bother with PHP. Grab C, C++ or D and get going. :)

6 comments for this post

leave a new one

#1 by
Chris

Thanks for these benchmarks & remarks! You seem to have fun in your "Winter of code".

What Hardware are you on? On my AMD Athlon X2 64bit PC with internal 1 TB HDD I got only 82 MiB/s with a 8.4GiB text file

$ pv /media/bigone/headers-dupl2.txt | wc -l 8,35GB 0:01:50 [77,6MB/s] [==================================>] 100%

Average over 10s in the middle of operation $ sudo iotop -d 10 -a -b -o Total DISK READ: 82.80 M/s | Total DISK WRITE: 31.27 K/s

A bit faster when throwing away the output $ pv headers-dupl2.txt > /dev/null 8,35GB 0:01:45 [81,2MB/s] [==================================>] 100%

Input file: I concatenated the Linux Kernel header n times, the file looks like this $ ll headers-dupl2.txt -h -rwxrwxrwx 1 root root 8,4G Sep 27 11:32 headers-dupl2.txt

$ head -5 headers-dupl.txt /usr/src/linux-headers-3.2.0-41/ /usr/src/linux-headers-3.2.0-41/block /usr/src/linux-headers-3.2.0-41/block/Kconfig /usr/src/linux-headers-3.2.0-41/block/Kconfig.iosched /usr/src/linux-headers-3.2.0-41/block/Makefile

OK it turns out it is very hardware dependent, on a Athlon Opteron 2427 Server with 12 cores there is more throughput. The line counting has a lot of effect on the server

# pv headers-dupl2.txt | wc -l 8.35GB 0:00:12 [ 677MB/s]

# pv headers-dupl2.txt > /dev/null 8.35GB 0:00:03 [2.28GB/s]

#2 by
Stephan

Thanks. :)

The tests were run on a Core i5-3570K with 16 GiByte of RAM and an Western Digital WD10EADS-00L. In my case the kernel cached the entire 508 MiByte file after the first test run. This was just an easy way to eliminate IO interference from the benchmark. Otherwise I would have tested the read throughput of my disk instead of the PHP code. The effect is best seen when using the dump multiple times:

$ pv mails.txt | wc -l
508MB 0:00:06 [82.1MB/s] ...
$ pv mails.txt | wc -l
508MB 0:00:00 [1.73GB/s] ...
$ pv mails.txt | wc -l
508MB 0:00:00 [1.71GB/s] ...

The first run is limited by the read throughput of the disk. The following runs get the data straight from the kernels read cache.

In your case the dump is probably to big for the kernel to cache it. To cache 8 GiByte of data you'll probably need 16 or 32 GiByte of RAM on the machine. Therefore your results pretty much mirror the throughput of your disk. Nevertheless if the server is the server I think it is it's SAN has a nice throughput (no idea how fast SAN should be actually). :)

To eliminate the IO component you can try using smaller text files (few hundred MiByte) and run the test multiple times.

#3 by
Mark

How exactly did you code "fgets with rtrim"? Thanks!

#5 by
mahdy

You load the whole content with "file_get_contents" and then use strpos. It consumes a lot of memory because the whole file will be in memory. fgets will load the file chunk by chunk that causes no memory exhausting.

#6 by
Stephan

When the data is stored in a file fgets() is usually the winner (see Results in the article).

But when you already have the data in a memory buffer things look different. For example when you're reading a mail directly from a network connection. When you can't take the memory hit of duplicating your data strpos() is the way to go. Just as you implied. But when you have the memory and want more performance you can open the data as a stream and use fgets(), too. In my measurements this was more than twice as fast.

But be aware, the measurements are quite old (2013). The situation might have changed by now.

Leave a new comment

Having thoughts on your mind about this stuff here? Want to tell me and the rest of the world your opinion? Write and post it right here. Be sure to check out the format help (focus the large text field) and give the preview button a try.

optional

Format help

Please us the following stuff to spice up your comment.

An empty line starts a new paragraph. ---- print "---- lines start/end code" ---- * List items start with a * or -

Just to keep your skill sharp and my comments clean.

or