With .NET Core 2.1 released I want to come back to an old SO question of mine where I asked how fast one can parse a text file with C# and .NET 4.0. Now I want to look how fast we can get today with .NET 4.7.2, .NET Core 2.1 and C++ on x86 and x64 . The structure of the text file is a double followed by an integer and a newline like this
1.1 0 2.1 1 3.1 2
for 10 million times. The result is a 179 MB text file which we want to parse as fast as possible.
2011 I did have as my home machine a Core Duo 6600 @ 2.40 GHz with Windows 7 running .NET 4.0 Lets see how much the numbers from 2011 have improved after I have a
- Better CPU – Haswell CPU Architecture I7 4770K 3.5 GHz
- Newer Windows – Windows 10 1803
- Newer .NET Framework – .NET 4.7.2
- New .NET Core 2.1
To compare how fast we can get with managed code I thought it would be good to compare the same thing also to a C implementation which reads the ANSI text file with fgets, tokenizes the string with strtok and converts the numbers with atof and atoi. That should be pretty fast right?
This one checks how fast we can read a file into one huge buffer from disk. But if we execute that a second time the data will be served from the file system cache. If warmed up we measure the memcopy and soft page fault performance of the OS to assign memory to our application. It may look very simple but it can tell you a lot about disk performance and caching behavior of your OS.
After having so many professional serializers at our disposal it would be a sin to dismiss one of the fastest ones. MessagePack is an efficient binary format for which a great library exist from Yoshifumi Kawai (aka neuecc) who did write the fastest one for C#. With 133 MB the resulting file is the smallest one.
When using a text based format JSON is also a must. Interestingly the JSON file is with 169 MB even a bit smaller than the original text format. Here we use also the fastest JSON serializer named UTF8Json. Credits for writing the fastest JSON serializer go to Yoshifumi Kawai.
That reads the file line by line but does nothing with the data. Since we do not allocate much except some temp strings we find how fast we can get when reading the UTF8 file and expand the data to UTF-16 strings.
One of the biggest improvements of .NET Core was the introduction of Span based APIs everywhere. That allows you to read characters from raw buffers without the need to copy the data around again. With that you can write nearly zero allocating API which operates on unmanaged memory without the need to use unsafe code. To make that fast the JIT compiler was enhanced to emit the most efficient code possible. For the regular .NET Framework there exists the System.Memory nuget package to bring the type also onboard but with no help from the JIT compiler. This one is especially interesting since it shows how much you loose if you do not have a helping hand of the JIT compiler here.
This is how one could write a single method parser with some hard coded assumptions to become as fast as possible. Interestingly this one is slower than the Span based API which is safer and faster.
C++ Is Still A Thing!
C++ should be the fastest with its highly optimized library routines which were carefully optimized and profiled for decades now. Looks like plain C 20 years ago but it is only faster by a slight margin compared to the naïve Standard Parse C# version.
Standard Parse C#
That is how you would do it when writing idiomatic C# code.
One obvious way to become faster is to buy a newer CPU. That works and can give you a speedup of roughly a factor two if you wait long enough between significant CPU generations. But the economic benefit to shrink CPU dies has decreased a lot with 7nm processes at the door you need EUV lithography. With all of its challenges further huge improvements are not expected soon. And since one atom is about 0,1nm we do not have so much room left to scale much further. If you want to increase the frequency a lot you also run into fundamental problems. The speed of light in conducting materials is ca 200 000 km/s. During 1ns (=1GHz) a signal can travel 20cm far. If you increase the frequency to 10 GHz you can transmit a signal only 2cm. If your chip is bigger you cannot transmit a signal in the necessary time to its destination. We have a natural upper limit how much faster you can get with current CPU designs if you rely on ack signals of some sort.
With the rise of AI and its huge power consumption we still can and will become much more energy efficient. Although Google is leading with dedicated Tensorflow CPUs the amount of power consumption to perform simple image recognition is at least 1000 times worse compared to their biological counterparts (that is only a guess and this number is likely much to low). The next big thing are definitely neuromorphic chips which ran rewire themselves depending the the problem to solve. With these processors you could make Google search a commodity where not a single corporation owns all of your data. Instead it could be made a public service run by many providers where the search graph is shared and we as customers can sell our data on our own. If that ever becomes true depends on who creates such highly energy efficient chips first.
The plain ReadFile call is about 4 times faster. That performance increase is not only related to the better CPU but also the soft page fault performance of Windows 10 Fall Creators Update did become better compared to Windows 7 which I was using at that time. The better soft fault performance is also part of Windows Server 2019 (finally). More details about soft fault performance can be found here.
32 vs 64 Bit
The biggest differences between x86/x64 has the C version . It turns out the x86 version is about 40% slower due to much higher costs in atof mainly which seems to be pretty slow implemented in x86. In the screenshot below you can see that the ucrtbase:common_atof needs 2,4 vs 1,3s in the x64 version which is a pretty big perf hit. Although C/C++ can be very efficient this is not always the case for the used runtime libraries. By using standard libraries you are not guaranteed to get world class performance on your current hardware. Especially extended instruction sets can only be used by specifically crafted binaries for each CPU. The standard approach in C/C++ is to create a library for each CPU architecture like
That quickly becomes pretty complicated to test. On the other hand in a JIT compiled world we still need to test on different CPUs but we get the best possible assembly code on our CPU for free if the runtime is up to date like it is the case with RyuJIT of .NET Core.
The next observation is that the new Span based version is 80% slower (1.35 vs 0.75 .NET Core 2.1) on x86. If you are after high throughput scenarios it definitely makes sense to switch over to 64 bit and use mainly index based structures to alleviate the 100% increase of pointer size for the most significant cases.
.NET Core 2.1 vs .NET 4.7.2
.NET 4.7.2 is for streaming data into memory between 2% (Unsafe Parse x86), 7% (Unsafe Parse x64) and 40% (Span Parse x64) slower than .NET Core. Especially the 40% speed drop for Span based Apis due to the lack of JIT support by the .NET Runtime is unfortunate. But since .NET was declared legacy and most innovation will happen for the years to come at .NET Core it is unlikely that the .NET Framework JIT compiler will get these new capabilities. The main reasosn were compat issues but we can already switch over to the legacy JIT if we do not want to use the new RyuJIT compiler. It should be easy to make the new JIT compiler of .NET Core a optional and the users of the .NET Framework can enable the new functionality if they have verified that their scenarios work.
Based on the numbers it makes sense to switch over to x64 and Span based APIs if you are running on .NET Core. But even if you are not using .NET Core Span based APIs offer quite a lot. Span based APIs make it possible to spare the copy operation from unmanaged locations into the managed heap. The result are nearly allocation free applications which can scale great. Another learning is that if you can change your problem a bit by e.g. changing the data format you can get even faster. The MessagePack serializer on x64 .NET Core 2.1 is 25% faster compared to the Span text based approach. That is a bit unfair since there is practically no parsing involved but shows that you can achieve near optimal performance with efficient binary formats which contain mainly primitive types. Even our pretty generic UTF8 JSON file approach performed 34% better than the x64 C/C++ version. In the end you can become very fast with any language if you profile your application and fix the obvious bottlenecks. To be able to do that you need to use some performance profiling tools for your platform. My favorite tool is ETW as you might have guessed from my previous posts. Are you profiling your code before problems are reported? What is your favorite profiling tool?