Joel Verhagen

a computer programming blog

The fastest CSV parser in .NET

Latest update: 2021-08-09, with new versions. Sylvan.Data.Csv takes the lead with SIMD!

Specific purpose tested

My goal was to find the fastest low-level CSV parser. Essentially, all I wanted was a library that gave me a string[] for each line where each field in the line was an element in the array. This is about as simple as you can get with a CSV parser. I don’t care about parsing headings or dynamically mapping fields to class properties. I can do all of that myself faster than reflection with C# Source Generators, rather trivially.

So if you want a feature-rich library and don’t care as much about 1 millisecond vs 10 milliseconds, read no further and just use CsvHelper. It’s the “winner” from a popularity stand-point and has good developer ergonomics in my experience. Using an established, popular library is probably the best idea since it’s most battle-tested and has the best examples and Q&A online.

CSV libraries tested

I tested the following CSV libraries.

And… I threw in two other implementations that don’t come from packages:

  • An implementation I called “HomeGrown” which is my first attempt at a CSV parser, without any optimization. 🤞
  • An implementation simply using string.Split. This is broken for CSV files containing escaped comma characters, but I figured it could be a baseline.
  • Microsoft.VisualBasic.FileIO.TextFieldParser, which is a built-in CSV parser.

Results

These are the parse times for a CSV file with 1,000,000 lines. The units are in seconds.

🏆 Congratulations Sylvan.Data.Csv! This library has taken the first place by parsing a 1 million line file in 1.39 seconds. Mark employed a new strategy to pull ahead of the pack by using SIMD. His first attempt worked great on newer Intel processors but my older AMD Zen 2 processor used for the benchmarks wasn’t working as well. He was kind enough to enhance his implementation to work well even on my older hardware (more context on his PR).

Since I originally posted, Josh Close (author of the most popular CsvHelper) has put a lot of work into performance and has brought his implementation from 10th place to a close 4th place. HUGE improvement. I haven’t tested “higher level” data mapping scenarios (which are likely the most common CsvHelper usages) but it’s really exciting to see such a big performance improvement in the most popular CSV parsing library.

I also want to mention some incredible work by Aurélien Boudoux who reworked his library FluentCSV to improve the performance. I have to say this is the most impressive performance improvement I’ve seen in this project. In the last round, his 2.0.0 version clocked in at 57 seconds for 1 million lines (last place). With version 3.0.0, he brought the time down to 3.2 seconds (fifth place)! Read more about his own performance analysis. Awesome work, Aurélien! Great to see such a nice API with excellent performance.

RecordParser, a newly tested library, took third place from CsvHelper. I hadn’t heard of this library before, but the performance is excellent and the adapter code is extensive and leverages several OSS libraries (Ben.StringIntern and System.IO.Pipelines) so it maybe be useful to look through to learn tricks for your own performance adventures.

Finally, I wanted to mention that two of the libraries provide better performance via data mapping to a full POCO or only provide parsing via data mapping in their API, as opposed to most libraries which provide raw, low-level access to a string array per row. This means that the tests for these APIs are perhaps not as representative of their intrinsic parsing performance. But I chose to still include them for completeness. These libraries are ChoETL (reference) and FileHelpers (reference). Thanks Mark Pflug for this investigation.

Also, a previous version of this post was using .NET Core 3.1. .NET 5 gave a measurable improvement on all implementations, averaging about a 10% reduction in runtime. Then, when we moved the benchmarks from .NET 5 to .NET 6 preview, we got another 4% reduction in runtime on average. Nice work .NET team!

Most shockingly, my HomeGrown implementation is not the worst. And the code is beautiful 😭 (as a father says to his ugly kid). In fact, it looks to be a very average implementation. So proud.

I’m talking smack?

Am I defaming your library? Point out what I missed! I make mistakes all the time 😅 and I’m happy to adjust the report if you can point out a legitimate flaw in my test.

I did my best to use the lowest level (and presumably highest performance?) API in each library. If I can adjust my implementations to squeeze out more performance or be more truthful to the intended use of each library API. Let me know or open a PR against my test repository.

Feel free to reach out to me however you can figure out. (can’t make it too easy for the spammers)

My motivation

For one of my side projects, I was using CSV files as an intermediate data format. Essentially I have an Azure Function writing results to Azure Table Storage and another Function collecting the results into giant CSV files. These CSV files get gobbled up by Azure Data Explorer allowing easy slice and dice with Kusto query language. Kusto is awesome by the way.

To save money on the Azure Function compute time, I wanted to optimize all of the steps I could, including the CSV reading and writing. Therefore, I naturally installed a bunch of CSV parsing libraries and tested their performance 😁.

Methodology

I used BenchmarkDotNet to parse a CSV file I had laying around containing NuGet package asset information generated from NuGet.org. It has a Good Mixture™ of data types, empty fields, and string lengths. I ran several benchmarks for varying file sizes – anywhere from an empty file to one million lines.

I put each library in an implementation of some ICsvReader interface I made up that takes a TextReader and returns a list of my POCO instances.

I used IL Emit for activating (“newing up”/”constructing”) partly because this is the fastest way to dynamically activate objects (given enough executions, via initialization cost amortization). Also one of the libraries I tested hard codes this method for activation so I wanted all of the libraries to have the same characteristics in this regard.

I tested execution time, not memory allocation. Maybe I’ll update this post later to talk about memory.

Library-specific adapters

Each library-specific implementation is available on GitHub.

All of the implementations look something like this:

public List<T> GetRecords<T>(MemoryStream stream) where T : ICsvReadable
{
    var activate = ActivatorFactory.Create<T>();
    var allRecords = new List<T>();

    using (var reader = new StreamReader(stream))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            var pieces = line.Split(',');
            var record = activate();
            record.Read(i => pieces[i]);
            allRecords.Add(record);
        }
    }

    return allRecords;
}

Code and raw data

The code for this is stored on GitHub: joelverhagen/NCsvPerf

The BenchmarkDotNet and Excel workbook (for the charts and tables above) are here: BenchmarkDotNet.Artifacts-5.0-6.zip

The test was run on my home desktop PC which is Windows 10, .NET 5.0.1, and an AMD Ryzen 9 3950X CPU.

Update log

Update 2021-08-13 (commit 39dd976)

  • Switched from .NET 5 to .NET 6 (6.0.100-preview.7.21379.14)
  • Updated Angara.Statistics from 0.1.0 to 0.1.4
  • Updated BenchmarkDotNet from 0.13.0 to 0.13.1
  • Updated Open.Text.CSV from 2.3.3 to 2.4.0
  • Updated Sylvan.Data.Csv from 1.1.5 to 1.1.6

This entire update was done by Mark via a PR. Thanks!

Results - BenchmarkDotNet.Artifacts-6.0-7.zip

Update 2021-08-09 (commit 8fa7626)

  • Added Angara.Table via a PR.
  • Updated ChoETL from 1.2.1.18 to 1.2.1.22 and enhanced the adapter for 3x perf wins via a PR.
  • Added Cesil via a PR.
  • Added Dsv via a PR.
  • Added KBCsv via a PR.
  • Added Microsoft.Data.Analysis via a PR.
  • Added Microsoft.ML via a PR.
  • Added Open.Text.CSV via a PR.
  • Updated Sylvan.Data.Csv from 1.0.3 to 1.1.5 via a PR.
  • Updated TxtCsvHelper from 1.2.9 to 1.3.1 via a PR.

This entire update was done by Mark. Thanks so much ❤️!

Results - BenchmarkDotNet.Artifacts-5.0-6.zip

Update 2021-08-05 (commit a26df9c)

  • Updated CsvHelper from 27.1.0 to 27.1.1.
  • Updated FlatFiles from 4.15.0 to 4.16.0.
  • Added RecordParser via a PR. Thanks Leandro!
  • Updated TinyCsvParser from 2.6.0 to 2.6.1.
  • Added TxtCsvHelper via a PR. Thanks Cameron!

Results - BenchmarkDotNet.Artifacts-5.0-5.zip

Update 2021-06-16 (commit 514a037)

  • Added ChoETL via a PR. Thanks, Josh!
  • Added CommonLibrary.NET via a PR. Thanks, Josh!
  • Added CSVFile via a PR. Thanks, Josh!
  • Updated CsvHelper from 20.0.0 to 27.1.0.
  • Added CsvTools via a PR. Thanks, Josh!
  • Added FlatFiles via a PR. Thanks, Josh!
  • Updated FluentCSV from 2.0.0 to 3.0.0 via a PR. Thanks Aurélien!
  • Added LinqToCsv via a PR. Thanks, Josh!
  • Updated ServiceStack.Text from 5.10.4 to 5.11.0.
  • Updated SoftCircuits.CsvParser from 2.4.3 to 3.0.0.
  • Added Sky.Data.Csv via a PR. Thanks, Josh!
  • Updated Sylvan.Data.Csv from 0.9.0 to 1.0.3 via a PR and a subsequent change by me. Thanks, Mark!

Results - BenchmarkDotNet.Artifacts-5.0-3.zip

Update 2021-01-18 (commit 8005d5)

  • Added Ctl.Data by request.
  • Added Cursively via a PR from @airbreather. Thanks, Joe!
  • Added Microsoft.VisualBasic.FileIO.TextFieldParser by request.
  • Added SoftCircuits.CsvParser by request.
  • Updated CsvHelper from 19.0.0 to 20.0.0 via a PR from @JoshClose. Thanks, Josh!
  • Updated ServiceStack.Text from 5.10.2 to 5.10.4.
  • Updated Sylvan.Data.Csv from 0.8.2 to 0.9.0.
  • Switched to a fork of FastCsvParser to avoid duplicate DLL name.

Results - BenchmarkDotNet.Artifacts-5.0-2.zip

Update 2021-01-06 (commit 586f602)

  • Moved to .NET 5.0.1
  • Added FluentCSV by request.
  • Added Sylvan.Data.Csv via a PR from @MarkPflug. Thanks, Mark!
  • Updated Csv from 1.0.58 to 2.0.62.
  • Updated CsvHelper from 18.0.0 to 19.0.0.
  • Updated mgholam.fastCSV from 2.0.8 to 2.0.9.

Results - BenchmarkDotNet.Artifacts-5.0.zip

Initial release 2020-12-08 (commit 57c31a)

Results - BenchmarkDotNet.Artifacts.zip