Posted: 2020-12-08

Category: Programming

The fastest CSV parser in .NET

Latest update: 2024-01-13, added newcomer Addax.Formats.Tabular, updated several packages, and Sep takes first place with a multithreading support. I moved to .NET 8. I added information about server GC (results have so far just been workstation GC).

Specific purpose tested

My goal was to find the fastest low-level CSV parser. Essentially, all I wanted was a library that gave me a string[] for each line where each field in the line was an element in the array. This is about as simple as you can get with a CSV parser. I don’t care about parsing headings or dynamically mapping fields to class properties. I can do all of that myself faster than reflection with C# Source Generators, rather trivially.

So if you want a feature-rich library and don’t care as much about 1 millisecond vs 10 milliseconds, read no further and just use CsvHelper. It’s the “winner” from a popularity stand-point and has good developer ergonomics in my experience. Using an established, popular library is probably the best idea since it’s most battle-tested and has the best examples and Q&A online.

Results

These are the parse times for a CSV file with 1,000,000 lines. The units are in milliseconds. I’ve only shown the top 20 results because the chart was getting hard to read. You can see the other results on 2nd chart below in the update log.

Here are the results for each implementation (see the CSV libraries tested section below for links).

Rank	Implementation	Workstation GC	Server GC
1	`Sep` (parallel)	479 ms	170 ms
2	`RecordParser` (parallel)	868 ms	387 ms
3	`Sep`	1034 ms	720 ms
4	`Sylvan.Data.Csv`	1083 ms	769 ms
5	`HomeGrown2` (`HomeGrown` with string pooling)	1519 ms	1227 ms
6	`RecordParser`	1545 ms	1171 ms
7	`Cursively`	1627 ms	1026 ms
8	`CsvHelper`	2243 ms	1959 ms
9	`SoftCircuits.CsvParser`	2734 ms	825 ms
10	`string.Split` (broken for escaped commas)	2936 ms	1042 ms
11	`FluentCSV`	3012 ms	1143 ms
12	`TxtCsvHelper`	3078 ms	1108 ms
13	`NReco.Csv`	3170 ms	1434 ms
14	`HomeGrown` (my own implementation)	3221 ms	1497 ms
15	`mgholam.fastCSV`	3260 ms	1553 ms
16	`Addax.Formats.Tabular`	3302 ms	1413 ms
17	`GenericParsing`	3462 ms	1653 ms
18	`Sky.Data.Csv`	3504 ms	1470 ms
19	`KBCsv`	3519 ms	1753 ms
20	`Dsv`	3542 ms	1549 ms
21	`ServiceStack.Text`	3624 ms	1634 ms
22	`CSVFile`	3800 ms	1819 ms
23	`Microsoft.ML` (parallel)	3885 ms	823 ms
24	`FileHelpers`	3953 ms	1855 ms
25	`FastCsvParser`	4152 ms	1715 ms
26	`CsvTextFieldParser`	4164 ms	2392 ms
27	`Ctl.Data`	4281 ms	1452 ms
28	`Open.Text.CSV`	4622 ms	3697 ms
29	`Cesil`	4623 ms	3399 ms
30	`LumenWorksCsvReader`	4723 ms	2879 ms
31	`LinqToCsv`	4983 ms	3194 ms
32	`TinyCsvReader`	5071 ms	3186 ms
33	`CsvTools`	5857 ms	3672 ms
34	`StackOverflowRegex` (a StackOverflow implementation)	6587 ms	5278 ms
35	`FlatFiles`	6597 ms	4881 ms
36	`Csv`	7255 ms	6591 ms
37	`Angara.Table`	7991 ms	5708 ms
38	`Microsoft.Data.Analysis`	11401 ms	8835 ms
39	`Microsoft.VisualBasic.FileIO.TextFieldParser` (built-in)	15266 ms	15693 ms
40	`ChoETL`	18915 ms	16719 ms
41	`CommonLibrary.NET`	27247 ms	15460 ms

🏆 Congratulations Sep! This library has taken the first place by parsing a 1 million line file in 479 milliseconds on workstation GC and a blistering 170 ms on server GC. Do note that this library is only one of the three that is tested using parallelism in its implementation (RecordParser and Microsoft.ML being the others). ML.NET has had parallel TextLoader for some time, although poorly documented, this is on by default in TextLoader.Options.UseThreads. If you want to stick to a single threaded parsing library, you can also use Sep in a single threaded fashion and still get amazing performance.

One thing to consider with Sep (and others) is that escaped fields (double quotes) will not be unescaped automatically unless you set the Unescape = true property on the SepReaderOptions. For more information, see Sep’s README on unescaping. Most other libraries provide unescaped strings by default, but not all, and some libraries also have unescaping and/or quotes handling disabled during the above benchmarks like RecordParser. The impact to performance with automatic unescaping for Sep is neglible in this benchmark, while RecordParser (parallel) works best without quote handling/unescaping.

Caveats

Note that this benchmark has several characteristics which may not match your production environment. If you’re concerned about how these results will match up with your own application, I highly recommend running your own performance tests.

The data in this test benchmark has:

No double quotes in any field value
No commas in any field value
No new line characters (\r a.k.a. CR a.k.a. carriage return or \n a.k.a. LF a.k.a. line feed) in any field value
A lot of repeated field values

No double quotes (and no commas and new lines inside such) allows some libraries, like RecordParser (parallel), to skip looking for quotes and only look for commas and new lines. Additionally, it allows skipping keeping track of such quotes and unescaping them, as is default for Sep, which still finds and handles quotes correctly no matter what though. Unlike e.g. RecordParser. For performance of automatic unescaping with Sep see detailed benchmarks in Sep README.

This means that parsers that aren’t particularly fast at handling escaping (e.g. a row like hello,"my good",friend) might do very well at this benchmark but not perform as well on your data containing a lot of double quotes. For a lot of repeated values, a parser can implement string pooling (sometimes called “interning”) to reduce allocations. If your data has few or no repeated values, then a string pool will be unnecessary overhead.

All of the historical results and the current column chart graphics are using workstation GC as opposed to server GC. If you look at the results table above, using server GC can have profound impact on the result and, in this case, changes the results significantly. For example, Microsoft.ML is in the top 5 with server GC but not even in the top 20 with workstation GC.

Finally, Mark Pflug (author of the Sylvan.Data.Csv library) did some great research on new line and quoted field handling in the various CSV parsers. See joelverhagen/NCsvPerf#52 for more information. Perhaps in the future I’ll capture this support in column in the new table above. The point here is that some CSV parsing libraries handle CSV edges better than other libraries.

In general, given the specific characteristics of the data in this benchmark, it is entirely feasible to write a custom fast parallel parser that would beat all of the above since the problem becomes embarrassingly parallel with no quotes. Some libraries are already heavily customized for the data and have custom options that are specific to the benchmark. Please be aware of this and measure on your own data and consult documentation or READMEs for the different libraries.

CSV libraries tested

I tested the following CSV libraries.

Addax.Formats.Tabular 1.2.0 / Project site / Source code
Angara.Table 0.3.3 (with 8.0.101 FSharp.Core, 8.0.0 System.Collections.Immutable) / Source code
Cesil 0.9.0 / Source code
ChoETL 1.2.1.64 / Source code
CommonLibrary.NET ~0.9.8.7 / Source code
- Note: I had to fork this package into Knapcode.CommonLibrary.NET (NuGet.org) for a NuGet.org distribution and for LF handling.
Csv 2.0.93 / Source code
CSVFile 3.1.1 / Source code
CsvHelper 30.0.1 / Project site / Source code
CsvTextFieldParser 1.2.2 / Source code
CsvTools 1.0.12 / Source code
Ctl.Data 2.0.0.2 / Project site / Source code
Cursively 1.2.0 / Project site / Source code
Dsv 1.3.1 / Source code
FastCsvParser 1.1.1 / Source code
- Note: I had to fork this package into Knapcode.FastCsvParser to avoid a colliding assembly name.
FileHelpers 3.5.2 / Project site / Source code
FlatFiles 5.0.4 / Source code
FluentCSV 3.0.0 / Source code
GenericParsing 1.5.0 / Source code
KBCsv 6.0.0 / Source code
LinqToCsv 1.5.0 / Source code
LumenWorksCsvReader 4.0.0 / Source code
mgholam.fastCSV 2.0.9 / Project site / Source code
Microsoft.Data.Analysis 0.21.0 / Project site / Source code
Microsoft.ML 3.0.0 / Project site / Source code
NReco.Csv 1.0.2 / Source code
Open.Text.CSV 3.4.0 / Source code
RecordParser 2.3.0 (with 0.1.8 Ben.StringIntern) / Source code
Sep 0.4.0 / Source code
ServiceStack.Text 8.0.0 / Project site / Source code
Sky.Data.Csv 2.5.0 / Source code
SoftCircuits.CsvParser 4.1.0 / Source code
Sylvan.Data.Csv 1.3.5 (with 0.4.3 Sylvan.Common) / Source code
TinyCsvParser 2.7.0 / Source code
TxtCsvHelper 1.3.3 / Source code

And… I threw in a few other implementations that don’t come from packages:

An implementation I called “HomeGrown” which is my first attempt at a CSV parser, without any optimization. 🤞
“HomeGrown2” which is like “HomeGrown” but has string pooling. Thanks @leandromoh!
An implementation simply using string.Split. This is broken for CSV files containing escaped comma characters, but I figured it could be a baseline.
Microsoft.VisualBasic.FileIO.TextFieldParser, which is a built-in CSV parser.
A regex based CSV-parser from StackOverflow, as suggested by @diogenesdirkx.

I’m talking smack?

Am I defaming your library? Point out what I missed! I make mistakes all the time 😅 and I’m happy to adjust the report if you can point out a legitimate flaw in my test.

I did my best to use the lowest level (and presumably highest performance?) API in each library. If I can adjust my implementations to squeeze out more performance or be more truthful to the intended use of each library API. Let me know or open a PR against my test repository.

Feel free to reach out to me however you can figure out. (can’t make it too easy for the spammers)

My motivation

For my side project NuGet Insights, I was using CSV files as an intermediate data format. Essentially I have an Azure Function writing results to Azure Table Storage and another Function collecting the results into giant CSV files. These CSV files get gobbled up by Azure Data Explorer allowing easy slice and dice with Kusto query language. Kusto is awesome by the way.

To save money on the Azure Function compute time, I wanted to optimize all of the steps I could, including the CSV reading and writing. Therefore, I naturally installed a bunch of CSV parsing libraries and tested their performance 😁.

Methodology

I used BenchmarkDotNet to parse a CSV file I had laying around containing NuGet package asset information generated from NuGet.org. It has a Good Mixture™ of data types, empty fields, and string lengths. I ran several benchmarks for varying file sizes – anywhere from an empty file to one million lines.

I put each library in an implementation of some ICsvReader interface I made up that takes a TextReader and returns a list of my POCO instances.

I tested execution time, not memory allocation. Maybe I’ll update this post later to talk about memory.

Library-specific adapters

Each library-specific implementation is available on GitHub.

All of the implementations look something like this:

public List<T> GetRecords<T>(MemoryStream stream) where T : ICsvReadable
{
    var activate = ActivatorFactory.Create<T>();
    var allRecords = new List<T>();

    using (var reader = new StreamReader(stream))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            var pieces = line.Split(',');
            var record = activate();
            record.Read(i => pieces[i]);
            allRecords.Add(record);
        }
    }

    return allRecords;
}

Code and raw data

The code for this is stored on GitHub: joelverhagen/NCsvPerf

The BenchmarkDotNet and Excel workbook (for the charts and tables above) are here: BenchmarkDotNet.Artifacts-8.0-10.zip

The test was run on my home desktop PC which is Windows 11, .NET 8.0.0, and an AMD Ryzen 9 3950X CPU.

Update log

Update 2024-01-13 (commit `40ae38c`)

Added server GC results
Switched from IL emit activation to new T() (thanks Mark for informing me about related perf improvements in .NET!)
Switched from .NET 7 to .NET 8 (8.0.0 runtime, 8.0.100 SDK)
Added Addax.Formats.Tabular 1.2.0
Updated BenchmarkDotNet from 0.13.9 to 0.13.12
Updated FSharp.Core from 7.0.401 to 8.0.101
Updated GenericParsing from 1.3.0 to 1.5.0
Updated Microsoft.Data.Analysis from 0.20.1 to 0.21.0
Updated Microsoft.ML from 2.0.1 to 3.0.0
Updated RecordParser from 2.1.0 to 2.3.0 (thanks Leandro!)
Updated Sep 0.2.7 to 0.4.0 and added multithreaded implementation (thanks nietras!)
Updated ServiceStack.Text from 6.11.0 to 8.0.0
Updated Sylvan.Common from 0.4.2 to 0.4.3 (thanks Mark!)
Updated Sylvan.Data.Csv from 1.3.3 to 1.3.5
Updated System.Collections.Immutable from 7.0.0 to 8.0.0

Results - BenchmarkDotNet.Artifacts-8.0-10.zip

Update 2023-10-26 (commit `b8b7f1f`)

Updated BenchmarkDotNet from 0.13.5 to 0.13.9
Updated ChoETL from 1.2.1.50 to 1.2.1.64
Updated CSVFile from 3.1.0 to 3.1.1
Updated FSharp.Core from 7.0.200 to 7.0.401
Added GenericParsing 1.3.0
Added HomeGrown2
Updated NReco.Csv from 1.0.1 to 1.0.2
Updated Open.Text.CSV from 3.3.3 to 3.4.0
Updated RecordParser from 1.3.0 to 2.1.0 and added a separate “parallel” benchmark
Added Sep 0.2.7
Updated ServiceStack.Text from 6.7.0 to 6.11.0
Updated SoftCircuits.CsvParser from 4.0.0 to 4.1.0 and switched to recommended API
Updated Sylvan.Common from 0.4.1 to 0.4.2
Updated Sylvan.Data.Csv from 1.2.7 to 1.3.3

Results - BenchmarkDotNet.Artifacts-7.0-9.zip

Update 2023-03-17 (commit `d201f24`)

Switched from .NET 6 to .NET 7 (7.0.4 runtime, 7.0.202 SDK)
Updated BenchmarkDotNet from 0.13.1 to 0.13.5
Updated ChoETL from 1.2.1.22 to 1.2.1.50
Updated Csv from 2.0.62 to 2.0.93
Updated CSVFile from 3.0.2 to 3.1.0
Updated CsvHelper from 27.1.1 to 30.0.1
Updated CsvTextFieldParser from 1.2.1 to 1.2.2
Updated FileHelpers from 3.5.0 to 3.5.2
Updated FlatFiles from 4.16.0 to 5.0.4
Updated Microsoft.Data.Analysis from 0.18.0 to 0.20.1
Updated Microsoft.ML from 1.6.0 to 2.0.1
Updated NReco.Csv from 1.0.0 to 1.0.1
Updated Open.Text.CSV from 2.4.0 to 3.3.0
Updated RecordParser from 1.2.0 to 1.3.0
Updated ServiceStack.Text from 5.11.0 to 6.7.0
Updated SoftCircuits.CsvParser from 3.0.0 to 4.0.0
Updated Sylvan.Common from 0.2.1 to 0.4.1
Updated Sylvan.Data.Csv from 1.1.6 to 1.2.7
Updated System.Collections.Immutable from 5.0.0 to 7.0.0
Updated TinyCsvParser from 2.6.1 to 2.7.0
Updated TxtCsvHelper from 1.3.1 to 1.3.3

Results - BenchmarkDotNet.Artifacts-7.0-8.zip

Update 2021-08-13 (commit `39dd976`)

Switched from .NET 5 to .NET 6 (6.0.100-preview.7.21379.14)
Updated Angara.Statistics from 0.1.0 to 0.1.4
Updated BenchmarkDotNet from 0.13.0 to 0.13.1
Updated Open.Text.CSV from 2.3.3 to 2.4.0
Updated Sylvan.Data.Csv from 1.1.5 to 1.1.6

This entire update was done by Mark via a PR. Thanks!

Results - BenchmarkDotNet.Artifacts-6.0-7.zip

Update 2021-08-09 (commit `8fa7626`)

Added Angara.Table via a PR.
Updated ChoETL from 1.2.1.18 to 1.2.1.22 and enhanced the adapter for 3x perf wins via a PR.
Added Cesil via a PR.
Added Dsv via a PR.
Added KBCsv via a PR.
Added Microsoft.Data.Analysis via a PR.
Added Microsoft.ML via a PR.
Added Open.Text.CSV via a PR.
Updated Sylvan.Data.Csv from 1.0.3 to 1.1.5 via a PR.
Updated TxtCsvHelper from 1.2.9 to 1.3.1 via a PR.

This entire update was done by Mark. Thanks so much ❤️!

Results - BenchmarkDotNet.Artifacts-5.0-6.zip

Update 2021-08-05 (commit `a26df9c`)

Updated CsvHelper from 27.1.0 to 27.1.1.
Updated FlatFiles from 4.15.0 to 4.16.0.
Added RecordParser via a PR. Thanks Leandro!
Updated TinyCsvParser from 2.6.0 to 2.6.1.
Added TxtCsvHelper via a PR. Thanks Cameron!

Results - BenchmarkDotNet.Artifacts-5.0-5.zip

Update 2021-06-16 (commit `514a037`)

Added ChoETL via a PR. Thanks, Josh!
Added CommonLibrary.NET via a PR. Thanks, Josh!
Added CSVFile via a PR. Thanks, Josh!
Updated CsvHelper from 20.0.0 to 27.1.0.
Added CsvTools via a PR. Thanks, Josh!
Added FlatFiles via a PR. Thanks, Josh!
Updated FluentCSV from 2.0.0 to 3.0.0 via a PR. Thanks Aurélien!
Added LinqToCsv via a PR. Thanks, Josh!
Updated ServiceStack.Text from 5.10.4 to 5.11.0.
Updated SoftCircuits.CsvParser from 2.4.3 to 3.0.0.
Added Sky.Data.Csv via a PR. Thanks, Josh!
Updated Sylvan.Data.Csv from 0.9.0 to 1.0.3 via a PR and a subsequent change by me. Thanks, Mark!

Results - BenchmarkDotNet.Artifacts-5.0-3.zip

Update 2021-01-18 (commit `8005d5`)

Added Ctl.Data by request.
Added Cursively via a PR from @airbreather. Thanks, Joe!
Added Microsoft.VisualBasic.FileIO.TextFieldParser by request.
Added SoftCircuits.CsvParser by request.
Updated CsvHelper from 19.0.0 to 20.0.0 via a PR from @JoshClose. Thanks, Josh!
Updated ServiceStack.Text from 5.10.2 to 5.10.4.
Updated Sylvan.Data.Csv from 0.8.2 to 0.9.0.
Switched to a fork of FastCsvParser to avoid duplicate DLL name.

Results - BenchmarkDotNet.Artifacts-5.0-2.zip

Update 2021-01-06 (commit `586f602`)

Moved to .NET 5.0.1
Added FluentCSV by request.
Added Sylvan.Data.Csv via a PR from @MarkPflug. Thanks, Mark!
Updated Csv from 1.0.58 to 2.0.62.
Updated CsvHelper from 18.0.0 to 19.0.0.
Updated mgholam.fastCSV from 2.0.8 to 2.0.9.

Results - BenchmarkDotNet.Artifacts-5.0.zip

Initial release 2020-12-08 (commit `57c31a`)

Results - BenchmarkDotNet.Artifacts.zip

The fastest CSV parser in .NET

Specific purpose tested

Results

Caveats

CSV libraries tested

I’m talking smack?

My motivation

Methodology

Library-specific adapters

Code and raw data

Update log

Update 2024-01-13 (commit 40ae38c)

Update 2023-10-26 (commit b8b7f1f)

Update 2023-03-17 (commit d201f24)

Update 2021-08-13 (commit 39dd976)

Update 2021-08-09 (commit 8fa7626)

Update 2021-08-05 (commit a26df9c)

Update 2021-06-16 (commit 514a037)

Update 2021-01-18 (commit 8005d5)

Update 2021-01-06 (commit 586f602)

Initial release 2020-12-08 (commit 57c31a)