This is a custom CSV parser written for our usecase at SmartReach.io
We faced various issues with csv parsing libraries. These were not necessarily the faults of the libraries, but rather the varied formats of CSV files - including incorrectly formatted files that our customers expected to "just" work, Tab separated files and semi-colon separated files.
Finally to reduce our error rate we decided to roll our own parser.
We have seen a significant reduction in CSV errors after this.
Although the title is CSV parser - but it handles semicolon separated, tab separated files as well.
To see that logic - have a look at common/driver.cpp
- bison (GNU Bison) 3.5.1
- flex 2.6.4
- c++-11
- GNU make
- MacOS
- Ubuntu Linux
Clone the repository:
git clone https://github.com/heaplabs/csv-parser-flex-bison.git
sudo apt-get install flex bison # install if missing
cd csv-parser-flex-bison
make
The build is tested on Ubuntu 20.04 with : g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
The integrated parser (for csv, tsv and semicolon separated files) will
live in build
folder, the generated exe is : csv2_ubuntu.exe
We tested with the gnu G++-11 cpp compiler . you can install that via brew
brew install flex # install if missing
brew install bison # install if missing
cd csv-parser-flex-bison
make -f GNUmakefile.macos
The integrated parser (for csv, tsv and semicolon separated files) will
live in build
folder, the generated exe is : csv2_macos.exe
The directory comma-separated-values/csv-test-files contains some sample csv files.
run the program as illustrated below:
the result of the parse is printed as a json to stdout:
❯ ./build/csv2_ubuntu.exe comma-separated-values/csv-test-files/inp1.csv
{
"expected_fields" : 3,
"n_iso_8859_1" : 0,
"n_utf8_longer_than_1byte" : 0,
"n_wincp1252" : 0,
"successfully_parsed" : 0,
"total_errors" : 0,
"total_records" : 0,
"header" : [
"abcd",
"efgh",
"1235"
]
,
"parsed_data":[
]}
❯ ./build/csv2_ubuntu.exe comma-separated-values/csv-test-files/inp2.csv
{
"expected_fields" : 3,
"n_iso_8859_1" : 0,
"n_utf8_longer_than_1byte" : 0,
"n_wincp1252" : 0,
"successfully_parsed" : 4,
"total_errors" : 0,
"total_records" : 4,
"header" : [
"abcd",
"efgh",
"1235"
]
,
"parsed_data":[
[
"abcd",
"efgh",
"1235"
]
, [
"abcd",
"efgh",
"1235"
]
, [
"abcd",
"efgh",
"12\t35"
]
, [
"abcd",
"efgh",
"1235"
] ]}
❯ ./build/csv2_macos.exe comma-separated-values/csv-test-files/inp1.csv
{
"expected_fields" : 3,
"n_iso_8859_1" : 0,
"n_utf8_longer_than_1byte" : 0,
"n_wincp1252" : 0,
"successfully_parsed" : 0,
"total_errors" : 0,
"total_records" : 0,
"header" : [
"abcd",
"efgh",
"1235"
]
,
"parsed_data":[
]}
❯ ./build/csv2_macos.exe comma-separated-values/csv-test-files/inp2.csv
{
"expected_fields" : 3,
"n_iso_8859_1" : 0,
"n_utf8_longer_than_1byte" : 0,
"n_wincp1252" : 0,
"successfully_parsed" : 4,
"total_errors" : 0,
"total_records" : 4,
"header" : [
"abcd",
"efgh",
"1235"
]
,
"parsed_data":[
[
"abcd",
"efgh",
"1235"
]
, [
"abcd",
"efgh",
"1235"
]
, [
"abcd",
"efgh",
"12\t35"
]
, [
"abcd",
"efgh",
"1235"
] ]}