Add comparisons

aswinkarthik · Apr 28, 2018 · 80c9ab9 · 80c9ab9
1 parent c066986
commit 80c9ab9
Show file tree

Hide file tree

Showing 3 changed files with 80 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -12,10 +12,10 @@ A Blazingly fast diff tool for comparing csv files.
 
 Csvdiff is a difftool to compute changes between two csv files.
 
-* It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables.
+* It is not a traditional diff tool. It is most suitable for comparing csv files dumped from database tables. GNU diff tool is orders of magnitude faster on comparing line by line.
 * Supports specifying group of columns as primary-key.
 * Supports selective comparison of fields in a row.
-* Process a million records csv in under 2 seconds
+* Compares csvs of million records csv in under 2 seconds. Comparisons and benchmarks [here](/benchmark).
 
 ## Demo
 

diff --git a/benchmark/README.md b/benchmark/README.md
@@ -1,4 +1,68 @@
-## Benchmark Results
+## Comparison with other tools
+
+
+### Setup
+
+* Using the majestic million data. (Source in credits section)
+* Both files have 998390 rows and 12 columns.
+* Only one modification between both files.
+* Ran on Processor: Intel Core i7 2.5 GHz 4 cores 16 GB RAM
+
+0. csvdiff (this tool) : *0m2.085s*
+
+```bash
+time csvdiff run -b majestic_million.csv -d majestic_million_diff.csv
+
+# Additions: 0
+# Modifications: 1
+
+real	0m2.085s
+user	0m3.861s
+sys	0m0.340s
+```
+
+1. [data.table](https://github.com/Rdatatable/data.table) : *0m4.284s*
+
+	* Join both csvs using `id` column.
+	* Check inequality between both columns
+	* Rscript in [data-table.r](/benchmark/data-table.r) (Can it be written better? New to R)
+
+```bash
+time Rscript data-table.r
+
+real	0m4.284s
+user	0m3.887s
+sys	0m0.284s
+```
+
+2. [csvdiff](https://pypi.org/project/csvdiff/) written in Python : *0m48.115s*
+
+```bash
+time csvdiff --style=summary id majestic_million.csv majestic_million_diff.csv
+0 rows removed (0.0%)
+0 rows added (0.0%)
+1 rows changed (0.0%)
+
+real	0m48.115s
+user	0m42.895s
+sys	0m3.948s
+```
+
+3. GNU diff (Fastest) : *0m0.297s*
+
+	* Seems the fastest. Couldn't even come close here.
+	* However, it does line by line diff. Does not support compound keys of a csv or selective compare of columns. Hence the disclaimer, cannot be used a generic diff tool.
+	* On another note, lets see if we can reach this.
+
+```bash
+time diff majestic_million.csv majestic_million_diff.csv
+
+real	0m0.297s
+user	0m0.144s
+sys	0m0.147s
+```
+
+## Go Benchmark Results
 
 Benchmark test can be found [here](https://github.com/aswinkarthik93/csvdiff/blob/master/pkg/digest/digest_benchmark_test.go).
 

diff --git a/benchmark/data-table.r b/benchmark/data-table.r
@@ -0,0 +1,13 @@
+library(data.table)
+
+csv1 = fread('majestic_million.csv')
+csv2 = fread('majestic_million_diff.csv')
+
+setkey(csv1,id)
+setkey(csv2,id)
+
+result <- merge(csv2, csv1, all.x=TRUE)
+
+diff <- result[result$"col-1.x" != result$"col-1.y" | result$"col-2.x" != result$"col-2.y" | result$"col-3.x" != result$"col-3.y" | result$"col-4.x" != result$"col-4.y" | result$"col-5.x" != result$"col-5.y" | result$"col-6.x" != result$"col-6.y" | result$"col-7.x" != result$"col-7.y" | result$"col-8.x" != result$"col-8.y" | result$"col-9.x" != result$"col-9.y" | result$"col-10.x" != result$"col-10.y" | result$"col-11.x" != result$"col-11.y"]
+
+diff