qsv pro v1 is here! 🎉

If you've been using qsv for a while, even if you're a command-line ninja, you'll find a lot of new capabilities in qsv pro that can make your data wrangling experience even better!

Apart from making qsv easier to use, qsv pro has a multitude of features including: view interactive data tables; browse stats/frequency/metadata; run recipes and tools (scripts); run Polars SQL queries; use Natural Language queries (using Retrieval Augmented Generation (RAG) techniques); regular expression search; export to multiple file formats; download/upload from/to compatible CKAN instances; design custom node-based flows and data pipelines; interact with a local API from external programs including the qsv pro command; run various qsv commands in a graphical user interface; and the list goes on!

And that's just the beginning, there's more to come! You just have to try it!

Download qsv pro v1 now at qsvpro.dathere.com.

Other highlights include:

pro: new command to allow qsv to interact with the qsv pro API to tap into qsv pro exclusive features.
lens: new command to interactively view CSVs using the csvlens crate.
The ludicrously fast diff command is now easier to use with its --drop-equal-fields option. @janriemer continues to work on his csv-diff crate, and there's more diff UX improvements coming soon!
stats adds sum_length and avg_length "streaming" statistics in addition to the existing min_length and max_length metrics. These are especially useful for datasets with a lot of "free text" columns.
stats also got "smarter" and "faster" by dog-fooding its own statistics to make it run faster!
It's a little complicated, but the way stats works is that it compiles the "streaming" statistics on the fly first as it multiplex load the data across several threads, and the more expensive advanced statistics are "lazily" computed at the end.
Since we now compile "sort order" in a streaming manner, we use this info when deriving cardinality at the end to see if we can skip sorting - an otherwise necessary step to get cardinality which is done by "scanning" all the sorted values of a column. Everytime two neighboring values differ in a sorted column, it increments the cardinality count.
Apart from this "sort order" optimization, we also improved the "cardinality scan" algorithm - halving its memory footprint and making it faster still for larger datasets by parallelizing the computation!
This in turn, makes the frequency command faster and more memory efficient!
we now also use our own fork of the csv crate, featuring SIMD-accelerated UTF-8 validation and other minor perf tweaks, making the entire qsv suite faster still!

Added

pro: add qsv pro command to interact with qsv pro API by @rzmk in #2039
lens: new command to interactively view CSVs using the csvlens crate #2117
apply: add crc32 operation #2121
count: add --delimiter option #2120
diff: add flag --drop-equal-fields by @janriemer in #2114
stats: add sum_length and avg_length columns #2113
stats: smarter cardinality computation - added new parallel algorithm for large datasets (10,000+ rows) and updated sequential algorithm for smaller datasets 4e63fec

Changed

count: added comment to justify magic number 5241e39
stats: use simdjson for faster JSONL parsing; micro-optimize compute hot loop 0e8b734
stats: standardized OVERFLOW and UNDERFLOW messages 38c6128
sort: renamed symbol so eliminate devskim lint false positive warning 12db739
enable lens feature in GH workflows #2122
deps: bump polars 0.42.0 to latest upstream at time of release 3c17ed1
deps: use our own optimized fork of csv crate, with simdutf8 validation and other minor perf tweaks e4bcd71
build(deps): bump serde from 1.0.209 to 1.0.210 by @dependabot in #2111
build(deps): bump serde_json from 1.0.127 to 1.0.128 by @dependabot in #2106
build(deps): bump qsv-stats from 0.19.0 to 0.22.0 #2107 #2112 cb1eb60
apply select clippy lint suggestions
updated several indirect dependencies
made various doc and usage text improvements

Fixed

schema: Print an error if the qsv stats invocation fails by @abrauchli in #2110

New Contributors

@abrauchli made their first contribution in #2110

Full Changelog: 0.133.1...0.134.0

Contributors

abrauchli, janriemer, and 2 other contributors

Assets 13

qsv-0.134.0-aarch64-apple-darwin.zip

146 MB 2024-09-10T14:32:25Z
qsv-0.134.0-aarch64-unknown-linux-gnu.zip

37.1 MB 2024-09-11T03:28:37Z
qsv-0.134.0-geocode-index.bincode

14.3 MB 2024-09-10T12:12:06Z
qsv-0.134.0-geocode-index.bincode.cities15000

14.3 MB 2024-09-10T12:12:02Z
qsv-0.134.0-geocode-index.bincode.cities15000.sz

5.65 MB 2024-09-10T12:12:00Z
qsv-0.134.0-x86_64-apple-darwin.zip

35.6 MB 2024-09-10T13:03:56Z
qsv-0.134.0-x86_64-pc-windows-gnu.zip

72.7 MB 2024-09-11T04:04:54Z
qsv-0.134.0-x86_64-pc-windows-msvc.zip

80.1 MB 2024-09-11T04:09:43Z
qsv-0.134.0-x86_64-unknown-linux-gnu.zip

161 MB 2024-09-11T04:01:32Z
qsv-0.134.0-x86_64-unknown-linux-musl.zip

91.2 MB 2024-09-11T03:51:29Z
Source code (zip)

2024-09-10T12:10:01Z
Source code (tar.gz)

2024-09-10T12:10:01Z

03 Sep 19:04

jqnatividad

0.133.1

e42f499

0.133.1

Highlights

¹ This release doubles down on Polars' capabilities, as we now, as a matter of policy track the latest polars upstream. If you think qsv has a torrid release schedule, you should see Polars. They're constantly fixing bugs, adding new features and optimizations!
To keep up, we've added Polars revision info to the --version output, and the --envlist option now includes Polars relevant env vars. We've also added support for the POLARS_BACKTRACE_IN_ERR env var to control whether Polars backtraces are included in error messages.
We also removed the to parquet subcommand as its redundant with the Polars-powered sqlp's ability to create parquet files. This removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries smaller.


¹	This release doubles down on Polars' capabilities, as we now, as a matter of policy track the latest polars upstream. If you think qsv has a torrid release schedule, you should see Polars. They're constantly fixing bugs, adding new features and optimizations! To keep up, we've added Polars revision info to the `--version` output, and the `--envlist` option now includes Polars relevant env vars. We've also added support for the `POLARS_BACKTRACE_IN_ERR` env var to control whether Polars backtraces are included in error messages. We also removed the `to parquet` subcommand as its redundant with the Polars-powered `sqlp`'s ability to create parquet files. This removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries smaller.

Other highlights include:

New edit command that allows you to edit CSV files.
The count command's --width option now includes record width stats beyond max length (avg, median, min, variance, stddev & MAD).
The fixlengths command now has --quote and --escape options.
The stats command adds a sort_order streaming statistic.

NOTE: 0.133.0 was skipped because of a dev dependency conflict with the csvs_convert crate, preventing us from publishing 0.133.0 to crates.io. This has been resolved in 0.133.1.