-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add binary option to stream / COPY #842
base: master
Are you sure you want to change the base?
Conversation
Thanks for working on this. Be aware that the binary format is neither documented nor guaranteed to be stable, which is one of the reasons I've never built in generic support for the format. I would also suggest working on the two directions (inserting data into the database and retrieving data out of the database) as two separate problems. It may help break the problem down into manageable chunks. I'm on my phone now, so hard to follow the code changes. But one really small and superficial note: adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see this come to fruition but unfortunately it's a surprisingly complicated topic. That in fact is why I never built this myself (though I made a start a few times).
One small ray of hope for the future is that the docs for PostgreSQL 17 do document some aspects of binary mode. I hope to get there one day.
@@ -47,3 +47,9 @@ win32/common | |||
**/runner.log | |||
**/runner.trs | |||
**/test-suite.log | |||
/x64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid building directly in your source tree if you can. It's important to keep the out-of-tree build working, which is an easy feature to break by accident unless constantly exercised.
Also, building out-of-tree is just cleaner. It means you can always just discard your build tree without losing any work. And it means that you don't need to insert any files generated by your particular build environment into the .gitignore
.
@@ -102,9 +102,10 @@ public: | |||
* the stream will write all columns in the table, in schema order. | |||
*/ | |||
static stream_to raw_table( | |||
transaction_base &tx, std::string_view path, std::string_view columns = "") | |||
transaction_base &tx, std::string_view path, std::string_view columns = "", | |||
bool binary = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend against using bool
arguments for such things. They render mistakes easy to make and hard to spot. Use the pqxx::format
enum instead.
Since that's a class enum
, it ensures type-safety of the parameter. On top of that, it gives us more extensibility in the future — adding an enum vaue will have very limited consequences for backward compatibility.
transaction_base &tx, std::string_view table_name, Columns const &columns) : | ||
stream_to{tx, table_name, std::begin(columns), std::end(columns)} | ||
transaction_base &tx, std::string_view table_name, Columns const &columns, | ||
bool binary /*= false*/) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of adding that /*= false*/
documentation, but I have a practical problem with it: it creates a kind of "double bookkeeping" between the header and the source file, which we will have to keep consistent. And if we ever get that wrong, nothing will warn us about it. We'll just have a bit of misleading documentation in the source code.
transaction_focus{tx, class_name}, m_char_finder{get_finder(tx)} | ||
{ | ||
tx.exec0(internal::concat("COPY ("sv, query, ") TO STDOUT"sv)); | ||
tx.exec0(binary?internal::concat("COPY ("sv, query, ") TO STDOUT WITH BINARY"sv):internal::concat("COPY ("sv, query, ") TO STDOUT"sv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big deal, since I run automated formatting on the source tree, but please aim to stay below 80 columns.
{ | ||
tx.exec0( | ||
std::empty(columns) ? | ||
pqxx::internal::concat("COPY "sv, table, " FROM STDIN WITH BINARY"sv) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small note... To be honest I'm not even sure at this point that this form ("copy all columns") is worth keeping. Nowadays I feel that it's just way too sensitive to schema changes. Consider that an historical mistake on my part.
transaction_focus{tx, s_classname, path}, | ||
m_finder{pqxx::internal::get_char_finder< | ||
'\b', '\f', '\n', '\r', '\t', '\v', '\\'>( | ||
pqxx::internal::enc_group(tx.conn().encoding_id()))} | ||
{ | ||
if (binary) | ||
begin_binary_copy(tx, path, columns); | ||
begin_copy(tx, path, columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not be in an else
?
inserter.complete(); | ||
|
||
auto r1{tx.exec1("SELECT * FROM stream_binary_to_test WHERE number0 = 1234")}; | ||
PQXX_CHECK_EQUAL(r1[0].as<int>(), 1234, "Read back wrong first int."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this going to work? AIUI the binary format has a very different, more complicated number format. You can't just take the address of an int
and send it across.
(This is part of the bigger problem that I brought up in the beginning: the binary formats are complex, undocumented, and not necessarily stable.)
Also, I see nothing to deal with byte orders. IIRC the binary format generally works in Little-Endian mode — what if the client system runs in Big-Endian mode?
And finally how do nulls work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is incomplete (and not correct), yes.
htonl[s]/ntoh[l|s] etc. must be used.
But all in all would be better to provide a wrapper around the processing of the header, tuples and trailer, and provide a consistent api here.
for example the when reading data, open method would go something like this:
use get raw_line from stream
check signature
read header
make some assumptions on the header flag values (or ignore them, depending on server version)
get number of fields
Then get data tuples from stream and continue or get more data, read field sizes, ensure it's consistent with the provided types client specified, read data and convert from network format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, as of C++20 we're getting at least some support for byte-swapping. (Although it doesn't look complete until C++23.)
My impression is that things get really hard where it comes to getting the SQL-side types and the C++-side types matched up correctly. For example, if the database is sending an integer but we're trying to read it as a floating-point number, we'll still need to start by decoding the binary value as the right type — and then maybe we'd then want to convert the integer to a floating-point value after that, or maybe we shouldn't even allow that to avoid mistakes. So it's possible that we'll need an entirely different paradigm for working with the types.
std::empty(columns) ? | ||
pqxx::internal::concat("COPY "sv, table, " FROM STDIN WITH BINARY"sv) : | ||
pqxx::internal::concat( | ||
"COPY "sv, table, "("sv, columns, ") FROM STDIN WITH BINARY"sv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't this supposed to say FORMAT
BINARY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I see this used older syntax but still supported.
Add a binary option to stream_to and stream_from.
This could increase performance for some use cases, but beware it requires to handle the raw encoding / decoding in your client code.