Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add binary option to stream / COPY #842

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

bj8sk
Copy link

@bj8sk bj8sk commented May 30, 2024

Add a binary option to stream_to and stream_from.
This could increase performance for some use cases, but beware it requires to handle the raw encoding / decoding in your client code.

@bj8sk bj8sk marked this pull request as draft May 30, 2024 20:31
@jtv
Copy link
Owner

jtv commented May 30, 2024

Thanks for working on this. Be aware that the binary format is neither documented nor guaranteed to be stable, which is one of the reasons I've never built in generic support for the format.

I would also suggest working on the two directions (inserting data into the database and retrieving data out of the database) as two separate problems. It may help break the problem down into manageable chunks.

I'm on my phone now, so hard to follow the code changes. But one really small and superficial note: adding a bool parameter for binary is probably not great for API clarity. I think there's a "format" enum somewhere - that would probably work better.

Copy link
Owner

@jtv jtv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see this come to fruition but unfortunately it's a surprisingly complicated topic. That in fact is why I never built this myself (though I made a start a few times).

One small ray of hope for the future is that the docs for PostgreSQL 17 do document some aspects of binary mode. I hope to get there one day.

@@ -47,3 +47,9 @@ win32/common
**/runner.log
**/runner.trs
**/test-suite.log
/x64
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid building directly in your source tree if you can. It's important to keep the out-of-tree build working, which is an easy feature to break by accident unless constantly exercised.

Also, building out-of-tree is just cleaner. It means you can always just discard your build tree without losing any work. And it means that you don't need to insert any files generated by your particular build environment into the .gitignore.

@@ -102,9 +102,10 @@ public:
* the stream will write all columns in the table, in schema order.
*/
static stream_to raw_table(
transaction_base &tx, std::string_view path, std::string_view columns = "")
transaction_base &tx, std::string_view path, std::string_view columns = "",
bool binary = false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend against using bool arguments for such things. They render mistakes easy to make and hard to spot. Use the pqxx::format enum instead.

Since that's a class enum, it ensures type-safety of the parameter. On top of that, it gives us more extensibility in the future — adding an enum vaue will have very limited consequences for backward compatibility.

transaction_base &tx, std::string_view table_name, Columns const &columns) :
stream_to{tx, table_name, std::begin(columns), std::end(columns)}
transaction_base &tx, std::string_view table_name, Columns const &columns,
bool binary /*= false*/) :
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding that /*= false*/ documentation, but I have a practical problem with it: it creates a kind of "double bookkeeping" between the header and the source file, which we will have to keep consistent. And if we ever get that wrong, nothing will warn us about it. We'll just have a bit of misleading documentation in the source code.

transaction_focus{tx, class_name}, m_char_finder{get_finder(tx)}
{
tx.exec0(internal::concat("COPY ("sv, query, ") TO STDOUT"sv));
tx.exec0(binary?internal::concat("COPY ("sv, query, ") TO STDOUT WITH BINARY"sv):internal::concat("COPY ("sv, query, ") TO STDOUT"sv));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, since I run automated formatting on the source tree, but please aim to stay below 80 columns.

{
tx.exec0(
std::empty(columns) ?
pqxx::internal::concat("COPY "sv, table, " FROM STDIN WITH BINARY"sv) :
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small note... To be honest I'm not even sure at this point that this form ("copy all columns") is worth keeping. Nowadays I feel that it's just way too sensitive to schema changes. Consider that an historical mistake on my part.

transaction_focus{tx, s_classname, path},
m_finder{pqxx::internal::get_char_finder<
'\b', '\f', '\n', '\r', '\t', '\v', '\\'>(
pqxx::internal::enc_group(tx.conn().encoding_id()))}
{
if (binary)
begin_binary_copy(tx, path, columns);
begin_copy(tx, path, columns);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not be in an else?

inserter.complete();

auto r1{tx.exec1("SELECT * FROM stream_binary_to_test WHERE number0 = 1234")};
PQXX_CHECK_EQUAL(r1[0].as<int>(), 1234, "Read back wrong first int.");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this going to work? AIUI the binary format has a very different, more complicated number format. You can't just take the address of an int and send it across.

(This is part of the bigger problem that I brought up in the beginning: the binary formats are complex, undocumented, and not necessarily stable.)

Also, I see nothing to deal with byte orders. IIRC the binary format generally works in Little-Endian mode — what if the client system runs in Big-Endian mode?

And finally how do nulls work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is incomplete (and not correct), yes.
htonl[s]/ntoh[l|s] etc. must be used.

For nulls:
https://www.postgresql.org/docs/16/sql-copy.html#:~:text=As%20a%20special%20case%2C%20%2D1%20indicates%20a%20NULL%20field%20value.%20No%20value%20bytes%20follow%20in%20the%20NULL%20case.

But all in all would be better to provide a wrapper around the processing of the header, tuples and trailer, and provide a consistent api here.

for example the when reading data, open method would go something like this:
use get raw_line from stream
check signature
read header
make some assumptions on the header flag values (or ignore them, depending on server version)
get number of fields

Then get data tuples from stream and continue or get more data, read field sizes, ensure it's consistent with the provided types client specified, read data and convert from network format.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, as of C++20 we're getting at least some support for byte-swapping. (Although it doesn't look complete until C++23.)

My impression is that things get really hard where it comes to getting the SQL-side types and the C++-side types matched up correctly. For example, if the database is sending an integer but we're trying to read it as a floating-point number, we'll still need to start by decoding the binary value as the right type — and then maybe we'd then want to convert the integer to a floating-point value after that, or maybe we shouldn't even allow that to avoid mistakes. So it's possible that we'll need an entirely different paradigm for working with the types.

std::empty(columns) ?
pqxx::internal::concat("COPY "sv, table, " FROM STDIN WITH BINARY"sv) :
pqxx::internal::concat(
"COPY "sv, table, "("sv, columns, ") FROM STDIN WITH BINARY"sv));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this supposed to say FORMAT BINARY?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I see this used older syntax but still supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants