Skip to content

Commit

Permalink
process xor/rol optimised and fully implemented (thanks @webbnh)
Browse files Browse the repository at this point in the history
  • Loading branch information
arekbulski committed Apr 18, 2018
1 parent 14b9bd8 commit bc8baa5
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 13 deletions.
91 changes: 79 additions & 12 deletions kaitai/kaitaistream.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -432,41 +432,108 @@ std::string kaitai::kstream::bytes_terminate(std::string src, char term, bool in
// ========================================================================

std::string kaitai::kstream::process_xor_one(std::string data, uint8_t key) {
if (key == 0)
return data;

size_t len = data.length();
std::string result(len, ' ');

for (size_t i = 0; i < len; i++)
for (size_t i = 0; i < len; i++) {
result[i] = data[i] ^ key;
}

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

@webbnh You also gave me an idea for a different optimisation: just did a benchmark and found that operator[] is slower than taking a char* and indexing it directly.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 19, 2018

That's not surprising, considering that a string is an abstract object and the [] operator presumably does things like bounds checks; using a char* would go around all that overhead.

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 19, 2018

Author Member

I am not sure about bounds check, the documentation says that wrong access results in undefined behaviour and not an exception.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 19, 2018

Re bounds checking, you are correct. (To get bounds checking one must use std::string::at().)

Nevertheless, "undefined behaviour" can include things like corrupting data and crashing your program (and, possibly even throwing exceptions).


return result;
}

std::string kaitai::kstream::process_xor_many(std::string data, std::string key) {
size_t len = data.length();
size_t kl = key.length();
if (len == 1)

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

Shouldn't you be checking the key length, not the data length, here?

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

HUGE BUG(SPIDER) ALERT! Oh just kidding. Fixed and thanks! :)

return process_xor_one(data, key[0]);

std::string result(len, ' ');

size_t ki = 0;
size_t k = 0;
size_t keylen = key.length();
for (size_t i = 0; i < len; i++) {
result[i] = data[i] ^ key[ki];
ki++;
if (ki >= kl)
ki = 0;
result[i] = data[i] ^ key[k];
k++;
if (k == keylen)

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

Using >= is good defensive coding practice (and there shouldn't be a performance penalty for it).

(I once got a free trip to a customer site because of this exact problem. In that case, there were multiple threads active, which caused the iterator to be incremented twice, which put it beyond the end...and, without the greater-than test, the code was none the wiser, until it had overwritten all of memory. I realize that this is single-threaded code and not subject to that exposure, but it's still best-practice to do as the original code here did.)

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

Mute point :) Even if such a race condition was possible, I would argue that it would be better for the code to run indefinately (and make a "visible error" that way) than to just return wrong data without an exception. But thats still a mute point.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 19, 2018

But...it doesn't cause the code to run indefinitely -- it causes the code to access memory outside the array boundaries, which at best will mysteriously return incorrect results and at worst cause the program to crash. Yes, it would be best if errors were made obvious so that they can be fixed, but I think that is secondary to correct results and safe (production) operation.

(And, by the way, I think you mean "moot point". :-) )

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 19, 2018

Author Member

Right! Thanks for correcting me! :)

k = 0;
}

return result;
}

std::string kaitai::kstream::process_rotate_left(std::string data, int amount) {
uint8_t precomputedSingleRotations[8][256];

// NOTE: static block of code, https://stackoverflow.com/a/34321324/2375119
computeSingleRotations {
for (int amount = 1; amount < 8; amount++) {
int anti_amount = 8 - amount;
for (uint8_t i = 0; i < 256; i++) {
precomputedSingleRotations[amount][i] = (uint8_t)((i << amount) | (i >> anti_amount));
}
}
}

std::string kaitai::kstream::process_rotate_left(std::string data, int amount, int groupSize = 1) {
if (groupSize < 1)
throw std::runtime_error("process_rotate_left: groupSize must be at least 1");

amount = mod(amount, groupSize * 8);
if (amount == 0)
return data;

int amount_bytes = amount / 8;
size_t len = data.length();
std::string result(len, ' ');

for (size_t i = 0; i < len; i++) {
uint8_t bits = data[i];
result[i] = (bits << amount) | (bits >> (8 - amount));
if (groupSize == 1) {
uint8_t *translate = &precomputedSingleRotations[amount][0];

for (size_t i = 0; i < len; i++) {
result[i] = translate[data[i]];
}

return result;
}

return result;
if (len % groupSize != 0)
throw std::runtime_error("process_rotate_left: data length must be a multiple of group size");

if (amount % 8 == 0) {
size_t indices[groupSize];
for (size_t i = 0; i < groupSize; i++) {
indices[i] = (size_t)((i + amount_bytes) % groupSize);
}

for (size_t i = 0; i < len; i += groupSize) {
for (size_t k = 0; k < groupSize; k++) {
result[i+k] = data[i + indices[k]];

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

Just to confirm: you're not using precomputedSingleRotations[] here, because we're rotating by an integral number of bytes, right? (So, we're just moving whole bytes from one position to another.)

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

Thats right, we're just moving whole bytes from one position to another. :)

}
}

return result;
}

{
int amount1 = amount % 8;
int amount2 = 8 - amount1;
size_t indices1[groupSize];
size_t indices2[groupSize];
for (size_t i = 0; i < groupSize; i++) {
indices1[i] = (size_t)((i + amount_bytes) % groupSize);
indices2[i] = (size_t)((i + 1 + amount_bytes) % groupSize);
}

for (size_t i = 0; i < len; i += groupSize) {
for (size_t k = 0; k < groupSize; k++) {
result[i+k] = (uint8_t)((data[i + indices1[k]] << amount1) | (data[i + indices2[k]] >> amount2));

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

Rather than doing this byte by byte, you should seriously consider doing it in larger chunks. That is, if the platform supports 64-bit integers, then you should be doing it 8 bytes at a time. (At the very least, you should be able to use a portable type, such as int or size_t, load it with sizeof(int) or sizeof(size_t) bytes (respectively), and then do a single pair of shift operations on it, which worst case should save about 17% and best case about 75% or even 88%.)

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

I agree that it would be a good idea performance-wise but this is just too much work for me. If you want to, add it yourself. I am done working on it.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 19, 2018

Fair enough. (As I said, I was simply recommending that you consider it.)

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 19, 2018

Author Member

Thank you for the recommendation, it was appreciated nonetheless. :)

}
}

return result;
}
}

#ifdef KS_ZLIB
Expand Down
9 changes: 8 additions & 1 deletion kaitai/kaitaistream.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,8 @@ class kstream {
/**
* Performs a XOR processing with given data, XORing every byte of input with a single
* given value.
* WARNING: May return same byte array if key is zero.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

I'm not sure what this warning is attempting to indicate. The contents of the array will be the same if the key is zero. (Hopefully that is self-evident....) The array object which is returned (i.e., the string) will not be the same, regardless of the key value.

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

Sorry, I am a C# person and I copied this line from that runtime as-is. It means exactly what you said, that it returns "same array object". In C++, well not sure, but if benchmark is correct and passing by copy/reference is same then the conclusion could be that passing by copy actually "returns same array object" in sense that it shares data with another string variable. The "same" in "May return same byte array if key is zero." means object identity.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 19, 2018

In C++, it won't return the same object.

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 19, 2018

Author Member

Yeah, I meant it kinda figuratively: that it shares data between arrays.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 19, 2018

Then I would suggest replacing WARNING which sounds very important, with NOTE which means worth considering but not worth worrying about....

*
* @param data data to process
* @param key value to XOR with
* @return processed data
Expand All @@ -182,6 +184,8 @@ class kstream {
* Performs a XOR processing with given data, XORing every byte of input with a key
* array, repeating key array many times, if necessary (i.e. if data array is longer
* than key array).
* WARNING: May return same byte array if key is zero.

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

Ditto previous comment.

*
* @param data data to process
* @param key array of bytes to XOR with
* @return processed data
Expand All @@ -192,11 +196,14 @@ class kstream {
* Performs a circular left rotation shift for a given buffer by a given amount of bits,
* using groups of 1 bytes each time. Right circular rotation should be performed

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

The number of bytes is now specified by the third parameter, with a default of 1.

This comment has been minimized.

Copy link
@arekbulski

arekbulski Apr 18, 2018

Author Member

Fixed.

* using this procedure with corrected amount.
* WARNING: May return same byte array if amount is zero (modulo-wise).

This comment has been minimized.

Copy link
@webbnh

webbnh Apr 18, 2018

See previous comment above about "same"-ness.

*
* @param data source data to process
* @param amount number of bits to shift by
* @param groupSize number of bytes that make a group
* @return copy of source array with requested shift applied
*/
static std::string process_rotate_left(std::string data, int amount);
static std::string process_rotate_left(std::string data, int amount, int groupSize = 1);

#ifdef KS_ZLIB
/**
Expand Down

5 comments on commit bc8baa5

@webbnh
Copy link

@webbnh webbnh commented on bc8baa5 Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to summarize, I've found the following issues which should be addressed before merging:

  • process_xor_many() checking the wrong length
  • The reset for the key iterator should not be changed from using >=.
  • And, the header file "warnings" should be clarified or down-played.

And, I recommend the following be considered, either before or subsequent to merging:

  • Perform the group rotations in larger chunks.
  • There is no need to allocate the first row of the translation array, since it is never initialized or read.

@arekbulski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to allocate the first row of the translation array, since it is never initialized or read.

Yes but I already rejected this and also explained why. Saving 256 bytes of memory is not worth adding the little code to effectuate it. Secondary reason is that I would prefer to keep it consistent with C# and Python implementations, I made 3 PRs with almost identical changes. Same applies to "mod 8" instead of "and 7". The readability is worth to me more than the little speedup. Listen, if you want to take over this PR and change things on your own authority, I dont have a problem with it, fine with me. But if its up to me to make the call, then I reject those suggestions.

@webbnh
Copy link

@webbnh webbnh commented on bc8baa5 Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry -- I must have missed your explanation of why. As you say, it's only a quarter of a kilobyte, but it's also more than 10% of your overall allocation.... Anyway, it was just a recommendation.

By the way, I'm intrigued that you brought up "mod 8" vs. "and 7", since I think your performance assay demonstrated that the modulo operator is a fairly expensive one....

For what it's worth, I understand the trade-off between performance and readability, but, since your proposed changes are all about performance (and, indeed, they introduce a fair amount of code, some of which is pretty complicated), trying to claim that readability is paramount is an odd position.

@arekbulski
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I sounded harsh. I had a bad day, then looked at how much work it would take to implement those additional changes (your suggestions) and that just overfilled the bucket. Please dont think that I didnt appreciate the fact that you extensively reviewed my work...

I'm sorry -- I must have missed your explanation of why. As you say, it's only a quarter of a kilobyte, but it's also more than 10% of your overall allocation.... Anyway, it was just a recommendation.

Hmm I might have explained it @GreyCat. The overall justification is just that on modern desktops, and its not like servers are much different, you have something like 1GB of memory used just on startup. If you open a process monitor on Ubuntu, its about 100 processes and each has few MB of working set at least. That and even few year old computers have 4GB of RAM. The 2KB that are perma-allocated is a ridiculous amount that I would not even look at twice. If I was reviewing someone's code, I would probably not point it out unless the alternative was equally elegant (aka readable, clean).

By the way, I'm intrigued that you brought up "mod 8" vs. "and 7", since I think your performance assay demonstrated that the modulo operator is a fairly expensive one....

Yeah. My reasoning is that the modulo happens ONCE in the function. If it was in a loop, then it would matter because the difference gets multiplied.

For what it's worth, I understand the trade-off between performance and readability, but, since your proposed changes are all about performance (and, indeed, they introduce a fair amount of code, some of which is pretty complicated), trying to claim that readability is paramount is an odd position.

The runtime is "performance oriented" in the sense that its important, not that its critical. It should be cared for, not obtained at all costs.

@webbnh
Copy link

@webbnh webbnh commented on bc8baa5 Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall justification is just that on modern desktops, and its not like servers are much different, you have something like 1GB of memory used just on startup.

True...and the question quickly becomes, why? Hopefully the answer isn't because developers have wasted millions of small allocations.... ;-) I only brought it up because, as a reviewer, it was suspicious that the code never initialized that memory; the explanation for it was that it never accessed that memory; and that led to the question of why should we allocate it if it was never used? As for addressing it, all that is required is to add a - 1 in a few places; if that's burdensome, you can encapsulate it inside an accessor function....

My reasoning is that the modulo happens ONCE in the function.

Yes, but that function will be called (literally) millions of times in my application....

Please sign in to comment.