Add zip64 #211

kenkendk · 2017-03-09T23:03:18Z

This adds zip64 writing support.

The way zip64 is implemented is by appending a set of "extra" values to the header.
In the zip file there are two headers for each file: local (before the stream) and central (end of file).

The central header is simple enough, and most implementations simply use this and mostly ignore the other header. Once we are writing the central header, we have all the information required, so we can just write it.

For the local header, there is a tradeoff. The "extra" bytes take up 2+2+8+8=20 bytes pr. entry. This header is only required if either stream size (compressed and uncompressed) exceeds uint.MaxValue, but we do not know the length of the stream before writing.

The dilemma is: do we write it for all files, in case one is too long? Or do we not write it and risk overflowing the values?

Since the header is "mostly ignored" we could live with this being broken. On the other hand, if we can use it we should.

I have added a hint value to the ZipWriteEntryOptions that can force this header on and off. I have also added a flag in ZipWriterOptions that enable the extra fields for all entries by default. If the caller does not set any flags, as I would assume most would, I use a threshold of 2GiB to toggle zip64 headers. If the underlying stream is already 2GiB or more, zip64 is automatically enabled in the local header.
This is not a perfect solution, but the I figure most users would write smaller zip files and never notices. Larger zip files are not really impacted by 20 bytes here and there. You can of course defeat the scheme by writing a 1.9GiB file, and then a +4GiB file, thus hitting the limitations before the automatic upgrade kicks in.

If the stream is non-seekable, we have another issue, namely that the file would normally set a flag and then write the Crc, uncompressed size, and compressed size in a special trailing header. This trailing header has not been updated to use zip64, so we cannot write the correct values in it. We can also not use both trailing headers and "extra" data. This was clarified from PKWare: https://blogs.oracle.com/xuemingshen/entry/is_zipinput_outputstream_handling_of

In the case of streaming, the local headers are written with the trailing data, which may overflow. But the headers do contain the crc32 value, and may contain correct data if the sizes are less than uint.MaxValue. Again, the central directory header has the correct values.

Not sure how to deal with testing, as it requires files +4GiB to hit the limitations.

adamhathcock · 2017-03-10T09:36:11Z

So if I understand, there's no post data descriptor for zip64 but the central directory has the values?

Then you can't use streaming reads on zip64?

This pretty shitty. I guess they updated APPNOTE.txt:

4.3.9.5 An extensible data descriptor will be released in a 
      future version of this APPNOTE.  This new record is intended to
      resolve conflicts with the use of this record going forward,
      and to provide better support for streamed file processing.

      4.3.9.6 When the Central Directory Encryption method is used, 
      the data descriptor record is not required, but MAY be used.  
      If present, and bit 3 of the general purpose bit field is set to 
      indicate its presence, the values in fields of the data descriptor
      record MUST be set to binary zeros.  See the section on the Strong 
      Encryption Specification for information. Refer to the section in 
      this document entitled "Incorporating PKWARE Proprietary Technology 
      into Your Product" for more information.

So there's currently no way to forward-only read entries out of a zip64.

Would a good test to just be forcing zip64 on a small file? The values getting overloaded shouldn't really matter.

kenkendk · 2017-03-10T09:45:00Z

Streaming

Yes, that is my understanding: streaming cannot write 64bit values in the post-data.

I know they previously said that the post-data header should have 8-byte values for zip64, but they forgot to mention how to figure out if the file is zip64 when reading forward-only. The update you mention is probably a thought that there should be some kind of identifier in the post-data region.

You can still use forward-only reading, but you cannot verify if the lengths are correct, if the stream is larger than 4GiB.

Edit: This problem is worst if you write the archive forward-only AND read the archive in forward-only. If the stream is seekable, it is possible to forgo the post-data and use the zip64 extras, such that forward-only reading is possible.

Testing

You can force zip64 to have the extra headers, but I would expect errors to show up when values are hitting the unit.MaxValue limit.

Misc

One detail I forgot to mention in the initial text is that the list of files report the Count as int, so there is an overflow there if you have more than int.MaxValue files in the archive. SharpCompress does not use this value for anything, but other libraries may.

adamhathcock · 2017-03-10T09:51:13Z

You could always just say upfront to write everything as zip64 and deal with streaming that way. Then write post descriptors with 8 bytes. I guess that would be incompatible with other implementations?

I guess if the code allows forward-only reading but you'd have to read the central directory for the valid length that's something. However, that's now how the reader works now. The central directory isn't used with readers, only archives. Ugh.

I guess the count should be 'ulong' like everything else.

This code really needs some refactor lovin too I think.

kenkendk · 2017-03-10T09:56:11Z

Streaming

I guess we can look at "what others do", but that is likely to cause errors. If the readers expect 4-byte values, there should at least be some indicator somewhere that the values are not 4-byte.

Count overflow

Yes, but that would mean replacing the system List<T>, which is a rather intrusive change.

adamhathcock · 2017-03-10T09:57:43Z

I usually test with WinRAR and 7Zip myself.

I guess I see the point about the collections. Let's save that for a rainy day.

kenkendk · 2017-03-10T10:00:44Z

Maybe a better approach is to simply throw an exception if that case is hit?

If the stream is non-seekable, and the input data is larger than uint.MaxValue, we could throw an exception.

adamhathcock · 2017-03-10T10:18:21Z

Yes. Explicit exceptions are good if there's no fix. At least to let people know's happening.

Thanks again!

kenkendk · 2017-03-10T10:24:58Z

Should we do the same for the case where we use a seek-able stream, and do not have space for the zip64 extra fields?

Maybe we could then choose to not make space for it by default (remove my 2GiB threshold) and make the caller set the zip64 flag when creating the archive or stream exceeds 4GiB ?

adamhathcock · 2017-03-10T10:27:11Z

I would rather be explicit in these cases and force the user to know what they're doing. Especially when getting into the zip64 file size area.

I guess just measure then throw exceptions if it's not zip64 mode.

kenkendk · 2017-03-10T10:29:37Z

Ok, I will update the code to require explicit zip64 activation when writing streams larger than 4GiB. And disallow writing streams larger than 4GiB in forward-only mode.

Archives larger than 4GiB can be transparently supported, as long as all streams inside are less than 4GiB individually.

…ngth (12 bytes) are not included in the reported length.

…ve limits, and zip64 is not enabled. This changes the logic, such that archives larger than 4GiB are still automatically written correct (only the central header is special). Archives with individual streams larger than 4 GiB must set the zip64 flag, either on the archive or the individual streams.

…be read in both Archive and Stream mode

kenkendk · 2017-03-11T00:03:22Z

I have updated the logic to require setting the zip64 flag before writing a stream larger than 4GiB.
I have also disabled zip64 support for writing non-seekable streams.

I have added the check to the Write method of the ZipWritingStream to avoid throwing exceptions in the Dispose method.

Not sure how you prefer this to work, but we cannot know in advance if the write expands the file beyond 4GiB, so I have made a pre-check and a post-check. If we catch it in the pre-check, the file is correct (but with a shorter stream), if we catch it in the post check, invalid data is now in the archive.
Maybe the post-check is sufficient? Is it acceptable to close the base stream if the write has produced an invalid file?

I have also added a unittest for trying out various parts of the zip64 features.

adamhathcock · 2017-03-13T10:23:14Z

It looks like you throw an exception when the file produced will be invalid so that seems sufficient.

kenkendk added 2 commits March 9, 2017 23:18

Support for writing zip64 headers in the unused code

cd3cbd2

Added support for writing zip64 headers

1263c0d

kenkendk added 3 commits March 10, 2017 23:10

Fixed an error in the zip64 central end of header: the signature + le…

d7f4c0e

…ngth (12 bytes) are not included in the reported length.

Added a test suite to verify zip64 write support is working, and can …

2894711

…be read in both Archive and Stream mode

Fixed compiling the unittest

726b9c8

adamhathcock merged commit f853840 into adamhathcock:master Mar 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zip64 #211

Add zip64 #211

kenkendk commented Mar 9, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017 •

edited

Loading

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017 •

edited

Loading

kenkendk commented Mar 11, 2017 •

edited

Loading

adamhathcock commented Mar 13, 2017

Add zip64 #211

Add zip64 #211

Conversation

kenkendk commented Mar 9, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017 • edited Loading

Streaming

Testing

Misc

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017

Streaming

Count overflow

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017

adamhathcock commented Mar 10, 2017

kenkendk commented Mar 10, 2017 • edited Loading

kenkendk commented Mar 11, 2017 • edited Loading

adamhathcock commented Mar 13, 2017

kenkendk commented Mar 10, 2017 •

edited

Loading

kenkendk commented Mar 10, 2017 •

edited

Loading

kenkendk commented Mar 11, 2017 •

edited

Loading