`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

tniessen · 2024-06-02T14:44:29Z

quick_xml::reader::Reader<R: BufRead> allows reading arbitrarily large files without allocating much memory. I am using this in combination with flate2::read::GzDecoder in order to decompress and simultaneously parse very large XML files (several gigabytes per document) without having to store the decompressed file (either in memory or on disk).

Unfortunately, because offsets within the document are computed as usize internally, this causes quick-xml to panic when the document size exceeds 4 GiB on 32-bit targets.

Notably, buffer_position() returns usize, which is not necessarily sufficient for representing the correct position if the input document size exceeds 4 GiB.

Rust's built-in std::io types typically use u64 for file sizes (and i64 for relative offsets within files). Hence, I believe it would be appropriate to use u64 for internal computations (and to return it from buffer_position() etc.) as well.

The text was updated successfully, but these errors were encountered:

Mingun · 2024-06-02T15:01:37Z

Seems reasonable. Mind to crate a PR? I plan to release 0.32 next few days so this change can be included there. error_position() also should be changed and do not forgot to add changelog entry.

tniessen · 2024-06-02T15:17:06Z

@Mingun I'll give it a try! :)

tniessen · 2024-06-22T16:04:11Z

While trying to approach this issue, I noticed a much bigger issue: all internal offsets are computed using usize, which breaks on 32-bit systems when processing more than 4 GiB of data in a single document.

Here's a very short reproduction of the issue, beginning with main.rs (with only quick-xml as a dependency):

fn main() {
    let file = std::fs::File::open("large.xml").unwrap();
    let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf).unwrap() {
            quick_xml::events::Event::Eof => break,
            _ => (),
        }
    }
    println!("Done.");
}

You can create such a file large.xml, for example, like this:

(echo -n '<root>' ; perl -e 'print "<child>text</child>"x300000000' ; echo -n '</root>') > large.xml

Then run the code in 32-bit mode:

$ rustup target add i686-unknown-linux-gnu
$ cargo run --release --target i686-unknown-linux-gnu
   Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
    Finished `release` profile [optimized] target(s) in 0.27s
     Running `target/i686-unknown-linux-gnu/release/test-quick-xml-4gb`
thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

As you can see, and as expected from the internal use of usize for file offsets, parsing fails once the offset within the file exceeds the range of usize.

Since quick_xml::reader::Reader<R: BufRead> should not be limited to files that fit into memory, I believe that fixing these related issues would require changing all internal offset computations to use u64.

pronebird · 2024-06-22T16:31:20Z

Looking at the same thing right now. usize is platform dependent and I have read that it's 128-bit on RISC-V but so far by looking at rust docs, usize cannot be larger than 64-bit. So I wonder if having a helper to simply convert all usize to u64 internally would do the trick, i.e:

fn to_u64(value: usize) -> u64 {
    u64::try_from(value).expect("failed to convert usize to u64")
}

Unless I missed something, it looks like we could get away with keeping everything memory related in usize but then convert to u64 when storing last_error_offset and offset.

Mingun · 2024-06-22T16:33:14Z

Yes, internal offsets should be stored in u64. I'm not sure, is it possible to Vec to store arrays which length exceeds usize? If yes, then we should at least have a special handling to break out long Text / CData events into multiple ones. Actually, tag name also could be such absurdly big, but not in a real life. Such improvements, however, can be done in different PRs, but need at least mention such limitations in the documentation.

I did a labor work of prototyping in u64-offsets branch.

tniessen · 2024-06-22T16:41:27Z

is it possible to Vec to store arrays which length exceeds usize?

If I'm understanding the question correctly, then the answer is no. A Vec<T> with capacity cap maintains a heap allocation of size mem::size_of::<T>() * cap, so if cap exceeded usize, this would be impossible.

pronebird · 2024-06-22T17:29:45Z

@Mingun cherry pick fixes for tests from pronebird@a393ffb

tniessen · 2024-06-23T11:47:12Z

@pronebird @Mingun I apologize, my previous example did not clear the buffer in-between events properly. This is the correct reproduction:

fn main() {
    let file = std::fs::File::open("large.xml").unwrap();
    let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf).unwrap() {
            quick_xml::events::Event::Eof => break,
            _ => (),
        }
        buf.clear();  // This was missing above.
    }
    println!("Done.");
}

Now we see the expected panic:

$ cargo run --target i686-unknown-linux-gnu
   Compiling cfg-if v1.0.0
   Compiling memchr v2.7.4
   Compiling encoding_rs v0.8.34
   Compiling quick-xml v0.33.0
   Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.13s
     Running `target/i686-unknown-linux-gnu/debug/test-quick-xml-4gb`
thread 'main' panicked at /home/tniessen/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quick-xml-0.33.0/src/reader/buffered_reader.rs:286:5:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Mingun · 2024-06-23T13:22:59Z

@tniessen, could you test that with u64-offsets branch?

tniessen · 2024-06-23T23:09:03Z

@Mingun Neither debug nor release mode panic with that branch (using the minimal repro from my last comment).

Mingun · 2024-06-24T13:38:28Z

Cool! I would be good to finish that work by thinking how we can test it. Of course, having 4GB+ file in the repository is not an option 😄. We could create a special BufRead implementation to stream data until its size exceeded usize::MAX. Or find are more smart way to test that (set internal attributes to a high value and stream some data?). Maybe I would find time in the nearest future to do that myself, but if someone would like to implement that, it would be great. How much time has the 4GB document been read?

pronebird · 2024-06-24T14:01:34Z

We could create a special BufRead implementation to stream data until its size exceeded usize::MAX

You probably mean exceeds u32::MAX because on 64-bit platforms usize::MAX would be the same as u64::MAX which would result into the same panic.

Graceful solution would be to use checked arithmetics all over the place, catch overflows and return an Err back but I am not sure if realistically anyone is ever going to hit that limit, so I suppose no need to complicate code more than needed.

Mingun · 2024-06-24T16:37:03Z

I made a some checks and here is the results:

Test	Time
With checks + small text	438.37s (7m 18.37s)
With checks + long text	15.97s
Without checks + small text	411.09s (6m 51.09s)
Without checks + long text	15.40s
Without checks + small text + `--release`	28.29s
Without checks + long text + `--release`	0.45s

"small text" is the

<child>text</child>

"long text" is the Lorem Ipsum with 100 paragraphs generated by https://ru.lipsum.com/feed/html, 61475 bytes long in the following frame:

<content description="Text generated by https://ru.lipsum.com/feed/html: 100 paragraphs">
Lorem Ipsum...
</content>

"With checks" is the assert_eq! of the names returned in the Start and End events, "without checks" -- without them.

The source of XML:

struct Fountain<'a> {
    /// That piece of data repeated infinitely...
    chunk: &'a [u8],
    /// Part of `chunk` that was consumed by BufRead impl
    consumed: usize,
    /// The overall count of read bytes
    overall_read: u64,
}
impl<'a> io::Read for Fountain<'a> {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let available = &self.chunk[self.consumed..];
        let len = buf.len().min(available.len());
        let (portion, _) = available.split_at(len);

        buf.copy_from_slice(portion);
        Ok(len)
    }
}
impl<'a> io::BufRead for Fountain<'a> {
    fn fill_buf(&mut self) -> io::Result<&[u8]> {
        Ok(&self.chunk[self.consumed..])
    }

    fn consume(&mut self, amt: usize) {
        self.consumed += amt;
        if self.consumed == self.chunk.len() {
            self.consumed = 0;
        }
        self.overall_read += amt as u64;
    }
}

Test is ended when overall_read >= u32::MAX.

Because I ran tests on x86_64, I did not run into the problems, so the only question remains how to test that on the correct target to hit the error. I think that should be possible to ensure that we have such target on CI.

tniessen · 2024-06-24T20:14:56Z

Thank you @Mingun and @pronebird for your work on this!

Mingun · 2024-06-24T20:19:52Z

I also just released 0.34.0 with this fix.

Mingun added the enhancement label Jun 2, 2024

tniessen changed the title ~~buffer_position() should return u64~~ Reader<R: BufRead> unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB Jun 22, 2024

Mingun added the help wanted label Jun 22, 2024

Mingun mentioned this issue Jun 24, 2024

Fix overflow when read 4GB+ fiiles on 32-bit targets #769

Merged

dralley closed this as completed in #769 Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

tniessen commented Jun 2, 2024 •

edited

Loading

Mingun commented Jun 2, 2024 •

edited

Loading

tniessen commented Jun 2, 2024

tniessen commented Jun 22, 2024

pronebird commented Jun 22, 2024 •

edited

Loading

Mingun commented Jun 22, 2024

tniessen commented Jun 22, 2024

pronebird commented Jun 22, 2024

tniessen commented Jun 23, 2024

Mingun commented Jun 23, 2024

tniessen commented Jun 23, 2024

Mingun commented Jun 24, 2024

pronebird commented Jun 24, 2024

Mingun commented Jun 24, 2024 •

edited

Loading

tniessen commented Jun 24, 2024

Mingun commented Jun 24, 2024

Reader<R: BufRead> unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

Reader<R: BufRead> unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

Comments

tniessen commented Jun 2, 2024 • edited Loading

Mingun commented Jun 2, 2024 • edited Loading

tniessen commented Jun 2, 2024

tniessen commented Jun 22, 2024

pronebird commented Jun 22, 2024 • edited Loading

Mingun commented Jun 22, 2024

tniessen commented Jun 22, 2024

pronebird commented Jun 22, 2024

tniessen commented Jun 23, 2024

Mingun commented Jun 23, 2024

tniessen commented Jun 23, 2024

Mingun commented Jun 24, 2024

pronebird commented Jun 24, 2024

Mingun commented Jun 24, 2024 • edited Loading

tniessen commented Jun 24, 2024

Mingun commented Jun 24, 2024

`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

tniessen commented Jun 2, 2024 •

edited

Loading

Mingun commented Jun 2, 2024 •

edited

Loading

pronebird commented Jun 22, 2024 •

edited

Loading

Mingun commented Jun 24, 2024 •

edited

Loading