New DDS decoder #2258

RunDevelopment · 2024-06-09T13:49:43Z

I made a new DDS decoder, because the old DXT-based decoder was very limited (only DXT1-5 + dimensions divisible by 4) and incorrect (DXT1 colors were not rounded correctly, resulting in discolorations). While this PR is not finished yet, I already implemented the following features:

Support for DXT1-5 and BC1-5 (including arbitrary image sizes).
Support for all (s)normalized and floating point uncompressed DX10 RGB formats. So most of DXGI_FORMAT.
Support for many DX9 pixel formats. This is done by mapping DDS pixel formats to DXGI_FORMAT (almost; I actually map to a supported formats enum, but this enum closely follows DXGI_FORMAT). This means that there are DX9 pixels formats the new DDS decoder does not, but I couldn't produce/find any DDS files that use them.
Support for non-standard DDS files. Some old and current encoders use custom flags not mentioned in the official docs (e.g. here are the pixel format flags TexConv uses vs the officially documented ones). I used the same flags for TexConv and ignored unknown flags. This allows the new DDS decoder to support more old DDS files with uncommon and non-standard formats.
Support for cubemaps. Cubemaps are displayed in a 4x3 grid that shows the unfolded cube. Example:

The only main formats that are still missing are BC6 and BC7. These 2 are quite complex, so it will likely take me some more time to implement them. Once those 2 are implemented, this should be a pretty competent DDS decoder.

Some notes and technical decisions:

I use &mut dyn Read in all format decoders to reduce binary size. Since there are a lot of DDS formats, there are a lot of functions to decode these formats. I think the binary size could be reduced even further, so I'm very open to feedback in that regard.
I put particular care into ensuring that the decoded colors are correctly rounded. DirectX uses float32 everywhere internally to read DDS files. So to output a 5-bit value as 8-bit, DX does (x as f32 / 31.0 * 255.0).round() as u8 (= first convert to a f32 value in the range 0-1, and then convert to u8 with rounding). Doing this with f32 is quite slow, so I did with only integer operations that are around 3x faster than f32 and around 2x harder to understand. The tricks I use are explained in x5_to_x8 in convert.rs.
I already added the code to read arbitrary surfaces from a DDS file. So reading mip chains, volumes, and image arrays can be easily supported. There's just not much use for it right now aside from cubemaps.
I used Paint.net to determine what DDS files should look like. I did this because I have not found a single DDS file it decoded incorrectly so far, so I used it as my source of truth. Paint.net uses DirectTex (same library that texconv uses) under the hood AFAIK, so everything should be correct.

With what I did so far out of the way, I have some questions to the maintainers on how to integrate this into the image crate:

How should I test this? Used around 20MB of images (>100 files) and tests/reference_image.rs > render_images to ensure that the decoder would correctly, but 20MB is a lot of image data to commit to a repo. I can go down to around 8MB, but not much lower. The issue is that I only have one file for some rare formats and no way to generate smaller images of those formats.
Benchmarking. I have not benchmarked decoding DDS files yet (although it shouldn't be too slow). The current bench setup also seems to be ill-suited for DDS, as I need to benchmark each format individually. How should I go about this?
Should this in a separate crate? I initially wanted to make this a separate crate, but found it vastly easier to add it to the image crate
What should we do with the old DXT implementation? It's not used by the new DDS decoder, and it's deprecated for some time now. Maybe it could be removed after the new DDS decoder is in?
How careful do I have to be with resource limits? Most formats currently read line-by-line into a temporary buffer from which pixels are then decoded. The size of this buffer is proportional to the width of the image. This means that we only need O(sqrt(N)) (N=number of pixels) additional memory for roughly square images. However, an attacker would supply an image with height=1, causing the temporary buffer to as large as (but no larger than) the output buffer. I already limited the width and height of DDS images to be at most 2²⁴, so the temporary buffer can be at most 256MB (with any R32G32B32A32_* format).

Also, (not a question) this my first large-scale Rust PR, so please feel free to pick apart and suggest improvements to everything you see.

HeroicKatora

Nothing short of amazing 🎉

This isn't fully comprehensive, but I do have the time for a bit of a review.

So this is probably a bit larger than we'd like within the main crate. Also, for bc decoding there's already a crate which looks interesting and doesn't even have dependencies: bcdec_rs. As far as is possible, I'd see if the bc.rs and convert.rs contents and parts of decoder.rs might be upstreamed to that crate and then used via a dependency. It's important to reduce the amount of code to be maintained centrally. (Aside: they, as in the author, also seem to have a file/encoding/decoding crate about it, image_dds, so might be interested anyways).

Doing this with f32 is quite slow, so I did with only integer operations that are around 3x faster than f32 and around 2x harder to understand.

I'm not sure all the tricks are worth the infamiliarty burden. Did you benchmark them? End-To-End numbers would be most convincing here, some floating point operations are quite fast, unless integer versions are smaller and can be vectorized better.

How should I test this?

8 MB isn't that much if the files don't change. That is, are they free of copyright risks etc. This would be my primary concern. Still I'm wondering why is it that the test images are this large?

Should this in a separate crate?

Well see above, quite possibly parts. See image-webp. I think image_dds is owned also by the author of the bc decoding crate about, but as you'll note with weezl and zune-jpeg it's totally fine to choose a more unique name anyways.

What should we do with the old DXT implementation? Maybe it could be removed after the new DDS decoder is in?

Yes, thinking so as well.

How careful do I have to be with resource limits?

Not too careful, although adhering to resource limits is always nice. If you can setup fuzzing, see the infrastructure in png and webp, these might also yield some very early detection of whether the memory limits are upheld. If it's within some constant factor above the configured one, I wouldn't worry. In png we've set the software limit of the decoder to half the fuzzer's own limit iirc.

src/codecs/dds/decoder.rs

src/codecs/dds/header.rs

RunDevelopment · 2024-06-09T22:35:54Z

Thanks for the quick response!

Also, for bc decoding there's already a crate which looks interesting and doesn't even have dependencies: bcdec_rs. As far as is possible, I'd see if the bc.rs and convert.rs contents and parts of decoder.rs might be upstreamed to that crate and then used via a dependency. It's important to reduce the amount of code to be maintained centrally. (Aside: they, as in the author, also seem to have a file/encoding/decoding crate about it, image_dds, so might be interested anyways).

bcdec_rs is certainly interesting. Especially since they already implemented BC6/7 (although their code scares me). However, they too seem to get rounding slightly wrong from looking at the code. I'll have to do some testing to be sure though. Assuming it's all correct (or that we fixed any issues), bcdec_rs should be able to replace bc.rs completely.

However, I don't think that the things in decoder.rs fit the theme of bcdec_rs, so I don't think moving them into that crate would fit. Of course, it's up to the author of bcdec_rs to decide what does and doesn't fit.

On the topic of image_dds: I might also add an encoder for uncompressed DDS files (= everything except DXTn/BCn) in a future PR. Block compression encoding might be a bit too difficult for me to get something good working, but all other formats are uncompressed and almost trivial to encode. I'm saying this now, because it might change your decision on whether this all should be a separate crate.

I also want to mention that I don't intend to maintain this code, so if you want to make it a separate crate, I would ask to put it under the image-rs org. I don't have any experience with maintaining a crate and I don't do a lot of Rust in general.

How should I test this?

8 MB isn't that much if the files don't change. That is, are they free of copyright risks etc. This would be my primary concern. Still I'm wondering why is it that the test images are this large?

Copyright won't be a concern for most images. I tested most formats with a 257x131 test image I created myself, and then used texconv to create DDS files in different formats. This is the image btw:

However, there are some images for which copyright will be an issue. For some less-common formats, I only have one image per format that I found in some games (e.g. Elden Ring and the Dark Souls games). Given that all of these formats are uncompressed (no fancy block compression), I could write some scripts to generate images in those formats. Non-standard flags could also be tested by just hex-editing the headers of similar DDS files.

As for the size: most DDS formats are uncompressed, so even my little test image produces a 174KB DDS file when saved with 8bpc RGBA color. Take that times 100 files and that's why. My main way of reducing the size would be use an even smaller test image btw.

Doing this with f32 is quite slow, so I did with only integer operations that are around 3x faster than f32 and around 2x harder to understand.

I'm not sure all the tricks are worth the infamiliarty burden. Did you benchmark them? End-To-End numbers would be most convincing here, some floating point operations are quite fast, unless integer versions are smaller and can be vectorized better.

I did a quick benchmark for a DX10 B5G6R5_UNORM image. (This format basically just does a range conversion (the same one BC1 uses) and that's it, so it's the ideal format to test this with.) I also found that the bcdec_rs crate uses a different trick for the range conversion. I don't understand how they did it, but they found a version that uses a bit shift instead of a division at the end. And it's crazy fast:

my int trick
  time:   [84.337 µs 84.543 µs 84.747 µs]

f32
  time:   [112.58 µs 112.93 µs 113.36 µs]
  change: [+32.418% +33.158% +33.792%] (p = 0.00 < 0.05)

bcdec_rs int trick
  time:   [57.821 µs 58.017 µs 58.255 µs]
  change: [-31.977% -31.377% -30.646%] (p = 0.00 < 0.05)

For the f32 conversion I used this snippet (similar for x6_to_x8):

#[inline(always)]
pub(crate) fn x5_to_x8(x: u16) -> u8 {
    debug_assert!(x < 32);
    const FACTOR: f32 = 255.0 / 31.0;
    (x as f32 * FACTOR + 0.5) as u8
}

So I already optimized to multiply with a single constant and use the truncation as u8 performs for rounding by adding 0.5. Especially the truncation trick is a lot faster than the call to round, which compiles to a literal call instruction.

So I think I'm going to switch to the trick bcdec_rs uses :)

And I think I also find constants for the same trick to use in other range conversion function. I don't know how they derived their constants, but they are small enough that I can use brute force to find more.

…m to the buffer

RunDevelopment · 2024-06-11T13:39:29Z

Alright, so it's pretty much ready now.

Updates:

I used bcdec_rs to add support for BC6 and BC7, so we support all commonly-used DDS formats now.
I also optimized how pixels are written into the output buffer. For some formats, this made decoding around 2x faster.
I removed the old DXT module becasue it was entirely unused and the compiler complained.

There are just a few things we have to talk about:

bcdec_rs: As I suspected, bcdec_rs does not round correctly. This affects all block compression formats except for BC6 and BC7. This isn't even a bug with bcdec_rs directly, because it inherited it from the bcdec C library.

Unfortunately, both the author of bcdec_rs and bcdec do not seem to be convinced that this is a bug or that it is worth fixing. I'll talk to them more, but that this means that we can't use bcdec_rs for BC1-5 for the time being.

bcdec_rs also doesn't support signed BC4 and BC5.
Clippy: DxgiFormat and PixelFormatFlags has a few unused associated constants. These constants are there for completeness, so removing them isn't really an option, because I'm not too keen on adding in DXGI_FORMAT constants on demand.

I could slap #[allow(unused)] on all these constants, but that doesn't seem like a nice solution. Is there a better way to fix this?

RunDevelopment · 2024-06-11T15:17:52Z

I've also run into a problem with testing: How do I test RGB32F images? render_images doesn't output them.

RunDevelopment · 2024-08-01T15:42:07Z

Sorry for the delay!

I mostly finished up the PR now.

I fixed all clippy problems.
I tried using the bcdec_rs crate for BC decompression as much as possible, but its BC4 implementation is incorrect, so I don't use it. I've made a PR to fix this upstream in bcdec, so hopefully this won't be an issue for much longer.
I added tests for all supported formats that I could test.

However, I need help with testing the floating-point formats. I don't know how to generate reference images for them, since render_images in reference_images.rs doesn't support f32 images. @HeroicKatora, could you please help me here?

Otherwise, this PR is ready as I see it. The new DDS decoder is correct, decently performant, and covers most DDS formats.

@HeroicKatora Please let me know what else needs to be done/changed.

New DDS decoder

b5e8a37

HeroicKatora reviewed Jun 9, 2024

View reviewed changes

src/codecs/dds/decoder.rs Outdated Show resolved Hide resolved

src/codecs/dds/decoder.rs Outdated Show resolved Hide resolved

src/codecs/dds/decoder.rs Show resolved Hide resolved

src/codecs/dds/header.rs Outdated Show resolved Hide resolved

RunDevelopment added 4 commits June 10, 2024 11:49

More efficient range conversions

cff043a

Ensure non-empty images for decoders

6e27115

Fix some warnings

db5e794

Optimized BC1 decoding

45ee60f

RunDevelopment mentioned this pull request Jun 10, 2024

BC1 and BC4 rounding behavior doesn't match DirectXTex ScanMountGoat/image_dds#17

Open

RunDevelopment added 10 commits June 10, 2024 22:19

Add support for BC6 and BC7 using bcdec_rs

6c2247d

Fix alignment for casts

e5d59ca

Improved rounded division

3a1cf84

Massive perf win by returning pixels as values instead of writing the…

73d669a

…m to the buffer

Fix 1.67.1 compiler error

237f8be

Removed unused DXT module

b284693

Use associated constants to make clippy happy

99d5008

Clippy

d0df233

Removed unused bits_per_pixel method

833655e

Fixed mips

b3de809

RunDevelopment added 10 commits June 11, 2024 17:28

Optimize BC4 SNORM

ab0ec2c

Improved u4 conversion code

fb7eeb2

Add tests for BCn compression

2862f60

Add test cases for sub-sampled formats and special cases

29d649f

Add uncompressed reference images

c51541d

Add DDS bench

c3c81f9

Merge branch 'main' into new-dds-decoder

9e0e01f

lint

5643cc7

BufRead + Seek

0915806

No Seek

79395f0

Clippy

00b7b4e

RunDevelopment marked this pull request as ready for review August 1, 2024 13:28

Use bcdec_rs for BC1 and BC2

637cf43

RunDevelopment requested a review from HeroicKatora August 1, 2024 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New DDS decoder #2258

New DDS decoder #2258

RunDevelopment commented Jun 9, 2024 •

edited

Loading

HeroicKatora left a comment

RunDevelopment commented Jun 9, 2024

RunDevelopment commented Jun 11, 2024

RunDevelopment commented Jun 11, 2024

RunDevelopment commented Aug 1, 2024

New DDS decoder #2258

Are you sure you want to change the base?

New DDS decoder #2258

Conversation

RunDevelopment commented Jun 9, 2024 • edited Loading

HeroicKatora left a comment

Choose a reason for hiding this comment

RunDevelopment commented Jun 9, 2024

RunDevelopment commented Jun 11, 2024

RunDevelopment commented Jun 11, 2024

RunDevelopment commented Aug 1, 2024

RunDevelopment commented Jun 9, 2024 •

edited

Loading