Sunday, May 13, 2018

Graphing BC7 with random codec options

I used a different set of data (and a different random seed) to compute this scattergraph vs. my previous post. This time, I've highlighted all random solutions that enable mode 1:


Modes 1 and 6 are the key modes for opaque textures. Mode 6 is used across the entire frontier, and mode 1 is used across all of the frontier above a threshold of about ~49.4 dB PSNR. (This is why I open sourced bc7enc16.)

Here's the same graph but I've marked in red two key solutions. The left one is mode 1 only, and the upper right one is modes 1+6. There are several variants of each, I just choose the highest quality ones to mark. Mode 1+6 is at a perfect spot: very high quality and just at the BC7 "wall" where quality plateaus.

Mode 1 only is also at a great spot. It's fast and has excellent quality with no visible block artifacts (unlike exclusively mode 6 which will be full of them).


Here's a graph showing all solutions that use mode 6 in brown, with the rest in blue:


Finally, here's a graph generated using perceptual colorspace metrics. To speed up the test I reduced the test texture so it had 1/16th as many blocks (randomly chosen from the original test texture). One thing that appears in this graph is that, with perceptual metrics and Luma PSNR, mode 6 by itself is more effective than with RGB metrics.


Saturday, May 12, 2018

BC7 opaque encoding sweetspots

By running our non-RDO codec in an automated test thousands of times, I've identified 3-4 major encoding time vs. quality sweetspots. These regions are somewhat specific to our codec and how it's written and optimized, but I suspect rough regions like this will exist in all correctly written BC7 codecs (either CPU or GPU). Excluding ispc_texcomp, most codecs have key pbit handling flaws preventing them from exploiting some of these regions at all.

It's commonly believed that BC7 has this incredibly large search space (which it does) and is slow to encode, but only a few mode combinations and encoder search settings are actually valuable (i.e. highest quality for a given amount of encoding time). The valuable codec settings appear to fall into these four major regions (three if you ignore the last impractically slow region):

Note the max practical PSNR for this test corpus is ~51.39 dB RGB avg. PSNR. For reference, BC1 non-perceptual is around 10 dB lower than BC7 on average, so even real time BC7 is massively better than BC1's quality.

Region 1: Real time


The first sweetspot is the "extremely fast" or real time region. Amazingly, it's possible to produce results in this region with minimal or no obvious block artifacts.

In this region, you only use mode 6 (which will have noticeable block artifacts), or you combine modes 0 and 6 but you limit the # of mode 0 partitions examined by your partition estimator to only a small amount, like 1-4. Only a single partition (the best one returned by the estimator) is evaluated.

You can increase the # of partitions examined in mode 0 to improve quality a little more, which will reduce or eliminate block artifacts. You must handle pbits correctly (as I've detailed in previous posts) to exploit this mode combination, otherwise mode 0 will be useless due to massive banding.

Encoding time: .5-3.5 secs, average quality: 48.2-49.09 dB RGB PSNR

Region 2: Fast


The second region uses modes 1, 5, 6, and possibly 3. The max # of partitions examined by the estimator ranges from 1-34, and the encoder uses the strongest pbit handling it supports (trying all combinations of pbits using correct rounding). The encoder can also try varying the selectors a bit to improve the endpoints. In mode 5 the encoder tries all component rotations. In this region there will be no visible block artifacts if the # of partitions examined in mode 1 is high enough (not sure of the threshold yet, probably at least 16).

Encoding time: 5.5-16.4 secs, average quality: 49.8-50.8 dB RGB PSNR

Region 3: Basic


Most of the third major region uses modes 1, 3 and 6. Mode 2 can be added for slightly more quality. The # of partitions examined by the estimator ranges from 42-55, and the # of partitions evaluated ranges from 1-9. This region uses exhaustive pbit evaluation with correct rounding, and the evaluator tries several different ways of varying the selectors. There are no visible block artifacts in this region.

Encoding time: 21-54.0 secs, average quality: 50.98-51.24 dB RGB PSNR

Region 4: Slow to Exhaustive


Beyond this you can enable more modes, more partitions, etc. to very slightly increase quality. I have breakdowns on some interesting configurations here, but the massive increase in encoding time just isn't worth the tiny imperceptible quality gains.

Encoding time: 132.8-401.97 secs, average quality: 51.36-51.39 dB RGB PSNR


Graphing our BC7 encoder's quality vs. encoding time for opaque textures

I non-RDO encoded to BC7 a 4k test texture containing random blocks chosen from a few thousand input textures 5000 times, using random codec settings for each trial encode. I recorded all the results to a CSV file.

The various stages of our codec can be configured in various ways. It's not immediately obvious or intuitive what settings are actually valuable, so we must run tests like this. Here are the various settings:

- Which modes to try
- Max # of partitions to examine for each mode (max of 16 or 64)
- Max # of partitions to evaluate after estimation, for each mode
- Various endpoint refinement settings ("uber" setting, iterative endpoint refinement trials)
- pbit refinement mode

Here's the resulting graph time vs. quality (RGB avg. PSNR) graph:


I examined this graph to help come up with strong codec profiles. Some key takeaways after examining the Pareto frontier of this graph (the uppermost points on the convex hull from left to right):
  • Ironically BC7 mode 0 (the mostly ignored 3 subset mode) is at its highest value in some of our fastest codec settings. Our 2 fastest settings use just mode 6. After this, we add mode 0 but just examine the first 1 or 4 partitions in our partition estimator. This combo is strong! I intuitively knew that mode 0 had unique value if the pbits were handled correctly. 3 subsets with 3-bit indices is a powerful combination. (If our partition estimator was faster we could afford to check more partitions in this mode before another set of codec settings becomes better.)
  • Mode 0 is only valuable if you limit the # of partitions examined during estimation to the first 1 or 4 (and just evaluate the best). In this case, when combined with 6 it's a uniquely powerful combination. With every other practical set of encoder settings (anything below 219 secs), mode 0 is of no value. 
  • Mode 4 is totally useless for opaque textures (for all settings below 401 secs of compute). Note we currently always evaluate all the rotations and index settings for mode 4 when it's enabled. This is possibly a minor weakness in our encoder, so I'm going to fix this and regenerate the graph.
  • Mode 6 is always Pareto optimal, i.e. it's always enabled in every optimal setting of the codec across the entire frontier. It doesn't have subsets, but it does have large 777.1 endpoints and 4-bit selectors. Mode 1 is also very strong, and is used across the entire frontier beyond the very fastest settings.
  • There's a very steep quality barrier at around 25-30 secs of compute. Beyond this inflection point only minor quality gains (of around .2-.4 dB) can be made - but only with large increases in encoding CPU time. Every BC7 codec I've seen hits a wall like this, sooner or later.
  • The sweetspot for our codec, at the beginning of this steep wall, is around 21 seconds of compute (~42x slower than mode 6 only). This config uses modes 1,3,6, limits the max # of partitions examined by the estimator to 42 for 1/3, and only evaluates the single best partition for modes 1/3. This mode also enables a more exhaustive pbit refinement mode, and a deeper endpoint refinement mode. Note we use the same estimated partition in mode 1/3 (they're both 2 subset modes), which is probably why this combo stands out. 
  • I'm not getting enough data samples on the frontier as I would like. Most samples have no value. I either need a more efficient way of computing random parameters, or I need to just use more samples.
  • It's interesting to explore the non-Pareto optimal solutions. Mode 5 only with pbit searching is a really bad performer (fast but very low quality).

Friday, May 11, 2018

Basis non-RDO BC7 benchmark

This graph shows the performance of ispc_texcomp at each of its supported opaque profiles (from ultrafast to slow) vs. Basis's non-RDO BC7 encoder at various settings. I haven't decided on the settings to use for each of our profiles yet, which is why I'm generating graphs like this.


For reference, BC1 non-perceptual gets ~36.92 dB on this test set, so even just mode 6 BC7 (the first data points on the bottom left) is superior to BC1's quality.

Basis non-RDO BC7 also has a "ultra" profile which is ~.49 dB better than ispc_texcomp's slow profile (its highest quality mode), but you pay a steep price in encoding time (959 secs vs. ~350 secs for ispc_texcomp's slow profile). In this mode Basis is even better than the reference BC7 encoder in NVidia's NVTT (but we're ridiculously faster):

Basis BC7 ultra:   953.2 secs   47.255552 dB
Basis BC7 slow:    365.5 secs   47.162389 dB
NVTT:              28061.9 secs 47.141580 dB
ispc_texcomp slow: 353.6 secs   46.769749 dB


The leftmost bottom samples are mode 6 only (ispc_texcomp's ultrafast profile). We are 19% faster at slightly higher quality here.

Basis non-RDO BC7 supports a powerful perceptual mode which I'll benchmark tomorrow. In this mode we kinda wreck ispc_texcomp at the same Luma PSNR (but to be fair ispc_texcomp doesn't support a perceptual mode at all, which is a serious deficiency).

BC7 mode 0-only encoding examples

BC7 mode 0 (3 subsets, 444.1 endpoints with a unique p-bit per endpoint, 3-bit indices, 16 partitions) is probably the most difficult mode to handle correctly. If you don't do the pbits right, the results look terrible (very banded). All BC7 encoder I've examined (from Intel, NVidia, Volition, and Microsoft) have weak mode 0 encoders (and most drop the ball with pbits).

To put things into perspective, mode 0 is better than BC1 by approximately 5-6 dB RGB PSNR on average, when done correctly. All-mode BC7 is 10-12 dB better on average than BC1 for opaque textures.

Here are some example mode-0 only encodings created with the ispc vectorized non-RDO BC7 encoder in Basis. Notice there's no visible banding (there shouldn't be in a properly written encoder).











Tuesday, May 8, 2018

One last non-RDO BC7 benchmark: ispc_texcomp slow vs. my encoder in perceptual mode

"ISPC" is Intel's Fast ISPC Texture Compressor. (Both of our encoders use ispc.)

In perceptual mode, you basically trade off around 1 dB of RGB PSNR for a gain of 2.6 dB Luma PSNR, relative to ispc_texcomp. Our per-mode encoding time is actually slower in perceptual mode, but we don't use modes 4 and 5 in perceptual mode yet which helps compensate for the slowdown.

Perceptual mode (total encode CPU time, average RGB PSNR, average Luma PSNR):
ISPC: 353.245527 46.769749 48.568988
Ours: 216.838825 45.782654 51.185091 

ISPC mode histogram:
367473 370942 26227 633692 26789 116571 318478 0
Ours mode histogram:
47882 409997 18025 185524 0 0 1198744 0

RGB mode:
ISPC: 352.133776 46.769749 48.568988
Ours: 227.692341 47.029635 48.903907 

ISPC mode histogram:
367473 370942 26227 633692 26789 116571 318478 0
Ours mode histogram:
33264 411398 21192 186552 25437 68297 1114032 0

I'm writing some API's to expose this encoder in the Basis DLL/SO/dylib. It'll be exposed just like Intel's encoder (you call it with an array of blocks and you handle the multithreading).

Friday, April 27, 2018

BC7 showdown #2: Basis ispc vs. NVidia Texture Tools vs. ispc_texcomp slow

Got my BC7 encoder test app linking with NVTT. The BC7 encoder in NVTT is the granddaddy of them all, from what I understand. It's painfully slow but very high quality. I called it using the blob of code below. It supports weighted metrics, which is great.

Anyhow, here are the test results using linear RGB metrics (non-perceptual), comparing NVTT and ispc_texcomp vs. Basis non-RDO BC7 ispc. The test corpus was kodim01-24 plus a number of other images/textures I commonly test with. I turned up the Basis BC7 encoder options to the highest currently supported.

Basis BC7:         365.5 secs   47.162389 dB
NVTT:              28061.9 secs 47.141580 dB
ispc_texcomp slow: 353.6 secs   46.769749 dB

This was a multithreaded test (using OpenMP) on a dual Xeon workstation supporting AVX.

Here's the code snippet for calling NVTT's AVPCL encoder directly to pack BC7 blocks (bypassing the rest of NVTT because I don't want to pack entire textures, just blocks):

#include "../3rdparty/nvidia-texture-tools/src/bc7/avpcl.h"

void nvtt_bc7_compress(void *pBlock, const uint8_t *pPixels, bool perceptual)
{
AVPCL::mode_rgb = false;
AVPCL::flag_premult = false; //(alphaMode == AlphaMode_Premultiplied);
AVPCL::flag_nonuniform = false;
AVPCL::flag_nonuniform_ati = perceptual;

// Convert NVTT's tile struct to AVPCL's.
AVPCL::Tile avpclTile(4, 4);
memset(avpclTile.data, 0, sizeof(avpclTile.data));

for (uint y = 0; y < 4; ++y) 
{
for (uint x = 0; x < 4; ++x) 
{
nv::Vector4 &p = avpclTile.data[y][x];

p.x = pPixels[0];
p.y = pPixels[1];
p.z = pPixels[2];
p.w = pPixels[3];

pPixels += 4;

avpclTile.importance_map[y][x] = 1.0f; //weights[4*y+x];
}
}

AVPCL::compress(avpclTile, (char *)pBlock);
}