Friday, April 20, 2018

ispc_texcomp BC7 issues

Been studying ispc_texcomp today to better understand why it's so slow compared to my encoder (currently by a factor of 2x at max quality). We do many of the same things, so why is it slower? Overall, there are many clever/smart things in there, but it's being held back by weak vectorization and some missing optimizations. Here's what I've found so far:

- The inner loops are bogged down with gathers and scatters. Definitely not good. The author even went so far as to wrap them in helper functions with a comment of "(perf warning expected)". (Umm - the compiler perf warnings are there for a reason!) For an example, check out block_quant().

The inner loops should not have gathers, period.

- The partition estimation code is using full PCA. I've found this to be unnecessary in my testing - just using Waveren's bounding box approximation works well and is trivially vectorizable. After all, we're not trying to compute the actual output, just a reasonable approximation.

So ispc_texcomp goes into overkill mode and computes PCA while estimating the partition. At least the way it computes each subset's PCA is smart: it first computes the overall block's statistics/covariance, then it subtracts out the statistics of each partition's active (masked) pixels to compute each subset's individual covar.

Also, it's only computing what looks like an upper bound on the error from the block statistics, not an approximation of the actual error. The approximation of the actual error (factoring in quantization to 4/8/16 selectors) is extremely fast to compute with SIMD code.

Overall, the author seems to be favoring cleverness vs. exploiting the properties of fast but simple SIMD code.

- It uses squish-style iterative refinement after choosing the partition: Basically, it computes the PCA, comes up with some initial selectors, uses least squares to optimize the endpoints, then it computes new selectors and tries all over again a few times. In my experience, the PSNR gain from this method is too low (fraction of a dB) to justify the repeated LS computation and selector selection costs. Instead, it can be far more valuable to just vary the selectors in simple ways (simplified cluster fit) in each trial.

- There's no support for perceptual colorspace metrics in there. This indirectly impacts performance (against other codecs that do support perceptual metrics) because it's stuck competing against RGB PSNR, and getting RGB PSNR up in BC7 is VERY computational intensive. You basically hit a steep quality wall, then it takes massively more compute to get it up above that wall even by a fraction of a dB.

If it supported perceptual metrics (where error in R,G,B is allowed to become a little unbalanced by approx. .25 - 1.5 dB, favoring G and R) it wouldn't have to try as hard because it would gain ~1.5 dB or more instantly before hitting the wall.

- First, the good news: The selector quantizer (see block_quant()) is using a clever algorithm: It dots the desired color by the subset's axis, converts that to a scaled int by rounding, clamps that to between [1,num_selectors-1], then it computes full squared euclidean error between the desired color and the subset's interpolated colors (s-1) and s. It only has to compute the full distance to 2 colors vs. all of them, which is cool.

I've compared this method vs. full distance to all colors and the results are super close (~1/1000th of a dB).

Now the bad news: The implementation is just bad. First, it recomputes the subset axis for every pixel (even though there are only 2 or 3 of them in BC7). And it uses multiple gather's to fetch the endpoints! This is done for all 16 pixels in the block - ouch! There's also a per-pixel divide in there.

Also, with good SIMD computing full distance to all subset colors isn't that expensive, at least for 4 and maybe 8 color blocks. I've implemented optimized forms of full search vs. ispc_texcomp's method. At least with AVX, all the fetches into the weighted_colors[] array (one for each lane) just slow the method down. Brute force leads to simpler code once vectorized and seems to slightly win out overall for 4 and 8 color blocks. With 16 color blocks the smarter method wins.

- After iterative refinement it doesn't have any more ways of improving quality. Trying to vary the selectors in keys ways (say by incrementing the lowest values and decrementing the highest values - to exploit extrapolation) and then LS optimizing the results helps a lot (.3-.5 dB) and is very fast if you SIMD optimize the trial solution evaluator function, yet it doesn't do that.

- Its mode 0 encoder suffers from a lot of quantization error - which is indicative of some weaknesses in its endpoint selection:

ispc_texcomp mode 0 only:

My encoder mode 0 only (no dithering - just stronger endpoint selection):

- ispc_texcomp is weak with grayscale images, by around .6-1.2 dB in my testing. Granted, once you're over ~60dB it doesn't matter much.

The "slow" profile is solidly in the quality "wall" region I described earlier. The basic and faster profiles are in much healthier regions.

A few Intel SPMD Compiler (ispc) C porting tips

I took notes as I was porting my new BC7 encoder from C to ispc. First, be sure to read and re-read the user guideperformance guide, and FAQ. This compiler tech kicks ass and I hope Intel keeps throwing resources at it. My initial port of 3k lines of C and initial stabs at vectorization was only ~2x faster, but after tuning the inner loops perf. shot up to over 5x vs. regular C code (on AVX). All without having to use a single ugly intrinsic instruction.

I'm new to ispc so hopefully I haven't made any mistakes below, but here's what I learned during the process:

If you're not starting from scratch, port your code to plain C with minimal dependencies and test that first. Make some simple helper functions like clamp(), min(), etc. that look like the ispc standard lib's, so when you do get to ispc you can easily switch them over to use the stdlib's (which is very important for performance).

Then port your C code to ispc, but put "uniform" on ALL variables and pointers. Test that and make sure it still works. In my experience so far you should have minimal problems at this stage assuming you put uniforms everywhere. Now you can dive into vectorizing the thing. I would first figure out how things should be laid out in memory and go from there. You may be able to just vectorize the hotspots, or you may have to vectorize the entire thing (like I did which was hours of messing around with uniform/varying keywords).

The mental model is like shaders but for the CPU. Conceptually, the entire program gang executes each instruction, but the results can be masked off on a per-lane basis. If you are comfortable with shaders you will get this model immediately. Just beware there's a lot of variability in the CPU cost of operations, and optimal code sequences can be dramatically faster than slower ones. Study the generated assembly of your hotspots in the debugger and experiment. CPU SIMD instruction sets seem more brittle than ones for GPU's (why?).

A single pointer deref can hide a super expensive gather or scatter. Don't ignore the compiler warnings. These warnings are critical and can help you understand what the compiler is actually doing with your code. Examine every gather and scatter and understand why the compiler is doing them. If these operations are in your hotspots/inner loops then rethink how your data is stored in memory. (I can't emphasize this enough - scatters and gathers kill perf. unless you are lucky enough to have a Xeon Phi.)

varying and uniform take on godlike properties in ispc. You must master them. A "varying struct" means the struct is internally expanded to contain X values for each member (one each for the size of the gang). sizeof(uniform struct) != sizeof(varying struct). While porting I had to check, recheck, and check again all uniform and varying keywords everywhere in my code.

You need to master pointers with ispc, which are definitely tricky at first. The pointee is uniform by default, but the pointer itself is varying by default which isn't always what you want. "varying struct *uniform ptr" is a uniform pointer to a varying struct (read it right to left). In most cases, I wanted varying struct's and uniform pointers to them.

Find all memset/memmove/memcpy's and examine them extremely closely. In many cases, they won't work as expected after vectorization. Check all sizeof()'s too. The compiler won't always give you warnings when you do something obviously dumb. In most cases I just replaced them with hand-rolled loops to copy/initialize the values, because once you switch to varying types all bets are off if a memset() will do the expected thing.

Sometimes, code sequences in vectorized code just don't work right. I had some code that inserted an element into a sorted list, that wouldn't work right until I rearranged it. Maybe it was something silly I did, but it pays to litter your code with assert()'s until you get things working.

assert()'s aren't automatically disabled in release, you must use "--opt=disable-assertions" to turn them off. assert()'s in vectorized code can be quite slow. The compiler should probably warn you about assert()'s when optimizations are enabled.

print("%", var); is how you print things (not "%u" or "%f" etc.). Double parentheses around the value means the lane was masked out. If using Visual Studio I wouldn't fully trust the locals window when debugging - use print().

Once you start vectorizing, either the compiler is going to crash, or the compiler is going to generate function prologs that immediately crash. Both events are unfortunately going to happen until it's more mature. For the func. prolog crashes, in most if not all cases this was due to a mismatch between the varying/uniform attributes of the passed in pointers to functions that didn't cause compiler errors or warnings. Check and double check your varying and uniform attributes on your pointers. Fix your function parameters until the crash goes away. These were quite painful early on. To help track them down, #if 0 out large sections of code until it works, then slowly bring code in until it fails.

The latest version of ispc (1.9.2) supports limited debugging with Visual Studio. Examining struct's with bool's doesn't seem to work, the locals window is very iffy but more or less works. Single stepping works. Profiling works but seems a little iffy.

If you start to really fight the compiler on a store somewhere, you've probably got something wrong with your varying/uniform keywords. Rethink your data and how your code manipulates it.

If you're just starting a port and are new to ispc, and you wind up with a "varying varying" pointer then it's ok to be paranoid. It's probably not really what you want.

Experienced obvious codegen issues with uniform shifts and logical or's of uint16 values. Once I casted them to uint32's the problems went away. Be wary of integer shifts, which I had issues with in a few spots.

Some very general/hand-wavy recommendations with vectorized code: Prefer SP math over DP. Prefer FP math over integer math. Prefer 32-bit integer math over 64-bit. Prefer signed vs. unsigned integers. Prefer FP math vs. looking stuff up from tables if using the tables requires gathering. Avoid uint64's. Prefer 32-bit int math intermediates vs. 8-bit. Prefer simpler algorithms that load from constant array entries in a table (so all lanes lookup at the same location in the table), vs. more complex algorithms that require table lookups with unique per-lane indices.

Study stdlib.ispc. Prefer stdlib's clamp() vs. custom functions, and prefer stdlib vs. your own stuff for min, max, etc. The compiler apparently will not divine that what you are doing is just a clamp, you should use the stdlib functions to get good SIMD code.

Use uniform's as much as you possibly can. Prefer to make loop iterators uniform by default. Make loop iterators uniform by default when you start iterating at 0, even if the high loop limit is varying.

Use cif() etc. on conditionals which will strongly be taken or not taken by the entire gang. Compilation can get noticeably slower as you switch to cif().

A few min's or max's and some boolean/bit twiddling ops can be much faster than the equivalent multiple if() statements. Study the SSE2 etc. instruction sets because there are some powerful things in there.

Things that usually make perfect sense in CPU code, like early outs, may actually just hurt you with SIMD code. If your early out checks have to check all lanes, and it's an uncommon early out, consider just removing or rethinking them.

Tuesday, April 17, 2018

BC7 encoding using weighted YCbCr colorspace metrics

I've written my second BC7 block encoder. My first was written in a straightforward way to gain experience with the format. My second was more focused on competing against the Fast ISPC Texture Compressor, but without using any SIMD, and was over 30x faster than my first attempt.

The BC7 encoders I've studied seem to be hyper focused on RGB PSNR metrics, which is just the wrong metric for many types of textures. Encoding authors that treat input textures as opaque arrays of 4x4 vectors are at a disadvantage in this domain. RGB PSNR tends to spread the error equally between the channels, which isn't what we want on sRGB textures. Instead, it's desirable to tradeoff a small amount of additional R/B error for less G error. This is what perceptual codecs like JPEG do: they transform the input into YCbCr space, then downsample and quantize the hell out of the CbCr coefficients because preserving chroma is a waste of bits.

Many other BC1 block compression codecs support weighted RGB metrics because in BC1 not doing so visually looks worse on sRGB photos/albedo textures/etc. Encoders using perceptual metrics look better on color gradients and with highly saturated blocks. Heavy usage of perceptual metrics dates back to at least NVidia's original nvdxt compressor, and it wasn't possible for crunch to compete against nvdxt without supporting perceptual metrics. The squish library recommends using perceptual metrics by default, because BC1 without perceptual metrics looks worse.

Anyhow, etc2comp by John Brooks takes things a step further and supports computing error metrics in weighted YCbCr space. Compared to vanilla RGB weighted metrics, this looks better in my experience writing Basis (especially with ETC1). I'm currently using weights (128,64,16).

Here's the REC 709 luma PSNR of 31 test textures encoded with ispc_texcomp (slow/highest quality - uses 7 modes) and my non-SIMD encoder in perceptual mode using just 4 modes:

The overall average PSNR for ispc_texcomp was 48.57, mine was 50.4. Even with ispc_texcomp's massive mode and SIMD advantages it does worse on this metric. ispc_texcomp doesn't support optimizing for perceptual metrics, which puts it at a huge disadvantage on many texture types.

I re-encoded the textures with linear metrics. My encoder used 6 modes: 0, 1, 3, 4, 5, and 6 (including all component rotations and the index flag).

ispc_texcomp's average PSNR was 46.77, mine was 46.50. My encoder can easily bridge this ~.25 dB gap (by using more modes and trying more partitions), but at a time penalty.

Note that ispc_texcomp in its best/slowest profile is pretty slow, and is much easier to compete against without SIMD code. It's just trying way too hard. It's faster in its lower quality "basic" profile, but it still doesn't support perceptual metrics so it'll continue to fight up a very steep hill.

For benchmarking, I ran each encoder in a single thread, and called ispc_texcomp with 64 blocks at a time.

Other findings: ispc_texcomp has a very weak mode 0 encoder, and it's weaker than it should be on grayscale textures. I'll blog examples soon.

Wednesday, April 4, 2018

Imaginary GPU formats

Every once in a while I wonder about alternative GPU texture format encodings. (Why not? It's fun.) There must be a sweet spot somewhere along the continuum between BC1 and BC7. Something that is more complex than BC1 but simpler than BC7. (I somewhat dislike ASTC, mostly because of its insanely complex encoding format.)

Here's one idea for an 128-bit per 4x4 block format (8 bits/texel) that mashes together ETC1+BC7. One thing I learned from ETC1 is that a lot of bits can be saved by forcing each subset's principle axis to always lie along the intensity direction. With a strong encoder, this constraint isn't as bad as one would think.

The format only has two modes: opaque and transparent. The opaque mode has 3 subsets, and the transparent mode has 2 subsets for RGB and 1 subset for alpha. Each color has 1 shared pbit, and each mode has 16 partitions for colors.

The color encoding is "RGB PBit IntensityTable". The intensity tables could be borrowed from ETC1 and expanded to 8 entries. For the transparent blocks, two 8-bit alpha values are specified (like BC4), and by borrowing degeneracy breaking from BC7 we can shave one bit from the alpha selectors. "CompRot" is a BC7-style component rotation, so any of the channels can be encoded into alpha.

Some things I like about this format: equal precision for all components, and there are only two simple modes. The opaque mode is powerful but simple: always 3 subsets, with color and selector precision better than BC1 and even better than BC7's 3 subset modes. The transparent mode is more powerful than BC3 for RGB (better color precision, and 2 subsets), but weaker for alpha (2 bit selectors vs. 3).

The main downside is that each subset's endpoints are constrained to lie along the intensity axis. I've seen commercial games ship with normal maps encoded into ETC1 and DXT1 so I know this isn't a total deal breaker.

Opaque block:
ModeBit    1 
Partition  4
Color0     777 1 3 
Color1     777 1 3 
Color2     777 1 3

Color selectors;
3 3 3 3
3 3 3 3
3 3 3 3
3 3 3 3

Total bits: 128

Transparent block:
ModeBit    1 
Partition  4
Color0     666 3 
Color1     666 3 
AlphaLoHi  8 8
CompRot    2

Color selectors:
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2

Alpha selectors:
1 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2

Total bits: 128

A strong encoder would adaptively choose between opaque blocks and transparent blocks using various component rotations, to minimize overall error. Transparent blocks can be used even on all-opaque textures.

I have no idea if this format is useful. On a rainy day I'll make a simple encoder and compare it against BC1 and BC7.

Tuesday, April 3, 2018

Basis feature support

Here's what we support right now:
  • .basis universal format, which is transcodable to BC1-5, ETC1, PVRTC1 4bpp (currently opaque only), and BC7 (currently opaque only). Alpha support for PVRTC1/BC7 is coming, and enhanced quality for BC7 and ETC2 are on the way. This is a universal solution with two quality modes (baseline and BC7), so by its very nature it trade offs max achievable quality for GPU format support.
    • For really small images (think icon-size), .basis can switch to using fixed selector codebooks to cut down on selector codebook overhead
    • Format supports arbitrary resolution texture arrays, all referring to a single set of compressed codebooks.
  • RDO BC1-5 - Creates more compressible files than crunch's (RDO in crunch was an afterthought and was pretty dumb/low quality), but slower compression. We put the most effort into optimizing BC1's output for LZ coding. Supports up to 32K entry codebooks (vs. crunch's 8K).
  • RDO ETC1 - Supports all ETC1 features, and very high quality levels (up to 32K entry codebooks), usable even on complex normal maps.
  • ETC1 intermediate format (supports all features of the ETC1 format, i.e. flips and both differential and individual colors). 10-20% smaller files at same SSIM vs. Unity crunch's the last I checked.
All of these codecs have been utilized by customers for different purposes.

We don't support an intermediate file format exclusively for BC1-5, only ETC1. Instead, we're focusing on universal solutions first, and then we'll focus on an intermediate format solution for BC7, BC6H, and ASTC.

I get asked all the time how these solutions compare to crunch's. I'll be working on extensive benchmarks soon. I've learned a lot since I designed and wrote crunch in 2009.

Sunday, April 1, 2018

Basis GPU format support update

Our goal is to support all the GPU formats (literally). Here's an update on our format support:

We just added PVRTC1 4bpp and BC7 support. PVRTC1 quality is approximately equal to PVRTexTool's middle setting ("good"), and significantly better than its lower two settings. Max quality in BC7 mode is currently limited to BC1/ETC1-grade quality levels (what we're calling "baseline" quality).

We've devised several ways of improving the max quality to near-BC7 grade by storing extra data in the .basis file. (You can't get something for nothing!) This high quality data would be optional, so users that don't care about super high quality levels can disable it and the codec will just transcode the baseline data to BC7 instead.

Here's what we support transcoding .basis into right now, in order of transcoding speed from fastest to slowest:
  • ETC1 
  • BC1 
  • BC3-5 
  • BC7: RGB 
  • PVRTC1 4bpp RGB 
Here are the formats we're going to eventually support in order of importance (with no changes to the .basis format needed):
  • PVRTC1 4bpp RGBA 
  • ETC2 RGBA 
  • BC7: RGBA 
  • PVRTC1 2bpp RGB/RGBA 
None of these formats require raw RGB/RGBA pixel processing during transcoding, i.e. we aren't just using real-time GPU format compressors here. Transcoding occurs at the level of GPU blocks, endpoints, and selector/modulation values.

At some point, we're going to boost quality above baseline, to better exploit BC7/ASTC. Most of our early users of this tech (which aren't native game apps) are happy with baseline quality, so the priority of doing this is relatively low. (Games will probably want BC7/ASTC specific codecs anyway.) We are designing the .basis format with this eventual goal, so when we add "enhanced quality" support we won't break compatibility with older baseline-only transcoders.

We'll be posting benchmarks comparing .basis to crunch (and Unity's crunch) and releasing WebAssembly (or asm.js) demos within the upcoming weeks.

Friday, March 30, 2018

Basis update - now with PVRTC support!

Basis (our new GPU texture compression product and the successor to our popular open source crunch lib) now supports PVRTC1, along with ETC1 and BC1-5 (DXTc). This means a .basis file can be utilized on pretty much every GPU in the universe that matters, independent of platform or API. A .basis file is conceptually like JPEG but for GPU texture data, and can be used on the web (using Emscripten and WebGL) or by native apps (using a small C++ transcoder library).

All textures are 1024x1024 (due to PVRTC1 limitations). Click on each one to see them at full-res (they are reduced in size on the page itself).

Each image below was transcoded directly to each GPU format from the .basis file, and then converted to 24bpp .PNG. On my desktop, ETC1 is fastest (~3ms), followed by BC1/4 (~7.9ms), then PVRTC (~37ms). The transcoders (particularly PVRTC) are not yet fully optimized, and are written in straightforward C++. (Update: PVRTC transcode at 1024x1024 is now ~15.4ms, without any SIMD or threading yet.)

The PVRTC transcoder really needs SIMD optimizations, which should give it a nice speed boost (probably around 2-3x). It would be trivial to thread the PVRTC transcoder too. The PVRTC's transcoder's quality is visually somewhere in between PVRTexTool's "Lower Quality" and "Good" settings. In many cases, it looks a little better than "Good", but it's a tossup.

Note that BC3 and BC5 formats are supported by calling the transcoder twice from different input image slices. So a RGBA GPU texture is encoded into two slices (sharing the same codebooks) in a single .basis file, and it transcodes to either two ETC1 textures, a ETC1 texture twice as high, or a single BC5 texture. PVRTC2 and ETC2 support will be very easy and transcode times will be comparable to ETC1 or BC1 (PVRTC1 will always be the most expensive). The PVRTC transcoder doesn't support alpha yet (it's next).

Image: laststarfighter_1024.basis, 133966 bytes, 1.022 bits/pixel






Image: map_1024.basis, 180603 bytes, 1.38 bits/pixel






Image: delorean_1024.png, 138894 bytes, 1.06 bits/pixel