Friday, April 20, 2018

ispc_texcomp BC7 issues

Been studying ispc_texcomp today to better understand why it's so slow compared to my encoder (currently by a factor of 2x at max quality). We do many of the same things, so why is it slower? Overall, there are many clever/smart things in there, but it's being held back by weak vectorization and some missing optimizations. Here's what I've found so far:

- The inner loops are bogged down with gathers and scatters. Definitely not good. The author even went so far as to wrap them in helper functions with a comment of "(perf warning expected)". (Umm - the compiler perf warnings are there for a reason!) For an example, check out block_quant().

The inner loops should not have gathers, period.

- The partition estimation code is using full PCA. I've found this to be unnecessary in my testing - just using Waveren's bounding box approximation works well and is trivially vectorizable. After all, we're not trying to compute the actual output, just a reasonable approximation.

So ispc_texcomp goes into overkill mode and computes PCA while estimating the partition. At least the way it computes each subset's PCA is smart: it first computes the overall block's statistics/covariance, then it subtracts out the statistics of each partition's active (masked) pixels to compute each subset's individual covar.

Also, it's only computing what looks like an upper bound on the error from the block statistics, not an approximation of the actual error. The approximation of the actual error (factoring in quantization to 4/8/16 selectors) is extremely fast to compute with SIMD code.

Overall, the author seems to be favoring cleverness vs. exploiting the properties of fast but simple SIMD code.

- It uses squish-style iterative refinement after choosing the partition: Basically, it computes the PCA, comes up with some initial selectors, uses least squares to optimize the endpoints, then it computes new selectors and tries all over again a few times. In my experience, the PSNR gain from this method is too low (fraction of a dB) to justify the repeated LS computation and selector selection costs. Instead, it can be far more valuable to just vary the selectors in simple ways (simplified cluster fit) in each trial.

- There's no support for perceptual colorspace metrics in there. This indirectly impacts performance (against other codecs that do support perceptual metrics) because it's stuck competing against RGB PSNR, and getting RGB PSNR up in BC7 is VERY computational intensive. You basically hit a steep quality wall, then it takes massively more compute to get it up above that wall even by a fraction of a dB.

If it supported perceptual metrics (where error in R,G,B is allowed to become a little unbalanced by approx. .25 - 1.5 dB, favoring G and R) it wouldn't have to try as hard because it would gain ~1.5 dB or more instantly before hitting the wall.

- First, the good news: The selector quantizer (see block_quant()) is using a clever algorithm: It dots the desired color by the subset's axis, converts that to a scaled int by rounding, clamps that to between [1,num_selectors-1], then it computes full squared euclidean error between the desired color and the subset's interpolated colors (s-1) and s. It only has to compute the full distance to 2 colors vs. all of them, which is cool.

I've compared this method vs. full distance to all colors and the results are super close (~1/1000th of a dB).

Now the bad news: The implementation is just bad. First, it recomputes the subset axis for every pixel (even though there are only 2 or 3 of them in BC7). And it uses multiple gather's to fetch the endpoints! This is done for all 16 pixels in the block - ouch! There's also a per-pixel divide in there.

Also, with good SIMD computing full distance to all subset colors isn't that expensive, at least for 4 and maybe 8 color blocks. I've implemented optimized forms of full search vs. ispc_texcomp's method. At least with AVX, all the fetches into the weighted_colors[] array (one for each lane) just slow the method down. Brute force leads to simpler code once vectorized and seems to slightly win out overall.

- After iterative refinement it doesn't have any more ways of improving quality. Trying to vary the selectors in keys ways (say by incrementing the lowest values and decrementing the highest values - to exploit extrapolation) and then LS optimizing the results helps a lot (.3-.5 dB) and is very fast if you SIMD optimize the trial solution evaluator function, yet it doesn't do that.

- Its mode 0 encoder suffers from a lot of quantization error - which is indicative of some weaknesses in its endpoint selection:

ispc_texcomp mode 0 only:

My encoder mode 0 only (no dithering - just stronger endpoint selection):

- ispc_texcomp is weak with grayscale images, by around .6-1.2 dB in my testing. Granted, once you're over ~60dB it doesn't matter much.

The "slow" profile is solidly in the quality "wall" region I described earlier. The basic and faster profiles are in much healthier regions.

A few Intel SPMD Compiler (ispc) C porting tips

I took notes as I was porting my new BC7 encoder from C to ispc. First, be sure to read and re-read the user guideperformance guide, and FAQ. This compiler tech kicks ass and I hope Intel keeps throwing resources at it. My initial port of 3k lines of C and initial stabs at vectorization was only ~2x faster, but after tuning the inner loops perf. shot up to over 5x vs. regular C code (on AVX). All without having to use a single ugly intrinsic instruction.

I'm new to ispc so hopefully I haven't made any mistakes below, but here's what I learned during the process:

If you're not starting from scratch, port your code to plain C with minimal dependencies and test that first. Make some simple helper functions like clamp(), min(), etc. that look like the ispc standard lib's, so when you do get to ispc you can easily switch them over to use the stdlib's (which is very important for performance).

Then port your C code to ispc, but put "uniform" on ALL variables and pointers. Test that and make sure it still works. In my experience so far you should have minimal problems at this stage assuming you put uniforms everywhere. Now you can dive into vectorizing the thing. I would first figure out how things should be laid out in memory and go from there. You may be able to just vectorize the hotspots, or you may have to vectorize the entire thing (like I did which was hours of messing around with uniform/varying keywords).

The mental model is like shaders but for the CPU. Conceptually, the entire program gang executes each instruction, but the results can be masked off on a per-lane basis. If you are comfortable with shaders you will get this model immediately. Just beware there's a lot of variability in the CPU cost of operations, and optimal code sequences can be dramatically faster than slower ones. Study the generated assembly of your hotspots in the debugger and experiment. CPU SIMD instruction sets seem more brittle than ones for GPU's (why?).

A single pointer deref can hide a super expensive gather or scatter. Don't ignore the compiler warnings. These warnings are critical and can help you understand what the compiler is actually doing with your code. Examine every gather and scatter and understand why the compiler is doing them. If these operations are in your hotspots/inner loops then rethink how your data is stored in memory. (I can't emphasize this enough - scatters and gathers kill perf. unless you are lucky enough to have a Xeon Phi.)

varying and uniform take on godlike properties in ispc. You must master them. A "varying struct" means the struct is internally expanded to contain X values for each member (one each for the size of the gang). sizeof(uniform struct) != sizeof(varying struct). While porting I had to check, recheck, and check again all uniform and varying keywords everywhere in my code.

You need to master pointers with ispc, which are definitely tricky at first. The pointee is uniform by default, but the pointer itself is varying by default which isn't always what you want. "varying struct *uniform ptr" is a uniform pointer to a varying struct (read it right to left). In most cases, I wanted varying struct's and uniform pointers to them.

Find all memset/memmove/memcpy's and examine them extremely closely. In many cases, they won't work as expected after vectorization. Check all sizeof()'s too. The compiler won't always give you warnings when you do something obviously dumb. In most cases I just replaced them with hand-rolled loops to copy/initialize the values, because once you switch to varying types all bets are off if a memset() will do the expected thing.

Sometimes, code sequences in vectorized code just don't work right. I had some code that inserted an element into a sorted list, that wouldn't work right until I rearranged it. Maybe it was something silly I did, but it pays to litter your code with assert()'s until you get things working.

assert()'s aren't automatically disabled in release, you must use "--opt=disable-assertions" to turn them off. assert()'s in vectorized code can be quite slow. The compiler should probably warn you about assert()'s when optimizations are enabled.

print("%", var); is how you print things (not "%u" or "%f" etc.). Double parentheses around the value means the lane was masked out. If using Visual Studio I wouldn't fully trust the locals window when debugging - use print().

Once you start vectorizing, either the compiler is going to crash, or the compiler is going to generate function prologs that immediately crash. Both events are unfortunately going to happen until it's more mature. For the func. prolog crashes, in most if not all cases this was due to a mismatch between the varying/uniform attributes of the passed in pointers to functions that didn't cause compiler errors or warnings. Check and double check your varying and uniform attributes on your pointers. Fix your function parameters until the crash goes away. These were quite painful early on. To help track them down, #if 0 out large sections of code until it works, then slowly bring code in until it fails.

The latest version of ispc (1.9.2) supports limited debugging with Visual Studio. Examining struct's with bool's doesn't seem to work, the locals window is very iffy but more or less works. Single stepping works. Profiling works but seems a little iffy.

If you start to really fight the compiler on a store somewhere, you've probably got something wrong with your varying/uniform keywords. Rethink your data and how your code manipulates it.

If you're just starting a port and are new to ispc, and you wind up with a "varying varying" pointer then it's ok to be paranoid. It's probably not really what you want.

Experienced obvious codegen issues with uniform shifts and logical or's of uint16 values. Once I casted them to uint32's the problems went away. Be wary of integer shifts, which I had issues with in a few spots.

Some very general/hand-wavy recommendations with vectorized code: Prefer SP math over DP. Prefer FP math over integer math. Prefer 32-bit integer math over 64-bit. Prefer signed vs. unsigned integers. Prefer FP math vs. looking stuff up from tables if using the tables requires gathering. Avoid uint64's. Prefer 32-bit int math intermediates vs. 8-bit.

Study stdlib.ispc. Prefer stdlib's clamp() vs. custom functions, and prefer stdlib vs. your own stuff for min, max, etc. The compiler apparently will not divine that what you are doing is just a clamp, you should use the stdlib functions to get good SIMD code.

Use uniform's as much as you possibly can. Prefer to make loop iterators uniform by default. Make loop iterators uniform by default when you start iterating at 0, even if the high loop limit is varying.

Use cif() etc. on conditionals which will strongly be taken or not taken by the entire gang. Compilation can get noticeably slower as you switch to cif().

A few min's or max's and some boolean/bit twiddling ops can be much faster than the equivalent multiple if() statements. Study the SSE2 etc. instruction sets because there are some powerful things in there.

Things that usually make perfect sense in CPU code, like early outs, may actually just hurt you with SIMD code. If your early out checks have to check all lanes, and it's an uncommon early out, consider just removing or rethinking them.

Tuesday, April 17, 2018

BC7 encoding using weighted YCbCr colorspace metrics

I've written my second BC7 block encoder. My first was written in a straightforward way to gain experience with the format. My second was more focused on competing against the Fast ISPC Texture Compressor, but without using any SIMD, and was over 30x faster than my first attempt.

The BC7 encoders I've studied seem to be hyper focused on RGB PSNR metrics, which is just the wrong metric for many types of textures. Encoding authors that treat input textures as opaque arrays of 4x4 vectors are at a disadvantage in this domain. RGB PSNR tends to spread the error equally between the channels, which isn't what we want on sRGB textures. Instead, it's desirable to tradeoff a small amount of additional R/B error for less G error. This is what perceptual codecs like JPEG do: they transform the input into YCbCr space, then downsample and quantize the hell out of the CbCr coefficients because preserving chroma is a waste of bits.

Many other BC1 block compression codecs support weighted RGB metrics because in BC1 not doing so visually looks worse on sRGB photos/albedo textures/etc. Encoders using perceptual metrics look better on color gradients and with highly saturated blocks. Heavy usage of perceptual metrics dates back to at least NVidia's original nvdxt compressor, and it wasn't possible for crunch to compete against nvdxt without supporting perceptual metrics. The squish library recommends using perceptual metrics by default, because BC1 without perceptual metrics looks worse.

Anyhow, etc2comp by John Brooks takes things a step further and supports computing error metrics in weighted YCbCr space. Compared to vanilla RGB weighted metrics, this looks better in my experience writing Basis (especially with ETC1). I'm currently using weights (128,64,16).

Here's the REC 709 luma PSNR of 31 test textures encoded with ispc_texcomp (slow/highest quality - uses 7 modes) and my non-SIMD encoder in perceptual mode using just 4 modes:

The overall average PSNR for ispc_texcomp was 48.57, mine was 50.4. Even with ispc_texcomp's massive mode and SIMD advantages it does worse on this metric. ispc_texcomp doesn't support optimizing for perceptual metrics, which puts it at a huge disadvantage on many texture types.

I re-encoded the textures with linear metrics. My encoder used 6 modes: 0, 1, 3, 4, 5, and 6 (including all component rotations and the index flag).

ispc_texcomp's average PSNR was 46.77, mine was 46.50. My encoder can easily bridge this ~.25 dB gap (by using more modes and trying more partitions), but at a time penalty.

Note that ispc_texcomp in its best/slowest profile is pretty slow, and is much easier to compete against without SIMD code. It's just trying way too hard. It's faster in its lower quality "basic" profile, but it still doesn't support perceptual metrics so it'll continue to fight up a very steep hill.

For benchmarking, I ran each encoder in a single thread, and called ispc_texcomp with 64 blocks at a time.

Other findings: ispc_texcomp has a very weak mode 0 encoder, and it's weaker than it should be on grayscale textures. I'll blog examples soon.

Wednesday, April 4, 2018

Imaginary GPU formats

Every once in a while I wonder about alternative GPU texture format encodings. (Why not? It's fun.) There must be a sweet spot somewhere along the continuum between BC1 and BC7. Something that is more complex than BC1 but simpler than BC7. (I somewhat dislike ASTC, mostly because of its insanely complex encoding format.)

Here's one idea for an 128-bit per 4x4 block format (8 bits/texel) that mashes together ETC1+BC7. One thing I learned from ETC1 is that a lot of bits can be saved by forcing each subset's principle axis to always lie along the intensity direction. With a strong encoder, this constraint isn't as bad as one would think.

The format only has two modes: opaque and transparent. The opaque mode has 3 subsets, and the transparent mode has 2 subsets for RGB and 1 subset for alpha. Each color has 1 shared pbit, and each mode has 16 partitions for colors.

The color encoding is "RGB PBit IntensityTable". The intensity tables could be borrowed from ETC1 and expanded to 8 entries. For the transparent blocks, two 8-bit alpha values are specified (like BC4), and by borrowing degeneracy breaking from BC7 we can shave one bit from the alpha selectors. "CompRot" is a BC7-style component rotation, so any of the channels can be encoded into alpha.

Some things I like about this format: equal precision for all components, and there are only two simple modes. The opaque mode is powerful but simple: always 3 subsets, with color and selector precision better than BC1 and even better than BC7's 3 subset modes. The transparent mode is more powerful than BC3 for RGB (better color precision, and 2 subsets), but weaker for alpha (2 bit selectors vs. 3).

The main downside is that each subset's endpoints are constrained to lie along the intensity axis. I've seen commercial games ship with normal maps encoded into ETC1 and DXT1 so I know this isn't a total deal breaker.

Opaque block:
ModeBit    1 
Partition  4
Color0     777 1 3 
Color1     777 1 3 
Color2     777 1 3

Color selectors;
3 3 3 3
3 3 3 3
3 3 3 3
3 3 3 3

Total bits: 128

Transparent block:
ModeBit    1 
Partition  4
Color0     666 3 
Color1     666 3 
AlphaLoHi  8 8
CompRot    2

Color selectors:
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2

Alpha selectors:
1 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2

Total bits: 128

A strong encoder would adaptively choose between opaque blocks and transparent blocks using various component rotations, to minimize overall error. Transparent blocks can be used even on all-opaque textures.

I have no idea if this format is useful. On a rainy day I'll make a simple encoder and compare it against BC1 and BC7.

Tuesday, April 3, 2018

Basis feature support

Here's what we support right now:
  • .basis universal format, which is transcodable to BC1-5, ETC1, PVRTC1 4bpp (currently opaque only), and BC7 (currently opaque only). Alpha support for PVRTC1/BC7 is coming, and enhanced quality for BC7 and ETC2 are on the way. This is a universal solution with two quality modes (baseline and BC7), so by its very nature it trade offs max achievable quality for GPU format support.
    • For really small images (think icon-size), .basis can switch to using fixed selector codebooks to cut down on selector codebook overhead
    • Format supports arbitrary resolution texture arrays, all referring to a single set of compressed codebooks.
  • RDO BC1-5 - Creates more compressible files than crunch's (RDO in crunch was an afterthought and was pretty dumb/low quality), but slower compression. We put the most effort into optimizing BC1's output for LZ coding. Supports up to 32K entry codebooks (vs. crunch's 8K).
  • RDO ETC1 - Supports all ETC1 features, and very high quality levels (up to 32K entry codebooks), usable even on complex normal maps.
  • ETC1 intermediate format (supports all features of the ETC1 format, i.e. flips and both differential and individual colors). 10-20% smaller files at same SSIM vs. Unity crunch's the last I checked.
All of these codecs have been utilized by customers for different purposes.

We don't support an intermediate file format exclusively for BC1-5, only ETC1. Instead, we're focusing on universal solutions first, and then we'll focus on an intermediate format solution for BC7, BC6H, and ASTC.

I get asked all the time how these solutions compare to crunch's. I'll be working on extensive benchmarks soon. I've learned a lot since I designed and wrote crunch in 2009.

Sunday, April 1, 2018

Basis GPU format support update

Our goal is to support all the GPU formats (literally). Here's an update on our format support:

We just added PVRTC1 4bpp and BC7 support. PVRTC1 quality is approximately equal to PVRTexTool's middle setting ("good"), and significantly better than its lower two settings. Max quality in BC7 mode is currently limited to BC1/ETC1-grade quality levels (what we're calling "baseline" quality).

We've devised several ways of improving the max quality to near-BC7 grade by storing extra data in the .basis file. (You can't get something for nothing!) This high quality data would be optional, so users that don't care about super high quality levels can disable it and the codec will just transcode the baseline data to BC7 instead.

Here's what we support transcoding .basis into right now, in order of transcoding speed from fastest to slowest:
  • ETC1 
  • BC1 
  • BC3-5 
  • BC7: RGB 
  • PVRTC1 4bpp RGB 
Here are the formats we're going to eventually support in order of importance (with no changes to the .basis format needed):
  • PVRTC1 4bpp RGBA 
  • ETC2 RGBA 
  • BC7: RGBA 
  • PVRTC1 2bpp RGB/RGBA 
None of these formats require raw RGB/RGBA pixel processing during transcoding, i.e. we aren't just using real-time GPU format compressors here. Transcoding occurs at the level of GPU blocks, endpoints, and selector/modulation values.

At some point, we're going to boost quality above baseline, to better exploit BC7/ASTC. Most of our early users of this tech (which aren't native game apps) are happy with baseline quality, so the priority of doing this is relatively low. (Games will probably want BC7/ASTC specific codecs anyway.) We are designing the .basis format with this eventual goal, so when we add "enhanced quality" support we won't break compatibility with older baseline-only transcoders.

We'll be posting benchmarks comparing .basis to crunch (and Unity's crunch) and releasing WebAssembly (or asm.js) demos within the upcoming weeks.

Friday, March 30, 2018

Basis update - now with PVRTC support!

Basis (our new GPU texture compression product and the successor to our popular open source crunch lib) now supports PVRTC1, along with ETC1 and BC1-5 (DXTc). This means a .basis file can be utilized on pretty much every GPU in the universe that matters, independent of platform or API. A .basis file is conceptually like JPEG but for GPU texture data, and can be used on the web (using Emscripten and WebGL) or by native apps (using a small C++ transcoder library).

All textures are 1024x1024 (due to PVRTC1 limitations). Click on each one to see them at full-res (they are reduced in size on the page itself).

Each image below was transcoded directly to each GPU format from the .basis file, and then converted to 24bpp .PNG. On my desktop, ETC1 is fastest (~3ms), followed by BC1/4 (~7.9ms), then PVRTC (~37ms). The transcoders (particularly PVRTC) are not yet fully optimized, and are written in straightforward C++. (Update: PVRTC transcode at 1024x1024 is now ~15.4ms, without any SIMD or threading yet.)

The PVRTC transcoder really needs SIMD optimizations, which should give it a nice speed boost (probably around 2-3x). It would be trivial to thread the PVRTC transcoder too. The PVRTC's transcoder's quality is visually somewhere in between PVRTexTool's "Lower Quality" and "Good" settings. In many cases, it looks a little better than "Good", but it's a tossup.

Note that BC3 and BC5 formats are supported by calling the transcoder twice from different input image slices. So a RGBA GPU texture is encoded into two slices (sharing the same codebooks) in a single .basis file, and it transcodes to either two ETC1 textures, a ETC1 texture twice as high, or a single BC5 texture. PVRTC2 and ETC2 support will be very easy and transcode times will be comparable to ETC1 or BC1 (PVRTC1 will always be the most expensive). The PVRTC transcoder doesn't support alpha yet (it's next).

Image: laststarfighter_1024.basis, 133966 bytes, 1.022 bits/pixel






Image: map_1024.basis, 180603 bytes, 1.38 bits/pixel






Image: delorean_1024.png, 138894 bytes, 1.06 bits/pixel






Monday, March 26, 2018

Basis v1.11 with universal GPU texture support has shipped

We've sent drops to two companies so far. This is the first version that supports fast block-level transcoding of .basis files to multiple formats: ETC1 (mobile) or BC1-5 (desktop). This is a major milestone for us, because Basis is the first system available to support efficient platform independent distribution of highly compressed GPU texture data. We've been working up to this release for over a year.

Here's a tiny demo (our first), using the Basis transcoder compiled to Javascript using Emscripten:

For some encoded example images created during development, see thisthis, or this post.

You encode your textures/images a single time, store a single set of .basis files (which are approximately the size of JPEG files), download the file on the remote device, and then transcode to the format you need for that device. Our transcoder converts the block-level data to DXT or ETC format GPU texture bits on the fly. The encoder is aware of all the formats and balances the quality levels of each.

.basis files consist of one or more 2D texture "slices", where each slice can be any dimension. Slices can be mipmap levels, tiles, cubemap faces, video frames, etc. - whatever you want.

We think the primary use case for .basis files are web apps of various types, or any kind of app that needs to distribute GPU texture data across a wide range of GPU devices. We've tested this solution on normal maps, diffuse maps, gloss maps, satellite photos, photographs, grayscale images, flight navigation maps, etc.

Anyhow, here's what you get with Basis:

  • bin, bin_linux, bin_osx: DLL/so/dylib's containing the precompiled encoder library (which is closed source) and several command line tools. The main tools are basiscomp (our new .basis file encoder, and our RDO ETC1 compressor) and rdodxt (uses our new RDO BC1-5 encoders that are around 10-25% better than crunch's).
  • basisexample: Shows how to use the encoder DLL to encode .basis or RDO .KTX files.
  • inc: Transcoder library source code/headers.
  • lib: static import library for encoder DLL
  • transcoding_demo: Sample that uses the included transcoder library (provided in source code form in the 
  • rdodxt: Sample that uses the encoder DLL to do RDO BC1-5 compression
I've tried to keep the few API's in the product as simple as possible, so not much documentation is needed for them. The readme file covers them. Encoding involves filling out a struct and calling a single C function in the DLL. Transcoding slices is similar, except you use a couple simple methods in inc/basis_decoder.h.

Wednesday, February 28, 2018

On Age DE's pathing/movement

Typical Age DE forum post:
The pathfinding in the game is terrible.
First off, Age of Empires Definitive Edition is a remaster of Age of Empires. It's not a rewrite, it's not a new engine, and that's what we've been saying for almost a year. Age of Empires 1's path finding was really bad, as most reviews point out:
This is the code we started with in DE. Not Age 2, not new code, but the original code which had a ton of flaws and quirks we had to learn about the hard way. This code was almost a quarter of a century old, and it showed. The original movement/pathing code was very weak to say the least. Yet entire complex systems above it (combat, AI, etc.) depended on this super quirky movement/pathing code. It took multiple Ensemble engineers several years of development to go from Age 1 to Age 2 level pathing.
We made a number of improvements to the path finding and unit movement code without breaking the original system. It must be emphasized that Age1's pather and movement code is extremely tricky and hard to change without breaking a hundred things about the game or AI (sometimes in subtle ways). It was a very tricky balance. The current system still has problems with chokepoints, which can be fixed with more work, but we instead had to focus on multiplayer which had to basically be 85% rewritten.
Here's a list of fixes made so far to DE's pathing and movement code in the time I had, which was only like 2 months:
  • DE's pathing system's findPath() function was speeded up by approx 3-4x faster vs. Age1's
    I performed around a dozen separate optimizations passes on the core pather. I implemented the A* early exploration optimization (eliminating 1 open list insertion/removal per iteration), and massively tuned the C++ code to generate reasonably efficient x64 assembly. I transformed the inner loops so much that they barely resemble the original code. We retested the pather and game thoroughly after each major optimization pass. 
  • Age1's pather's A* implementation was outright broken (the open list management was flawed, so the cheapest node wasn't always expanded upon during each iteration). DE's pather fixes all these bugs and is a proper implementation of A*. (I have no idea how the original code shipped, but it was 1997!)
  • DE's pather gives up if after many thousands of iterations it can't make forward progress towards the goal, to avoid spending CPU cycles on hopeless pathing unnecessarily. (It's more complex than this, but that's the gist of it.)
  • Added multiple lane support to villager pathing.
  • Villagers can use one of two collision sizes (either small or large), so if a villager bumps into another friendly villager we can immediately switch to the smaller radius to avoid stopping. So basically, villagers can get very close to each other, avoiding gathering slowdowns. Villager vs. military combat was preserved because vills use large radii by default, and if a vill vs. vill collision doesn't occur after a set period of time the villager returns back to the large radius. (An unfortunate side effect of this: It's possible to group together a ton of villagers - way more than the original game - and use them to attack other units. Villager Mobs are a game unbalancing issue in DE.) Note that another engineer (not associated with FE) attempted to "fix" villagers by allowing them to collide/overlap and totally broke the entire game, and that code did not and could not ship because it broke the engine's assumptions in multiple ways.
  • Movement of units through single-tile openings was greatly improved and tested with all unit types. Age 1's handling of single tile openings was so bad that players would exploit it:
  • The DE pather was modified to have a much higher max iteration count than Age1's, so longer and more complex routes can be found.
  • The per-turn pathing cap in Age1 was switched to short and long range pathing categories in DE. 8 short range paths can occur per turn, and for long range paths it supports up to 4 findPaths() per turn.
  • For short range paths, straight line paths are preferred vs. the tile path returned by findPath() if the straight line path is safe to traverse.
  • Boat movement was modified to have deceleration.
  • Waypoints along a path can be skipped if a unit can safely move from its current position to the next waypoint
  • Added support for 32-facing angles vs. Age's original 8. Also, the unit direction/facing angle is interpolated in DE, instead of "snapped" to like in Age1. The interpolation is purposely disabled when units switch angles during combat.
  • Added stuck unit detection logic to DE's movement code, to automatically detect and fix permanently stuck units (rare, but possible).
  • We ported Age2's entire obstruction manager into DE, replacing the old bitmap system. Units use circular obstructions, and buildings use square obstructions.
  • Added several new behaviors to the movement code to help with chokepoints: A "wait" behavior, that checks every second or so for up to 45 seconds to see if the unit can be moved to the destination, and a stuck unit "watchdog", which watches to see if a unit hasn't made forward progress and tries to switch behaviors to get the unit unstuck. Age1's code would just give up at the slightest problem.
  • The pather tries to path starting from the center of each tile, but this sometimes fails in tight spaces or with lots of units around. DE tries harder to find a good starting position, so movement through single tile openings isn't broken.
  • Path caching system: Villagers and boats can reuse previously found paths in DE, for efficiency.
  • In situations that Age1's pather would just outright give up and stop, DE's pather tries a lot harder to get the unit where it needs to go using several randomized fallback behaviors.
  • Age1's waypoint detection code was very janky and fixed in DE. Age1's code would sometimes cause units to oscillate around their current waypoint until it figured out it was ok to go to the next one.
  • The original Age1 devs used text log files to debug the pather/movement systems. I had to write a 2D/3D debug primitive system so I could see what was actually going on. Here are some development screenshots - notice how complex this stuff is:

Age1's pathing/movement systems implements a form of randomized, emergent behavior. The units are basically like dumb ants. It's imperfect in chokepoints, but it's a continuation of the essence of what made Age 1 what it was. If all the units are moving in the same direction it can usually handle chokepoints (I tested this over and over with a wide variety of units on a pathing torture test scenario from MS before release). The fundamental behaviors the AI and combat systems expected were accurately preserved in DE's pather, which was our goal.
Instead of people saying "the pathing in DE sucks!", I would much rather hear about the specific issues with movement/pathing, and actually constructive suggestions on how to improve the system without breaking the game or turning it into Age 2.

Wednesday, February 21, 2018

Age DE's latency matrix

Getting the netcode to work reliably in a peer to peer multiplayer title is tricky. Every peer must be able to quickly and reliably send and receive packets with every other peer, or the whole thing falls apart. Also, if any machine runs a turn slower than expected for any reason, the entire system will hitch while waiting for the slow peer to catch up. DE constantly monitors the pings and framerates of all connections and machines in a MP game, but it can only compensate so much for bad connections.

In DE's game lobby there's a 2D matrix of blocks that shows the systemwide roundtrip latencies between all players (circled in red):

Once you're in a lobby, the game's peer to peer multiplayer code (parts of which date back to the original game) is active, and your machine is actively communicating with all the other machines. Every 4 seconds your machine pings all the other clients, the results are sent to the host, and every few seconds the host then sends the entire matrix to all peers.

For each row of this matrix, the latency to all the other players is visualized. So the first block on row 2 represents the latency from player 2 to player 1, and the third block on row 2 is the latency from player 2 to player 3, etc. Grey means no response (yet), green is <200ms ping, yellow is <=400ms, and red is >400ms. The game won't start if there are any grey blocks (even in "dedicated server" mode). The ping matrix is not necessarily symmetrical, but usually is.

If a block has a thin blue rectangle around it, that means that client has to use a TURN server relay to get its packets to the other client due to NAT traversal issues. This means extra overhead.

The latencies visualized here are low pass filtered over approx. 8 pings.

The "Ping" column shows the local roundtrip latencies to the other clients. Each player will have its own unique column of values. Apart from maybe the host, I think this column is kind of useless, because it's only showing local latencies. It would have been better if it displayed each player's worst latency.

This matrix is used to compute the turntimes used during the actual game. I believe all peer to peer titles should display something like this, to help players quickly see at a glance how healthy the connections are between peers.

Wednesday, February 14, 2018

Lessons learned while developing Age of Empires 1 Definitive Edition

In late 2016 I began helping Forgotten Empires on Age 1 DE, a UWP app shipping in the Windows Store on Feb 20th. I only helped occasionally for the first couple months or so (because I was working on Basis and an aerospace project), but as the title got closer to shipping I spent more and more of my time working on Age problems. We started with the original 20 year old Age 1 codebase. Here are some of the things I've learned:

1. Get networking and multiplayer working early.
DE supports both traditional peer to peer (with optional TURN server relaying to handle problematic NAT routers), and a new client-server like mode ("host command forwarding") where all clients send their commands to the host which are then forwarded to the other clients. Age 1 uses a lockstep simulation model, except for most AI code which is only executed on the host (see here).

Do not underestimate the complexity of lockstep peer to peer RTS multiplayer games. If possible, choose an already debugged/shipped low-level networking library so you can focus on higher-level game-specific networking problems.

If you do use an off the shelf network library, test it thoroughly to help build a mental model of how it actually works (vs. how you think it works or how it's supposed to work). Develop a test app you can send the library developers to reproduce problems. If the library supports reliable in-order messaging then (at the minimum) put sequence numbers in all of your packets and assert if the library drops, reorders or duplicates packets in case there are bugs in the reliable layer.

For debugging purposes make sure all timeouts can be increased by a factor of 10x or whatever. Sometimes, debugging real-time network code is impossible in the debugger (because it inserts long pauses), so be prepared to do a lot of printf()-style debugging on multiple machines.

If you're taking an old codebase and changing it to use a new networking library or API, try to (at first) minimize the amount of changes you make to the original code. No matter how ugly it is, the original code worked, was bug fixed and shipped, and don't underestimate the value of this.

If you develop your own reliable messaging system, develop a network simulator testbed (which simulates packet loss, etc.) to automate the validation of this layer whenever it's modified and always keep it working.

Trust nothing and verify everything at multiple levels. CRC your packets, CRC the uncompressed data if you use packet compression, use session nonces in your connection-oriented layer to validate connections, validate that your reliable layer is actually reliable, etc. Make sure the initial connection process is well defined and completely understood. Everything needs timeouts of some sort and when sending unreliable messages any packet can get lost. Gaffer on Games is a great guide to this domain of problems.

Getting the game to run smoothly with X random machines across a variety of network conditions is difficult. Plan on spending a lot of time tuning the system which controls the game's turntime (command latency and sim tick rate). There are multiple sources of MP hitches (which players hate): Turntime too low (so one or more machines can't keep up with the faster ones), random CPU spikes caused by AI/pathing/etc., reliable messaging retransmit delays, random client latency spikes, AI's sending too much command data, etc. Develop strong tools to track these problems down when they occur in the field and not in your test lab.

Add cheat commands to the game to help simulate a wide range of various networking and framerate conditions.

If you send unreliable ping/pong packets to measure roundtrip client latency, filter the results because some routers are quite noisy. The statistics that go into computing the sim tick rate and turntimes should be well filtered.

Establishing the initial connections between two random machines behind NAT's is still a challenging problem - test this early.

Identify your most important packets and consider adding some form of forward error correction to them to help insulate the system from packet loss. In lockstep designs like Age, the ALL_DONE packets sent by each client to every other client to indicate end of turn are the most important and currently sent twice for redundancy. (Excluding AI's, there are no commands from the player on most turns!)

Internal testing doesn't mean much. You must have MP betas to discover the real problems. It seems virtually impossible to simulate network conditions as they occur in the wild, or the game running on customer machines.  Make sure you get valuable test data back from MP betas to help diagnose problems.

Age DE's reliable messaging system is based on Brownlow's "A Reliable Messaging Protocol" in GPG 5. This is an elegant and simple NACK-based reliable protocol, except the retransmit method described in the article is not powerful enough and is sensitive to network latency (supporting only 1 packet retransmit request per roundtrip). We had to modify the system to support retransmit packets containing 64-bit bitmasks indicating which speciific packets needed to be resent.

2. Develop strong out of sync (OOS) detection tools early, and learn how to use them.
As a lockstep RTS codebase is modified you will introduce many mysterious and horrifying OOS problems. Don't let them smolder in the codebase, fix them early and fix new ones ASAP.

Functions which are not safe to use in the lockstep simulation should be marked as much. We had an accessor function which returned true if the entire map was visible, which got accidentally used in some code to determine if a building could be placed at a location. This caused OOS's whenever the user resigned (which locally exposes the entire map) and another client built walls. This little OOS took 2 days to track down.

If you are getting mysterious OOS's, you need to identify the initial cause of divergence and fix that, then repeat the OOS debugging process until no more divergences remain. Don't waste time looking at downstream effects (such as out of sync random number generators) - identify and fix that first divergence.

In Age, the original developers logged virtually everything they could in the lockstep sim. Some important events (such as where objects were being created) were left out, so we had to add unique "origin" parameters to all object creations so we knew where in the code objects were being created.

3. Do not underestimate the complexity and depth of UWP and Xbox Live development.
Your team will need at least 1-2 developers who live and breathe these platforms. These individuals are rare so you'll just have to bite the bullet and make an investment into these technologies.

4. Develop clean and defensive coding practices early on. Use static analysis, use debug heaps, pay attention to warnings, etc. Being sloppy here will increase your OOS rate and cause player and developer pain. Be smart and use every tool at your disposal.

5. Do not disable or break "old" logging code. Make sure it always compiles.
This logging code is invaluable for tracking down mysterious/rare problems and OOS's. The original developers put all this logging code in there for a reason..

6. Add debug primitives if the engine doesn't have any
This is a basic quality of life thing: You need the ability to efficiently render 2D text, debug primitives in the world, etc. If the engine doesn't support them then get them in early.

7. Profile early and make major engine architectural decisions based off actual performance metrics.
If your new renderer design relies on a specific way of rendering the game in a non-mainstream manner, then verify that your design will actually work in a prototype before betting the farm on it. Be willing to pivot to an alternate renderer design with better performance if your initial design is too slow.

Get perf. up early: Lockstep RTS multiplayer games can only tick the simulation at the rate of the slowest machine in the game. So if one machine is a dog and can only handle 20Hz, the game will feel choppy for everyone. Other major sources of perf problems like pathing or AI spikes will be obscured if rendering is running slow.

8. Figure out early on how to split up a singled threaded engine to be multithreaded.
Constraining an RTS to live on only a single thread is a recipe for performance disaster, especially if you are massively increasing the max map size and pop caps vs. the original title.

9. Many RTS systems rely on emergent behavior and are interdependent.
If you modify one of these systems, you MUST test the hell out of it before committing, and then be prepared to deal with the unpredictable downstream effects.

For example, modifying the movement code in subtle ways can break the AI, or cause it to behave suboptimally. The movement code in Age1 DE is like Starcraft's: an unholy mess. To be successful modifying code like this you must deeply understand the game and the entire system's emergent behavior.

Carelessly hacking the movement or path finding code in an RTS is akin to hacking the kernel in an OS: expect chaos.

10. Automated regression testing
The more you automate and objectify testing of movement, AI, etc. the happier your life will be and the easier you will sleep at night.

11. Playtest constantly and with enough variety
It's not enough to just play against AI's on the same map over and over. Vary it up to exercise different codepaths. You MUST playtest constantly to understand the true state of the title.

12. Assume the original developers knew what they were doing.
The old code shipped and was successful. If you don't understand it, most likely the problem is you, not the code.

For example, Age 1's original movement system has some weird code to accelerate objects as they moved downhill. This code didn't have a max velocity cap, so on very long hills units could move very quickly. We resisted modifying this code because it turns out it's a subtle but important aspect of combat on hills.

13. Don't waste time developing new templated containers and switching the engine to use them, but do reformat and clean up the old code.
Nobody will have the time to figure out your new fancy custom container classes, they'll just use std because we all know how they work.

Instead, spend that time making the old code readable so it can be enhanced without the developers going crazy trying to understand it: fix its formatting, add "m_" prefixes, etc.

Sunday, February 4, 2018

10 abusive company types

These categories were originally about abusive men, but my friend Stephanie noticed these categories could be adapted to describe abusive companies, too. From the book "Why Does He Do That?":

1. Drill Sergeant: Micromanages you, wants to control everything.

2. Mr. Sensitive: Builds up a public image of being a great company so people think you're crazy if you criticize them.

3. The Water Torturer: Is an expert at not doing anything OBVIOUSLY wrong, you feel wronged but can't pinpoint why and wonder if you're crazy.

4. The Demand Man (or Company): Everything seems fine if you never ask for anything, like a raise. If you do that, you're suddenly painted as ungrateful and treated poorly.

5. Mr. Right: Everything is fine so long as you don't question the company's actions or say anything critical about them.

6. The Player: Never lets you feel like the job is stable. Acts interested in you only to hook you in, then you're neglected and treated poorly again.

7. Rambo: Treats everyone like shit, but tells you you're special and an exception.

8. The Victim: You caused the company so much trouble, you really messed up that one time, any mistreatment happening to you now is making up for that.

9. The Terrorist: Reminds you of the power they have to ruin your career or life, so you better not go against them.

10. Bipolar: The company oscillates between being angry and then happy with you depending on the state of your current project. They become angry when a problem is identified, and when you fix it they are temporarily happy.