Fixing and 99% improvement in voxel replay

Catch-up: For campaign mode, SnwScf needs bigger arenas/levels.  This means (1) much wider voxel snow coverage (e.g. across mountains) which mandates LOD levels (level of detail) and (2) removing/replacing snow as the player(s) move.  All in all, lots of optimizations, etc.  Some of the funnest work 😀

This one’s all about fixing then optimizing the replay of actions (dig, build, etc).  This is used when voxels are (a) tidied due to being further away than we wish to keep or (b) when we switch their LOD levels = a lot more common!

At the start, it didn’t work.  This was mostly due to other work I’d done on the codebase while not worrying about this functionality since I hadn’t been using it.  With small arenas I never let the snow be tidied — I just kept it all!  Obviously this meant larger load on CPU, Memory and GPU!  In past posts (12 & 3) I talked about improving with LOD, etc.  Now the voxels can be swapped-out to lower LOD levels so we definitely need it!

Here’s out test patch of snow.  To keep things consistent, I saved the snow so I can reload it every time.  This is a slight over-complication — it actually stores all of the generation settings and simply loads them then does generation and gets the same result every time!

 

screenshot-2016-09-16-10-47-22

Here are the numbers:

Total time (ms) Percentage improvement Num replay calls Num DoOperation() calls Comments
(didn’t work) Original (didn’t work)
4552 0% 486 31104 Original (fixed)
752 -83% 78 4992 (1) Switched to Chunk-per-call (rather than whole height each time)
(not-timed) 6 4144 (2) Optimized what was considered actually modifying a ChunkData.
53 -99% 6 1120 (3) Bounded start & end bounds by original sizes
8 Blocks to a side in a Chunk
8 Height of this operation (Chunks)
-3 MinY
1 MaxY
4 World height (Chunks)

And some comments on what I did:

  1. The original code found all ActionData instances that had affected a Chunk then re-ran it for every column… but it didn’t bound the height of the column so it was affecting each Chunk as many times as the number of vertical Chunks it had affected!! (e.g. ActionData touched 3 Chunks high, all 3 Chunks would be done 3 times!?)  This seems a bug in the original code.
  2. The original ActionData code considered a Chunk touched if it was within certain bounds even if it didn’t change the isovalue.  This might have been for painting?  Even that guess is a bit of a stretch — I can’t really see any reason for it so I ‘fixed’ it.  I don’t use it anyway so meh.  It’s easy enough to revert if I offer these changes back to the die-hard users of TerraVol (if there are any).
  3. So previously we’d operated on *every* Block within the Chunk.  For the edge ones, that was superfluous.  Instead, I bounded the Blocks affected by the original size affected.  Great improvement — albeit most useful on edge Chunks which this test-case has lots of.  A larger operation will be more inner ones but I have ideas for that!
    (care to guess along at home … or write in on a stamped self-addressed envelope 😉 )

Here are a couple of pictures of the operation.  The white boxes are Chunk boundaries and the purple boxes are the reduced bounds that are now operated upon 🙂  If you’re wondering why the purple boxes extend outside the green capsule, it’s because the isovalues are smoothed over that range from no-effect to full-effect to produce a smooth result at about the threshold where the green capsule is drawn.

screenshot-2016-09-16-10-44-54

screenshot-2016-09-16-10-45-37

53ms is actually still a long time and needs reducing / parallelizing but it’s a lot better.  To give you an idea — this is a tiny operation.  Most are much larger.  This one’s equivalent to rolling a small snowman 1 meter.  A carrocket hit at smallest level would be about 100 times larger.

Now this is all done off the main thread so it’s not so bad however it’s all done on a Generation thread (which has responsibility for getting the data ready before the Builder threads turn that into a mesh).  Sadly, the way things are structured at the moment, the Builder can time-out if the Generation takes too long and won’t notice a change until the camera moves sufficiently far.  Also generally I’ve tended to only need 1 or 2 Generation threads whereas what we’re really doing here is ActionData application — which I have tended to have 4 threads for since it’s done a lot!  It would feel a lot cleaner to move this re-application of ActionData to the ActionDataThread then get the BuilderThread to be informed when the Chunks are ready for meshing.  So that’s next!

Onwards and faster-wards 😉

Advertisements

Voxel LOD generation

LOD (Level of detail) now working for voxel *generation* as well as building.

Here’s a YouTube at full res for the LOD-lovers like me out there 😉

As discussed in the last post, the generation phase is deciding on the matrix (8x8x8 for LOD0) of numbers that comprises a voxel ‘ChunkData’ whereas the building phase is turning those numbers into a ‘Chunk’ with a ‘Mesh’ — a thing you can actually see and interact with.

This new work means:

  • LOD0 (white boxes) contain 512 floats and 128 triangles = unchanged.
  • LOD0i (yellow boxes) contain 512 floats but only 2 triangles = unchanged.
  • LOD1 (cyan boxes) contain 1 float and 2 triangles = the NEW bit!

You can see how that’s much less memory and much better performance — especially for covering *much* larger arenas!

Yep, open(er) world campaign mode, here we come!

p.s. As with most gamedev or even programming, this took a great deal of frustrating tweaking including upgrading Unity and several assets then creating a whole new way to investigate details about voxel spaces that are generated on threads.  I know, implementation details.  I’m happy to discuss and the answer will involve phrases like “Marching Cubes”, “Isovalues” and lots about “neighbours”.  Ask if you’re curious / suffering from insomnia 😛

Oh and there’s still plenty of work to do before someone asks “is it done yet”.  There’s a reason I don’t show the Snowmen moving around in it yet 😉

Looking further, TerraVol performance work, part 2

Hi!

This blog post is gonna be lots of deep code stuff to optimize the performance of TerraVol.  Why?  Well I’d like larger arenas for SnwScf!  Huh?  How do those relate?  Well the voxel snow system I use proecdurally generates snow.  A larger arena requires more snow coverage.  More snow coverage requires more system resources (CPU, memory, GPU).  However if we increase the efficiency of the system, we can reduce that increase and make the game run better!  The uber-goal here being a campaign mode that takes place on a wide open terrain with long view distances.  To do that, I’m introducing LOD (Level of Detail) generation — building fewer snow pieces that cover larger areas with fewer details (vertices) — they’ll be atop far mountains you never reach, etc so they just don’t need a gazillion vertices up there!  …well, until you visit them — then they’ll be all beautifully detailed!  This is a common optimization in games but isn’t part of the voxel engine I used — TerraVol — so I’m adding it piece at a time.

So, a quick introduction to voxels in TerraVol.

Chunks, Blocks and GameObjects — oh my!

A cubic section of the voxel space is called a Chunk which consists of a 8x8x8 grid of Blocks.  Each Chunk has a GameObject and, at the start of all of this, 4096 were created at level start and sat in the scene.  (Yes, they were originally all Active GameObjects!  I can’t imagine Unity thinks much of 4096 active GameObjects sitting in its scene when not actually needed!  I switched them to all be inactive to start with.)  When some snow is needed in an area, all the Chunks from the highest Chunk that’s ever been to the lowest that’s allowed are queued for ‘generation’ in a thread.  This does some mathematics to determine height, values are assigned to each Chunk’s Blocks (8x8x8 = 512) and left.  Another thread notices the values are ready but no mesh has been made so it goes through each column and builds a mesh for each Chunk therein.  These two phases “Generation” and “Building” conclude and you can see the Chunks — snow in my case — in the game.

First we’ll do some work on Building then we’ll do some work on Generating.  First however, as noted at the bottom of the last post, since we can’t apply the Unity Profiler here, we need to measure manually!

Measuring time

Chunks are built on threads (I use 4).  Before my introduction of LOD, the average time to build (not generate) was 0.6ms per Chunk (where 19 out of 20 Chunks really need no actual meshing since all Blocks within have the same value = all above or all below a surface!).  For LOD1 columns the average to build becomes 0.01ms per column — a 60 times improvement in load!  (since 1 Chunk at LOD0 needs 8×8 = 64 runs through Marching Cubes, etc whereas 1 Chunk at LOD1 needs 1 run through!)

If I introduce a LOD2 or LOD3 covering 2×2 and 3×3 Chunks respectively, the ‘1 Chunk’ call cost should drop to 0.0025 and 0.001!  At this point the overhead of queuing and locking will probably exceed my measurement ability!

Aside: Unfair?

An odd thing to note: BuilderThread-02 is taking about twices as long as the other 3 threads?!  I recall the dispatching is round-robin not single-pool.

To explain: when a Chunk needs building (turning from numbers into a mesh) the work is placed onto a queue-per-thread rather than all placed in a single queue that all threads take from.

I imagine the original author did this since it avoids lock contention at work-retrieval time and can reduce lock contention for the dispatching thread (since the lock it needs to take only has 1 thread vying for it vs. 4 in a single-pool approach).

It would seem ’02 is just getting more of the harder work!?  Odd.  Will this cause a problem?  Well, it will mean that 3 threads will be sitting idle when they could be doing work ’02 has been given.  This might mean things take slightly longer than they should.  What’s odd is that I’d expect it would average-out in the long run — but it’s always ’02!?  I ought to try a single-pool approach (where all work is taken from a single queue).

After introducing ChunkSimpleState the average for both LOD levels is 0.55ms per Chunk.  Now that looks bad since it’s an increase from 0.01ms however we’ve lost all extraenous queueings and meshings of Chunks that didn’t actually need them so this is pure building that needed to be done.  Although the per-Chunk time has gone up, the system as a whole ought to be faster!  It also paves the way for reducing the number of GameObjects the system uses.

Model vs. View

Now the original developer had obviously once thought to do this but hadn’t finished it.  The system does separate ‘data’ (a.k.a. Model) from ‘visualization’ (a.k.a. View) so a ‘ChunkData‘ represents that large block of space but has no visualization until meshed as a ‘Chunk’ — a MonoBehaviour with a static builder method that creates the GameObject and other associated Components (MeshFilter, MeshRenderer, MeshCollider).  As it stands, a pool of ChunkData are created and each associated with a Chunk at creation time — hence 4096.  I removed that immediate association and made it lazy, based on whether ChunkSimpleState indicates a given ChunkData actually needs a mesh!  Now a 20x20x20 grid of Chunks still needs 8000 ChunkData instances but likely only 400 Chunks!  (assuming no caves or overhangs.)

Of course when the pool is exhausted, each are created on-demand.  Given I wish to move all of this off to a separate thread shortly, I can’t be trying to create the Chunk (and its GameObject, etc) at use point (Unity doesn’t allow things affecting its internal systems to be done off its main thread).  Instead let’s add another pool and pull from that.  Now Chunks are returned to their pool and disassociated from the ChunkData.

Data efficiency and Model-View separation

Now recall that each Chunk consists of 8x8x8 Blocks?  To be more precise each ChunkData consists of 8x8x8 BlockData.  ChunkData and BlockData are classes — instances are associated when needed.

Aside: Packed?

It strikes me the BlockData might be better as a packed format, maybe a struct that would result in a cache-friendly array for ChunkData?  Obviously that ‘associating’ approach would need addressing.  I’ll probably come back to that.

For now, what concerns me — why we’re doing all of this in the first place — is increasing the draw distance for the voxel snow, largely by making the system more efficient, especially by introducing efficient LOD usage.  So at the moment, although we’ve now got LOD1 generating a greatly simplified mesh (e.g. 2×2 vertices for LOD1 rather than 8×8 for LOD0), that’s all in the Chunk (built mesh).  The ChunkData still has 8x8x8 BlockData.  For LOD1+, that’s not needed — we only need 1 BlockData at each ChunkData’s local coordinate (0,0,0).  So let’s see whether we can get generation time to only do Blocks that are needed!  Let’s measure first!

Actually when I went to measure and fix this, I found the generation code is neither optimal nor conducive to this new need.  It’s getting the ChunkData from the TerraMap (via a ThreadLocal cache I previously added) each inner step of a 3-layer deep for loop (loop y, loop x, loop z).  I previously fixed so we looped down y first in preparation for this (and in preparation for storing MaxY per Chunk2D — column).  Now we can clearly see each ChunkData only needs retrieving every 8 Blocks — the X & Z coords never affect which ChunkData we need since they’re only operating in a single column!  So I re-structured to only get a new ChunkData every 8 Blocks and then make the ChunkData’s direct methods to change Block values (rather than going through the TerraMap’s lookups each time).

After that change, times to generate in seconds/column with minHeight 20 becomes:

0.002568627 +
0.003196078 +
0.003450980 +
0.003176471 +
0.003803922 +
0.003578431 +
0.008921568 + <-- odd spike
0.003617647 +
0.003803922 +
0.003696078
====
Average: 0.0039813724

You may recall from the previous post that we’d got it down from 0.015 s/column to 0.005 s/column so this 0.001 s/column improvement (on my fast machine) is within measurement error bounds.  Still it’s not going the wrong direction; still might be something and, as highlighted above, it paves the way for the next big steps!

Generating LOD data

Next is generating at the LOD.  This means that:

LOD 0 columns: (minheight - minY) * blocks per Chunk * Y block per chunk * Y blocks per
 chunk
LOD 1 columns: (minheight - minY) * 1 * 1

For minHeight:20 and minY:-3, that’s:

LOD 0 columns: 23 * 8 * 8 * 8 = 11776
LOD 1 columns: 23 * 1 * 1 * 1 = 23

That’s 11753 fewer loops, psuedo-random number calls for perlin values, etc and 11753 fewer BlockData assigned!

0.002441176 + LOD0
0.003137255 + LOD0
0.002166667 + LOD0
0.000784313 + mixed
0.000970588 + LOD1
0.000862745 + LOD1
0.000745098 + LOD1
0.000941176 + LOD1
0.001000000 + LOD1
0.000941176   LOD1
====
Average: 0.0013990194 s/column for both

BUT just the LOD1 values:
Average: 0.0009101305 s/column at LOD1

 

That’s a 30x improvement!  Super!

Now admittedly there’s still some work to do since the boundary between LOD0 and LOD1 now looks a little odd — the LOD0 building code is probably looking for values along the LOD1 boundaries.  Unless I can contrive a cheat, I might need to generate the edge values for LOD1.  Hmm… perhaps a new LOD0i for columns on the interface between LOD0 and LOD1 which does this (or probably build at LOD1 but generate at LOD0 since it’ll make the code much cleaner).

Well, off to a BBQ this afternoon so I’ll schedule this to post later in case I forget.  Maybe inspiration will strike while I’m wolfing down hog roast and chit-chatting with people IRL!

p.s. Nope, decided to go with LOD0i 🙂

Voxel usage in Snowman Scuffle — for switching voxel platforms?

The voxel system I use for the snow in Snowman Scuffle is no longer maintained. In fact my customized version is probably the most advanced variant of it. This is good and bad. It’s also a pure Marching Cubes approach — a technique that has been somewhat obsoleted by more modern techniques. As such, I’d be quite keen to replace it if I could find a viable (performance, cost & integration time) way to do it. As such, I keep my eye on the forum for mentions of other systems that might fit.

One came up a few months ago by @PawfessorFluff. The tech certainly looks good so, at his/her request I posted my needs for consideration. Here’s a summary of the use of voxels in Snowman Scuffle — both existing and potential future. I’ll probably update this post with clarifications as @PF requests.

Why replace?

Un-maintained software requires the user to fix the bugs (which I’ve done) and extend with any desired new functionality (which I’ve also done).

However the problem is probably one of performance.  Here’s a table of performance results with different setups.  TerraVol has no LOD facility so these numbers are relatively consistent (depending on camera position).

https://docs.google.com/spreadsheets/d/18Fu1yoWm3hXfDkIign-FbKm8c14h7EEP-nRMoWGFcNM/pubhtml?gid=2030552354&single=true

A quick comment on viewpoint

@PawfessorPuff observed that the SnwScf camera is basically top-down. In case relevant to his understanding, I’ll clarify/correct slightly.

The current camera in SnwScf generally tries to be pointing up the Z-axis pointing down at around 60 degrees. It’s distance if determined by the distance between the camera’s targets (generally players but also certain power-ups, etc). When targets become closer and lie close to similar Z values (i.e. paralleling the X-axis), the camera moves down to present a more side-on view.

The probable single-player mode might use more of a follow-cam approach if it doesn’t confuse players moving between the two modes. 

Current voxel usage

The voxels in Snowman Scuffle are currently solely used to represent the snow (not the terrain or obstacles).

At level start the snow is built by one of 3 methods:

  1. Procedurally using callbacks to an interface I provide — currently giving a pseudo-random Perlin noise based using some parameters to control depth, amplitude, frequencies, etc.
  2. Loaded heightmap.
  3. Loaded previously-serialised data.

During gameplay both create and destroy operations are used at present. There are however two complexities:

The first complexity is that, to maintain the appearance of snow, all operations must produce smooth edges across their surface (e.g. a sphere cut from a plane should look like exactly that boolean operation). When I initially implemented, the default approach modified the isovalues to their ‘most outside’ value (-1 in TerraVol) which produced ridges in the sphere. To solve this I apply different isovalue modifications within the sphere (full removal) to on at the surface (linearly interpolated towards the outside). With a little tuning of the sphere overall size, this produced the smooth outlines I wanted.

The second complexity is wishing to know which player destroyed what snow. For some game modes, the amount of snow destroyed factors into scoring. For rolling (described below), it is also used to add health to the snowman. I implemented this as a passed delegate approach where some delegates cascade (if both score and health info are needed) and said delegates can be built at the start of the level to save GC load.

Destructive weapons

Rockets (called Carrockets in SnwScf), Mines, future flamethowers, human tanks, etc.
A simple one to understand, they currently all do one sphere destroy operation each.

Snowballs landing

Another simple one, this is one voxel create operation. The value added is tuned to approximate the size of the snowball that has landed.  Small snowballs solidify on landing whereas big, rolling snowballs solidify when their velocity drops too low.

Rolling

Rolling was trickier to implement. For this I wrote a capsule destroy operation. Both sufficiently large snowballs and snowmen with the correct button held can roll. As a snowball/snowman rolls, the snow that has been passed is periodically destroyed along the vector just passed with the radius of the snowball that’s rolling. It had to be the snow ‘just passed’ otherwise the roller would be immediately dropped into the trench it had just cut then stopped or launched into the air by the ramp right in-front of it! The tricky part is tuning this operation so it doesn’t lag too much. It had to be periodically done to make the operation not too heavy on processing. For now, I have it done every certain distance travelled (though janks and bugs can cause long furrows to be cut as the capsule code notices two distant points need connecting!).

Futures

Global warming

I considered a potential future use of ‘smooth’ operation for melting snow. My current voxel system (TerraVol) doesn’t support that so I’d have to write it myself — as such, its low priority meant I chose not to (yet).

Terrain

In many respects, I would *love* to have the landscape be voxel-based as well. I have destructive weapons (rocket launchers, mines, etc) that would benefit from a bit more effect and it would [hopefully] be easy to tie into the existing operations. Sadly, I believe the performance impact would be too great with my existing voxel approach due to causing too much load. The reason for this is a change I made to TerraVol to allow it to create sufficiently detailed snow models — specifically the size of my smallest-snowballs. To do that, I scale the whole TerraVol system by 0.1. This allows nice shaping of snow blobs but means 10 times as much geometry all round. I did a little optimizing knowing I’d be dealing with only snow (including disabling multiple materials and vertex colours disabling mud/grass/rock distinctions).

Were I to switch to voxel-based terrain, I’d probably want similar approaches as those mentioned by a recent poster saying things like heightmaps of importing from Unity Terrain / MapMagic / Gaia.

Streaming / roaming

I have a single-player mode in early development that allows players to roam. I haven’t determined how viable it is yet. TerraVol allows for on-the-fly generation / destruction at a radius from a specified point (the camera’s target area in my case). I have one nit in my modified version of TerraVol that causes the snow operations to not be played-back properly on the voxel space. They’re stored but it’s not replaying them in the proper place yet — something in the scaling is yet messing it up.

So, there you have a summary of voxel usage by SnwScf. Maybe not stereotypical voxel usage but it does make for some pleasant gameplay possibilities.

Here’s hoping @PawfessorFluff has some awesome voxel powers to make it even better!