Tuesday, July 12, 2016

A programming vignette

This doesn't rise to the level of a "war story", but it does illustrate the importance of testing, checking your assumptions, and not taking anything for granted.

The cave environment in Krag Vargenstone is implemented using a tiling system, where individual graphics are duplicated over and over to draw the floor, the walls, and most of the scenery the player will encounter.  (We are using the excellent SpriteTile framework to make our job easier, which I highly recommend if you are planning to make a similar game.)  Most of the level is drawn in a level editor, but the player, the enemies, and the objects are placed as Unity GameObjects.  Sometimes a tile will be used as a placeholder to represent an object's location, but then replaced with a standard floor tile before the level loads.

I was investigating a graphical glitch where blank lines would occasionally appear in the gaps between tiles.  Having made a number of changes, I hit Play to test things, only for Unity to go completely unresponsive.  Not a single control would work, and I had to force-quit Unity from Windows.

A search of my code revealed no endless loops that could have caused the problem.  I narrowed down the changes I had made during that session -- often having to go through the force-quit and restart routine -- and eventually found that changing the size of the tiles in the level editor would reliably reproduce the hang.  This didn't smell right, but I had pursued that particular thread as far as it would go, so I fired off an email to the SpriteTile guy and turned in for the night.

One of the inconvenient realities of programming is that just because you can trace a bug down to a certain part of code does not mean that you've found the source of the problem.

The following day, I was able to look at the problem with fresh eyes.  Not surprisingly, the SpriteTile guy was unable to reproduce the error.  This called for a more intensive diagnostic strategy: if isolating a specific code change -- or point in the code history -- is unfruitful, the next step is to isolate a specific code statement -- or point in the code flow.  In practice, this involves ripping out or disabling all the features, and then adding them back in one by one.  Rather messy.

This strategy traced the cause to an unlikely place: not the tiling or display code, which I had first suspected, but the object placement code.  And that's where I saw it: an endless loop that I had overlooked in my search the previous day.  It belonged to a hackish section of code that was only present to test a particular new feature.  It was something that I had planned to rewrite and replace with a more permanent solution at some point.

The root cause of the problem was that this temporary code was set up to search for a suitable tile on which to place a treasure chest.  One of the suitability requirements was that the tile had to be off-screen.  It just so happened that changing the size of the tiles shrank the test area just enough so that every candidate tile was within the screen viewport.  The tile picker degenerated into running an endless shell game with no possibility of guessing a valid tile.

Another of the inconvenient realities of programming is that sometimes in the course of tracking down and fixing one bug, you discover another bug that has to be found and fixed first.  After all this, I'm back to the point where I still need to fix the blank line glitch.

No comments:

Post a Comment