Thursday, June 14, 2007

Tracking down the SSB Dynarec Bug - Part 2

On Monday I talked about the fragment simulator and how this could be used to help track down bugs in the dynarec implementation. In this post I'm going to talk a bit about a tool I use mostly for regression testing, but also to help determine the exact point at which the fragment simulator and the interpretative core go out of sync. It's a bit of a long post, so apologies in advance :)

Daedalus can be compiled with a flag which enables a special 'synchronisation' mode. This build configuration creates an instance of a synchronisation class which can be initialised in one of two modes - either as a producer or as a consumer. At various points during program execution I pass information about the internal state of the emulator to the synchroniser for processing. In the case of the producer, it simply writes this data out to a file on disk. The consumer is a bit more interesting; it reads data of the required size from disk, and compares this 'baseline' value against the value provided by the emulator. If these two values are found to be different, the synchroniser knows that things have drifted out of sync and it can trigger a breakpoint and drop out into the debugger.

This technique relies on the fact that the emulator is deterministic, i.e. running the emulator twice in a row with the same inputs generates exactly the same results. By 'inputs' this means not just the same rom image, but external inputs such as data from the controller must match exactly too. Obviously pressing buttons on the controller in exactly the same order with the same timings would be impossible to duplicate, so the other function the synchroniser performs is to record input from the pad in the case of the producer, or play input back in the case of the consumer. Other external input, such as calls to timer functions (e.g. time(), QueryPerformanceCounter() or rdtsc) can be synchronised in the same way.

The synchroniser works with as few or as many sync points as you provide. For debugging very simple problems, you can get away with just checking the value of the program counter as each instruction is executed. For more tricky problems you can end up adding many more sync points - for instance you can synchronise the entire register set after every instruction to ensure that the synchroniser catches any instruction which generates a different result from the baseline.

I add sync points to Daedalus using a set of macros. When synchronisation is enabled, the macros expand out to calls to a virtual method on a global instance of the synchroniser class. An example sync point in the code might look like this:

u32 pc = gCPUState.CurrentPC;


OpCode op;
if( CPU_FetchInstruction( pc, &op ) )
CPU_Execute( pc, op );

The interesting line here is the SYNC_POINT macro, which synchronises on the current program counter value. For producers, this just writes the value of 'pc' to disk. For consumers, it checks that the value we have for 'pc' matches the one read from disk.

The DAED_SYNC_REG_PC argument is simply a flag to describe what is being synchronised. Another global constant allows easy control of what is synchronised:

enum ESynchFlags
DAED_SYNC_NONE = 0x00000000,

DAED_SYNC_REG_GPR = 0x00000001,
DAED_SYNC_REG_CPU0 = 0x00000002,
DAED_SYNC_REG_CCR0 = 0x00000004,
DAED_SYNC_REG_CPU1 = 0x00000008,
DAED_SYNC_REG_CCR1 = 0x00000010,

DAED_SYNC_REG_PC = 0x00000020,

static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC);

#define SYNCH_POINT( flags, x, msg ) \
if ( DAED_SYNC_MASK & (flags) ) \
CSynchroniser::SynchPoint( x, msg )

If I want to enable more thorough debugging, I can change DAED_SYNC_MASK and OR in more values:


Changing the mask value requires the emulator to be rebuilt from scratch and the baseline synch file to be recreated. This is a bit time consuming but doing it in this way means that the compiler can optimise out any synch points which we aren't interested in, keeping things running as quickly as possible.

One problem with this technique is that the synchroniser can quickly generate a massive amount of data, so much that most of the execution time is spent shifting this data to or from disk, slowing debugging to a crawl. In the example I gave on Monday, it can sometimes take over 500 million instructions before things go out of sync. Even when just synchronising on the program counter, that's over 2GiB of data that needs to be read/written to disk. When you throw in more sync points such as register sets (the GPR registers on their own are around 256 bytes) this can very quickly become impractical. To get around these limitations in Daedalus I gzip the stream of data on the fly which compresses the data significantly. Another trick I use is to hash each register set to a 32bit value and synchronise on this value instead. When using both these techniques the sync files typically end up around 100-200MiB, which is much more manageable.

One of the main uses of this synchronisation code is for regression testing optimisations I've made. I can take a 'known good' build of the emulator and initialise the synchronisation class as a producer to generate a baseline sync file. I can then take a modified version of Daedalus with the optimisations that I want to test, and initialise the synchroniser as a consumer. If the synchroniser detects that things have gone out of sync, then I know that my changes are buggy, and I can investigate why they're not working as planned. It's worth noting that even if everything stays in sync, this isn't a guarantee that my changes are bug-free, but it's a pretty good indication that they're ok.

I also use the synchronisation code to debug tricky dynarec issues. When debugging these types of problems I typically start off by disabling the dynarec engine and setting up the synchroniser to produce a baseline for testing. I'll then re-enable dynarec, but using the fragment simulator with precise interrupt handling (see the end of Monday's post for more on this) and run Daedalus with the synchroniser in consumer mode. Theoretically, as soon as the dynarec code gets out of sync with the interpretative core, the breakpoint triggers and I can investigate things more closely in the debugger.

This is exactly the process I used to track down the Super Smash Bros. bug. When I ran the emulator with the synchroniser in consumer mode, it detected that the program counter was different from the expected baseline value after exactly 387,939,387 instructions had been executed. I'd like to think that an error rate of 2.57e-7% wasn't all that bad, but apparently it is :)

Now that I knew the point at which the emulator was going out of synch, I set a few breakpoints in the emulator to see what exactly was happening. My usual trick is to disassemble the executed instructions just before and after things diverge, and see what's different. Here are snippets from the 'good' and 'bad' logs as things go out of sync:

Count 171f7c35: PC: 80132500: LW ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: JAL 0x80131fb0 ?
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80131fb0: ADDIU sp = sp + 0xffd8
Count 171f7c3c: PC: 80131fb4: SW ra -> 0x0024(sp)
Count 171f7c3d: PC: 80131fb8: SW s0 -> 0x0020(sp)
Count 171f7c3e: PC: 80131fbc: CLEAR a0 = 0
Count 171f7c3f: PC: 80131fc0: CLEAR a1 = 0

Count 171f7c35: PC: 80132500: LW ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: MTC1 at -> FP06
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80132af0: SWC1 FP06 -> 0x0018(a0)
Count 171f7c3c: PC: 80132af4: LBU v0 <- 0x4ad1(v0)
Count 171f7c3d: PC: 80132af8: ADDIU at = r0 + 0x0008
Count 171f7c3e: PC: 80132afc: BEQ v0 == at --> 0x80132b24
Count 171f7c3f: PC: 80132b00: ADDIU at = r0 + 0x0009

I've highlighted the instruction at which the synchroniser detected the PCs were out of sync. In the good trace (top) the PC is 0x80131fb0, but in the bad trace it's 0x80132af0. If you have particularly sharp eyes, you'll notice something else - two instructions before the code goes out of sync, the good trace executes a jump instruction to 0x80131fb0, but the bad trace is performing a MTC1 op (Move To Coprocessor 1)

This provides a particularly good example of one of the main weaknesses with the synchroniser - it's only as good as the synch points you set up. Because I was just synching on the program counter, it didn't detect the fact that the emulator executed an entirely different opcode two instructions previously. In this particular case I was fortunate in that the real source of the problem was very close to the location identified by the synchroniser, but sometimes the cause and effect can be separated by many thousands of instructions.

Fortunately it's easy enough to add new synch points in the code to detect issues like this, but adding too many synch points causes the emulator to slow to a crawl and makes debugging impractical. I've found the best approach is to start off with as few synch points defined as possible (ideally just the program counter) and slowly introduce more synchpoints as required. This is all very easy to do using the DAED_SYNC_MASK flag discussed above.

Getting back to SSB, it looked like I had found the root cause of the problem - somehow the rom was replacing the instructions in memory, essentially a form of self-modifying code (it's more likely it was just loading a new section of code into RAM from ROM, but it's still essentially self-modifying). The dynarec system was oblivious to these changes and so it ended up trying to execute stale instructions that it had cached when creating the fragment, potentially many thousands of cycles ago.

Dealing with self modifying code in dynamic code generators is generally very tricky. In Daedalus I've been relying on the fact that most roms are well-behaved and flush the instruction cache when they modify memory containing executable code. When I detect a instruction cache invalidate (through the MIPS CACHE opcode) I simply dump the entire contents of the fragment cache and start from scratch. This might sound a little heavy handed, but the way that I link fragments together makes it very hard to unlink small sections of code that has been invalided. Flushing the cache is very quick, safe and has a few advantages such as purging cold traces that are no longer being executed any more.

Ironically, the reason the dynarec was failing to cope with SSB wasn't due to a bug in Daedalus - it was due to a bug in SSB that just never happened to be a problem on a real N64. After updating memory with the new instructions SSB should have been invalidating the instruction cache to ensure that it didn't contain stale code, but for whatever reason it failed to do this. The only reason the rom runs correctly on a real N64 is that by the time it comes to execute the modified instructions, the instruction cache has been refilled a number of times and so the stale instructions are no longer cached.

Even though this isn't Daedalus's bug, it still needs to work around the problem. I'll leave this discussion for a future post though - this one is long enough as it is :)



mkbin said...

wow that sounds like alot of work from what i do with minor debugging but i have to ask if you looked at the z-fighting on most games with shadows and zelda oot the paths of the road are messed up this probily isn't that hard of a bug to fix

j said...

so when is it coming out?!? lol i'm so excited! great job so far!

Xabier said...

dang man, it must have taken for ever to type this up! :P i dont code so this is all confusing to me but sounds interesting.
are there any more optimizations you will be making on R12? i really appreciate your work man, best homebrew/emu ever and is one of the only reasons i have a psp :)


gg said...

That's funny. HAL left in a bug just to mess with your emulator. Classic.

Tatanaz said...

I want to congratulate you.

You have been employed very much at this project and I congratulate you from Chile.

With my friends we are waiting impatiently the date of release of the future version of the emulator, and I want to ask two things you:

1.-If you think about a future to include the way of game ad-hoc in the emulator;

2.-If about the following version you think that there will be more compatibility and performance with games as mario party 3 or kirby.

Thank you very much for this great contribution to the lovers of the retro.

icemaN said...

Glad to see that you found out that the problem was with the rom and not the emulator is there something that u can implement in the emulator that will be able to fix the problem let us know we love the work you are putting in for the community any ideas on when the next update will be ?

Brendan said...

Great work tracking down the bug, when can we see R12, even as a beta? I have a long trip in a couple of days, and would love to see this out! Keep up the good work!

Brian said...

I've been following your blog for about a year now, but haven't said hi, so I thought it was about time. I would venture to guess that I'm the only one here who neither codes nor owns a PSP. :) Although most of what you say flies 30,000 ft over my head, I find your technical expertise and problem-solving skills intriguing. Thanks for the ever-engaging posts.

Cueca said...

how can a bug in the game apper in a emulator, but in the real game, not? :(

Michael said...

Do you know whether it's actually self-modifying (as in, selectivelyish), reading a new routine from ROM (as in a loop from $b0000000), or a PI DMA?
How are you handling PI DMA at the moment when it comes to code fragments?

Dillon said...

Wow, thats really interestingthanks for an updated on the coolest thing on the psp, and for something to read at 1 am =)

Legooolas said...

I have been looking at ways to verify dynarec execution against an emulator, for similar reasons (but I'm just starting out in mine), so this is very helpful! :)

Have you considered running both the emulator and dynarec in a semi-lockstep mode? I am going to try this when I get far enough into getting my dynarec working, as I'm not too worried about performance (yet... it's more of a learning experience and for some researchy ideas I have) and this should avoid the need to have a temporary log and the disk access. Of course it will be a lot slower, due to having to run two complete emulations at once, but this may prove quicker than going via disk?

Kdo4971 said...

Pretty good post, which perfectly illustrates the sentence :
"your code is as good as your tester"

Exophase said...

Self-modifying code is perhaps the most annoying thing to deal with in a dynarec (that is, if you do in fact have to worry about it beyond cache invalidation). A good number of GBA games are very unfriendly in this regard. Actually, if it isn't checked for in general then a large number of games will fail outright.

Of course, to detect it for the general case you have to add code to either check for it or change state somewhere after every write to RAM. In your case, since this happens relatively infrequently, you can probably manage to special case it only for certain writes in certain games.

If you look at Project64's INI file you can see that there are several self-modifying code handling models (many of which probably don't really apply to PSP as it doesn't have an MMU). This alone indicates that it's a real problem for a number of N64 games, although perhaps not the most popular ones.

One thing I do recommend, if you're not already doing this, is to have separate translation caches for code that is in RAM and ROM (and not directly linking fragments between the two). That way you don't need to flush all of the generated code upon an icache invalidation or other detected modification, rather just the code in RAM. You can also go a step further and split things into dynamic and static regions to track modifications and try to minimize flushing overhead. This is only a concern if the game modifies code heavily - if any N64 games are like some GBA games then there will be ones that modify code many times a frame.

I talked about this issue in my blog, if you're interested:

Jeff Wilcox said...

In a word: whoa!

Great job StrmnNrmn. Here's a question for ya: can the SSB rom be 'patched' to work correctly? Maybe not the file itself, but during the loading you have game-specific patches that Daedalus adds on the fly?

Also, on an unrelated note:

Is it possible to implement some kind of down-sampling of the N64's textures in order to save ram? The PSP's screen is tiny (large by handheld standards sure, but still smaller than a TV) and I am wondering if people would notice or care if the textures were reduced.

Keep up the great work man. I gotta say, you're regular posts are motivating. I have a wierd urge to try an Atari Jaquar PSP emulator with JIT dynarec! (okay... it passed)

sheepsama said...

so do you have an estimate on when R12 is going to be released? And great find for spotting the bug ^-^ keep up the great work :]

max said...

thanks for such detailed posts. as an amature i appreicated what i learn from all your posts

Morgan said...

StrmnNrmn so you know I will be setting us a poll at DCEMU as you suggested for which titles are most wanted to be improved on for R13. I'm just taking the liberty in doing this for you so you can concentrate on Daedalus, I know you have enough on your plate as it is so. Anyway good work and I always enjoy reading your posts, keep it up.

StrmnNrmn said...

@michael: It took a short while to figure out exactly how it was modifying the code. I was worried that it was modifying the code via the cpu (i.e. manually updating instructions) but it turned out to be a DMA transfer. I'll post again with a bit more details about how I'm handling this.

StrmnNrmn said...

@Legooolas: Glad you found the post useful :)

I did consider the lockstep method (I believe Zilmar was using this on pj64) but I think the aproach I use in Daedalus is much simpler to get working. The lockstep approach would cause a few issues for the psp as you'd need to have enough RAM for two copies of the emulated RAM etc.

It probably would be quicker than going via disk though, and less hassle to get everything set up.

StrmnNrmn said...

@exophase: Hiya. Most n64 games seem to be pretty well behaved (they generally invalidate caches where required) which is why I've not spotted too many problems in the past.

They don't generally seem to modify code very frequently either - usually just on specific events (e.g. transfer from menus to game etc) rather than on a frame-by-frame basis.

Separate caches is a good idea though, I'll investigate that this weekend.

StrmnNrmn said...

@Jeff: In the past I've gone to the length of scanning the rom for recognisable library functions and patching them (mostly just to help debug what's going on with the rom, but also for optimising a few things like memcpy() etc). There's no reason why I couldn't use the same technique to patch game-specific code which is known to cause problems, but I think this would be pretty labour intensive.

In the end my solution was just to check each DMA transfer directly for an overlap with the fragment cache. This solution should work with all roms that modify code though ROM->RAM DMA, but still won't solve the problem of roms modifying individual instructions directly :(

The N64's textures tend to be pretty tiny anyway - it only had 4KiB of texture memory so 16x16 and 32x32 are pretty typical texture sizes. The ones that I have the most problems with are textures used for backgrounds in menus. Quite often these are really wide and thin - MarioKart's menus use lots of 320x3 textures for instance, which I have to round up to 512x4 for the psp. Down-sampling the textures might help a bit here - I'd have to experiment to see how it looked.

Amusingly I used to have code to support this in the PC build - I can't remember the details, but it was a compatibility fix for certain PowerVR cards which were limited to 256x256 textures IIRC :)

StrmnNrmn said...

@morgan: Thanks - I look forward to seeing the results :)

tyler said...

It bit more lengthy (?) than the last one but I do enjoy reading the posts. They really peak my interests, now only if other books/writings could be this interesting and useful further on in life.
Anywho, take your time with this one. I really want to play SSB, but I will wait for it to be better than barely playable but released often. Btw, do u have a planned release? I'm not looking for anything specific, like maybe a month, 2 weeks, etc.

Andrew said...

I was just thinking while reading that if you read some papers and applied their ideas to your debugging you could get a masters degree out of it haha.

Interesting. Thanks again for posting this.

god0fgod said...

How on earth do you do it? I can't believe your the only one working on daedalus (except some suggestions). Your one hell of a programmer.