Retro Console Dev: dynarec

Showing posts with label dynarec. Show all posts

Sunday, November 11, 2007

R13 Issues, R14 Plans

Over the past week I've started making plans for what I want to do for R14.

To start with, R13 introduced a couple of issues which I want to fix. Firstly, a number of roms now no longer work with dynarec enabled, or show odd behaviour. For instance, Aerogauge now finishes the race as soon as the countdown completes. I've tracked this down to one of the dynarec optimisations I added in August, where I optimise fragments which jump back to themselves. This should be a 'safe' transformation, so it suggests there's a bug somewhere in my implementation. If I can't fix the bug in time for R14, I'll add a temporary setting to allow this optimisation to be disabled on a rom-by-rom basis (much like the 'dynarec stack optimisation' setting).

Secondly, it looks like something I changed for savestate support has broken the 'return to main menu' option. I added some logic to help ensure that when taking a snapshot for the savestate, the CPU is paused in a 'safe' state (i.e. no dynarec code is executing, nothing is running on the RSP, and nothing is executing in the branch delay slot.) It looks like I've messed something up which is causing the 'return to main menu' option to wait for a safe state before bailing out to the menu. Should be an easy one to fix.

Morgan suggested a nice idea in the comments, which is that I generate a thumbnail for the savestate as it is created to display alongside the slot in the UI. It's a little tricky to implement, as by the time the emulator is told to create a savestate, it has already obliterated the n64's framebuffer with the Daedalus UI. I'll have to do something quite clever like speculatively copy the n64's framebuffer into system memory every time you enter the Pause Menu, or create the screenshot on the first frame rendered after saving. Either way, I'd like to add this simple feature to R14.

Next on my list for R14 is to look at making more significant performance improvements. Over the months many people have been asking when I'd get around to implementing audio on the PSP's Media Engine. I've talked about this before, but always kept putting it off in order to work on easier optimisations.

The Media Engine is a bit of unknown territory for me. Even though it's practically identical to the main CPU, you can't just change a setting an suddenly have your code running on it. There are a number of small hurdles I have to overcome before I can get audio working on the ME, but this is my big goal for R14 (I'll save the technical discussion for the next post.) If all goes to plan this should mean that audio can always be enabled without a significant impact on framerate.

So in summary for R14: a few bug fixes, thumbnails for savestates, and audio without affecting framerate.

-StrmnNrmn

Wednesday, August 15, 2007

Interesting Dynarec Hack

I was playing around with the code generation a couple of evenings ago, and realised that if I made a certain assumption, I could drastically speed up specific types of memory accesses.

When I discussed load/store handling on Sunday, I presented the new code that is typically generated for handling a load such as 'lw $t0, 0x24($t1)' on the N64:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  SLT       t0 = (a0  BEQ       t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1)    # load data

(I'll ignore all the extra code which is generated, and just concentrate on the 5 instructions above which correspond to the expected path of execution.)

Of the 5 instructions that are generated, two - the SLT and BEQ - are just there for performing error handling in the case that the address is invalid, a hardware register (i.e. for memory-mapped I/O), or a virtual address. I'll call this error handling for short.

If we were generating code in an idealised environment where we didn't have to perform error handling, we could drop the SLT/BEQ instructions to get this:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1)    # load data

We could then optimise this even further and perform the offset calculation directly as a part of the LW instruction:


  ADDU      a1 = s1 + s7        # add offset to emulated ram
  LW        s0 <- 0x0024(a1)    # load data

In this idealised situation we could reduce an emulated load to just two instructions, with no branches. That's a pretty good saving!

The problem is that the environment we're generating code from is not 'ideal', and it's hard to know in advance of time which memory accesses are going to directly access physical ram, and which are going to access hardware registers or require virtual address translation. For that reason, we have to place a guard around every memory access to make sure that it behaves correctly. At least, that was the way I was thinking until earlier in the week.

What I realised on Monday is that I can make an assumption that lets me remove the error handling code for certain types of load/stores. The assumption is that when the N64 accesses any memory through the stack pointer ($sp) register, the address is always going to be valid, physical memory.

The assumption relies on the fact that most roms don't do anything particularly clever with their stack pointers - it gets set up for each thread to point at a valid region of memory then the game just runs along, pushing and popping values from it as the code executes. Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner :)

It was straightforward to add a hack to the code generation to exploit this kind of behaviour, and the results have been better than I expected - I'm seeing at least a 10% speed up, and the code expansion factor (the ratio of generated bytes of instructions to input bytes) has dropped from around 5.0x to 4.0x. Stability has been excellent too - I've run about 8 roms with the hack so far, and all of them have run perfectly.

I think one of the reasons the hack has such an impact is that a lot of the memory accesses in a typical C program are through the stack. Here's an example snippet from the entry to a function on the N64, where the compiler emitted code to store the return address and arguments:


SW        ra -> 0x0014(sp)
SW        a0 -> 0x0058(sp)
SW        a1 -> 0x005c(sp)
SW        a2 -> 0x0060(sp)

When I look through disassembly for the roms I'm working on, it's very common to see lots of sequential loads/stores relative to the stack pointer like this.

Previously Daedalus generated around 20 instructions (including 5 branches) for the above snippet. With the hack, the generated code now looks like this:


ADDU      t1 = s1 + s7
SW        s4 -> 0x0014(t1)
ADDU      t1 = s1 + s7
SW        s3 -> 0x0058(t1)
ADDU      t1 = s1 + s7
SW        s2 -> 0x005c(t1)
ADDU      t1 = s1 + s7
SW        s5 -> 0x0060(t1)

8 instructions, 0 branches. What's more, it looks like with a little work, I could eliminate 3 redundant address calculations:


ADDU      t1 = s1 + s7
SW        s4 -> 0x0014(t1)
SW        s3 -> 0x0058(t1)
SW        s2 -> 0x005c(t1)
SW        s5 -> 0x0060(t1)

Now that would be efficient :)

I still want to do lots of testing with the hack. I want to find out if there are roms that don't work with the hack enabled, and how common a problem it is. It's such a significant optimisation though that I'm certain I'll be adding it as an option in Daedalus R13. The results of my testing will probably determine whether I default it to on or off though.

So far Daedalus R13 is shaping up to be significantly faster than R12. I'm still not sure when I'll be ready to release it, but you'll hear about it here first.

-StrmnNrmn

Monday, August 13, 2007

Dynarec Improvements

I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.

At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)

I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.

The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.

These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.

The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):


  ADDIU     a0 = s1 + 0x0024     # add offset to base register
  SLT       t0 = (a0<s6)      # compare to upper limit
  ADDU      a1 = a0 + s7         # add offset to emulated ram
  BNEL      t0 != r0 --> cont # valid address?
  LW        s0 <- 0x0000(a1)  # load data
  J         _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
  NOP
cont:
  # s0 now holds the loaded value,
  # or we've exited from dynarec with an exception

There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.

So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.

Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.

When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.

What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  SLT       t0 = (a0<s6)     # compare to upper limit
  BEQ       t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1) # load data
cont:
  # s0 now holds the loaded value,
  # or we've exited from dynarec with an exception
  #
  # rest of fragment code follows
  # ...
  
  
_Trampoline_XYZ123:
  # handler returns control to 'cont'
  J     _HandleLoadStore_XYZ123  
  NOP

The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.

As with the first optimisation, this change yielded a further 3-5% speedup.

The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:


8018e014 LB        t8 <- 0x0000(a1)
8018e018 LB        t9 <- 0x0000(a0)
8018e01c ADDIU     a0 = a0 + 0x0001
8018e020 XOR       a2 = t8 ^ t9
8018e024 SLTU      a2 = (r0<a2)
8018e028 BEQ       a2 == r0 --> 0x8018e038
8018e02c ADDIU     a1 = a1 + 0x0001
8018e038 LB        t0 <- 0x0000(a0)
8018e03c NOP
8018e040 BEQ       t0 == r0 --> 0x8018e058
8018e044 NOP
8018e048 LB        t1 <- 0x0000(a1)
8018e04c NOP
8018e050 BNE       t1 != r0 --> 0x8018e014
8018e054 NOP

I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.

The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:


enter_8018e014:
  load n64 registers into cached regs
  
  perform various calculations on cached regs
  
  if some-condition
      flush dirty cached regs back to n64 regs
      goto enter_8018e038
  
  perform various calculations on cached regs
  
  flush dirty cached regs back to n64 regs
  
  if ok-to-continue
      goto enter_8018e014
exit_8018e014:
  ...
  
enter_8018e038:
 ...

The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:


enter_8018e014:
  load n64 registers into cached regs
  mark modified regs as dirty

loop: 
  perform various calculations on cached regs
  
  if some-condition
      flush dirty cached regs back to n64 regs
      goto enter_8018e038
  
  perform various calculations on cached regs
   
  if ok-to-continue
      goto loop

  flush dirty cached regs back to n64 regs
exit_8018e014:
  ...
  
enter_8018e038:
 ...

In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.

This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.

I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet :)

-StrmnNrmn

*The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate :)

Thursday, June 14, 2007

Tracking down the SSB Dynarec Bug - Part 2

On Monday I talked about the fragment simulator and how this could be used to help track down bugs in the dynarec implementation. In this post I'm going to talk a bit about a tool I use mostly for regression testing, but also to help determine the exact point at which the fragment simulator and the interpretative core go out of sync. It's a bit of a long post, so apologies in advance :)

Daedalus can be compiled with a flag which enables a special 'synchronisation' mode. This build configuration creates an instance of a synchronisation class which can be initialised in one of two modes - either as a producer or as a consumer. At various points during program execution I pass information about the internal state of the emulator to the synchroniser for processing. In the case of the producer, it simply writes this data out to a file on disk. The consumer is a bit more interesting; it reads data of the required size from disk, and compares this 'baseline' value against the value provided by the emulator. If these two values are found to be different, the synchroniser knows that things have drifted out of sync and it can trigger a breakpoint and drop out into the debugger.

This technique relies on the fact that the emulator is deterministic, i.e. running the emulator twice in a row with the same inputs generates exactly the same results. By 'inputs' this means not just the same rom image, but external inputs such as data from the controller must match exactly too. Obviously pressing buttons on the controller in exactly the same order with the same timings would be impossible to duplicate, so the other function the synchroniser performs is to record input from the pad in the case of the producer, or play input back in the case of the consumer. Other external input, such as calls to timer functions (e.g. time(), QueryPerformanceCounter() or rdtsc) can be synchronised in the same way.

The synchroniser works with as few or as many sync points as you provide. For debugging very simple problems, you can get away with just checking the value of the program counter as each instruction is executed. For more tricky problems you can end up adding many more sync points - for instance you can synchronise the entire register set after every instruction to ensure that the synchroniser catches any instruction which generates a different result from the baseline.

I add sync points to Daedalus using a set of macros. When synchronisation is enabled, the macros expand out to calls to a virtual method on a global instance of the synchroniser class. An example sync point in the code might look like this:


u32 pc = gCPUState.CurrentPC;

SYNCH_POINT( DAED_SYNC_REG_PC, pc );

OpCode op;
if( CPU_FetchInstruction( pc, &op ) )
{
    CPU_Execute( pc, op );
}

The interesting line here is the SYNC_POINT macro, which synchronises on the current program counter value. For producers, this just writes the value of 'pc' to disk. For consumers, it checks that the value we have for 'pc' matches the one read from disk.

The DAED_SYNC_REG_PC argument is simply a flag to describe what is being synchronised. Another global constant allows easy control of what is synchronised:


enum ESynchFlags
{
 DAED_SYNC_NONE  = 0x00000000,

 DAED_SYNC_REG_GPR = 0x00000001,
 DAED_SYNC_REG_CPU0 = 0x00000002,
 DAED_SYNC_REG_CCR0 = 0x00000004,
 DAED_SYNC_REG_CPU1 = 0x00000008,
 DAED_SYNC_REG_CCR1 = 0x00000010,

 DAED_SYNC_REG_PC = 0x00000020,
 DAED_SYNC_FRAGMENT_PC = 0x00000040,
};

static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC);

#define  SYNCH_POINT( flags, x, msg ) \
 if ( DAED_SYNC_MASK & (flags) )  \
  CSynchroniser::SynchPoint( x, msg )

If I want to enable more thorough debugging, I can change DAED_SYNC_MASK and OR in more values:


static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC|DAED_SYNC_REG_GPR);

Changing the mask value requires the emulator to be rebuilt from scratch and the baseline synch file to be recreated. This is a bit time consuming but doing it in this way means that the compiler can optimise out any synch points which we aren't interested in, keeping things running as quickly as possible.

One problem with this technique is that the synchroniser can quickly generate a massive amount of data, so much that most of the execution time is spent shifting this data to or from disk, slowing debugging to a crawl. In the example I gave on Monday, it can sometimes take over 500 million instructions before things go out of sync. Even when just synchronising on the program counter, that's over 2GiB of data that needs to be read/written to disk. When you throw in more sync points such as register sets (the GPR registers on their own are around 256 bytes) this can very quickly become impractical. To get around these limitations in Daedalus I gzip the stream of data on the fly which compresses the data significantly. Another trick I use is to hash each register set to a 32bit value and synchronise on this value instead. When using both these techniques the sync files typically end up around 100-200MiB, which is much more manageable.

One of the main uses of this synchronisation code is for regression testing optimisations I've made. I can take a 'known good' build of the emulator and initialise the synchronisation class as a producer to generate a baseline sync file. I can then take a modified version of Daedalus with the optimisations that I want to test, and initialise the synchroniser as a consumer. If the synchroniser detects that things have gone out of sync, then I know that my changes are buggy, and I can investigate why they're not working as planned. It's worth noting that even if everything stays in sync, this isn't a guarantee that my changes are bug-free, but it's a pretty good indication that they're ok.

I also use the synchronisation code to debug tricky dynarec issues. When debugging these types of problems I typically start off by disabling the dynarec engine and setting up the synchroniser to produce a baseline for testing. I'll then re-enable dynarec, but using the fragment simulator with precise interrupt handling (see the end of Monday's post for more on this) and run Daedalus with the synchroniser in consumer mode. Theoretically, as soon as the dynarec code gets out of sync with the interpretative core, the breakpoint triggers and I can investigate things more closely in the debugger.

This is exactly the process I used to track down the Super Smash Bros. bug. When I ran the emulator with the synchroniser in consumer mode, it detected that the program counter was different from the expected baseline value after exactly 387,939,387 instructions had been executed. I'd like to think that an error rate of 2.57e-7% wasn't all that bad, but apparently it is :)

Now that I knew the point at which the emulator was going out of synch, I set a few breakpoints in the emulator to see what exactly was happening. My usual trick is to disassemble the executed instructions just before and after things diverge, and see what's different. Here are snippets from the 'good' and 'bad' logs as things go out of sync:


Count 171f7c35: PC: 80132500: LW        ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU     sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR        ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: JAL       0x80131fb0        ?
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80131fb0: ADDIU     sp = sp + 0xffd8
Count 171f7c3c: PC: 80131fb4: SW        ra -> 0x0024(sp)
Count 171f7c3d: PC: 80131fb8: SW        s0 -> 0x0020(sp)
Count 171f7c3e: PC: 80131fbc: CLEAR     a0 = 0
Count 171f7c3f: PC: 80131fc0: CLEAR     a1 = 0


Count 171f7c35: PC: 80132500: LW        ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU     sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR        ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: MTC1      at -> FP06
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80132af0: SWC1      FP06 -> 0x0018(a0)
Count 171f7c3c: PC: 80132af4: LBU       v0 <- 0x4ad1(v0)
Count 171f7c3d: PC: 80132af8: ADDIU     at = r0 + 0x0008
Count 171f7c3e: PC: 80132afc: BEQ       v0 == at --> 0x80132b24
Count 171f7c3f: PC: 80132b00: ADDIU     at = r0 + 0x0009

I've highlighted the instruction at which the synchroniser detected the PCs were out of sync. In the good trace (top) the PC is 0x80131fb0, but in the bad trace it's 0x80132af0. If you have particularly sharp eyes, you'll notice something else - two instructions before the code goes out of sync, the good trace executes a jump instruction to 0x80131fb0, but the bad trace is performing a MTC1 op (Move To Coprocessor 1)

This provides a particularly good example of one of the main weaknesses with the synchroniser - it's only as good as the synch points you set up. Because I was just synching on the program counter, it didn't detect the fact that the emulator executed an entirely different opcode two instructions previously. In this particular case I was fortunate in that the real source of the problem was very close to the location identified by the synchroniser, but sometimes the cause and effect can be separated by many thousands of instructions.

Fortunately it's easy enough to add new synch points in the code to detect issues like this, but adding too many synch points causes the emulator to slow to a crawl and makes debugging impractical. I've found the best approach is to start off with as few synch points defined as possible (ideally just the program counter) and slowly introduce more synchpoints as required. This is all very easy to do using the DAED_SYNC_MASK flag discussed above.

Getting back to SSB, it looked like I had found the root cause of the problem - somehow the rom was replacing the instructions in memory, essentially a form of self-modifying code (it's more likely it was just loading a new section of code into RAM from ROM, but it's still essentially self-modifying). The dynarec system was oblivious to these changes and so it ended up trying to execute stale instructions that it had cached when creating the fragment, potentially many thousands of cycles ago.

Dealing with self modifying code in dynamic code generators is generally very tricky. In Daedalus I've been relying on the fact that most roms are well-behaved and flush the instruction cache when they modify memory containing executable code. When I detect a instruction cache invalidate (through the MIPS CACHE opcode) I simply dump the entire contents of the fragment cache and start from scratch. This might sound a little heavy handed, but the way that I link fragments together makes it very hard to unlink small sections of code that has been invalided. Flushing the cache is very quick, safe and has a few advantages such as purging cold traces that are no longer being executed any more.

Ironically, the reason the dynarec was failing to cope with SSB wasn't due to a bug in Daedalus - it was due to a bug in SSB that just never happened to be a problem on a real N64. After updating memory with the new instructions SSB should have been invalidating the instruction cache to ensure that it didn't contain stale code, but for whatever reason it failed to do this. The only reason the rom runs correctly on a real N64 is that by the time it comes to execute the modified instructions, the instruction cache has been refilled a number of times and so the stale instructions are no longer cached.

Even though this isn't Daedalus's bug, it still needs to work around the problem. I'll leave this discussion for a future post though - this one is long enough as it is :)

-StrmnNrmn

Tuesday, June 12, 2007

Tracking down the SSB Dynarec Bug

Yesterday I said I'd provide some more details about the Super Smash Bros. dynarec fix. The actual fix is fairly straightforward, but I thought the process of tracking down the issue was quite interesting and worthy of a couple of blog posts.

When I first started looking at SSB I noted that although the game ran fine without dynarec, it would always hang when trying to enter the main entry with dynarec enabled.

I've been programming professionally for around 6 years now and I can safely say that debugging dynarec bugs is one of the hardest categories of problems I've ever had to work on. For a start, because the code is generated on the fly, you don't have the luxury of source level debugging, and without spending time reverse engineering the original rom image, you don't even know what the generated dynarec code is meant to be doing. It's very much like working blindfolded.

And it gets even worse. I've fixed dynarec problems in the past which were the result of generating incorrect code for a fragment over 500 million instructions into emulation. This would be bad enough, but it can be many thousands of instructions later before this causes emulation finally diverges from the correct path. Just identifying the exact point at which the emulation starts to diverge from the correct sequence of instructions can be like finding a needle in particularly large haystack. While blindfolded :)

Over the years of trying to debug problems like these I've built up a set of tools and learned a few tricks along the way which you might find quite interesting. Although I'm going to talk about them in the context of tracking down this dynarec issue, I've found some of the techniques useful in solving other problems so you might find other ways of applying them too.

One of the first things I do when trying to identify a dynarec issue with Daedalus is to see if the problem is reproducible on the PC build of the emulator. Although it is possible to use GDB with PSPLink, I've never got this up and running and I'm much more comfortable debugging with Visual Studio. Also, working with the PC build is usually much faster than working with the PSP build (debug builds run around 10x faster on the PC, and build times are much quicker.)

Not all dynarec issues can be debugged in this way - the PSP and PC builds have different code generation back-ends (i.e. MIPS and x86 code generation respectively) so bugs in the MIPS code generation won't usually be reproducible in the PC build. The dynarec system in Daedalus shares a common frontend (trace selection and recording) between the two platforms, which means that if I can reproduce the problem on both platforms, I can narrow down the likely location of the bug to this area.

Fortunately this particular bug manifested itself in both the PC and the PSP builds, so I knew that if I fixed the bug on the PC build, it should fix the PSP build too. What I needed to find out next is what the emulator was doing differently when dynarec was enabled compared to when it was disabled.

If dynarec is running without errors, then the sequence of executed instructions should exactly match that executed with dynarec disabled. If I could log details about all the instructions executed with dynarec disabled, and again with dynarec enabled, I should be able to compare the two logs to figure out the exact point at which dynarec is going out of sync. This all relies on the fact that the emulator is totally deterministic, i.e. that running the emulator twice in succession with the same settings should give exactly the same results.

Unfortunately, for a variety of reasons my dynarec solution doesn't produce identical results to interpretation, the main reason being that for performance reasons I can only handle vertical blank and timer interrupts on the boundaries between fragments. For example, with dynarec disabled, the first vertical blank interrupt might occur exactly on the 625,000th instruction, but with dynarec enabled with might not occur until the 625,015th instruction. This means that the logs diverge at the instant the first VBL fires, and never regain synchronisation.

When I was originally developing the new dynarec system I put a lot of effort into writing a fragment simulator, the idea being that rather than executing the native assembly code for a given trace, I could keep track of the instructions making up the trace and interpret these individually instead. Theoretically fragment simulation is identical to dynarec code execution, even down to the way I handle VBLs and timer interrupts, and it's been very useful at identifying bugs in the dynarec code generation. What's particularly useful about fragment simulation however is that I can enable a setting which makes it handle interrupts exactly in the same way as the non-dynarec core, i.e. interrupts are handled precisely rather than on fragment boundaries.

Essentially Daedalus has four modes of operation:

Dynarec + fragment execution
Dynarec + fragment simulation (imprecise interrupt handling)
Dynarec + fragment simulation (precise interrupt handling)
Interpretative core

This tool is particularly powerful, because if I can ensure that dynarec+fragment execution is equivalent to dynarec+fragment simulation, and that dynarec+fragment simulation is equivalent to running the interpretative core, then I can use the transitive properties of these relations to ensure that dynarec+fragment execution is equivalent to running the interpretative core. Fragment simulation allows me to bridge the gap between these two modes of operation which would otherwise be very difficult to compare.

I think that's long enough for one post. Tomorrow I'll talk about how I used this technique to help track down the SSB dynarec bug.

-StrmnNrmn

Sunday, June 10, 2007

Super Smash Bros - Dynarec Update

This is just a quick update to let everyone know I've finally figured out why the dynarec wasn't working in Super Smash Bros. The problem has taken a lot longer to identify than I'd hoped - in part because it was a particularly tricky bug but also because I've not had as much time to work on Daedalus recently as I would have liked.

Anyway, I managed to spend a few hours this weekend isolating the problem, and after a little experimentation I've been able to come up with a temporary workaround. With the fix in place SSB is running at around 30-40fps in game on the PSP, which is very exciting.

Now that I've identified the problem my next job is to come up with a permanent, robust solution to help fix similar problems in other roms. I also want to add some improved checks in the debug build to help spot other situations where this problem arises.

For those that are interested I'll post an update shortly (within the next day or so) with some of the technical details.

-StrmnNrmn

Tuesday, May 15, 2007

Dynarec Fixes

In the previous update I mentioned that the dynarec wasn't working with Goldeneye:

...I can still only get the rom running with dynarec disabled on the PSP, so the framerate is currently just 6-7 fps. Interestingly it runs fine with dynarec on the PC with the same set of changes, so this indicates a bug somewhere in the PSP dynarec implementation.

I've spent a couple of evenings investigating this problem, and I've finally had a bit of luck in determining the root cause. As I suspected it was due to a bug in the PSP dynarec implementation - it turns out that I was fiddling addresses to compensate for the different endianness of the PSP twice (essentially having the same effect as not compensating in the first place).

The fix was very easy to implement (just a single line of code). It's quite a significant bug, so I'm hoping that it will fix dynarec issues with a few other roms too. Unfortunately it looks like the bug which is preventing dynarec working with Super Smash Bros. is a different issue, so I'll have to spend some more time investigating that.

I'm pleased to say that Goldeneye is now running fine with dynarec, so it now boots and runs the intro sequence very quickly. Although the intro and menu run at around 30-60fps, in-game is still very slow (around 5-8fps) so I need to profile this and see what's going on.

As I've hinted above, the next job is to have a look why dynarec is causing Super Smash Bros to hang. A cursory look revealed that it's still broken even when I disable most of the native code generation and directly call the interpretive instruction handlers from the compiled fragment. Because of this I think the problem has something to do with how traces are selected and the resulting fragments are linked together. I'll know more later in the week.

-StrmnNrmn

Retro Console Dev