Wednesday, May 31, 2006

Some initial benchmarks

I've been really busy working on the new dynarec engine, so I've not been posting as frequently as I'd like. I've made a lot of progress in the following areas:

  • Most integer arithmetic and logical instructions now implemented (i.e I'm now generating optimised assembly for these instructions rather than calling a generic function to handle them

  • Regsiter caching implemented (although I'm only using a greedy allocation algorithm at the moment, as I've not yet fully implemented the fast linear scan algorithm I talked about in the previous post)

  • I'm directly linking all direct branches to compiled fragments

  • I'm linking to all indirect branch targets

So far I'd say I'm around 40-50% through the work on the dynarec engine.

Now for some stats :) The following table compares the framerates at various points (previous framerate is for the R4 release of Daedalus, current framerate is for my most recent development build):

ScenePrevious Framerate (Hz)Current Framerate (Hz)
Mario Head36
Mario Main Menu1425
Mario Peach Letter6-711
Mario Flyby (under bridge)610
Mario In Game5-69
Mario Kart Nintendo logo1023
Mario Kart Flag611
Mario Kart Menu711
Zelda Nintendo Logo2023
Zelda Start Menu2-34
Zelda Main Menu1013

Overall I'd say the dynarec is currently achieving up to a 100% speedup in the roms I've tested, which I'm very excited about. Mario is certainly starting to feel a lot more playable, and the Mario Kart menus are a lot more responsive now.

I specifically included Zelda in the results because I'm not seeing the same kind of results there, so I need to take a closer look at what's going on there (it's quite possible it's just using a few of the arithmetic and logical ops I've not spent time optimising yet).

A twofold improvement in framerate is pretty good, but I now think I can do a lot better. Here's the list of things I currently have on my 'TODO' list:

  • Fully implement all the remaining integer ops (including all the 64 bit instructions)

  • Finalise implementation of the fast linear scan register allocation algorithm

  • Keep track of 'known' values for specific registers and use this to optimise the generated code (e.g. most of the time the top half of the N64's 64 bit registers is just sign extended from the lower half)

  • Cache the memory location pointer to by the N64 stack pointer (SP) and optimise load/stores using this register as a base pointer

  • Optimise all memory access instructions (currently all the cached registers get flushed for all memory accesses other than LW/SW/LWC1 and SWC1)

  • Detect and optimise 'busy wait' loops (e.g. many roms sit in a tight loop waiting for the next vertical blank interrupt to fire which is just wasting cycles on the PSP)

  • Implement all the branching instructions (I've currently only implemented BNE, BEQ, BLEZ and BGTZ)

  • Implement instructions and register caching for all the cop1 (floating point coprocessor) instructions. (I think this will give a huge speedup.)

Although the list is quite short, there's quite a lot of work there. What I'm quite excited about is that I think these changes will start to provide significant speedups as they're implemented. I don't want to get too far ahead of myself, but I'm starting to feel that certain roms are going to be very playable in the not too distant future.

I'm going to try and release a new version of the emulator soon. Unfortunately it's probably not going to be this weekend (due to various social commitments); towards the end of the following week is more likely. I'd certainly like to get a version released before the World Cup starts and all my free time is taken up watching football :)


Tuesday, May 23, 2006

A few more details

Laxer3A (of SnesTYL fame) made an insightful point in the comments page of the previous post (I hope he doesn't mind me quoting him):

What you are generating actually seems more like "threaded code" to me than actually a real dynarec.

I entirely agree with his description "threaded code" for the way I'm approaching the new dynarec engine. I want to stress tha this is just the early stages. The intention is to develop a stable platform from which I can incrementally move towards what would be considered more conventional dynamic code generation. I want (and now have) a nice solid platform from which I can implement things in chunks of 2-3 hours, which I can easily fit into an evening after work.

So I'm currently in the process of replacing generic calls to handle individual opcodes with specialised code which avoids the function call overhead and can hardcode register locations and immediate values. I can also optimise certain things very well. As an example there is no 'mov' instruction on the MIPS architecture so you quite often see ops like 'or a0, a1, r0' (which just ORs the value in a1 with 0 and stores the result in a0). I can write the code to handle OR to simplify this particular situation by just directly copying the contents of register a1 to a0, and therefore avoid the logical operation altogether.

As another example, here's the code I wrote last night to handle LUI (the operation to load a 16 bit value into the upper half of the low-32 bits of the N64 register):

LUI( PspReg_T0, immediate );
SetVar( &gGPR[rt]._u32[0], PspReg_T0 );

if (immediate >= 0)
SetVar( &gGPR[rt]._u32[1], PspReg_R0 ); // Clear tops bits
SRA( PspReg_T1, PspReg_T0, 0x1f );
SetVar( &gGPR[rt]._u32[1], PspReg_T1 ); // Set top bits

In the code I talked about on Sunday, handling this instruction would involve 13 ops: 2 to load the value of the opcode into the argument register, and another to call the 10 op-long function 'R4300_LUI' (Daedalus's instruction handler for LUI).

With the code above this is reduced to 4 ops in the worst case (if the immediate value is negative), or just 3 ops if the value is positive. Also, there is no branching. To give a speicifc example, this N64 opcode:

LUI at, 0x8034

now causes this PSP code to be generated:

LUI t0, 0x8034
SW t0, 8(s0) ; s0 points to the emulated register set
SRA t1, t0, 0x1f
SW t1, 12(s0)

My intention is to spend the next few days reimplementing the most commonly used opcodes in this way. By that point I think the major overhead will shift from the cost of all of the function calls to the generic handlers to the cost of storing loading the emulated registers each time they're referenced (you'll notice in the snippet above I call SW twice - once for each half of the 64 bit N64 register.)

From previous experience, register caching is where the real speedups come from with dynamic recompilation. Memory accesses are typically an order of magnitude slower than register access so anything I can do to avoid them in the recompiled code will be a huge improvement.

If anyone is curious, I've been reading these two papers on fast register allocation for dynamic code generation:

Linear Scan Register Allocation - Poletto, Sarkar. 1999
A fast, memory-efficient register allocation framework for embedded systems - Thammanur, Pande. 2004


[Edit 23:40 - fix markup]

Sunday, May 21, 2006

DynaRec status

Just a quick update on the dynarec status, as I know a lot of people are more interested in this than the grizly details of branch delay instructions :)

Last weekend (13/14 May) I managed to assemble the fragment buffers into native x86 code, and execute this dynamically. I spent some time debating whether to target MIPS or Intel initially, but I decided that it would be a lot easier for me to debug the code generation on the PC than it would be to debug code gen on the PSP.

In the end I'm glad I started with the PC as it allowed me to fix a number of hairy problems without going down the torturous path of debugging self modifying code on the PSP with just a few printf() statements to help track down any problems.

To start with on the x86 code generation, all I did was convert my fragment simulator loop directly into assembly. So the generated code for each instruction in the fragment looked something like this:

set current pc
set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler

So a fragment with 100 instructions in would have this block repeated 100 times (with different pc, delay flag constants etc).

This generated a lot of assembly (i.e. 200KB or so of N64 instructions would generate over 4MB of x86 assembly, i.e. an expansion of around 2000%). At this point I wasn't interested in performance though - I wanted to make sure that I was preserving the behaviour of the fragment simulator as much as possible. The exception and branch handlers mentioned in the pseudo code above warrant more detailed description, but I'll leave that for another post.

At this point I spent a few hours debugging, but generally everything was working pretty well. The emulator was currently running around the same speed with 'dynarec' enabled as with it disabled (I use quotes because there isn't really much 'dynamic' about this code yet).

I spent the rest of the weekend and the early part of last week trying to optimise the generated assembly and see what I could get away with removing. One of the first things to go was setting the program counter before executing each instruction. The only instructions that need to know this tend to be branching instructions (which are relative and need to know the current PC to work out their target address) and other instructions that can potentially fire exceptions. The vast majority of instructions don't need to know the current PC though (e.g. arithmetic ops, logical ops etc).

Next I had a look at reworking things so I only needed to explicitly set the branch delay flag if a branch delay slot was actually active. I made the precondition that the branch delay slot was always clear, and explicitly set/cleared it when I knew the state needed to change.

Finally I removed exception handling from all the instructions I knew to be safe. For instance, I know ANDI (and immediate) can never throw an exception. As I only perform counter updates at the end of the block, an exception can never be fired when executing this instruction.

After all these changes I had an instruction execution block which looked something like this:

if ( pc needed ) set current pc
if ( branch delay instruction )set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( can throw exception and exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler

This meant that the vast majority of instructions looked as follows:

get op code in ECX
call handler for specified op code (from R4300.cpp)

So I had nice big fragments like this:

get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)

Essentially I removed the vast majority of the instruction fetch/decoding overhead from the emulation.

With this version of the dynarec, 200KB of N64 code was now generating just 2MB of x86 assembly (i.e. an expansion ratio of around 1000%). The PC version was running around 60% faster with dynarec enabled than with it disabled, which is a pretty significant speedup (although this is still very early in the process).

What's also important is that this is before I've done any real optimisation of the generated code. For each instruction I'm still calling the generic instruction handler which has the overhead of figuring out which source registers to use, which register is the destination etc. The *real* speedup comes from generating code to handle op codes explicitly, as you remove all this decoding overhead along with the overhead of jumping to another function. Once you've removed most of the generic instruction handling you can start looking at caching register values to minimise the amount of memory that's being moved around.

With the PC version up and running fairly successfully, I've spent this weekend getting the PSP code generation working. I don't want to go into too many details (as I want to go into more depth in future posts), but I know people are keen to hear some news about how this is going.

I got the basic code generation working on Saturday morning (thankfully I'd already resolved most of the tricky issues in developing the x86 version the previous weekend). I spent most of Saturday afternoon fixing some really horrible instruction cache related bugs. I'm still not 100% sure I've fixed them, but it seems very stable at the moment. At the moment I'm at the same stage with the PSP version of the dynarec that I was with the PC version last weekend - the code generation is running fine (and executing on the PSP without crashing more importantly :) but I've only just started looking at optimising things. It's still too early to speculate on numbers for the performance improvement it will give. Currently it's running around 10% faster with dynarec enabled, but it's still very early days.

More soon.


Branch delay instructions

Well it's been a little longer than I intended since my last post. I've been busy with work, and when I've had free time I've found it hard to drag myself away from the compiler for long enough to post an update :)

In my last post on the subject, I left saying that I was trying to iron out bugs in the fragment 'simulator' and was about to start work on native code generation. I want to pick up from that point, but first I wanted to talk about about branch delay instructions and how they effect the implementation of emulators and the new DynaRec work I've been doing.

The couple of bugs in the fragment simulator were indeed to do with exceptions and interrupts triggering in unexpected places. The problem occurred if an exception was triggered in the branch delay slot following a jump instruction.

In case you're not familiar with the MIPS architecture, after branch instruction has been executed (but before control arrives at the target instruction), the CPU executes the instruction immediately following the branch. So to give an example, this is a block of mips code that calls the function foo( 6, 3 ):

li a0, 6 # load the first argument
jal foo # call foo
li a1, 3 # load the second argument
# the result is returned in v0 here

jr ra # return
add v0, a0, a1 # add the arguments, store result in v0

If you've never read MIPS assembly before this will probably look a little strange.
When we call foo with 'jal foo' (Jump and Link), we don't set the second argument until after the jump to 'foo'! Notice also that we calculate the return value for foo after we return with the Jump Register (JR) instruction.

The reason this code works is because of the branch delay slots. When the call to foo is executed ('jal foo') the CPU keeps going for one more instruction and executes 'li a1, 3'. Control then jumps to the start of 'foo', where the CPU immediately executes 'jr ra', jumping back to where it just came from. Again, the CPU executes the following instruction, calculating the sum of the arguments and returning the result in v0.

Although this seems rather pointless and unnecessarily complicated, it serves a very good purpose. With most modern CPUs the instruction execution is pipelined, which means that the processor breaks down the work into discrete chunks (fetch instruction, decode instruction, execute, commit etc), and executes multiple instructions in parallel. Wikipedia (as always) explains this in a lot more detail.

With many architectures the CPU has to throw away the contents of the pipeline when a branch is executed, as the subsequent instructions may refer to data (register or memory contents) that is invalid until after the call has completed. With certain architectures (including MIPS), the engineers decided that it was wasteful to throw away all this work on every branch, and designed the processor to continue processing the pipeline until the subsequent instruction had completed.

What this means is that when writing code for MIPS processors, you have to be careful that your branch delay slot doesn't have any unintended side effects. For maximum efficiency you also have to be careful to try and do useful work in the branch delay slot (rather than just filling it with a NOP for instance). Normally this isn't an issue as the compiler generates all the code for you, but it's certainly an issue if you're writing assembly. It's also been a very important consideration when I've been writing the code generation for the new DynaRec (I'll cover this in a later post).

So, how do branch delay instructions effect emulators? Although they help pipelined CPUs to improve performance, they require emulators to do bit of extra bookeeping and this slows things down slightly and adds complexity. Every time daedalus interprets an instruction it has to check if a branch was taken, and if so set a flag to indicate that a branch delay instruction is due to be executed. When the branch delay instruction is executed, the flag is cleared and the emulator sets the program counter to the original target of the branch.

This is all fairly straight forward, but complications arise when exceptions fire as a branch delay slot is due to be executed. In the example above, what would happen if an interrupt fired immediately before the branch delay instruction 'li a1, 3' was executed? Normally once the interrupt has been handled, the operating system restores control by jumping to the instruction that was due to be executed, allowing the program to run from that point. If this happened in our example above, a1 would be loaded with the value of 3, but the jump to 'foo' that was originally delayed will never take place. The code would continue running without ever calling 'foo'!

In order to handle this situation, when an exception (or interrupt) is triggered, the CPU sets a flag in the 'cause' register. The operating system keeps track of this flag, along with the program counter where the exception fired and various other bits of information. When it's done handling the exception, the operating system uses this information to allow the CPU to correctly restore control to the code that was originally executing. In our example above, the CPU would see that the branch delay flag is set in the cause register, and restore control to the jal instruction immediately preceeding the instruction where the interrupt occured.

It should be fairly clear that there's quite a lot that must be taken care of by the emulator to make sure that the program executes as intended. This is all fairly simple to keep track of when processing instructions individually, but it becomes more complicated when you start to dynamically recompile the code.

The bugs that I mentioned at the start of this post were caused because I wasn't correctly setting the branch delay flag for instructions causing exceptions in branch delay slots. When I build fragments for the dynamic recompiler I avoid triggering certain types of interrupts (e.g. vertical blank and timer interrupts) in the middle of the fragment to reduce the overhead of having to add the code to handle these situations. Unfortunately there are many other types of exceptions that can occur in the middle of a fragment, such as page faults, I/O completion interrupts etc. It turns out these are very rare, but unfortunately not rare enough to save me from a full day of debugging :(

This was a little more detailed than I was originally planning. Branch delay instructions are quite a simple concept on the surface, but they can cause all sorts of complexities when it comes to dynamic recompilation. Fortunately I feel very confident that I've now fixed all the branch delay related bugs in the dynamic recompiler, so hopefully I shouldn't have to think too much about them again in the near future.

I'll follow up this post with a quick update on the state of the new dynarec engine.


Wikipedia entry on branch delay slots

Saturday, May 13, 2006

Monkey 64 v2.0

I turn my back for two days and PSMonkey has released v2.0 of his emulator! Congrats PSMonkey - keep up the good work!

Wednesday, May 10, 2006

Dynamic Recompilation Progress

kekpsp asked:

Is it possible to emulate a system like the N64 with the system limitations that the PSP poses?

When I first decided to port Daedalus over to the PSP I really didn't know the answer to this. I knew there were some substantial challenges - I'd ported Daedalus to the Xbox a couple of years earlier and quickly discovered that even with 64MB you really didn't have much room to manoeuvre. With the PSP you have even tighter memory constraints (24MB user memory + 2MB vram), a slower processor and gpu.

I think I've pretty much cracked the memory problem. When I added in the rom streaming code I reduced the memory usage for an 8MB rom image down to around 2MB. Larger roms only use fractionally more ram (i.e. a few 100KB or so), so I've managed to free up around another 6MB to use for textures, audio and most importantly the dynarec engine.

The next big challenge is speed. Currently Daedalus is unusably slow - typically 4-5fps max (although there are some roms that freakishly seem to run faster). Dynarec is going to bring about the biggest gains here, but it's too early for me to tell how much of an improvement it's going to bring on the PSP in the long run.

This is probably a good point to give a bit of a progress update on the new dynamic recompiler. I'm at the state where I'm successfully capturing 'hot-traces' from the rom as it runs. In order to work the bugs out of the system, I'm then simulating the execution of these traces to see whether everything is working as expected. It also lets me collect a few stats like how many instructions will end up being executed through the native fragment cache rather than being interpreted, and roughly how much memory is going to be consumed.

The results are looking very encouraging. Firstly, even though I'm not actually executing any native code yet, the emulator runs almost as quickly with the 'simulated' dynarec enabled as it does running entirely through the interpreter. Although this sounds a bit of a backwards step, it's actually quite significant because it means the dynarec engine itself isn't any substantial load to the CPU. I'm hoping this means that when I am actually executing native code, the dynarec engine will only be using a fractional part of the CPU.

The other significant result is that you don't actually need to recompile much code to get a sizable portion of the rom executing natively. In my tests with Mario, typically around 90% of the instructions executed are going through the fragment cache rather than the interpreter. Importantly this is with only around 64,000 instructions in 700-1000 fragments. I think this will mean I'll be able to get away with a 1-2MB code buffer on the PSP.

At the moment I'm still ironing out a couple of bugs with the fragment 'simulator' (mostly to do with exceptions and interrupts occuring in the middle of a fragment). Once that's complete I'm going to start taking a look at taking a few small steps towards generating native code. I'll go over this in more detail in my next few posts.

Sunday, May 07, 2006

Dynamic Recompilation

I'm in the proces of rewriting the dynamic recompilation engine for Daedalus. The previous version just about did the job on the PC, but the approach I originally used had a number of problems when it came to the PSP.

Firstly it was a huge memory hog. As memory is relatively cheap on the PC, the recompilation engine is fairly enthusiastic about recompiling blocks of code almost as soon as it comes across them. I'm not so sure this is such a good strategy now, as thousands of ops are recompiled never to be executed again. On the PSP I just can't afford this memory overhead.

Another big problem with the PC dynarec is that there's currently no way for me to easily flush old fragments of code from the cache to save memory. Again this isn't much of a problem on the PC, but on the PSP I can't afford to have old fragments of code lying around, chewing up valuable memory.

Finally the PC version of the dynarec is sufficiently old, poorly documented and overly complex that it's almost impossible for me to improve and properly debug. In general I'd rather refactor systems than rewrite them from scratch, but I feel there are so many drawbacks with maintaining the old system that it makes sense to look for an alternative solution.

Earlier in the week I started doing a bit of background reading on dynamic recompilation. In the past I've always struggled to find any useful papers on the subject (pehaps because 'dynamic recompilation' seems to be emulator-scene terminology). Googling 'dynamic code generation' turned up this paper:

Dynamo: A Transparent Dynamic Optimization System

Although this is talking about native recompilation (i.e. interpreting and recompiling a binary on the same platform it was developed for), it addresses the issues I raised above. What I find incredibly impressive is that they can improve on the speed of native execution, even with the overhead that they're generating by initially interpreting the code. I feel that if I can use their approach for a new version of Daedalus's dynamic recompiler I'll be able to obtain significant improvements in speed and memory usage over the old version.


Let's try again

I started a blog here a few months ago and, as expected, my enthusiasm shriveled up and died fairly quickly. Now that I've released Daedalus PSP I wanted to have somewhere I could easily post progress updates and other bits and pieces I thought might be interesting.

Anyway, I've written up a short piece about what I'm doing with the dynamic recompilation engine for Daedalus PSP, so I'll start with that.