Sunday, May 21, 2006

DynaRec status

Just a quick update on the dynarec status, as I know a lot of people are more interested in this than the grizly details of branch delay instructions :)

Last weekend (13/14 May) I managed to assemble the fragment buffers into native x86 code, and execute this dynamically. I spent some time debating whether to target MIPS or Intel initially, but I decided that it would be a lot easier for me to debug the code generation on the PC than it would be to debug code gen on the PSP.

In the end I'm glad I started with the PC as it allowed me to fix a number of hairy problems without going down the torturous path of debugging self modifying code on the PSP with just a few printf() statements to help track down any problems.

To start with on the x86 code generation, all I did was convert my fragment simulator loop directly into assembly. So the generated code for each instruction in the fragment looked something like this:


set current pc
set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler



So a fragment with 100 instructions in would have this block repeated 100 times (with different pc, delay flag constants etc).

This generated a lot of assembly (i.e. 200KB or so of N64 instructions would generate over 4MB of x86 assembly, i.e. an expansion of around 2000%). At this point I wasn't interested in performance though - I wanted to make sure that I was preserving the behaviour of the fragment simulator as much as possible. The exception and branch handlers mentioned in the pseudo code above warrant more detailed description, but I'll leave that for another post.

At this point I spent a few hours debugging, but generally everything was working pretty well. The emulator was currently running around the same speed with 'dynarec' enabled as with it disabled (I use quotes because there isn't really much 'dynamic' about this code yet).

I spent the rest of the weekend and the early part of last week trying to optimise the generated assembly and see what I could get away with removing. One of the first things to go was setting the program counter before executing each instruction. The only instructions that need to know this tend to be branching instructions (which are relative and need to know the current PC to work out their target address) and other instructions that can potentially fire exceptions. The vast majority of instructions don't need to know the current PC though (e.g. arithmetic ops, logical ops etc).

Next I had a look at reworking things so I only needed to explicitly set the branch delay flag if a branch delay slot was actually active. I made the precondition that the branch delay slot was always clear, and explicitly set/cleared it when I knew the state needed to change.

Finally I removed exception handling from all the instructions I knew to be safe. For instance, I know ANDI (and immediate) can never throw an exception. As I only perform counter updates at the end of the block, an exception can never be fired when executing this instruction.

After all these changes I had an instruction execution block which looked something like this:


if ( pc needed ) set current pc
if ( branch delay instruction )set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( can throw exception and exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler


This meant that the vast majority of instructions looked as follows:


get op code in ECX
call handler for specified op code (from R4300.cpp)


So I had nice big fragments like this:


...
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
...


Essentially I removed the vast majority of the instruction fetch/decoding overhead from the emulation.

With this version of the dynarec, 200KB of N64 code was now generating just 2MB of x86 assembly (i.e. an expansion ratio of around 1000%). The PC version was running around 60% faster with dynarec enabled than with it disabled, which is a pretty significant speedup (although this is still very early in the process).

What's also important is that this is before I've done any real optimisation of the generated code. For each instruction I'm still calling the generic instruction handler which has the overhead of figuring out which source registers to use, which register is the destination etc. The *real* speedup comes from generating code to handle op codes explicitly, as you remove all this decoding overhead along with the overhead of jumping to another function. Once you've removed most of the generic instruction handling you can start looking at caching register values to minimise the amount of memory that's being moved around.

With the PC version up and running fairly successfully, I've spent this weekend getting the PSP code generation working. I don't want to go into too many details (as I want to go into more depth in future posts), but I know people are keen to hear some news about how this is going.

I got the basic code generation working on Saturday morning (thankfully I'd already resolved most of the tricky issues in developing the x86 version the previous weekend). I spent most of Saturday afternoon fixing some really horrible instruction cache related bugs. I'm still not 100% sure I've fixed them, but it seems very stable at the moment. At the moment I'm at the same stage with the PSP version of the dynarec that I was with the PC version last weekend - the code generation is running fine (and executing on the PSP without crashing more importantly :) but I've only just started looking at optimising things. It's still too early to speculate on numbers for the performance improvement it will give. Currently it's running around 10% faster with dynarec enabled, but it's still very early days.

More soon.

-StrmnNrmn

5 comments:

Unknown said...

Thanks for the update.
Once again, some very interesting reading material :)

Though at times like these, I wish I had a better grasp of programming and the terminology :\

Although I couldn't quite follow all of what you wrote, your pseudo code (and some quick wikipedia lookups) helped out pretty well.

Anyway, you seem to be making good progress. Please do keep up this amazing work :)

I'm looking forward here to reading your next update (especially on the branch and exception handlers, which I still dont understand as much as I'd like to)

Keep it up mate!

-LaMa

Laxer3A said...

Hi from Laxer3A (SnesTYL).

Glad to see that you have a lot of energy to spend :-)

What you are generating actually seems more like "threaded code" to me than actually a real dynarec.

Anyway keep up the good work.

Its going to get tough when remapping each mips instruction into PSP mapped mips instruction.

The nice stuff would be to have all ROM/RAM read directly mapped to the buffer you have allocated in the emu, use the cpu register directly also.

And just backup/restore them when exiting/entering a "code chunk".

Anyway I will follow your blog with excitment.

_Psycho said...

As always very interesting, especially the branch delay post. Myself been reading about mips recently and I didn't knew that part yet.

It's very nice also to see how you going to do it and the progress so far. I have no doubt this is going to boost the speed a lots that way. Can't wait to see more results. (and the "going to explain this in next post").

StrmnNrmn said...

Hi Laxer3A - hope you don't mind me quoting you in the latest post, but I think you made a good point here that I wanted to expand upon.

Laxer3A said...

No, of course not ! No problem.

I am thinking that you probably want to go this way with a true dynarec anyway.

Which would go with the following steps :
- Remove the call and directly "copy paste" the called function code instead.
- Remove the emulated cpu register load and backup(except at the beginning and end of chunk).
- Analyze the instruction, adressing mode... One can assume that a LOAD that access a register will not be the same as a LOAD that access the memory,etc... Basically look at the type of the memory getter/setter and replace it with a nice LOAD/STORE directly mapped to the emulator memory.
- Once a whole chunk of code has been generated (=between two conditionnal instructions),
a global "optimizer" could analyze the code and make it more efficient : parse the whole sequ of instruction, make a tree/graph of the code (Data->Instruction->Data) and create filter to optimize such, then regenerate the assembly code.

Anyway, how far do you want to go ?