Wednesday, May 10, 2006

Dynamic Recompilation Progress

kekpsp asked:

Is it possible to emulate a system like the N64 with the system limitations that the PSP poses?


When I first decided to port Daedalus over to the PSP I really didn't know the answer to this. I knew there were some substantial challenges - I'd ported Daedalus to the Xbox a couple of years earlier and quickly discovered that even with 64MB you really didn't have much room to manoeuvre. With the PSP you have even tighter memory constraints (24MB user memory + 2MB vram), a slower processor and gpu.

I think I've pretty much cracked the memory problem. When I added in the rom streaming code I reduced the memory usage for an 8MB rom image down to around 2MB. Larger roms only use fractionally more ram (i.e. a few 100KB or so), so I've managed to free up around another 6MB to use for textures, audio and most importantly the dynarec engine.

The next big challenge is speed. Currently Daedalus is unusably slow - typically 4-5fps max (although there are some roms that freakishly seem to run faster). Dynarec is going to bring about the biggest gains here, but it's too early for me to tell how much of an improvement it's going to bring on the PSP in the long run.

This is probably a good point to give a bit of a progress update on the new dynamic recompiler. I'm at the state where I'm successfully capturing 'hot-traces' from the rom as it runs. In order to work the bugs out of the system, I'm then simulating the execution of these traces to see whether everything is working as expected. It also lets me collect a few stats like how many instructions will end up being executed through the native fragment cache rather than being interpreted, and roughly how much memory is going to be consumed.

The results are looking very encouraging. Firstly, even though I'm not actually executing any native code yet, the emulator runs almost as quickly with the 'simulated' dynarec enabled as it does running entirely through the interpreter. Although this sounds a bit of a backwards step, it's actually quite significant because it means the dynarec engine itself isn't any substantial load to the CPU. I'm hoping this means that when I am actually executing native code, the dynarec engine will only be using a fractional part of the CPU.

The other significant result is that you don't actually need to recompile much code to get a sizable portion of the rom executing natively. In my tests with Mario, typically around 90% of the instructions executed are going through the fragment cache rather than the interpreter. Importantly this is with only around 64,000 instructions in 700-1000 fragments. I think this will mean I'll be able to get away with a 1-2MB code buffer on the PSP.

At the moment I'm still ironing out a couple of bugs with the fragment 'simulator' (mostly to do with exceptions and interrupts occuring in the middle of a fragment). Once that's complete I'm going to start taking a look at taking a few small steps towards generating native code. I'll go over this in more detail in my next few posts.

9 comments:

Unknown said...

Although console emulation/dynamic recompilation is far away from my field of expertise, it really interests me and I appreciate you sharing your developer perspective.

From what I gather, the results seem very promissing.

It's great to hear that you have solved the memory limitation issue.
Hopefully that free'd ~6MB will be sufficient for textures, audio and the dynarec.

Hope you don't mind me asking this...
But is there any chance you could speculate (based on your progress/todo list) and give an estimation on what fps you expect when you have a recompiler in place?
Obvsiouly I'm not asking for promisses :P
Just what you feel is a 'safe' estimation. 10-15 fps?

Keep up the great work and keep blogging :)

iainmacleod said...

These improvements you're making... will we see you feed these back for the xbox version - as I am sure you are aware, the xbox scene is still very alive.

Javk said...

StrmnNrmn will pass to history as ULTRAHLE did it. Nice work.
I'm an old programmer and I'd like to do some stuff with my PSP, I used to have a PB-1000 which I programmed in assembler HD61700 to work as a wordprocessor, now I wanna do the same but as I'm a worker now I'm kind of lost and don't know where do get an SDK to use native code in this amazing piece of art.

_Psycho said...

Very interesting approach, always nice to get all explanation. Do you also plan a "texture cache dump" in futur, to speed up the bits swapping before displaying textures? Kind of sub directory with all textures dumped that you could preload in memory ?

Been talking with PSMonkey about that and he told me one emuw as already doing that, so you could use high-res textures (not the point here).

Been reading the R4400 Mips manual for a while, I hope to be able to do something in my free time (which is zero atm. ;)

Exophase said...

All sounds pretty good, I'm vaguely thinking of starting a dynarec for PSP as well, sometime in the future (big pipedream though). I've always kinda brushed aside this method of recompilation, that is, interpreting until a code block has been executed several times. From a performance standpoint you've practically proven that it's not worth the extra overhead (since 90% of the code ends up being recompiled), but from a memory usage standpoint it's probably a completely different story, and in this case it looks like a good thing. Too bad it requires a fully functional interpreter alongside the recompiler (I don't know many people who do an emulator without doing an interpreter first but if possible I wouldn't mind...) What are your recompilation stipulations exactly? Is it simply encountering a jump target n times? Keep in mind that whatever low performance overhead that adds right now may be much more significant when the CPU emulation itself takes much less time.

1-2MB though, that sounds pretty good for shoving on the ME's eDRAM. And of course having the code blocks ran by the ME (but then this would be a 1.0/1.5 only feature; it probably wouldn't be that hard to support.. maybe). I don't know how much that'd improve performance by itself but maybe you'd be afforded the usage of more registers not reserved up by the PSP's OS.

One question, you say that the traced blocks right now are dealing with issues regarding mid-block interrupts and exceptions, do you mean those on the N64 side or the PSP side? For the former I would hope that they can be deferred until the end of the block (I suppose that is the hope of anyone writing a dynarec), however, I suspect that with an interpreter you can begin interpreting once an interrupt draws near; your blocks sound somewhat large (a good thing, of course), but N64's clock speed is pretty high so hopefully interrupts won't be occuring too often, in terms of clock cycles.

Joshua said...

Nice work, although it looks like you have alot of work ahead of you. Don't you find it annoying that you're writing a Dynamic Mips R4300i recompiler for a Mips R4000 (R4400?)processor, I actually think it may be possible to just use some opcodes instead of recompiling them, although a few would have to be recompiled, and jumps would have to be routed to a function that loads more code, and frees the old block, system calls would also have to be patched. Also some adresses would have to be patched but it would probably do the job. Aparently the MIPS Family's opcode format and opcode listing is almost identical throughout processors in the same generation. I know I posted this in the other comments, but I thought it could possibly help you.

StrmnNrmn said...

lama: It's still a bit too early for me to want to put any firm figures on expected performance improvements. As a benchmark the previous dynamic recompiler achieved about a 3-4x speedup, and I feel that this this approach has a lot more potential (i.e. it tends to generate much larger fragments, so I should be wasting much less time in 'context switches' as I jump between them.

macleod: I've not tried compiling these new PSP developments for the Xbox, but I'm sure it will work with minimal effort. Certainly I'd expect a few roms that I was struggling with (i.e. the 32MB roms) to work now.

mr jones: Check out http://ps2dev.org/ for everything you need to start writing your own homebrew for the PSP. There is an excellent guide to getting started here: http://www.scriptscribbler.com/psp/tutorials/lesson01.htm

_psycho: I think the problem with this approach is that reading the textures from memory stick would be far slower than decoding them on the fly from memory. Caching them on disk is probably worth consideration on the PC though - it's an interesting idea (certainly as it lets people play around with them :)

StrmnNrmn said...

exophase: From a performance standpoint I'm sure it is worth the effort of delaying recompilation until you've hit a block several times. Although you end up spending 90% of your time executing recompiled code, this is probably only 10-20% of the code that's run. To put it another way, it's the 80-20 rule (e.g. 80% of your time is spent executing 20% of your code - ). Recompiling the code is a relatively expensive process, so by defering recompilation until a block has been executed multiple times, you avoid the overhead of recompiling code that never ends up being executed again.

Currently I'm using the same criteria for selecting hot traces as they use in Dynamo, i.e. when an instruction has been the target of a backwards branch 50 times, or any instructions that are exit points from another hot trace. This seems to be working really well so far.

I'm really keen to explore how I can take advantage of the ME. My main thought at the moment is to use it to execute the RSP tasks on there (the RSP is the N64's coprocessor). What I like about this approach is that N64 roms have been developed to exploit the parallelism afforded by the RSP, so I think this would result in much better use of the PSP hardware. But the eDRAM does sound like a nice buffer to use for recompiled code :)

The issues I'm having with interrupts and exceptions firing are on the N64 side. I believe I can defer all of the interrupts until at least the end of the current block (most interrupts are triggered on writes to various hardware registers, e.g. for DMA transfers etc). The only other interrupt that can fire at any point is the Count/Compare interrupt (this fires whenever the count register is greater or equal to the cause register, and it's the main mechanism for preemptive multitasking on the N64.) As the count reg increments with every executed instruction, theoretically it can trigger at any point during the execution of a fragment. As you suggest, for performance reasons (and to avoid the annoying problems of interrupts occuring mid-fragment) I'm deferring incrementing the count register until I exit the fragment.

For exceptions, I have no option but to handle these immediately when they're fired, otherwise the rom wouldn't execute correctly. Unfortunately exceptions happen very frequently on the N64 (typically for TLB refill exceptions etc). I should only have to check for exceptions following specific instructions however (e.g. load/stores etc)

StrmnNrmn said...

71M: Thanks for that mate :)

fdisk: Theoretically yes but I think this would be a very difficult problem. The N64 shuffles code around as it's running (i.e. it copies chunks of code from rom to ram as it boots and executes). Apparently some have self-modifying code too. So although you might be able too 'pre-re-compile' very simple roms (e.g. some of the early n64 homebrew), I doubt you could do this for the vast majority of roms people would actually want to use.

shalted: I find it a little confusing more than annoying really :) I keep getting confused as to whether the opcode I'm looking at is an N64 opcode that I'm decoding or a PSP opcode that I'm generating.
Although I don't think I'll be able to reuse the original opcodes directly, I think some of the generated code will be very similar as I can try and pick the same registers to cache values in. One of the nice things about targeting a Mips processor is that you have plenty of free registers to shuffle things around in. For the original PC recompiler I had far fewer hardware registers to play with which made things much more fiddly.