Monday, August 13, 2007

Dynarec Improvements

I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.

At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)

I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.

The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.

These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.

The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):


ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0<s6) # compare to upper limit
ADDU a1 = a0 + s7 # add offset to emulated ram
BNEL t0 != r0 --> cont # valid address?
LW s0 <- 0x0000(a1) # load data
J _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
NOP
cont:
# s0 now holds the loaded value,
# or we've exited from dynarec with an exception


There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.

So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.

Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.

When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.

What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:


ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0<s6) # compare to upper limit
BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
cont:
# s0 now holds the loaded value,
# or we've exited from dynarec with an exception
#
# rest of fragment code follows
# ...


_Trampoline_XYZ123:
# handler returns control to 'cont'
J _HandleLoadStore_XYZ123
NOP


The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.

As with the first optimisation, this change yielded a further 3-5% speedup.

The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:


8018e014 LB t8 <- 0x0000(a1)
8018e018 LB t9 <- 0x0000(a0)
8018e01c ADDIU a0 = a0 + 0x0001
8018e020 XOR a2 = t8 ^ t9
8018e024 SLTU a2 = (r0<a2)
8018e028 BEQ a2 == r0 --> 0x8018e038
8018e02c ADDIU a1 = a1 + 0x0001
8018e038 LB t0 <- 0x0000(a0)
8018e03c NOP
8018e040 BEQ t0 == r0 --> 0x8018e058
8018e044 NOP
8018e048 LB t1 <- 0x0000(a1)
8018e04c NOP
8018e050 BNE t1 != r0 --> 0x8018e014
8018e054 NOP


I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.

The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:


enter_8018e014:
load n64 registers into cached regs

perform various calculations on cached regs

if some-condition
flush dirty cached regs back to n64 regs
goto enter_8018e038

perform various calculations on cached regs

flush dirty cached regs back to n64 regs

if ok-to-continue
goto enter_8018e014
exit_8018e014:
...

enter_8018e038:
...


The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:


enter_8018e014:
load n64 registers into cached regs
mark modified regs as dirty

loop:
perform various calculations on cached regs

if some-condition
flush dirty cached regs back to n64 regs
goto enter_8018e038

perform various calculations on cached regs

if ok-to-continue
goto loop

flush dirty cached regs back to n64 regs
exit_8018e014:
...

enter_8018e038:
...


In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.

This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.

I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet :)

-StrmnNrmn


*The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate :)

28 comments:

Jody said...

So when's Daedalus R13 gonna be out..haha. kiddin'. I just read that whole blog and the only thing I understood was the % of speed-ups. lol. I'm the biggest noob for programming, but I can tell you're making a lot of progress for R13. good job, Stormin'.

Volvagia said...

I always read throught your blogposts even though I mostly don't understand anything at all
:-p.
As always, I get excited over the future of N64 on psp. Really fun to hear about your progress!

Will R13 maybe be the first release in which at least one game works 100% with sound (like mario 64)? I can't wait to try! :-D

Keep the good work up!

Anonymous said...

Great work StrmnNrmn, Hopefully on R13 Mario64 will be playable! (by playable.. i mean at at least 50% framerate)

XG917 said...

o man o man.. 10-15% !! and with further optimizations :D omg man you are the best!

mike03$$$ said...

cool whats going on with media engine and my wrestling games

McLovin said...

Wow, I'm really impressed that even after this project being so old you can still find such a huge speed up. Keep up the good work.

Yoshi Party said...

really great work!!!
But I've a question to u:
Have u ever been asked by Sony, to work for him? Cuz, it seems u now more about the psp then most of his developer^^
U may could improve the psx emulation....
Hope R13 will fix the last sound problem in Mario64^^
and make Paper Mario playable;)

Unknown said...

WOW! i must say im impressed by all your work. and i have been sceptical before.. when you write about 10-15 % speed up once more im beginning to understand that i should not doubt you anymore. i honestly think this emulator can be FULLSPEED one day, and you will really be the best coder for the psp if this project will get there.

YOU ARE FANTASTIC!! MANY THANX FROM SWEDEN :)

Kadnassio said...

Nice improvements. Will R13 mainly be a speed update? What I've been hoping for in the next release is more compatibility and stable sound. My Star Wars and Castlevania collection are ready just waiting to be played. Heh heh. Great work on what you've done so far.

Nico said...

It's great to see your improvements and see that you're still excited with the project.

As many of us, i don't know much about programming, but i still find interesting these "code posts".

Another 10%-15% speedup sounds great! I was wondering, do you feel like you'll find a dead end on the way to fullspeed (in all games)? I mean, is it a matter of improvement releases, or will there be hardware limits unachieveable?

Thanks for your time in the project and sharing it with us!

pj1115 said...

Holy smokes, Batman!

I never would've thought of any of that... I was impressed with your trampolining idea btw.

Good luck with R13, you're doing a great job! We're all behind you!

Alex said...

The more and more you post it appears that the task you so simply claim to be your hobby seems like an impossible job of packing a month's worth of clothing into a backpack; all the while you're folding very neatly and conserving space, packing efficiently and then unpacking and doing the whole job over again.

Props to you for driving forward into the challenge.

Hope you like your new Apple.

Exophase said...

You got two out of three of these ideas directly from an e-mail I sent you in February :/

"At any rate, in every single one of these epilogues you're setting the third parameter to 0. You'd may as well make a version of this function that has it fixed to 0."

"Of course, what you have now is good. A minor suggestion: instead of branching over the call to the generalized memory handler, you could do a conditional branch and link to a "trampoline" that sits somewhere in your translated blocks. This way you don't introduce a bubble inside your code, improving locality of reference, and you decrease overall block size, improving icache footprint. I put these trampolines at the beginning of my blocks, but the reality is that you can have one service many blocks, so long as they're close enough."

As far as event handling goes, you should only have to be dealing with time based events (ones that will happen N cycles from now). Any input based events can be coupled with the timed ones since low latency is not a requirement for them. For this reason, all you really should be doing at the end of a code block is decrementing a counter and checking to see if it has gone below zero (and if it has branch + link out to an event handler). This is only a couple instructions on MIPS (not counting the delay slot). The event handler will then calculate which timed event will happen soonest and set the counter to this.

Rodrigo Cardoso said...

@ice

did you take a look at R12?
Mario64 already runs at more than 50% of the framerate... :)

narutofan150 said...

Ok, first off, great progress :-D but i have a question about the speedup

Ok so say when the emulator gets released its total speedup is 10%(thats an example), If a rom is running at 40fps and another at 20fps in r12, does that mean the rom that was running at 40 would be at 44fps and the one at 20 would be 22fps?

Thanks very much for the great work your putting out and the best of luck with daedalus!

BrendanL said...

It's good to know that Daedalus is still going strong!

said...

you are the MAN, you're doing a greatjob, keep on it !! let's see how far can go the PSP !

cyclonmaster said...

Will you plan to include save state feature in Daedalus? I admire your project and dedication to make this happen. Homebrew scene really respecting you sir. Keep up the good work.

Josh said...

I have no idea what that means, but I'm excited to hear that you're making progress. Oh, and I don't think you should ever announce release dates. People will get it when they get it.

Unknown said...

Amazing... a 10-15% increase in speed already and further optimizations to go. Once Star Fox 64 is completely playable on the PSP you may never see me again in daylight. This is awesome, keep up the good work!

GotikO said...

yo' keep doing that great job!!! hey if u think is possible, try to prove pokemon stadium 2, and try to see if u can find a way to run it faster... it's mad slow on R12. see ya' man.

Kdo said...

The detailed explanation you give tell a lot about your skill level.
As said so many times already, i'm indeed very impressed, and tend to read carefully through all your post, admiring your brilliant ideas

narutofan150 said...

Strmnnrmn: Do you think you will ever get more than a few games running at full speed 60fps with sound?
please give your opinion, as your the dev, the magic guy :D

KingPepper said...

Keep It Going StrmnNrmn,
Can't Wait to See What other Things your Going to Find that can be Made Faster!!!

Thanks for the Continued Work On this Emulator,
Here's to R13's Progress!!!!

Anonymous said...

Hey, for the next update, could you try to get paper mario working? It crashes after picking your character's name.

That game would probably run really fast and it's fun too!

Anyway, I guess that overall speedups are more important.

GmDude66 said...

Dude, you rock.

Keep going, i'm sure you can find most code to optimize!

Thanks for making this great emulator!

StrmnNrmn said...

@exophase: Sorry mate, it looks like I didn't follow your email very closely :( I'm going to have to go back now and see what other tips I've missed!

StrmnNrmn said...

@narutofan150: Yup, that's it. It's all very imprecise though - the speedup will vary slightly between games (e.g. 10% vs 8% vs 15%) and even within different parts of the same game. Just take them as rough indication for now!