Sunday, August 06, 2006

More R7 Optimisations

It's been a while since my last post, but I've still been hard at work with various optimisations for Daedalus R7.

Although my main focus is on improving the dynamic recompiler, I've been looking at optimising a couple of other areas that I noticed were fairly expensive. The texture cache is one of the areas that I spent time tuning this week. This cache is used to avoid converting textures from the native n64 formats to psp formats every frame. I made a couple of fixes to improve the hashing function which gives much faster lookups in certain situations (such as tiled backdrops). I also provided an option to change the frequency at which the texture cache checks for updates to the textures. Many roms look fine when this check is entirely disabled, and this can give quite a nice speed boost.

My main focus has continued to be on the dynamic recompiler. I've made a couple more bugfixes in this area. One bugfix involved detecting when roms were using self-modifying code. The fix involved dumping the contents of the dynarec cache so that the code is correctly regenerated for the updated instructions. This fix solves a couple of issues I was seeing with Quest64, and I'm sure it will help improve compatibility with a number of other roms too.

The other dynarec issue I fixed was related to the way I was handling certain types of branch instructions. The MIPS processor has a set of 'branch likely' instructions which work slightly differently to regular branches and so I handle them separately in the dynamic recompiler. It turned out that I had forgotten to link together code fragments when they exited through a branch likely instruction. This fix gives a nice little speedup.

The biggest bit of new development I've been doing on the dynarec is on optimising for various situations where I can determine the contents of a given register at the time I'm compiling the code. As an example, many roms use the following sequence to load an integer value from memory at a specific address:

LUI $t0, 0x8033 // Load Upper Immediate - i.e. load t0 with 0x80330000
LW $t0, 0x1234($t0) // Load Word - i.e. load t0 with the value at 0x80331234

Previously I'd generate code for both of these instructions on the PSP. The LUI instruction is easy (if t0 is cached on the PSP then this is just one instruction). The LW is a lot more tricky. I have to call a function to convert the address on the n64 (0x80331234 in this case) to the address in the emulated memory on the PSP. Then I have to read from that address, or trigger an exception in the emulator if the memory address is invalid.

With the changes I've just made, when I encounter the LUI instruction (or other instructions involving loading constant values into registers) I keep track of the fact that I've loaded t0 with 0x80330000. When I come to process the LW instruction, I can now determine that the desired address is 0x80331234. I can then map that address directly to the required location on the PSP, avoiding a function call in the generated code. By avoiding the function call I no longer need to flush cached registers back out to memory. Also, because I can tell in advance that the address lies in RAM (and isn't referencing a hardware register for instance) then I can also omit the code testing for an exception. Finally, in situations like the example above, I can don't need to generate any code for the initial LUI (as the register is immediately overwritten with the loaded value.)

In summary this is a very nice optimisation - it generates fewer instructions (reducing the size of the dynarec code), it avoids unnecessarily flushing out cached registers, it avoids generating exception handling code, and it can eliminate redundant instructions (the initial LUI). In the best case, for 2 source instructions it will generate just 3 output instructions, compared to 12-13 for the unoptimised case.

Unfortunately this approach only works with load and store instructions where the address can be determined in advance, but from the roms I've examined so far around 10-15% of the load/store instructions can be optimised in this way, which is enough to give a measurable benefit.

I'm going to spend the rest of this week seeing which other parts of the dynarec engine can benefit from similar approaches. I have a couple of other features to implement (configurable controllers etc), if that all goes to plan I'll try and prepare R7 for a release next weekend.



  About how much faster have you gotten it since last time? Percentage wise.

  yes, you forgot to tell us the zelda speed and you must tell us how much faster it is now.

  I just wonder how fast Mario Karts going now!

  4. Hi !

    Self Modifying code on a MIPS ?!?!

    I believed that it was limited to old cpu without any cache or pipeline.

    Normally self-modifying isnt possible on RISC CPU.

    Do you have any example ?
    Keep up the good work.

  5. It can be implemented on any processor where either:

    1) You can flush the cache/pipeline, either by a special instruction or by utilising something like the x86 far jump (which admittedly only flushes the pipeline [ignoring prediction abilities], but you get the point)

    2) You can be sure the code you're modifying isn't currently in the cache or pipeline (this would be somewhere between hard and impossible)

    In other cases you can use self-modifying code, but your changes will be ignored on a regular basis, at least until that particular piece of code has left the pipeline and cache.

  10. Sorry for my quiet spell last week - I was so busy I didn't check the site all week and was quite surprised by all the comments I found!

    tsurumaru: Thanks for the heads up on the rom links - I'll go through and remove them now.

    kramer: I think I'll try and post brief updates a couple of times through the week. I can appreciate that people get a bit frustrated when they don't hear anything for a few days, and new posts help to keep everything a bit fresher.

    wally*won_kenobie: The compatibility does seem to have improved somewhat so it may well be worth waiting until R7 (not least because the mux/screenshot situation will be a bit less painfile for you :)

    mario kart god: I've lost track where I'm at now. The problem is some roms are seeing much bigger speedups than others. For instance bits of Super Mario are running at 20fps now (in the castle for instance) whereas the opening sequence of Zelda is still stubbornly running at 4fps. I think you'll be quite pleased anyway :) MarioKart is around 12fps now - I'm not sure what it was before, but it's starting to feel quite playable.

    urkel: As I mention above the opening sequence in Zelda is still just 4fps. In game is a little faster now (5-6fps rather than 4), but it's still not running as quickly as I'd like.

    xiringu: Good question :) Dynarec should benefit all emulators. Obviously if the emulator runs at full speed through an interpretor then there's no point in going for the added complexity of implementing a dynarec engine too.
    The principles of the dynarec are pretty similar for all machines, but some are easier to handle than others (the N64 is fairly easy because it's based on a RISC chip - I imagine writing a dynarec engine for an x86 emulator would be a lot more difficult for instance).

  11. imtiaz: It's definitely looking a lot better, but I think sound is still a handful of releases away (maybe R10 or R11??)

    kersplatty: Just don't get caught playing games while you should be paying attention :)

    laxer3a: Hello! I haven't traced through what the roms are doing - in the Quest64 case it looked like a decompression routine or something like that. It's maybe not 'self-modifying' in the true sense of the word - I think they're actually streaming code in dynamically from the rom at runtime to save on RAM.
    I'm actually just trapping their calls to invalidate the cache (i.e. the mips CACHE instruction) and dumping the dynarec contents. It's not perfect, but it seems to work well enough for now.

    kemp: You're exactly right - the roms in this case are using 1) to flush the instruction cache.
    What's quite interesting is that in emulating the n64 on the psp, my dynarec is itself doing something similar on another MIPS processor (I have to regularly flush the instruction cache to avoid nasty hard-to-trace crashes).
    I wonder if in a few years someone will be emulating Mario 64 in Daedalus PSP on a PSP emulator on a new nintendo handheld? :D

