Wednesday, May 31, 2006

Some initial benchmarks

I've been really busy working on the new dynarec engine, so I've not been posting as frequently as I'd like. I've made a lot of progress in the following areas:


  • Most integer arithmetic and logical instructions now implemented (i.e I'm now generating optimised assembly for these instructions rather than calling a generic function to handle them

  • Regsiter caching implemented (although I'm only using a greedy allocation algorithm at the moment, as I've not yet fully implemented the fast linear scan algorithm I talked about in the previous post)

  • I'm directly linking all direct branches to compiled fragments

  • I'm linking to all indirect branch targets



So far I'd say I'm around 40-50% through the work on the dynarec engine.

Now for some stats :) The following table compares the framerates at various points (previous framerate is for the R4 release of Daedalus, current framerate is for my most recent development build):














ScenePrevious Framerate (Hz)Current Framerate (Hz)
Mario Head36
Mario Main Menu1425
Mario Peach Letter6-711
Mario Flyby (under bridge)610
Mario In Game5-69
Mario Kart Nintendo logo1023
Mario Kart Flag611
Mario Kart Menu711
Zelda Nintendo Logo2023
Zelda Start Menu2-34
Zelda Main Menu1013


Overall I'd say the dynarec is currently achieving up to a 100% speedup in the roms I've tested, which I'm very excited about. Mario is certainly starting to feel a lot more playable, and the Mario Kart menus are a lot more responsive now.

I specifically included Zelda in the results because I'm not seeing the same kind of results there, so I need to take a closer look at what's going on there (it's quite possible it's just using a few of the arithmetic and logical ops I've not spent time optimising yet).

A twofold improvement in framerate is pretty good, but I now think I can do a lot better. Here's the list of things I currently have on my 'TODO' list:


  • Fully implement all the remaining integer ops (including all the 64 bit instructions)

  • Finalise implementation of the fast linear scan register allocation algorithm

  • Keep track of 'known' values for specific registers and use this to optimise the generated code (e.g. most of the time the top half of the N64's 64 bit registers is just sign extended from the lower half)

  • Cache the memory location pointer to by the N64 stack pointer (SP) and optimise load/stores using this register as a base pointer

  • Optimise all memory access instructions (currently all the cached registers get flushed for all memory accesses other than LW/SW/LWC1 and SWC1)

  • Detect and optimise 'busy wait' loops (e.g. many roms sit in a tight loop waiting for the next vertical blank interrupt to fire which is just wasting cycles on the PSP)

  • Implement all the branching instructions (I've currently only implemented BNE, BEQ, BLEZ and BGTZ)

  • Implement instructions and register caching for all the cop1 (floating point coprocessor) instructions. (I think this will give a huge speedup.)



Although the list is quite short, there's quite a lot of work there. What I'm quite excited about is that I think these changes will start to provide significant speedups as they're implemented. I don't want to get too far ahead of myself, but I'm starting to feel that certain roms are going to be very playable in the not too distant future.

I'm going to try and release a new version of the emulator soon. Unfortunately it's probably not going to be this weekend (due to various social commitments); towards the end of the following week is more likely. I'd certainly like to get a version released before the World Cup starts and all my free time is taken up watching football :)

-StrmnNrmn

10 comments:

_Psycho said...

Always interesting as always :)

I was wondering, lets say you finish your dynarec and you are around 60-90% of the original speed (like 22-27fps for mario for example instead of 30). You have any plan to finish the speedup ? Like taking some textures, cache functions and rewriting them in mips asm instead of c++ ? Would that give you an extra boost or the way you wrote your dynarec already take there of that so it would be useless to rewrite some part in mips asm ?

Anyway, I really enjoy following the technical notes. I can't wait to dig in the source code to see the changes.

Unknown said...

Mighty impressive update :)
Once again, very good work!

I'm looking forward to the next public release. But by all means, take your time and don't rush yourself.

Cheers,

LaMa

Exophase said...

Question about one of your optimization ideas. How can you cache the stack pointer's memory region when you don't know at compile time if stack relative accesses will be in the same page? Or is the stack usually in unpaged memory? If the latter is the case I would assume you're referring to the hardware memory region, and not the virtual memory space (but this should always be RDRAM, right?)

Laxer3A said...

Hi,

1/ In my previous post I didnt tought that n64 MIPS instruction set was different from PSP MIPS.
(64 bit reg and so on...)

So direct code "translation" seems to be a bit harder.

2/ I actually have a lot of other projects and also very busy work.
I believe StrmnNrmn is very skilled too and does not need anybody like me. :-)
I was just makings some comment about potential implementations.

I would just hope that StrmnNrmn would try to discuss more on this blog, so we could devellop optimization ideas. That would be fun. But I bet he want to reach his goal fast without loosing too much time.(=when home == coding, not internet thingy)

StrmnNrmn said...

wally: There will be sound support, but I think speed and compatibility are more important at the moment (any audio will sound horrible until the emulator is running at close to full speed).

insert display name: Save is definitely a big priority. It shouldn't be too hard to get working, but again I don't think it's a priority until some of the compatibility and performance issues are addressed. A better GUI is definitely required too (if just to allow the controller to be reconfigured on a rom by rom basis)

_psycho: I think there is a lot of scope for optimising other parts of the emu specifically for the psp. Certainly the texture decompression etc could be heavily optimised for the PSP. At the moment the CPU emulation is taking the majority of the time, so that's what I'm focussing on. Hopefully once the dynarec work is finished it should be more obvious where to look at improving next.
PS- I'll take a look at commiting my changes on the sf.net CVS repository today if you fancy looking thrigh the code.

expophase: Usually the stack is in physical ram, so there's no issues with paging etc. Some games do (annoyingly) have the stack in virtual mem so this optimisation probably wouldn't work for them. It would have to be toggleable from the .ini file to work I think.

StrmnNrmn said...

laxer3a: Sorry I didn't get around to replying to your previous post - you raised a really interesting approach that I'd not given much thought before.

I think you spotted the same problem as I thought of- i.e. the psp has a slightly different instruction set than the n64 (64 bit instructions as you mention) The other problem is that it's big endian whereas the psp is little endian, so all 1 and 2 byte load/stores need to be fiddled to get working. I think it might be possible to get a 'direct' translator working like you suggest, but I think there would end up being a lot of hacks and special cases etc. Ultimately I think that a full dynamic translator is going to be the easiest approach (plus I can also share most of the code for the 'front end' of the translator with the PC version :)

You're right when you say that I'm trying to achieve my goal quickly without 'wasting' too much time. My job takes up quite a lot of my time, so I only get a few hours at home in the evenings during the week. It usually takes me a couple of hours to go through all my email and update the blog etc. I usually only ever end up doing it once a week so I can spend as much time developing as possible, but I'm aware that it's important to keep people informed as to what's going on. I also enjoy talking to you guys so I'll try and squeeze in a few smaller updates when I get the chance :)

Exophase said...

Other N64 emulators have used byteswapping before, assumedely to address the endian issue, although I'm not sure how this actually helped anything since reading/writing bytes and halfwords would provide an inconsistant view of memory. I've used byteswapping but it was on a "platform" that only supported full word memory accesses.

I don't know if you're already using these or not, but for manual byteswapping MIPS32r2 processors have some additional instructions that should help, a two instruction sequence can byteswap a full word, it's not as good as what's available on PPC but it's a lot better than doing it the traditional way (the following is taken from the programmers manual PDF):

lw t0, 0(a1) /* Read word value */
wsbh t0, t0 /* Convert endiannes of the halfwords */
rotr t0, t0, 16 /* Swap the halfwords within the words */

_Psycho said...

You know I just realised you were on sourceforge and that you were using the CVS, I thought you were only realising the source code with every release ;) Good Idea there, I give it a look later this week.

Mikeyd, if you really want to test badly, get the lastest source code in the cvs, compile it and check the result ;)

Laxer3A said...

StrmnNrmn, thanks for your answer.

Basically the problem when you translate ONE n64 mips instruction it result in MULTIPLE PSP mips instruction, with a lot of different subcases.

The "best" way actually would be to formalize the N64 instruction into a graph as used in compiler. Basically roll back the ASM instruction into chunk of "virtual" micro instruction...

Once the chunk of code has been completly "virtualized" you pass all your graphs trees through various optimizing filter. Which will reduce the size of the tree or make them more efficient to become mips instruction on the target device.

I know it is a bit of overkill (thats why I put best between "").
Thats how compiler do their job, but we definitely agree that it is costly for real time stuff and limited cpu platform.

MIPS -> MIPS still is close enough to do that in a more simple way.
Some "special cases" handled in a nice way can probably do as good at a cheaper cost.

Anyway there is always hundreds of way to solve problems, depending on trade-off so...

The problem doing this on the Snes emu is that the architecture is SO different, that CISC rich adressing mode on each instruction is just a real pain, getter/setter code generation annoying and finally need to detect if the code is in RAM or ROM to avoid selfmodifying code issue.

In the case of TYL, it isnt worth it.(dev time vs benefit)
In your case, it is definitely a requirement.

If you translate your GPU call quite fast, the next bottleneck is audio and cpu... Definitely worth it then.

Anyway, if you have time, drop a mail : laxer3a@hotmail.com
I really enjoy discussing about this kind of stuff.

StrmnNrmn said...

_psycho: Just to let you know I've updated the CVS repository with all my recent changes- I'll post a small new entry about this so it's a bit more visible.

bigmace: 4k/16k eeprom support should only take a few minutes to get working - most of the logic is all there from Daedalus PC. I just need to sit down for 10 minutes and hook up the load/save to memorystick on the PSP. I'm holding off for a short while as I want to double check the 'fileformat' is compatible with various other emulators (i.e. so people can download and share their saves)

kramer: I will fix a few of the more obvious glitches that I've come across. R5 is primarily going to focus on perfomance though, so there's unlikely to be much in the way of graphics or compatibility fixes (maybe I'll spend a week concentrating on this for a quick R6 release).

wally: It does run very fast. The problem is there are some nasty graphical glitches which make it impossible to see what's going on when you get in game. I'll take a look at fixing this soon :)

laxer3a: Sounds like a really interesting idea. In the past I've had a thought about treating the n64 asm as an arbitrary program fragment, converting it into SSA form and then applying various optimisations from there (e.g. lots of peephole optimisations should become very easy at this point). Obviously if you did this you'd have to make sure that the overhead of your optimiser didn't slow everything down too much though. Will drop you a line sometime this week - would be good to chat about this in a bit more depth.