When I discussed load/store handling on Sunday, I presented the new code that is typically generated for handling a load such as 'lw $t0, 0x24($t1)' on the N64:
ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
(I'll ignore all the extra code which is generated, and just concentrate on the 5 instructions above which correspond to the expected path of execution.)
Of the 5 instructions that are generated, two - the SLT and BEQ - are just there for performing error handling in the case that the address is invalid, a hardware register (i.e. for memory-mapped I/O), or a virtual address. I'll call this error handling for short.
If we were generating code in an idealised environment where we didn't have to perform error handling, we could drop the SLT/BEQ instructions to get this:
ADDIU a0 = s1 + 0x0024 # add offset to base register
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
We could then optimise this even further and perform the offset calculation directly as a part of the LW instruction:
ADDU a1 = s1 + s7 # add offset to emulated ram
LW s0 <- 0x0024(a1) # load data
In this idealised situation we could reduce an emulated load to just two instructions, with no branches. That's a pretty good saving!
The problem is that the environment we're generating code from is not 'ideal', and it's hard to know in advance of time which memory accesses are going to directly access physical ram, and which are going to access hardware registers or require virtual address translation. For that reason, we have to place a guard around every memory access to make sure that it behaves correctly. At least, that was the way I was thinking until earlier in the week.
What I realised on Monday is that I can make an assumption that lets me remove the error handling code for certain types of load/stores. The assumption is that when the N64 accesses any memory through the stack pointer ($sp) register, the address is always going to be valid, physical memory.
The assumption relies on the fact that most roms don't do anything particularly clever with their stack pointers - it gets set up for each thread to point at a valid region of memory then the game just runs along, pushing and popping values from it as the code executes. Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner :)
It was straightforward to add a hack to the code generation to exploit this kind of behaviour, and the results have been better than I expected - I'm seeing at least a 10% speed up, and the code expansion factor (the ratio of generated bytes of instructions to input bytes) has dropped from around 5.0x to 4.0x. Stability has been excellent too - I've run about 8 roms with the hack so far, and all of them have run perfectly.
I think one of the reasons the hack has such an impact is that a lot of the memory accesses in a typical C program are through the stack. Here's an example snippet from the entry to a function on the N64, where the compiler emitted code to store the return address and arguments:
SW ra -> 0x0014(sp)
SW a0 -> 0x0058(sp)
SW a1 -> 0x005c(sp)
SW a2 -> 0x0060(sp)
When I look through disassembly for the roms I'm working on, it's very common to see lots of sequential loads/stores relative to the stack pointer like this.
Previously Daedalus generated around 20 instructions (including 5 branches) for the above snippet. With the hack, the generated code now looks like this:
ADDU t1 = s1 + s7
SW s4 -> 0x0014(t1)
ADDU t1 = s1 + s7
SW s3 -> 0x0058(t1)
ADDU t1 = s1 + s7
SW s2 -> 0x005c(t1)
ADDU t1 = s1 + s7
SW s5 -> 0x0060(t1)
8 instructions, 0 branches. What's more, it looks like with a little work, I could eliminate 3 redundant address calculations:
ADDU t1 = s1 + s7
SW s4 -> 0x0014(t1)
SW s3 -> 0x0058(t1)
SW s2 -> 0x005c(t1)
SW s5 -> 0x0060(t1)
Now that would be efficient :)
I still want to do lots of testing with the hack. I want to find out if there are roms that don't work with the hack enabled, and how common a problem it is. It's such a significant optimisation though that I'm certain I'll be adding it as an option in Daedalus R13. The results of my testing will probably determine whether I default it to on or off though.
So far Daedalus R13 is shaping up to be significantly faster than R12. I'm still not sure when I'll be ready to release it, but you'll hear about it here first.
-StrmnNrmn
17 comments:
All the game developers I've worked with were all about using the stack as much as possible. It's just a good defensive programming technique to avoid memory leaks and fragmentation. alloca() FTW
Is this a 10% speedup plus the 10-15% speedup from the previous post? Or is the total a 10-15% increase?
WOW!!!
Then... could we maybe see F-ZERO X running full speed in R13?? XD (if it got running again, of course :( )
another question: Do you think in move some code to ME in R13, or this will be a further optimization?
:)
Hi there,
about disabling the range checking on $sp - as you obviously can't always be sure that $sp points to something valid, have you tried validating $sp whenever a value gets written to it?
eg if after an add or subtract $sp no longer points to something valid, run all the following loads of stores that use that register with range checking. If the stack pointer is a valid address (+/- the maximum displacement allowed by the store) then roll with disabled protection on loads/stores...
Just a thought :-)
Btw I'm toying with writing some N64 emulation for the DS and I'd love to ask you some questions about your work! Any chance of mailing you about it?
Please fix Bomberman 64
It has a Black Overlay. Bomberman 64 is a very funny game and it should (for you) be no big problem to fix it and get it running on a decent speed as its imho not very reeource consuming.
If you fix Bomberman 64, Im pretty sure we would have two great games at a decent speed. Mario64 and Bomberman64.
Keep up the good work!
wow!
I wonder how much of the original code is still intact, Daedalus started as a port and with all the optimisations it's getting better and better
so plz continue your great work and release R13 when you have all the optimisations included, which come to your mind, and ensured they all work as perfect as all your code ^^
jope you're as happy with your MBP as I am :)
You Rock!
Thats all what I wanted to say ^^.
Regarding your dynarec hack, particularly the part where you stated '' if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner''
Perhaps you should implement a way to disable this hack on a per ROM basis just in case it does cause any to crash. Of course it should be globally enabled by default if found to be stable enough.
All of this news is extremely exciting, I hope you plan to work on moving sound or processing to the Media Engine or just using it in any way you see fit. Good job Strmnrmn (and Exophase's dynarec optimisation suggestions). I Wonder if anything else Exo said has been missed, we could be missing another 10-15% increase in speed idea :(
wait when u said
"Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner :)"
does this mean when R13 is released and i use it this can happen to me while playing it? cuz i really dont want my psp to crash.
sounds good nice work
You can do even one better on this. Most of the time, only four operations are done with respect to the stack pointer:
- load
- store
- increment by constant
- decrement by constant
Because of this, you really don't need to cache the stack pointer itself directly but can instead cache a "shadow sp" that you perform these operations on directly, each one instruction. Only when the stack pointer is modified in some other fashion or referenced will you need to bring in the real sp, which can be calculated from the shadow sp (add the appropriate offset - in your case, since you have this in a register to add with already, that would mean subtract this offset; cheap operation). If you've been seeing a 10% boost then this might give you another percent or two.
I'll be honest, the code discussions don't mean much to me, though they're always fascinating to read :)
thanks for your continued efforts on this, R13 sounds very promising, looking forward to it.
cheers, Rob
I gotta hurry and get a new PSP by the time R13 comes out or else I'll be jumping face plants everywhere. LOL! I'd glad your shaping this thing up. You're an awesome programmer....
duh
@taco: it's cumulative, so roughly 20% overall. It varies rom by rom though.
@cueca: Moving stuff to the ME is definitely a long term plan, but I think it's too much work to squeeze into R13. It's definitely on the horizon though.
@simon: That's definitely a possibility for improving stability, but to be honest I'm not sure that it'll be necessary as it seems to be working fine without additional checks. That technique might work well for other registers though. With the approach I'm taking I could collect information about each memory access at the time I'm tracing the code (i.e. in the phase where the code is interpreted immediately prior to generating the fragment).
Feel free to drop me an email - my address is in Daedalus's readme file.
@stefan paul: Well I a lot of it has been rewritten over the years but more of the core stuff is the same. The main things to change are the dynarec engine, graphics and various platform specific stuff.
@anonymous: The hack will definitely be configurable on a rom by rom basis. I'll add it as a setting in one of the configuration files and allow it to be overridden from the UI.
@metalsnake4: I wouldn't worry too much about this - the worst that will happen will be that you'll have to flip the battery out of your psp.
@exophase: The first dynarec code I ever wrote cached the sp like this and it worked pretty well, although I was still checking it for validity on each access (I was really paranoid about it being 'wrong' and introducing subtle bugs). I collect a lot of stats when I generate the fragment like which registers are used as bases and how many times they're modified, so I could extend this approach to other regs too, caching their translated address rather than recalculating it every time they're accessed. I think this would work really well for tight loops that operate on pointers (stuff like memcpy, memcmp etc)
simon; The problem with your suggestion is that every time $sp is written to in a "complex" way is that accesses that have already been recompiled will be subject to encountering the bad $sp too (even with the function inlining in Daedalus's recompiler). You'd have to flush the cache and start over with checks on all of them, which is probably preferable to having a config option, but you won't be able to do only some of them.
I think it's unlikely that a game would do something weird with $sp, although I guess it's possible, maybe if they're very strapped for registers.
Also have to wonder what you mean to accomplish with an N64 emulator on DS.. as DS's main CPU is arguably weaker than N64's and it doesn't have an FPU, actually it doesn't even have enough memory, how are you going to even get an emulator that works at all??
I'm not much of a coder, in fact I can barely do assembler at all yet I love reading your blog, it's always well written and nudges people along in a relaxed manner without the all too common chest beating of many a coder. If I ever get into coding again, I hope whatever literature I pick up is going to be written by you.
Post a Comment