Tuesday, May 23, 2006

A few more details

Laxer3A (of SnesTYL fame) made an insightful point in the comments page of the previous post (I hope he doesn't mind me quoting him):

What you are generating actually seems more like "threaded code" to me than actually a real dynarec.


I entirely agree with his description "threaded code" for the way I'm approaching the new dynarec engine. I want to stress tha this is just the early stages. The intention is to develop a stable platform from which I can incrementally move towards what would be considered more conventional dynamic code generation. I want (and now have) a nice solid platform from which I can implement things in chunks of 2-3 hours, which I can easily fit into an evening after work.

So I'm currently in the process of replacing generic calls to handle individual opcodes with specialised code which avoids the function call overhead and can hardcode register locations and immediate values. I can also optimise certain things very well. As an example there is no 'mov' instruction on the MIPS architecture so you quite often see ops like 'or a0, a1, r0' (which just ORs the value in a1 with 0 and stores the result in a0). I can write the code to handle OR to simplify this particular situation by just directly copying the contents of register a1 to a0, and therefore avoid the logical operation altogether.

As another example, here's the code I wrote last night to handle LUI (the operation to load a 16 bit value into the upper half of the low-32 bits of the N64 register):


LUI( PspReg_T0, immediate );
SetVar( &gGPR[rt]._u32[0], PspReg_T0 );

if (immediate >= 0)
{
SetVar( &gGPR[rt]._u32[1], PspReg_R0 ); // Clear tops bits
}
else
{
SRA( PspReg_T1, PspReg_T0, 0x1f );
SetVar( &gGPR[rt]._u32[1], PspReg_T1 ); // Set top bits
}


In the code I talked about on Sunday, handling this instruction would involve 13 ops: 2 to load the value of the opcode into the argument register, and another to call the 10 op-long function 'R4300_LUI' (Daedalus's instruction handler for LUI).

With the code above this is reduced to 4 ops in the worst case (if the immediate value is negative), or just 3 ops if the value is positive. Also, there is no branching. To give a speicifc example, this N64 opcode:

LUI at, 0x8034

now causes this PSP code to be generated:


LUI t0, 0x8034
SW t0, 8(s0) ; s0 points to the emulated register set
SRA t1, t0, 0x1f
SW t1, 12(s0)


My intention is to spend the next few days reimplementing the most commonly used opcodes in this way. By that point I think the major overhead will shift from the cost of all of the function calls to the generic handlers to the cost of storing loading the emulated registers each time they're referenced (you'll notice in the snippet above I call SW twice - once for each half of the 64 bit N64 register.)

From previous experience, register caching is where the real speedups come from with dynamic recompilation. Memory accesses are typically an order of magnitude slower than register access so anything I can do to avoid them in the recompiled code will be a huge improvement.

If anyone is curious, I've been reading these two papers on fast register allocation for dynamic code generation:

Linear Scan Register Allocation - Poletto, Sarkar. 1999
A fast, memory-efficient register allocation framework for embedded systems - Thammanur, Pande. 2004

-StrmnNrmn

[Edit 23:40 - fix markup]

3 comments:

_Psycho said...

Im going to read that at job tomorow with great interest ! It's very interesting stuff for my brain =P

Do you have any "pre result", like % of speed increase after xx opcodes changed ?

Laxer3A said...

Hi,

Dont know so much about Mips on PSP and N64.

But here is my idea (sorry to be long):
I was thinking actually that if you can create a system where you can have your full register set usable for the N64 during a chunk execution.
The problem to solve is the loss of your current SP (your PSP code) and switch to a SP of a N64 and be able to get back your PSP SP register without any problem.

it is like :
A/ Load all registers from N64 context, except PC.

B/ Execute code N64...

C/... End of emulation chunk code...
All mips registers are N64 registers.(except PC)

B/push all registers on the N64 stack.

C/
// Find a way here to give an adress
// ex. like hardcoded #imm assembled // to form an adress
Move #imm -> f_reg.
// Store SP (n64) at this adress.
Store SP at adress (@execution context of the N64, SP register)

// Use this technic to get again an adress.
// This adress store your latest PSP SP
move #imm -> reg_f
load SP, [reg_f]
or something like that.(sorry I am not fluent in MIPS ASM... I prefer ARM)

Basically we load FULLY the complete register set of the N64 in the cpu, EVEN SP (except PC)... Execute code.
At the end backup the N64 registers (except SP and PC of course) to the stack (the real N64 would not but who cares).
--> Use a register to get an adress hardcoded.
--> Use this adress to save the N64 SP register.
--> Use a register to get an adress hardcoded.
--> Use this register to load new SP from adress.(now we have back PSP SP)

Now you have all the N64 registers backup at a given adress...
And you are back fully in PSP state.

With a technic like this you dont need to remap the register dynamically, or work on a reduced set of registers. The N64 code work as it is.


Now it is easy to backup all N64 register before we erase them.

I will make a second post about other ideas...

Laxer3A said...

I continue :
Now at the beginning of a chunk you can do something like (once SP N64 is set)
- Get back registers from N64 SP
- Shift SP by xxx register to delete them.
- Start emulate code...
...


Thats pretty cool, you dont need any useless move. Clean and efficient.

Now what I wanted to talk about is the instruction set.
A/ The cool point with risc architecture is that you know when memory is going to be accessed :
STORE or LOAD instruction.

B/ You know that troublesome instructions are the one modify the PC. (switch with tbl, switch with offset, jumps)

C/ All other instruction should be problemless (need to think about it)

So in case of a LOAD :
- Adress is fixed : remap it to a LOAD with PSP Memory.
- Adress is unfixed but we can estimate : ex, load from RAM, this same load is most likely to always load from RAM... not access a register in a chipset the next time... But who knows.
- Adress with no estimate.

The first and second cases, can be remapped directly to PSP memory.

The third case need to be replaced with a function call.(getter)

Same with STORE :
- RAM ?
- Register ?

I am just wondering how much LOAD or STORE at one place are doing access at completly different places in memory. Most likely to be 0 but never know...

Regards.