What you are generating actually seems more like "threaded code" to me than actually a real dynarec.
I entirely agree with his description "threaded code" for the way I'm approaching the new dynarec engine. I want to stress tha this is just the early stages. The intention is to develop a stable platform from which I can incrementally move towards what would be considered more conventional dynamic code generation. I want (and now have) a nice solid platform from which I can implement things in chunks of 2-3 hours, which I can easily fit into an evening after work.
So I'm currently in the process of replacing generic calls to handle individual opcodes with specialised code which avoids the function call overhead and can hardcode register locations and immediate values. I can also optimise certain things very well. As an example there is no 'mov' instruction on the MIPS architecture so you quite often see ops like 'or a0, a1, r0' (which just ORs the value in a1 with 0 and stores the result in a0). I can write the code to handle OR to simplify this particular situation by just directly copying the contents of register a1 to a0, and therefore avoid the logical operation altogether.
As another example, here's the code I wrote last night to handle LUI (the operation to load a 16 bit value into the upper half of the low-32 bits of the N64 register):
LUI( PspReg_T0, immediate );
SetVar( &gGPR[rt]._u32, PspReg_T0 );
if (immediate >= 0)
SetVar( &gGPR[rt]._u32, PspReg_R0 ); // Clear tops bits
SRA( PspReg_T1, PspReg_T0, 0x1f );
SetVar( &gGPR[rt]._u32, PspReg_T1 ); // Set top bits
In the code I talked about on Sunday, handling this instruction would involve 13 ops: 2 to load the value of the opcode into the argument register, and another to call the 10 op-long function 'R4300_LUI' (Daedalus's instruction handler for LUI).
With the code above this is reduced to 4 ops in the worst case (if the immediate value is negative), or just 3 ops if the value is positive. Also, there is no branching. To give a speicifc example, this N64 opcode:
LUI at, 0x8034
now causes this PSP code to be generated:
LUI t0, 0x8034
SW t0, 8(s0) ; s0 points to the emulated register set
SRA t1, t0, 0x1f
SW t1, 12(s0)
My intention is to spend the next few days reimplementing the most commonly used opcodes in this way. By that point I think the major overhead will shift from the cost of all of the function calls to the generic handlers to the cost of storing loading the emulated registers each time they're referenced (you'll notice in the snippet above I call SW twice - once for each half of the 64 bit N64 register.)
From previous experience, register caching is where the real speedups come from with dynamic recompilation. Memory accesses are typically an order of magnitude slower than register access so anything I can do to avoid them in the recompiled code will be a huge improvement.
If anyone is curious, I've been reading these two papers on fast register allocation for dynamic code generation:
Linear Scan Register Allocation - Poletto, Sarkar. 1999
A fast, memory-efficient register allocation framework for embedded systems - Thammanur, Pande. 2004
[Edit 23:40 - fix markup]