Retro Console Dev: August 2007

Wednesday, August 15, 2007

Whoops

It turns out that exophase had already suggested two of the three optimisations I implemented this weekend in a conversation we had way back in February.

Whoops! It's been such a long time, I'd totally forgotten we'd talked about it. I really should pay more attention sometimes - if I had been you might have seen some of these improvements way back in R9. Sorry exophase - kudos for spotting this so long ago!

-StrmnNrmn

Interesting Dynarec Hack

I was playing around with the code generation a couple of evenings ago, and realised that if I made a certain assumption, I could drastically speed up specific types of memory accesses.

When I discussed load/store handling on Sunday, I presented the new code that is typically generated for handling a load such as 'lw $t0, 0x24($t1)' on the N64:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  SLT       t0 = (a0  BEQ       t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1)    # load data

(I'll ignore all the extra code which is generated, and just concentrate on the 5 instructions above which correspond to the expected path of execution.)

Of the 5 instructions that are generated, two - the SLT and BEQ - are just there for performing error handling in the case that the address is invalid, a hardware register (i.e. for memory-mapped I/O), or a virtual address. I'll call this error handling for short.

If we were generating code in an idealised environment where we didn't have to perform error handling, we could drop the SLT/BEQ instructions to get this:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1)    # load data

We could then optimise this even further and perform the offset calculation directly as a part of the LW instruction:


  ADDU      a1 = s1 + s7        # add offset to emulated ram
  LW        s0 <- 0x0024(a1)    # load data

In this idealised situation we could reduce an emulated load to just two instructions, with no branches. That's a pretty good saving!

The problem is that the environment we're generating code from is not 'ideal', and it's hard to know in advance of time which memory accesses are going to directly access physical ram, and which are going to access hardware registers or require virtual address translation. For that reason, we have to place a guard around every memory access to make sure that it behaves correctly. At least, that was the way I was thinking until earlier in the week.

What I realised on Monday is that I can make an assumption that lets me remove the error handling code for certain types of load/stores. The assumption is that when the N64 accesses any memory through the stack pointer ($sp) register, the address is always going to be valid, physical memory.

The assumption relies on the fact that most roms don't do anything particularly clever with their stack pointers - it gets set up for each thread to point at a valid region of memory then the game just runs along, pushing and popping values from it as the code executes. Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner :)

It was straightforward to add a hack to the code generation to exploit this kind of behaviour, and the results have been better than I expected - I'm seeing at least a 10% speed up, and the code expansion factor (the ratio of generated bytes of instructions to input bytes) has dropped from around 5.0x to 4.0x. Stability has been excellent too - I've run about 8 roms with the hack so far, and all of them have run perfectly.

I think one of the reasons the hack has such an impact is that a lot of the memory accesses in a typical C program are through the stack. Here's an example snippet from the entry to a function on the N64, where the compiler emitted code to store the return address and arguments:


SW        ra -> 0x0014(sp)
SW        a0 -> 0x0058(sp)
SW        a1 -> 0x005c(sp)
SW        a2 -> 0x0060(sp)

When I look through disassembly for the roms I'm working on, it's very common to see lots of sequential loads/stores relative to the stack pointer like this.

Previously Daedalus generated around 20 instructions (including 5 branches) for the above snippet. With the hack, the generated code now looks like this:


ADDU      t1 = s1 + s7
SW        s4 -> 0x0014(t1)
ADDU      t1 = s1 + s7
SW        s3 -> 0x0058(t1)
ADDU      t1 = s1 + s7
SW        s2 -> 0x005c(t1)
ADDU      t1 = s1 + s7
SW        s5 -> 0x0060(t1)

8 instructions, 0 branches. What's more, it looks like with a little work, I could eliminate 3 redundant address calculations:


ADDU      t1 = s1 + s7
SW        s4 -> 0x0014(t1)
SW        s3 -> 0x0058(t1)
SW        s2 -> 0x005c(t1)
SW        s5 -> 0x0060(t1)

Now that would be efficient :)

I still want to do lots of testing with the hack. I want to find out if there are roms that don't work with the hack enabled, and how common a problem it is. It's such a significant optimisation though that I'm certain I'll be adding it as an option in Daedalus R13. The results of my testing will probably determine whether I default it to on or off though.

So far Daedalus R13 is shaping up to be significantly faster than R12. I'm still not sure when I'll be ready to release it, but you'll hear about it here first.

-StrmnNrmn

Monday, August 13, 2007

Dynarec Improvements

I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.

At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)

I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.

The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.

These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.

The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):


  ADDIU     a0 = s1 + 0x0024     # add offset to base register
  SLT       t0 = (a0<s6)      # compare to upper limit
  ADDU      a1 = a0 + s7         # add offset to emulated ram
  BNEL      t0 != r0 --> cont # valid address?
  LW        s0 <- 0x0000(a1)  # load data
  J         _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
  NOP
cont:
  # s0 now holds the loaded value,
  # or we've exited from dynarec with an exception

There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.

So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.

Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.

When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.

What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  SLT       t0 = (a0<s6)     # compare to upper limit
  BEQ       t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1) # load data
cont:
  # s0 now holds the loaded value,
  # or we've exited from dynarec with an exception
  #
  # rest of fragment code follows
  # ...
  
  
_Trampoline_XYZ123:
  # handler returns control to 'cont'
  J     _HandleLoadStore_XYZ123  
  NOP

The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.

As with the first optimisation, this change yielded a further 3-5% speedup.

The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:


8018e014 LB        t8 <- 0x0000(a1)
8018e018 LB        t9 <- 0x0000(a0)
8018e01c ADDIU     a0 = a0 + 0x0001
8018e020 XOR       a2 = t8 ^ t9
8018e024 SLTU      a2 = (r0<a2)
8018e028 BEQ       a2 == r0 --> 0x8018e038
8018e02c ADDIU     a1 = a1 + 0x0001
8018e038 LB        t0 <- 0x0000(a0)
8018e03c NOP
8018e040 BEQ       t0 == r0 --> 0x8018e058
8018e044 NOP
8018e048 LB        t1 <- 0x0000(a1)
8018e04c NOP
8018e050 BNE       t1 != r0 --> 0x8018e014
8018e054 NOP

I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.

The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:


enter_8018e014:
  load n64 registers into cached regs
  
  perform various calculations on cached regs
  
  if some-condition
      flush dirty cached regs back to n64 regs
      goto enter_8018e038
  
  perform various calculations on cached regs
  
  flush dirty cached regs back to n64 regs
  
  if ok-to-continue
      goto enter_8018e014
exit_8018e014:
  ...
  
enter_8018e038:
 ...

The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:


enter_8018e014:
  load n64 registers into cached regs
  mark modified regs as dirty

loop: 
  perform various calculations on cached regs
  
  if some-condition
      flush dirty cached regs back to n64 regs
      goto enter_8018e038
  
  perform various calculations on cached regs
   
  if ok-to-continue
      goto loop

  flush dirty cached regs back to n64 regs
exit_8018e014:
  ...
  
enter_8018e038:
 ...

In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.

This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.

I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet :)

-StrmnNrmn

*The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate :)

Thursday, August 02, 2007

Daedalus PSP under OSX

I mentioned a few days ago that I've recently bought a Macbook Pro. I've never owned a Mac before so it's been a really interesting experience learning my way around. Initially I was planing on dual booting Windows XP via Boot Camp, but I quickly found that I could do almost everything I need to in OSX. I did install Parallels Desktop, I rarely find myself running any Windows apps.

One of the main things I need for day to day 'work' is the ability to compile and run Daedalus. The rest of this article describes the process of setting up the PSPSDK under OSX, and compiling Daedalus PSP for the first time. This post is really aimed at people who are interested in compiling Daedalus for themselves on OSX. Hopefully it will also be useful for other PSP homebrew developers who have been having problems getting the PSPSDK set up under OSX.

To install the PSPSDK I largely followed this guide on the ps2dev.org forums. I already had XCode and fink installed. Fink complained when I tried to install all the listed packages, and I had to remove one of them from the command line (I think it was autogen, but I can't really remember now.)

I've been using the most recent psptoolchain script, which was updated a few weeks ago, and I had to make a couple of modifications. In depends/check-ncurses.sh I had to change the check for ncurses to look for the OSX .dyliib file:


## Check for a ncurses library.
ls /usr/lib/libncurses.a 1> /dev/null ||
    ls /usr/lib/libncurses.dll.a ||
    { echo "ERROR: Install ncurses before continuing."; exit 1; }

became:


## Check for a ncurses library.
ls /usr/lib/libncurses.a 1> /dev/null ||
    ls /usr/lib/libncurses.dll.a ||
    ls /usr/lib/libncurses.dylib ||
    { echo "ERROR: Install ncurses before continuing."; exit 1; }

Secondly I had to make this change to scripts/001-binutils-2.16.1.sh. As urchin mentions on the ps2dev.org forum, ".m" is the extension for Objective C files in OSX. The '-r' tells make to ignore the built-in implicit rules, and everything works fine:

So:


## Compile and install.
make clean && make -j 2 && make install && make clean || { exit 1; }

became:


## Compile and install.
make clean && make -r -j 2 && make install && make clean || { exit 1; }

(note the '-r' flag on the second invocation of make.)

I left the psptoolchain script doing its stuff for a couple of hours, and when I came back to it everything seemed to have completed and installed correctly.

The next step was to get Daedalus PSP compiling. Somewhat naively I assumed Daedalus PSP would compile out of the box on OSX. As it turns out, I had to make a few small changes to get everything compiling nicely.

Firstly, I had to update the makefile so that it directly referenced psp-gcc and psp-g++. Normally, I build Daedalus PSP through a Visual Studio Makefile project, and I had used a couple of scripts from the ps2dev.org forums to format GCC's output into a format that Visual Studio understands, so that double clicking on an error in the output opens the corresponding file in the editor. I found a better way to handle this, so I changed the CC/CXX macros to refer to the original pspsdk tools.

The main problem I encountered was my arbitrary use of backslashes instead of forward slashes in #include directives, e.g.:


#include "Core\CPU.h"

should be:


#include "Core/CPU.h"

Another subtle error came from the way that I was instantiating static functions from templated classes which are defined in a namespace. An example will probably help explain. Here's the basic outline of the class I use to implement the singleton pattern:


namespace daedalus
{
 template< class T>
 class CSingleton
 {
 public:
  static bool Create();

  static T * Get() { return mpInstance; }

 private:
  T * mpInstance;
 };
}

The Create method for the singleton class is then implemented like this:


template<> bool CSingleton< CController >::Create()
{
 DAEDALUS_ASSERT_Q(mpInstance == NULL);
 
 mpInstance = new IController();

 return true;
}

For some reason this started failing when compiling Daedalus PSP under OSX:


Source/Core/PIF.cpp: At global scope:
Source/Core/PIF.cpp:250: error: specialization of 'static bool
 daedalus::CSingleton::Create() [with T = CController]' in different namespace
Source/Core/PIF.cpp:250: error:   from definition of 'static bool
 daedalus::CSingleton::Create() [with T = CController]'

Rather than being an OSX issue, I suspect that the reason this error started occurring was actually due to the PSPSDK using an updated version of GCC which is a bit stricter than the version I was using on Windows. Regardless, the fix was easy - the code just needed to be wrapped in the 'daedalus' namespace:


namespace daedalus
{
template<> bool CSingleton< CController >::Create()
{
 DAEDALUS_ASSERT_Q(mpInstance == NULL);
 
 mpInstance = new IController();

 return true;
}
}

(I've recently been going off the singleton pattern, but that's another story :)

With these changes Daedalus PSP compiles perfectly under OSX. On my 2.4 GHz Macbook Pro it takes just under 50 seconds. On my 2.4GHz Windows machine it takes over 2 minutes to compile, so I'm very impressed with the results.

I believe I've checked in all the required changes to the Daedalus SVN repository on SourceForge. If you decide to try compiling Daedalus PSP under OSX, let me know how you get on via the comments page (I'll be rejecting any off-topic comments to try and keep the discussion constructive)

-StrmnNrmn

Wednesday, August 01, 2007

333 MHz

On startup Daedalus increases the default clock frequency of the PSP's cpu from 222MHz to 333MHz for a 'free' 50% speedup. I use 'free' in quotes because this comes at the expense of drawing more power so the battery runs out of charge faster.

I've been asked if I'm going to support 333MHz many times, so I wanted to put this question to rest for once and for all. The answer is yes - I believe this has been the case since R1 :)

-StrmnNrmn

Retro Console Dev