Retro Console Dev: R13

Showing posts with label R13. Show all posts

Sunday, November 04, 2007

Daedalus PSP R13 Released

I've just finished packing up R13 and uploading it to sourceforge:

Daedalus PSP R13 for v1.00 Firmware
Daedalus PSP R13 for v1.50+ Firmware
Daedalus PSP R13 Source

The most significant new feature is savestate support. You can now save your progress at any point, via the Pause Menu (accessed through hitting the Select button). Savestates are written out to the memory stick, and consume around a megabyte per slot. You can load up a savestate at any time from the Pause Menu, or via the front end (hit the right shoulder button to swap from the rom list to the savestate list.)

Other than savestates, the most significant change in R13 is a number of optimisations to the dynarec core which should give a 10-20% speedup depending on the title being played. I've tested these optimisations as much as I can, but if you find that roms which worked with R12 are now broken, try disabling 'dynamic recompilation' and/or 'dynarec stack optimisation' from the rom's preferences screen.

I haven't looked at compatibility at all in R13, so it's unlikely that any roms will have started working in R13.

I'm interested to hear your feedback on both of these features. Let me know if you have any problems with savestates, or if you've found roms are no longer working in R13. I'll try and keep on top of the comments pages over the next couple of weeks.

It's taken a LOT longer to get R13 out than I had hoped. I can't quite believe the last release was in June! I'm hoping now I've got this release out of the door I'll be able to get back to making small, more frequent releases.

-StrmnNrmn

Update A couple of things have cropped up in the comments. Firstly, when you create a savestate it can take a while - up to 10 seconds - during which time the screen is black. It will complete though - please be patient :). I'll have a look at adding a progress meter or something like that for R14.

There are a few titles that don't seem to be working well with some of the dynarec optimisations I added. If you're having problems, try turning off the 'dynarec stack optimisation' as discussed above.

Saturday, November 03, 2007

Daedalus R13 Soon

I've just about finished implementing the last couple of savestate features I talked about last week.

I've got a bit of polishing to do and a little bit more testing, but I'm hoping I should be able to release R13 sometime tomorrow (Sunday).

I'll post again then with a complete list of changes and links to the latest builds for you to download. See you then!

-StrmnNrmn

Thursday, October 25, 2007

R13 close, honestly!

Apologies again for the lack of updates. I know people get nervous when I don't update regularly, but I don't like to post when I have nothing new or exciting to talk about.

I'm very close to releasing R13, I've just been struggling to find the time to add the final finishing touches to the savestate support. I've found it hard to get into a regular working pattern since moving, so what should have been a 1 week job has ended up taking a month. Bioshock and Halo 3 haven't helped either.

Savestates are working very well. The current implementation provides 10 slots which are shared across all roms. You can save to any slot at any time from an option on the Pause menu, or choose to reload a previous savestate. In this way it works just like QuickSave/QuickLoad found in various PC games (I'm tempted to add a special slot for this to the top level of the Pause menu).

I use zlib to compress the savestates, so an 8MiB savestate compresses down to 1MiB (or even smaller if you're running games which don't use the Expansion Pak).

There are just a couple of things I need to finish off now before I can release. Firstly I need to check if the savestate you're loading is for a different rom than is currently loaded. If this is the case I need to scan through your available roms looking for the correct one, and load it. To make this efficient I need to get the RomDB which I use for the PC build working on the PSP.

The second thing I need to do is come up with a decent way of linking some metadata to the savestate so I can show this in the UI. Just simple stuff like the full name of the rom, time the savestate was created and the total time spent playing when it was created. I figure without this information it'll be quite difficult to figure out what it stored in each savestate slot. The alternative is to add a text entry system to the UI so you can name each savestate, but I think this will just delay things even more.

In summary R13 is really close and I've not forgotten about it. Just a few more things to sort out and it'll be ready for release.

-StrmnNrmn

Wednesday, September 19, 2007

Back Online

Apologies for the lack of updates. I've not had internet access at home since moving flat at the end of August. My ADSL was activated today so I'm finally up and running again. I can't believe how long it's taken!

Between the flat move and a busy few weeks at work (getting ready for the Tokyo Game Show) I've not had that much time to work on Daedalus, but I have made some progress on a really cool feature people have been asking for for some time, namely savestate support.

For those not familiar with savestates, the basic premise is very much like 'hibernate' or sleep modes on PCs and laptops. A savestate can be created at any point during emulation. At a later time the savestate can be reloaded to restore the emulator to the exact state it was in when the savestate was created.

What's really useful about savestates is that they work even if the underlying save mechanism used by a rom isn't supported properly. I'm hoping that by adding savestate support Daedalus will become significantly more usable for many roms. Savestates also make development a lot easier as it means I can jump straight into the middle of a game when debugging/profiling rather than sitting through the title sequence dozens of times.

I still have a little bit of work left to do on the savestate system, mostly on the UI side, but there are a number of additional optimisations I want to finish off before I release R13. I'm currently planning to get everything ready for release on either the weekend of the 29th Sept, or possibly the first weekend in October.

-StrmnNrmn

Wednesday, August 15, 2007

Interesting Dynarec Hack

I was playing around with the code generation a couple of evenings ago, and realised that if I made a certain assumption, I could drastically speed up specific types of memory accesses.

When I discussed load/store handling on Sunday, I presented the new code that is typically generated for handling a load such as 'lw $t0, 0x24($t1)' on the N64:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  SLT       t0 = (a0  BEQ       t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1)    # load data

(I'll ignore all the extra code which is generated, and just concentrate on the 5 instructions above which correspond to the expected path of execution.)

Of the 5 instructions that are generated, two - the SLT and BEQ - are just there for performing error handling in the case that the address is invalid, a hardware register (i.e. for memory-mapped I/O), or a virtual address. I'll call this error handling for short.

If we were generating code in an idealised environment where we didn't have to perform error handling, we could drop the SLT/BEQ instructions to get this:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1)    # load data

We could then optimise this even further and perform the offset calculation directly as a part of the LW instruction:


  ADDU      a1 = s1 + s7        # add offset to emulated ram
  LW        s0 <- 0x0024(a1)    # load data

In this idealised situation we could reduce an emulated load to just two instructions, with no branches. That's a pretty good saving!

The problem is that the environment we're generating code from is not 'ideal', and it's hard to know in advance of time which memory accesses are going to directly access physical ram, and which are going to access hardware registers or require virtual address translation. For that reason, we have to place a guard around every memory access to make sure that it behaves correctly. At least, that was the way I was thinking until earlier in the week.

What I realised on Monday is that I can make an assumption that lets me remove the error handling code for certain types of load/stores. The assumption is that when the N64 accesses any memory through the stack pointer ($sp) register, the address is always going to be valid, physical memory.

The assumption relies on the fact that most roms don't do anything particularly clever with their stack pointers - it gets set up for each thread to point at a valid region of memory then the game just runs along, pushing and popping values from it as the code executes. Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner :)

It was straightforward to add a hack to the code generation to exploit this kind of behaviour, and the results have been better than I expected - I'm seeing at least a 10% speed up, and the code expansion factor (the ratio of generated bytes of instructions to input bytes) has dropped from around 5.0x to 4.0x. Stability has been excellent too - I've run about 8 roms with the hack so far, and all of them have run perfectly.

I think one of the reasons the hack has such an impact is that a lot of the memory accesses in a typical C program are through the stack. Here's an example snippet from the entry to a function on the N64, where the compiler emitted code to store the return address and arguments:


SW        ra -> 0x0014(sp)
SW        a0 -> 0x0058(sp)
SW        a1 -> 0x005c(sp)
SW        a2 -> 0x0060(sp)

When I look through disassembly for the roms I'm working on, it's very common to see lots of sequential loads/stores relative to the stack pointer like this.

Previously Daedalus generated around 20 instructions (including 5 branches) for the above snippet. With the hack, the generated code now looks like this:


ADDU      t1 = s1 + s7
SW        s4 -> 0x0014(t1)
ADDU      t1 = s1 + s7
SW        s3 -> 0x0058(t1)
ADDU      t1 = s1 + s7
SW        s2 -> 0x005c(t1)
ADDU      t1 = s1 + s7
SW        s5 -> 0x0060(t1)

8 instructions, 0 branches. What's more, it looks like with a little work, I could eliminate 3 redundant address calculations:


ADDU      t1 = s1 + s7
SW        s4 -> 0x0014(t1)
SW        s3 -> 0x0058(t1)
SW        s2 -> 0x005c(t1)
SW        s5 -> 0x0060(t1)

Now that would be efficient :)

I still want to do lots of testing with the hack. I want to find out if there are roms that don't work with the hack enabled, and how common a problem it is. It's such a significant optimisation though that I'm certain I'll be adding it as an option in Daedalus R13. The results of my testing will probably determine whether I default it to on or off though.

So far Daedalus R13 is shaping up to be significantly faster than R12. I'm still not sure when I'll be ready to release it, but you'll hear about it here first.

-StrmnNrmn

Monday, August 13, 2007

Dynarec Improvements

I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.

At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)

I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.

The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.

These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.

The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):


  ADDIU     a0 = s1 + 0x0024     # add offset to base register
  SLT       t0 = (a0<s6)      # compare to upper limit
  ADDU      a1 = a0 + s7         # add offset to emulated ram
  BNEL      t0 != r0 --> cont # valid address?
  LW        s0 <- 0x0000(a1)  # load data
  J         _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
  NOP
cont:
  # s0 now holds the loaded value,
  # or we've exited from dynarec with an exception

There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.

So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.

Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.

When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.

What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:


  ADDIU     a0 = s1 + 0x0024    # add offset to base register
  SLT       t0 = (a0<s6)     # compare to upper limit
  BEQ       t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
  ADDU      a1 = a0 + s7        # add offset to emulated ram
  LW        s0 <- 0x0000(a1) # load data
cont:
  # s0 now holds the loaded value,
  # or we've exited from dynarec with an exception
  #
  # rest of fragment code follows
  # ...
  
  
_Trampoline_XYZ123:
  # handler returns control to 'cont'
  J     _HandleLoadStore_XYZ123  
  NOP

The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.

As with the first optimisation, this change yielded a further 3-5% speedup.

The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:


8018e014 LB        t8 <- 0x0000(a1)
8018e018 LB        t9 <- 0x0000(a0)
8018e01c ADDIU     a0 = a0 + 0x0001
8018e020 XOR       a2 = t8 ^ t9
8018e024 SLTU      a2 = (r0<a2)
8018e028 BEQ       a2 == r0 --> 0x8018e038
8018e02c ADDIU     a1 = a1 + 0x0001
8018e038 LB        t0 <- 0x0000(a0)
8018e03c NOP
8018e040 BEQ       t0 == r0 --> 0x8018e058
8018e044 NOP
8018e048 LB        t1 <- 0x0000(a1)
8018e04c NOP
8018e050 BNE       t1 != r0 --> 0x8018e014
8018e054 NOP

I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.

The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:


enter_8018e014:
  load n64 registers into cached regs
  
  perform various calculations on cached regs
  
  if some-condition
      flush dirty cached regs back to n64 regs
      goto enter_8018e038
  
  perform various calculations on cached regs
  
  flush dirty cached regs back to n64 regs
  
  if ok-to-continue
      goto enter_8018e014
exit_8018e014:
  ...
  
enter_8018e038:
 ...

The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:


enter_8018e014:
  load n64 registers into cached regs
  mark modified regs as dirty

loop: 
  perform various calculations on cached regs
  
  if some-condition
      flush dirty cached regs back to n64 regs
      goto enter_8018e038
  
  perform various calculations on cached regs
   
  if ok-to-continue
      goto loop

  flush dirty cached regs back to n64 regs
exit_8018e014:
  ...
  
enter_8018e038:
 ...

In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.

This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.

I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet :)

-StrmnNrmn

*The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate :)

Sunday, July 29, 2007

Recharged

It's been quite a while since my last update. I was starting to feel a little worn out from all the work I'd been putting into the emulator over the past few months so since releasing R12 I've been taking a bit of a break from Daedalus to unwind and recharge my batteries.

It's been really nice just taking a bit of a break to do a few different things with my spare time. I spent a short while in Spain with my sister, and since I've been back I've been catching up with a bit of reading, watched a load of TV that I'd queued up and played through a few games I'd had gathering dust for a while. Sadly my 360 succumbed to the Red Ring of Death last week so it seems I might have been pushing it a bit too hard :)

I found a new flat in Guildford which I'll be moving to at the end of August. I'm quite excited about the move as I'll save a lot of time commuting. It currently takes me little over an hour to travel into work and it'll just be 15 minutes once I move. I'm hoping that cutting back on the commute should not only free up a couple of hours each day, but I'll also be a bit less knackered once I finish for the day.

I also picked up a Macbook Pro during my 'time off' and I've really been enjoying my new-found sense of computing freedom (I've been writing this post on the train on the way home from work.) It's the 17" 'lapzilla' model, and it's an absolute beast. It actually compiles Daedalus faster than my year old desktop PC which I find pretty impressive. I'll have some details and tutorials about compiling Daedalus PSP under OSX in the near future.

Now that I've had a bit of time off, I'm feeling very excited about getting cracking with R13. My main feeling is that I'd like to continue working on speeding up the emulator to try and improve the framerate for titles that are already working. From your comments on this blog and on other sites such as the DCEmu forums (the site seems to be down right now - I'll update with a link later) it seems that this is mostly what people are interested in seeing for the next release.

There are a number of different areas I can investigate to help improve performance. The two main possibilities I want to investigate are working on further dynarec improvements, and looking at making use the Media Engine. To start with I'm going to explore both of these areas and try and figure out which would give the biggest 'bang for buck' for R13. I'll post and update on R13 when I have more details.

-StrmnNrmn

PS Thanks for all your comments while I've been away. I have about 100 left to approve which I'll start to go through now.

Retro Console Dev