Thursday, March 29, 2007

R10 Countdown

I'm in the process of tidying up my current build so that I can release R10 over the next few days.

There are a number of features/optimisations still on my TODO list, but as I promised last month, I'd like to release R10 by the end of March. I think frequent, small updates are better than keeping everyone waiting months between releases.

Although R10 will be a smaller update than R9, there are some great improvements (most of which I've already talked about):


  • An approximate 10-15% speedup
  • Frameskip, framerate limiting and stick deadzone tuning
  • Various small bugfixes


I'm hoping to have everything ready by Sunday afternoon at the latest.

One feature which might not make it is the Expansion Pak support. I mentioned that I'd finally fixed the bug that was preventing this from working, but I've been having difficulty resolving the various memory issues that it causes when enabled. Rather than delay R10, I'd prefer to disable the Expansion Pak support for now and release R11 early - as soon as I've fixed the underlying problem. I'll keep you posted.

-StrmnNrmn

Wednesday, March 21, 2007

Frameskip

A couple of people have been commenting about the mysterious frameskip version of Daedalus R9 which appeared a short while ago. I'm not going to link to it because I can't verify where it came from. That said, I've not checked my email for a week so maybe the author has emailed me about it :)

Anyway, it just so happens that I implemented frameskip in R10 on Sunday, so expect this to be a supported feature in the next official release. I had been planning to add this to R9, but I forgot :) It's no big deal - it's about 20 lines of code.

It does give a slight speedup, but not always as much as you'd expect. For instance, skipping every other frame won't double the framerate, as not all the time is spent rendering. Paradoxically, it tends to have more of an effect on roms that are already running fairly fast. Hopefully for some roms it will make the different between them being barely playable and playable though.

There are a few other things people have been asking about which I will implement for R10 too:


  • A configurable deadzone for the stick
  • A configurable framerate limiter


-StrmnNrmn

Tuesday, March 20, 2007

A look inside Daedalus

(This is quite technical post, probably only of interest to other C++/PSP developers.)

Since I fixed the issue stopping the Expansion Pak support from working, I've been testing daedalus with the feature enabled to see if the added pressure on available RAM causes any new issues. Most roms that I've been tested have been running perfectly, but I've been experiencing occasional crashes through running out of memory. I believe this is due to a small leak somewhere, so in an effort to track it down I've been improving the tools I use to track memory usage.

In the improved tracker I override the global new and delete operators, which lets me perform some logging on every allocation/deallocation made during the course of executing the emulator (this isn't quite true, as I don't currently log calls to malloc etc, but it's good enough for my purposes). At the start of each overriden implementation, I keep track of the calling function's return address with the following snippet of code:


u32 ra;
asm volatile
(
"sw $ra, %0\n"
: "+m"(ra) : : "memory"
);


I log this return address along with the allocation size and the returned pointer for every call to new and new[]. For calls to delete and delete[], I just log the address of the memory being freed and the caller's address. This data is all logged to disk across the USB connection using PSPLink.

The logfiles look a little like this:


Allocating 36 bytes - 09c40620 - RA is 08953a94
Allocating[] 8192 bytes - 09c40a00 - RA is 0895e8bc
Allocating 12 bytes - 09c3fce0 - RA is 08964898
Allocating 12 bytes - 09c40970 - RA is 089648b0
Allocating 12 bytes - 09c409d0 - RA is 089648c8
Allocating 12 bytes - 09c408e0 - RA is 089648e0
Allocating 20 bytes - 091ceb20 - RA is 089645e8
Freeing 09c3fce0 - RA is 08962374
Freeing 09c40970 - RA is 089621fc


In order to analyse the results I've written a PC tool which scans through the logfile one line at a time 'replaying' the allocations and deallocations in the same order that they occured on the PSP. The analyser keeps track of the current state of allocations at any point in time, matching up any calls to delete with the corresponding call to new. This means that at any given time the tool has a complete list of all outstanding allocations.

I can discover any memory leaks by shutting down the emulator and continuing to record the logfile while it frees up all allocated resources. By replaying the logfile the analyser can identify any leaks, as these are (mostly) the only remaining allocations at the end of the logfile. I can then run the corresponding return addresses through psp-addr2line to discover where the leaked memory is being allocated from.

The other cool feature of the tool is that it builds up a graphical representation of the state of memory allocations at any point in time. This is really useful for figuring out where all the available RAM is going.

Here's a picture showing where most of the PSP's memory is being used while emulating Mario 64 (I've added the labels on by hand, the tool doesn't do that :)



Each pixel corresponds to 16 bytes of RAM. The smallest blocks are 64x64x16 bytes = 64KB. 1MB chunks are formed from 4x4 64KB blocks. The black space corresponds to both unallocated memory, and also memory allocated outside of the tracker (e.g. calls to malloc, memory used by PSPLink, the CRT, static data areas etc.) You can see that the "Emulated RAM" accounts for just over 8MB - this is because I enabled the Expansion Pak while testing. Dynarec currently uses about 6MB - Ideally I'd like to reduce this down to around 4MB soon.

You'll notice that almost all the memory Daedalus uses is for these big fixed-size allocations. What little dynamic allocation it does at runtime is limited to:


  • Keeping track of hot-trace hit counts in the Dynamo implementation
  • Textures and texture caching


As it turns out, the out-of-memory issues I've been having have been due to the texture cache going crazy and chewing up around 3MB of memory (typically it just uses 200-300KB or so). I've not figured out the root cause yet, but the tool has helped point me in the right direction.

All in all I think this is a pretty nifty utility as it stands, but I've been thinking about a few features that would make it even better:


  • Log the time and current frame alongside each allocation/deallocation. The analyser can then use this information to see how much 'churn' there is over time. Minimising this should help improve performance and reduce fragmentation.
  • Every memory allocation has a small housekeeping overhead (for alignment, keeping track of the allocation size etc). The tool could generate a histogram of allocation sizes to demonstrate how much memory is being wasted through tiny allocations, and give some indication of where pooling or freelists might help.


These are probably features for some point down the line however :)

That pretty much sums up the tool. If this code (either for the tracker or the logfile analyser) would be useful to anyone, let me know and I'll add it to Subversion alongside the rest of the Daedalus code with R10.

StrmnNrmn

Sunday, March 18, 2007

Daedalus R10 Optimisation Progress

I didn't mean to leave it quite so long since last weekend's update, but I've been working hard on a number of optmisations for R10. Oddly enough these are mostly new issues that I've found - most of them don't exist in the list of tasks I came up with a couple of weeks ago. I think that shows how much scope there is for optimising Daedalus!

Firstly, I finally managed to get Daedalus compiling with GCC's '-O3' setting. This flag turns on all of the optimisations that GCC provides. When I've tried to enable this flag in the past I've had numerous strange crashes and odd behaviour, so all releases of Daedalus to date have been compiled with -O1.

I updated my local installation of the PSPSDK last weekend and decided to try the -O3 setting again. I was pleased to find that Daedalus ran without crashing, but there was still some odd behaviour which I eventually tracked down to my use of the famous InvSqrt function. You can read a bit more about my findings on the pspdev forums.

Enabling -O3 tends to slightly increase the code size (the EBOOT.PBP has increased from around 850KB to 900KB), but the speedup is quite noticable - my estimate is that Daedalus runs around 5% faster with -O3 over -O1.

As a result of the thread I started on the pspdev forums, hlide and Raphael both came up with some great suggestions for how I could optimise my use of the VFPU.

When I originally wrote the VFPU code for TnL and clipping there were still many undocumented/unsupported functions. A few months down the line and hlide and co have discovered a couple of instructions which are perfect for my needs - namely vuc2i and vc2i. These two functions take a 32-bit value comprising of 4 (un)signed 8-bit chars and unpack them into a vector of 4 32-bit fixed point numbers. It turns out that these instructions are perfect for converting the N64's packed colour and normal values into a format I can use in the VFPU code.

The various VFPU tweaks I've made have given Daedalus another 5% or so speedup.

The final set of changes I've been working on this week have been to do with how I handle certain blend modes. Some of the N64 blend modes are too complex for the PSP to deal with precisely, so I have a large table of 'override' blend modes which allow me to make as good an approximation of the N64 mode as possible. It turned out that looking up these blend modes was very expensive, so I've rewritten how this is handled to make it more efficient. The end result is another small speedup.

Overall these three changes give a combined 10-15% speedup on the various games I've tested, although there are roms that lie outside this range (some show an even greater speedup while others are more or less unaffected by the changes).

There's still quite a lot more in the way of optimisations that I want to get in for Daedalus R10 (mostly stuff I mentioned earlier) so hopefully these numbers will improve even further over the next couple of weeks.

-StrmnNrmn

Saturday, March 10, 2007

Weekend update

It's been a busy week at work, what with catching up after my week off and GDC, so I've not managed to post as many updates as I'd have liked.

On Daedalus I've been starting to take a look at the list of potential optimisations I listed and working out what to tackle first. To help me do this my first job is to do some work on Daedalus's profiler, to try and figure out where the biggest wins are going to come from. Hopefully I'll be able to report back with some interesting findings this weekend.

On a related note, I've spent the morning looking at converting the source control I'm using at sourceforge from CVS to Subversion. I've been meaning to do this for some time. I've never really been a fan of CVS, and as I'm using Subversion for other projects at work and at home I thought it made sense to migrate Daedalus over too.

So you can now access the latest Daedalus source* through Subversion:


svn co https://daedalus-n64.svn.sourceforge.net/svnroot/daedalus-n64/trunk daedalus-n64


With CVS I usually only updated the source alongside every release. Ideally the repository would contain an up-to-date copy of my local build, but I've had problems in the past where people have distributed 'intermediate' builds of Daedalus PSP, bugs and all. I only ever release new builds when I think there are enough new features and its stable enough to make it worthwhile for people to download and install; updating a source a bit less frequently gives me a bit more control and helps prevent everyone's time being wasted with intermediate builds. I think that I'm going to continue with this policy for the time being. We'll see how it goes.

-StrmnNrmn

*This is still just the R8 source which I lifted from CVS today. I'm in the process of testing whether this compiles OK, then I'll refresh the repository with all the changes from R9. I'll update this post when the R9 source is available.

Edit: R9 source commited to Subversion, all seems to be compiling OK.

Tuesday, March 06, 2007

R10 Plan of Action

Before I went away on holiday I asked you what you thought I should look at working on for the next release of Daedalus. Over 200 of you replied, and I've greatly enjoyed reading what you've had to say. There were some brilliant suggestions, so many thanks for your contributions.

It seems pretty clear to me that speed is the single biggest issue that most people want to see addressed. Many people also mentioned compatibility and savestate or save game support, but in nowhere near the same kind of numbers as those wanting speed improvements.

Based on your feedback my current plan is to release Daedalus R10 at the end of March, focusing mostly on speed improvements. If I can fit in any easy compatibility fixes, I'll do this too*.

Several people have asked what possibilities remain for optimisation. Here's a short list of things I know need more work:
  • In many games, a lot of the time spent executing dynamically recompiled code is doing things which can potentially be emulated at a high level. For instance, over 5% of the time spent executing dynarec code in Mario64 is just converting matrices from floating point to fixed point format. Another 4-5% of the time is spent in a loop invalidating areas of the data cache (which is irrelevent in an emulator.)
  • Some of the most expensive fragments are those which branch to themselves (i.e. those doing many loops). I can optimise for this to avoid loading and flushing cached registers on each iteration through the loop.
  • I can implement a frameskip option (I had intended to implement this for R9, but forgot!)
  • I can make use of the Media Engine (as Exophase suggested in conversation, as the ME can't access VRAM, it might make more sense to execute Audio and Display Lists on the main CPU, and run the N64 CPU emulation on the PSP ME)
  • There are certain situations where I fail to create fragments in the dynamic recompiler - for instance if the code being recompiled writes to a hardware register, this triggers an interrupt and causes fragment generation to be aborted. I should be able to deal with situations such as this more gracefully.
  • The fragment generator can do a lot more to improve register caching, and eliminating redundant 64-bit operations.
  • There are many situations where N64 roms busy wait. I detect very simple occurances of this, but not all of them. If I manually identify more complex examples I can have the fragment generator optimise them away.
  • Some roms are causing the dynarec fragment cache to be repeatedly dumped and recreated (I think Banjo Kazooie is one example of this). Fixing this may just involve tweaking a couple of magic numbers.
  • I currently optimise memory accesses under the assumption that most accesses are in the range 0x80000000 - 0x80800000, which is incorrect in the case of roms that make heavy use of virtual memory, or access RAM through the mirrored range at 0xa0000000. I can improve the trace recorder to collect information on which range a memory access fell in, and generate code to speculatively optimise for this.
  • Now that the dynarec engine is producing much better code, the cost of display list processing is becoming more significant, and may finally be worth profiling and optimising.
That's quite a big list, so I doubt I'll be able to work on these things before the end of March, but I think it shows there's still a lot of scope for further optimisation.

-StrmnNrmn

*Just this morning, I figured out why the Expansion Pak support was broken, so Majora's Mask and a couple of other games relying on this are booting correctly now :)

Sunday, March 04, 2007

Back!

Just a quick post to say that I'm back safe and well from skiing. It was awesome! As an added bonus I made it through the week without breaking any bones :)

Lots of people posted with their thoughts on what I should work on for R10 - many thanks for all your feedback. I'll post an update with my plans shortly (probably sometime tomorrow, after I've recovered from the travelling.)

-StrmnNrmn