Wednesday, August 30, 2006

Display List Debugger

I've been working on a few new features which I'm eager to talk about, but I thought it would be interesting to give some details about some of the tools I use to help me debug various problems with the emulator.

Some of the hardest problems to identify and fix are various graphical issues that crop up when running certain roms. Sometimes it's unhandled combiner modes (this is what results in the purple-and-black textures seen in so many screenshots). Other times there are black or white polygons, or scrambled textures and so on. Sometimes the screen is just totally black :)

When I'm trying to figure out what's causing a particular problem my first step is to recompile Daedalus with the Display List Debugger enabled. If you've been playing with the source, this is done by setting CFLAGS = $(DEBUG_DLIST_CFLAGS) in the Makefile, and linking in Source/PSPGraphics/DisplayListDebugger.o.

The debugger is accessed by pausing emulation and hitting the right trigger button. You need to have PSPLink set up in order to use it, as I didn't want to clutter up the display with various debbuging output. When the debugger is activated, it keeps the emulator paused and replays the current display list over and over. The first thing is does is dump out a list of all the commands in the display list to a logfile which looks something like this (I've edited it somewhat to create a simple example):


[00066] 0x00214290: 01060040 802143b0 G_GBI1_MTX
Command: ModelView Load Push Length 64 Address 0x002143b0
+1.00000 +0.00000 +0.00000 +0.00000
+0.00000 +1.00000 +0.00000 +0.00000
+0.00000 +0.00000 +1.00000 +0.00000
+0.00000 +160.00000 +0.00000 +1.00000

[00067] 0x001c1118: bb000001 ffffffff G_GBI1_TEXTURE
Level: 0 Tile: 0 enabled
ScaleS: 0.999985, ScaleT: 0.999985
[00068] 0x001c1120: 04f00100 0a000000 G_GBI0_VTX
Address 0x001c1000, v0: 0, Num: 16, Length: 0x0100
#00 Flags: 0x0000 Pos: { 0.000000, 60.000000,-1.000000}
#01 Flags: 0x0000 Pos: { 80.000000, 60.000000,-1.000000}
#02 Flags: 0x0000 Pos: { 80.000000, 80.000000,-1.000000}
[00069] 0x001c1130: bf000000 00000a14 G_GBI1_TRI1
Tri: 0,1,2


What this is showing is a matrix being loaded (command 0066), texturing being enabled (0067), a bunch of vertices being loaded (0068) and then a triangle being rendered (0069). (That's quite a lot of work to display a single trinangle! In reality the n64 would load up batches of vertices and render multiple triangles at a time.)

The display list debugger also provides a number of other useful features. The first is the Combiner Explorer. This is how it looks in PSPLink:



This displays all the different combiners used in a scene, and allows them to be enabled or disabled individually. When they're disabled, it replaces the triangle with a big green and black texture, like so:





In each of these cases a single combiner was disabled, showing how different combiners are used to achieve different effects in the scene.

This tool is invaluable in debugging combiners as I add them, to ensure that they're doing what I think they're doing.

The Texture Explorer displays all the textures used in the scene. It looks like this in PSPLink:



It allows individual textures to be displayed on the PSP so I can ensure that they're being decoded correctly:




The final function I want to demonstrate is the ability to terminate the display list at a given point. That is, if a display list has 1000 commands, I can stop it after at any point during rendering to see what the scene looks like. The example pasted above shows a snippet of 4 commands from 66-69, typically there may be 3000-5000 commands in a display list - the Mario scene above is composed of 3614 commands.

Being able to stop the display list like this is really useful, because it allows me to see the exact command at which a given triangle in the scene is drawn. Let's say that we're trying to figure out why the triangles making up Mario's feet are coming out in the wrong colour. What we'd do is render the display list command by command until the specific triangle we were interested in appeared on-screen. At this point we'd be able to determine the command number, and cross-reference this with the display list log which was dumped at the beginning. Usually looking a few lines before this point in the log will reveal the source of the problem.

I hope this has provided a little insight into how I go about debugging some of Daedalus' graphical problems. I'll leave you with a sequence of shots showing how Mario 64 looks as it's rendered from start to finish:











-StrmnNrmn

Friday, August 25, 2006

Something for the weekend - R8

I got back from my trip yesterday and I've spent a few hours today putting together the final changes for Daedalus PSP R8.

You can grab the latest files from SourceForge.

Here's the changelist:



[^] Replaced all uses of sceCtrlReadBufferPositive with sceCtrlPeekBufferPositive.
[^] Various known value optimisations for the dynamic recompilation engine.
[^] Various texture cache optimisations and rendering optimisations.
[+] Implemented a new clipping method which is more efficient and gives better results.
[-] Removed 'tesselate large triangles' setting.
[+] Added option to reset emulator to the main menu.
[^] No longer use index buffers for rendering.
[^] Implement matrix multiplication using VFPU.
[^] Implement vertex transform and lighting code using VFPU.
[^] Implement clipping code using VFPU.
[^] Minor AddTri optimisations.
[^] Free background and font textures while emulator is running to free VRAM.
[!] Fixed bug in default controller config (c-down and dpad-down were broken)



It should be pretty obvious that most of the changes in R8 are optimisations. This build is significantly faster than R7, which is a significant achievement considering how much R7 had achieved in this area. Here's an updated view of the framerate table I introduced a couple of months ago and updated recently:














SceneR4 fpsR5 fpsR7-beta fpsR8 fps
Mario Head3688
Mario Main Menu14253038
Mario Peach Letter6-7111318
Mario Flyby (under bridge)6101217
Mario In Game5-691115
Mario Kart Nintendo logo10232435
Mario Kart Flag6111316
Mario Kart Menu7111314
Zelda Nintendo Logo2023?70
Zelda Start Menu2-34?8
Zelda Main Menu1013?40


I think the numbers speak for themselves. What's particularly impressive is that the R8 results are so high, despite having the new clipping code enabled by default. By implementing the triangle clipping code and the transform and lighting code using the PSP's VFPU I've managed to keep the additional cost to a minimum.

With the previous release, I also mentioned a list of things I had planned for R8. I decided to put them on hold while I pursued the various optimisations in R8, so I'll be looking at working on them for the next release.

Have fun - I'm going to wade through the past couple of weeks worth of emails that have built up while I've been putting this build together :)



-StrmnNrmn

Tuesday, August 22, 2006

Away for a couple of days

I've made some great progress on getting the new clipping code working with the PSP's VFPU. Actually, so far I've just been working on getting various matrix/vector routines and the transform and lighting (TnL) code working with the VFPU and I'm seeing very good results so far. The TnL code is around 2-3 times as fast running through the VFPU compared to the CPU. This gives around a 0.5-1.0fps speedup in the various roms I've been testing.

Unfortunately I have to put the clipping work on hold until the end of the week as I'm heading home for a few days to see my family. In the meantime, I've answered a few of your comments about clipping on the previous post.

Also, here are a few amusing 'outtakes' as I was trying to get the VFPU TnL code working:

Hmm, something wrong with the indexing I think


Something is definitely wrong with the transform matrix (or else Lakitu is drunk :)


M.C. Escher would be proud


Whoops.


-StrmnNrmn

Sunday, August 20, 2006

Triangle Clipping

After Wednesday's news I wanted to keep everyone up to date with what I've been working on over the past few days.

With Wednesday's changes incorporated, I reprofiled a few roms to see where most of the CPU time was going. Things have changed considerably since I initially talked about deciding what to optimise. Looking at the profiler for Mario 64 the time spent executing display lists is now a much more significant fraction of the total time spent on each frame. Back around R3/R4 only around 20% of the time was spent here. With the latest build display list processing now accounts for around 35-40% of the time. The display list processing hasn't become any slower, it's just becoming more significant as I've optimised the CPU emulation.

One of the settings I mentioned was worth disabling for a speed boost when I released R7 was the 'Tesselate Large Triangles' option. When this setting is enabled, it causes the display list processor to recursively break up large triangles into smaller pieces. This has been necessary to overcome the PSPs poor hardware clipping support; without breaking the triangles up into smaller pieces, the PSP will often fail to render large triangles as shown below:

Super Mario 64 without clipping
Super Mario 64 without clipping


The large triangles that make up the floor that Mario is standing on are rejected by the PSP, leaving a large hole where the floor should be. By breaking the triangles into smaller pieces before attempting to render them, it reduces the chance that the PSP will decide to discard them.

There were a few problems with the 'Tesselate Large Triangles' setting which I've been working on overcoming this weekend. Firstly, it's not perfect - there were plenty of cases where visible triangles would still be culled even when they had been subdivided 3-4 times (which generates 27-81 triangles for each input triangle!). This was always quite noticable in games with a relatively low camera, such as racing games. The other big problem with this setting was that it was very slow - often adding over 20ms per frame.

This setting was always intended as a quick fix rather than a long term solution, so I've been looking at fixing both of these problems over the past few days. I started by ripping out all the exisiting polygon clipping and tesselation code and starting from scratch. After a couple of days of hacking I've finally got a replacement system that seems to be clipping everything I've thrown at it perfectly. Here's a shot of the same location in Mario 64:

Super Mario 64 with new clipping code
Super Mario 64 with new clipping code


Now that I have a working version of the code in place, I'm going to look at optimising it. At the moment the new clipping code is roughly as expensive as the tesselation code, but due to the way it's implemented I think it should be much easier to make work with the PSP's VFPU, as I can process batches of vertices in parallel. Ideally I'd like to get this change into the next release, so I'm going to hold off putting the R8 build together until it's ready. I'll let you know how I get on.

-StrmnNrmn

Wednesday, August 16, 2006

Unexpected optmisations

One of the things that I find most rewarding about programming is when you discover an unexpected improvement or optimisation by accident. You can spend weeks carefully tuning and optimising code, only to stumble across a glaring inefficiency in your code which you've never spotted before. One quick change and your application is suddenly noticably faster.

In my daily job I rely heavily on debuggers and profilers to discover bottlenecks in the working on the Xbox, Microsoft provided some excellent performance analysis tools (I see they've finally released PIX for Windows). These days I tend to use AQtime as I'm PC based (it's also one of the few profilers I've found that can handle the size of our libraries at work without grinding to a shuddering halt.)

Without these kind of tools it's a lot tougher profiling on the PSP. Over the past few months I've built a number of custom profiling tools into Daedalus to help me figure out where all the time is going, but the numbers I get out tend to be quite vague, and there's usually quite a large margin of error. I think this explains why the unexpected optimisation I've just found went undiscovered for so long.

A couple of days ago I was browsing the ps2dev forums and came across this post. I was about to back out after a quick scan, when I noticed this comment from Soatome:


PeterM wrote:
but one waits for the vblank

...and that's sceCtrlReadBufferPositive (which you're using)
you should use sceCtrlPeekBufferPositive instead.


That's when I realised that when Daedalus was emulating a rom, it was stalling for a frame every time the rom read the status of the pad*. In other words by changing one line of code in Daedalus from


sceCtrlReadBufferPositive


to


sceCtrlPeekBufferPositive


I could get on average an instant 1fps speedup across all roms. What's more, I knew some roms read from the pad multiple times each frame, so they would see an even great speedup.

Frustratingly I had to wait a couple of days before I could try this out. As I mentioned earlier I'm in the process up moving over to a new PC, and I had just moved Perforce over but hadn't set up the pspsdk, which required Cygwin. Daedalus requires libpng and zlib so I had to download and build them too. Then I had to set up Psplink, PuTTY and a whole host of other tools. You get the picture...

Last night I finally managed to get a new build together with the updated code, and the results were every bit as good as I'd expected. In some cases I had to restart the rom just to make sure I wasn't mistaken. I know most of you just want to see some numbers, so here's a few of my observations:

Mario now runs at at steady 15fps in most places, and around 20fps indoors etc (it reaches over 35fps in the main menu, and close to 30 in some scenes.) Zelda now runs at around 8fps in game, and up to 20fps in certain places. The 'nintendo' logo at the start runs at over 90fps :D The MarioKart Nintendo logo now runs at 30fps, and the main menu (with the flag) runs at a solid 15fps. In game it's a comfortable 12fps. Starfox runs at around 15fps - the intro runs at 25-30fps. Quest64 runs at 20fps.

So all in all it's a pretty amazing improvement for a single-line change. Having said that, I think it would be a mistake to assume that this is an instant fix that will suddenly make everything fully-playable. Although some of the framerates I list above are excellent - faster than an native n64 even - not all roms show this improvement. Don't assume that all roms now run at 15+fps (because they don't.) There's still a lot more work to do to get from a sluggish 8fps to a more playable 15fps (in Zelda for instance). I still need to save a lot more cycles in order to support other features such as sound.

Because this change makes such a big improvement I'm going to try and get another release out sooner rather than later. I don't like releasing builds too often as I think each revision should something worthwhile, but I think this qualifies :) There are a couple of other optimisations I want to get in this build, so while it might be ready this weekend, sometime early next week is more likely. The new features I had planned for this build will have to wait until R9.

As always, I'll keep you posted.

-StrmnNrmn

*This actually reminds me of a funny story from one of the Xbox games I was working on. We were investigating a sudden slowdown that had been appeared a few days previously. Somehow I realised that the framerate doubled when you unplugged all the controllers. As it turned out someone was accidentally reinitialising the USB hub every frame, and removing the all controllers prevented this from happening.

Sunday, August 13, 2006

R7 released!

I've just uploaded R7 to SourceForge. Here's the changelist:


[^] Avoid checking for interrupts in dynarec code in most situations.
[^] Optimise dynarec Load/Store instructions to avoid checking for interrupts directly.
[^] Implemented the remaining 32-bit integer instructions in the dynarec.
[^] Implemented the remaining commong load/store instructions in the dynarec.
[^] Implemented JAL/JR in dynarec.
[^] Optimised various texture cache related features.
[^] Added various known value optimisations to the dynarec engine.
[^] Link together blocks even when they exit with branch likely instructions.
[+] Added option to allow frequency of texture update checks to be reduced.
[+] Added the ability to configure buttons
[!] Fixed a couple of compatibility issues caused by the dynarec.
[!] Fixed a couple of issues related to self-modifying code and the dynarec.
[!] Fixed issues with the framerate counter flickering.

 



Daedalus R7 for v1.00
Daedalus R7 for v1.50
Daedalus R7 Source

The main emphasis has been on improving the framerate of as many roms as possible, but I've also made some significant fixes to the dynarec engine which should improve compatibility for a few roms where this was causing problems before.

There are two settings you should be aware of if you're looking at getting the fastest possible framerate. The first is on the global settings page (that's the one you see on the main menu as soon as you boot up). You'll want to set 'Tesselate Large Triangles' to No here. The next option that helps boost the framerate is set 'Texture Update Check' to Disabled on the Rom Settings screen. A combination of these two options give a significant speedup in various roms.

In R7 I've also added the ability to define your own custom controller configurations. You can define a new controller mapping by adding a new .ini file to the Daedalus/ControllerConfigs directory. There are a few examples in there already, and I'll look at posting a brief tutorial up here sometime soon. If you come up with a new mapping you think would be useful then email me (my address is in the readme.txt) and I'll post it up here and add it to a later release.

I think R8 is going to continue to focus on improving the framerate. I still have a lot of optimisations I want to get in, and I think these will help improve the framerate even further. I also want to spend a little time improving the front end, as it's getting harder for me to add new settings and options in there. I also want to add an option for changing some of the settings while a rom is running (i.e. I think it's time we had an in-game menu.) Another thing I'll look at for R8 is saving settings between runs of the emulator - this way Daedalus will remember which controller setup you prefer for each rom. Finally I want to add an option to quit back to the main menu without having to restart Daedalus.

Phew! That's quite a big list. I can't guarantee I'll be able to add all that for the next release, but that's what I'm currently aiming for. I think it's more important to try and release regularly (i.e. every 3-4 weeks) rather than try and cram everything into one go, so some of these features might move back to R9 if I slip behind.

My first job though is to move my development environment over from my old PC to the new 'beast'. I've had to put this release together through a Remote Desktop Connection to the old PC and I can't bear to do that any longer. It'll probably take a few days to get everything set up on the new PC, but it should be a lot less painful in the long run :)

-StrmnNrmn

Nearly there

I've just about finished work on the custom controller configuration. Once I've finished testing everything is working as expected I'll begin the process of putting the release together. This can be quite involved as it requires updating all the documentation, packaging all the necessary files and uploading to SourceForge etc. Hopefully it shouldn't take more than a couple of hours. Next post will be with full details of all the changes and the download links etc.

-StrmnNrmn

Thursday, August 10, 2006

R7 close

I'm getting close to finishing everything I want to get into R7. I've just spent a little time tidying up a few loose ends (little things like the way screenshots are handled). The one last substantial bit of work I want to get done is support for custom controller configs. I think this should be a few hours work, so I expect I'll finish this sometime on Saturday or Sunday, with a release following shortly afterwards.

I got a new PC on Tuesday and this has slowed things down a little bit as I've spent a couple of evenings this week seeing just how stupidly fast it is. This is what I went for in the end:



I've got it all hooked up to my Dell 2405fpw and it's awesome. It's nice to finally be able to play games at the 2405's native 1920x1200 resolution without it chugging along :D

-StrmnNrmn

Sunday, August 06, 2006

Trigonometry Wars

While you're waiting for R7, my good friend 71M has just released the first version of Trigonometry Wars. It's awesome, check it out!

More R7 Optimisations

It's been a while since my last post, but I've still been hard at work with various optimisations for Daedalus R7.

Although my main focus is on improving the dynamic recompiler, I've been looking at optimising a couple of other areas that I noticed were fairly expensive. The texture cache is one of the areas that I spent time tuning this week. This cache is used to avoid converting textures from the native n64 formats to psp formats every frame. I made a couple of fixes to improve the hashing function which gives much faster lookups in certain situations (such as tiled backdrops). I also provided an option to change the frequency at which the texture cache checks for updates to the textures. Many roms look fine when this check is entirely disabled, and this can give quite a nice speed boost.

My main focus has continued to be on the dynamic recompiler. I've made a couple more bugfixes in this area. One bugfix involved detecting when roms were using self-modifying code. The fix involved dumping the contents of the dynarec cache so that the code is correctly regenerated for the updated instructions. This fix solves a couple of issues I was seeing with Quest64, and I'm sure it will help improve compatibility with a number of other roms too.

The other dynarec issue I fixed was related to the way I was handling certain types of branch instructions. The MIPS processor has a set of 'branch likely' instructions which work slightly differently to regular branches and so I handle them separately in the dynamic recompiler. It turned out that I had forgotten to link together code fragments when they exited through a branch likely instruction. This fix gives a nice little speedup.

The biggest bit of new development I've been doing on the dynarec is on optimising for various situations where I can determine the contents of a given register at the time I'm compiling the code. As an example, many roms use the following sequence to load an integer value from memory at a specific address:


LUI $t0, 0x8033 // Load Upper Immediate - i.e. load t0 with 0x80330000
LW $t0, 0x1234($t0) // Load Word - i.e. load t0 with the value at 0x80331234


Previously I'd generate code for both of these instructions on the PSP. The LUI instruction is easy (if t0 is cached on the PSP then this is just one instruction). The LW is a lot more tricky. I have to call a function to convert the address on the n64 (0x80331234 in this case) to the address in the emulated memory on the PSP. Then I have to read from that address, or trigger an exception in the emulator if the memory address is invalid.

With the changes I've just made, when I encounter the LUI instruction (or other instructions involving loading constant values into registers) I keep track of the fact that I've loaded t0 with 0x80330000. When I come to process the LW instruction, I can now determine that the desired address is 0x80331234. I can then map that address directly to the required location on the PSP, avoiding a function call in the generated code. By avoiding the function call I no longer need to flush cached registers back out to memory. Also, because I can tell in advance that the address lies in RAM (and isn't referencing a hardware register for instance) then I can also omit the code testing for an exception. Finally, in situations like the example above, I can don't need to generate any code for the initial LUI (as the register is immediately overwritten with the loaded value.)

In summary this is a very nice optimisation - it generates fewer instructions (reducing the size of the dynarec code), it avoids unnecessarily flushing out cached registers, it avoids generating exception handling code, and it can eliminate redundant instructions (the initial LUI). In the best case, for 2 source instructions it will generate just 3 output instructions, compared to 12-13 for the unoptimised case.

Unfortunately this approach only works with load and store instructions where the address can be determined in advance, but from the roms I've examined so far around 10-15% of the load/store instructions can be optimised in this way, which is enough to give a measurable benefit.

I'm going to spend the rest of this week seeing which other parts of the dynarec engine can benefit from similar approaches. I have a couple of other features to implement (configurable controllers etc), if that all goes to plan I'll try and prepare R7 for a release next weekend.

-StrmnNrmn