Monday, September 25, 2006

PIF fixes

It's about time I posted an update with news of some of the things I've been working on! Annoyingly the '.' key on my keyboard has decided to stop working so apologies in advance for any dodgy punctuation :)

I've been having a look at fixing a couple of issues related to the way that Daedalus handles the N64 PIF (peripheral interface) emulation. The PIF is responsible for a number of different functions, including:


  • Reading the controller status

  • Reading/writing to the eeprom

  • Reading/writing to the controller mempak



Access to the PIF is controlled through a 64 byte region of memory. Writing to this area triggers the PIF to interpret the block of data as a sequence of commands (such as those listed above), and the CPU can read back from this region to obtain the results.

Correctly handling the way the block is formatted is key to emuating the PIF accurately. Over the years I've rewritten this code several times as I've fixed bugs, each time fixing certain incorrect assumptions made during the previous version. Usually these assumptions were made on incredibly useful, but incomplete information (such as early docs from LaC (*waves*) ) which described how to use the PIF rather than how to emulate it. Unfortunately there was lots of gaps in this information and so I'd keep finding roms that used the PIF is various ways that I'd not taken into account.

A few years ago I stumbled across something rather amazing - a document which described the PIF hardware in great detail. Not only was it very thorough - including details on all the various formatting commands I'd spent hours reverse engineering - it also included an insane amount of detail on things such as the chip timings and the even method by which the PIF is used as a 'security system' to lock out pirated roms.

What was this document? Did I find it tucked away on some obscure N64 hacking site? No - it was US Patent 6,394,905, filed by Nintendo in September 2000. Genius!

There are actually a whole series of N64 patents (see 5,426,762 and 4,799,635 for a couple of other similar examples.) It's not all good news though because unfortunately they're all written in a horribly convulted form of Legalese:

There are six JoyChannels available in the present exemplary embodiment. Each Channel's transmit data and receive data byte sizes are all independently assignable by setting size parameters. In the exemplary embodiment, all six channels size parameter setups are required, whether they are used or not.


Ouch.

This is the statement I spotted a couple of weeks ago that explained why one of the previous assumptions I'd made wasn't working in all cases:

If the 64 bit CPU 100 wants to change JoyChannel's Tx/RxData assignment, a 32 bit format flag is used, where a certain bit(s) specify the desired format. For example, when Wr64B or Wr4B is issued when this flag is "1", PIF executes each JoyChannel's Tx/RxData assignment based on each channel's Tx/Rx Size. In other words, unless this flag is set to "1" with Wr64B or Wr4B, Tx/RxData area assignment does not change. After Tx/RxData assignment, this flag is reset to "0" automatically.


After painfully wading through the text, it's basically saying that if you write a '1' to the last byte of the PIF block, you need to reformat all the channels. Otherwise you just assume the same formatting as the previous command.

The bug that I've since fixed was actually caused by two problems. Firstly, I was reformatting the channels regardless of how this last byte was set. I'd assumed that it was always set to '1' by the CPU to indicate that a command was ready, and set to '0' by the PIF when it was finished executing. In a few roms I was seeing '0' being written, but I didn't know why. Quite often when I saw them do this they'd hang shortly aftwerwards.

The other problem that contributed to the bug was that I was accidentally overwritting the 'Tx/RxData assignment' area with 0 as I was processing the PIF block (actually, I was clearing the top 4 bits of each byte, rather than just the top 2). This meant that when I reformatted the Tx/RxData assigments the areas were smaller than expected, so the PIF handling code would abort and return with failure.

These two problems combined to cause certain roms to get confused and hang when trying to access the mempak. The total fix was a few dozen lines of code to separate out the PIF formatting from the command execution, and a single-line fix for the second issue. All the roms I found exhibiting this issue now make further progress (hooray!) There may well be other issues that have been resolved with this fix too :)

It's bedtime for me now (well, maybe a little Dead Rising on the 360 first :) I'll try and get back into the habit of updating a bit more often from this week on. Sorry if I've not replied to your email - it's not personal, it's just me being rubbish.

-StrmnNrmn

Sunday, September 17, 2006

I'm still alive

I just wanted to apologise for the lack of updates recently. In between commitments at work and at home I've not had much time to work on Daedalus over the past couple of weeks. Fortunately things are quietening down again so I should have some free time to work on the next release of the emulator this week. Rest assured I'll keep you up to date with all the new developments.

-StrmnNrmn

Wednesday, September 06, 2006

Viewport scaling

It's been a fairly slow week for progress on Daedalus. I've had a few busy days with work, and I was away over the weekend so I've not had much time to work on Daedalus since my last update.

One thing I have managed to implement is viewport scaling however. After the previous release I received an email from Chris Meyer-Rassow which had a couple of interesting points. He suggested that I should have a look at providing a setting to allow the emulator to run with an aspect ratio of 4:3, rather than strecthing everything to fit the PSP's 16:9 (or 15.88236:9 to be accurate :)

I had been meaning to add this option for some time, and as it was only a small change I decided to implement this feature for the next release. I've added 3 different modes, as shown the in screenshots below. The first mode simply scales the graphics up to fit the PSP's screen (this is no different from the current release of the emulator):



This renders at 480x272. Notice how the vertial lines for the 't' and 'f' characters look slightly too heavy. This is as a result of scaling the graphics up to fit the PSP's screen. (As an aside, this screenshot clearly shows another issue that I need to address - the zfighting which is clearly visible in the carpet. I have a couple of ideas for fixing this.)

The next shot shows the same scene with the new unscaled 4:3 viewport setting:



This renders at 320x240, and so avoids scaling up any of the graphics. You can see that the problems with the 'f' and 't characters has gone. On the downside, there's a lot of 'wasted' space around the screen.

The final settings shows the new scaled 4:3 viewport setting:



This maintains a 4:3 ratio (so we maintain the correct aspect ratio), but scales things up to 362x272 to make full use of the screen's height. We waste a little less space with this mode, but some of the original problems with the text are still visible.

In the end all three options are a trade off between making full use of the screen and quality. In the next release it will be possible to adjust this setting as the emulator is running so you can pick whichever one looks best for the rom you're currently running.

As a result of doing this work with the viewport, I also fixed a couple of bugs related to how I was emulating the N64's viewport. When I originally wrote the graphics code for the PSP version of Daedalus, I forgot to handle the N64's viewport scaling/translation, which causes problems rendering certain scenes, as demonstrated in Aerogauge below:



Notice how the car is being rendered off-centre (it should be in the top right hand corner). After my viewport fixes, the same scene now looks like this:



The same change also fixes various minor glitches in a number of other roms (including Mariokart and Waverace), so it's a nice bugfix to get in for the next release.

-StrmnNrmn

Wednesday, August 30, 2006

Display List Debugger

I've been working on a few new features which I'm eager to talk about, but I thought it would be interesting to give some details about some of the tools I use to help me debug various problems with the emulator.

Some of the hardest problems to identify and fix are various graphical issues that crop up when running certain roms. Sometimes it's unhandled combiner modes (this is what results in the purple-and-black textures seen in so many screenshots). Other times there are black or white polygons, or scrambled textures and so on. Sometimes the screen is just totally black :)

When I'm trying to figure out what's causing a particular problem my first step is to recompile Daedalus with the Display List Debugger enabled. If you've been playing with the source, this is done by setting CFLAGS = $(DEBUG_DLIST_CFLAGS) in the Makefile, and linking in Source/PSPGraphics/DisplayListDebugger.o.

The debugger is accessed by pausing emulation and hitting the right trigger button. You need to have PSPLink set up in order to use it, as I didn't want to clutter up the display with various debbuging output. When the debugger is activated, it keeps the emulator paused and replays the current display list over and over. The first thing is does is dump out a list of all the commands in the display list to a logfile which looks something like this (I've edited it somewhat to create a simple example):


[00066] 0x00214290: 01060040 802143b0 G_GBI1_MTX
Command: ModelView Load Push Length 64 Address 0x002143b0
+1.00000 +0.00000 +0.00000 +0.00000
+0.00000 +1.00000 +0.00000 +0.00000
+0.00000 +0.00000 +1.00000 +0.00000
+0.00000 +160.00000 +0.00000 +1.00000

[00067] 0x001c1118: bb000001 ffffffff G_GBI1_TEXTURE
Level: 0 Tile: 0 enabled
ScaleS: 0.999985, ScaleT: 0.999985
[00068] 0x001c1120: 04f00100 0a000000 G_GBI0_VTX
Address 0x001c1000, v0: 0, Num: 16, Length: 0x0100
#00 Flags: 0x0000 Pos: { 0.000000, 60.000000,-1.000000}
#01 Flags: 0x0000 Pos: { 80.000000, 60.000000,-1.000000}
#02 Flags: 0x0000 Pos: { 80.000000, 80.000000,-1.000000}
[00069] 0x001c1130: bf000000 00000a14 G_GBI1_TRI1
Tri: 0,1,2


What this is showing is a matrix being loaded (command 0066), texturing being enabled (0067), a bunch of vertices being loaded (0068) and then a triangle being rendered (0069). (That's quite a lot of work to display a single trinangle! In reality the n64 would load up batches of vertices and render multiple triangles at a time.)

The display list debugger also provides a number of other useful features. The first is the Combiner Explorer. This is how it looks in PSPLink:



This displays all the different combiners used in a scene, and allows them to be enabled or disabled individually. When they're disabled, it replaces the triangle with a big green and black texture, like so:





In each of these cases a single combiner was disabled, showing how different combiners are used to achieve different effects in the scene.

This tool is invaluable in debugging combiners as I add them, to ensure that they're doing what I think they're doing.

The Texture Explorer displays all the textures used in the scene. It looks like this in PSPLink:



It allows individual textures to be displayed on the PSP so I can ensure that they're being decoded correctly:




The final function I want to demonstrate is the ability to terminate the display list at a given point. That is, if a display list has 1000 commands, I can stop it after at any point during rendering to see what the scene looks like. The example pasted above shows a snippet of 4 commands from 66-69, typically there may be 3000-5000 commands in a display list - the Mario scene above is composed of 3614 commands.

Being able to stop the display list like this is really useful, because it allows me to see the exact command at which a given triangle in the scene is drawn. Let's say that we're trying to figure out why the triangles making up Mario's feet are coming out in the wrong colour. What we'd do is render the display list command by command until the specific triangle we were interested in appeared on-screen. At this point we'd be able to determine the command number, and cross-reference this with the display list log which was dumped at the beginning. Usually looking a few lines before this point in the log will reveal the source of the problem.

I hope this has provided a little insight into how I go about debugging some of Daedalus' graphical problems. I'll leave you with a sequence of shots showing how Mario 64 looks as it's rendered from start to finish:











-StrmnNrmn

Friday, August 25, 2006

Something for the weekend - R8

I got back from my trip yesterday and I've spent a few hours today putting together the final changes for Daedalus PSP R8.

You can grab the latest files from SourceForge.

Here's the changelist:



[^] Replaced all uses of sceCtrlReadBufferPositive with sceCtrlPeekBufferPositive.
[^] Various known value optimisations for the dynamic recompilation engine.
[^] Various texture cache optimisations and rendering optimisations.
[+] Implemented a new clipping method which is more efficient and gives better results.
[-] Removed 'tesselate large triangles' setting.
[+] Added option to reset emulator to the main menu.
[^] No longer use index buffers for rendering.
[^] Implement matrix multiplication using VFPU.
[^] Implement vertex transform and lighting code using VFPU.
[^] Implement clipping code using VFPU.
[^] Minor AddTri optimisations.
[^] Free background and font textures while emulator is running to free VRAM.
[!] Fixed bug in default controller config (c-down and dpad-down were broken)



It should be pretty obvious that most of the changes in R8 are optimisations. This build is significantly faster than R7, which is a significant achievement considering how much R7 had achieved in this area. Here's an updated view of the framerate table I introduced a couple of months ago and updated recently:














SceneR4 fpsR5 fpsR7-beta fpsR8 fps
Mario Head3688
Mario Main Menu14253038
Mario Peach Letter6-7111318
Mario Flyby (under bridge)6101217
Mario In Game5-691115
Mario Kart Nintendo logo10232435
Mario Kart Flag6111316
Mario Kart Menu7111314
Zelda Nintendo Logo2023?70
Zelda Start Menu2-34?8
Zelda Main Menu1013?40


I think the numbers speak for themselves. What's particularly impressive is that the R8 results are so high, despite having the new clipping code enabled by default. By implementing the triangle clipping code and the transform and lighting code using the PSP's VFPU I've managed to keep the additional cost to a minimum.

With the previous release, I also mentioned a list of things I had planned for R8. I decided to put them on hold while I pursued the various optimisations in R8, so I'll be looking at working on them for the next release.

Have fun - I'm going to wade through the past couple of weeks worth of emails that have built up while I've been putting this build together :)



-StrmnNrmn

Tuesday, August 22, 2006

Away for a couple of days

I've made some great progress on getting the new clipping code working with the PSP's VFPU. Actually, so far I've just been working on getting various matrix/vector routines and the transform and lighting (TnL) code working with the VFPU and I'm seeing very good results so far. The TnL code is around 2-3 times as fast running through the VFPU compared to the CPU. This gives around a 0.5-1.0fps speedup in the various roms I've been testing.

Unfortunately I have to put the clipping work on hold until the end of the week as I'm heading home for a few days to see my family. In the meantime, I've answered a few of your comments about clipping on the previous post.

Also, here are a few amusing 'outtakes' as I was trying to get the VFPU TnL code working:

Hmm, something wrong with the indexing I think


Something is definitely wrong with the transform matrix (or else Lakitu is drunk :)


M.C. Escher would be proud


Whoops.


-StrmnNrmn

Sunday, August 20, 2006

Triangle Clipping

After Wednesday's news I wanted to keep everyone up to date with what I've been working on over the past few days.

With Wednesday's changes incorporated, I reprofiled a few roms to see where most of the CPU time was going. Things have changed considerably since I initially talked about deciding what to optimise. Looking at the profiler for Mario 64 the time spent executing display lists is now a much more significant fraction of the total time spent on each frame. Back around R3/R4 only around 20% of the time was spent here. With the latest build display list processing now accounts for around 35-40% of the time. The display list processing hasn't become any slower, it's just becoming more significant as I've optimised the CPU emulation.

One of the settings I mentioned was worth disabling for a speed boost when I released R7 was the 'Tesselate Large Triangles' option. When this setting is enabled, it causes the display list processor to recursively break up large triangles into smaller pieces. This has been necessary to overcome the PSPs poor hardware clipping support; without breaking the triangles up into smaller pieces, the PSP will often fail to render large triangles as shown below:

Super Mario 64 without clipping
Super Mario 64 without clipping


The large triangles that make up the floor that Mario is standing on are rejected by the PSP, leaving a large hole where the floor should be. By breaking the triangles into smaller pieces before attempting to render them, it reduces the chance that the PSP will decide to discard them.

There were a few problems with the 'Tesselate Large Triangles' setting which I've been working on overcoming this weekend. Firstly, it's not perfect - there were plenty of cases where visible triangles would still be culled even when they had been subdivided 3-4 times (which generates 27-81 triangles for each input triangle!). This was always quite noticable in games with a relatively low camera, such as racing games. The other big problem with this setting was that it was very slow - often adding over 20ms per frame.

This setting was always intended as a quick fix rather than a long term solution, so I've been looking at fixing both of these problems over the past few days. I started by ripping out all the exisiting polygon clipping and tesselation code and starting from scratch. After a couple of days of hacking I've finally got a replacement system that seems to be clipping everything I've thrown at it perfectly. Here's a shot of the same location in Mario 64:

Super Mario 64 with new clipping code
Super Mario 64 with new clipping code


Now that I have a working version of the code in place, I'm going to look at optimising it. At the moment the new clipping code is roughly as expensive as the tesselation code, but due to the way it's implemented I think it should be much easier to make work with the PSP's VFPU, as I can process batches of vertices in parallel. Ideally I'd like to get this change into the next release, so I'm going to hold off putting the R8 build together until it's ready. I'll let you know how I get on.

-StrmnNrmn

Wednesday, August 16, 2006

Unexpected optmisations

One of the things that I find most rewarding about programming is when you discover an unexpected improvement or optimisation by accident. You can spend weeks carefully tuning and optimising code, only to stumble across a glaring inefficiency in your code which you've never spotted before. One quick change and your application is suddenly noticably faster.

In my daily job I rely heavily on debuggers and profilers to discover bottlenecks in the working on the Xbox, Microsoft provided some excellent performance analysis tools (I see they've finally released PIX for Windows). These days I tend to use AQtime as I'm PC based (it's also one of the few profilers I've found that can handle the size of our libraries at work without grinding to a shuddering halt.)

Without these kind of tools it's a lot tougher profiling on the PSP. Over the past few months I've built a number of custom profiling tools into Daedalus to help me figure out where all the time is going, but the numbers I get out tend to be quite vague, and there's usually quite a large margin of error. I think this explains why the unexpected optimisation I've just found went undiscovered for so long.

A couple of days ago I was browsing the ps2dev forums and came across this post. I was about to back out after a quick scan, when I noticed this comment from Soatome:


PeterM wrote:
but one waits for the vblank

...and that's sceCtrlReadBufferPositive (which you're using)
you should use sceCtrlPeekBufferPositive instead.


That's when I realised that when Daedalus was emulating a rom, it was stalling for a frame every time the rom read the status of the pad*. In other words by changing one line of code in Daedalus from


sceCtrlReadBufferPositive


to


sceCtrlPeekBufferPositive


I could get on average an instant 1fps speedup across all roms. What's more, I knew some roms read from the pad multiple times each frame, so they would see an even great speedup.

Frustratingly I had to wait a couple of days before I could try this out. As I mentioned earlier I'm in the process up moving over to a new PC, and I had just moved Perforce over but hadn't set up the pspsdk, which required Cygwin. Daedalus requires libpng and zlib so I had to download and build them too. Then I had to set up Psplink, PuTTY and a whole host of other tools. You get the picture...

Last night I finally managed to get a new build together with the updated code, and the results were every bit as good as I'd expected. In some cases I had to restart the rom just to make sure I wasn't mistaken. I know most of you just want to see some numbers, so here's a few of my observations:

Mario now runs at at steady 15fps in most places, and around 20fps indoors etc (it reaches over 35fps in the main menu, and close to 30 in some scenes.) Zelda now runs at around 8fps in game, and up to 20fps in certain places. The 'nintendo' logo at the start runs at over 90fps :D The MarioKart Nintendo logo now runs at 30fps, and the main menu (with the flag) runs at a solid 15fps. In game it's a comfortable 12fps. Starfox runs at around 15fps - the intro runs at 25-30fps. Quest64 runs at 20fps.

So all in all it's a pretty amazing improvement for a single-line change. Having said that, I think it would be a mistake to assume that this is an instant fix that will suddenly make everything fully-playable. Although some of the framerates I list above are excellent - faster than an native n64 even - not all roms show this improvement. Don't assume that all roms now run at 15+fps (because they don't.) There's still a lot more work to do to get from a sluggish 8fps to a more playable 15fps (in Zelda for instance). I still need to save a lot more cycles in order to support other features such as sound.

Because this change makes such a big improvement I'm going to try and get another release out sooner rather than later. I don't like releasing builds too often as I think each revision should something worthwhile, but I think this qualifies :) There are a couple of other optimisations I want to get in this build, so while it might be ready this weekend, sometime early next week is more likely. The new features I had planned for this build will have to wait until R9.

As always, I'll keep you posted.

-StrmnNrmn

*This actually reminds me of a funny story from one of the Xbox games I was working on. We were investigating a sudden slowdown that had been appeared a few days previously. Somehow I realised that the framerate doubled when you unplugged all the controllers. As it turned out someone was accidentally reinitialising the USB hub every frame, and removing the all controllers prevented this from happening.

Sunday, August 13, 2006

R7 released!

I've just uploaded R7 to SourceForge. Here's the changelist:


[^] Avoid checking for interrupts in dynarec code in most situations.
[^] Optimise dynarec Load/Store instructions to avoid checking for interrupts directly.
[^] Implemented the remaining 32-bit integer instructions in the dynarec.
[^] Implemented the remaining commong load/store instructions in the dynarec.
[^] Implemented JAL/JR in dynarec.
[^] Optimised various texture cache related features.
[^] Added various known value optimisations to the dynarec engine.
[^] Link together blocks even when they exit with branch likely instructions.
[+] Added option to allow frequency of texture update checks to be reduced.
[+] Added the ability to configure buttons
[!] Fixed a couple of compatibility issues caused by the dynarec.
[!] Fixed a couple of issues related to self-modifying code and the dynarec.
[!] Fixed issues with the framerate counter flickering.

 



Daedalus R7 for v1.00
Daedalus R7 for v1.50
Daedalus R7 Source

The main emphasis has been on improving the framerate of as many roms as possible, but I've also made some significant fixes to the dynarec engine which should improve compatibility for a few roms where this was causing problems before.

There are two settings you should be aware of if you're looking at getting the fastest possible framerate. The first is on the global settings page (that's the one you see on the main menu as soon as you boot up). You'll want to set 'Tesselate Large Triangles' to No here. The next option that helps boost the framerate is set 'Texture Update Check' to Disabled on the Rom Settings screen. A combination of these two options give a significant speedup in various roms.

In R7 I've also added the ability to define your own custom controller configurations. You can define a new controller mapping by adding a new .ini file to the Daedalus/ControllerConfigs directory. There are a few examples in there already, and I'll look at posting a brief tutorial up here sometime soon. If you come up with a new mapping you think would be useful then email me (my address is in the readme.txt) and I'll post it up here and add it to a later release.

I think R8 is going to continue to focus on improving the framerate. I still have a lot of optimisations I want to get in, and I think these will help improve the framerate even further. I also want to spend a little time improving the front end, as it's getting harder for me to add new settings and options in there. I also want to add an option for changing some of the settings while a rom is running (i.e. I think it's time we had an in-game menu.) Another thing I'll look at for R8 is saving settings between runs of the emulator - this way Daedalus will remember which controller setup you prefer for each rom. Finally I want to add an option to quit back to the main menu without having to restart Daedalus.

Phew! That's quite a big list. I can't guarantee I'll be able to add all that for the next release, but that's what I'm currently aiming for. I think it's more important to try and release regularly (i.e. every 3-4 weeks) rather than try and cram everything into one go, so some of these features might move back to R9 if I slip behind.

My first job though is to move my development environment over from my old PC to the new 'beast'. I've had to put this release together through a Remote Desktop Connection to the old PC and I can't bear to do that any longer. It'll probably take a few days to get everything set up on the new PC, but it should be a lot less painful in the long run :)

-StrmnNrmn

Nearly there

I've just about finished work on the custom controller configuration. Once I've finished testing everything is working as expected I'll begin the process of putting the release together. This can be quite involved as it requires updating all the documentation, packaging all the necessary files and uploading to SourceForge etc. Hopefully it shouldn't take more than a couple of hours. Next post will be with full details of all the changes and the download links etc.

-StrmnNrmn

Thursday, August 10, 2006

R7 close

I'm getting close to finishing everything I want to get into R7. I've just spent a little time tidying up a few loose ends (little things like the way screenshots are handled). The one last substantial bit of work I want to get done is support for custom controller configs. I think this should be a few hours work, so I expect I'll finish this sometime on Saturday or Sunday, with a release following shortly afterwards.

I got a new PC on Tuesday and this has slowed things down a little bit as I've spent a couple of evenings this week seeing just how stupidly fast it is. This is what I went for in the end:



I've got it all hooked up to my Dell 2405fpw and it's awesome. It's nice to finally be able to play games at the 2405's native 1920x1200 resolution without it chugging along :D

-StrmnNrmn

Sunday, August 06, 2006

Trigonometry Wars

While you're waiting for R7, my good friend 71M has just released the first version of Trigonometry Wars. It's awesome, check it out!

More R7 Optimisations

It's been a while since my last post, but I've still been hard at work with various optimisations for Daedalus R7.

Although my main focus is on improving the dynamic recompiler, I've been looking at optimising a couple of other areas that I noticed were fairly expensive. The texture cache is one of the areas that I spent time tuning this week. This cache is used to avoid converting textures from the native n64 formats to psp formats every frame. I made a couple of fixes to improve the hashing function which gives much faster lookups in certain situations (such as tiled backdrops). I also provided an option to change the frequency at which the texture cache checks for updates to the textures. Many roms look fine when this check is entirely disabled, and this can give quite a nice speed boost.

My main focus has continued to be on the dynamic recompiler. I've made a couple more bugfixes in this area. One bugfix involved detecting when roms were using self-modifying code. The fix involved dumping the contents of the dynarec cache so that the code is correctly regenerated for the updated instructions. This fix solves a couple of issues I was seeing with Quest64, and I'm sure it will help improve compatibility with a number of other roms too.

The other dynarec issue I fixed was related to the way I was handling certain types of branch instructions. The MIPS processor has a set of 'branch likely' instructions which work slightly differently to regular branches and so I handle them separately in the dynamic recompiler. It turned out that I had forgotten to link together code fragments when they exited through a branch likely instruction. This fix gives a nice little speedup.

The biggest bit of new development I've been doing on the dynarec is on optimising for various situations where I can determine the contents of a given register at the time I'm compiling the code. As an example, many roms use the following sequence to load an integer value from memory at a specific address:


LUI $t0, 0x8033 // Load Upper Immediate - i.e. load t0 with 0x80330000
LW $t0, 0x1234($t0) // Load Word - i.e. load t0 with the value at 0x80331234


Previously I'd generate code for both of these instructions on the PSP. The LUI instruction is easy (if t0 is cached on the PSP then this is just one instruction). The LW is a lot more tricky. I have to call a function to convert the address on the n64 (0x80331234 in this case) to the address in the emulated memory on the PSP. Then I have to read from that address, or trigger an exception in the emulator if the memory address is invalid.

With the changes I've just made, when I encounter the LUI instruction (or other instructions involving loading constant values into registers) I keep track of the fact that I've loaded t0 with 0x80330000. When I come to process the LW instruction, I can now determine that the desired address is 0x80331234. I can then map that address directly to the required location on the PSP, avoiding a function call in the generated code. By avoiding the function call I no longer need to flush cached registers back out to memory. Also, because I can tell in advance that the address lies in RAM (and isn't referencing a hardware register for instance) then I can also omit the code testing for an exception. Finally, in situations like the example above, I can don't need to generate any code for the initial LUI (as the register is immediately overwritten with the loaded value.)

In summary this is a very nice optimisation - it generates fewer instructions (reducing the size of the dynarec code), it avoids unnecessarily flushing out cached registers, it avoids generating exception handling code, and it can eliminate redundant instructions (the initial LUI). In the best case, for 2 source instructions it will generate just 3 output instructions, compared to 12-13 for the unoptimised case.

Unfortunately this approach only works with load and store instructions where the address can be determined in advance, but from the roms I've examined so far around 10-15% of the load/store instructions can be optimised in this way, which is enough to give a measurable benefit.

I'm going to spend the rest of this week seeing which other parts of the dynarec engine can benefit from similar approaches. I have a couple of other features to implement (configurable controllers etc), if that all goes to plan I'll try and prepare R7 for a release next weekend.

-StrmnNrmn

Saturday, July 29, 2006

Fixed!

I'm very pleased to be able to say that I've finally managed to fix the nasty bug I blogged about on Thursday.

I'll go into more details in a later post, but in essence the problem was due to very rare situations where the trace recorder would exit a trace when there was still a branch delay instruction pending. This caused the fragment generator to inadvertently skip the branch instruction, causing the odd behaviour I was seeing.

For reference, here are some updated figures for Super Mario 64 and Mario Kart (initial results are from a previous post). Generally the current changes seem to indicate an overall speedup of 20%-25%, which is great for a few days work. What's even better is that I've still not implemented all the optimisations that I have planned for R7, so hopefully these numbers will look even better soon.











SceneR4 Framerate (Hz)R5 Framerate (Hz)Current Framerate (Hz)
Mario Head368
Mario Main Menu142530
Mario Peach Letter6-71113
Mario Flyby (under bridge)61012
Mario In Game5-6911
Mario Kart Nintendo logo102324
Mario Kart Flag61113
Mario Kart Menu71113


(I'll update with results from Zelda shortly - I have to go to a BBQ now!)

-StrmnNrmn

Thursday, July 27, 2006

Captain Morgan's compatibility results

It looks like Captain Morgan has put a lot of effort into doing compatibility testing with Daedalus R6. Thanks Cap'n! (thanks for your email too - I promise I'll get back to you just as soon as sort this issue out!)

[Update]

Don't miss Wally*Won_Kenobie's R6 compatibility list, which is also excellent. Wally has been collecting missing_mux.txt files for me too, for which I am indebted :)

-StrmnNrmn

A great optimisation/bugfix..

..with a catch.

This week I've been posting about various speedups I've been making by implementing various opcodes in the new dynarec engine. Although I have implemented most of the commonly used opcodes now, after my previous post I decided to add some temporary logging to the emulator to see how often the remaining unhandled opcodes were being used. Two that immediately jumped out at me were JAL (Jump And Link, which is used to perform a function call) and JR (Jump Register, which is used to return from a function call.)

These two instructions are very heavily used, and I was surprised to realise that I'd not implemented them! They're pretty easy to code - in fact, due to the way I construct the instruction traces that are fed into the dynamic recompiler I could effectively ignore the JAL instruction and JR just required a couple of lines of code.

So far so good. I was expecting a modest speedup - maybe another 2-3% on top of all the previous changes I've made this week. After compiling and running the new code, I was amazed to see an improvement of over 10%! Surprised with the figures I was seeing, I did a full rebuild and checked the results again with several different roms. They all showed the same kind of speedup.

I've been programming (and more importantly debugging) long enough now when I should trust my instincts - call it my programming 'Spider-Sense' tingling if you will, but I knew something didn't quite add up :). In situations like this in the past, rather than taking an unexpected speedup for granted I've spent time investigating the root cause to find out exactly what's going on. At the very least I'll simply satisfy my own curiosity, but often I'll find some useful information along the way too (e.g. other related improvements and optimisations etc.)

So I started looking through the code and rerunning a few roms to try and get a handle on what was causing such a significant improvement. After a short while I realised that all the roms were now generating a lot more potential traces for the dynarec engine to consider for recompiling. This confused me even more, because this behaviour should slow the emulator down rather than speed it up. Another puzzling thing was that the only observable behaviour of my changes should be the speed of emulation - but it looked like my change was somehow changing the flow of execution in the rom.

After bit more head scratching and debugging, I finally realised what had happened. In making my changes, I had inadvertently fixed a bug in the dynarec engine that was causing the recompiled code to jump back out to the interpreter whenever a JAL instruction was encountered! This bug had been in the dynarec engine since the first day or so, but because its only side effect was to slow down the emulator rather than something more obvious (i.e. a crash!) it had remained undetected for a couple of months.

So I had figured out what the reason for the 10% speedup was, and I could finally get to bed safe in the knowledge that I had fixed a nasty, subtle bug along the way. Brilliant!

It was only then that I noticed a couple of new problems that I hadn't seen before: The emulator began hanging in places that had previously been fine - such as Peach's letter at the start of Mario 64. On the occasions the emulator managed to get past that point, I found out that Mario wouldn't move or jump (but strangely the c-buttons and pause menu worked fine)

:(

I've spent the last couple of evenings trying to figure out why fixing one bug is causing another. I've finally managed to find a way of reliably reproducing a hang within a few seconds of starting the emulator up. This is important because my best chance of identifying and fixing the problem is to be able to run the PC build of the emulator with my 'fragment simulator' enabled. The simulator is very slow however (about 100x slower than running the emulator normally), which is why it's important to find a way of reliably reproducing the bug very early on in the emulation.

So that's what I've been up to over the past couple of days, and why I haven't been able to reply to people's emails or comments on this blog. Now that I can reproduce the bug in the fragment simulator I'm confident that I can get to the bottom of it. I'll keep you posted with any developments and try to go through emails/comments just as soon as I've cracked it.

-StrmnNrmn

Wednesday, July 26, 2006

Further dynarec optimisation

I've spent the last couple of evenings working on adding support for additional instructions to the dynamic recompiler. With every instruction I add, the generated code becomes a bit more efficient as I can avoid various bookeeping work (such as flushing all the cached registers out to memory.)

I've added code to handle the following ops:


  • MULT, MULTU (multiply, multiply unsigned)

  • DIV, DIVU (divide, divide unsigned)

  • MFLO, MFHI (move from lo/hi)

  • MTLO, MTHI (move to lo/hi)

  • LB, LBU (load byte, load byte unsigned)

  • LH, LHU (load halfword, load halfword unsigned)



So far I'm seeing around a 5-6% speedup with these changes (on top of the 10-12% speedup I talked about on Sunday). I am generating slightly more code as a result of this work, but given the large savings I made over the weekend this isn't much of an issue.

My next job is to look at optimising the remaining load/store instructions - I just have LWU/SB/SH to do (ignoring the 64 bit instructions for now). Once that's done I'm going to have a look at optimising sequences of load/store operations by caching the base address between uses. I think that should give a significant speed up for memory intensive chunks of code.

-StrmnNrmn

Sunday, July 23, 2006

A productive weekend

I've had a fairly productive weekend. I've been working on a few improvements on the dynamic recompiler, and I've managed to both decrease the amount of memory it's using (by approximately 10-12%), and increase the emulator's speed (by around 7-10%).

I don't want to get into too many details just yet, as I'm planning on putting together a more detailed post with all the gory details (and a few graphs :) later in the week.

In a related change, I've also managed to identify and fix an issue (read 'bug') with the dynarec that may have been causing stability problems on v1.0 firmware PSPs. This may explain some of the differences in stability users of the different builds have been seeing.

-StrmnNrmn

Wednesday, July 19, 2006

Windows CE device emulator source

For anyone reading this blog with an interest in emulator development, Microsoft have just released their Windows CE device emulator as shared source.

As the post mentions, it includes the source for an ARM -> x86 JIT compiler which makes interesting reading, (take a peek at armcpu.cpp)

-StrmnNrmn

R6 Released

I've got to keep this short as blogger.com is going down in 5 minutes :(

Change log:

[+] Added over 50 new combiner modes
[+] Added support for c-buttons
[+] Load roms from ms0:\N64 in addition to local roms directory
[!] Fixed backface culling issues
[!] Correctly implemented flipping to avoid flickering with certain roms
[!] Plugged memory leak in texture handling code, fixing various crashes
[!] Fixed issue which caused screenshot function to hang the emulator

You can grab it here.

For R6 I've mostly been focusing on fixing a number of graphical issues (namely adding combiner modes to popular roms). I've also managed to add a couple of nice usability improvements (in particular mapping the c buttons to the dpad, using the circle button to toggle back to the n64 dpad)*. I've also been able to track down a couple of bugs that affected stability.

I've still not decided what to concentrate my attention on for R7. The main areas are:


  • Speed

  • Compatibility

  • Graphics

  • Usability



These are all quite broad areas, but it would be good to get a feeling for what people are most interested in seeing improved. Any comments would be most appreciated.

-StrmnNrmn

* I should point out that I had dozens of people suggest this to me via email and through comments on this blog, so I can't take any credit for this idea :)

R6 up within the hour..

I had to pull a long shift at work (14 hours!) so I've only just got in. I'm in the process of rebuilding a release build, updating docs, zipping things etc and should have a new build uploaded by 1am (BST). It looks like blogger.com is down at 5pm (PST) so if I don't post an update soon, check the SourceForge site for the update.

-StrmnNrmn

Monday, July 17, 2006

R6 Tomorrow (hopefully!)

I don't mean to tease, but I am hoping to release the next build of Daedalus PSP tomorrow. I've spent the last couple of days polishing a number of graphical issues and I need to set myself a target date otherwise I'll just keep tinkering for ages :) I'll have a longer post tomorrow with details of the changes that have made it into R6.

In the meantime, I've answered a few of the most recent questions on the previous comments page.

-StrmnNrmn

Tuesday, July 11, 2006

Graphical fixes

Here are a few of the significant graphical fixes I've made so far for R6.

Firstly, I managed to fix the horrible flickering that happened when running various roms (Paper Mario was a good example). It turned out that I was making an assumption that roms executed exactly one display list per frame. I assumed that each display list would clear the screen, render everything, and then wait for the screen to flip. As it turns out, some roms execute multiple display lists per frame. In the case of Paper Mario it executes 2 display lists per frame (one which clears the screen, then another which renders everything). By making sure that I only flip after the second display list executes, I avoid the flickering (the actual solution is a little more involved but this is the general idea).

The next significant glitch I've fixed was to do with backface culling of triangles. Basically, when I ported the graphics engine over from the PC version, I forgot to implement the two or three lines of code which handles this. It was a very small fix, but it corrects a number of significant graphical issues (notably all the walls getting in the way in Quest 64).

Finally, I've managed to track down and fix a significant memory leak in the texture handling code. I believe this was causing many of the random crashes that were occuring when leaving the emulator running for several minutes or more (basically through running out of memory). Before applying the fix I found that the Super Mario 64 would crash within 4-5 minutes. After applying the fix I've been able to run Mario with no problems for over 30 minutes.

I'll keep you posted as to when you can expect a new release. I'm quite excited about the memory leak fix, so I'd like to get a new release out as soon as I can implement some of the other things I promised for R6.

-StrmnNrmn

PS I know I've been crap at replying to emails :( I'm hoping to get this release out and then I'll spend a few hours sorting out my mailbox and replying to various comments here.

Sunday, July 09, 2006

Congratulations Italy!

Congratulations to Italy on their win tonight!

Just a quick note to apologise for the lack of updates recently. I've been busy with work for the past couple of weeks, but hopefully it should be business as usual now. I'll try to post a more interesting update tomorrow or on Tuesday.

-StrmnNrmn

Monday, June 26, 2006

Deciding what to optimise

Whenever I start to answer questions on the comment pages I always end up going into too much detail for a quick response and end up deciding to put up a new post instead. I hope this isn't too annoying :)

In response to Plans for R6 xiringu and ukcuf16 had a couple of interesting suggestions for performance improvements.

First up, from xiringu:

instead of working with a 300x200 screen, work with only half height 150x200 and then display an empty line every other line to get the final 300x200.


That's an interesting idea - it's a trick that's been used by demo coders for years to get a few extra fps. I'm not sure this is going to provide all that much of a speedup to Daedalus though :( The reason for this is that currently rendering only contributes a small amount to the overall cost of each frame, so even if rendering time was totally eliminated, the framerate wouldn't change much. As an example, let's take something like Zelda which currently runs at around 4 fps. At 4fps it means each frame takes 1000/4 = 250 milliseconds to render each frame, which is broken down something like this:

CPU emulation: 200 ms
Display list parsing: 40 ms
Rendering: 10 ms
Total: 200 + 40 + 10 = 250 ms (i.e. 1000/250 = 4fps)

Assuming that we could totally eliminate the rendering time, this would now look like:

CPU emulation: 200 ms
Display list parsing: 40 ms
Rendering: 0 ms (no cost)
Total: 200 + 40 = 240 ms (i.e. 1000/240 = 4.17fps)

So the very best we could hope for in this case would be a .17fps improvement in the framerate :(

ukcuf16 wrote:

Just wanted to ask if there is ever going to be frame skip in later versions :)


What ukcuf16 is suggesting is that the emulator renders one frame, then skips the next. Alternating frames like this should halve the cost of rendering, at the cost of making the framerate a little less smooth.

Again, this is an interesting idea, but I don't really see this having much impact on the framerate as things stand at the moment. Working out the potential speedup is a little more complicated, as we have to take the average time over two frames. The numbers look something like this:

Frame1 CPU emulation: 200 ms
Frame1 Display list parsing: 40 ms
Frame1 Rendering: 10 ms
Frame2 CPU emulation: 200 ms
Frame2 Display list parsing: 0 ms (skipped)
Frame2 Rendering: 0 ms (skipped)
Total: 200 + 40 + 10 + 200 = 450 ms
Average: 450 / 2 = 225 ms (i.e. 1000/225 = 4.44fps)

So even implementing a frame skip mechanism would only give a tiny 0.5fps speedup.

To take this example to its ultimate conclusion, let's assume that I could somehow eliminate the entire cost of display list parsing and rendering:

CPU emulation: 200 ms
Display list parsing: 0 ms (no cost)
Rendering: 0 ms (no cost)
Total: 200 ms (i.e. 1000/200 = 5fps)

Even if I could somehow (magically) reduce the cost of rendering to 0 milliseconds, we'd still only see a 1fps speedup. However, if I can halve the cost of CPU emulation (which is much more likely given the speedups already seen with the new dynarec engine) this is what the calculations look like:

CPU emulation: 100 ms (now twice as fast)
Display list parsing: 40 ms
Rendering: 10 ms
Total: 100 + 40 + 10 = 150 ms (i.e. 1000/150 = 6.66fps)

At the moment I feel that there are more gains to come from optimising the CPU emulation, which is why I've been concentrating on this area recently. As the cost of CPU emulation falls relative to rendering then the ideas suggested by xiringu and ukcug16 will start to become more attractive.

-StrmnNrmn

Thursday, June 22, 2006

Source code rant - update

I was going to post these responses on the comments page, but I was worried that they'd get buried and I have a few important points to make.

From laxer3a:
Now you start to see why the TYL emu source was released shifted by one version from the bin. :-)


I was thinking about doing this, but it's important that people are able to tinker with the source as they please. I usually only ever have the time to refresh the CVS depot when I release anyway, but if other people get involved in the project then this will need to happen more regularly.

gregnoid wrote:
Naa, this pre-release is just a fake !
Becaus PspMonkey had not compiled this.


I need to make it clear I was ranting about PSdonkey, not PSmonkey. I have a lot of respect for PSmonkey and I wouldn't want people to think I was criticising him. It's unfortunate that so many people confuse the two names. Remember: donkey = large four-legged member of the horse family (likes hay, sombreros etc). Monkey = amusing, cheeky primate (likes bananas, mischief, etc) :)

psdonkey said:
About the pre R5 build that I made. Yes I did change a couple of minor things in the source code and things seemed to run a tad bit better. However, after reviewing your updated R5 release, none of the changes that I made in the other build had any effect in this new R5 build that you released. In fact most of what I did was clean up some of the excess code and I can see that you already did this in your R5 release.


I appreciate you clearing up the situation and making your source available - thanks for doing this. I really wasn't expecting to see this happen, so accept my apologies for the criticism I levied in my previous post. I had a bit of a Hulk rage going on and it wasn't really justified.

You make a really good point about having a shared directory for roms between Daedalus and Monkey64 - I'll look at rolling this change into the next official release (R6). If you're really keen on helping contribute to Daedalus then I think you should drop me an email and we can talk about the possibility adding you as a contributor on sourceforge.

-StrmNrmn

Plans for R6

I'm back from Spain now. I had a great time in Barcelona, it's a bit of a shame to be back :)

I'm planning to have quite a quick turnaround on for the next release, i.e. hopefully I'll have something ready by the end of next week, or early July. I want to concentrate on fixing a bunch of graphical issues (adding new combine modes to fix various pink+black textures etc). I'm also going to look at reducing memory requirements - I think the stability problems associated with running the emulator with dynarec for an extended period of time are a result of running out of memory. This should also help to improve the Expansion Pak support. If I get enough time I'll look at adding support for configuring the controls on a rom by rom basis. Finally, I also want to look at improving savegame support.

-StrmnNrmn

PS Congrats Ghana + Australia :D

Friday, June 16, 2006

Away for a few days

I'm going to Spain for a few days to celebrate a friend's birthday. I'll be back on Monday sometime, so I'll try and respond to various questions/issues about the R5 release then.

Buenos nachos! :)

Thursday, June 15, 2006

PSdonkey's build

PSdonkey made this comment when releasing a pre-release of R5 last week:

I went ahead and compiled the new source for everyone and also added a couple of minor changes to the source for speed improvements.


As I have no way of getting in touch with him directly, I'd like to ask him publicly if he could forward me a set of diffs for the 'speed improvements' he applied. Not just because this is a requirement of the GPL (the license under which the Daedalus PSP source is made available), but because I think it would be beneficial to the project to apply these changes to the main source tree. Having said that, I suspect PSdonkey might simply have been attempting to take partial credit for several weeks development work. I'll let you know if I receive any diffs...

Daedalus PSP R5

I've just uploaded the R5 release to sourceforge. Here's the changelist:

[+] New DynaRec engine, resulting in significant performance improvements
[+] New front end - ability to toggle a couple of options (more to come)
[+] Save game first pass (eeprom4k, eeprom16k and mempak)
[^] Various interpreting engine optimisations
[~] Use .png fileformat for background images, save ~380KB
[~] Stripped out unnecessary code, save ~250KB


By far the most substantial change was the addition of the new dynarec engine. As detailed in previous posts, this is still some way from completion but is already providing significant benefits (around a 2x increase in speed in many roms).

I've also added the groundwork for a new front end and implemented a first-pass of the savegame system. The savegame support isn't fully tested so I'm expecting a few teething problems - please post any bug reports on the sourceforge site (preferably!), the comments page or email me (check the readme.txt..)

I'm not too sure what I want to concentrate on next. It's been over a month since the last release and I think that it would be a good idea to knock out the next few releases in quick succession, to help me pick up some momentum that I've lost. If I aim to do this I'll probably concentrate on some nice (but quick) improvements such as various graphical fixes, savegame support and per-rom control configuration. Any suggestions welcome :)

Linkage:

R5 Source
R5 for v1.00
R5 for v1.50

Tuesday, June 13, 2006

Brief Update

This is just a brief update as it's been a while since my last post. I'm hoping to release a new build tomorrow, possibly Thursday at latest. I was planning on making an official release last weekend, but I got sidetracked polishing a few things. I've got a slight bit more to finalise, but there should be a couple of nice little additions since the source drop last week.

Monday, June 05, 2006

Source updated

Earlier this evening I updated the project CVS repository with the latest version of the code. Normally I only do this when I release a new build, but I know people have been playing with the code and have expressed an interest in seeing the latest developments.

I'm not quite ready to release a new binary yet (still a few more optimisations I want to make and various bugs to fix first), but I'll try and do this within the coming week.

Incidentally, updating the source normally only takes 20 minutes or so, but it took a good couple of hours tonight. Sourceforge updated their CVS service recently (May 12th) and as a result I had to spend a couple of hours updating WinCVS, generating new SSH keys and the like. Hopefully it won't be so painful next time around, or I might just lose the will to live.

As a more general update, I cleared a couple of things from my TODO list sorted this weekend. I'm caching floating point registers for most of the single-precision Cop1 instructions, which are now implemented directly in the dynarec code. I've not timed this in depth yet, but it's shaving 10-20ms/frame off the intro to Mario 64 (Mario's Head), which is particularly FPU heavy (i.e. I'm getting ~160ms/frame rather than ~180ms)

Finally I need to have a good think about how to go about optimising the double-precision floating point performance. As the PSP doesn't have hardware support for double precision floating point this is currently very expensive (i.e. adding 2 doubles on the n64 takes just one instruction - on the psp this balloons to several hundred as it all has to be done in software).

Currently I cheat and cast all the double-precision floats to single-precision values before performing the calculations. Although this is much faster it obviously loses a lot of precision, so I need to be careful it's not going to break any roms. Also even the float->double/double->float conversions are pretty expensive so it's still not an ideal solution. Fortunately not many roms seem to use double-precision maths extensively (presumably because it was relatively expensive on the n64), and where they do use it they don't seem to be too sensitive to the fact that I'm throwing most of their mantissa away :)

Wednesday, May 31, 2006

Some initial benchmarks

I've been really busy working on the new dynarec engine, so I've not been posting as frequently as I'd like. I've made a lot of progress in the following areas:


  • Most integer arithmetic and logical instructions now implemented (i.e I'm now generating optimised assembly for these instructions rather than calling a generic function to handle them

  • Regsiter caching implemented (although I'm only using a greedy allocation algorithm at the moment, as I've not yet fully implemented the fast linear scan algorithm I talked about in the previous post)

  • I'm directly linking all direct branches to compiled fragments

  • I'm linking to all indirect branch targets



So far I'd say I'm around 40-50% through the work on the dynarec engine.

Now for some stats :) The following table compares the framerates at various points (previous framerate is for the R4 release of Daedalus, current framerate is for my most recent development build):














ScenePrevious Framerate (Hz)Current Framerate (Hz)
Mario Head36
Mario Main Menu1425
Mario Peach Letter6-711
Mario Flyby (under bridge)610
Mario In Game5-69
Mario Kart Nintendo logo1023
Mario Kart Flag611
Mario Kart Menu711
Zelda Nintendo Logo2023
Zelda Start Menu2-34
Zelda Main Menu1013


Overall I'd say the dynarec is currently achieving up to a 100% speedup in the roms I've tested, which I'm very excited about. Mario is certainly starting to feel a lot more playable, and the Mario Kart menus are a lot more responsive now.

I specifically included Zelda in the results because I'm not seeing the same kind of results there, so I need to take a closer look at what's going on there (it's quite possible it's just using a few of the arithmetic and logical ops I've not spent time optimising yet).

A twofold improvement in framerate is pretty good, but I now think I can do a lot better. Here's the list of things I currently have on my 'TODO' list:


  • Fully implement all the remaining integer ops (including all the 64 bit instructions)

  • Finalise implementation of the fast linear scan register allocation algorithm

  • Keep track of 'known' values for specific registers and use this to optimise the generated code (e.g. most of the time the top half of the N64's 64 bit registers is just sign extended from the lower half)

  • Cache the memory location pointer to by the N64 stack pointer (SP) and optimise load/stores using this register as a base pointer

  • Optimise all memory access instructions (currently all the cached registers get flushed for all memory accesses other than LW/SW/LWC1 and SWC1)

  • Detect and optimise 'busy wait' loops (e.g. many roms sit in a tight loop waiting for the next vertical blank interrupt to fire which is just wasting cycles on the PSP)

  • Implement all the branching instructions (I've currently only implemented BNE, BEQ, BLEZ and BGTZ)

  • Implement instructions and register caching for all the cop1 (floating point coprocessor) instructions. (I think this will give a huge speedup.)



Although the list is quite short, there's quite a lot of work there. What I'm quite excited about is that I think these changes will start to provide significant speedups as they're implemented. I don't want to get too far ahead of myself, but I'm starting to feel that certain roms are going to be very playable in the not too distant future.

I'm going to try and release a new version of the emulator soon. Unfortunately it's probably not going to be this weekend (due to various social commitments); towards the end of the following week is more likely. I'd certainly like to get a version released before the World Cup starts and all my free time is taken up watching football :)

-StrmnNrmn

Tuesday, May 23, 2006

A few more details

Laxer3A (of SnesTYL fame) made an insightful point in the comments page of the previous post (I hope he doesn't mind me quoting him):

What you are generating actually seems more like "threaded code" to me than actually a real dynarec.


I entirely agree with his description "threaded code" for the way I'm approaching the new dynarec engine. I want to stress tha this is just the early stages. The intention is to develop a stable platform from which I can incrementally move towards what would be considered more conventional dynamic code generation. I want (and now have) a nice solid platform from which I can implement things in chunks of 2-3 hours, which I can easily fit into an evening after work.

So I'm currently in the process of replacing generic calls to handle individual opcodes with specialised code which avoids the function call overhead and can hardcode register locations and immediate values. I can also optimise certain things very well. As an example there is no 'mov' instruction on the MIPS architecture so you quite often see ops like 'or a0, a1, r0' (which just ORs the value in a1 with 0 and stores the result in a0). I can write the code to handle OR to simplify this particular situation by just directly copying the contents of register a1 to a0, and therefore avoid the logical operation altogether.

As another example, here's the code I wrote last night to handle LUI (the operation to load a 16 bit value into the upper half of the low-32 bits of the N64 register):


LUI( PspReg_T0, immediate );
SetVar( &gGPR[rt]._u32[0], PspReg_T0 );

if (immediate >= 0)
{
SetVar( &gGPR[rt]._u32[1], PspReg_R0 ); // Clear tops bits
}
else
{
SRA( PspReg_T1, PspReg_T0, 0x1f );
SetVar( &gGPR[rt]._u32[1], PspReg_T1 ); // Set top bits
}


In the code I talked about on Sunday, handling this instruction would involve 13 ops: 2 to load the value of the opcode into the argument register, and another to call the 10 op-long function 'R4300_LUI' (Daedalus's instruction handler for LUI).

With the code above this is reduced to 4 ops in the worst case (if the immediate value is negative), or just 3 ops if the value is positive. Also, there is no branching. To give a speicifc example, this N64 opcode:

LUI at, 0x8034

now causes this PSP code to be generated:


LUI t0, 0x8034
SW t0, 8(s0) ; s0 points to the emulated register set
SRA t1, t0, 0x1f
SW t1, 12(s0)


My intention is to spend the next few days reimplementing the most commonly used opcodes in this way. By that point I think the major overhead will shift from the cost of all of the function calls to the generic handlers to the cost of storing loading the emulated registers each time they're referenced (you'll notice in the snippet above I call SW twice - once for each half of the 64 bit N64 register.)

From previous experience, register caching is where the real speedups come from with dynamic recompilation. Memory accesses are typically an order of magnitude slower than register access so anything I can do to avoid them in the recompiled code will be a huge improvement.

If anyone is curious, I've been reading these two papers on fast register allocation for dynamic code generation:

Linear Scan Register Allocation - Poletto, Sarkar. 1999
A fast, memory-efficient register allocation framework for embedded systems - Thammanur, Pande. 2004

-StrmnNrmn

[Edit 23:40 - fix markup]

Sunday, May 21, 2006

DynaRec status

Just a quick update on the dynarec status, as I know a lot of people are more interested in this than the grizly details of branch delay instructions :)

Last weekend (13/14 May) I managed to assemble the fragment buffers into native x86 code, and execute this dynamically. I spent some time debating whether to target MIPS or Intel initially, but I decided that it would be a lot easier for me to debug the code generation on the PC than it would be to debug code gen on the PSP.

In the end I'm glad I started with the PC as it allowed me to fix a number of hairy problems without going down the torturous path of debugging self modifying code on the PSP with just a few printf() statements to help track down any problems.

To start with on the x86 code generation, all I did was convert my fragment simulator loop directly into assembly. So the generated code for each instruction in the fragment looked something like this:


set current pc
set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler



So a fragment with 100 instructions in would have this block repeated 100 times (with different pc, delay flag constants etc).

This generated a lot of assembly (i.e. 200KB or so of N64 instructions would generate over 4MB of x86 assembly, i.e. an expansion of around 2000%). At this point I wasn't interested in performance though - I wanted to make sure that I was preserving the behaviour of the fragment simulator as much as possible. The exception and branch handlers mentioned in the pseudo code above warrant more detailed description, but I'll leave that for another post.

At this point I spent a few hours debugging, but generally everything was working pretty well. The emulator was currently running around the same speed with 'dynarec' enabled as with it disabled (I use quotes because there isn't really much 'dynamic' about this code yet).

I spent the rest of the weekend and the early part of last week trying to optimise the generated assembly and see what I could get away with removing. One of the first things to go was setting the program counter before executing each instruction. The only instructions that need to know this tend to be branching instructions (which are relative and need to know the current PC to work out their target address) and other instructions that can potentially fire exceptions. The vast majority of instructions don't need to know the current PC though (e.g. arithmetic ops, logical ops etc).

Next I had a look at reworking things so I only needed to explicitly set the branch delay flag if a branch delay slot was actually active. I made the precondition that the branch delay slot was always clear, and explicitly set/cleared it when I knew the state needed to change.

Finally I removed exception handling from all the instructions I knew to be safe. For instance, I know ANDI (and immediate) can never throw an exception. As I only perform counter updates at the end of the block, an exception can never be fired when executing this instruction.

After all these changes I had an instruction execution block which looked something like this:


if ( pc needed ) set current pc
if ( branch delay instruction )set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( can throw exception and exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler


This meant that the vast majority of instructions looked as follows:


get op code in ECX
call handler for specified op code (from R4300.cpp)


So I had nice big fragments like this:


...
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
...


Essentially I removed the vast majority of the instruction fetch/decoding overhead from the emulation.

With this version of the dynarec, 200KB of N64 code was now generating just 2MB of x86 assembly (i.e. an expansion ratio of around 1000%). The PC version was running around 60% faster with dynarec enabled than with it disabled, which is a pretty significant speedup (although this is still very early in the process).

What's also important is that this is before I've done any real optimisation of the generated code. For each instruction I'm still calling the generic instruction handler which has the overhead of figuring out which source registers to use, which register is the destination etc. The *real* speedup comes from generating code to handle op codes explicitly, as you remove all this decoding overhead along with the overhead of jumping to another function. Once you've removed most of the generic instruction handling you can start looking at caching register values to minimise the amount of memory that's being moved around.

With the PC version up and running fairly successfully, I've spent this weekend getting the PSP code generation working. I don't want to go into too many details (as I want to go into more depth in future posts), but I know people are keen to hear some news about how this is going.

I got the basic code generation working on Saturday morning (thankfully I'd already resolved most of the tricky issues in developing the x86 version the previous weekend). I spent most of Saturday afternoon fixing some really horrible instruction cache related bugs. I'm still not 100% sure I've fixed them, but it seems very stable at the moment. At the moment I'm at the same stage with the PSP version of the dynarec that I was with the PC version last weekend - the code generation is running fine (and executing on the PSP without crashing more importantly :) but I've only just started looking at optimising things. It's still too early to speculate on numbers for the performance improvement it will give. Currently it's running around 10% faster with dynarec enabled, but it's still very early days.

More soon.

-StrmnNrmn

Branch delay instructions

Well it's been a little longer than I intended since my last post. I've been busy with work, and when I've had free time I've found it hard to drag myself away from the compiler for long enough to post an update :)

In my last post on the subject, I left saying that I was trying to iron out bugs in the fragment 'simulator' and was about to start work on native code generation. I want to pick up from that point, but first I wanted to talk about about branch delay instructions and how they effect the implementation of emulators and the new DynaRec work I've been doing.

The couple of bugs in the fragment simulator were indeed to do with exceptions and interrupts triggering in unexpected places. The problem occurred if an exception was triggered in the branch delay slot following a jump instruction.

In case you're not familiar with the MIPS architecture, after branch instruction has been executed (but before control arrives at the target instruction), the CPU executes the instruction immediately following the branch. So to give an example, this is a block of mips code that calls the function foo( 6, 3 ):


li a0, 6 # load the first argument
jal foo # call foo
li a1, 3 # load the second argument
# the result is returned in v0 here
...

foo:
jr ra # return
add v0, a0, a1 # add the arguments, store result in v0




If you've never read MIPS assembly before this will probably look a little strange.
When we call foo with 'jal foo' (Jump and Link), we don't set the second argument until after the jump to 'foo'! Notice also that we calculate the return value for foo after we return with the Jump Register (JR) instruction.

The reason this code works is because of the branch delay slots. When the call to foo is executed ('jal foo') the CPU keeps going for one more instruction and executes 'li a1, 3'. Control then jumps to the start of 'foo', where the CPU immediately executes 'jr ra', jumping back to where it just came from. Again, the CPU executes the following instruction, calculating the sum of the arguments and returning the result in v0.

Although this seems rather pointless and unnecessarily complicated, it serves a very good purpose. With most modern CPUs the instruction execution is pipelined, which means that the processor breaks down the work into discrete chunks (fetch instruction, decode instruction, execute, commit etc), and executes multiple instructions in parallel. Wikipedia (as always) explains this in a lot more detail.

With many architectures the CPU has to throw away the contents of the pipeline when a branch is executed, as the subsequent instructions may refer to data (register or memory contents) that is invalid until after the call has completed. With certain architectures (including MIPS), the engineers decided that it was wasteful to throw away all this work on every branch, and designed the processor to continue processing the pipeline until the subsequent instruction had completed.

What this means is that when writing code for MIPS processors, you have to be careful that your branch delay slot doesn't have any unintended side effects. For maximum efficiency you also have to be careful to try and do useful work in the branch delay slot (rather than just filling it with a NOP for instance). Normally this isn't an issue as the compiler generates all the code for you, but it's certainly an issue if you're writing assembly. It's also been a very important consideration when I've been writing the code generation for the new DynaRec (I'll cover this in a later post).

So, how do branch delay instructions effect emulators? Although they help pipelined CPUs to improve performance, they require emulators to do bit of extra bookeeping and this slows things down slightly and adds complexity. Every time daedalus interprets an instruction it has to check if a branch was taken, and if so set a flag to indicate that a branch delay instruction is due to be executed. When the branch delay instruction is executed, the flag is cleared and the emulator sets the program counter to the original target of the branch.

This is all fairly straight forward, but complications arise when exceptions fire as a branch delay slot is due to be executed. In the example above, what would happen if an interrupt fired immediately before the branch delay instruction 'li a1, 3' was executed? Normally once the interrupt has been handled, the operating system restores control by jumping to the instruction that was due to be executed, allowing the program to run from that point. If this happened in our example above, a1 would be loaded with the value of 3, but the jump to 'foo' that was originally delayed will never take place. The code would continue running without ever calling 'foo'!

In order to handle this situation, when an exception (or interrupt) is triggered, the CPU sets a flag in the 'cause' register. The operating system keeps track of this flag, along with the program counter where the exception fired and various other bits of information. When it's done handling the exception, the operating system uses this information to allow the CPU to correctly restore control to the code that was originally executing. In our example above, the CPU would see that the branch delay flag is set in the cause register, and restore control to the jal instruction immediately preceeding the instruction where the interrupt occured.

It should be fairly clear that there's quite a lot that must be taken care of by the emulator to make sure that the program executes as intended. This is all fairly simple to keep track of when processing instructions individually, but it becomes more complicated when you start to dynamically recompile the code.

The bugs that I mentioned at the start of this post were caused because I wasn't correctly setting the branch delay flag for instructions causing exceptions in branch delay slots. When I build fragments for the dynamic recompiler I avoid triggering certain types of interrupts (e.g. vertical blank and timer interrupts) in the middle of the fragment to reduce the overhead of having to add the code to handle these situations. Unfortunately there are many other types of exceptions that can occur in the middle of a fragment, such as page faults, I/O completion interrupts etc. It turns out these are very rare, but unfortunately not rare enough to save me from a full day of debugging :(

This was a little more detailed than I was originally planning. Branch delay instructions are quite a simple concept on the surface, but they can cause all sorts of complexities when it comes to dynamic recompilation. Fortunately I feel very confident that I've now fixed all the branch delay related bugs in the dynamic recompiler, so hopefully I shouldn't have to think too much about them again in the near future.

I'll follow up this post with a quick update on the state of the new dynarec engine.

-StrmnNrmn

Wikipedia entry on branch delay slots

Saturday, May 13, 2006

Monkey 64 v2.0

I turn my back for two days and PSMonkey has released v2.0 of his emulator! Congrats PSMonkey - keep up the good work!

Wednesday, May 10, 2006

Dynamic Recompilation Progress

kekpsp asked:

Is it possible to emulate a system like the N64 with the system limitations that the PSP poses?


When I first decided to port Daedalus over to the PSP I really didn't know the answer to this. I knew there were some substantial challenges - I'd ported Daedalus to the Xbox a couple of years earlier and quickly discovered that even with 64MB you really didn't have much room to manoeuvre. With the PSP you have even tighter memory constraints (24MB user memory + 2MB vram), a slower processor and gpu.

I think I've pretty much cracked the memory problem. When I added in the rom streaming code I reduced the memory usage for an 8MB rom image down to around 2MB. Larger roms only use fractionally more ram (i.e. a few 100KB or so), so I've managed to free up around another 6MB to use for textures, audio and most importantly the dynarec engine.

The next big challenge is speed. Currently Daedalus is unusably slow - typically 4-5fps max (although there are some roms that freakishly seem to run faster). Dynarec is going to bring about the biggest gains here, but it's too early for me to tell how much of an improvement it's going to bring on the PSP in the long run.

This is probably a good point to give a bit of a progress update on the new dynamic recompiler. I'm at the state where I'm successfully capturing 'hot-traces' from the rom as it runs. In order to work the bugs out of the system, I'm then simulating the execution of these traces to see whether everything is working as expected. It also lets me collect a few stats like how many instructions will end up being executed through the native fragment cache rather than being interpreted, and roughly how much memory is going to be consumed.

The results are looking very encouraging. Firstly, even though I'm not actually executing any native code yet, the emulator runs almost as quickly with the 'simulated' dynarec enabled as it does running entirely through the interpreter. Although this sounds a bit of a backwards step, it's actually quite significant because it means the dynarec engine itself isn't any substantial load to the CPU. I'm hoping this means that when I am actually executing native code, the dynarec engine will only be using a fractional part of the CPU.

The other significant result is that you don't actually need to recompile much code to get a sizable portion of the rom executing natively. In my tests with Mario, typically around 90% of the instructions executed are going through the fragment cache rather than the interpreter. Importantly this is with only around 64,000 instructions in 700-1000 fragments. I think this will mean I'll be able to get away with a 1-2MB code buffer on the PSP.

At the moment I'm still ironing out a couple of bugs with the fragment 'simulator' (mostly to do with exceptions and interrupts occuring in the middle of a fragment). Once that's complete I'm going to start taking a look at taking a few small steps towards generating native code. I'll go over this in more detail in my next few posts.