Wednesday, June 27, 2007

Daedalus PSP R12 Released

I've just finished uploading the latest build of Daedalus PSP:

Daedalus PSP R12 for v1.00 Firmware
Daedalus PSP R12 for v1.50+ Firmware
Daedalus PSP R12 Source

As usual it will take 20-30 minutes for the files to propagate across the Sourceforge mirrors, so please be patient :)

Here is the list of changes:


  • [!] Fixed issue preventing Goldeneye from being loaded.
  • [!] Fixed dynarec for Goldeneye.
  • [!] Fixed dynarec for Super Smash Bros.
  • [!] Fix various texturing issues with 4bpp and small or non power-of-2 textures.
  • [!] Fix TexRect instructions with negative s/t components.
  • [!] Fixed the HUD in Mario 64 (broken in R11.)
  • [!] Fixed lights in F3DEX2 microcodes.
  • [+] Correctly implement instruction fetch exceptions, improving compatibility.
  • [+] Improved floating point compatibility.
  • [+] Correctly handle mask_s/mask_t tile values.
  • [+] Implemented a few custom blend modes.
  • [+] Screenshots just cover visible viewport.


If you've been following the updates on this blog over the past month the most obvious change is that Super Smash Bros. is now running well with dynarec enabled, and many graphical glitches have now been resolved. The compatibility fixes were specifically aimed at Super Smash Bros. but may well fix issues with other roms too. Overall SSB is looking and playing much better than it was in R11, but even at 30-40fps it's still not running at fullspeed yet. There are a few graphical issues that still need resolving, but all in all it's starting to feel very playable with frameskip set to 1 or 2.

Goldeneye is also running in R12. Although the intro sequence is running very quickly and with few noticable graphical issues, a lot more work is needed to it running at a playable framerate in-game. I think it's a good start though, and something to get excited about for the future :)

Otherwise R12 just has a fewm minor compatibility and graphical fixes - there are no optimisations for this build.

As always, leave your feedback on the comments pages. I read all your comments and I'll do my best to reply to any questions you raise. I'm particularly interested to hear if any roms which were broken in previous releases are now running in R12.

Enjoy!

-StrmnNrmn

Saturday, June 16, 2007

Multiplayer Thoughts

In response to a recent post, Zeus asked an interesting question:


I know you've probably been bugged by people on this before, but how hard would multi-player be to implement? More importantly, do you think that there is enough bandwidth, and low enough lag, to allow a host-client multiplayer setup to be playable? (ie, with one psp "hosting" the game and doing the emulation, while the other(s) just receive screen-captures and send back user-input) Or do you think that distributing the computational workload would be a better approach? (at the very minimum, the audio processor shouldn't be too horrible to move to the client psp(s) )


I have thought about multiplayer a great deal, but I've never made any plans to work on it - there didn't seem to be much point getting multiplayer working before there were a few multiplayer games running quickly and glitch free. Now that MarioKart and Super Smash Bros. are both working reasonably well, there's obviously going to be a lot more demand for multiplayer support. Before I raise expectations and get anyone's hopes up, I should mention that this isn't likely to happen any time soon.

It would be possible to go down the route of having a host psp which performs emulation and broadcasts the screen to client psps. Sony's Remote Play between the PS3 and PSP shows that this kind of 'dumb terminal' approach can work in the right situations. That's with the PS3 doing the grunt work of compressing the framebuffer and sending it over to the PSP. For for a PSP running Daedalus as a host, I'm not quite sure there would be enough spare horsepower to compress the framebuffer and audio and then send it to 1 or more connected clients.

Also, I don't think that it would be possible to decouple the audio processing from the main cpu thread so that this work could be distributed to one of the clients. Although the audio and graphics processing is notionally run in parallel on the RSP (the N64's coprocessor), access is still serialised between audio and graphics tasks so they have to be completed in order. Performing audio processing on a client PSP would just mean that graphics processing on the host would have to wait until the results of the audio processing were received.

The approach that I'd been considering was running Daedalus in lockstep across 2 or more connected PSPs. As I mentioned previously Daedalus can run deterministically if external inputs such as pad input and timing sources are synchronised. What this would mean would be that every time the rom queried the pad status, each connected psp would have to synchronise its view of the pad input with the host. This would mean sending just 64 bytes across the network from the client to the host and back. This information would have to be sent over a TCP connection rather than UDP as we have to ensure that every PSP sees the exact same input (actually, client->host communication is a bit less critical so it may be possible to transmit this information over UDP if it helps improve lag.)

One really cool feature about this approach is that as each PSP would be responsible for rendering its own display, it would be possible to scale up each viewport to fill the display; rather than playing 4-player Mario Kart or Goldeneye at 160x120, you'd be able to play at 320x240 (or 480x272 if you wanted to scale it up to entirely fill the PSP's screen).

As I said at the start of the post, this isn't something that's likely to happen any time soon, but if and when it does happen, it will be amazing :)

-StrmnNrmn

R12 Release Date

A number of people have been asking in the comments when R12 is going to be released.

There are still a number of things I want to work on. Now that Super Smash Bros. is running nice and quickly with dynarec enabled, I want to spend a week or so polishing the graphics and trying to make it as playable as possible. Although Goldeneye is running with dynarec in R12, it still needs a lot of work before its playable, so I'm not going to spend any more time on it for R12.

I'm going on holiday at the end of June, so I'd like to have R12 released before then. I'll aim for next weekend (23rd/24th June) but it may end up being as late as the 26th/27th.

-StrmnNrmn

Thursday, June 14, 2007

Tracking down the SSB Dynarec Bug - Part 2

On Monday I talked about the fragment simulator and how this could be used to help track down bugs in the dynarec implementation. In this post I'm going to talk a bit about a tool I use mostly for regression testing, but also to help determine the exact point at which the fragment simulator and the interpretative core go out of sync. It's a bit of a long post, so apologies in advance :)

Daedalus can be compiled with a flag which enables a special 'synchronisation' mode. This build configuration creates an instance of a synchronisation class which can be initialised in one of two modes - either as a producer or as a consumer. At various points during program execution I pass information about the internal state of the emulator to the synchroniser for processing. In the case of the producer, it simply writes this data out to a file on disk. The consumer is a bit more interesting; it reads data of the required size from disk, and compares this 'baseline' value against the value provided by the emulator. If these two values are found to be different, the synchroniser knows that things have drifted out of sync and it can trigger a breakpoint and drop out into the debugger.

This technique relies on the fact that the emulator is deterministic, i.e. running the emulator twice in a row with the same inputs generates exactly the same results. By 'inputs' this means not just the same rom image, but external inputs such as data from the controller must match exactly too. Obviously pressing buttons on the controller in exactly the same order with the same timings would be impossible to duplicate, so the other function the synchroniser performs is to record input from the pad in the case of the producer, or play input back in the case of the consumer. Other external input, such as calls to timer functions (e.g. time(), QueryPerformanceCounter() or rdtsc) can be synchronised in the same way.

The synchroniser works with as few or as many sync points as you provide. For debugging very simple problems, you can get away with just checking the value of the program counter as each instruction is executed. For more tricky problems you can end up adding many more sync points - for instance you can synchronise the entire register set after every instruction to ensure that the synchroniser catches any instruction which generates a different result from the baseline.

I add sync points to Daedalus using a set of macros. When synchronisation is enabled, the macros expand out to calls to a virtual method on a global instance of the synchroniser class. An example sync point in the code might look like this:


u32 pc = gCPUState.CurrentPC;

SYNCH_POINT( DAED_SYNC_REG_PC, pc );

OpCode op;
if( CPU_FetchInstruction( pc, &op ) )
{
CPU_Execute( pc, op );
}


The interesting line here is the SYNC_POINT macro, which synchronises on the current program counter value. For producers, this just writes the value of 'pc' to disk. For consumers, it checks that the value we have for 'pc' matches the one read from disk.

The DAED_SYNC_REG_PC argument is simply a flag to describe what is being synchronised. Another global constant allows easy control of what is synchronised:


enum ESynchFlags
{
DAED_SYNC_NONE = 0x00000000,

DAED_SYNC_REG_GPR = 0x00000001,
DAED_SYNC_REG_CPU0 = 0x00000002,
DAED_SYNC_REG_CCR0 = 0x00000004,
DAED_SYNC_REG_CPU1 = 0x00000008,
DAED_SYNC_REG_CCR1 = 0x00000010,

DAED_SYNC_REG_PC = 0x00000020,
DAED_SYNC_FRAGMENT_PC = 0x00000040,
};

static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC);

#define SYNCH_POINT( flags, x, msg ) \
if ( DAED_SYNC_MASK & (flags) ) \
CSynchroniser::SynchPoint( x, msg )


If I want to enable more thorough debugging, I can change DAED_SYNC_MASK and OR in more values:


static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC|DAED_SYNC_REG_GPR);


Changing the mask value requires the emulator to be rebuilt from scratch and the baseline synch file to be recreated. This is a bit time consuming but doing it in this way means that the compiler can optimise out any synch points which we aren't interested in, keeping things running as quickly as possible.

One problem with this technique is that the synchroniser can quickly generate a massive amount of data, so much that most of the execution time is spent shifting this data to or from disk, slowing debugging to a crawl. In the example I gave on Monday, it can sometimes take over 500 million instructions before things go out of sync. Even when just synchronising on the program counter, that's over 2GiB of data that needs to be read/written to disk. When you throw in more sync points such as register sets (the GPR registers on their own are around 256 bytes) this can very quickly become impractical. To get around these limitations in Daedalus I gzip the stream of data on the fly which compresses the data significantly. Another trick I use is to hash each register set to a 32bit value and synchronise on this value instead. When using both these techniques the sync files typically end up around 100-200MiB, which is much more manageable.

One of the main uses of this synchronisation code is for regression testing optimisations I've made. I can take a 'known good' build of the emulator and initialise the synchronisation class as a producer to generate a baseline sync file. I can then take a modified version of Daedalus with the optimisations that I want to test, and initialise the synchroniser as a consumer. If the synchroniser detects that things have gone out of sync, then I know that my changes are buggy, and I can investigate why they're not working as planned. It's worth noting that even if everything stays in sync, this isn't a guarantee that my changes are bug-free, but it's a pretty good indication that they're ok.

I also use the synchronisation code to debug tricky dynarec issues. When debugging these types of problems I typically start off by disabling the dynarec engine and setting up the synchroniser to produce a baseline for testing. I'll then re-enable dynarec, but using the fragment simulator with precise interrupt handling (see the end of Monday's post for more on this) and run Daedalus with the synchroniser in consumer mode. Theoretically, as soon as the dynarec code gets out of sync with the interpretative core, the breakpoint triggers and I can investigate things more closely in the debugger.

This is exactly the process I used to track down the Super Smash Bros. bug. When I ran the emulator with the synchroniser in consumer mode, it detected that the program counter was different from the expected baseline value after exactly 387,939,387 instructions had been executed. I'd like to think that an error rate of 2.57e-7% wasn't all that bad, but apparently it is :)

Now that I knew the point at which the emulator was going out of synch, I set a few breakpoints in the emulator to see what exactly was happening. My usual trick is to disassemble the executed instructions just before and after things diverge, and see what's different. Here are snippets from the 'good' and 'bad' logs as things go out of sync:


Count 171f7c35: PC: 80132500: LW ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: JAL 0x80131fb0 ?
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80131fb0: ADDIU sp = sp + 0xffd8
Count 171f7c3c: PC: 80131fb4: SW ra -> 0x0024(sp)
Count 171f7c3d: PC: 80131fb8: SW s0 -> 0x0020(sp)
Count 171f7c3e: PC: 80131fbc: CLEAR a0 = 0
Count 171f7c3f: PC: 80131fc0: CLEAR a1 = 0



Count 171f7c35: PC: 80132500: LW ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: MTC1 at -> FP06
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80132af0: SWC1 FP06 -> 0x0018(a0)
Count 171f7c3c: PC: 80132af4: LBU v0 <- 0x4ad1(v0)
Count 171f7c3d: PC: 80132af8: ADDIU at = r0 + 0x0008
Count 171f7c3e: PC: 80132afc: BEQ v0 == at --> 0x80132b24
Count 171f7c3f: PC: 80132b00: ADDIU at = r0 + 0x0009


I've highlighted the instruction at which the synchroniser detected the PCs were out of sync. In the good trace (top) the PC is 0x80131fb0, but in the bad trace it's 0x80132af0. If you have particularly sharp eyes, you'll notice something else - two instructions before the code goes out of sync, the good trace executes a jump instruction to 0x80131fb0, but the bad trace is performing a MTC1 op (Move To Coprocessor 1)

This provides a particularly good example of one of the main weaknesses with the synchroniser - it's only as good as the synch points you set up. Because I was just synching on the program counter, it didn't detect the fact that the emulator executed an entirely different opcode two instructions previously. In this particular case I was fortunate in that the real source of the problem was very close to the location identified by the synchroniser, but sometimes the cause and effect can be separated by many thousands of instructions.

Fortunately it's easy enough to add new synch points in the code to detect issues like this, but adding too many synch points causes the emulator to slow to a crawl and makes debugging impractical. I've found the best approach is to start off with as few synch points defined as possible (ideally just the program counter) and slowly introduce more synchpoints as required. This is all very easy to do using the DAED_SYNC_MASK flag discussed above.

Getting back to SSB, it looked like I had found the root cause of the problem - somehow the rom was replacing the instructions in memory, essentially a form of self-modifying code (it's more likely it was just loading a new section of code into RAM from ROM, but it's still essentially self-modifying). The dynarec system was oblivious to these changes and so it ended up trying to execute stale instructions that it had cached when creating the fragment, potentially many thousands of cycles ago.

Dealing with self modifying code in dynamic code generators is generally very tricky. In Daedalus I've been relying on the fact that most roms are well-behaved and flush the instruction cache when they modify memory containing executable code. When I detect a instruction cache invalidate (through the MIPS CACHE opcode) I simply dump the entire contents of the fragment cache and start from scratch. This might sound a little heavy handed, but the way that I link fragments together makes it very hard to unlink small sections of code that has been invalided. Flushing the cache is very quick, safe and has a few advantages such as purging cold traces that are no longer being executed any more.

Ironically, the reason the dynarec was failing to cope with SSB wasn't due to a bug in Daedalus - it was due to a bug in SSB that just never happened to be a problem on a real N64. After updating memory with the new instructions SSB should have been invalidating the instruction cache to ensure that it didn't contain stale code, but for whatever reason it failed to do this. The only reason the rom runs correctly on a real N64 is that by the time it comes to execute the modified instructions, the instruction cache has been refilled a number of times and so the stale instructions are no longer cached.

Even though this isn't Daedalus's bug, it still needs to work around the problem. I'll leave this discussion for a future post though - this one is long enough as it is :)

-StrmnNrmn

Tuesday, June 12, 2007

Tracking down the SSB Dynarec Bug

Yesterday I said I'd provide some more details about the Super Smash Bros. dynarec fix. The actual fix is fairly straightforward, but I thought the process of tracking down the issue was quite interesting and worthy of a couple of blog posts.

When I first started looking at SSB I noted that although the game ran fine without dynarec, it would always hang when trying to enter the main entry with dynarec enabled.

I've been programming professionally for around 6 years now and I can safely say that debugging dynarec bugs is one of the hardest categories of problems I've ever had to work on. For a start, because the code is generated on the fly, you don't have the luxury of source level debugging, and without spending time reverse engineering the original rom image, you don't even know what the generated dynarec code is meant to be doing. It's very much like working blindfolded.

And it gets even worse. I've fixed dynarec problems in the past which were the result of generating incorrect code for a fragment over 500 million instructions into emulation. This would be bad enough, but it can be many thousands of instructions later before this causes emulation finally diverges from the correct path. Just identifying the exact point at which the emulation starts to diverge from the correct sequence of instructions can be like finding a needle in particularly large haystack. While blindfolded :)

Over the years of trying to debug problems like these I've built up a set of tools and learned a few tricks along the way which you might find quite interesting. Although I'm going to talk about them in the context of tracking down this dynarec issue, I've found some of the techniques useful in solving other problems so you might find other ways of applying them too.

One of the first things I do when trying to identify a dynarec issue with Daedalus is to see if the problem is reproducible on the PC build of the emulator. Although it is possible to use GDB with PSPLink, I've never got this up and running and I'm much more comfortable debugging with Visual Studio. Also, working with the PC build is usually much faster than working with the PSP build (debug builds run around 10x faster on the PC, and build times are much quicker.)

Not all dynarec issues can be debugged in this way - the PSP and PC builds have different code generation back-ends (i.e. MIPS and x86 code generation respectively) so bugs in the MIPS code generation won't usually be reproducible in the PC build. The dynarec system in Daedalus shares a common frontend (trace selection and recording) between the two platforms, which means that if I can reproduce the problem on both platforms, I can narrow down the likely location of the bug to this area.

Fortunately this particular bug manifested itself in both the PC and the PSP builds, so I knew that if I fixed the bug on the PC build, it should fix the PSP build too. What I needed to find out next is what the emulator was doing differently when dynarec was enabled compared to when it was disabled.

If dynarec is running without errors, then the sequence of executed instructions should exactly match that executed with dynarec disabled. If I could log details about all the instructions executed with dynarec disabled, and again with dynarec enabled, I should be able to compare the two logs to figure out the exact point at which dynarec is going out of sync. This all relies on the fact that the emulator is totally deterministic, i.e. that running the emulator twice in succession with the same settings should give exactly the same results.

Unfortunately, for a variety of reasons my dynarec solution doesn't produce identical results to interpretation, the main reason being that for performance reasons I can only handle vertical blank and timer interrupts on the boundaries between fragments. For example, with dynarec disabled, the first vertical blank interrupt might occur exactly on the 625,000th instruction, but with dynarec enabled with might not occur until the 625,015th instruction. This means that the logs diverge at the instant the first VBL fires, and never regain synchronisation.

When I was originally developing the new dynarec system I put a lot of effort into writing a fragment simulator, the idea being that rather than executing the native assembly code for a given trace, I could keep track of the instructions making up the trace and interpret these individually instead. Theoretically fragment simulation is identical to dynarec code execution, even down to the way I handle VBLs and timer interrupts, and it's been very useful at identifying bugs in the dynarec code generation. What's particularly useful about fragment simulation however is that I can enable a setting which makes it handle interrupts exactly in the same way as the non-dynarec core, i.e. interrupts are handled precisely rather than on fragment boundaries.

Essentially Daedalus has four modes of operation:


  • Dynarec + fragment execution
  • Dynarec + fragment simulation (imprecise interrupt handling)
  • Dynarec + fragment simulation (precise interrupt handling)
  • Interpretative core


This tool is particularly powerful, because if I can ensure that dynarec+fragment execution is equivalent to dynarec+fragment simulation, and that dynarec+fragment simulation is equivalent to running the interpretative core, then I can use the transitive properties of these relations to ensure that dynarec+fragment execution is equivalent to running the interpretative core. Fragment simulation allows me to bridge the gap between these two modes of operation which would otherwise be very difficult to compare.

I think that's long enough for one post. Tomorrow I'll talk about how I used this technique to help track down the SSB dynarec bug.

-StrmnNrmn

Sunday, June 10, 2007

Super Smash Bros - Dynarec Update

This is just a quick update to let everyone know I've finally figured out why the dynarec wasn't working in Super Smash Bros. The problem has taken a lot longer to identify than I'd hoped - in part because it was a particularly tricky bug but also because I've not had as much time to work on Daedalus recently as I would have liked.

Anyway, I managed to spend a few hours this weekend isolating the problem, and after a little experimentation I've been able to come up with a temporary workaround. With the fix in place SSB is running at around 30-40fps in game on the PSP, which is very exciting.

Now that I've identified the problem my next job is to come up with a permanent, robust solution to help fix similar problems in other roms. I also want to add some improved checks in the debug build to help spot other situations where this problem arises.

For those that are interested I'll post an update shortly (within the next day or so) with some of the technical details.

-StrmnNrmn