Monday, December 24, 2007

Yuletide Update

It's been a while since the last update so I wanted to give some news on how things are going with work towards the next release.

I've spent a lot of time working on getting the HLE audio code working on the Media Engine. I've been making steady progress, but it's been taking longer than I initially expected. Fortunately, I'm very close to getting all of the audio processing moved over to the ME - in fact I believe I have just one significant bug left to fix.

The issue seems to be a very odd synchronisation bug which causes the emulator to lock up when running the audio processing asynchronously. As with many of these types of bugs, it's proving quite hard to track down because as soon as I change the code to debug the problem, the issue goes away. A true Heisenbug :(

What's particularly annoying is that the bug is stopping me from measuring how much of a difference running the audio code on the ME makes. Hopefully I'll be able to fix the bug over the Christmas break and be able to publish some timings over the new year.

As part of this work, I've also been writing a general-purpose 'job manager', which coordinates batches of work between the main CPU and the ME. The idea is to build on top of J.F.'s MediaEngine.prx to provide a simple interface for queing up and dispatching work asynchronously. When a job is added to the queue, a flag indicates whether the job is suitable for running on the ME, or whether it should just be run asynchronously on the main CPU instead.

Initially just the audio processing will run through the job manager on the ME, but eventually it should be possible run other pieces of work asynchronously too. I'm hoping that it will eventually be possible to move parts of the HLE graphics processing to run asynchronously too, but I need to investigate things a bit more first. That's a job for future releases however.

Anyway, that's all for now. I'm off to eat mince pies and watch The Great Escape on tv. Merry Christmas everyone :)

-StrmnNrmn

Sunday, November 25, 2007

R14 Progress

It's been a while since I talked about R14 so I wanted to post a quick update on what I've been doing.

The Media Engine work has been going well. The job manager I talked about last week is now fairly functional and handles executing the audio upsampling code in 3 different modes: synchronously, asynchronously on the main processor, and asynchronously on the ME.

It's taken me a little longer to get the audio upsampling code working smoothly on the ME. I decided to focus on this initially (rather than Azimer's Audio HLE code) as it's a lot simpler and more self contained, but getting it working on the ME without any glitches required a little bit of work. I had to rewrite the simple ring buffer I was using to be lock-free. This is straightforward when dealing with a single reader thread and a single writer thread on the same processor, but a little more care is required when the reader and writer are operating on separate cores without cache coherency. I think getting this running glitch-free has helped prepare me well for the bigger task of getting Azimer's HLE code running asynchronously on the ME. I'll be working on this next.

Besides the ME work, I've had an interesting diversion getting some new font rendering working in Daedalus. I saw on the ps2dev.org forums that BenHur had released a library for rendering text using the PSP's built in fonts. I've always been a little unhappy with Daedalus's text rendering, and thought this would be a good opportunity to improve it. Here's a screenshot of the UI using BenHur's intraFont library (click through for a better-looking unscaled version):




I think this is looking a lot better than the previous font. The drop shadows really help make the text more readable. I also support multiple fonts for the first time, so the header text actually looks like header text :)

-StrmnNrmn

Monday, November 19, 2007

And the winner is...

This has been one of the hardest posts I've had to write, because I know whichever entry I pick as winner of the icon competition, a lot of people are going to be disappointed that I didn't pick their favourite.

Trying to select a winner has forced me to think about what I wanted to get out of this competition. On the surface, the main goal of the competition was just to replace the ugly default icon on the PSP's XMB. However as I've checked out all the entries over the past couple of weeks, I've come to realise that the main objective of the competition should be to define an identity for Daedalus.

In the end, one of the main criteria I've used to select the winner is how well the design works as a logo, or identity, for the project as a whole. For this reason I've picked Patrick Ahmann's design:




I think Patrick's design is wonderfully clean and stylish, and the colours look especially crisp and vibrant on the PSP's screen*. The typography, particularly for the 'daedalus' logo is very simple and distinctive and I think it would work well as a logo on the web. What's more, I got quite excited about using the cog motif on the logo as a progress indicator on loading screens :)

So, well done Patrick!

I want to say a big 'thank you' to everyone who contributed. Many people put a lot of work into this and I'm really sorry that I could only pick one winner. I think the quality of the final 10 was excellent, so if you disagree with my choice hopefully there's something there for you :)

-StrmnNrmn

*annoyingly Blogger insists on downsampling the images above slightly, so they look a little fuzzy. Click through for the original versions...

Saturday, November 17, 2007

Icon Competition - Shortlist

I've just finished compiling a shortlist of what I think are the 10 best submissions for the icon competition.

I received just under 200 entries in total; I've been blown away by the response. Picking just 10 entries has been very difficult. In the end I had a 'longlist' of around 30 entries which I thought were superb, and it's taken me over an hour to whittle that down to just 10 entries.

Here is my shortlist, presented in the order they were submitted. Feel free to discuss these in the comments - I'll be reading your comments over the weekend, and I'll pick a final winner on Monday.

Thanks to everyone who entered - you're awesome!



-StrmnNrmn



Cladil




Pleaser




Pharonyk




Steven La




Victor Aguirre




Joël Dos Santos




Jay Mc




Patrick Ahmann




Hykypoo




juniorslick363


Edit 20/11/2007 12:00 Fix Pharonyk's name.

Friday, November 16, 2007

Last call for icons+backgrounds

Just a reminder that today is the last day for sending in your entries to the icon competition. I'll be making the shortlist sometime tomorrow morning, so provided you get your entry to 12pm (that's noon - GMT!) on Saturday it will probably be included.

Once I've made a shortlist of what I feel are the 10 best submissions, I'll post them up here so that people can give their thoughts, before I pick a final winner on Monday.

-StrmnNrmn

Tuesday, November 13, 2007

Media Engine progress

Over the weekend I described my plans for getting audio list processing working on the PSP's Media Engine. I'm making some decent progress so far. I've got Daedalus loading a kernel mode PRX to handle the ME nitty gritty, and I've managed to execute some test code on the ME successfully.

I've spent some time reviewing the audio code, trying to figure out if any bits are particularly amenable to running asynchronously, and trying to figure out if there is anything that is going to cause any problems when running this code on the ME. Fortunately it looks like all of Azimer's audio code is very straightforward C so there should be no problems getting it running on the ME once the synchronisation issues are dealt with. I've also realised that alongside the audio list processing there is also some expensive 44kHz upsampling code which will run very nicely on the ME too.

I have the feeling that debugging code on the ME is going to be particularly painful, so I want to try and catch as many of the obvious synchronisation bugs as early as possible. This evening I've started writing a job manager to 'simulate' executing code on the ME. The manager simply creates a thread which sits and waits for jobs to come in, mimicing the behaviour of the mediaengineprx. Once I've got the audio list processing running correctly through the job manager, I can easily switch things over to get these jobs running on the ME in parallel to the main core. That's the plan, anyway :)

-StrmnNrmn

Sunday, November 11, 2007

Icons update (part 2)!

..And the last update today (I promise!)

I just wanted to say that I've now received about 100 entries for the icon competition. I'm about halfway through the list so far; apologies if I've not responded to your entry yet.

I've had a lot more entries than I was expecting. Originally I was planning on posting all the entries so that people could comment on their favourites before I picked a winner. There are too many entries for me to be able to do that, so next weekend I'll pick my favourite 10 and show them instead.

I've had lots of entries from readers of PSP-Generation, so a big 'merci beaucoup' for that. It's certainly helping exercise my rusty French :)

There's just under a week left before the competition ends, so be sure to get your submission in shortly. Thanks for all the entries so far!

-StrmnNrmn

PS Remember it's 'daedalus' and not 'deadalus'. It's an easy mistake to make :)

Media Engine

Earlier I discussed my plans for getting Daedalus's audio processing working on the PSP's Media Engine.

As I mentioned in that post, it's not just a case of changing some compiler setting to get this working. I've not spent much time investigating the ME so I may be wrong on a few of these points, but here are the current issues that I think need solving.

Firstly in order to access the ME I need to be running in kernel mode. This requires either running Daedalus in kernel mode, or (preferably) creating a kernel mode PRX that encapsulates the required functionality. I think kernel mode rules out anyone running with v1.50 firmware (hence my earlier post - please respond to the poll if you haven't already done so!) Maybe one of the more savvy psp developers out there can correct me on this? If no-one is using v1.50 any more then maybe it isn't even an issue.

Another problem is that although the ME is essentially the same processor as the main core, it has a different memory map. This means that things like the VRAM is invisible to the ME, so any code ported to run on the ME would have to be written to operate on main memory. This isn't an issue for Daedalus's audio list processing, but it would cause problems if I wanted to move display list processing to the ME too.

Touching on the memory map issue, another problem is the lack of cache coherency between the two cores. I need to be careful when accessing the same areas of memory with both cores to correctly flush and invalidate the data caches. Ideally any shared memory should be kept to a minimum, but this is easier said than done when porting existing code, rather than writing new code.

For a similar reason, any code which needs to run on the ME should avoid making any calls to the runtime library, including doing any system memory allocation. System calls are also ruled out. This is fairly easy to guarantee if you're writing new code, but again, it's a lot harder if you're porting existing code.

I think that's most of the issues from the hardware side. There are also a number of issues to be solved to do with the way that Daedalus handles audio and display list processing.

On the N64, the audio and display lists are processed asynchronously by the RSP coprocessor. In Daedalus, I can identify when these tasks are queued up for the RSP, intercept them, and process them synchronously (using high-level emulation rather than simulating the RSP execution directly).

The key thing here is that as far as the emulated N64 is concerned, audio and display list processing currently happens instantaneously. As soon as it kicks off the RSP it gets a interrupt to inform it that processing has completed. The whole process is very deterministic and I'm worried that by processing these display lists asynchronously on the ME that a number of intermittent and hard-to-debug issues will crop up. On the other hand, processing these tasks asynchronously is much closer to the behaviour of a real N64, which may fix some timing-related issues. It will also allow Daedalus to exploit the inherent parallelism that N64 roms were designed to take advantage of.

My current plan for ME audio support in R14 is:


  1. Create a kernel mode PRX and get Daedalus successfully loading and invoking functions (under all supported firmwares). I've just about done this.
  2. Add the code to support initialising and running code on the ME to the PRX. Test invoking user mode functions from the main EBOOT.PBP. I'll probably be using J.F.'s great sample code as a reference for this. Thanks J.F.!
  3. Rewrite the audio list processing code so that it can be invoked synchronously or asynchronously as required (via some kind of configuration option). When running asynchronously it can just be run from a separate high-priority thread to start with. I can use this to test for various synchronisation issues without going through the pain of trying to do this on the ME first.
  4. Audit the audio list processing code to minimise any memory accesses or ensure that they are correctly synchronised with the main core/thread. Any crt or system calls need to be eliminated or abstracted away (e.g. printfs NOP when compiled to run on the ME).
  5. Invoke audio list processing code from the ME.
  6. Cross fingers.


So, that's the plan; I'll keep you updated on my progress. If anyone has any experience doing this kind of thing on the ME it would be great to hear your thoughts.

-StrmnNrmn

R13 Issues, R14 Plans

Over the past week I've started making plans for what I want to do for R14.

To start with, R13 introduced a couple of issues which I want to fix. Firstly, a number of roms now no longer work with dynarec enabled, or show odd behaviour. For instance, Aerogauge now finishes the race as soon as the countdown completes. I've tracked this down to one of the dynarec optimisations I added in August, where I optimise fragments which jump back to themselves. This should be a 'safe' transformation, so it suggests there's a bug somewhere in my implementation. If I can't fix the bug in time for R14, I'll add a temporary setting to allow this optimisation to be disabled on a rom-by-rom basis (much like the 'dynarec stack optimisation' setting).

Secondly, it looks like something I changed for savestate support has broken the 'return to main menu' option. I added some logic to help ensure that when taking a snapshot for the savestate, the CPU is paused in a 'safe' state (i.e. no dynarec code is executing, nothing is running on the RSP, and nothing is executing in the branch delay slot.) It looks like I've messed something up which is causing the 'return to main menu' option to wait for a safe state before bailing out to the menu. Should be an easy one to fix.

Morgan suggested a nice idea in the comments, which is that I generate a thumbnail for the savestate as it is created to display alongside the slot in the UI. It's a little tricky to implement, as by the time the emulator is told to create a savestate, it has already obliterated the n64's framebuffer with the Daedalus UI. I'll have to do something quite clever like speculatively copy the n64's framebuffer into system memory every time you enter the Pause Menu, or create the screenshot on the first frame rendered after saving. Either way, I'd like to add this simple feature to R14.

Next on my list for R14 is to look at making more significant performance improvements. Over the months many people have been asking when I'd get around to implementing audio on the PSP's Media Engine. I've talked about this before, but always kept putting it off in order to work on easier optimisations.

The Media Engine is a bit of unknown territory for me. Even though it's practically identical to the main CPU, you can't just change a setting an suddenly have your code running on it. There are a number of small hurdles I have to overcome before I can get audio working on the ME, but this is my big goal for R14 (I'll save the technical discussion for the next post.) If all goes to plan this should mean that audio can always be enabled without a significant impact on framerate.

So in summary for R14: a few bug fixes, thumbnails for savestates, and audio without affecting framerate.

-StrmnNrmn

Firmware poll

I'm interested in which firmware everyone is running on. It would be very helpful if you could reply to this post with a) which model of PSP you're using ('fat'/slim, revision # if known) and b) which firmware you're using.

I have two PSPs. I have an original Japanese PSP with v1.00 firmware which I use for development, and a UK PSP which usually runs the latest official firmware (v3.72 at the moment.)

I'm interested because I want to know if anyone still requires the v1.50 (kxploit) versions of Daedalus that I release. I also need to figure out if it's worth my time getting a slim PSP for developing with the v3.xx+ firmware.

Thanks!
-StrmnNrmn

Wednesday, November 07, 2007

Icons update!

Well it's only been a couple of days, but I've received over 50 submissions to date - It's going to take me some time to go through them!

A couple of people have asked about the icon size. 134x74 or 140x80 are both fine.

Thanks for all the entries so far - there have been some really great ones. Keep them coming!

-StrmnNrmn

Monday, November 05, 2007

Icons for Daedalus

I've never taken the time to add icons to the Daedalus EBOOT.PBP. Mostly it's just because I've been lazy (when I develop Daedalus I usually run it through psplink rather than the XMB, so I rarely see the frontend), but also because I've never found a suitable set of icons to use.

There are plenty of icon/background packs floating around for Daedalus, but I've been reluctant to use any of them; I need something that I can freely distribute, but also something that doesn't infringe on any trademarks.

I'd like to add icons and a background to Daedalus for the next release, so I'm opening a 'competition' to try and find the best design. The 'prize' will be full credit in the release notes and on the Daedalus About screen (including a link to your website if you wish). Here's what I'm looking for:


  • Background 480x272, preferably in .png format
  • Icon 134*74, preferably in .png format
  • No use of Nintendo's trademarks! This means no use of the "N64" or "Nintendo" logos. Sorry. Be inventive!
  • You must own the work you submit, and give me permission to use it with Daedalus PSP


Email your submissions to me (strmnnrmn@gmail.com). I'll post all the submissions I receive by 16th November on this site to get people's thoughts, and make a decision the following Monday (19th November - that's two weeks today!)

Of course if you don't like whatever decision I make, you're always welcome to repack the Daedalus EBOOT.PBP and use any of the other graphics people send in :)

Time to get mspaint fired up!

-StrmnNrmn

Sunday, November 04, 2007

Daedalus PSP R13 Released

I've just finished packing up R13 and uploading it to sourceforge:

Daedalus PSP R13 for v1.00 Firmware
Daedalus PSP R13 for v1.50+ Firmware
Daedalus PSP R13 Source

The most significant new feature is savestate support. You can now save your progress at any point, via the Pause Menu (accessed through hitting the Select button). Savestates are written out to the memory stick, and consume around a megabyte per slot. You can load up a savestate at any time from the Pause Menu, or via the front end (hit the right shoulder button to swap from the rom list to the savestate list.)

Other than savestates, the most significant change in R13 is a number of optimisations to the dynarec core which should give a 10-20% speedup depending on the title being played. I've tested these optimisations as much as I can, but if you find that roms which worked with R12 are now broken, try disabling 'dynamic recompilation' and/or 'dynarec stack optimisation' from the rom's preferences screen.

I haven't looked at compatibility at all in R13, so it's unlikely that any roms will have started working in R13.

I'm interested to hear your feedback on both of these features. Let me know if you have any problems with savestates, or if you've found roms are no longer working in R13. I'll try and keep on top of the comments pages over the next couple of weeks.

It's taken a LOT longer to get R13 out than I had hoped. I can't quite believe the last release was in June! I'm hoping now I've got this release out of the door I'll be able to get back to making small, more frequent releases.

-StrmnNrmn

Update A couple of things have cropped up in the comments. Firstly, when you create a savestate it can take a while - up to 10 seconds - during which time the screen is black. It will complete though - please be patient :). I'll have a look at adding a progress meter or something like that for R14.

There are a few titles that don't seem to be working well with some of the dynarec optimisations I added. If you're having problems, try turning off the 'dynarec stack optimisation' as discussed above.

Saturday, November 03, 2007

Daedalus R13 Soon

I've just about finished implementing the last couple of savestate features I talked about last week.

I've got a bit of polishing to do and a little bit more testing, but I'm hoping I should be able to release R13 sometime tomorrow (Sunday).

I'll post again then with a complete list of changes and links to the latest builds for you to download. See you then!

-StrmnNrmn

Thursday, October 25, 2007

R13 close, honestly!

Apologies again for the lack of updates. I know people get nervous when I don't update regularly, but I don't like to post when I have nothing new or exciting to talk about.

I'm very close to releasing R13, I've just been struggling to find the time to add the final finishing touches to the savestate support. I've found it hard to get into a regular working pattern since moving, so what should have been a 1 week job has ended up taking a month. Bioshock and Halo 3 haven't helped either.

Savestates are working very well. The current implementation provides 10 slots which are shared across all roms. You can save to any slot at any time from an option on the Pause menu, or choose to reload a previous savestate. In this way it works just like QuickSave/QuickLoad found in various PC games (I'm tempted to add a special slot for this to the top level of the Pause menu).

I use zlib to compress the savestates, so an 8MiB savestate compresses down to 1MiB (or even smaller if you're running games which don't use the Expansion Pak).

There are just a couple of things I need to finish off now before I can release. Firstly I need to check if the savestate you're loading is for a different rom than is currently loaded. If this is the case I need to scan through your available roms looking for the correct one, and load it. To make this efficient I need to get the RomDB which I use for the PC build working on the PSP.

The second thing I need to do is come up with a decent way of linking some metadata to the savestate so I can show this in the UI. Just simple stuff like the full name of the rom, time the savestate was created and the total time spent playing when it was created. I figure without this information it'll be quite difficult to figure out what it stored in each savestate slot. The alternative is to add a text entry system to the UI so you can name each savestate, but I think this will just delay things even more.

In summary R13 is really close and I've not forgotten about it. Just a few more things to sort out and it'll be ready for release.

-StrmnNrmn

Monday, October 01, 2007

Daedalus and the PSP Slim + Lite

A number of people have been asking if Daedalus will support the PSP Slim and Lite (PSP-2000), and if it does, if it will take advantage of the extra 32MiB RAM to improve the speed of emulated roms.

I am planning on supporting 3.xx firmware, but not until I have a PSP with suitable firmware which I can test with directly. People will probably end up compiling the Daedalus source for 3.xx (and they're welcome to), but until I can verify everything directly myself these will have to remain 'unofficial' builds. If there is strong demand for a 3.xx version of Daedalus then I'll consider setting up a donations page with the aim of buying a Slim and Lite for development.

As for the additional 32MiB RAM that the PSP-2000 provides, it's unlikely that this will provide for faster emulation. I've spent a lot of time reducing Daedalus's memory requirements so that it runs comfortably in the 32MiB that the PSP-1000 provides (it's actually a fair bit less than this when you take out OS overheads etc.)

There are only two main areas of the emulator that would benefit from an increase in cache size. The first is the texture cache, which is only used when video memory is exhausted. I fixed a few bugs with this months ago which means that very few roms ever have to place textures in system memory.

The second place that would benefit from a cache increase is the ROM cache. In order to support roms larger than the PSP's system memory, I dyamically map pages of the ROM into the PSP's RAM on demand (rather like virtual memory on PCs.) If a page of the ROM that is requested isn't in the cache, I have to pause emulation as I load it from memory stick. Currently this cache size is 2MiB. It could be increased up to 32MiB which would fit most N64 roms comfortably, and would entirely eliminate paging while the emulator is running. This might give a small speedup, but I don't believe this is currently a significant issue even with the tiny 2MiB cache that is currently in use.

Increasing the dynarec buffers would be very unlikely to provide a speedup, as I've yet to see a rom where they fill up before they are flushed for some other reason (typically instruction cache invalidation.)

So in summary, yes, I will support the PSP Slim and Lite at some point, but no, it's unlikely to offer much of an improvement over the PSP-1000s (other than being slimmer and lighter, obviously :)

-StrmnNrmn

Wednesday, September 19, 2007

Back Online

Apologies for the lack of updates. I've not had internet access at home since moving flat at the end of August. My ADSL was activated today so I'm finally up and running again. I can't believe how long it's taken!

Between the flat move and a busy few weeks at work (getting ready for the Tokyo Game Show) I've not had that much time to work on Daedalus, but I have made some progress on a really cool feature people have been asking for for some time, namely savestate support.

For those not familiar with savestates, the basic premise is very much like 'hibernate' or sleep modes on PCs and laptops. A savestate can be created at any point during emulation. At a later time the savestate can be reloaded to restore the emulator to the exact state it was in when the savestate was created.

What's really useful about savestates is that they work even if the underlying save mechanism used by a rom isn't supported properly. I'm hoping that by adding savestate support Daedalus will become significantly more usable for many roms. Savestates also make development a lot easier as it means I can jump straight into the middle of a game when debugging/profiling rather than sitting through the title sequence dozens of times.

I still have a little bit of work left to do on the savestate system, mostly on the UI side, but there are a number of additional optimisations I want to finish off before I release R13. I'm currently planning to get everything ready for release on either the weekend of the 29th Sept, or possibly the first weekend in October.

-StrmnNrmn

Wednesday, August 15, 2007

Whoops

It turns out that exophase had already suggested two of the three optimisations I implemented this weekend in a conversation we had way back in February.

Whoops! It's been such a long time, I'd totally forgotten we'd talked about it. I really should pay more attention sometimes - if I had been you might have seen some of these improvements way back in R9. Sorry exophase - kudos for spotting this so long ago!

-StrmnNrmn

Interesting Dynarec Hack

I was playing around with the code generation a couple of evenings ago, and realised that if I made a certain assumption, I could drastically speed up specific types of memory accesses.

When I discussed load/store handling on Sunday, I presented the new code that is typically generated for handling a load such as 'lw $t0, 0x24($t1)' on the N64:


ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0 BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data


(I'll ignore all the extra code which is generated, and just concentrate on the 5 instructions above which correspond to the expected path of execution.)

Of the 5 instructions that are generated, two - the SLT and BEQ - are just there for performing error handling in the case that the address is invalid, a hardware register (i.e. for memory-mapped I/O), or a virtual address. I'll call this error handling for short.

If we were generating code in an idealised environment where we didn't have to perform error handling, we could drop the SLT/BEQ instructions to get this:


ADDIU a0 = s1 + 0x0024 # add offset to base register
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data


We could then optimise this even further and perform the offset calculation directly as a part of the LW instruction:


ADDU a1 = s1 + s7 # add offset to emulated ram
LW s0 <- 0x0024(a1) # load data


In this idealised situation we could reduce an emulated load to just two instructions, with no branches. That's a pretty good saving!

The problem is that the environment we're generating code from is not 'ideal', and it's hard to know in advance of time which memory accesses are going to directly access physical ram, and which are going to access hardware registers or require virtual address translation. For that reason, we have to place a guard around every memory access to make sure that it behaves correctly. At least, that was the way I was thinking until earlier in the week.

What I realised on Monday is that I can make an assumption that lets me remove the error handling code for certain types of load/stores. The assumption is that when the N64 accesses any memory through the stack pointer ($sp) register, the address is always going to be valid, physical memory.

The assumption relies on the fact that most roms don't do anything particularly clever with their stack pointers - it gets set up for each thread to point at a valid region of memory then the game just runs along, pushing and popping values from it as the code executes. Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner :)

It was straightforward to add a hack to the code generation to exploit this kind of behaviour, and the results have been better than I expected - I'm seeing at least a 10% speed up, and the code expansion factor (the ratio of generated bytes of instructions to input bytes) has dropped from around 5.0x to 4.0x. Stability has been excellent too - I've run about 8 roms with the hack so far, and all of them have run perfectly.

I think one of the reasons the hack has such an impact is that a lot of the memory accesses in a typical C program are through the stack. Here's an example snippet from the entry to a function on the N64, where the compiler emitted code to store the return address and arguments:


SW ra -> 0x0014(sp)
SW a0 -> 0x0058(sp)
SW a1 -> 0x005c(sp)
SW a2 -> 0x0060(sp)


When I look through disassembly for the roms I'm working on, it's very common to see lots of sequential loads/stores relative to the stack pointer like this.

Previously Daedalus generated around 20 instructions (including 5 branches) for the above snippet. With the hack, the generated code now looks like this:


ADDU t1 = s1 + s7
SW s4 -> 0x0014(t1)
ADDU t1 = s1 + s7
SW s3 -> 0x0058(t1)
ADDU t1 = s1 + s7
SW s2 -> 0x005c(t1)
ADDU t1 = s1 + s7
SW s5 -> 0x0060(t1)


8 instructions, 0 branches. What's more, it looks like with a little work, I could eliminate 3 redundant address calculations:


ADDU t1 = s1 + s7
SW s4 -> 0x0014(t1)
SW s3 -> 0x0058(t1)
SW s2 -> 0x005c(t1)
SW s5 -> 0x0060(t1)


Now that would be efficient :)

I still want to do lots of testing with the hack. I want to find out if there are roms that don't work with the hack enabled, and how common a problem it is. It's such a significant optimisation though that I'm certain I'll be adding it as an option in Daedalus R13. The results of my testing will probably determine whether I default it to on or off though.

So far Daedalus R13 is shaping up to be significantly faster than R12. I'm still not sure when I'll be ready to release it, but you'll hear about it here first.

-StrmnNrmn

Monday, August 13, 2007

Dynarec Improvements

I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.

At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)

I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.

The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.

These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.

The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):


ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0<s6) # compare to upper limit
ADDU a1 = a0 + s7 # add offset to emulated ram
BNEL t0 != r0 --> cont # valid address?
LW s0 <- 0x0000(a1) # load data
J _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
NOP
cont:
# s0 now holds the loaded value,
# or we've exited from dynarec with an exception


There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.

So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.

Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.

When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.

What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:


ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0<s6) # compare to upper limit
BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
cont:
# s0 now holds the loaded value,
# or we've exited from dynarec with an exception
#
# rest of fragment code follows
# ...


_Trampoline_XYZ123:
# handler returns control to 'cont'
J _HandleLoadStore_XYZ123
NOP


The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.

As with the first optimisation, this change yielded a further 3-5% speedup.

The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:


8018e014 LB t8 <- 0x0000(a1)
8018e018 LB t9 <- 0x0000(a0)
8018e01c ADDIU a0 = a0 + 0x0001
8018e020 XOR a2 = t8 ^ t9
8018e024 SLTU a2 = (r0<a2)
8018e028 BEQ a2 == r0 --> 0x8018e038
8018e02c ADDIU a1 = a1 + 0x0001
8018e038 LB t0 <- 0x0000(a0)
8018e03c NOP
8018e040 BEQ t0 == r0 --> 0x8018e058
8018e044 NOP
8018e048 LB t1 <- 0x0000(a1)
8018e04c NOP
8018e050 BNE t1 != r0 --> 0x8018e014
8018e054 NOP


I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.

The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:


enter_8018e014:
load n64 registers into cached regs

perform various calculations on cached regs

if some-condition
flush dirty cached regs back to n64 regs
goto enter_8018e038

perform various calculations on cached regs

flush dirty cached regs back to n64 regs

if ok-to-continue
goto enter_8018e014
exit_8018e014:
...

enter_8018e038:
...


The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:


enter_8018e014:
load n64 registers into cached regs
mark modified regs as dirty

loop:
perform various calculations on cached regs

if some-condition
flush dirty cached regs back to n64 regs
goto enter_8018e038

perform various calculations on cached regs

if ok-to-continue
goto loop

flush dirty cached regs back to n64 regs
exit_8018e014:
...

enter_8018e038:
...


In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.

This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.

I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet :)

-StrmnNrmn


*The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate :)

Thursday, August 02, 2007

Daedalus PSP under OSX

I mentioned a few days ago that I've recently bought a Macbook Pro. I've never owned a Mac before so it's been a really interesting experience learning my way around. Initially I was planing on dual booting Windows XP via Boot Camp, but I quickly found that I could do almost everything I need to in OSX. I did install Parallels Desktop, I rarely find myself running any Windows apps.

One of the main things I need for day to day 'work' is the ability to compile and run Daedalus. The rest of this article describes the process of setting up the PSPSDK under OSX, and compiling Daedalus PSP for the first time. This post is really aimed at people who are interested in compiling Daedalus for themselves on OSX. Hopefully it will also be useful for other PSP homebrew developers who have been having problems getting the PSPSDK set up under OSX.

To install the PSPSDK I largely followed this guide on the ps2dev.org forums. I already had XCode and fink installed. Fink complained when I tried to install all the listed packages, and I had to remove one of them from the command line (I think it was autogen, but I can't really remember now.)

I've been using the most recent psptoolchain script, which was updated a few weeks ago, and I had to make a couple of modifications. In depends/check-ncurses.sh I had to change the check for ncurses to look for the OSX .dyliib file:


## Check for a ncurses library.
ls /usr/lib/libncurses.a 1> /dev/null ||
ls /usr/lib/libncurses.dll.a ||
{ echo "ERROR: Install ncurses before continuing."; exit 1; }


became:


## Check for a ncurses library.
ls /usr/lib/libncurses.a 1> /dev/null ||
ls /usr/lib/libncurses.dll.a ||
ls /usr/lib/libncurses.dylib ||
{ echo "ERROR: Install ncurses before continuing."; exit 1; }


Secondly I had to make this change to scripts/001-binutils-2.16.1.sh. As urchin mentions on the ps2dev.org forum, ".m" is the extension for Objective C files in OSX. The '-r' tells make to ignore the built-in implicit rules, and everything works fine:

So:


## Compile and install.
make clean && make -j 2 && make install && make clean || { exit 1; }


became:


## Compile and install.
make clean && make -r -j 2 && make install && make clean || { exit 1; }


(note the '-r' flag on the second invocation of make.)

I left the psptoolchain script doing its stuff for a couple of hours, and when I came back to it everything seemed to have completed and installed correctly.

The next step was to get Daedalus PSP compiling. Somewhat naively I assumed Daedalus PSP would compile out of the box on OSX. As it turns out, I had to make a few small changes to get everything compiling nicely.

Firstly, I had to update the makefile so that it directly referenced psp-gcc and psp-g++. Normally, I build Daedalus PSP through a Visual Studio Makefile project, and I had used a couple of scripts from the ps2dev.org forums to format GCC's output into a format that Visual Studio understands, so that double clicking on an error in the output opens the corresponding file in the editor. I found a better way to handle this, so I changed the CC/CXX macros to refer to the original pspsdk tools.

The main problem I encountered was my arbitrary use of backslashes instead of forward slashes in #include directives, e.g.:


#include "Core\CPU.h"


should be:


#include "Core/CPU.h"


Another subtle error came from the way that I was instantiating static functions from templated classes which are defined in a namespace. An example will probably help explain. Here's the basic outline of the class I use to implement the singleton pattern:


namespace daedalus
{
template< class T>
class CSingleton
{
public:
static bool Create();

static T * Get() { return mpInstance; }

private:
T * mpInstance;
};
}


The Create method for the singleton class is then implemented like this:


template<> bool CSingleton< CController >::Create()
{
DAEDALUS_ASSERT_Q(mpInstance == NULL);

mpInstance = new IController();

return true;
}


For some reason this started failing when compiling Daedalus PSP under OSX:


Source/Core/PIF.cpp: At global scope:
Source/Core/PIF.cpp:250: error: specialization of 'static bool
daedalus::CSingleton::Create() [with T = CController]' in different namespace
Source/Core/PIF.cpp:250: error: from definition of 'static bool
daedalus::CSingleton::Create() [with T = CController]'


Rather than being an OSX issue, I suspect that the reason this error started occurring was actually due to the PSPSDK using an updated version of GCC which is a bit stricter than the version I was using on Windows. Regardless, the fix was easy - the code just needed to be wrapped in the 'daedalus' namespace:


namespace daedalus
{
template<> bool CSingleton< CController >::Create()
{
DAEDALUS_ASSERT_Q(mpInstance == NULL);

mpInstance = new IController();

return true;
}
}


(I've recently been going off the singleton pattern, but that's another story :)

With these changes Daedalus PSP compiles perfectly under OSX. On my 2.4 GHz Macbook Pro it takes just under 50 seconds. On my 2.4GHz Windows machine it takes over 2 minutes to compile, so I'm very impressed with the results.

I believe I've checked in all the required changes to the Daedalus SVN repository on SourceForge. If you decide to try compiling Daedalus PSP under OSX, let me know how you get on via the comments page (I'll be rejecting any off-topic comments to try and keep the discussion constructive)

-StrmnNrmn

Wednesday, August 01, 2007

333 MHz

On startup Daedalus increases the default clock frequency of the PSP's cpu from 222MHz to 333MHz for a 'free' 50% speedup. I use 'free' in quotes because this comes at the expense of drawing more power so the battery runs out of charge faster.

I've been asked if I'm going to support 333MHz many times, so I wanted to put this question to rest for once and for all. The answer is yes - I believe this has been the case since R1 :)

-StrmnNrmn

Monday, July 30, 2007

Custom Controller Configurations

A number of people have asked if I would be adding support for user-defined controller configurations in a future release of Daedalus. As it happens Daedalus has supported user-defined controller configurations from R7 onwards. From the readme file:


As of R7 Daedalus now allows user-configurable controls to be specified.
The desired controls can be chosen from the Rom Settings screen.

In order to define your own controller configuration you need to add a
new .ini file to the Daedalus/ControllerConfigs directory. There are a
few examples provided which should give an overview of what is possible.
I will look at providing a more thorough tutorial shortly.


The format really is quite simple, but it is very flexible and allows for a number of advanced configurations. Here is a simple configuration which is distributed with Daedalus (ControllerConfigs/dpad.ini):


Name=DPad
Description=By default the PSP DPad maps to the N64 DPad. Hold Circle to map to the CButtons.

[Buttons]
N64.Start = PSP.Start
N64.A = PSP.Cross
N64.B = PSP.Square
N64.Z = PSP.Triangle
N64.LTrigger = PSP.LTrigger
N64.RTrigger = PSP.RTrigger
N64.Up = !PSP.Circle & PSP.Up
N64.Down = !PSP.Circle & PSP.Down
N64.Left = !PSP.Circle & PSP.Left
N64.Right = !PSP.Circle & PSP.Right
N64.CUp = PSP.Circle & PSP.Up
N64.CDown = PSP.Circle & PSP.Down
N64.CLeft = PSP.Circle & PSP.Left
N64.CRight = PSP.Circle & PSP.Right


The file starts with two lines defining the name and description for the controller config. These strings are used in the UI when selecting configurations. The '[Buttons]' block defines the mapping from PSP controls to N64 controls. In this particular configuration the N64 d-pad is mapped to the PSP d-pad when the circle button is released. When circle is pressed, the PSP d-pad maps to the N64 c-buttons. This config is particularly useful for games which make heavy use of the d-pad.

In the [Buttons] section, the left hand side of each rule must consist of one of the following N64 control names:


















NameDescription
N64.StartThe N64's start button
N64.AThe N64's A button
N64.BThe N64's B button
N64.ZThe N64's Z trigger
N64.LTriggerThe left trigger
N64.RTriggerThe right trigger
N64.UpUp on the N64's d-pad
N64.DownDown on the N64's d-pad
N64.LeftLeft on the N64's d-pad
N64.RightRight on the N64's d-pad
N64.CUpThe N64's C up button
N64.CDownThe N64's C down button
N64.CLeftThe N64's C left button
N64.CRightThe N64's C right button


N.B. There is currently no definition for the N64's analogue stick. By default this is always assumed to be bound to the PSP's analogue stick.

The right hand side of a rule consists of an expression defined from the following values:














NameDescription
PSP.StartThe PSP's start button
PSP.CrossThe PSP's cross button
PSP.SquareThe PSP's square button
PSP.TriangleThe PSP's triangle button
PSP.CircleThe PSP's circle button
PSP.LTriggerThe PSP's left shoulder button
PSP.RTriggerThe PSP's right shoulder button
PSP.UpUp on the PSP's d-pad
PSP.DownDown on the PSP's d-pad
PSP.LeftLeft on the PSP's d-pad
PSP.RightRight on the PSP's d-pad


N.B. You cannot use the PSP's select button when defining controller configurations, as this is reserved for the emulator's use.

Values can be combined using a few simple operations to allow rules to be constructed with more complex behaviour. For instance, the following line:

N64.CUp = PSP.Circle & PSP.Up


Tells Daedalus to report that the N64 C up button is pressed when both the circle AND d-pad up buttons are pressed on the PSP. The NOT operator (!) can be used to invert a value, for instance:

N64.Up = !PSP.Circle & PSP.Up


This rule tells Daedalus to report that the N64 d-pad up button is pressed when the PSP's circle button is not pressed while the d-pad up button is being pressed.

The available logical operators are:






OperatorDescriptionExample
expr & exprReturns the logical AND of the two expressionsN64.CUp=PSP.LTrigger & PSP.Up
expr | exprReturns the logical OR of the two expressionsN64.A=PSP.Cross | PSP.Circle
!exprReturns the logical NOT of the expressionN64.Up = !PSP.LTrigger & PSP.Up


You can also use parentheses to control the order of precedence, e.g.:

N64.CUp = PSP.Up | (PSP.LTrigger & PSP.Triangle)


Which defines a rule which defines that the N64 C up button is pressed when either up on the PSP d-pad is pressed, or when the left shoulder button and triangle are pressed.

No single controller mapping scheme can be provided which works well across all games, but using custom controller configs it should be possible to create a mapping which works well for any given game.

Post any questions you might have in the comment pages, and I'll do my best to answer them. If you come up with a good controller config for a game, email me (my name @gmail.com) and I'll look at adding it for distribution with future releases of Daedalus.

-StrmnNrmn

(Apologies - for some reason my template in Blogger really screws up when I place tables in he post and inserts a ton of whitespace before the table. I've never quite figured out how to fix it, so we'll just have to live with it :)

(Fixed post date :)

Sunday, July 29, 2007

Recharged

It's been quite a while since my last update. I was starting to feel a little worn out from all the work I'd been putting into the emulator over the past few months so since releasing R12 I've been taking a bit of a break from Daedalus to unwind and recharge my batteries.

It's been really nice just taking a bit of a break to do a few different things with my spare time. I spent a short while in Spain with my sister, and since I've been back I've been catching up with a bit of reading, watched a load of TV that I'd queued up and played through a few games I'd had gathering dust for a while. Sadly my 360 succumbed to the Red Ring of Death last week so it seems I might have been pushing it a bit too hard :)

I found a new flat in Guildford which I'll be moving to at the end of August. I'm quite excited about the move as I'll save a lot of time commuting. It currently takes me little over an hour to travel into work and it'll just be 15 minutes once I move. I'm hoping that cutting back on the commute should not only free up a couple of hours each day, but I'll also be a bit less knackered once I finish for the day.

I also picked up a Macbook Pro during my 'time off' and I've really been enjoying my new-found sense of computing freedom (I've been writing this post on the train on the way home from work.) It's the 17" 'lapzilla' model, and it's an absolute beast. It actually compiles Daedalus faster than my year old desktop PC which I find pretty impressive. I'll have some details and tutorials about compiling Daedalus PSP under OSX in the near future.

Now that I've had a bit of time off, I'm feeling very excited about getting cracking with R13. My main feeling is that I'd like to continue working on speeding up the emulator to try and improve the framerate for titles that are already working. From your comments on this blog and on other sites such as the DCEmu forums (the site seems to be down right now - I'll update with a link later) it seems that this is mostly what people are interested in seeing for the next release.

There are a number of different areas I can investigate to help improve performance. The two main possibilities I want to investigate are working on further dynarec improvements, and looking at making use the Media Engine. To start with I'm going to explore both of these areas and try and figure out which would give the biggest 'bang for buck' for R13. I'll post and update on R13 when I have more details.

-StrmnNrmn

PS Thanks for all your comments while I've been away. I have about 100 left to approve which I'll start to go through now.

Wednesday, July 04, 2007

Daedalus disinformation

I was quite interested to read this article over at pspupdates.qj.net:


StrmnNrmn sent us an email recently that really got us thinking about how far Daedalus has come from its humble beginnings more than a year ago. Not only does he give us some prime information about the direction of Daedalus, but he also mentions just how long it might take him to actually get everything running properly.


I never emailed pspupdates.qj.net, so it looks like someone has been deliberately trying to mislead them. It goes on to say:

R13 might actually use Bios files now


I have no idea what this actually means! The information doesn't seem to be particularly malicious, but I didn't write it. Anything I have to say about Daedalus, I'll talk about here, on this blog. It'll be another week or so before I can update regularly, so any other 'information' you hear that's not directly from this blog is likely to be a hoax.

-StrmnNrmn

I can't seem to find any way of contacting the author of the story at QJ.net (Victor B), but if they read this I'd appreciate an update to the story linking to this response. Thanks!

Update pspupdates.qj.net have posted an update to the story. I just wanted to express my thanks for clarifying the situation so quickly. Thanks Victor!

Wednesday, June 27, 2007

Daedalus PSP R12 Released

I've just finished uploading the latest build of Daedalus PSP:

Daedalus PSP R12 for v1.00 Firmware
Daedalus PSP R12 for v1.50+ Firmware
Daedalus PSP R12 Source

As usual it will take 20-30 minutes for the files to propagate across the Sourceforge mirrors, so please be patient :)

Here is the list of changes:


  • [!] Fixed issue preventing Goldeneye from being loaded.
  • [!] Fixed dynarec for Goldeneye.
  • [!] Fixed dynarec for Super Smash Bros.
  • [!] Fix various texturing issues with 4bpp and small or non power-of-2 textures.
  • [!] Fix TexRect instructions with negative s/t components.
  • [!] Fixed the HUD in Mario 64 (broken in R11.)
  • [!] Fixed lights in F3DEX2 microcodes.
  • [+] Correctly implement instruction fetch exceptions, improving compatibility.
  • [+] Improved floating point compatibility.
  • [+] Correctly handle mask_s/mask_t tile values.
  • [+] Implemented a few custom blend modes.
  • [+] Screenshots just cover visible viewport.


If you've been following the updates on this blog over the past month the most obvious change is that Super Smash Bros. is now running well with dynarec enabled, and many graphical glitches have now been resolved. The compatibility fixes were specifically aimed at Super Smash Bros. but may well fix issues with other roms too. Overall SSB is looking and playing much better than it was in R11, but even at 30-40fps it's still not running at fullspeed yet. There are a few graphical issues that still need resolving, but all in all it's starting to feel very playable with frameskip set to 1 or 2.

Goldeneye is also running in R12. Although the intro sequence is running very quickly and with few noticable graphical issues, a lot more work is needed to it running at a playable framerate in-game. I think it's a good start though, and something to get excited about for the future :)

Otherwise R12 just has a fewm minor compatibility and graphical fixes - there are no optimisations for this build.

As always, leave your feedback on the comments pages. I read all your comments and I'll do my best to reply to any questions you raise. I'm particularly interested to hear if any roms which were broken in previous releases are now running in R12.

Enjoy!

-StrmnNrmn

Saturday, June 16, 2007

Multiplayer Thoughts

In response to a recent post, Zeus asked an interesting question:


I know you've probably been bugged by people on this before, but how hard would multi-player be to implement? More importantly, do you think that there is enough bandwidth, and low enough lag, to allow a host-client multiplayer setup to be playable? (ie, with one psp "hosting" the game and doing the emulation, while the other(s) just receive screen-captures and send back user-input) Or do you think that distributing the computational workload would be a better approach? (at the very minimum, the audio processor shouldn't be too horrible to move to the client psp(s) )


I have thought about multiplayer a great deal, but I've never made any plans to work on it - there didn't seem to be much point getting multiplayer working before there were a few multiplayer games running quickly and glitch free. Now that MarioKart and Super Smash Bros. are both working reasonably well, there's obviously going to be a lot more demand for multiplayer support. Before I raise expectations and get anyone's hopes up, I should mention that this isn't likely to happen any time soon.

It would be possible to go down the route of having a host psp which performs emulation and broadcasts the screen to client psps. Sony's Remote Play between the PS3 and PSP shows that this kind of 'dumb terminal' approach can work in the right situations. That's with the PS3 doing the grunt work of compressing the framebuffer and sending it over to the PSP. For for a PSP running Daedalus as a host, I'm not quite sure there would be enough spare horsepower to compress the framebuffer and audio and then send it to 1 or more connected clients.

Also, I don't think that it would be possible to decouple the audio processing from the main cpu thread so that this work could be distributed to one of the clients. Although the audio and graphics processing is notionally run in parallel on the RSP (the N64's coprocessor), access is still serialised between audio and graphics tasks so they have to be completed in order. Performing audio processing on a client PSP would just mean that graphics processing on the host would have to wait until the results of the audio processing were received.

The approach that I'd been considering was running Daedalus in lockstep across 2 or more connected PSPs. As I mentioned previously Daedalus can run deterministically if external inputs such as pad input and timing sources are synchronised. What this would mean would be that every time the rom queried the pad status, each connected psp would have to synchronise its view of the pad input with the host. This would mean sending just 64 bytes across the network from the client to the host and back. This information would have to be sent over a TCP connection rather than UDP as we have to ensure that every PSP sees the exact same input (actually, client->host communication is a bit less critical so it may be possible to transmit this information over UDP if it helps improve lag.)

One really cool feature about this approach is that as each PSP would be responsible for rendering its own display, it would be possible to scale up each viewport to fill the display; rather than playing 4-player Mario Kart or Goldeneye at 160x120, you'd be able to play at 320x240 (or 480x272 if you wanted to scale it up to entirely fill the PSP's screen).

As I said at the start of the post, this isn't something that's likely to happen any time soon, but if and when it does happen, it will be amazing :)

-StrmnNrmn

R12 Release Date

A number of people have been asking in the comments when R12 is going to be released.

There are still a number of things I want to work on. Now that Super Smash Bros. is running nice and quickly with dynarec enabled, I want to spend a week or so polishing the graphics and trying to make it as playable as possible. Although Goldeneye is running with dynarec in R12, it still needs a lot of work before its playable, so I'm not going to spend any more time on it for R12.

I'm going on holiday at the end of June, so I'd like to have R12 released before then. I'll aim for next weekend (23rd/24th June) but it may end up being as late as the 26th/27th.

-StrmnNrmn

Thursday, June 14, 2007

Tracking down the SSB Dynarec Bug - Part 2

On Monday I talked about the fragment simulator and how this could be used to help track down bugs in the dynarec implementation. In this post I'm going to talk a bit about a tool I use mostly for regression testing, but also to help determine the exact point at which the fragment simulator and the interpretative core go out of sync. It's a bit of a long post, so apologies in advance :)

Daedalus can be compiled with a flag which enables a special 'synchronisation' mode. This build configuration creates an instance of a synchronisation class which can be initialised in one of two modes - either as a producer or as a consumer. At various points during program execution I pass information about the internal state of the emulator to the synchroniser for processing. In the case of the producer, it simply writes this data out to a file on disk. The consumer is a bit more interesting; it reads data of the required size from disk, and compares this 'baseline' value against the value provided by the emulator. If these two values are found to be different, the synchroniser knows that things have drifted out of sync and it can trigger a breakpoint and drop out into the debugger.

This technique relies on the fact that the emulator is deterministic, i.e. running the emulator twice in a row with the same inputs generates exactly the same results. By 'inputs' this means not just the same rom image, but external inputs such as data from the controller must match exactly too. Obviously pressing buttons on the controller in exactly the same order with the same timings would be impossible to duplicate, so the other function the synchroniser performs is to record input from the pad in the case of the producer, or play input back in the case of the consumer. Other external input, such as calls to timer functions (e.g. time(), QueryPerformanceCounter() or rdtsc) can be synchronised in the same way.

The synchroniser works with as few or as many sync points as you provide. For debugging very simple problems, you can get away with just checking the value of the program counter as each instruction is executed. For more tricky problems you can end up adding many more sync points - for instance you can synchronise the entire register set after every instruction to ensure that the synchroniser catches any instruction which generates a different result from the baseline.

I add sync points to Daedalus using a set of macros. When synchronisation is enabled, the macros expand out to calls to a virtual method on a global instance of the synchroniser class. An example sync point in the code might look like this:


u32 pc = gCPUState.CurrentPC;

SYNCH_POINT( DAED_SYNC_REG_PC, pc );

OpCode op;
if( CPU_FetchInstruction( pc, &op ) )
{
CPU_Execute( pc, op );
}


The interesting line here is the SYNC_POINT macro, which synchronises on the current program counter value. For producers, this just writes the value of 'pc' to disk. For consumers, it checks that the value we have for 'pc' matches the one read from disk.

The DAED_SYNC_REG_PC argument is simply a flag to describe what is being synchronised. Another global constant allows easy control of what is synchronised:


enum ESynchFlags
{
DAED_SYNC_NONE = 0x00000000,

DAED_SYNC_REG_GPR = 0x00000001,
DAED_SYNC_REG_CPU0 = 0x00000002,
DAED_SYNC_REG_CCR0 = 0x00000004,
DAED_SYNC_REG_CPU1 = 0x00000008,
DAED_SYNC_REG_CCR1 = 0x00000010,

DAED_SYNC_REG_PC = 0x00000020,
DAED_SYNC_FRAGMENT_PC = 0x00000040,
};

static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC);

#define SYNCH_POINT( flags, x, msg ) \
if ( DAED_SYNC_MASK & (flags) ) \
CSynchroniser::SynchPoint( x, msg )


If I want to enable more thorough debugging, I can change DAED_SYNC_MASK and OR in more values:


static const u32 DAED_SYNC_MASK(DAED_SYNC_REG_PC|DAED_SYNC_REG_GPR);


Changing the mask value requires the emulator to be rebuilt from scratch and the baseline synch file to be recreated. This is a bit time consuming but doing it in this way means that the compiler can optimise out any synch points which we aren't interested in, keeping things running as quickly as possible.

One problem with this technique is that the synchroniser can quickly generate a massive amount of data, so much that most of the execution time is spent shifting this data to or from disk, slowing debugging to a crawl. In the example I gave on Monday, it can sometimes take over 500 million instructions before things go out of sync. Even when just synchronising on the program counter, that's over 2GiB of data that needs to be read/written to disk. When you throw in more sync points such as register sets (the GPR registers on their own are around 256 bytes) this can very quickly become impractical. To get around these limitations in Daedalus I gzip the stream of data on the fly which compresses the data significantly. Another trick I use is to hash each register set to a 32bit value and synchronise on this value instead. When using both these techniques the sync files typically end up around 100-200MiB, which is much more manageable.

One of the main uses of this synchronisation code is for regression testing optimisations I've made. I can take a 'known good' build of the emulator and initialise the synchronisation class as a producer to generate a baseline sync file. I can then take a modified version of Daedalus with the optimisations that I want to test, and initialise the synchroniser as a consumer. If the synchroniser detects that things have gone out of sync, then I know that my changes are buggy, and I can investigate why they're not working as planned. It's worth noting that even if everything stays in sync, this isn't a guarantee that my changes are bug-free, but it's a pretty good indication that they're ok.

I also use the synchronisation code to debug tricky dynarec issues. When debugging these types of problems I typically start off by disabling the dynarec engine and setting up the synchroniser to produce a baseline for testing. I'll then re-enable dynarec, but using the fragment simulator with precise interrupt handling (see the end of Monday's post for more on this) and run Daedalus with the synchroniser in consumer mode. Theoretically, as soon as the dynarec code gets out of sync with the interpretative core, the breakpoint triggers and I can investigate things more closely in the debugger.

This is exactly the process I used to track down the Super Smash Bros. bug. When I ran the emulator with the synchroniser in consumer mode, it detected that the program counter was different from the expected baseline value after exactly 387,939,387 instructions had been executed. I'd like to think that an error rate of 2.57e-7% wasn't all that bad, but apparently it is :)

Now that I knew the point at which the emulator was going out of synch, I set a few breakpoints in the emulator to see what exactly was happening. My usual trick is to disassemble the executed instructions just before and after things diverge, and see what's different. Here are snippets from the 'good' and 'bad' logs as things go out of sync:


Count 171f7c35: PC: 80132500: LW ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: JAL 0x80131fb0 ?
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80131fb0: ADDIU sp = sp + 0xffd8
Count 171f7c3c: PC: 80131fb4: SW ra -> 0x0024(sp)
Count 171f7c3d: PC: 80131fb8: SW s0 -> 0x0020(sp)
Count 171f7c3e: PC: 80131fbc: CLEAR a0 = 0
Count 171f7c3f: PC: 80131fc0: CLEAR a1 = 0



Count 171f7c35: PC: 80132500: LW ra <- 0x0014(sp)
Count 171f7c36: PC: 80132504: ADDIU sp = sp + 0x0018
Count 171f7c37: PC: 80132508: JR ra
Count 171f7c38: PC: 8013250c: NOP
Count 171f7c39: PC: 80132ae8: MTC1 at -> FP06
Count 171f7c3a: PC: 80132aec: NOP
Count 171f7c3b: PC: 80132af0: SWC1 FP06 -> 0x0018(a0)
Count 171f7c3c: PC: 80132af4: LBU v0 <- 0x4ad1(v0)
Count 171f7c3d: PC: 80132af8: ADDIU at = r0 + 0x0008
Count 171f7c3e: PC: 80132afc: BEQ v0 == at --> 0x80132b24
Count 171f7c3f: PC: 80132b00: ADDIU at = r0 + 0x0009


I've highlighted the instruction at which the synchroniser detected the PCs were out of sync. In the good trace (top) the PC is 0x80131fb0, but in the bad trace it's 0x80132af0. If you have particularly sharp eyes, you'll notice something else - two instructions before the code goes out of sync, the good trace executes a jump instruction to 0x80131fb0, but the bad trace is performing a MTC1 op (Move To Coprocessor 1)

This provides a particularly good example of one of the main weaknesses with the synchroniser - it's only as good as the synch points you set up. Because I was just synching on the program counter, it didn't detect the fact that the emulator executed an entirely different opcode two instructions previously. In this particular case I was fortunate in that the real source of the problem was very close to the location identified by the synchroniser, but sometimes the cause and effect can be separated by many thousands of instructions.

Fortunately it's easy enough to add new synch points in the code to detect issues like this, but adding too many synch points causes the emulator to slow to a crawl and makes debugging impractical. I've found the best approach is to start off with as few synch points defined as possible (ideally just the program counter) and slowly introduce more synchpoints as required. This is all very easy to do using the DAED_SYNC_MASK flag discussed above.

Getting back to SSB, it looked like I had found the root cause of the problem - somehow the rom was replacing the instructions in memory, essentially a form of self-modifying code (it's more likely it was just loading a new section of code into RAM from ROM, but it's still essentially self-modifying). The dynarec system was oblivious to these changes and so it ended up trying to execute stale instructions that it had cached when creating the fragment, potentially many thousands of cycles ago.

Dealing with self modifying code in dynamic code generators is generally very tricky. In Daedalus I've been relying on the fact that most roms are well-behaved and flush the instruction cache when they modify memory containing executable code. When I detect a instruction cache invalidate (through the MIPS CACHE opcode) I simply dump the entire contents of the fragment cache and start from scratch. This might sound a little heavy handed, but the way that I link fragments together makes it very hard to unlink small sections of code that has been invalided. Flushing the cache is very quick, safe and has a few advantages such as purging cold traces that are no longer being executed any more.

Ironically, the reason the dynarec was failing to cope with SSB wasn't due to a bug in Daedalus - it was due to a bug in SSB that just never happened to be a problem on a real N64. After updating memory with the new instructions SSB should have been invalidating the instruction cache to ensure that it didn't contain stale code, but for whatever reason it failed to do this. The only reason the rom runs correctly on a real N64 is that by the time it comes to execute the modified instructions, the instruction cache has been refilled a number of times and so the stale instructions are no longer cached.

Even though this isn't Daedalus's bug, it still needs to work around the problem. I'll leave this discussion for a future post though - this one is long enough as it is :)

-StrmnNrmn

Tuesday, June 12, 2007

Tracking down the SSB Dynarec Bug

Yesterday I said I'd provide some more details about the Super Smash Bros. dynarec fix. The actual fix is fairly straightforward, but I thought the process of tracking down the issue was quite interesting and worthy of a couple of blog posts.

When I first started looking at SSB I noted that although the game ran fine without dynarec, it would always hang when trying to enter the main entry with dynarec enabled.

I've been programming professionally for around 6 years now and I can safely say that debugging dynarec bugs is one of the hardest categories of problems I've ever had to work on. For a start, because the code is generated on the fly, you don't have the luxury of source level debugging, and without spending time reverse engineering the original rom image, you don't even know what the generated dynarec code is meant to be doing. It's very much like working blindfolded.

And it gets even worse. I've fixed dynarec problems in the past which were the result of generating incorrect code for a fragment over 500 million instructions into emulation. This would be bad enough, but it can be many thousands of instructions later before this causes emulation finally diverges from the correct path. Just identifying the exact point at which the emulation starts to diverge from the correct sequence of instructions can be like finding a needle in particularly large haystack. While blindfolded :)

Over the years of trying to debug problems like these I've built up a set of tools and learned a few tricks along the way which you might find quite interesting. Although I'm going to talk about them in the context of tracking down this dynarec issue, I've found some of the techniques useful in solving other problems so you might find other ways of applying them too.

One of the first things I do when trying to identify a dynarec issue with Daedalus is to see if the problem is reproducible on the PC build of the emulator. Although it is possible to use GDB with PSPLink, I've never got this up and running and I'm much more comfortable debugging with Visual Studio. Also, working with the PC build is usually much faster than working with the PSP build (debug builds run around 10x faster on the PC, and build times are much quicker.)

Not all dynarec issues can be debugged in this way - the PSP and PC builds have different code generation back-ends (i.e. MIPS and x86 code generation respectively) so bugs in the MIPS code generation won't usually be reproducible in the PC build. The dynarec system in Daedalus shares a common frontend (trace selection and recording) between the two platforms, which means that if I can reproduce the problem on both platforms, I can narrow down the likely location of the bug to this area.

Fortunately this particular bug manifested itself in both the PC and the PSP builds, so I knew that if I fixed the bug on the PC build, it should fix the PSP build too. What I needed to find out next is what the emulator was doing differently when dynarec was enabled compared to when it was disabled.

If dynarec is running without errors, then the sequence of executed instructions should exactly match that executed with dynarec disabled. If I could log details about all the instructions executed with dynarec disabled, and again with dynarec enabled, I should be able to compare the two logs to figure out the exact point at which dynarec is going out of sync. This all relies on the fact that the emulator is totally deterministic, i.e. that running the emulator twice in succession with the same settings should give exactly the same results.

Unfortunately, for a variety of reasons my dynarec solution doesn't produce identical results to interpretation, the main reason being that for performance reasons I can only handle vertical blank and timer interrupts on the boundaries between fragments. For example, with dynarec disabled, the first vertical blank interrupt might occur exactly on the 625,000th instruction, but with dynarec enabled with might not occur until the 625,015th instruction. This means that the logs diverge at the instant the first VBL fires, and never regain synchronisation.

When I was originally developing the new dynarec system I put a lot of effort into writing a fragment simulator, the idea being that rather than executing the native assembly code for a given trace, I could keep track of the instructions making up the trace and interpret these individually instead. Theoretically fragment simulation is identical to dynarec code execution, even down to the way I handle VBLs and timer interrupts, and it's been very useful at identifying bugs in the dynarec code generation. What's particularly useful about fragment simulation however is that I can enable a setting which makes it handle interrupts exactly in the same way as the non-dynarec core, i.e. interrupts are handled precisely rather than on fragment boundaries.

Essentially Daedalus has four modes of operation:


  • Dynarec + fragment execution
  • Dynarec + fragment simulation (imprecise interrupt handling)
  • Dynarec + fragment simulation (precise interrupt handling)
  • Interpretative core


This tool is particularly powerful, because if I can ensure that dynarec+fragment execution is equivalent to dynarec+fragment simulation, and that dynarec+fragment simulation is equivalent to running the interpretative core, then I can use the transitive properties of these relations to ensure that dynarec+fragment execution is equivalent to running the interpretative core. Fragment simulation allows me to bridge the gap between these two modes of operation which would otherwise be very difficult to compare.

I think that's long enough for one post. Tomorrow I'll talk about how I used this technique to help track down the SSB dynarec bug.

-StrmnNrmn