Sunday, May 21, 2006

Branch delay instructions

Well it's been a little longer than I intended since my last post. I've been busy with work, and when I've had free time I've found it hard to drag myself away from the compiler for long enough to post an update :)

In my last post on the subject, I left saying that I was trying to iron out bugs in the fragment 'simulator' and was about to start work on native code generation. I want to pick up from that point, but first I wanted to talk about about branch delay instructions and how they effect the implementation of emulators and the new DynaRec work I've been doing.

The couple of bugs in the fragment simulator were indeed to do with exceptions and interrupts triggering in unexpected places. The problem occurred if an exception was triggered in the branch delay slot following a jump instruction.

In case you're not familiar with the MIPS architecture, after branch instruction has been executed (but before control arrives at the target instruction), the CPU executes the instruction immediately following the branch. So to give an example, this is a block of mips code that calls the function foo( 6, 3 ):

li a0, 6 # load the first argument
jal foo # call foo
li a1, 3 # load the second argument
# the result is returned in v0 here

jr ra # return
add v0, a0, a1 # add the arguments, store result in v0

If you've never read MIPS assembly before this will probably look a little strange.
When we call foo with 'jal foo' (Jump and Link), we don't set the second argument until after the jump to 'foo'! Notice also that we calculate the return value for foo after we return with the Jump Register (JR) instruction.

The reason this code works is because of the branch delay slots. When the call to foo is executed ('jal foo') the CPU keeps going for one more instruction and executes 'li a1, 3'. Control then jumps to the start of 'foo', where the CPU immediately executes 'jr ra', jumping back to where it just came from. Again, the CPU executes the following instruction, calculating the sum of the arguments and returning the result in v0.

Although this seems rather pointless and unnecessarily complicated, it serves a very good purpose. With most modern CPUs the instruction execution is pipelined, which means that the processor breaks down the work into discrete chunks (fetch instruction, decode instruction, execute, commit etc), and executes multiple instructions in parallel. Wikipedia (as always) explains this in a lot more detail.

With many architectures the CPU has to throw away the contents of the pipeline when a branch is executed, as the subsequent instructions may refer to data (register or memory contents) that is invalid until after the call has completed. With certain architectures (including MIPS), the engineers decided that it was wasteful to throw away all this work on every branch, and designed the processor to continue processing the pipeline until the subsequent instruction had completed.

What this means is that when writing code for MIPS processors, you have to be careful that your branch delay slot doesn't have any unintended side effects. For maximum efficiency you also have to be careful to try and do useful work in the branch delay slot (rather than just filling it with a NOP for instance). Normally this isn't an issue as the compiler generates all the code for you, but it's certainly an issue if you're writing assembly. It's also been a very important consideration when I've been writing the code generation for the new DynaRec (I'll cover this in a later post).

So, how do branch delay instructions effect emulators? Although they help pipelined CPUs to improve performance, they require emulators to do bit of extra bookeeping and this slows things down slightly and adds complexity. Every time daedalus interprets an instruction it has to check if a branch was taken, and if so set a flag to indicate that a branch delay instruction is due to be executed. When the branch delay instruction is executed, the flag is cleared and the emulator sets the program counter to the original target of the branch.

This is all fairly straight forward, but complications arise when exceptions fire as a branch delay slot is due to be executed. In the example above, what would happen if an interrupt fired immediately before the branch delay instruction 'li a1, 3' was executed? Normally once the interrupt has been handled, the operating system restores control by jumping to the instruction that was due to be executed, allowing the program to run from that point. If this happened in our example above, a1 would be loaded with the value of 3, but the jump to 'foo' that was originally delayed will never take place. The code would continue running without ever calling 'foo'!

In order to handle this situation, when an exception (or interrupt) is triggered, the CPU sets a flag in the 'cause' register. The operating system keeps track of this flag, along with the program counter where the exception fired and various other bits of information. When it's done handling the exception, the operating system uses this information to allow the CPU to correctly restore control to the code that was originally executing. In our example above, the CPU would see that the branch delay flag is set in the cause register, and restore control to the jal instruction immediately preceeding the instruction where the interrupt occured.

It should be fairly clear that there's quite a lot that must be taken care of by the emulator to make sure that the program executes as intended. This is all fairly simple to keep track of when processing instructions individually, but it becomes more complicated when you start to dynamically recompile the code.

The bugs that I mentioned at the start of this post were caused because I wasn't correctly setting the branch delay flag for instructions causing exceptions in branch delay slots. When I build fragments for the dynamic recompiler I avoid triggering certain types of interrupts (e.g. vertical blank and timer interrupts) in the middle of the fragment to reduce the overhead of having to add the code to handle these situations. Unfortunately there are many other types of exceptions that can occur in the middle of a fragment, such as page faults, I/O completion interrupts etc. It turns out these are very rare, but unfortunately not rare enough to save me from a full day of debugging :(

This was a little more detailed than I was originally planning. Branch delay instructions are quite a simple concept on the surface, but they can cause all sorts of complexities when it comes to dynamic recompilation. Fortunately I feel very confident that I've now fixed all the branch delay related bugs in the dynamic recompiler, so hopefully I shouldn't have to think too much about them again in the near future.

I'll follow up this post with a quick update on the state of the new dynarec engine.


Wikipedia entry on branch delay slots

No comments: