Fault-Tolerance Mechanisms in the SB-PRAM Multiprocessor
The SB-PRAM is an experimental multiprocessor architecture with a shared address space and synchronously running threads, i.e. giving the illusion to work on a PRAM. A 4-processor prototype has been completed while a 64-processor prototype is under construction. We investigate the detection and handling of single bit errors occuring during transmission of packets in the interconnection network. We analyze the impact of an error on the different parts of a packet and derive several strategies to recover from such an error. The strategies range from single bit correction codes to checkpointing the application and roll back in case of error. We find that the changes necessary in hard- and system software are small. In particular, none of the ASICs designed for the SB-PRAM have to be changed. The runtime overhead due to the fault-tolerance mechanisms can be neglected. Finally, we sketch how these strategies can be extended to cover component failures.
Nutzung und Vervielfältigung:
Alle Rechte vorbehalten