سال انتشار: ۱۳۹۰

محل انتشار: پنجمین کنفرانس بین المللی پیشرفتهای علوم و تکنولوژی

تعداد صفحات: ۹

نویسنده(ها):

H Hamidi – Islamic Azad University -Doroud Branch

چکیده:

We present a new approach to algorithm based fault tolerance (ABFT) for High Performance Computing system. The Algorithm Based Fault Tolerance approach transforms a system that does not tolerate a specific type of faults, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. We have implemented a systematic procedure for introducing structured redundancy into ABFT. Algorithm Based Fault Tolerance has been recommending as a cost-effective concurrent error detection scheme. It proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. To that end, a matrix-based model has been developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated