AbstractThe CRCW PRAM under dynamic fail-stop (no restart) processor behavior is a fault-prone multiprocessor model for which it is possible to both guarantee reliability and preserve efficiency. To handle dynamic faults some redundancy is necessary in the form of many processors concurrently performing a common read or write task. In this paper we show how to significantly decrease this concurrency by bounding it in terms of the number of actual processor faults. We describe a low concurrency, efficient and fault-tolerant algorithm for the Write-All primitive: ``using <= N processors, write 1's into N locations''. This primitive can serve as the basis for efficient fault-tolerant simulations of algorithms written for fault-free PRAMs on fault-prone PRAMs. For any dynamic failure pattern F, our algorithm has total write concurrency <= |F| and total read concurrency <= 7 |F| log N, where |F| is the number of processor faults (for example, there is no concurrency in a run without failures); note that, previous algorithms used Omega(N log N) concurrency even in the absence of faults. We also describe a technique for limiting the per step concurrency and present an optimal fault-tolerant EREW PRAM algorithm for Write-All, when all processor faults are initial.
Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors); F.1.2 [Computation by Abstract Devices]: Modes of Computation; F.2.3 [Analysis of Algorithms and Problem Complexity]: Tradeoffs among Complexity Measures
Additional Key Words and Phrases: parallel computation, fault-tolerance, concurrency, robust algorithms
Selected references
- Miklos Ajtai, James Aspnes, Cynthia Dwork, and Orli Waarts. A theory of competitive analysis for distributed algorithms. In 35th Annual Symposium on Foundations of Computer Science, pages 401-411, Santa Fe, New Mexico, 20-22 November 1994. IEEE.
- Richard J. Anderson and Heather Woll. Wait-free parallel algorithms for the union-find problem. In Proceedings of the Twenty Third Annual ACM Symposium on Theory of Computing, pages 370-380, New Orleans, Louisiana, 6-8 May 1991.
- Y. Aumann, Z. M. Kedem, K. V. Palem, and M. O. Rabin. Highly efficient asynchronous execution of large-grained parallel programs. In 34th Annual Symposium on Foundations of Computer Science, pages 271-280, Palo Alto, California, 3-5 November 1993. IEEE.
- Yonatan Aumann and Michael O. Rabin. Clock construction in fully asynchronous parallel systems and PRAM simulation (extended abstract). In 33rd Annual Symposium on Foundations of Computer Science, pages 147-156, Pittsburgh, Pennsylvania, 24-27 October 1992. IEEE.
- Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the ACM, 21(2):201-206, April 1974.
- Steven Fortune and James Wyllie. Parallelism in random access machines. In Conference Record of the Tenth Annual ACM Symposium on Theory of Computing, pages 114-118, San Diego, California, 1-3 May 1978.
- Z. M. Kedem, K. V. Palem, M. O. Rabin, and A. Raghunathan. Efficient program transformations for resilient parallel computation via randomization (preliminary version). In Proceedings of the Twenty-Fourth Annual ACM Symposium on the Theory of Computing, pages 306-317, Victoria, British Columbia, Canada, 4-6 May 1992.
- Z. M. Kedem, K. V. Palem, A. Raghunathan, and P. G. Spirakis. Combining tentative and definite executions for very fast dependable parallel computing (extended abstract). In Proceedings of the Twenty Third Annual ACM Symposium on Theory of Computing, pages 381-390, New Orleans, Louisiana, 6-8 May 1991.
- Zvi M. Kedem, Krishna V. Palem, and Paul G. Spirakis. Efficient robust parallel computations (extended abstract). In Proceedings of the Twenty Second Annual ACM Symposium on Theory of Computing, pages 138-148, Baltimore, Maryland, 14-16 May 1990.
- Clyde P. Kruskal, Larry Rudolph, and Marc Snir. Efficient synchronization on multiprocessors with shared memory. ACM Transactions on Programming Languages and Systems, 10(4):579-601, October 1988.
- Charles Martel, Ramesh Subramonian, and Arvin Park. Asynchronous PRAMs are (almost) as good as synchronous PRAMs. In 31st Annual Symposium on Foundations of Computer Science, volume II, pages 590-599, St. Louis, Missouri, 22-24 October 1990. IEEE.
- Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222-238, August 1983.
- Jacob T. Schwartz. Ultracomputers. ACM Transactions on Programming Languages and Systems, 2(4):484-521, October 1980.