dc.description.abstract | This paper discusses our experience with fine-grain synchronization for the preconditioned conjugate gradient method using the modified incomplete Cholesky factorization of the coefficient matrix as a preconditioner. This algorithm represents a large class of algorithms that have been widely used but traditionally difficult to implement efficiently on vector and parallel machines. Through a series of experiments conducted using a simulator of a distributed shared-memory multiprocessor, this paper addresses two major questions related to fine-grain synchronization in the context of this application. First, what is the overall impact of fine-grain synchronization on performance? Second, what are the individual contributions of the following three mechanisms typically provided to support fine-grain synchronization: language-level support, full-empty bits for compact storage and communication of synchronization state, and efficient processor operations on the state bits? The experiments indicate that fine-grain synchronization improves overall performance by a factor of 3.7 on 16 processors using the largest problem size we could simulate; the paper also projects that a significant performance advantage will be sustained for larger problem sizes. Preliminary experience shows that the bulk of the performance advantage for this application can be attributed to exposing increased parallelism through language-level expression of fine-grain synchronization. A smaller fraction relies on a compact-implementation of synchronization state, while an even smaller fraction results from efficient full-empty bit operations. The paper also shows that the last two components are likely to have a greater impact on performance as mechanisms for latency tolerance are employed. | en_US |