We have used the Manta compiler to support divide and conquer parallelism. We have extended the Java language with Cilk like primitives, that make it very convenient for the programmer to write divide and conquer style programs. We have implemented the system, called Satin, on top of the Manta native Java compiler. The name is of course a tribute to Cilk, which inspired Satin. The ultimate goal of Satin is to run parallel divide and conquer applications on wide-area systems. We believe that such systems will probably be hierarchical in nature (such as the DAS). Because divide and conquer programs are also structured in a hierarchy, we hope that such applications will map nicely on hierarchical systems. The idea is that all complex wide-area optimizations are in the Satin runtime system, and not in the Satin input programs. Therefore, the burden of hand-optimizing the applications for wide-area systems is lifted from the programmer.
In the EuroPar'2000 paper, we have shown that the current Satin implementation achieves good performance on a cluster of workstations, for twelve different applications. This performance is partly due to a mechanism we invented, called serialization on demand, which is described in detail in the paper.
A paper describing the load-balancing algorithms and performance was published in PPoPP 2001. It is also online here. In this paper, we experimentally compare Random work Stealing (RS) with existing load-balancing strategies that are believed to be efficient for multi-cluster systems, Random Pushing and two variants of Hierarchical Stealing. We demonstrate that, in practice, they obtain less than optimal results. We introduce a novel load-balancing algorithm, Cluster-aware Random Stealing (CRS) which is highly efficient and easy to implement. CRS adapts itself to network conditions and job granularities, and does not require manually-tuned parameters. Although CRS sends more data across the WANs, it is faster than its competitors for 11 out of 12 test applications with various WAN configurations. It has at most 4 percent overhead in run time compared to RS on a single, large cluster, even with high wide-area latencies and low wide-area bandwidths. These strong results suggest that divide-and-conquer parallelism is a useful model for writing distributed supercomputing applications on hierarchical wide-area systems.