Prioritized Multiplicative Schwarz Procedures for Solving Linear Systems
David Wingate, Nathaniel Powell, Quinn Snell, Kevin Seppi
Computer Science Department
Brigham Young University
Provo, Utah 84602
{ wingated, nep8, snell, kseppi}@ cs. byu. edu
Abstract
We describe a new algorithm designed to quickly and ro-bustly
solve general linear problems of the form Ax = b. We
describe both serial and parallel versions of the algorithm,
which can be considered a prioritized version of an Alter-nating
Multiplicative Schwarz procedure. We also adopt a
general view of alternating Multiplicative Schwarz proce-dures
which motivates their use on arbitrary problems ( even
which may not have arisen from problems that are naturally
decomposable) by demonstrating that, even in a serial con-text,
algorithms should use many, many partitions to ac-celerate
convergence; having such an over- partitioned sys-tem
also allows easy parallelization of the algorithm, and
scales extremely well. We present extensive empirical evi-dence
which demonstrates that our algorithm, with a com-panion
subsolver, can often improve performance by sev-eral
orders of magnitude over the subsolver by itself and
over other algorithms.
1. Introduction
This work introduces the concept of prioritization ap-plied
to Alternating Multiplicative Schwarz ( AMS) proce-dures
[ 11] as an effective method for solving problems of
the form Ax = b, where A is a large, sparse matrix.
AMS procedures are a form of problem decomposition
that were originally motivated by trying to solve continu-ous
PDEs with irregular domains. The principle is to divide
the problem into smaller subproblems ( which are still con-tinuous),
then repeatedly sweep over the subproblems, solv-ing
each one and updating interface values between prob-lems.
These PDEs were not necessarily discretized, and in
some cases, the subproblems could be solved directly. The
technique was then generalized to discretized PDEs, where
a matrix A resulting from a discretization process is bro-ken
up into submatrices, each of which correspond directly
to some part of the original problem [ 10].
In contrast, we explore the use of AMS- like procedures
for arbitrary matrices, regardless of where they originated.
In that sense, this work reverses the traditional view of AMS
procedures: instead of starting with a known problem and
generating a matrix representing its discretization, we be-gin
with a matrix, and force a decomposition upon it, even
if the underlying problem is not naturally decomposable.
This means that the decomposition may or may not corre-spond
to anything recognizable in the underlying problem.
The results are excellent, in some cases accelerating solu-tion
times by many orders of magnitude.
The centerpiece of this paper is the development of pri-oritized
AMS procedures, in which subdomains are selected
according to some priority metric. The principle behind this
is the observation that variables with large error strongly in-fluence
the solution vector. Corrections of large errors prop-agate
additional error backwards through the graph defined
by the A matrix. Our algorithm processes variables in prior-ity
order by constructing a priority queue using a metric de-rived
from the residual. Such prioritized solution methods
have shown to provide tremendous speedups in other con-texts
[ 9][ 12] for two reasons: first, they focus computation
in regions of the problem which are expected to be max-imally
productive, and second, they can sometimes avoid
processing large parts of the problem.
It is well- known that AMS procedures are naturally par-allelizable.
However, most researchers using AMS proce-dures
as a domain decomposition method split the matrix
into a small number of partitions, typically equal to the
number of processors. One of the central points of this work
is the fact that, even on a single processor, dramatic im-provements
are produced if a large number of partitions are
used to solve a problem. We demonstrate that in a parallel
context, far more partitions than processors should be used.
This allows the same performance benefits to be gained on
each individual processor, and improves the load- balancing
and cache coherency of the algorithm.
The goal of this research is to speed up solution times by
building on top of existing algorithms. However, our modi-
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
fied AMS procedures are not accelerators ( in the sense de-fined
by Hageman and Young [ 7]), nor can they be con-sidered
preconditioners ( as an additive Schwarz procedure
can be). No simple linear algebra expression exists which
captures the essence of the algorithm or which facilitates
easy analysis. For those reasons, we present a more pro-cedural
exposition. Since we treat AMS procedures as an
algorithmic accelerator for other subsolvers, and not as a
preconditioner, we use lambda calculus notation to explic-itly
recognize the fact that our algorithm must be used in
conjunction with another subsolver. This subsolver may be
any solution method, including iterative methods, such as
GMRES, CGS, TFQMR, BiCGSTAB, or direct methods,
such as more traditional ( I) LU decomposition or Gaussian
elimination [ 2]. Thus, we have named the resulting algo-rithm
GPS( ?), which is short for General Prioritized Solver,
where ? is the companion subsolver. The parallel imple-mentation
is called PGPS( ?), which is short for Parallel
General Prioritized Solver.
Because we wish to think about AMS procedures with-out
regard to the underlying problem, we take a perspective
akin to partial differentiation on the matrix A: we force arbi-trary
subsolvers to solve for a certain set of variables, while
holding the rest of the variables constant. We make no as-sumptions
about the form of the matrix; for example, we
do not assume that the matrix has a diagonal structure ( with
any sized bandwidth). Instead of a traditional block decom-position,
we adopt more general partitioning notation, and
rely on general graph partitioners to decompose the problem
and to determine interface edges. Thus, the ideas of over-lap
and interface value communication are not treated in
the traditional way. However, if a naive partitioning is used
( in which each partition is composed of a contiguous block
of variables), and the matrix is already in a strongly diago-nal
form, then the result is mathematically equivalent to tra-ditional
AMS procedures, which is why we consider our al-gorithm
a slight generalization of them.
Empirical results demonstrate that the serial version of
the final algorithm almost always improves performance
for a variety matrix types and sizes, and for a variety of
subsolvers. Performance is often improved by several or-ders
of magnitude, but for different reasons with different
matrix/ solver combinations. Improved performance is pre-served
in the parallel version, and excellent parallel scala-bility
is demonstrated.
2. The GPS( ?) Algorithm
Here, we present the serial version of the GPS( ?) algo-rithm.
The next section discusses parallel extensions. The
core ideas underlying the GPS( ?) algorithm are quite sim-ple:
we partition the variables, we select a subset of vari-ables
based on priority, we solve the subset while hold-ing
all other variables constant ( using any linear system
solver, denoted ?), and then we update any quantities that
depended upon variables in the subset. The entire process
is repeated until convergence. There are four details that
merit attention: 1) How can an arbitrary subsolver be made
to solve for some variables, while holding others constant?
2) How should partitions be created? 3) How should parti-tions
be selected in order to maximize the efficiency of the
algorithm? 4) To what tolerance should the subproblem be
solved? The following subsections address each question.
A few pieces of notation will be used throughout the pa-per.
Let P be a partitioning of variable indices, with each
p ? P being a set of variable indices. Let r = Ax- b be the
global residual, and rp = Apxp - bp be residual of a sub-problem.
.
Folding/ Extracting One idea motivating the GPS( ?) al-gorithm
is the idea of solving for a subset of all variables,
while holding the rest of the variables constant. We term
our method folding and extracting, and is an approach com-patible
with any sub- solver. The key observation is the fact
that if certain variables are held constant, their values may
be temporarily folded into the right- hand side vector, and a
new sub- problem created.
As an example, consider the system Ax = b where A is
a 2x2 matrix. Suppose that we wish to solve only for vari-able
0 while holding variable 1 constant. We may expand
this system ( using array subscript notation to index vectors
and matrices) and, since x[ 1] is constant, rewrite it as
where b is the new vector that results from the folding oper-ation.
This yields a set of two equations with one unknown.
Either may be used to solve for variable 0, although since
our estimate for x[ 1] is not guaranteed to be correct, differ-ent
equations will result in different answers ( see Section 7
for more discussion on this point).
In general, given an n×n matrix A and a partition p, the
equation for variable i ? p is:
This yields n equations in | p| unknowns, which is an over
constrained system. We will select a subset of the equations
with which to work; we adopt the convention that we will
use the equations corresponding to the variable indices in
the partition. We define a selector matrix as
Kp[ i, j] = 1 i ? p ? i = j
0 otherwise
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
Then, to select a subset of equations from A, x and b, we let
A = KpAKT
p , x = Kpx, and b = Kpb.
Note that A is still an n×n matrix, but with many empty
rows and columns. If row i in A is empty, column i will also
be empty, as will the corresponding entries in both both x
and b . The entire system may therefore be compacted by
eliminating such zero entries, and mapping variable indices
from the original system to new variable indices in the com-pacted
system: Ap = compact( A ), xp = compact( x ),
and bp = compact( b ). The entire system Apxp = bp may
now be passed to an arbitrary subsolver.
Other authors define similar reduction procedures in
terms of restriction and prolongation operators [ 6][ 10]. We
use our notation for simplicity of exposition.
Partitioning There are several principles governing the
creation of partitions. Partitioning variables is different than
a traditional block decomposition on the matrix A, because
a partition may contain any subset of variables. Block de-compositions
effectively stipulate that variables be parti-tioned
contiguously.
There are still some constraints on the partitioning. First,
partitions must ensure that no sub- problem is under con-strained.
This is equivalent to ensuring that every sub-matrix
has no empty rows or columns ( unless both are
empty), which is equivalent to stipulating that the binary
relation defined by the sub- matrix be total and surjective.
General graph partitioning algorithms, such as multi- level
k- way partitioners [ 8], or k- way partitioners using spectral
bisection, do not necessarily generate partitions that satisfy
this criterion. However, both conditions are satisfied for any
partitioning when the matrix has a non- zero diagonal. In this
situation, it is possible to create “ naive” partitions where
each partition simply contains n/| P| contiguous variables.
Although in theory an arbitrary partitioning of the vari-ables
could be used, the principle of partitioning is to group
related variables together. The idea is that if one variable
has a high priority, the related variables in the partition will
also have a high priority, and we can solve them all simulta-neously
without incurring the overhead of prioritizing each
variable individually. A good way to accomplish is to use a
partitioning which minimizes the edge cut, which is a com-mon
capability of existing k- way partitioning algorithms.
Prioritization Once partitions are defined, the next ques-tion
is determining the order in which partitions should
be processed. There are two options: first, a naive sequen-tial
method may be used. Although simple, this method is
surprisingly effective for well- ordered matrices, as demon-strated
in Section 6. A second option is to process partitions
in priority order.
For many of the experiments, we therefore define a pri-ority
queue which orders partitions. The queue has a pri-ority
metric associated with it, which we denote H( p). For
reasons that will be explained shortly, we use the 2- norm-squared
of the residual vector of each sub- problem as the
priority of that partition: H( p)
. In princi-ple,
many different priority metrics could be used, but this
is left to future research.
For the purposes of the algorithm, the priority queue
must support three operations: insert, pop, and
reprioritize. The third operation is necessary be-cause
the priority of a partition will often change when it
is deep in the queue, as other partitions, upon which it de-pends,
are solved.
Subproblem tolerance The final issue involves determin-ing
the tolerance to which subproblems should be solved.
There are three possibilities: specify a maximum number of
iterations, specify a subproblem tolerance, or both.
Specifying a subproblem tolerance merits some discus-sion.
Assume that we wish to solve Ax = b to tolerance of
using a 2- norm measure on the residual vector. Clearly,
it is not sufficient to solve each subproblem to within , be-cause
the combination of the residual norms of all of the
subproblems may yield an overall residual norm which is
greater than the global tolerance.
Instead, it is necessary to allocate error to different sub-problems.
If we assume that error is allocated uniformly, for
instance, then
, and by the uniformity
assumption, (| P| ?p) 1/ 2 < , meaning that ? 1/ 2
p <
| P| 1/ 2 .
This states that if error is allocated uniformly, the subprob-lem
tolerance must be less than to ensure that the global
tolerance will be achieved.
However, this is only one way of allocating error to par-titions.
Other methods, including methods which dynami-cally
allocate error as the algorithm progresses, are left to
future research. In our algorithm, we specify both a subdo-main
tolerance, as well as a maximum number of subdo-main
iterations.
2.1. Algorithm Details
Here, we present the final algorithm which incorporates
the issues discussed in this section. We will briefly discuss
an optimization related to incremental error computations,
and then present the final algorithm.
Incremental error computations As previously noted,
many different stopping criteria could be used with the
GPS( ?) algorithm. We have presented the most common,
which is the 2- norm of the residual vector ( this is also the
criteria that was used in all of our experiments). The exten-sion
to other stopping criteria is straightforward, but messy,
and has been omitted. An advantage of the 2- norm of the
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
Algorithm 1 GPS( ?)
Initialization
1: r := b - Ax
8: PQ. insert( p)
Main Loop
1: repeat
2: // Solve the highest priority partition
3: p := PQ. pop()
4: ( xp, bp) := fold( p, A, x, b)
5: xp := ?( Ap, xp, bp)
6: ( x, r) := extract( p, x, xp, r, rp)
7: rp := bp - Apxp
8:
9: // Update the global residual
21: ?G := ?G + r[ i] 2
22:
23: // Reprioritize partitions
24: for all p ? PartDep( p) do
25: PQ. reprioritize( p )
26: until ?G <
residual vector is that efficient incremental computation can
be derived, which is compatible with the fact that, due to
the prioritization aspects of GPS( ?), variables do not always
need to be processed. Practically, squaring and square root
operations can be avoided if the 2- norm squared is com-pared
to the tolerance squared; our algorithm employs this.
We use this technique for computing global error estimates,
as well as maintaining error estimates for the sub- problems.
Empirically, the technique is stable, but suffers from round-off
errors, which is remedied by periodically performing a
full residual norm calculation ( not shown in Algorithm 1).
Full algorithm Algorithm 1 shows the full pseudocode
for the serial GPS( ?) algorithm. Here, VarDep( p) returns
Algorithm 2 fold( p, A, x, b)
1: for all i ? p do
5: return x, r
the variable dependents of a partition p, which is the set
of all variables with an edge to any variable in p. Simi-larly,
PartDep( p) returns the partition dependents of a par-tition
p, which is the set of all partitions which contain at
least one variable in VarDep( p). The gv to lv function
(“ global- variable- to- local- variable”) maps variable indices
from A to submatrix indices in Ap, and the var to part
function maps variables to their owning partitions.
3. Parallel Implementation
The enhancements of partitioning and prioritization nat-urally
facilitate an efficient parallel implementation. This
section discusses one way in which the GPS( ?) algorithm
can be parallelized, and discusses some of additional issues
encountered when moving to a parallel architecture.
In our algorithm, each processor owns ( or is responsible
for) a set of partitions. These partitions are termed “ local”
partitions, which contain “ local” variables. All other parti-tions
and variables are “ foreign.” In a corresponding gener-alization
of notation, we will use ?Pr to denote the sum of
all ?p for p local to processor Pr.
To create PGPS( ?), we adopted an architecture with
some asynchronous aspects as well as some synchronous
aspects. The basic algorithm is substantially the same as the
serial algorithm, except for two major changes which merit
detailed discussion. First, since partitions are assigned to
processors, there is the issue of communicating the values
of variables between processors. Secondly, there is the is-sue
of termination.
3.1. Variable Communication
The PGPS( ?) algorithm embodies the same core con-cepts
as the serial version, except that the data and com-putation
has been distributed. First, each processor selects
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
the local partition p with the highest priority and solves it.
Then, the priorities and residuals of any other local parti-tions
which depend on p are recomputed. Now, however,
PGPS( ?) must execute steps that GPS( ?) did not need to:
it must communicate the values in p to all other processors
which depend upon p, and it must receive and process up-dates
from other processors.
The values in a partition do not necessarily need to be
communicated to every other processor. In fact, the proces-sors
which need the new values for partition p are partitions
which have some local variable that depends upon a value
from a variable in p. We term this set ProcDep( p), and is
typically very small in sparse matrices.
Communicating new values can be accomplished with a
single message sent to each processor in ProcDep( p). To
avoid starvation of processors, a processor receives a maxi-mum
of n messages, where n is the number of processors.
Receiving an update message from a foreign processor
is conceptually the same as the solving a local subprob-lem
associated with a partition p. The various quantities that
depend on variables within p must be recomputed: parti-tion
priorities, partition error estimates, the global error es-timate,
and variable residuals.
3.2. Termination Detection
Since a processor is only responsible for a subset of the
variables, it can only compute a subset of the overall error.
Each processor’s error estimate must be combined to form a
global error estimate, which can be used to determine when
to terminate.
Although there are many ways to compute such a global
error estimate, an effective way is a periodic reduction. In
our implementation, all processors reduce the local ?Pr er-ror
values every 100 iterations. The number of iterations
between synchronizations is a tunable parameter. Theoreti-cally,
if it is set too high, the algorithm performs unneces-sary
work, and if it is set too low, there can be significant
overhead. Empirically, however, it had very little impact on
the solution times.
3.3. Discussion
There are several advantages of this quasi- synchronous
architecture. First, since processors communicate changed
values only to the processors that need it, and since
ProcDep( p) is typically very different for different parti-tions,
processors can work at their own rate. In addition, the
overall volume of communication is fairly small, since par-titions
are usually selected to contain no more than
about 100 variables. Naturally, all master/ slave bottle-necks
are avoided.
The primary disadvantage is that many small messages
are generated, which can incur high latency penalties. In
addition, the synchronous computation of error means that
processors cannot work completely at their own rate. This
creates some inefficiencies that could be remedied with a
more sophisticated termination detection.
3.4. Full Algorithm
Algorithm 4 shows the full PGPS( ?) algorithm. Note
that processors now operate only on local partitions. The
PQ contains only local partitions, and whenever a partition
changes ( either as a result of popping off the local PQ and
solving, or as a result of a message from a foreign proces-sor),
only local errors and residuals are recomputed.
Here, LocalVarDep( p) returns the local variable de-pendents
of a partition p, which is the same as VarDep( p),
except that it contains only variables local to Pr.
LocalPartDep( p) returns the local partition depen-dents
of a partition p, and only contains local partitions.
The ProcDep( p) function returns a set of processor de-pendents
of a partition p, which is the set of all processors
that own one variable which depends upon one vari-able
in p.
Lines 2- 5 show pseudocode for the simple synchronous
termination test. Lines 18- 38 constitute the bulk of the dif-ference
between PGPS( ?) and GPS( ?), but is conceptually
simple: each partition update message received triggers pro-cessing
that is the same as the processing which happens
which a local partition changes.
4. Algorithm Design Details
There are many issues surrounding the design of an algo-rithm
such as GPS( ?). This section briefly examines some
of the more theoretical aspects, but we stress that space pre-cludes
a full analysis. Here, we only briefly consider con-vergence
and miscellaneous design details.
4.1. Convergence
Technically, the introduction of prioritization into the so-lution
process moves the GPS( ?) algorithm into the class of
chaotic ( or asynchronous) relaxation algorithms. Proofs of
convergence of such algorithms have traditionally imposed
strict requirements on the problems to be solved; a typical
condition is that ?( A) < 1 [ 3][ 5][ 4].
In principle, the convergence of GPS( ?) should there-fore
be governed by two factors: first, A must satisfy any
constraints that the subsolver requires. If CGS is selected as
a subsolver, for instance, then A must be symmetric posi-tive
definite. Second, it seems that A should satisfy any ad-ditional
constraints that asynchronous relaxations impose.
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
Algorithm 4 PGPS( ?)
Main Loop
1: repeat
2: // Synchronously compute global error
3: if time to resync then
4: ensure all messages have been received
5: ?G = Reduce( ?Pr)
6:
7: // Solve the highest priority partition
8: // and update the global residual
9: same as serial
10:
11: // Update partition and variable residuals
12: // and reprioritize partitions
13: for all i ? LocalVarDep( p) do
14: same as serial
15: for all p ? LocalPartDep( p) do
16: same as serial
17:
18: // Send new partition values
19: for all i ? ProcDep( p) do
20: Send( i, p, xp)
21:
22: // Receive new values
23: for i = 1 to nproc do
24: if no messages waiting then
25: break
Empirically however, the GPS( ?) algorithm is unique in
that neither constraint appears to be fully in force. In our ex-periments,
GPS( ?) robustly converged for almost all matri-ces
tested ( assuming that preconditioner succeeded on the
sub- problems), each of which was non- convergent ( that is,
?( A) = 1). In addition, at least in theory, the GPS( ?) algo-rithm
actually relaxes some of the constraints imposed on
solvers and matrices. The issues of convergence merit de-tailed
attention, so they are largely left for future research.
However, the idea that GPS( ?) relaxes some requirements
merits an example. Consider the matrix
A = ? ? ? ?
3 2 1 0
2 3 0 4
0 0 3 2
0 0 2 3
? ? ? ?
and a right- hand side of b = [ 1, 1, 1, 1] T . Although A is
positive definite, it is not symmetric. A solver that requires
that the matrix be SPD, such as CG, diverges when attempt-ing
to solve Ax = b. However, if we define two partitions
p1 = { 1, 2} and p2 = { 3, 4}, then A1 = A2 = 3 2
2 3
both of which are positive definite. CG can therefore solve
both sub- problems, meaning that GPS( CG) can be used to
solve the entire system.
The rate of convergence of GPS( ?) is unknown. Since it
is technically a chaotic relaxation algorithm, all known con-vergence
proofs simply state that if certain conditions re-lating
to the matrix are met, and all variables are relaxed
an infinite number of times, the system will converge. No
guaranteed rates are known except in very specific circum-stances,
but empirically, we have observed excellent rates.
Additional material and analysis on convergence of
AMS procedures can be found in [ 6].
4.2. Miscellaneous Details
Currently, determining when to stop the subsolver is
largely heuristic. As noted, GPS( ?) uses a combination of
a subproblem tolerance and a small maximum number of
subproblem iterations. Although sufficient for convergence,
both methods leave something to be desired, because both
numbers are static. Some initial experimentation has indi-cated
that dynamically changing the subproblem tolerance
( or the maximum number of subdomain iterations) can yield
improved performance. This is consistent with basic princi-ples:
it is good to work quickly across many different parts
of a problem, to quickly reduce large errors. However, as
time progresses, it becomes more important to solve sub-problems
to a higher accuracy.
5. Experimental Setup
A number of experiments were run to isolate and quan-tify
all relevant behaviors of the GPS( ?) and PGPS( ?) algo-rithms.
The bulk of the experiments were designed to quan-tify
the benefits of different enhancements, and across sev-eral
different subsolvers. It examined 1) the folding / ex-tracting
technique; 2) the benefits of prioritization; and 3)
the efficiency of the parallel version of the code.
Experiments were run across a wide variety of matrices,
which are described in Figure 1. An effort was made to se-lect
matrices from different discplines, with different sizes,
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
Matrix name Description Size NNZ RHS SPD
bcsstk08 BCS – TV studio 1,074 7,017 ( made) Yes
e20r0000 2D fluid flow in a driven cavity 4,241 131,556 MM No
e40r5000 2D fluid flow in a driven cavity 17,281 553,956 MM No
fidap003 Matrix from the FIDAP package 1,821 52,659 MM No
fidap006 Matrix from the FIDAP package 1,651 49,479 MM No
fidap010 Matrix from the FIDAP package 2,410 54,816 MM No
fidap011 Matrix from the FIDAP package 16,614 1,091,362 MM No
fidap014 Matrix from the FIDAP package 3,251 66,647 MM No
fidap035 Matrix from the FIDAP package 19,716 218,308 MM No
fidapm33 Matrix from the FIDAP package 2,353 23,765 MM No
fs 760 3 Mixed kinetics diffusion problem 760 5,976 ( made) No
hp 400 5 Thermal convection; 5- pt stencil 160,000 798,400 ( in- house) No
mcar- 160000 Reinforcement learning problem 160,000 637,831 ( in- house) No
nnc1374 Advanced gas- cooled nuclear reactor core 1,374 8,606 ( made) No
rdb2048l 2D reaction- diffusion model 2,048 12,032 ( made) No
sap- 160000 Reinforcement learning problem 160,000 639,996 ( in- house) No
s3dkt3m2 FEA of cylindrical shell 90,449 1,921,955 ( made) Yes
sherman2 Thermal simulation with steam injection 1,080 23,094 MM No
sherman3 IMPES simulation of a black oil model 5,005 20,033 MM No
Figure 1. Descriptions of the matrices used.
and with different numerical properties. Most of the matri-ces
used can be found at NIST’s MatrixMarket. 1 Excep-tions
are labeled with an “ in- house” label in the RHS col-umn
of Figure 1.
A fair comparison of GPS( ?) to other solvers is not easy.
We selected unpreconditionedGMRES as the general base-line,
because of its popularity and robustness across many
different matrix types. As we will show, GPS( ?) often out-performs
GMRES. On the one hand, it is not surprising that
GPS( ?) outperforms an unpreconditioned solver, but on the
other hand, GPS( ?) is not preconditioned either ( this is be-cause
it is theoretically possible to develop a preconditioned
version of GPS( ?) which operates on a preconditioned ma-trix,
instead of directly on the matrix A). The most natural
direct competitor is GMRES, preconditioned with an Ad-ditive
Schwarz Method, and using a standard ILU subdo-main
preconditioner. With these issues in mind, most of
the experiments were run using GMRES ( labeled as “ P-GMRES”),
GMRES with an ASM+ ILU preconditioner ( la-beled
as “ P- GMRES w/ ASM+ ILU”), GPS( ?) ( labeled as
“ GPS- PQ” or “ GPS- SEQ”), and GMRES with a standard
MAT preconditioner ( labeled as “ P- GMRES w/ MAT”).
For most of the experiments, the parameters for GPS( ?)
( such as number of partitions, maximum number of sub-domain
iterations, etc.) were aggressively hand- tuned to
yield the best performance. When solving the problems, tol-erances
were selected primarily to elicit interesting behav-
1 http:// math. nist. gov/ MatrixMarket/
ior. For GPS( ?), when solving to a specified tolerance, sub-problem
tolerances were always set to be | P|.
For all of the experiments the subsolvers incorporated
into the GPS( ?) code were taken from the “ Portable, Ex-tensible
Toolkit for Scientific Computation” ( PETSc) li-brary
[ 1]. Experimentation with Sandia National Labora-tory’s
“ Aztec” package yielded worse performance, and did
not have the preconditioners needed, so it was dropped.
To help isolate the benefits of the fold/ extract approach
from the benefits of prioritization, we will also reference
an algorithm named “ GPS- SEQ.” This is a variant of the
GPS( ?) algorithm which selects partitions sequentially, in-stead
of in priority order. If clarification is needed, we will
reference the prioritized version of GPS( ?) as “ GPS- PQ.”
Many of the matrices from the MatrixMarket have a RHS
that is also downloadable, but some do not. For those that
do not, one was manufactured by selecting a random x vec-tor
with each x[ i] ? ( 0, 1), then multiplying it through A to
generate b. These are indicated in Figure 1 with “( made)” in
the RHS column. Regardless of where the RHS came from,
an initial guess of x = 0 was always used.
Both the GPS( ?) and PGPS( ?) depend on having a par-tition
of the variables. Since almost all of the matrices are
strongly diagonal, a naive partitioning was used. Each par-tition
was assigned n/| P| contiguous variables.
Serial results were obtained on a dual processor 2.4GHz
P4 machine with 2G of RAM. Parallel results were obtained
on a fully connected cluster of dual processor 2.4GHz
P4 machines, each with 2G of RAM and equipped with
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
Myrinet interconnects. All code was written in C with MPI.
6. Results
The results of our experiments were very positive. We
begin by presenting the general serial results, followed by
the parallel scalability results. To be fair, we point out cases
in which the algorithm fails, or yields worse performance
than a baseline. The primary criterion measuring success is
the wall- clock time needed to reduce the ( unpreconditioned)
2- norm of the residual to a specific tolerance.
6.1. Serial results
Figure 2, top three rows, shows a representative sample
of results for the GPS( ?) algorithm on a variety of matri-ces.
The story is complicated: the various algorithms out-perform
each other on different problems, and even on the
same problem, different algorithms may win or lose based
on what tolerance was selected. For this reason, we opted to
present the majority of results in graphical form. However,
several general patterns are clearly evident, and the winner
on any given matrix is usually clear in the limit.
There are three broad types of result:
• GPS( ?) dramatically improves performance ( by up to
several orders of magnitude) [ bcsstk08, e20r0000, fi-dap006,
fidap014, sap- 160000, sherman2]
• GPS( ?) barely affects performance [ e40r5000, fi-dap011,
fs 760 3, mcar- 160000]
• GPS( ?) dramatically worsens performance ( some-times
making the problem impossible to solve) [ fi-dap003,
fidap010, nnc1374, fidap035, fidapm33,
hp 400 5, s3dkt3m2, rdb2048l, sherman3]
Themost exciting results in our experiments indicate that
GPS( ?) yielded excellent performance on a wide variety of
matrices. On the bcsstk08 and sap- 160000 problems ( see
Figure 2), GPS( ?) solves it instantly while other algorithms
gradually reduce the residual norm. For the e20r0000 ( Fig-ure
2) and fidap014 problems, GPS( ?) reliably reduces the
residual norm better than all other solvers. On the fidap006
problem, GPS( ?) bottoms out at a residual norm of 0.118
after about 3 seconds, while P- GMRES achieves a residual
norm of 0.135 after 40 seconds. Of course, not every posi-tive
result is as compelling: on the sherman2 matrix, for ex-ample,
GPS( ?) can outperform P- GMRES by two orders of
magnitude, but only with a small number of partitions.
Naturally, GPS( ?) is not a panacea. For some prob-lems
( such as e40r5000), no algorithm worked. On the
fs 760 3 problem, GPS( ?) stabilizes at a residual norm
lower than that of P- GMRES, but GPS- SEQ achieves the
lowest residual. The fidap011 problem was an interesting
case: P- GMRES and GPS- PQ flip- flopped twice, although
GPS- PQ eventually won out. For themcar- 160000 problem,
GPS- PQ almost instantly solves it, but the GPS- SEQ also
solves it almost instantly, and to a better tolerance. This case
illustrates two important points: first, since GPS- PQ and
GPS- SEQ performed about the same, it appears that prior-itization
is not the defining performance factor, but the fact
that a multiplicative Schwarz procedure was being used.
Second, it is clear that for a problem like this, a hybrid al-gorithm,
in which some iterations of GPS( ?) are run until
stabilization, followed by some iterations of another solver,
would probably best either algorithm alone.
We also observed situations where GPS( ?) degraded
performance substantially. Often, it was the case that al-though
GPS( ?) could reduce the tolerance reliably, other
algorithms managed to solve a problem instantly. In other
cases, GPS( ?) drives down residual norm down quickly
and reliably, but another algorithm achieves a lower norm
even more quickly. The fidap003, fidap010, nnc1374 and fi-dap035
problems all showed an interesting general pattern:
P- GMRES was able to reduce the residual norm quickly, but
GPS( ?) could not. If the number of partitions used was ad-justed
downward, however, GPS( ?) approached closer and
closer to the performance of P- GMRES.
6.2. Parallel Results
Figure 2 ( bottow two rows) show representative paral-lel
results for PGPS( ?). The first two graphs show the wall-clock
time of P- GMRES, P- GMRES/ ASM+ ILU, PGPS- PQ
and PGPS- SEQ, while the second two graphs show the
speedup of the same algorithms. The parallel results are pre-sented
for the s3dkt3m2 and hp 400 5 matrices.
Here, we are benchmarking two distinct parallel imple-mentations:
the P- GMRES and P- GMRES/ ASM+ ILU algo-rithms
are the standard PETSc implementations, using their
MPI- based parallelization. Recall that although PGPS( ?)
uses PETSc as a subsolver, it is not a parallel subsolver.
The parallel code for PGPS( ?) was developed in- house.
The interpretation of these results requires some care.
Specifically, the results of different algorithms should not be
compared to each other. This is because the point of these
figures is to show the parallel scalability and speedup of
each algorithm on two different matrices. To demonstrate
this in a clear way, we used different tolerances for differ-ent
algorithms, meaning that each algorithm is effectively
solving a different problem.
The results demonstrate excellent overall scalabil-ity.
PGPS( ?) consistently scales almost linearly with the
number of processors, and although some parallel over-head
was seen, it did not appear to be as pronounced
as some of the overhead that the PETSc library in-curred.
This is also an important validation of our assertion
that cross- processor communication overhead is not pro-
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
hibitive in PGPS( ?), even with the quasi- synchronous
architecture we adopted.
The speedup graphs indicate that, for both PGPS( ?)
and PETSc, there is room for improvement. Neither pack-age
bests the other consistently, although GPS- SEQ always
shows the best speedup. Of course, experiments on more
than two matrices would need to be run to truly determine
how the packages compare.
The competition with PETSc is significant not only be-cause
PGPS( ?) appears to scale better ( demonstrat-ing
that an un- optimized prototype code is similar to a
well- developed library relied on by many researchers), but
also because it was easier to implement.
6.3. Miscellaneous Results
One meta result that does not fall neatly into any cat-egory
is the fact that throughout our experimentation, very
few cases of divergence were observed while using GPS( ?),
assuming that the preconditioner suceeded on the subprob-lems.
Considering the highly asynchronous nature of the al-gorithm,
and the discussion of convergence in Section 4.1,
this result is surprising, and merits further study. Finally,
we note that tuning the parameters of the algorithm is diffi-cult.
For example, performance varies significantly as func-tion
of the number of partitions used, which is another issue
for future research.
7. Conclusions and Future Research
Perhaps the best way to describe our final conclusions
is this: initial experimentation using GPS( ?) and PGPS( ?)
generated many promising results, and indicates significant
potential for the idea of prioritization. Empirically, we have
shown that even by themselves, the AMS procedures em-bodied
in GPS( ?) and PGPS( ?) can provide dramatic per-formance
benefits for many problems, regardless of the ori-gin
of the underlying problem. The addition of prioritiza-tion
to these procedures can often further improve perfor-mance.
Of course, there is no reason to suppose that priori-tizing
subdomains in an AMS procedure is the only way that
prioritization can be integrated into an algorithm. Studying
other ways to accomplish this ( perhaps by deriving directly
prioritized versions of algorithms) is a significant direction
for future research. We have also shown that GPS( ?) is ef-fectively
parallelizable, allowing many different subsolvers
to be used in a general parallel framework.
The GPS( ?) algorithm is not perfect, of course. Use of
the GPS( ?) sometimes worsened performance over baseline
algorithms, and sometimes lead to divergence ( especially if
the partitioning was not done carefully). Perhaps this re-sult
simply means that GPS( ?) fits squarely with other lin-ear
system solvers: no single algorithm always outperforms
every other algorithm, and very few algorithms are guaran-teed
to converge on every problem.
Overall, our results indicate that AMS procedures can be
effectively used to accelerate many problems. This is true
even in the serial case, and even for matrices which came
from an unknown underlying problem. Perhaps more gen-erally,
the results indicate that there is still room for im-provements
to the current generation of best solvers. This
motivates continued research into advances such as prioriti-zation
with partitioning, in an effort to one day find the op-timal
way of solving large sparse linear systems.
Acknowledgements
David Wingate is supported under a National Science
Foundation Graduate Research Fellowship.
References
[ 1] S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp,
D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and
H. Zhang. PETSc users manual. Technical Report ANL-
95/ 11 - Revision 2.1.5, Argonne National Laboratory, 2004.
[ 2] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato,
J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der
Vorst. Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods, 2nd Edition. SIAM, Philadel-phia,
PA, 1994.
[ 3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed
Computation: Numerical Methods. Prentice- Hall, Engle-wood
Cliffs, NJ, 1989.
[ 4] D. Chazan and W. Miranker. Chaotic relaxation. In Lin-ear
Algebra and its Applications, volume 2, pages 199– 222,
1969.
[ 5] V. Gullapalli and A. G. Barto. Convergence of indirect adap-tive
asynchronous value iteration algorithms. In Advances
in Neural Information Processing Systems, volume 6, pages
695– 702, 1994.
[ 6] W. Hackbusch. Iterative Solution of Large Sparse Systems of
Equations. Springer- Verlag, New York, 1994.
[ 7] L. A. Hageman and D. M. Young. Applied Iterative Meth-ods.
Academic Press, New York, NY, 1981.
[ 8] G. Karypis and V. Kumar. Multilevel k- way partitioning
scheme for irregular graphs. Journal of Parallel and Dis-tributed
Computing, 48( 1): 96– 129, 1998.
[ 9] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Re-inforcement
learning with less data and less time. Machine
Learning, 13: 103– 130, 1993.
[ 10] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS
Publishing, Boston, 1996.
[ 11] H. A. Schwarz. Gesammelte Mathematische Abhandlungen,
volume 2. Springer- Verlag, 1890.
[ 12] D. Wingate and K. Seppi. P3VI: A partitioned, prioritized,
parallel value iterator. In Proceedings of the Twenty- First
International Conference on Machine Learning, pages 863–
870, 2004.
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE
1e- 05
1e- 04
0.001
0.01
0.1
1
10
100
0 4 8 12 16 20 24 28 32 36 40
Residual norm
Time ( seconds)
P- GMRES
P- GMRES w/ MAT
P- GMRES w/ ASM+ ILU
GPS- SEQ
GPS- PQ
1e- 09
1e- 08
1e- 07
1e- 06
1e- 05
1e- 04
0.001
0.01
0.1
1
0 4 8 12 16 20 24 28 32 36 40
Residual norm
Time ( seconds)
10000
100000
1e+ 06
1e+ 07
0 20 40 60 80 100
Residual norm
Time ( seconds)
1e- 08
1e- 07
1e- 06
1e- 05
1e- 04
0.001
0.01
0.1
1
10
0 4 8 12 16 20 24 28 32 36 40
Residual norm
Time ( seconds)
1e- 08
1e- 06
1e- 04
0.01
1
100
10000
1e+ 06
1e+ 08
1e+ 10
1e+ 12
0 4 8 12 16 20 24 28 32 36 40
Residual norm
Time ( seconds)
1e- 08
1e- 06
1e- 04
0.01
1
100
10000
1e+ 06
0 4 8 12 16 20 24 28 32 36 40
Residual norm
Time ( seconds)
0.1
1
10
100
1000
1 2 4 8 16 32 64
Time ( seconds)
Number of processors
0.1
1
10
100
1 2 4 8 16 32 64
Time ( seconds)
Number of processors
0
5
10
15
20
25
30
35
40
1 2 4 8 16 32 64
Speedup
Number of processors
0
5
10
15
20
25
30
35
40
1 2 4 8 16 32 64
Speedup
Number of processors
Figure 2. Performance results ( top three rows). The matrices shown are ( left to right, top to bot-tom)
e20r0000, SAP- 160000, fidap011, MCAR- 160000, bcsstk08, and s3dkt3m2. Parallel performance
results ( fourth row) and speedup results ( bottom row) for the s3dkt3m2 ( left) and hp 400 5 ( right) ma-trices.
The same legend ( shown in the upper- left graph) is used for all graphs.
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium ( IPDPS’ 05)
1530- 2075/ 05 $ 20.00 IEEE