P-Ram Algorithms And Data Structures For Sparse Linear Systems

e have studied the variations in the matrix elements through Gauss Elimination. We know that in Gauss elimination a nonzero position in position jk, implied that there was also a non zero position in kj, which would cause fill to occur in position i, j when row k was used to eliminate the element in position ik. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k=1,2,...,n-1 we use the formulae

We have studied the variations in the matrix elements through Gauss Elimination. We know that in Gauss elimination a nonzero position in position jk, implied that there was also a non zero position in kj, which would cause fill to occur in position i, j when row k was used to eliminate the element in position ik. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k=1,2,...,n-1 we use the formulae

      and                                                                 

Where , i, j = 1, 2, ...............n. The only assumption required is that the inequalities , k=1,2, ......., n hold. There entries are called pivot in Gaussion elimination. It is convenient to use the notation.

For the system obtained after (k-1), steps, k = 1, 2, ........, n with A(1), = A and b(1), = b. The final matrix A(n),  is upper triangular matrix. Here we present the following analysis when the matrix is symmetric and positive definite.

Case 1 – When  where i¹j                   

We note that (i), when the above difference is less than 0 then the elements is negative before elimination and (ii), when this difference is grater then 0 then  both are negative or both are positive and for pivot element is always positive and (iii), when this difference is equal to zero then no elimination will take place and one of the following condition will be satisfied:

(a), =0    or (b), =0    or    both (a), & (b), will be zero.

Case 2 – When  where i=j                    

In this case it is found that (i), when the above difference is less than 0 then the pivot element will be eliminated and (ii), when the difference is grater then 0 then the element which is to be eliminated will be grater then zero and (iii), when this difference  is equal to zero then elimination will not take place. The parallel algorithms for storage is Os(A), = O (log2n),, for multiplication /division is Om(A), = O (log3n), and for addition/subtraction is OA(A), = O (log3n), and the number of Processing Element(PE), are for the case of lower triangular matrix  is (i(i+1),/2),, for upper triangular matrix is (i-1),(n-1/2),+j and for tri-diagonal matrix is 2+2x(i-2),+j. Based on the above concept we develop the algorithms.

 

 

(i), Row oriented dense UTDU factorization PRAM-CREW Algorithm, (ii), Row oriented sparse UTDU factorization (with zero testing), PRAM-CREW algorithm, (iii), Row oriented sparse UTDU Factorization (with preprocessing PRAM-CREW Algorithm and (iv), O(qA), symbolic factorization PRAM-CREW Algorithm

 

respectively. These algorithms are derived from Algorithms used for Symmetric Gaussian algorithms for Sparse Matrix. Dense Matrix Algorithm may be implemented efficiently, since it stores and operates on just the upper triangle of A, D and U. However, when A is sparse then this algorithm fails to exploit the zeros in A and U to reduce the storage or work. It is possible to avoid arithmetic operations on zeros by testing the operands prior to using them. There are two serious problems with this approach. First all of the entries of the upper of them could be tested in some of algorithms, and second, there are more testing operations than arithmetic operations so that the running time of algorithm could be proportional to the amount of testing rather than the amount of arithmetic. To avoid these problems A.H. Sherman showed that the pre-processing required to compute the three sets {rak} (denotes the set of column indices j³k for which akj¹0),, {ruk} (denotes the set of column indices j>k for which ukj¹0),, {cuk} (denotes the set of column indices i<k for which uik¹0), can be performed in time proportional to the total storage. In these algorithm it is already shown that the total storage required is proportional to the number of non zeros in A and U and the total running time is proportional to the number of arithmetic operations on non zeros. Algorithm for symbolic factorization algorithm based directly on numarical decomposition algorithm. Since we know that a symbolic factorization produces a data structure that exploits the sparsity of the triangular factors, the numerical decomposition can be performed using a static storage. There is no need to perform storage allocations for the fill in, during the numerical computations. This reduces both storage and execution time overheads for the numerical phase and we obtained from the symbolic factorization a bound was obtained on the amount of sparse we need in order to solve the linear system. This immediately suggests that the numerical computation is feasible.

The above discussion gives an idea of the values of the elements of the matrix. The Gauss elimination for symmetric, positive definite matrix for share memory has been studied. The classical problem for sequential algorithm for UTDU factorization of a matrix A was computed by A. Sherman. The results are given as follows: qS(A), » O(n2),, qM(A), » O(n3),,           qA(A), » O(n3), where qS, qM and qA denote the storage, the number of multiplication / division and addition / subtraction respectively. The matrix multiplication (SIMD-Hypercube), example of Dekel, Nassimi and Sahni (1981), is extended to sparse linear systems. Now Consider a cube connected computer with n3 PEs. Conceptually, these PEs may be regarded as arranged in an n x n x n array pattern. If it is assumed that the PEs are indexed in row-major order, the PE, PE(i,j,k), in position (i,j,k), of this array has index in2 + jn + k (note that array indices are in the range [0, n-1]. Hence, if r3q-1,...,r0 is the binary representation of the PE position (i,j,k), then i = r3q-1,...,r2q, j = r2q-1,...,rq and k = rq-1,...,r0. In mapping of data into the hypercube it was indicated that the data is mapped to its all possible neighbor processors in the n-cube which has hamming distance exactly by one bit, which makes like a tree structure of having leaf of all its possible dimensions (i.e. for n-cube the tree has n leaf), The complexity of these parallel algorithm.

In this paper it is shown that (i), the routing or access function in2+jn+k contains topological properties and (ii), cost of storage and cost of retrieval of a matrix are proportional to each other in the polylograthmic parallel time using P-RAM with a polynomial number of processor.

 

1. Introduction: In this paper we will explain the method of representing a sparse matrix in parallel by using P-RAM model. P-RAM model is a shared memory model [5]. To develop these algorithms we use method of symbolic factorization on sparse symmetric factorization. Symbolic factorization procedures a data structure that exploits the sparsity of the triangular factors. We will represent an array pattern of processing elements (PE), for sparse matrix on hypercube; we have already developed P-RAM algorithms for linear system (dense case), [7,8,9]. These algorithms are implemented on Hypercube architecture.

 

2. Background: We consider symmetric Gaussian elimination for the solution of the system of linear equations.

            A x       =          b

Where A is an n x n symmetric, positive definite matrix.  A. Sherman [13] developed a row-oriented UTDU factorization algorithm for dense matrices A and consider two modifications of it which take advantage of sparseness in A and U. We also discuss the type of information about A and U which is required to allow sparse symmetric factorization to be implemented efficiently.

3. Analysis of Symmetric Gauss Elimination: In this section we suggest the actual testing operations that will be performed during Symmetric Gaussion elimination. Basically the Gaussion elimination is applied to transform matrices of linear equation to triangular form. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k=1,2,...,n-1 we use the formulae

                  ...... (1),

and                 ...... (2),

Where , i, j = 1, 2, ...............n. The only assumption required is that the inequalities , k=1,2, ......., n hold. There entries are called pivot in Gaussion elimination. It is convenient to use the notation.

                  ..........................(3),

For the system obtained after (k-1), steps, k = 1, 2, ........, n with A(1), = A and b(1), = b. The final matrix A(n),  is upper triangular matrix [3]

In Gauss elimination, a nonzero position in position jk, implied that there was also a nonzero position in kj, which would cause fill to occur in position i, j. When row k is used to eliminate the element in position ik [4].

Here we present the following analysis when the matrix is symmetric and positive definite.

When i ¹ j in equation 3 and  is less than 0 then the elements aik(k), or ajkk is negative before elimination and when in equation 3  is greater than 0 then ajk(k), or aik(k), both are negative or both are positive and for pivot     element aij(k), is always positive and when in equation 3  is equal to zero then no elimination will take place and one of the following condition will be satisfied

            (a), aik(k), = 0     or         (b), ajk(k), = 0     or         both (a), and (b),

will be zero.

When i = j in the above explanation of equation 3 than for the first case the pivot element will be eliminated and for the second case the element which is to be eliminated will be greater than zero and for the third case elimination will not take place.

We suggested testing operation for variation occurring in the elements of Symmetric Gaussion elimination. There are more testing operations rather than arithmetic operations, so that the running time of the algorithms  could be proportional to the amount of testing rather than amount of arithmetic operation, which will cause symbolic factorization to occur. To avoid this problem sherman pre computed the sets rak, ruk, cuk and showed that in implementation of this scheme, the total storage required is proportional to the number of non zeroes in A and U and that the total running time is proportional to the number of arithmetic operations on nonzero. In addition to this the preprocessing required to compute the sets { rak} { ruk} and {cuk} can be performed in time proportional to the total storage. To see how to avoid these problems, let us assume that for each k, 1£ k  £ N, we have pre-computed

(i)                The set rak of columns j ³ k for which akj ¹ 0;

(ii)  The set ruk of columns j > k for which ukj ¹ 0;

 (iii), The set cuk of columns i <  k for which uik ¹ 0;

The A.H.Sharman[13] shows that the total storage required is proportional to the number of nonzeroes in A and U and that the total running time is proportional to the number of arithmetic operations on nonzeroes. In addition, the pre-processing required to compute the sets {rak}, {ruk}, {cuk} can be performed in time proportional to the total storage.

4. Symbolic Factorization: The sets {rak}, which describe the structure of A, are input parameters, and the symbolic factorization algorithm, computes the sets {ruk} from them. The sets {cuk} could be computed from the sets {ruk}, at the k-th step of the symbolic factorization algorithm, ruk is computed from rak and the sets rui for i < k. An examina­tion of Algorithm 3.1 shows that for j > k, ukj ≠ 0 if and only if either

(i)                  akj ≠ 0               or

(ii)                uik ≠ 0 for some i Î cuk.

Thus letting

ruki = {j Î rui : j> k},

we have j Î ruk if and only if either

(iii)               j Î rak or

(iv)              j Î ruki for some i Î cuk.

 

At th k-th step, ruk is formed by combining rak with sets { ruki} for i Î cuk. However, it is not necessary to examine ruki for all rows i Î cuk. Let lk be the set of rows i Î cuk for which k is the minimum column index in rui. Then we have the following result, which expresses a type of transitivity condition for the fill-in in symmetric Gaussian elimination. The import of the theorem is that in forming ruk, it suffices to examine ruki for i Î lk.

There are some important reasons why it is desirable to perform such a symbolic factorization.

i.                     Since a symbolic factorization produce a data structure that exploits the sparsity of the triangular factors, the numerical decomposition can be performed using a static storage scheme. There is no need to perform storage allocations for the fill-in during the numerical computations. This reduces both storage and execution time overheads for the numerical phase.[4]

ii.         We obtained from the symbolic factorization a bound on the amount of sparse we need in order to solve the linear system. This immediately tells us if the numerical computation is feasible. (of course, this is important only if the symbolic factorization can be performed  reply both in terms of storage and execution time and if the bound on space is reasonably light), [4].

 

5. Parallel Matrix Algorithm: By an efficient parallel algorithm we mean one that takes polylogarithmic time using a polynomial number of processors. In practical terms, at most a polynomial number of processors is reckoned to be feasible [6]. A polylogarithmic time algorithm takes O (logkn), parallel time for some constant integer k, where n is the problem size. Problems which can be solved within these constraints are universally regarded as having efficient parallel solutions and are said to belong to the class NC(Nick Pippenger's Class),

5.1 Representation of Array Pattern of Processing Elements (P.Es.),: Consider a case of three dimensional array pattern with n3 = 23q (Processing Elements), PEs[2,10].

Conceptually these PEs may be regarded as arranged, in n × n × n array pattern. If we assume that the PEs are row major order, the PE (i,j,k), in position (i,j,k), of this array has index in2 + jn + k (note that array indices are in the range [0, (n-1),]. Hence, if r3q-1 ......... r0 is the binary representation of the PE position (i,j,k), then i = r3q-1 .......................... r2q,  j = r2q-1 ......... rq, k = rq-1 ......... r0 using A(i,j,k),, B(i,j,k), and C(i,j,k), to represent memory locations in P(i,j,k),, we can describe the initial condition for matrix multiplication as

                 A(0,j,k), = Ajk

                 B(0,j,k), = Bjk, 0 <= j < k, 0 <= k < n

Ajk and Bjk are the elements of the two matrices to be multiplied. The desired final configuration is

                 C(0,j,k), = C(j,k),, o <= j < n, 0 <= k < n

Where

                                                     (1),

This algorithm computes the product matrix C by directly making use of (1), The algorithm has three distinct phases. In the first, element of A & B are distributed over the n3 PEs so that we have A(l,j,k), = Ajl and B(l,j,k), = Blk. In the second phase the products C(l,j,k), = A(l,j,k),* B(l,j,k), = AjlBlk are computed. Finally, in third phase the sum  are computed.

The details are spelled out in Dekel, Nassimi and Sahni 1981[2]. In this procedure all PE references are by PE index (Recall that the index of PE(i,j,k), as in2 + jn + k), The key to the algorithm of Dekel, Nassimi & Sahni[1]in the data routing strategy 5g = 5 log n routing steps are sufficient to broadcast the initial value through the processor array and to confine the results.

The array pattern of processing elements (PE),in row-major order for the specialized sparse matrix (symmetric), [9] can be formulated in the following manner.

1.    Representation of Lower-Triangular Matrix: index of (aij),           = Total Number of elements in first i-1 rows + Number of elements up to jth column in the ith row

=   i (i+1),/2 + j   (1 ≤ i,j ≤ n),

2.    Representation of Upper-Triangular Matrix:

index of (aij), = Number of elements up to aij  element

                          = (i-1), ´ (n - i/2), + j  (1 £ i, j £ n),

3.  Representation of tri-Diagonal Matrix:

index of (aij),     = Total number of elements in first (i-1), rows + Number of elements up to jth Column in the ith Row

     =     2 + 2 x (i - 2), + j  (1 £ i, j £ n),   (1 £ i, j £ n),

Representation of array pattern of processing elements (PE), for lower triangular matrix is presented on a hypercube model. Hypercubes are loosely coupled parallel processors based on the binary n-cube network. A n-cube parallel processor consists of 2n identical processor, each provided with its own sizable memory and inter connected with n neighbors, [12]. This architecture consists of a large number of identical processors inter connected to one another according to some convenient pattern. In a shared memory system, processors operate on the data from the common memory, each processor reads the data it needs, performs some processing and writes the results back in memory. In a distributed memory system inter processor communication is achieved by message passing and computation of data driven (although some designs incorporate a global bus, this does not constitute the main way of inter communication), By message passing it is meant that data or possibly code are transferred from processor A to processor B by traveling across a sequence of nearest neighbor nodes starting with node A and ending with B, synchronization in driven by data in the sense that computation in some node is performed only when its necessary data are available. The main advantage of such architectures, often referred to as ensemble architectures, is the simplicity of their design. The nodes are identical, or are of a few different kinds, and can therefore be fabricated at relatively low cost. The model can easily be made fault to learnt by shutting down failing nodes. The most important advantages of this class of design in the ability to exploit particular topologies of problem or algorithms in order to minimize communication costs.  A hypercube is a multidimensional mesh of nodes with exactly two nodes in each dimension. A d-dimensional hypercube consists of K nodes, where K=2n.

1.            A hypercube has n special dimensions, where n can be any positive integer (including zero),

2.            A hypercube has 2n vertices.

3.            There are n connections (lines), that meet at each vertex of a hypercube.

4.            All connections at a hypercube vertex meet at right angles with respect to each other [1], [14].

5.            The Hypercube can be constructed recursively from lower dimensional cubes.

6.            An architecture where the degree and diameter of the graph is same than they will achieve a good balance between, the communication speed and the complexity of the topology network. Hypercube achieve this equality, which explains why they are one of the today's most popular design (e.g. i psc of intel corp., T-series of FPS, n-cube, connection machine of thinking machines corp.),[15].

When the lower triangular matrix is presented in three dimension then the PE’s are indexed in the following manner.

                       

For different values of i and j we can map Hypercube. Here we are representing mapping of Hypercube for a single value of i and j .

 

A.        i=0, j=0 a(i,j),=0 ,k=0

BIT (0000, 0), = 0,         BIT-COMPLEMENT (0000, 0), = 0001 = (1),

BIT (0000, 1), = 0,         BIT-COMPLEMENT (0000, 1), = 0010 = (2),

BIT (0000, 2), = 0,         BIT-COMPLEMENT (0000, 2), = 0100 = (4),

BIT (0000, 3), = 0,         BIT-COMPLEMENT (0000, 3), = 1000 = (8),

Fig 5.1:Mapping of lower triangular matrix on Hypercube. 

            j

i

0

1

2

3

0

0

0000

 

 

 

1

1

0001

2

0010

 

 

2

3

0011

4

0100

5

0101

 

3

6

0110

7

0111

8

1000

9

1001

 

Figure 5.2:         Array view of a 16 PE for a lower triangular matrix. Each square matrix represents a PE. The number in a square in the PE index (both decimal & binary representation are provided),

Now we conclude the following results

Lemma 1: The routing or access function in2+jn+k contains topological properties.

Proof:  This access function is a polynomial of 3 dimensional discrete space. Where I, j, k are 3 dimensions and n is fixed. It gives a relationship of processing elements (i.e. there are 23 connections (time), that meet at each vertex of a hypercube means that the algorithm can be evaluated in polylgarithmic time using a poly nominal (in2+jn+k), of three dimensional discrete space. We can easily construct hypercube successively from loves dimensional cubes by using polynomial in2+jn+k. A discrete space in a topological space in which all sets are isolated. We conclude that the access function by which we are mapping matrix elements will be pair wise continuous. It is shown in the implementation that because of the hamming distance between the processes the hyper cube is modeled as a discrete space with discrete time. This access function is also used to map a matrix of three dimensions in to RAM sequentially.

Lemma 2: Cost of storage and cost of retrieval of a matrix are proportional to each other in polylogarithmic parallel time using P-RAM with a polynomial member of processor.

Proof:  For storage and retrieval of a matrix we use parallel iteration. Parallel iteration has the property of convergence in logn in parallel. It converge geometrically or at the rate of geometric progression therefore they are proportional to each other for a single value. From the above fact we can write that cost of retrieval is proportional to cost of storage

Þ cost of retrieval = k x cost of storage, (Where k is a constant),

Þ if k ³ 1 then it is a Dense matrix and if k < 1 then it is a sparse matrix.

Here we are representing PRAM-CREW Algorithm.

5.2. Row oriented sparse UTDU factorization (With pre-processing), PRAM-CREW Algorithm

Begin

Repeat log n times do

For all (ordered), pair (i, j, k),, 0< k £ n, 0<i=j=k £ n

And q=log n in parallel do

            m(22qi+2qj+k),=m(k, k),

                        d(22qi+2qj+k),=d(k, k),

            a(22qi+2qj+k),=a(k, k),

                        m(k, k), = 1

                        d(k, k), ß a(k, k),

e