Home | Conferences | Techknowledge | 12th National Conference | P-Ram Algorithms And Data Structures For Sparse Linear Systems

P-Ram Algorithms And Data Structures For Sparse Linear Systems

11/05/2011 16:37:00

Font size:

e have studied the variations in the matrix elements through Gauss Elimination. We know that in Gauss elimination a nonzero position in position jk, implied that there was also a non zero position in kj, which would cause fill to occur in position i, j when row k was used to eliminate the element in position ik. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k=1,2,...,n-1 we use the formulae

We have studied the variations in the matrix elements through Gauss Elimination. We know that in Gauss elimination a nonzero position in position jk, implied that there was also a non zero position in kj, which would cause fill to occur in position i, j when row k was used to eliminate the element in position ik. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k=1,2,...,n-1 we use the formulae

and

Where , i, j = 1, 2, ...............n. The only assumption required is that the inequalities , k=1,2, ......., n hold. There entries are called pivot in Gaussion elimination. It is convenient to use the notation.

For the system obtained after (k-1), steps, k = 1, 2, ........, n with A^(1),= A and b^(1),= b. The final matrix A^(n), is upper triangular matrix. Here we present the following analysis when the matrix is symmetric and positive definite.

Case 1 – When where i¹j

We note that (i), when the above difference is less than 0 then the elements is negative before elimination and (ii), when this difference is grater then 0 then both are negative or both are positive and for pivot element is always positive and (iii), when this difference is equal to zero then no elimination will take place and one of the following condition will be satisfied:

(a), =0 or (b), =0 or both (a), & (b), will be zero.

Case 2 – When where i=j

In this case it is found that (i), when the above difference is less than 0 then the pivot element will be eliminated and (ii), when the difference is grater then 0 then the element which is to be eliminated will be grater then zero and (iii), when this difference is equal to zero then elimination will not take place. The parallel algorithms for storage is O_s(A), = O (log²n),, for multiplication /division is O_m(A), = O (log³n), and for addition/subtraction is O_A(A), = O (log³n), and the number of Processing Element(PE), are for the case of lower triangular matrix is (i(i+1),/2),, for upper triangular matrix is (i-1),(n-1/2),+j and for tri-diagonal matrix is 2+2x(i-2),+j. Based on the above concept we develop the algorithms.

(i), Row oriented dense U^TDU factorization PRAM-CREW Algorithm, (ii), Row oriented sparse U^TDU factorization (with zero testing), PRAM-CREW algorithm, (iii), Row oriented sparse U^TDU Factorization (with preprocessing PRAM-CREW Algorithm and (iv), O(q_A), symbolic factorization PRAM-CREW Algorithm

respectively. These algorithms are derived from Algorithms used for Symmetric Gaussian algorithms for Sparse Matrix. Dense Matrix Algorithm may be implemented efficiently, since it stores and operates on just the upper triangle of A, D and U. However, when A is sparse then this algorithm fails to exploit the zeros in A and U to reduce the storage or work. It is possible to avoid arithmetic operations on zeros by testing the operands prior to using them. There are two serious problems with this approach. First all of the entries of the upper of them could be tested in some of algorithms, and second, there are more testing operations than arithmetic operations so that the running time of algorithm could be proportional to the amount of testing rather than the amount of arithmetic. To avoid these problems A.H. Sherman showed that the pre-processing required to compute the three sets {ra_k} (denotes the set of column indices j³k for which a_kj¹0),, {ru_k} (denotes the set of column indices j>k for which u_kj¹0),, {cu_k} (denotes the set of column indices i<k for which u_ik¹0), can be performed in time proportional to the total storage. In these algorithm it is already shown that the total storage required is proportional to the number of non zeros in A and U and the total running time is proportional to the number of arithmetic operations on non zeros. Algorithm for symbolic factorization algorithm based directly on numarical decomposition algorithm. Since we know that a symbolic factorization produces a data structure that exploits the sparsity of the triangular factors, the numerical decomposition can be performed using a static storage. There is no need to perform storage allocations for the fill in, during the numerical computations. This reduces both storage and execution time overheads for the numerical phase and we obtained from the symbolic factorization a bound was obtained on the amount of sparse we need in order to solve the linear system. This immediately suggests that the numerical computation is feasible.

The above discussion gives an idea of the values of the elements of the matrix. The Gauss elimination for symmetric, positive definite matrix for share memory has been studied. The classical problem for sequential algorithm for U^TDU factorization of a matrix A was computed by A. Sherman. The results are given as follows: q_S(A), » O(n²),, q_M(A), » O(n³),, q_A(A), » O(n³), where q_S, q_M and q_A denote the storage, the number of multiplication / division and addition / subtraction respectively. The matrix multiplication (SIMD-Hypercube), example of Dekel, Nassimi and Sahni (1981), is extended to sparse linear systems. Now Consider a cube connected computer with n³ PEs. Conceptually, these PEs may be regarded as arranged in an n x n x n array pattern. If it is assumed that the PEs are indexed in row-major order, the PE, PE(i,j,k), in position (i,j,k), of this array has index in² + jn + k (note that array indices are in the range [0, n-1]. Hence, if r_3q-1,...,r₀ is the binary representation of the PE position (i,j,k), then i = r_3q-1,...,r_2q, j = r_2q-1,...,r_q and k = r_q-1,...,r₀. In mapping of data into the hypercube it was indicated that the data is mapped to its all possible neighbor processors in the n-cube which has hamming distance exactly by one bit, which makes like a tree structure of having leaf of all its possible dimensions (i.e. for n-cube the tree has n leaf), The complexity of these parallel algorithm.

In this paper it is shown that (i), the routing or access function in²+jn+k contains topological properties and (ii), cost of storage and cost of retrieval of a matrix are proportional to each other in the polylograthmic parallel time using P-RAM with a polynomial number of processor.

1. Introduction: In this paper we will explain the method of representing a sparse matrix in parallel by using P-RAM model. P-RAM model is a shared memory model [5]. To develop these algorithms we use method of symbolic factorization on sparse symmetric factorization. Symbolic factorization procedures a data structure that exploits the sparsity of the triangular factors. We will represent an array pattern of processing elements (PE), for sparse matrix on hypercube; we have already developed P-RAM algorithms for linear system (dense case), [7,8,9]. These algorithms are implemented on Hypercube architecture.

2. Background: We consider symmetric Gaussian elimination for the solution of the system of linear equations.

A x = b

Where A is an n x n symmetric, positive definite matrix. A. Sherman [13] developed a row-oriented U^TDU factorization algorithm for dense matrices A and consider two modifications of it which take advantage of sparseness in A and U. We also discuss the type of information about A and U which is required to allow sparse symmetric factorization to be implemented efficiently.

3. Analysis of Symmetric Gauss Elimination: In this section we suggest the actual testing operations that will be performed during Symmetric Gaussion elimination. Basically the Gaussion elimination is applied to transform matrices of linear equation to triangular form. This process may be performed in general by creating zeros in the first column, then the second and so forth. For k=1,2,...,n-1 we use the formulae

...... (1),

and ...... (2),

..........................(3),

For the system obtained after (k-1), steps, k = 1, 2, ........, n with A^(1),= A and b^(1),= b. The final matrix A^(n), is upper triangular matrix [3]

In Gauss elimination, a nonzero position in position jk, implied that there was also a nonzero position in kj, which would cause fill to occur in position i, j. When row k is used to eliminate the element in position ik [4].

Here we present the following analysis when the matrix is symmetric and positive definite.

When i ¹ j in equation 3 and is less than 0 then the elements a_ik^(k), or a_jk^k is negative before elimination and when in equation 3 is greater than 0 then a_jk^(k), or a_ik^(k), both are negative or both are positive and for pivot element a_ij^(k), is always positive and when in equation 3 is equal to zero then no elimination will take place and one of the following condition will be satisfied

(a), a_ik^(k), = 0 or (b), a_jk^(k), = 0 or both (a), and (b),

will be zero.

When i = j in the above explanation of equation 3 than for the first case the pivot element will be eliminated and for the second case the element which is to be eliminated will be greater than zero and for the third case elimination will not take place.

We suggested testing operation for variation occurring in the elements of Symmetric Gaussion elimination. There are more testing operations rather than arithmetic operations, so that the running time of the algorithms could be proportional to the amount of testing rather than amount of arithmetic operation, which will cause symbolic factorization to occur. To avoid this problem sherman pre computed the sets ra_k, ru_k, cu_k and showed that in implementation of this scheme, the total storage required is proportional to the number of non zeroes in A and U and that the total running time is proportional to the number of arithmetic operations on nonzero. In addition to this the preprocessing required to compute the sets { ra_k} { ru_k} and {cu_k} can be performed in time proportional to the total storage. To see how to avoid these problems, let us assume that for each k, 1£ k £ N, we have pre-computed

(i) The set ra_k of columns j ³ k for which a_kj ¹ 0;

(ii) The set ru_k of columns j > k for which u_kj ¹ 0;

(iii), The set cu_k of columns i < k for which u_ik ¹ 0;

The A.H.Sharman[13] shows that the total storage required is proportional to the number of nonzeroes in A and U and that the total running time is proportional to the number of arithmetic operations on nonzeroes. In addition, the pre-processing required to compute the sets {ra_k}, {ru_k}, {cu_k} can be performed in time proportional to the total storage.

4. Symbolic Factorization: The sets {ra_k}, which describe the structure of A, are input parameters, and the symbolic factorization algorithm, computes the sets {ru_k} from them. The sets {cu_k} could be computed from the sets {ru_k}, at the k-th step of the symbolic factorization algorithm, ru_k is computed from ra_k and the sets ru_i for i < k. An examination of Algorithm 3.1 shows that for j > k, u_kj ≠ 0 if and only if either

(i) a_kj ≠ 0 or

(ii) u_ik ≠ 0 for some i Î cu_k.

Thus letting

ru^k_i = {j Î ru_i : j> k},

we have j Î ru_k if and only if either

(iii) j Î ra_k or

(iv) j Î ru^k_i for some i Î cu_k.

At th k-th step, ru_k is formed by combining ra_k with sets { ru^k_i} for i Î cu_k. However, it is not necessary to examine ru^k_i for all rows i Î cu_k. Let l_k be the set of rows i Î cu_k for which k is the minimum column index in ru_i. Then we have the following result, which expresses a type of transitivity condition for the fill-in in symmetric Gaussian elimination. The import of the theorem is that in forming ru_k, it suffices to examine ru^k_i for i Î l_k.

There are some important reasons why it is desirable to perform such a symbolic factorization.

i. Since a symbolic factorization produce a data structure that exploits the sparsity of the triangular factors, the numerical decomposition can be performed using a static storage scheme. There is no need to perform storage allocations for the fill-in during the numerical computations. This reduces both storage and execution time overheads for the numerical phase.[4]

ii. We obtained from the symbolic factorization a bound on the amount of sparse we need in order to solve the linear system. This immediately tells us if the numerical computation is feasible. (of course, this is important only if the symbolic factorization can be performed reply both in terms of storage and execution time and if the bound on space is reasonably light), [4].

5. Parallel Matrix Algorithm: By an efficient parallel algorithm we mean one that takes polylogarithmic time using a polynomial number of processors. In practical terms, at most a polynomial number of processors is reckoned to be feasible [6]. A polylogarithmic time algorithm takes O (log^kn), parallel time for some constant integer k, where n is the problem size. Problems which can be solved within these constraints are universally regarded as having efficient parallel solutions and are said to belong to the class NC(Nick Pippenger's Class),

5.1 Representation of Array Pattern of Processing Elements (P.Es.),: Consider a case of three dimensional array pattern with n³ = 2^3q (Processing Elements), PEs[2,10].

Conceptually these PEs may be regarded as arranged, in n × n × n array pattern. If we assume that the PEs are row major order, the PE (i,j,k), in position (i,j,k), of this array has index in² + jn + k (note that array indices are in the range [0, (n-1),]. Hence, if r_3q-1 ......... r₀ is the binary representation of the PE position (i,j,k), then i = r_3q-1 .......................... r_2q, j = r_2q-1 ......... r_q, k = r_q-1 ......... r₀ using A(i,j,k),, B(i,j,k), and C(i,j,k), to represent memory locations in P(i,j,k),, we can describe the initial condition for matrix multiplication as

A(0,j,k), = A_jk

B(0,j,k), = B_jk, 0 <= j < k, 0 <= k < n

A_jk and B_jk are the elements of the two matrices to be multiplied. The desired final configuration is

C(0,j,k), = C(j,k),, o <= j < n, 0 <= k < n

Where

(1),

This algorithm computes the product matrix C by directly making use of (1), The algorithm has three distinct phases. In the first, element of A & B are distributed over the n³ PEs so that we have A(l,j,k), = A_jl and B(l,j,k), = B_lk. In the second phase the products C(l,j,k), = A(l,j,k),* B(l,j,k), = A_jlB_lk are computed. Finally, in third phase the sum are computed.

The details are spelled out in Dekel, Nassimi and Sahni 1981[2]. In this procedure all PE references are by PE index (Recall that the index of PE(i,j,k), as in² + jn + k), The key to the algorithm of Dekel, Nassimi & Sahni[1]in the data routing strategy 5g = 5 log n routing steps are sufficient to broadcast the initial value through the processor array and to confine the results.

The array pattern of processing elements (PE),in row-major order for the specialized sparse matrix (symmetric), [9] can be formulated in the following manner.

1. Representation of Lower-Triangular Matrix: index of (a_ij), = Total Number of elements in first i-1 rows + Number of elements up to j^thcolumn in the i^th row

= i (i+1),/2 + j (1 ≤ i,j ≤ n),

2. Representation of Upper-Triangular Matrix:

index of (a_ij), = Number of elements up to a_ij element

= (i-1), ´ (n - i/2), + j (1 £ i, j £ n),

3. Representation of tri-Diagonal Matrix:

index of (a_ij), = Total number of elements in first (i-1), rows + Number of elements up to j^thColumn in the i^th Row

= 2 + 2 x (i - 2), + j (1 £ i, j £ n), (1 £ i, j £ n),

Representation of array pattern of processing elements (PE), for lower triangular matrix is presented on a hypercube model. Hypercubes are loosely coupled parallel processors based on the binary n-cube network. A n-cube parallel processor consists of 2ⁿ identical processor, each provided with its own sizable memory and inter connected with n neighbors, [12]. This architecture consists of a large number of identical processors inter connected to one another according to some convenient pattern. In a shared memory system, processors operate on the data from the common memory, each processor reads the data it needs, performs some processing and writes the results back in memory. In a distributed memory system inter processor communication is achieved by message passing and computation of data driven (although some designs incorporate a global bus, this does not constitute the main way of inter communication), By message passing it is meant that data or possibly code are transferred from processor A to processor B by traveling across a sequence of nearest neighbor nodes starting with node A and ending with B, synchronization in driven by data in the sense that computation in some node is performed only when its necessary data are available. The main advantage of such architectures, often referred to as ensemble architectures, is the simplicity of their design. The nodes are identical, or are of a few different kinds, and can therefore be fabricated at relatively low cost. The model can easily be made fault to learnt by shutting down failing nodes. The most important advantages of this class of design in the ability to exploit particular topologies of problem or algorithms in order to minimize communication costs. A hypercube is a multidimensional mesh of nodes with exactly two nodes in each dimension. A d-dimensional hypercube consists of K nodes, where K=2ⁿ.

1. A hypercube has n special dimensions, where n can be any positive integer (including zero),

2. A hypercube has 2ⁿ vertices.

3. There are n connections (lines), that meet at each vertex of a hypercube.

4. All connections at a hypercube vertex meet at right angles with respect to each other [1], [14].

5. The Hypercube can be constructed recursively from lower dimensional cubes.

6. An architecture where the degree and diameter of the graph is same than they will achieve a good balance between, the communication speed and the complexity of the topology network. Hypercube achieve this equality, which explains why they are one of the today's most popular design (e.g. i psc of intel corp., T-series of FPS, n-cube, connection machine of thinking machines corp.),[15].

When the lower triangular matrix is presented in three dimension then the PE’s are indexed in the following manner.

For different values of i and j we can map Hypercube. Here we are representing mapping of Hypercube for a single value of i and j .

A. i=0, j=0 a(i,j),=0 ,k=0

BIT (0000, 0), = 0, BIT-COMPLEMENT (0000, 0), = 0001 = (1),

BIT (0000, 1), = 0, BIT-COMPLEMENT (0000, 1), = 0010 = (2),

BIT (0000, 2), = 0, BIT-COMPLEMENT (0000, 2), = 0100 = (4),

BIT (0000, 3), = 0, BIT-COMPLEMENT (0000, 3), = 1000 = (8),

Fig 5.1:Mapping of lower triangular matrix on Hypercube.

j i	0	1	2	3
0	0 0000
1	1 0001	2 0010
1	1 0001	2 0010
2	3 0011	4 0100	5 0101
3	6 0110	7 0111	8 1000	9 1001

Figure 5.2: Array view of a 16 PE for a lower triangular matrix. Each square matrix represents a PE. The number in a square in the PE index (both decimal & binary representation are provided),

Now we conclude the following results

Lemma 1: The routing or access function in²+jn+k contains topological properties.

Proof: This access function is a polynomial of 3 dimensional discrete space. Where I, j, k are 3 dimensions and n is fixed. It gives a relationship of processing elements (i.e. there are 2³ connections (time), that meet at each vertex of a hypercube means that the algorithm can be evaluated in polylgarithmic time using a poly nominal (in²+jn+k), of three dimensional discrete space. We can easily construct hypercube successively from loves dimensional cubes by using polynomial in²+jn+k. A discrete space in a topological space in which all sets are isolated. We conclude that the access function by which we are mapping matrix elements will be pair wise continuous. It is shown in the implementation that because of the hamming distance between the processes the hyper cube is modeled as a discrete space with discrete time. This access function is also used to map a matrix of three dimensions in to RAM sequentially.

Lemma 2: Cost of storage and cost of retrieval of a matrix are proportional to each other in polylogarithmic parallel time using P-RAM with a polynomial member of processor.

Proof: For storage and retrieval of a matrix we use parallel iteration. Parallel iteration has the property of convergence in log_nin parallel. It converge geometrically or at the rate of geometric progression therefore they are proportional to each other for a single value. From the above fact we can write that cost of retrieval is proportional to cost of storage

Þ cost of retrieval = k x cost of storage, (Where k is a constant),

Þ if k ³ 1 then it is a Dense matrix and if k < 1 then it is a sparse matrix.

Here we are representing PRAM-CREW Algorithm.

5.2. Row oriented sparse U^TDU factorization (With pre-processing), PRAM-CREW Algorithm

Begin

Repeat log n times do

For all (ordered), pair (i, j, k),, 0< k £ n, 0<i=j=k £ n

And q=log n in parallel do

m(2^2qi+2^qj+k),=m(k, k),

d(2^2qi+2^qj+k),=d(k, k),

a(2^2qi+2^qj+k),=a(k, k),

m(k, k), = 1

d(k, k), ß a(k, k),

Share on: Facebook del.icio.us Digg StumbleUpon Twitter

Print version

More from 12th National Conference

Program Schedule

Editorial

Acknowledgement

Application Of Cloud Computing In University Libraries

Role Of Information And Communication Technology In Teacher’s Education

Most Recent

Annual Report - 2018
Annual Report - 2017
Annual Report - 2016
Annual Report - 2015
Annual Report - 2014

Home

Upcoming Events

Darpan

Annual Report

Shodh

Seminar

Conferences

About

Home | Set as homepage | Add to favorites | Rss / Atom | Archive