koorio.com

海量文库 文档专家

海量文库 文档专家

- Numerical_methods_for_semiconductor_device_simulation
- Modelling and simulation of moving interfaces in gas-assisted injection moulding process
- CFD–DEM simulation of material motion in air-and-screen cleaning device
- 5093069_Process_and_device_for_the_produ (1)
- Simulation Analysis of the Working Process of the Metering Device with Combination Inner-Cell
- Thermal stress simulation and interface destabilisation in indium phosphide grown by LEC process
- Modeling and Simulation of Strained Quantum Wells in Semiconductor Lasers
- WHEEL LOADER REALIZATION-a model and an architecture that enables simulation in a process context
- Using Computer Simulation in Operating Room Management Impacts on Process Engineering and P
- Integrated Process Simulation and Die-Design in Sheet Metal Forming

Eidgenossische ¨ Technische Hochschule Zurich ¨

Ecole polytechnique federale de Zurich ? ? Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich

Institut fur Integrierte Systeme

Integrated Systems Laboratory

Application of Parallel Sparse Direct Methods in Semiconductor Device and Process Simulation1

Olaf Schenk, Klaus Gartner, and Wolfgang Fichtner Technical Report 99/5

Abstract

We present PARDISO, a new scalable parallel sparse direct linear solver on shared memory multiprocessors. In this paper, we describe the parallel factorization algorithm which utilizes the supernode structure of the matrix to reduce the number of memory references with Level 3 BLAS. We also propose enhancements that signi cantly reduce the communication rate for pipelining parallelism. The result is a greatly increased factorization performance. Furthermore, we have investigated popular shared memory multiprocessors and the most popular numerical algorithms commonly used for the solution of the classical drift-di usion and the di usion-reaction equations in semiconductor device and process simulation. The study includes a preconditioned iterative linear solver package and our parallel direct linear solver. Moreover, we have investigated the e ciency and the limits of our parallel approach. Results of several simulations of up to 300'000 unknowns for three-dimensional simulations are presented to illustrate our approach towards robust, parallel semiconductor device and process simulation.

1 This work will be presented at the International Symposium on High Performance Computing '99 (ISHPC99, Kyoto, Japan, May 26-28, 1999) and will appear in the conference proceedings.

Application of Parallel Sparse Direct Methods in Semiconductor Device and Process Simulation

Olaf Schenk? 1 , Klaus Gartner2, and Wolfgang Fichtner1

1

Integrated Systems Laboratory, Swiss Federal Institute of Technology Zurich, ETH Zurich, 8092 Zurich, Switzerland

2

Weierstrass Institute for Applied Analysis and Stochastics, Mohrenstr. 39, 10117 Berlin, Germany

gaertner@wias-berlin.de

foschenk, fwg@iis.ee.ethz.ch http://www.iis.ee.ethz.ch

Abstract. We present PARDISO, a new scalable parallel sparse direct

linear solver on shared memory multiprocessors. In this paper, we describe the parallel factorization algorithm which utilizes the supernode structure of the matrix to reduce the number of memory references with Level 3 BLAS. We also propose enhancements that signi cantly reduce the communication rate for pipelining parallelism. The result is a greatly increased factorization performance. Furthermore, we have investigated popular shared memory multiprocessors and the most popular numerical algorithms commonly used for the solution of the classical drift-di usion and the di usion-reaction equations in semiconductor device and process simulation. The study includes a preconditioned iterative linear solver package and our parallel direct linear solver. Moreover, we have investigated the e ciency and the limits of our parallel approach. Results of several simulations of up to 300'000 unknowns for three-dimensional simulations are presented to illustrate our approach towards robust, parallel semiconductor device and process simulation.

1 Introduction: Semiconductor Device Simulation and Algorithms

Numerical semiconductor device and process simulation is based on the solution of a coupled system of non-linear partial di erential equations (PDEs) that can be either stationary, time dependent, or even complex depending on the problem considered 1, 2]. The nonlinear nature of the semiconductor transport equations, with exponential relations between variables, leads to discretized equations and the sparse linear systems are typically ill-conditioned. These systems are solved by iterative methods or direct methods.

?

The work of O. Schenk was supported by a grant from the Cray Research and Development Grant Program and the Swiss Commission of Technology and Innovation under contract number 3975.1.

2

Olaf Schenk et al.

Generally, the memory requirements of an iterative method are xed, known in advance, and these methods require less main memory than direct methods. Heiser et al. 3] compared sparse direct and iterative algorithms for semiconductor device simulations and expected main memory requirements for sparse direct methods in the 10-100 Gbyte range for a simulation with 100'000 vertices and 300'000 unknowns. However, they used only local ll-in reduction methods like minimum-degree based algorithms 4] and the sparse direct solver did not exploit the memory hierarchy of the computing architecture. The rapid and widespread acceptance of shared memory multiprocessors, from the desktop to the high-end server, has now created a demand for parallel device and process simulation on such shared memory multiprocessors. In this paper, we present a study of large-scale semiconductor device and process simulations and compare the time requirements of complete simulations using iterative methods and our parallel direct solver. PARDISO exploits the memory hierarchy of the architecture using the clique structure of the elimination graph by supernode algorithms, thus improving memory and processor locality. The reordering features state-of-the-art techniques, e.g. multilevel recursive bisection 5], for the ll-in reduction. The combination of block techniques, parallel processing, and global ll-in reduction methods for three-dimensional semiconductor devices results in a signi cant improvement in computational performance. This paper is organized as follows. In section 2, we describe the algorithms and the performance of the sparse direct linear solver. Section 3 provides the comparison of iterative methods and sparse direct methods for several semiconductor device simulation examples on di erent architectures. The scalability is demonstrated for the simulator DESSIS?ISE with PARDISO for a realistic 3-D example with 105'237 vertices and 315'711 unknowns. Finally, in section 4 we give our conclusions.

2 PARDISO: A High-Performance Serial and Parallel Sparse Linear Solver

The problem we wish to solve is the direct solution of equations Ax = b, where A is a symmetric or structurally symmetric sparse matrix. This system is solved by performing a Cholesky or LU factorization of the matrix A, yielding factor

matrices L and U, and then performing a pair of triangular system solves to arrive at the value x. The linear systems arising in the discretization of the semiconductor device partial di erential equations, the so-called drift di usion equations, are unsymmetric linear systems and they are reputed to be very ill-conditioned. However, the unsymmetric matrix A can be transformed in a structurally symmetric matrix if a few non-zeros are added to A. Hence, in this paper we look at the problem of factoring large sparse, structurally symmetric linear systems.

PARDISO: A Scalable Parallel Sparse Direct Solver

3

2.1 Sequential Numerical Factorization

A fast sequential factorization code can be created by taking advantage of the supernode structure of the matrix 6]. Consider the matrix A of Figure 1, and its factors L and U. Although the columns of A appear quite dissimilar in structure, many of the columns in L and U appear nearly identical. Signi cant bene ts can be obtained by taking advantage of the structure and treating sets of columns with nearly identical structures as groups, or supernodes. A supernode is de ned as a set of contiguous columns whose structure in the factor consists of a dense triangular block on the diagonal and an identical set of non-zeros for each column below this diagonal block. In the example matrix, the supernodes are fA, Bg, fCg, fD, Eg, fFg, fG, H, Ig, fJg, fK, Lg, fM, Ng, fOg, fP, Q, Rg. By making use of the properties of supernodes, two performance enhancements are possible. The rst one is generally exploited in all supernode codes, the second one, instead, is handled very di erently. The rst enhancement has to do with reducing the number of references made to index vectors. Since the columns of a supernode all have the same structure, they all factorize the same set of destination columns. In supernode factorization, a destination column is factorized by all the columns in a supernode as a single step. It is easy to see that after all the columns of the supernodes have been added together in a temporary array, the index vectors are accessed to determine the entries in the destination column. This phase is called assembling step. The second source of performance improvements deals with the organization of the numerical factorization. Instead of only focusing on a single destination column, the factorization update can be extended to a supernode-supernode update. The supernode-supernode update operates on complete supernodes at the same time. As stated earlier, the second source of performance improvements is handled very di erently in sparse direct solver packages. The supernodesupernode factorization is mathematically expressed as a matrix-matrix operation. However, in many factorization codes 7{10] the factors L and U are separately stored and the matrix-matrix multiplications are internally implemented in either a DAXPY or a DAXPI loops. A DAXPY loop is a loop of the form y = a x + y, where x and y are vectors and a is a constant. A DAXPI loop is of the form y = a x(index) + y, where index is a vector of indices into x. In these packages, almost all of the factorization oating points operations occur in either DAXPY and DAXPI loops. The second source of higher performance in these packages comes from loop unrolling of these operations 7{10]. A serious issue for this class of algorithms is the time-ratio of one oating point operation to one access to the main memory. Without taking advantage of possible data reuse the memory access time will dominate. The DAXPY operation is the prototype of Level 1 BLAS operations - it can only reuse a, and unrolling these operations will not result in Level 3 BLAS performance. We describe a factorization algorithm which

4

Olaf Schenk et al.

A B C D E F G H I J K L MN O P Q R A B C D E F G H I J K L M N O P Q R A B C D E F G H I J K L MN O P Q R A B C D E F G H I J K L M N O P Q R Q P O L K I H G C E D B A F N M R J

Fig. 1. The sparse matrix A, the non-zero structure of the factors L and U with the

rectangular supernode storage scheme, and the elimination tree.

G H I J K L MN O P Q R A B C D E F G H I Q P O L K Proc. 1 I H G C E D B A F N M R J

ABC DEF

Fig. 2. The left-looking numerical factorization of supernode S(G, H, I).

GHI AB GI A B G H I K GHI C G I K Q DE

GI

D E G H I K L Q

*

*

C

*

* F

F G I

DGEMM L = L - X*Y

G I K Q

DGEMM TP = TP - X*Y

GI G H I K L N Q G H I K L Q

DGEMM TP = TP - X*Y

G H I K L N Q G I

DGEMM TP = TP - X*Y

GI G H I K L N Q

ASSEMBLING

ASSEMBLING

ASSEMBLING

Fig. 3. The external numerical factorization of the factor L and the upper triangular

dense diagonal block of the factor U.

PARDISO: A Scalable Parallel Sparse Direct Solver

5

GHI AB

GI A B K

GHI C K Q DE

*

*

C

*

D E K L Q

DGEMM U = U - X*Y

DGEMM TP = TP - X*Y

GI K Q GHI K L N Q K L Q

DGEMM TP = TP - X*Y

KL Q GHI K L N Q

ASSEMBLING

ASSEMBLING

Fig. 4. The external numerical factorization of the remaining part of the factor U.

GHI GHI GHI 1 G G G H H * H = 1 1 I I I

DGETRF

GHI G H I K L Q GHI

1 1 1 I

G H

K

L Q

DTRSM

DTRSM

Fig. 5. The internal numerical factorization of the supernode dense diagonal block with

Level 3 BLAS.

R Q P O G H I J K L MN O P Q R G H I J K L M N O P Q R L K I Proc. 1 H G C E D B A F N M J

KL MN

QR

Fig. 6. The right-looking phase of the parallel algorithm. All supernodes that will be

factorized by supernode S(G, H, I) are marked to exploit the pipelining parallelism.

6

Olaf Schenk et al.

Proc. 3 J K L MN O P Q R A B C D E F G H I J K L M N O P Q R Q P O L N M Proc. 1 R J

Proc. 2 K I H G C E D B A F

ABCDE F GHI

Fig. 7. Pipelining parallelism is essential to achieve higher parallel e ciency. Three processors can start with the external numerical factorization of the supernodes S(K, L), S(M, N), and S(Q, R) by supernode S(G, H, I).

utilizes the supernode structure of the matrix to reduce the number of memory references with Level 3 BLAS. In other words, it bene ts from the O(n3 ) operations and O(n2 ) memory accesses of a classical matrix multiplication algorithm. Consequently, all supernodes have to be stored as rectangular dense matrices; the dense triangular blocks of L and U on the diagonal are merged to a rectangular dense matrix. Figure 1 depicts the storage scheme for the factor L and U and all supernodes. The rectangular storage scheme exploits the performance of the Level 3 BLAS routines. Consider in Figures 2, 3, and 4, the external numerical factorization of supernode fG, H, Ig with a left-looking supernode-supernode algorithm. In the left-looking approach, the factorization of supernode fG, H, Ig is performed by gathering all contributions to fG, H, Ig from previously computed supernodes: fA, Bg, fCg, fD, Eg, fFg. If we assume that the non-zeros are stored contiguously in increasing order by column on L and row on U, then the multiplication can be done without a index vector inside the DGEMM loop. Once the external supernode-supernode multiplication has been performed, the result is assembled in the destination supernode. Note that the oating point operation phase is completely separated from the assembling phase and no indirect operations are required to perform the external supernode factorization. We now study the internal factorization of supernode fG, H, Ig in Figure 5. In the example matrix, the three columns and rows in supernode fG, H, Ig are factorized with the LAPACK subroutines DGETRF and DTRSM for dense matrices.

2.2 Numerical Factorization Benchmarking

In order to provide concrete performance comparisons for the two numerical factorization schemes which we discussed in this paper, we will present performance numbers for the factorization of a number of benchmark matrices on

PARDISO: A Scalable Parallel Sparse Direct Solver

7

Table 1. Clock speed, on-chip and o -chip cache sizes, peak oating point and LINPACK performance of the processors.

Processor on-chip o -chip Peak LINPACK Clock speed Cache Cache M op/s M op/s 96 KB 4 MB 32 KB 4 MB 16 KB 4 MB 1224 390 672 200 2000 764 344 461 202 1929 Alpha EV5.6 21164 612 MHz MIPS R10000 195 MHz UltraSPARC-II 336 MHz J90 100 MHz SX-4 125 MHz

commercially-available shared memory multiprocessors (SMPs). In this subsection, we describe both the multiprocessors on which we performed all experiments and the benchmark matrices which we factor on this machines. We used two vector supercomputers, a 16-CPU Cray J90 and an 8-CPU NEC SX-4, and three multiprocessor servers: an 8-CPU DEC AlphaServer 8400 a 32-CPU SGI Origin 2000, and an 8-CPU Sun Enterprise 4000. Table 1 summarizes the characteristics of the individual processors. The matrices which we use to evaluate sparse factorization performance are selected from the Harwell-Boeing sparse matrix test set 11] and from commercial semiconductor device and process simulation packages 1, 2]. The Harwell-Boeing matrices are symmetric, whereas all other matrices are structurally symmetric. We have tried to choose a variety of sizes and sparsities. All sparse matrices are reordered with a recursive nested dissection algorithm 5]. However, any ll-in reducing algorithm is suitable for our factorization approach. Neither the serial nor the parallel formulation requires a special ordering method. The purpose of the collection was to demonstrate that the approach can deliver close to peak oating point performance for practical problems. Table 3 gives the run times of the two supernodes schemes,

Table 2. Benchmark matrices.

Matrix n nnz(A) nnz(L) M ops 1 bcsstk31 35588 608'502 4'386'990 1'182 2 bcsstk32 44'609 1'029'655 5'605'579 1'235 3 eth-3D-eeprom 12'002 630'002 6'602'828 3'105 4 ise-3D-igbt-coupled 18'668 412'674 7'614'492 3'866 5 eth-3D-eclt 25'170 1'236'312 17'914'354 12'079 6 ise-3D-soir-coupled 29'907 2'004'323 28'342'865 23'430 7 eth-3D-mosfet 31'789 1'633'499 26'732'577 22'339 8 eth-3D-eclt-big 59'648 3'225'942 63'785'130 75'102

the unrolled DAXPY approach, as employed in SUPER 10, 12], and the block

8

Olaf Schenk et al.

Table 3. Factorization run times on one processor of an DEC AlphaServer 8400.

PARDISO SUPER Matrix Time (s) M op/s Time (s) M op/s 1 bcsstk31 3.4 347.6 10.4 114.7 2 bcsstk32 3.8 329.8 10.4 118.7 3 eth-3D-eeprom 7.1 437.3 31.8 97.6 4 ise-3D-igbt-coupled 8.7 444.4 47.2 81.9 5 eth-3D-eclt 24.4 495.1 143.1 84.4 6 ise-3D-soir-coupled 43.5 538.6 315.3 74.2 7 eth-3D-mosfet 42.5 525.6 305.3 73.1 8 eth-3D-eclt-big 136.1 551.8 1130.7 66.41

method with DGEMM, as integrated into PARDISO, for the eight benchmark matrices. Note that although the two schemes for sparse LU factorization which have been discussed perform essentially the same set of oating point operations, they exhibit very di erent cache behavior. As can be seen from the table, the PARDISO code performs sparse factorization at more than six times the speed of SUPER. The block method with Level 3 BLAS forms the basis for our parallel factorization in the next section.

2.3 Parallel Numerical Factorization

There are three levels of parallelism that can be exploited by parallel direct methods. The rst one, elimination tree parallelism 13] is generally exploited in all parallel sparse direct packages. In practice, however, a large percentage of the computation occurs at the top levels of the tree and the sequential part of the elimination tree usually limits the e ciency. It is therefore important to consider more levels of parallelism. Another type of parallelism, called node level parallelism, solves in a hybrid approach at the top level of the tree the factorization of each supernode in parallel 14] on a loop level basis. However, the size of the supernodes and the numbers of processors determine the e ciency. A third type of parallelism, the pipelining parallelism, instead, is suitable for a larger number of processors. In this case, a processor can start with the external factorization contributions of all supernodes that are already factorized. Although a pipelining approach is di cult to realize in sparse direct solver packages, the pipelining parallelism is essential to achieve higher concurrency. In the PARDISO package, fractions of supernodes are owned by processors and spawned for OpenMP threads. Following Ng and Peyton 9], we build a pool of tasks containing the list of tasks that can be performed by available processors. This pool is initialized with all leaves of the supernode elimination tree. One important consequence of the left-looking supernode algorithm is that the factorization of a supernode S consists of two steps, the external factorization with supernodes below the current position of S and the internal factorization. The

PARDISO: A Scalable Parallel Sparse Direct Solver

9

left-looking external and internal factorization of supernode S(G, H, I) is shown in Figures 2 to 5. The external factorization of supernode S(G, H, I) dependents on four other supernodes, whereas the inner factorization is independent of other processes. To understand and exploit the pipelining parallelism, we consider the factorization of the example supernode S again. Once it is completed, all other supernodes that have a ancestor/descendent relationship with S can potentially start with the external factorization. Since the factorizations can be done in parallel, it is important to exploit this level of concurrency to achieve higher parallel e ciency. Therefore, when a processor has nished a supernode, e.g. S(G, H, I), it immediately informs all other supernodes that require a update from S(G, H, I). This is schematically depicted in Figure 6. The supernodes S(K, L), S(M, N) and S(Q, R) have a ancestor/descendent relationship with supernode S(G, H, I). The information that S(G, H, I) is already factorized is distributed to all these nodes. In terms of supernode technology this can be called a right-looking step, since a right row traversal of the current supernode S(G, H, I) is performed. As stated earlier, pipelining parallelism increases the concurrency and due to the left-right looking method three processors can start in Figure 7 with the factorization. Another important consequence of the left-looking approach is that it substantially increases the task grain size during factorization. Because the tasks are dependent and the largest are the last, load balancing is a problem. Assume the extreme case of a dense matrix. The factor L forms a single supernode. It therefore presents no possibility for concurrent work if all of the work associated with a supernode is assigned to a single processor. Hence, it is necessary to decrease the task grain size by splitting large supernodes into smaller pieces, called panels. How many panels are necessary? One major consideration is that we would like to keep the supernodes as large as possible to get the most bene t from the supernode technique. On the other hand, load balancing is necessary to keep the processors busy. We tried a number of splitting techniques and the most e ective of these depends on the number of processors. First, if (1 < nproc 4) then each panel contains less than 96 columns of the factor, if (4 < nproc) then the maximum panel size is reduced to 72 columns. Portability should be a key issue in implementing a parallel algorithm. A small subset of OpenMP 15], a new portable alternative to message passing, was able to solve all thread creation and synchronization tasks. Hence, the application is easily portable to multiple platforms | the e ciency will certainly depend on hardware and operation system features. The main goal of the OpenMP standard is to promote portable high performance parallel programming. OpenMP o ers compiler directives, which look like Fortran comments and are added to the source code, to specify the parallelism of the algorithm and where multi-processor synchronization is needed. In the Figures 8 to 13 we present the speedups and M op/s with our parallel left-right looking supernode algorithm on several commercial symmetric multiprocessing architectures.

10

Olaf Schenk et al.

2000 1800

1600

Mflop rate on a CRAY J90

eth?3d?eeprom ise?igbt?coupled eth?3d?eclt ise?soir?coupled eth?3d?mosfet eth?3d?eclt?big

1400

1200

1000

800

600

400

200

0

2

4

6

8

10

12

14

16

Fig. 8. M op/s on a 16-CPU Cray J90.

8000 7000

# processors

Mflop rate on a NEC SX?4/8

6000

eth?3d?eeprom ise?igbt?coupled eth?3d?eclt ise?soir?coupled eth?3d?mosfet eth?3d?eclt?big

5000

4000

3000

2000

1000

0

1

2

3

4

5

6

7

8

Fig. 9. M op/s on an 8-CPU NEC SX-4.

1800 1600

# processors

Mflop rate on a SGI ORIGIN 2000

1400

eth?3d?eeprom ise?igbt?coupled eth?3d?eclt ise?soir?coupled eth?3d?mosfet eth?3d?eclt?big

1200

1000

800

600

400

200

0

1

2

3

4

5

6

7

8

Fig. 10. M op/s on an 8-CPU SGI Origin 2000.

# processors

PARDISO: A Scalable Parallel Sparse Direct Solver

1200

11

Mflop rate on a Sun Enterprise 4000/5000

1000

eth?3d?eeprom ise?igbt?coupled eth?3d?eclt ise?soir?coupled eth?3d?mosfet eth?3d?eclt?big

800

600

400

200

0

1

2

3

4

5

6

7

8

Fig. 11. M op/s on an 8-CPU SUN Enterprise Server.

2500

# processors

Mflop rate on a DEC AlphaServer 8400

2000

eth?3d?eeprom ise?igbt?coupled eth?3d?eclt ise?soir?coupled eth?3d?mosfet eth?3d?eclt?big

1500

1000

500

0

1

2

3

4

5

6

7

8

Fig. 12. M op/s on an 8-CPU DEC AlphaServer 8400.

7000

# processors

Mflop/s on various platforms for sparse matrix 8

6000

5000

DEC AlphaServer EV56 (612 MHz) NEC SX?4 (125 MHz) Sun Enterprise 4000/5000 (336 MHz) SGI ORIGIN 2000 (195 MHz) Cray J90 (100 MHz)

4000

3000

2000

1000

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Fig. 13. The computational performance in M op/s on all SMPs for sparse matrix 8.

#processors

12

Olaf Schenk et al.

3 Comparing Algorithms and Machines for Complete Semiconductor Device Simulations

For our comparison in Table 4, we used ve typical semiconductor device simulation cases: one 2-D transient simulation with a hydrodynamic model and 14'662 vertices (87'732 unknowns), and four 3-D quasistationary device simulations, with 10'870, 12'699, 26'855, and 105'237 vertices (32'610, 38'097, 80'565, and 315'711 unknowns), respectively. All grids are highly irregular and the linear systems are very sparse. The irregularity of the grids leads to more complicated structures of the linear systems that must be solved during the simulation. The machines are shared memory multiprocessors and the processor characteristics are summarized in Table 1. We used an 32-processor SGI Origin 2000 with sixteen Gbytes of main memory, and an four-processor DEC AlphaServer with two Gbytes of main memory. Comparing the performance of two sparse direct solvers on the same machine with one single processor, we see that in all cases the new direct solver PARDISO 16] is signi cantly faster than SUPER?ISE 1, 12]. The mayor bottleneck of SUPER?ISE is not the number of oating point operations necessary for the factorization, but rather the cost of fetching data from main memory into cache memory. Although both packages take advantage from supernode techniques, PARDISO exploits the memory hierarchy between cache and main memory signi cantly better. Furthermore, a nested dissection approach 5] has been integrated into PARDISO. This nested dissection ll-reducing ordering is substantially better than the multiple minimum degree algorithm for large problem sizes (up to a factor up 5 in oating point operations for larger 3-D grids). Table 4 clearly re ects the performance di erence due to algorithmic improvements. On the other side, preconditioned sparse iterative methods play an important role in the area of semiconductor device and process simulation problems. Of the wide variety of iterative solvers available in the literature 17], it has been found that only a few are capable of solving the equations arising in semiconductor device and process simulations. This is mainly due to to the ill-conditioned matrices of the coupled solution scheme. So far, BiCGstab(2) 18] with a fast and powerful incomplete LU decomposition preconditioning and reverse Cuthill-McKee ordering appears to be the best choice for solving large sparse linear systems 1, 19]. The default preconditioning in the package SLIP90?ISE is ILU(5,1E-2). For increasingly di cult linear systems, the ll-in level and the threshold value for dropping a ll-in entry in the ILU decomposition is changed in ve steps to ILU(20,1E-5). However, the results for the linear solver have to be seen in the context of the complete simulation task. One often observed problem of the iterative methods used is the strong reduction of the average step size during a quasistationary or transient simulation due to convergence problems. The typical time step size for non trivial examples is in the range of 0.01 to 0.001 times that found for a direct method. Furthermore, looking at the quasistationary 3-D simulation with 105'237 vertices in Table 4 we see that the iterative solver failed during

PARDISO: A Scalable Parallel Sparse Direct Solver

13

Table 4. Comparison of di erent algorithms and packages on high-end servers. Wall clock times of the solution time in hours for one complete semiconductor device simulation with the parallel DESSIS?ISE .

SLIP90?ISE SUPER?ISE PARDISO iterative direct direct 1 CPU 4 CPUs

2-D 14'662 vertices, six nonlinear PDE's

DEC Alpha SGI Origin

14.60 22.30 1.80 1.41 0.90 0.90 0.92 1.04

23.70 46.70 2.30 3.37 6.68 5.72 45.79 55.04

5.30 9.70 0.45 0.85 0.72 1.00 1.33 1.61

1.62 2.81 0.12 0.23 0.20 0.23 0.41 0.51

3-D 10'870 vertices three nonlinear PDE's

DEC Alpha SGI Origin

3-D 12'699 vertices three nonlinear PDE's

DEC Alpha SGI Origin

3-D 26'859 vertices three nonlinear PDE's

DEC Alpha SGI Origin

3-D 105'237 vertices three nonlinear PDE's

DEC Alpha SGI Origin

failed no memory no memory no memory failed no memory 401.1 16.0 on 30 CPUs

the simulation. The main memory requirement of DESSIS?ISE with the parallel direct solver was about 4 Gbytes on the SGI Origin 2000 for the largest example with 105'237 vertices and 315'711 unknowns. To complete the global view, we present in Table 5 the contributions of the right-hand side, Jacobian evaluation and solution times for linear systems. Note that neither the calculation of the right-hand side nor the calculation of the Jacobian is parallelized in the semiconductor device simulator. Although the sequential solution step dominates the execution time, the in uence of the remaining sequential part can not be neglected having introduced a very e cient parallel direct solver.

14

Olaf Schenk et al.

Table 5. Decomposition of the Execution Time of the Serial DESSIS?ISE simulation

with PARDISO into assembling time and solution time on 1 processor. Rhs-time Jacobian-time Solve-time

2-D 14'662 vertices, six nonlinear PDE's

13 % 15 % 32 % 0.5 %

21 % 3% 2% 0.5 %

64 % 82 % 64 % 99 %

3-D 12'699 vertices three nonlinear PDE's

3-D 26'859 vertices three nonlinear PDE's

3-D 105'237 vertices three nonlinear PDE's

4 Conclusion

We considered the problem of performing the sparse LU factorization on di erent shared memory architectures. We started with two sequential factorization methods and demonstrated the advantages of utilizing the supernode structure of the matrix to reduce the number of memory references with Level 3 BLAS on hierarchical memory machines. This algorithm forms the basis for the parallel version. We discussed and implemented enhancements that signi cantly reduce the communication rate for pipelining parallelism in a left-right looking supernode-panel algorithm - resulting in greatly increased factorization performance. Tangible bene ts can be achieved on today's eight-processor work group servers. As a result, PARDISO is already in use in several commercial applications, e.g. the solver has been successfully integrated into complex semiconductor process and device simulation packages 1, 2]. The parallel e ciency in the direct linear solution algorithm has now reached, for larger problem sizes, the 32 processor machines. The gap in-between the 104 and 106 vertices may be bridged by multigrid methods.

Acknowledgment

The authors appreciate the support of following high performance benchmark groups: Silicon Graphics/Cray Research Inc., Digital Equipment Corporation, Sun Microsystems Inc and NEC. Also, we thank Esmond Ng (Oak Ridge National Laboratory) for a number of interesting discussions and hints during two visits at Zurich. Finally, we want to thank for the support within the CRAYETHZ SuperCluster Cooperation and the grant from the Swiss Commission of Technology and Innovation.

PARDISO: A Scalable Parallel Sparse Direct Solver

15

References

1. Integrated Systems Engineering AG. ?ISE Reference Manual. ISE Integrated Systems Engineering AG, 1998. 2. Integrated Systems Engineering AG. ?ISE Reference Manual. ISE Integrated Systems Engineering AG, 1998. 3. G. Heiser, C. Pommerell, J. Weis, and W. Fichtner. Three-dimensional numerical semiconductor device simulation: Algorithms, architectures, results. IEEE Transactions on Computer-Aided Design, 10(10):1218{1230, 1991. 4. J.W.H. Liu. Modi cation of the Minimum-Degree algorithm by multiple elimination. ACM Transactions on Mathematical Software, 11(2):141{153, 1985. 5. G. Karypis and V. Kumar. Analysis of multilevel graph algorithms. Technical Report MN 95 - 037, University of Minnesota, Department of Computer Science, Minneapolis, MN 55455, 1995. 6. C.C. Ashcraft, R.R. Grimes, J.G. Lewis, B.W. Peyton, and H.D. Simon. Progress in sparse matrix methods for large linear systems on vector supercomputers. The International Journal of Supercomputer Applications, 1(4):10{30, 1987. 7. E. Rothberg and A. Gupta. An evaluation of left-looking, right-looking, and multifrontal approaches to sparse Cholesky factorization on hierarchical-memory machines. International Journal of High Speed Computing, 5(4):537{593, 1993. 8. E. Rothberg. Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization. PhD thesis, Stanford University, 1992. STAN-CS-92-1459. 9. E.G. Ng and B.W. Peyton. A supernodal Cholesky factorization algorithm for shared-memory multiprocessors. SIAM Journal on Scienti c Computing, 14(4):761{769, 1993. 10. Integrated Systems Engineering AG. ?ISE Reference Manual. ISE Integrated Systems Engineering AG, 1998. 11. I.S. Du , R.G. Grimes, and J.G. Lewis. Users' guide for the Harwell-Boeing sparse matrix collection (release 1). Technical Report RAL-92-086, Rutherford Appleton Laboratory, 1992. 12. A. Liegmann. E cient Solution of Large Sparse Linear Systems. PhD thesis, ETH Zurich, 1995. 13. J.W.H. Liu. The role of elimination trees in sparse factorization. SIAM Journal on Matrix Analysis & Applications, 11(1):134{172, January 1990. 14. P. Matstoms. Parallel sparse QR factorization on shared memory architectures. Parallel Computing, 21:473{486, 1995. 15. R. Menon L. Dagnum. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science & Engineering, 1:46{55, 1998. 16. O. Schenk, K. Gartner, and W. Fichtner. E cient sparse LU factorization with left-right looking strategy on shared memory multiprocessors. Technical Report 98/40, Integrated Systems Laboratory, ETH Zurich, Swiss Fed. Inst. of Technology (ETH), Zurich, Switzerland, Submitted to BIT Numerical Mathematics, 1998. 17. Y. Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, 1996. 18. G.L.G. Sleijpen, H.A. van der Vorst, and D.R. Fokkema. BiCGSTAB(l) and other hybrid Bi-CG methods. Technical Report TR Nr. 831, Department of Mathematics, University Utrecht, 1993. 19. D.R. Fokkema. Subspace methods for linear, nonlinear, and eigen problems. PhD thesis, Utrecht University, 1996.

DESSIS DIOS SUPER