Download PDF Parallel Programming: for Multicore and Cluster Systems

Sign In Register Help Cart. Search Results Results 1 -8 of 8. Customers who searched for ISBN: Brand New Quantity available: Kindly provide day time phone number in order to ensure smooth delivery. Territorial restrictions may be printed on the book. We may ship from Asian regions for inventory purpose. Not a Used Book, Book Language: Printed in Black and White. Excellent customer service response.

Course pages

Ergodebooks , Texas, United States Seller rating: Ships with Tracking Number! May not contain Access Codes or Supplements. The simple alignment algorithm requires additional storage and communication bandwidth. Therefore, we introduce the new tile-based method, which simplifies the computational model and also handles very long sequences with less data transmission. Our tile-based method can be described as follows: Here, we introduce our new method of tile reduction, an OpenMP and OpenMPI tile aware parallelization technique that applies reduction on multidimensional arrays.

Elements of each submatrix correspond to one tile. And for each diagonal row, one file will be created where computed values will be stored and destroyed synchronously. This new parallel implementation is shown in pseudocode Listing 3.

Parallel Programming: For Multicore And Cluster Systems

Pseudo code corresponding to Figure 2. The wave-front algorithm is an important method used in various scientific applications. Here, the sequence alignment algorithm also executes each element in wave-front manner, and our new tiling transformation method will also execute in wave-front manner. This wave-front pattern of computation is within the chunk and also between the chunks. In Figure 2a , we can see the wave-front structure for parallelizing the algorithm. Since the value of each block in the matrix is dependent on the left, upper, and upper left blocks, as shown in Figure 2a , blocks with the same color are put in the same parallel computing round.

Figure 2b shows the data skewing strategy to implement this pattern. This memory structure is useful because blocks in the same parallelizing group same color are adjacent to each other. During programming, we have stored the values in the same file for same colored tiles. For one color blocks one diagonal , one file is created, which is needed for the calculation of next colored blocks next diagonal for which another file is created.

When third file gets created, first file first colored block data is deleted as it is of no use now. Thus, for every third step, first file gets deleted. This deletion process is synchronous.

Parallel Programming: For Multicore And Cluster Systems by Thomas Rauber

We study these paradigms for various input sequence sizes, various tile sizes, number of threads for OpenMP, number of processes for MPI, and both threads and processes for hybrid model. In our experimental analysis on OpenMP programming model, we observe that as the input sequence size increases, the optimum performance obtained with respect to time and speedup improves with the decreasing number of threads working cores.

To analyze this result, here in this section, we presented a simple method on how to estimate the overhead of communication in OpenMP loops. The reason behind this observation is probably because of the following reason. The total number of processors is two with each containing six cores. The layout is illustrated in Figure 3. How the data are exactly located on each core is not known, but assume that the tile size matrix is located as depicted in Figure 3 , where the shaded memory block contains the data from the master working core.

Now, when the algorithm uses 12 OpenMP threads, all the cores will access the memory block in all iterations of the solver.

Parallel Computing with MATLAB

For the cores to retrieve this memory, the cores that are farthest away from the data might have to be idle for several cycles to get the data to be processed. This will happen even if the data are not distributed as in Figure 3 , because of the nature of OpenMP. The total overhead, O t , for an OpenMP loop on processors executed by a number of threads can be expressed as. For a for-loop in OpenMP, a loop of i th iterations is divided into chunk size of k , so that the number of messages sent is. Summarizing these formulas, we get a general expression for overhead of communication in an OpenMP for-loop,.

Further, the speedup graphs indicate that we get little speedup in our parallel algorithm with this strategy. If a thread migrates to another CPU for some reason, this migration takes extra time and must be included in the overhead calculation. To test if this was a significant source of overhead, we used the bindprocess function call to bind the OpenMP threads to a given CPU. However, this did not give any notable performance improvements. The MPI implementation is such that the multiple instances of the same code runs in parallel on various processors. Each processor communicates with the other by message-passing interface and stores its data in its own memory.

To distinguish between the processes, a unique id is given to them, which are used to have explicit communication between them. The data are distributed to all working MPI nodes where each individual core will run the code in parallel.

The individual cores locate these data in their own memory bank close to it. This avoids the idle time for waiting for memory because of the close proximity to the memory. A significant amount of communication overhead is reduced due to sharing of data by several cores on each node. Hence, MPI gives improved performance for large size of files. As for small sequence size, time required for distribution of data to all nodes will be comparatively more. The multicore platform that we have used for testing our parallel algorithms for shared and distributed memory structures was as listed below:.

Here, the experiments were performed on quad-core Intel 2. We employ the automatic sequence-generating program to generate different test cases. Hence, different amounts of data sets were used for each result on various architectures. The general test was executed on four different implementations of the algorithm for DNA sequence alignment. The first one is the serial implementation, the second is the simple wave-front implementation using OpenMP, the third is traditional tile-based implementation, and the fourth is our new tile-based implementation.

For long sequences, the simple wave-front version cannot load the entire matrix into the processor memory. Therefore, we select groups of sequences that have variable lengths to test on Linux version. The experimental results are shown in Table 2. Because multicore processor has a special hierarchy of memory structure, the access time for different levels of the hierarchy greatly varies.

For example, global memory has an access time of about cycles and the on-chip memory such as a register of only one cycle. This is due to the reason that multicore processor does not provide a good mechanism to optimize memory accesses at a compiler level. Also we can see from the results that, among four versions of the algorithm, our tile-based method gives better results. Here, the experiments were performed on the platform that has a core Intel 2. In our test, we focus on how the number of threads, which gets created corresponding to the number of cores of the system, affects the final performance.

The test results are shown in Figure 4 , in which we see that with the growth of the average sequence length, the performance increases massively for all the methods. When the average sequence length exceeds 32, characters, the time of sequence alignment will increase much faster. This is probably because the main memory size of this core processor is 32 GB.

When the size of the sequence exceeds 32, characters, access of data from main memory requires more CPU cycles. The parallelization of algorithm with three different methods as shown in Table 3 shows that our new method of tiling gives better performance compared with the other two. This is emphasized in Figure 5 , which shows a pairwise comparison between the overall speedups for all the methods of parallel algorithm.

The comparison is done for each pair of sequences. Figure 5 shows that the tile-based implementation is an order of magnitude faster than the simple wave-front implementation, because it only needs to calculate a portion of the dynamic programming matrix. In addition, the time growth of the tile-based implementation is lower. As the sequence length increases, the growth of the calculation time needed by our tile-based method is very low.

Comparison of speedup for different versions of parallel algorithm on core processor. Here, the experiments are performed on the platform that has a core Intel 2. Figure 6 shows the performance evaluation on a core multithreaded processor corresponding to Table 4 for various sequence lengths and computational time in seconds.

Here, we also observed that the time required increases from 32, character length of the sequence file. The experimental results were performed on two Linux-based workstations of Intel Xeon processor with 2. In MPI programming, as each process processes different pieces of same code independently and simultaneously, it gives more speedup on distributed architecture compared to shared memory system. As shared memory uses fork—join model, it stops improvement in speedup after some point, which is mainly due to limited memory.

Both the implementations were tested on publicly available GenBank database 20 over various length sequences; four sample input files that have benchmark alignments are taken. The figures show that MPI implementation gives an improved speedup for large sequences compared to OpenMP implementation.

We also get adequately acceptable speedup for small and medium sequences. The highlighted entries in Table 5 show that, with the increase in sequence size, the optimum performance was obtained with the decreasing number of threads working cores for both tile sizes. The highlighted entries in Table 6 show that the optimum performance was obtained with the increasing number of processes and tile size with the corresponding increase in the input data size. This is because communication between the processes reduces as the size of data sent to the slave process increases, which would be otherwise more for small data size.

Hardcover , pages. To see what your friends thought of this book, please sign up. To ask other readers questions about Parallel Programming , please sign up. Lists with This Book. This book is not yet featured on Listopia. John rated it really liked it Jan 23, Giorgos rated it really liked it Nov 21, Meqdad Darweesh rated it really liked it Sep 10, Brian Bishop rated it really liked it Aug 11, Andrew rated it it was ok Jul 27, Richard Cole rated it really liked it Feb 20, Asha rated it liked it Jul 29, Daniel marked it as to-read Jul 07, Johan added it Nov 08, Adam Elkus marked it as to-read Aug 25,