CS3211 Project 2 OthelloX

Size: px
Start display at page:

Download "CS3211 Project 2 OthelloX"

Transcription

1 CS3211 Project 2 OthelloX Contents SECTION I. TERMINOLOGY 2 SECTION II. EXPERIMENTAL METHODOLOGY 3 SECTION III. DISTRIBUTION METHOD 4 SECTION IV. GRANULARITY 6 SECTION V. JOB POOLING 8 SECTION VI. SPEEDUP 11 SECTION VII. FURTHER IMPROVEMENTS 13 SECTION VIII. CONCLUSION 13! 1

2 Section I. Terminology Job A job is a current board position that has to be evaluated for its value so that the move leading to it can be deemed a good or bad move. Algorithms 1)! SERIAL_MINIMAX: Serial version of the minimax algorithm 2)! SERIAL_ALPHABETA: Serial version of the minimax with alpha-beta pruning algortihm 3)! BATCH_MINIMAX: Parallel algorithm where the boards to be evaluated are split into more boards. Boards are then sent to Slave processors in as 1 single batch to be evaluated using the minimax algorithm. 4)! BATCH_ALPHABETA: Similar to the above but boards are to be evaluated using the minimax with alpha-beta pruning algorithm. 5)! JOBPOOL_MINIMAX: Master maintains a pool of jobs that have to be evaluated and Slave processors request for jobs to work on. Boards are then sent to Slave processors in small mini-batches (Job pool send size) to be evaluated using the minimax algorithm. 6)! JOBPOOL_ALPHABETA: Similar to the above but boards are to be evaluated using the minimax with alpha-beta pruning algorithm. Job distribution method by the master This is the method that will be used by the Master to choose boards to send to the Slave processors. Suppose there are a total of N boards and K boards have to be sent to each Slave processor: 1)! SEQUENTIAL: Sends the first K boards of the total N boards to each Slave processor. 2)! RANDOM: Randomly chooses K boards to send out of the total N boards. Number of jobs per processor This is the least number of jobs that each processor should work on. If there are less jobs than this, the Master processor will split the current jobs into more granular jobs to divide among the processors. Job pool send size This is the number of boards to send per job request by Slave processors. For example, if job pool send size = n and there are 23 jobs in the job pool that is waiting to be executed and a Slave processor requests for a job, the Slave processor will be sent n jobs if n does not exceed 23.! 2

3 Section II. Experimental Methodology The experiments in the following sections will be performed on 2 boards with different number of empty slots. Board A is a 6x6 board with only 13 empty slots left on the board. Board B is a 6x6 board with 19 empty slots left on the board, and hence, will be much harder to evaluate the entire space. In each experiment, the minimax or alpha-beta pruning algorithm will be executed to full depth (13 on Board A and 19 on Board B) in order to have a fixed problem size for comparison. We will treat the minimax algorithm and the alpha-beta pruning algorithm as two different algorithms even though they are essentially computing the same results. The justification for this is because it is difficult to obtain a fair comparison between two algorithms that have very different properties. For example, in the evaluation of Board A, we can observe huge differences (in Table 1) between the time taken for the serial implementations of minimax and alpha-beta pruning, due to the ability of the alpha-beta pruning algorithm to prune away large portions of the search tree. Time taken for Serial Time taken for Serial Minimax (s) Alpha-Beta Pruning (s) Table 1. Difference in time taken for evaluation of Board A For each experiment, there will be multiple variables (discussed in Section I) in othello.cpp that can be adjusted to change the behavior of the program and obtain experimental results. These variables are: a)! ALGORITHM (Algorithm to use) b)! JOB_DISTRIBUTION (Job distribution method by Master) c)! NUM_JOBS_PER_PROC (Number of jobs per processor) d)! JOBPOOL_SEND_SIZE (Number of jobs to send per job request)! 3

4 Section III. Distribution Method The Master has different ways to choose jobs to send to the Slaves. A simple method would be to do so in a SEQUENTIAL manner, where if there are K jobs to be sent to a Slave processor, the first K jobs in the jobs queue would be given to the Slave processor. There is an alternative method as well, and that is to do so in a RANDOM manner. If there are K jobs to be sent to a Slave processor, the K jobs will be randomly chosen from the jobs within the jobs queue. Minimax Evaluation Board: A Algorithm: BATCH_MINIMAX Number of jobs per processor: 10 Total time taken for different job distribution methods Total time taken (s) Random Sequential Graph 1a. As we can see from the graph above, a RANDOM method of choosing jobs to send to the Slaves always results in a lower total time taken to evaluate the board than a SEQUENTIAL method. The SEQUENTIAL method can be problematic because the execution time for each job varies significantly, depending on the board position and the current player. If the job has a large branching factor because of the number of possible moves, the execution time for the job would be much slower. When jobs are split by the Master, they reenter the jobs queue as one group. Hence, the jobs with long execution times are likely to lump together and this means that the sequential approach will likely! 4

5 cause some Slave processors to get jobs which are particularly long as compared to the other Slave processors. The RANDOM method will likely be more effective because any job within the jobs queue is equally likely to be chosen as opposed to jobs within the vicinity of each other having a higher chance of being chosen together. This means that even if the long jobs are lumped together, the chances of choosing all of them and sending them to the same Slave processor is extremely low, and hence, this allows a better distribution of workload to the Slave processors. Alpha-Beta Pruning Evaluation Board: B Algorithm: BATCH_ALPHABETA Number of jobs per processor: 10 Total time taken for different job distribution methods Total time taken (s) Random Sequential Graph 1b. As we can see from the graph above, the RANDOM method also improves the time taken for alpha-beta pruning as compared to the SEQUENTIAL method. Total time taken for the RANDOM method is always faster than than of the SEQUENTIAL method. The rationale is the same as that of the minimax algorithm.! 5

6 Section IV. Granularity Granularity refers how fine-grained the jobs are, it is granular if there are many small jobs (short execution time) to be done as opposed to a few big jobs (long execution time). In order change the granularity of the jobs before they are sent to the Slave processors, we use the variable NUM_JOBS_PER_PROC (Number of jobs per processor). The Master process ensures that there will at least (NUM_JOBS_PER_PROC x ) jobs in the jobs queue before sending any jobs. This is done by splitting the original jobs into multiple jobs and adding them back into the jobs queue. Minimax Evaluation Board: A Algorithm: BATCH_MINIMAX Job distribution: RANDOM Speedup achieved for different number of jobs per processor Speedup Graph 2a. As we observe from the graph above, increasing the number of jobs per processor from 5 to 10 to 20 generally results in an increased overall speedup (with the exception of 1 point). Speedup is measured by the time taken for the serial algorithm to evaluate the board A divided by the time taken by the algorithm to evaluate the same board A. The reason for why increasing the number of jobs per processor (and hence increased granularity) will improve speedup is because the minimax algorithm works similarly whether it is working on smaller jobs or large jobs. It goes down the search tree and performs a min and max at every level depending on the current player. By breaking! 6

7 down the large original jobs into smaller jobs, it will not affect the algorithm. However, there is an advantage of breaking the large jobs into smaller jobs, because this means that the pool of jobs to be sent to the Slave processors would be much larger. Consequently, with a RANDOM manner of selecting jobs to send to Slave processors, it means that each processor is more likely to receive the the same number of large jobs and small jobs. This means that it is less likely that any Slave processor will be delayed for a significant time by being given more workload. Hence, the overall speedup benefits. Alpha-Beta Pruning Evaluation Board: B Algorithm: BATCH_ALPHABETA Job distribution: RANDOM Speedup achieved for different number of jobs per processor Speedup Graph 2b. As we observe from the graph above, increasing the number of jobs per processor from 1 to 5 to 10 actually results in a decreased overall speedup, which is the opposite of the case for the minimax algorithm. This is because, unlike the minimax algorithm, the alpha-beta pruning algorithm is indeed affected by the splitting of large jobs into smaller jobs. The algorithm works by acquiring an alpha and beta bound as it travels through the search tree. By breaking down the jobs into smaller jobs, it means that the bounds attained from one job will not be passed to the next job. Without this bound, the advantage of the alpha-beta pruning algorithm which is the ability to prune off large portions of the search tree, is compromised. This means that it is possible that with more number of processors (and hence more granular jobs), speedup may actually decrease (we can observe this in Graph 2b from the decrease in speedup of the grey line).! 7

8 Section V. Job Pooling For the BATCH_MINIMAX and BATCH_ALPHABETA algorithms, the jobs that have to be executed were sent out in one-shot (at the start) and after which, the Master process also beings to work on the jobs. However, it is difficult to ensure that all the processes receive jobs that are of approximate execution times. The result of this is that there is a wide distribution of the time taken for each process to finish its jobs, with some processes finishing very early and remaining idle while some processes take a much longer time than other processes. Since the Master process has to wait for all Slave processes to finish, this effectively slows the entire program. Job pooling is the idea that the jobs that have to be done will not be sent to the Slave processes in one-shot at the start. Instead, it will be released in mini-batches (JOBPOOL_SEND_SIZE) upon request by the Slave. The disadvantage is that the Master process will not be working on any computational tasks, but instead, constantly send jobs to the Slave processes and receive completed jobs from them. Moreover, because of the constant to-and-fro communication between the Master process and Slave processes, communication costs would increase significantly. However, the advantage of such an algorithm is that it maximizes the resource usage of the Slave processes by giving them jobs whenever they are idle. Minimax Evaluation Board: A Algorithm: JOBPOOL_MINIMAX Job distribution: RANDOM Number of jobs per processor: 10 Speedup achieved for different job pool send size 20.0 Speedup Graph 3a.! 8

9 We observe in Graph 3a that with an increase in the number of processors, the speedup increases significantly (almost linearly). The rationale for this is the similar to that explained in Graph 2a. Since minimax algorithm is not affected by splitting the current jobs into smaller jobs, more processors will be able to work on the existing problem and get a solution faster. Also, we see in Graph 3a that with an increase in the job pool send size from 1 to 5 to 10, there is a decrease in the overall speedup. This is because if the job pool send size is large, it means that there is a higher chance that the last few jobs would take the Slave processor a long time to finish, leading other Slave processors to idle. This defeats the purpose of implementing a job pool as it will face the problem in the batch sending algorithm. In general, when we compare the speedup achieved in Graph 3a (which ranges between 11.4 and 15.0) with the speedup achieved in Graph 2a, we see that there is almost no advantage of the job pooling minimax algorithm over the batch minimax algorithm. This might have due to several factors mentioned previously, such as communication costs and the fact that the Master process is now not able to work on the jobs. Alpha-Beta Pruning Evaluation Board: B Algorithm: JOBPOOL_ALPHABETA Job distribution: RANDOM Number of jobs per processor: 10 Job pool send size: 1 Speedup achieved (Non-granularity adjusted) Speedup Graph 3b.! 9

10 We observe in Graph 3b that the speedup decrease with an increasing number of processors for the JOBPOOL_ALPHABETA algorithm. This counterintuitive problem is due to the fact that number of jobs per processor remained constant at 10. This means that in order to split the jobs amongst the increased number of processors, more of the jobs have to be split into more granular ones. The increased in granularity resulted in the poor performance of the algorithm. We will perform a tweak to prevent such problems due to the increase in granularity. Evaluation Board: B Algorithm: JOBPOOL_ALPHABETA Job distribution: RANDOM Number of jobs per processor: Adjusted to maintain granularity Job pool send size: Speedup achieved (Granularity adjusted) Speedup Graph 3c. To obtain Graph 3c, we change the number of jobs per processor every time we increase the number of processors so that (number of jobs per processor x number of processors) always stay approximately the same and hence, granularity stays approximately the same. As we can see in Graph 3c, this actually resulted in the expected speedup with an increase in the number of processors. The reason for this is because the adverse granularity effects on alpha-beta pruning discussed in Section IV has now been countered. In general, the speedup achieved in Graph 3c by the granularity-adjusted job pooling alpha-beta pruning algorithm is also worse than the speedup achieved in Graph 2b by the batch alpha-beta pruning algorithm. Similarly, it could have been due to the fact that there is increased communication costs and the fact that the Master process has now stopped working on the jobs.! 10

11 Section VI. Speedup Minimax Time taken for Serial Minimax Time taken for Batch Minimax with to evaluate Board A(s) 12 processors to evaluate Board B(s) Table 2. Fixed problem size speedup for minimax algorithm Fixed problem size speedup achieved is given by (33.49 / 5.19) = 6.5x. Boards assessed on Board B in Boards assessed on Board B in seconds by Serial Minimax seconds by Batch Minimax with 12 processors 22,339, ,691,957 Table 2b. Fixed time speedup for minimax algorithm Work done by each algorithm can be defined by the number of boards assessed. Hence, Fixed time speedup achieved is (216,691,957 / 22,339,377) = 9.7x. Alpha-Beta Pruning Time taken for Serial Alpha- Time taken for Batch Alpha-Beta Beta Pruning to evaluate Pruning with 12 processors to Board B(s) evaluate Board B(s) Table 3a. Fixed problem size speedup for alpha-beta pruning algorithm Fixed problem size speedup achieved is given ( / 76.5) = 2.2x. Boards assessed on Board B in Boards assessed on Board B in seconds by Serial Alpha- seconds by Batch Alpha-Beta Beta Pruning Pruning with 12 processors 22,098, ,260,265 Table 3b. Fixed time speedup for alpha-beta pruning algorithm Work done by each algorithm can be defined by the number of boards assessed. Hence, Fixed time speedup achieved is (192,260,265 / 22,098,881) = 8.7x.! 11

12 Comparison From Table 2a and 3a, we observe that the fixed problem size speedups achieved by the minimax algorithm is significantly larger that that for the alpha-beta pruning algorithm (6.5x vs 2.2x). The reason for this is because of the issue that was discussed in granularity. Minimax algorithm can be easily parallelized and splitting large jobs into smaller jobs will not affect the entire algorithm. However, alpha-beta pruning algorithm is not the same and by splitting the large jobs into smaller ones, the algorithm now has to assess more boards to get the same answer. On the other hand, from Table 2b and 3b, we observe that the fixed time speedup achieved by the minimax algorithm and alpha-beta pruning algorithm is similar (9.7x vs 8.7x). This is because both algorithms are essentially searching through the state tree. While the smaller jobs will cause alpha-beta pruning to take a longer time to terminate, it does not affect the speed at which it traverses through the state tree. Hence, the number of boards assessed is not very different from that of the minimax algorithm.! 12

13 Section VII. Further Improvements The following are suggestions as to how to further improve the performance of the parallel Othello solver: 1)! Unroll recursive minimax and alpha-beta pruning function calls into iterative for loops to reduce cost incurred due to recursion. This might result in a significant improvement in performance as the bulk of the computational work uses these functions. 2)! Store board data in a 1D array instead of a 2D array. This might result in a significant improvement in performance as well since the computational work involves repeatedly accessing the board elements and duplication of board elements to perform searching. With a 1D array, the elements can be easily cached and cache misses will be reduced. 3)! Send board data as a contiguous block instead of sending the n 2 elements oneby-one. This might result in only a small improvement in performance since current communication time is not significant. 4)! Use smaller classes and structs by removing elements that are not needed in order to reduce the initialization time and the communication time. This might result in only a small improvement in performance since current communication time is not significant. 5)! Currently, when the processes are working on several different jobs using alpha-beta pruning, the results from previous jobs are not used in the evaluation for the subsequent jobs. They can be used to update the alpha value and hence, assist in pruning away more parts of the search tree. This might result in a moderate improvement in performance since information from previous jobs are now useful in the execution of subsequent jobs. Section VIII. Conclusion Overall, we observe that the minimax algorithm for solving Othello can be easily scalable to more processors whereas the alpha-beta pruning algorithm can be much harder to scale requiring the tuning of some other parameters to make sure that granularity does not adversely affect the speedup. However, this does not mean that alpha-beta pruning should not be used for parallel algorithms because the pruning of the search tree by the algorithm results in tremendous performance improvements and the suggested improvements might be able to improve the speedup of a parallel alpha-beta algorithm.! 13