Data mining is the extraction of hidden predictive information from large data bases. Emerging datamining applications are important factors to drive the architecture of future microprocessors. This paper analyzes the performance scalability on parallel architectures of such applications to understand how to best architect the next generation of microprocessors that will have many CPU cores on chip. Bioinformatics is one of the most active research areas in computer science, and it relies heavily on many types of data-mining techniques. In this paper, we report on the performance scalability analysis of six bioinformatics applications on a 16-way SMP based on Intel Xeon™ microprocessor system.
These applications are very compute intensive, and they manipulate very large data sets; many of them are freely accessible. Bioinformatics is a good proxy for workload analysis of general datamining applications. Our experiments show that these applications exhibit good parallel behaviors after some algorithm-level reformulations, or careful parallelism selection. Most of them scale well with increased numbers of processors, with a speed-up of up to 14.4X on 16 processors. We start with an introduction to data mining.
The datamining techniques studied are briefly described, and the selected workloads using these techniques are listed. We then provide a brief description of the methodology used for the studies. We present the scalability analysis of three workloads related to Bayesian Network (BN) structure, two workloads relevant to recognition, and one workload related to optimization.
We conclude with the key lessons of the study. These workloads are compute intensive and data parallel. They manipulate large amounts of data that stress the cache hierarchy. Techniques optimizing the use of caches are key to ensure performance scalability of these workloads on parallel architectures.