Log on:
Powered by Elgg

Juan Bernal :: Blog :: NSF PIRE Project Proposal : Data Mining and Machine Learning Cross Validation over Distributed Networks

March 26, 2008

Student Name: Juan Bernal  (from FAU)

Supervisor’s Name and Title at FIU/FAU: Andres Folleco, Assistant Professor 

Name of the PIRE International Partner’s Institution: Barcelona Supercomputing Center

Supervisor’s Name and Title at the PIRE International Partner’s Institution: Jordi Torres (Autonomic Systems and eBusiness Platform Manager) Ricard Gavalda (Assistant Professor – LARCA Research Group)

Project Title: Data Mining and Machine Learning Cross Validation over Distributed Networks

 Problem Statement: One common problem our empirical research faces in every experiment is the serialization of fold processing during cross-validation (CV) operations. This situation arises under any data mining and machine learning experiments that use CV operations, regardless of the type of algorithm used or application domain. 

Motivation and Impact: There is a serious bottle-neck when conducting CV operations for any Data Mining and Machine Learning initiative. Wasted time as well as inefficient use of resources are ubiquitous problems faced constantly in these settings. The plan is to use High Performance Super Computing resources that are Grid-enabled and/or distributed over heterogeneous networks to alleviate serialization issues and significant delays during CV operations. 

 Current Status: There have been at least two other initiatives directly related to the GRID initiative that may address the problem stated above (directly or indirectly). Initially, we plan to conduct a technical survey of these studies in order to evaluate and verify their value. We are not aware of any FIU/FAU group currently conducting such study.

 Research Roadmap: 1-) Technical survey of “solutions” directly and/or indirectly related to the stated goals above. 2-) Based on the results of the survey we can either harness any reusable code/solutions/etc. that enhance CV operations and/or3-) Design a solution for the Weka open source commonly used data mining and machine learning tool that minimizes required changes/additions into its architecture. The idea is to create a “pluggable” solution that enables CV operations to be conducted over a GRID enabled and/or distributed heterogeneous network (not necessarily LA- Grid enabled systems). 4-) Implementation of the “pluggable” modules into the Weka tool architecture. All in Java.5-) Test and verification phase. 

Relation to PIRE Core Research Projects: This project fits the CI Integration Layer and specifically the “Data Mining Software Tools” box within this layer. We can foresee similar/spin-off projects into other closely related areas in at least one other layer: The CI Enablement Layer. On the other hand, this project can significantly benefit any of the application domain areas under the CI Application Layer that can use Data Mining and Machine Learning technology directly.

Keywords: cross validation, data mining, distributed systems, grid, machine learning

Posted by Juan Bernal


  1. This is the proposed project that I will be focusing my research on. Data Mining and Machine Learning are new topics that I find interesting. and as mentioned in the project there is an identified bottleneck when doing cross-validation which hopefully can be overcome by implementing cross-validation over grid enabled and/or distributed networks.

    As preparation for this research experience I had taken an introductory class in data mining which has enabled me to understand and apply many important concepts needed to generate positive results in this field.

    I am looking forward and with expectations towards a great experience doing research in the Super Computing Center in Barcelona, Spain.

    Juan BernalJuan Bernal on Wednesday, 26 March 2008, 18:35 EDT # |

You must be logged in to post a comment.