Log on:
Powered by Elgg

Selim Kalayci :: Blog :: Project Proposal

April 01, 2008

Student Name:

Selim Kalayci

Supervisor’s Name and Title at FIU/FAU:

Masoud Sadjadi, Ph.D. – Assistant Professor, School of Computing and Information Sciences, FIU 

Name of the PIRE International Partner’s Institution:

IBM India Research Lab 

Supervisor’s Name and Title at the PIRE International Partner’s Institution:

Gargi B Dasgupta, IBM India Research Lab 

Project Title:

Pattern Based Fault-Tolerance at Workflow Management Systems

Problem Statement:

The use of workflows is becoming increasingly popular to capture the business logic of complex scientific applications in many scientific domains such as Bioinformatics, Meteorology, Astronomy, and High-Energy Physics. Scientists compose complex workflows that are typically computation-intensive, data-intensive, or both.  Grid computing techniques are being employed more and more everyday to utilize the available resources from multiple institutions for the execution of complex workflows.       However, the large size of such complex workflows and the heterogeneous and dynamic behavior of the Grid environments raise some challenging problems that need to be addressed. One such issue is the faults they may occur during the execution of a workflow. Since the individual tasks are dependent on each other in a typical workflow, failure of even a single task may result in the failure of the whole workflow, as well as the wasted time and resources. 

Motivation and Impact:

For the abovementioned reasons, to be able to provide a reliable and efficient workflow management system to the users (scientists), fault-handling is a necessity for a workflow management system in Grid computing environments. Specifically, such reliability and efficiency is important for long-running, but real-time complex scientific workflows, such as Hurricane Prediction Ensemble workflows, which cannot afford missing the deadline or accuracy of the results. 

Current Status:

Currently, our team at FIU is collaborating with the IBM T.J. Watson and IBM IRL to design and develop a prototype workflow management system as part of the Latin American Grid project. We use the BPEL version 2.0 to specify the workflows and JSDL to describe the jobs. BPEL has some very primitive fault-handling constructs, and this has to be improved upon to achieve the system that was pointed out in the Motivation. In this part, we plan to leverage the work previously done in TRAP/BPEL. So far, we have been studying the job flow patterns and their associated failure patterns in typical workflows, but this study is still in a preliminary phase.  As for the case study, we have chosen the Montage Astronomy application. 

Research Roadmap:

First, we will finish the ongoing prototype development of our workflow management system.

           This is a pre-cursor task and hopefully this gets done by April.


A.        Next, we investigate how to incorporate the failure patterns that we have identified into this prototype system.

a.       What are the common failure patterns and the common job-flow patterns

b.      How do we identify co-relations among them

c.       How do we apply them

d.      The implementation of the proxy components for all the above

B.         Next, we will evaluate the feasibility and efficiency of the system by running some tests on Montage application.

a.       We can introduce various artificial failures to test the correctness and effectiveness of the failure patterns. Based on the outcoming result we can verify if the incorporation of failure patterns gives the expected result.

b.      What will we measure here. For performance results, we can see how much delay there is in reacting with a failure pattern

c.       For efficiency measurements, we can see how many times we are able to select the right pattern for failure correction.

C.        Finally, we plan to write a paper based on the finding of this research. We will identify appropriate venue for this publication, which most probably will be a workshop, conference, or journals focusing on grid computing, scientific computing, distributed computing, autonomic computing, fault-tolerant, or service oriented computing.


Relation to PIRE Core Research Projects: My research project proposal directly falls into “MetaScheduling & Jobflow Mgmt” box displayed in the “CI Enablement” level.  Because I will be working on the Jobflow management component, and add fault-tolerancy to this component. Also, I will be in close interaction with the Metascheduler component, which interacts with the Jobflow manager through an interface. 


Keywords: fault-tolerant, Montage, workflow

Posted by Selim Kalayci

You must be logged in to post a comment.