WEEKLY STATUS REPORT
Camilo A. Silva
July 21-27
ACTIVITIES:
Time has gone fast. This was my last week in Guadalajara. But, let me tell you—what a great time I had! I am happy because I had a great opportunity to live. To go to another country and get to familiarize with it is a memorable experience. In my case, I was able to excel in my studies. But, also I was able to meet new Mexican friends.
Ok. Going back to my activities, during this past week I had the meeting with my team members on Monday, where we discussed the progress of our work—and guess what? We completed it. All the parallelization of our project is completed. Only one thing was left: the error handling of the program. Thus, both Gary and I decided to work together in this task.
My friend Gary did a great job in getting this running promptly, while it took me a bit longer to complete mine. Therefore, we were able to run the first set of data during the weekend (Michael kindly asked GCB users to let us the cluster). The first set of data results were completed with no errors so far. There was one complication, though—this will be commented below in the “ISSUES/PROBLEMS” sections.
All in all, I was able to complete my program. I was able to fulfill my goal. My program is a parallel program that is capable of managing MPI communication errors only.
On Wednesday, I had my last meeting with Dr. Duran. During that meeting I shared with him the progress of my job and the steps to follow afterwards. He was of great help through out my visit.
Another important thought that I want to emphasize is that I never expected to have such a great working experience with my friends from the Bioinformatics group. I was amazed at how well we were able to “virtually” work together from different parts of the globe: China, Mexico, and USA. Truly, I was also glad by the great leadership and companionship from Mr. Michael Robinson. He was always there to help me, full of patience, and good guidance. My partner Gary is a great member as well—a hard worker and very knowledgeable about programming. I sincerely feel that I am with the best team members.
Looking back to the wonderful time that I have spent in Guadalajara, I am proud to say that I do not regret anything at all. I am sincerely grateful with FIU CIS and the PIRE program for giving me this prestigious opportunity.
One important thing, through out my PIRE experience I had the chance to work with many important people that helped me solved issues. I never had the opportunity to thank them publicly so I want to take the time to thank them all:
- I want to thank God for giving the opportunity to participate in this research experience
- All the FIU students that helped me: David, Juan Carlos, both Javier Delgado and Javier Figueroa, and Michael Robinson
- Big thanks to Michael Robinson because he was a great team leader I am both proud and happy to have partnered with him and Gary
- Many thanks to Gary because he gave me tons of insight in my programs
- Special thanks to all Professors and Administrators in charge of PIRE: Dr. Sadjadi, Dr. Graham, Ms. Carbajo, Dr. Hector Duran, Dean Yi Deng, and every single person that helped out with the PIRE program
- Lastly, special thanks to both of my Professors and Advisors in Mexico and USA: Dr. Duran, and Dr. S. Masoud Sadjadi
ACCOMPLISHMENTS:
I am proud to say that I completed my proposed plan for the Summer. I completed my parallelized program with the capability of self-healing whenever an MPI error message is detected at the time the master node sends a message to the slave nodes.
ISSUES/PROBLEMS:
Just as I mentioned in the “ACTIVITIES” section, there was a little problem that we had during the runtime of our project. Yesterday night during our meeting, Michael shared with us that the problem dealt with something known as “memory leakage.” To be honest, I am not quite sure what the cause of the problem is and how it should be resolved. This is something that as a group we will find out and that Michael decided to look on.
PLANS:
The big plan now is to write the technical paper and complete my PIRE DVD on time.
SUMMARIES/CRITIQUES OF PAPERS:
FIRST READINGS: “MPI Error Handling”
REFERENCES:
http://beige.ucs.indiana.edu/I590/node85.html
http://www.mpi-forum.org/docs/mpi-11-html/node148.html
http://www.dartmouth.edu/~rc/classes/intro_mpi/mpi_error_functio
http://www.hlrs.de/people/gabriel/benchmarking/ft-vsuite-s
http://www.netlib.org/utk/people/JackDongarra/PAPERS/ft-mpi-l
The basic theory that I learned from all those articles about error handling is that the MPI communicator is more than just a group of processes that belong to it. Amongst some of the items that belong to the communicator and that it hides inside its body is the error handler. It is important to point out that whether an error message is printed or not, it depends on its implementation.
MPI error(s) arise whenever there are messages that are incorrectly constructed, addressed, set or received. Please note that MPI does not provide mechanisms for dealing with failures in the communication system. What MPI does is that it provides mechanisms to solve recoverable errors. Simply meaning that the default error handler of aborting an MPI program will be replaced with an appropriate error handler. Thus, in order for the application to identify the error code, the MPI_Error_class routine converts any error code into one of a small set of standard errors codes known as error classes. Furthermore, MPI provides only two types of predefined error handlers: MPI_ERRORS_ARE_FATAL which is the default that causes the MPI program to abort whenever an error is found. And, the other is MPI_ERRORS_RETURN which causes MPI to return an error value instead of aboting.
Since I read a whole lot of different papers or documents, I could generalize and comment that all of them were helpful. Some of the documents were heavy in a lot of theory which was a bit monotonic, while others were very simple in the definitions and provided great examples. All of them were easy to comprehend.
This topic was extremely important for the last part of my project because it helped me a lot in establishing a self-healing system to my project.
SECOND READINGS: “MPI debugging”
RESOURCES:
http://cw.squyres.com/columns/2004-12-CW-MPI-Mechanic.pdf
http://www.clustermonkey.net/index.php?option=com_search&searchw
http://www.hlrs.de/organization/amt/services/tools/debugge
http://www.cs.utah.edu/research/techreports/2007/pdf/UUCS-07-0
http://www.nacad.ufrj.br/sgi/007-3687-010/sgi_html/ch04.html
I wanted to learn what strategies were present in the debugging process of MPI. And I was able to be exposed to techniques that are used nowadays to debug parallel programs. The first technique that I learned was “printf()” debugging. This technique is not that effective in parallel programs because of the multiplicative effect meaning that many nodes will be printing the same thing unless there is something in the print out that could identify them. Also “printf()” techniques can only display a limited subset of the process state.
Other types of debugging techniques are to use the serial debuggers in parallel. Although serial debuggers were not developed to be used in parallel programs, they might provide some insight in finding certain bugs.
Memory Checking debuggers look for erroneous patterns such as accessing memory outside of an array or the local stack using heap memory that was already freed. One of the advantages of using such is that they report all errors in a file and with line numbers. The downside is that it cannot be used interactively and cannot be attached to already-running processes.
The last category of techniques for debugging is known as the parallel debuggers. Besides all the common functionality known from all debuggers, these type of debuggers are capable of setting breakpoints, examining variables, stepping through code, and also individually monitor and control all processes in a running MPI job.
This topic was very interesting. And all the authors did a great job in explaining it. Although most of the information could be found by reading only one of the five articles that I read.
This information was important to me because it helped me in understanding what the different types of debugging techniques are that could be used in a parallel MPI program.













to the attendant. Either I look like a terrorist, or a terrorist looks like me, since I was pulled aside for a "random" inspection on the way back, in both Buenos Aires and in Miami. The customs guy even confiscated the apple they gave me on the plane (I think he was just hungry and wanted to eat my apple). On the plane, neither the reading lights, nor the video screens were working in the coach section of the plane during flight. This made for a long and very boring flight home.