Sunday, March 7, 2010

What really happened on Mars?

I remember when I was a kid probably in 6th or 7th grade our principle ma'am gave as a example in our daily morning assembly program. Since I was very tall among other class mates I was standing at the end of the line and was really strugling hard to listen her. She was talking about some billion dollar project which was failed just because of some error of 1 and 0. She further told as that Nasa has sent a spacecraft to mars and the robot is male functioning just because of some mistake in 1 and 0. Well at that time I was thinking how come a signle sweep in the digit of 1 and 0 can cause the whole project to crash. Although I didn't understood anything at that time but I was very much fascinated by the whole Idea of sending a bot to Mars. I read each and every article I found in paper regarding the Mars Project but most of them didn't satisfied my needs.
After very long time when I was in 12th standard and my physics teacher taught me digital electronics I understood something what my principle ma'am was taking about. Now I understood what she was talking about I knew that it has something to do with the Boolean condition. Now I am in fourth year of my graduation in IIT Bombay and It took me almost 12 years to understand what really happened in Mars.

So here we go-

The Mars Pathfinder mission was widely proclaimed as “flawless” in the early
days after its July 4th, 1997 landing on the Martian surface. The spacecraft has to gather the meteriological data and send them back to earth. The spacecraft has one processor and has to do many process at time. The engineers decided to in manner called "preemptive priority scheduling of tasks". By tasks I mean process like gathering data or communication with earth.
The operating system is designed such that It after every fixed amount of time (time slice ) the OS will look for all process and assign priority to every process in every cycle and execute process with highest priority.

Pathfinder has mainly three tasks low priority "meteorological data gathering" which run infrequently , medium priority "communication task" and high priority "information bus management task" which run very frequently. The information bus can be assumed as the common memory used by every tasks to exchange information and hence bus management task should run very frequently. Access to the bus was synchronized with mutual exclusion locks called mutexe which will insure that bus is locked to a particular task and other tasks can't access the bus.

How it works-

The meteorologica task will run , it will collect data and while communicating with earth It will access the information bus through mutexes which will lock the bus. Now if an interrrept occur and Information bus is needed to be accessed the Bus management task will wait till the bus is unlocked by mutexe. Once bus is released the Bus management task will use it.

The bug-

Suppose the Bus is locked by the meteorological data task and bus management task is waiting for the bus to be released. At this point of time if interrupt occur that causes the "communication task" to be scheduled with high priority. The OS will keep the "meteorological task" at hold (waiting ) and will start running slow "communication task". Now the communication task is running and "meteorological task" is waiting which has unfortunately locked the Information bus and the highest priority "Information bus management task " is still waiting to access bus. After some point of time if Information bus management task didn't get to access bus "watch dog timer" thinks that something is terribly wrong with system bus and call for Total System Reset.

So where is 1 and 0's-

Now the mutex object accepts a boolean parameter that indicates whether priority inheritance should be performed by the mutex. The basic idea of the priority inheritance protocol is that when a job blocks one or more high priority jobs, it ignores its original priority assignment and executes its critical section at the highest priority level of all the jobs it blocks. After executing its critical section, the job returns to its original priority level. Before launching the pathfinder mutex was initialized with this parameter OFF. So here comes the 1 and o mistake if had it been initialized with this parameter ON the mutex could have released the information bus locked my meteorological task for bus management task which in turn could have prevented watchdog timer from invoking total system reset.

So was it debugged ?

During the development stage of Pathfinder the OS has a C language interpreter intended to allow developers to type C expressions and functions to be executed on the fly during the system debugging. Fortunately Pathfinder was launched with this feature still enabled. A C program was uploaded to the spacecraft, which when interpreted, changed the values of the initialization variable of mutex form FALSE to TRUE. There was no system reset after that.

No comments:

Post a Comment