Monday, 2 January 2012

Launching and monitoring multiple parallel processes in fool proof way

Sometimes we need to launch multiple processes parallely and wait for all these parallel processes to complete.

Eg.

In maven , we need to run multiple profiles (eg. test profiles) parallely and wait for them to complete.

Logic will be like
- Launch all the parallel processes
- Capture the PID of each parallel process (using $captured_pid=`echo $!` )
- Loop until a pre-defined maximum time interval for monitoring , with each iteration
- do something like ps -ef | grep and see if PID is still alive
- Continue this step until all the PID in parallel execution completes.


semantic_no=1


for i in `echo ratingsBAT pollsBAT mbBAT abuseBAT`
do
  echo "Launcher: Launching BAT for semantic : $i started at `date` "
  mvn -Dmaven.test.apiHost.value=$1 -Dmaven.test.apiPort.value=$2 test -P $i &
  
  pidlist[$semantic_no]=`echo $!`
  semanticlist[$semantic_no]=`echo $i`
  pidrunning[$semantic_no]=1
  
  semantic_no=`expr $semantic_no + 1`

  
done

for((i1=0;i1<30;i1++))
do
  sleep 60 
  for((j=1;j<$semantic_no;j++))
  do
    if [ ${pidrunning[$j]} == 1 ] 
    then    
      ct=`ps -ef | grep ${pidlist[$j]} |grep -v 'grep' | wc -l`
    if [ $ct == 0 ] 
    then
      seconds_for_exec=`expr $i1 \* 60`
      echo "Launcher:  Process with PID : ${pidlist[$j]} completed execution in $seconds_for_exec seconds -- TIME of COMPLETION : `date`" 
      pidrunning[$j]=0
    fi
   fi
  done

 completed=0
 for((j=1;j<$semantic_no;j++))
 do
   if [ ${pidrunning[$j]} == 1 ]
   then
        echo "Launcher:  PID : ${pidlist[$j]} of semantic ${semanticlist[$j]} is still running ::::"
 completed=1
   fi
 done

 if [ $completed == 0 ]
 then
    echo "Completed all the Tests.. TIME : `date`"
    exit
 fi
done

 for((j=1;j<$semantic_no;j++))
 do
   if [ ${pidrunning[$j]} == 1 ]
   then
        echo "Launcher:  PID : ${pidlist[$j]} of semantic ${semanticlist[$j]} did not complete at TIME `date`"
   fi
 done

Above script will continue to monitor for the parallel mvn test -P commands until execution for 30 mins (in intervals of 1 min), beyond which it will terminate with the list of still running PID's 

But the problem with this approach is Let's say there are 3 mvn commands running parallely with PID's 123,234,345 Let's assume PID 123 completes execution within a minute and this PID is re-allocated by Unix process manager to some other process. 

Unfortunately, if that process is a daemon, it will keep the above script running for the 30 mins even if the process with PID 234,345 completes within say 10 mins. 

To avoid this pitfall, I changed the logic to something like this Every parallel process launched will create a flat file (with name equivalent to that of the PID of mvn process), and the monitoring loop will check if that file exist [or] not. In this way, we could catch if some PID's completing execution within a minute (monitoring interval) and the monitoring script is not fooled by the re-allocation of PID to some other process. 

To do this , we do write another script like ./mvnWrapper.sh which will take the arguments from the main monitoring script
#---- Contents of mvnWrapper.sh -----
currentPID=`echo $$`
echo "Creating flat file with $currentPID"
touch $currentPID
echo "mvn $* "
mvn $*
echo "Deleting flat file with $currentPID"
rm $currentPID

#---- End of mvnWrapper.sh -----

# Main Monitoring script

profileToRun=`echo profile1 profile2`

for i in `echo $profileToRun`
do
  echo "Launcher: Launching BAT for semantic : $i started at `date` "
  ./mvnWrapper.sh -f pom.xml -Dmaven.test.apiHost.value=$1 -Dmaven.test.apiPort.value=$2 $MINUS_D_OPTIONS test -P $i &
 
  captured_pid=`echo $!`
 
  pidlist[$semantic_no]=`echo $captured_pid`
  echo "Captured PID : `echo $captured_pid`"
  semanticlist[$semantic_no]=`echo $i`
  pidrunning[$semantic_no]=1
  
   semantic_no=`expr $semantic_no + 1`
   
done

# this monitoring script will run for 90 mins, monitoring all the PIDs in array in interval of 1 min

for((i1=0;i1<90;i1++))
do
  sleep 60 
  echo "Checking PID status for $i1 time(s).."
  for((j=1;j<$semantic_no;j++))
  do
    if [ ${pidrunning[$j]} == 1 ] 
    then    
      #ct=`ps -ef | grep ${pidlist[$j]} |grep -v 'grep' | wc -l`
      fileName=`echo ${pidlist[$j]}`
      echo "FileName checked is $fileName"
      if [ -f  $fileName ]
      then
        ct=1
      else
        ct=0
      fi
    if [ $ct == 0 ] 
    then
      seconds_for_exec=`expr $i1 \* 60`
      echo "Launcher:  Process with PID : ${pidlist[$j]} completed execution in $seconds_for_exec seconds -- TIME of COMPLETION : `date`" 
      pidrunning[$j]=0
    fi
   fi
  done

 completed=0
 for((j=1;j<$semantic_no;j++))
 do
   if [ ${pidrunning[$j]} == 1 ]
   then
        echo "Launcher:  PID : ${pidlist[$j]} of semantic ${semanticlist[$j]} is still running ::::"
 completed=1
   fi
 done

 if [ $completed == 0 ]
 then
    echo "Launcher: Completed all the Tests.. TIME : `date`"
    exit
 fi
done

 for((j=1;j<$semantic_no;j++))
 do
   if [ ${pidrunning[$j]} == 1 ]
   then
        echo "Launcher:  PID : ${pidlist[$j]} of semantic ${semanticlist[$j]} did not complete at TIME `date`"
   fi
 done 
 
Link to my other blogs