Many-Bugs progress:

High level take-away:

Using Eric's version of GENPROG at the assembly level seems to be as effective as the original version of GENROG. I tried one or two bugs from each of the programs targeted in the many-bugs paper for which a fix was found. In all cases, Eric's version was able to find a fix. However, for each of the bugs that GENPROG was not able to fix, Eric's version also failed to find a (reliable) fix.

Lessons learned:

Use kvm to make the VM reasonably fast.

In the beginning stages, it was excruciating slow to run a repair on the VM. Even the fastest of the test suites took several minutes to run. Some of the bugs test suites took a couple hours to run all of the tests. I was able to overcome this setback by using the qemu argument: -machine accel=kvm

Note: this will result in an error if the user is not in the kvm group.

This change made the test suites run orders of magnitude faster.
Run the repair from the host machine.

I originally tried to run repairs entirely within the VM (since this was how GENPROG did it) but this turned out to be a bad strategy for a few reasons.
- The VMs are Fedora installs and there are known incompatibilities with SBCL and Fedora
- The VMs are 32-bit and as such they are limited to 4GB of ram. This turns out to be a severe limitation since Eric's version of GENPROG is very memory intensive. I tried switching to CCL since it has a more aggressive garbage collector but while this helped some, 4GB was simply not enough memory. There seems to be a memory leak somewhere in software-evolution since the memory usage continually creeps up on a long run. A bunch of time was spent dealing with this problem because the repair runs take quite a while to complete and it wasn't until the run had been going for quite a while that the memory errors would crash the run.
- I decided to move the repair runs to the host machine. Since the bugs were from 32-bit builds with specific library version numbers etc., I still used the VMs to evaluate candidate mutations. For each bug I wrote a script on the host and the guest OS to evaluate fitness. Once I manually verified that everything worked correctly I made several copies of the VM ( each thread is associated with its own VM) and started the repair run. Since the machine (I used 'prime') had a ton of ram and cores, the repair would not run out of resources (although after ~100,000 evals it got very sluggish).
  
  I found that for bugs which GENPROG had found repairs for In the past, Eric's version also found a repair and it found them pretty quickly.

Setup Details

For each run, I selected a bug to repair. In the bug tarball there is a file called bugged-program.txt. This identifies the source file that contains the bug. To generate the assembly file, I would use the touch command to update the timestamp of the file and run make. By observing the output of make I found the compilation command used by the Makefile. I used the same command to manually compile the file with the addition of the --save-temps flag so that I could get an assembly file version. The VM fitness script used the test.sh script in the bug directory to determine the fitness score to return to the host OS. For example, if there were 31 positive test cases and 5 negative test cases, the evaluation phase of the script would look something like:

cd $TESTDIR

for i in {1..31}; do
  $TESTDIR/test.sh p$i >/dev/null 2>&1 && FIT=$(($FIT+1))
done 
for i in {1..5}; do
  $TESTDIR/test.sh n$i >/dev/null 2>&1 && FIT=$(($FIT+1))
done

The developer fix can be found in the fixed directory of the bug tarball. I verified that the fitness script would return the target fitness by copying the developer fix to the appropriate location and generating an assembly representation as described above. I verified that the buggy version passed all of the positive test cases and failed all of the negative test cases and that the developer fix version passed both the positive and the negative tests.

Once this script was complete, I made sure that the host side fitness script communicated correctly and returned the correct fitness scores for both the buggy and developer-fix versions of the assembly. The host side script looks something like:

#!/bin/bash

ASM=$1
THREAD=$2

[[ -z $THREAD ]] && THREAD=0;
if [[ $THREAD == "main" ]];then
  THREAD=0
fi

PORT=$((8000 + $THREAD))

echo "GOING TO MACHINE ON PORT # $PORT"

ssh -p $PORT root@localhost "killall fit.sh"

scp -P $PORT $ASM root@localhost:$ASM

ssh -p $PORT root@localhost "/root/fit.sh $ASM"

FIT=$?

ssh -p $PORT root@localhost "rm $ASM"

exit $FIT

Some bugs (e.g. gzip-bug-3eb6091d69a-884ef6d16c6) presented interesting behavior. Bugs like this would sometimes reach target fitness but would fail manual verification. I changed the fitness script in these cases so that the fitness script would only return the target fitness if the candidate repair passed all of the test cases twice in a row.

The failure to find repairs that GENPROG was not able to find in the past let to a closer examination of the bugs which did not have repairs.

For most of these bugs, the reason that no repairs were found is obvious. The developer fix of the bug often consisted of the development of entirely new functionality. As a result, I shifted my focus to the subset of bugs which seemed (based on Zak & Clair's analysis) to be most amenable to repair. Specifically, I focused on the bugs which were identified as not fixable by GENPROG because of the limitation (which Eric's version does not have) that GENPROG could not mutate previous mutations.

Unfortunately, most of these bugs had other issues that made repair unlikely.

The following table lists the subset of bugs under consideration. In the column labeled 'Other Problem' indicates the reason a repair seemed unlikely, with the exception of the ones marked 'Promising'.

BUG	Other Problem
libtiff-bug-0860361d-1ba75257	Promising
libtiff-bug-a2f7abf-ce76d31	Missing fn/var
php-bug-307146-307147	Missing fn/var
php-bug-307563-307571	Missing fn/var
php-bug-308020-308035	Missing fn/var
php-bug-308046-308051	Large Diff
php-bug-308734-308761	Promising
php-bug-309111-309159	Promising
php-bug-309453-309456	Missing fn/var
python-bug-69223-69224	Promising
python-bug-69368-69372	Missing fn/var
python-bug-69831-69833	Missing fn/var
python-bug-70019-70023	Missing fn/var
python-bug-70098-70101	Missing fn/var

Each of the ones marked promising were run (or re-run if they had been run early in this adventure) but no repairs were found.

Of the ones marked 'Promising', the developer fix involved a function call or the insertion of a conditional statement. Clearly a repair at the assembly level would require multiple edits in just the right order to make this happen. Since the fitness function is based on the number of test cases that the candidate repair passes, it does not encourage the kind of intermediate edits needed build up a function call or a conditional statement. Perhaps if the fitness function were to be modified in some way that could encourage these kinds of edits a repair could be generated.

Shortly before I left NM, I started re-running some of these at the CIL level since a function call or a conditional statement could be a single line fix. I tested my CIL implementation of the repair algorithm on the GCD bug and it fixes it in a matter of seconds. For reasons that I have not figured out yet, the repair runs crash after several hundred fitness evaluations.

Many-Bugs progress:

High level take-away:

Lessons learned:

Use kvm to make the VM reasonably fast.

Run the repair from the host machine.

Setup Details