Using Eric's version of GENPROG at the assembly level seems to be as effective as the original version of GENROG. I tried one or two bugs from each of the programs targeted in the many-bugs paper for which a fix was found. In all cases, Eric's version was able to find a fix. However, for each of the bugs that GENPROG was not able to fix, Eric's version also failed to find a (reliable) fix.
In the beginning stages, it was excruciating slow to run a repair
on the VM. Even the fastest of the test suites took several
minutes to run. Some of the bugs test suites took a couple hours
to run all of the tests. I was able to overcome this setback by
using the qemu argument: -machine accel=kvm
Note: this will result in an error if the user is not in the
kvm
group.
This change made the test suites run orders of magnitude faster.
I originally tried to run repairs entirely within the VM (since this was how GENPROG did it) but this turned out to be a bad strategy for a few reasons.
The VMs are Fedora installs and there are known incompatibilities with SBCL and Fedora
The VMs are 32-bit and as such they are limited to 4GB of ram. This turns out to be a severe limitation since Eric's version of GENPROG is very memory intensive. I tried switching to CCL since it has a more aggressive garbage collector but while this helped some, 4GB was simply not enough memory. There seems to be a memory leak somewhere in software-evolution since the memory usage continually creeps up on a long run. A bunch of time was spent dealing with this problem because the repair runs take quite a while to complete and it wasn't until the run had been going for quite a while that the memory errors would crash the run.
I decided to move the repair runs to the host machine. Since the bugs were from 32-bit builds with specific library version numbers etc., I still used the VMs to evaluate candidate mutations. For each bug I wrote a script on the host and the guest OS to evaluate fitness. Once I manually verified that everything worked correctly I made several copies of the VM ( each thread is associated with its own VM) and started the repair run. Since the machine (I used 'prime') had a ton of ram and cores, the repair would not run out of resources (although after ~100,000 evals it got very sluggish).
I found that for bugs which GENPROG had found repairs for In the past, Eric's version also found a repair and it found them pretty quickly.
For each run, I selected a bug to repair. In the bug tarball there
is a file called bugged-program.txt
. This identifies the source
file that contains the bug. To generate the assembly file, I would
use the touch
command to update the timestamp of the file and run
make
. By observing the output of make
I found the compilation
command used by the Makefile. I used the same command to manually
compile the file with the addition of the --save-temps
flag so
that I could get an assembly file version. The VM fitness script
used the test.sh script in the bug directory to determine the
fitness score to return to the host OS. For example, if there were
31 positive test cases and 5 negative test cases, the evaluation
phase of the script would look something like:
cd $TESTDIR
for i in {1..31}; do
$TESTDIR/test.sh p$i >/dev/null 2>&1 && FIT=$(($FIT+1))
done
for i in {1..5}; do
$TESTDIR/test.sh n$i >/dev/null 2>&1 && FIT=$(($FIT+1))
done
The developer fix can be found in the fixed
directory of the bug
tarball. I verified that the fitness script would return the target
fitness by copying the developer fix to the appropriate location and
generating an assembly representation as described above. I verified
that the buggy version passed all of the positive test cases and
failed all of the negative test cases and that the developer fix
version passed both the positive and the negative tests.
Once this script was complete, I made sure that the host side fitness script communicated correctly and returned the correct fitness scores for both the buggy and developer-fix versions of the assembly. The host side script looks something like:
#!/bin/bash
ASM=$1
THREAD=$2
[[ -z $THREAD ]] && THREAD=0;
if [[ $THREAD == "main" ]];then
THREAD=0
fi
PORT=$((8000 + $THREAD))
echo "GOING TO MACHINE ON PORT # $PORT"
ssh -p $PORT root@localhost "killall fit.sh"
scp -P $PORT $ASM root@localhost:$ASM
ssh -p $PORT root@localhost "/root/fit.sh $ASM"
FIT=$?
ssh -p $PORT root@localhost "rm $ASM"
exit $FIT
Some bugs (e.g. gzip-bug-3eb6091d69a-884ef6d16c6
) presented
interesting behavior. Bugs like this would sometimes reach target
fitness but would fail manual verification. I changed the fitness
script in these cases so that the fitness script would only return
the target fitness if the candidate repair passed all of the test
cases twice in a row.
The failure to find repairs that GENPROG was not able to find in the past let to a closer examination of the bugs which did not have repairs.
For most of these bugs, the reason that no repairs were found is obvious. The developer fix of the bug often consisted of the development of entirely new functionality. As a result, I shifted my focus to the subset of bugs which seemed (based on Zak & Clair's analysis) to be most amenable to repair. Specifically, I focused on the bugs which were identified as not fixable by GENPROG because of the limitation (which Eric's version does not have) that GENPROG could not mutate previous mutations.
Unfortunately, most of these bugs had other issues that made repair unlikely.
The following table lists the subset of bugs under consideration. In the column labeled 'Other Problem' indicates the reason a repair seemed unlikely, with the exception of the ones marked 'Promising'.
BUG | Other Problem |
---|---|
libtiff-bug-0860361d-1ba75257 | Promising |
libtiff-bug-a2f7abf-ce76d31 | Missing fn/var |
php-bug-307146-307147 | Missing fn/var |
php-bug-307563-307571 | Missing fn/var |
php-bug-308020-308035 | Missing fn/var |
php-bug-308046-308051 | Large Diff |
php-bug-308734-308761 | Promising |
php-bug-309111-309159 | Promising |
php-bug-309453-309456 | Missing fn/var |
python-bug-69223-69224 | Promising |
python-bug-69368-69372 | Missing fn/var |
python-bug-69831-69833 | Missing fn/var |
python-bug-70019-70023 | Missing fn/var |
python-bug-70098-70101 | Missing fn/var |
Each of the ones marked promising were run (or re-run if they had been run early in this adventure) but no repairs were found.
Of the ones marked 'Promising', the developer fix involved a function call or the insertion of a conditional statement. Clearly a repair at the assembly level would require multiple edits in just the right order to make this happen. Since the fitness function is based on the number of test cases that the candidate repair passes, it does not encourage the kind of intermediate edits needed build up a function call or a conditional statement. Perhaps if the fitness function were to be modified in some way that could encourage these kinds of edits a repair could be generated.
Shortly before I left NM, I started re-running some of these at the CIL level since a function call or a conditional statement could be a single line fix. I tested my CIL implementation of the repair algorithm on the GCD bug and it fixes it in a matter of seconds. For reasons that I have not figured out yet, the repair runs crash after several hundred fitness evaluations.