Arlen Cox

Ramblings on my hobbies

Problems with Papers

leave a comment »

Writing papers for peer-reviewed journals is not a trivial task. I’ve written three of them now and none of them have yet been accepted, though hopefully the third and most recent will. The difficulty with preparing these papers is twofold:

  1. Implementing the idea
  2. Evaluating the idea

The Implementation

In theory implementing the idea is the easy part.  It’s just engineering.  I tell myself this frequently, I suppose to tell myself that I can do it, because I can do engineering, but in reality it is something beyond engineering.
Implementing a research prototype is a delicate balance between polishing a turd and making something genuinely well engineered.  I made two prototypes for my most recent research paper.  The first, which took a very short amount of time worked and actually worked fairly well, but there were numerous extensibility problems with it.  I wanted to add some new features (such as new filter types) and I realized that I’d done everything wrong.  The intermediate forms I had chosen did not support the kinds of abstractions I required.
I reengineered it.  This process, took almost two months, which is completely ridiculous.  What I had created in my first prototype was almost magical in the way it actually worked.  My reengineered system had more bugs than the original one by a long shot (see Things You Should Never Do) and it had far fewer features for now.
Along the way, I had to add numerous features that I didn’t need before, just so I could debug the monstrosity that I had created.  I use SMT solvers extensively in my work.  Most SMT solvers provide both a text-based interface and a software API that you can use.  Unfortunately, I know of know software API that writes a useful text-based representation for debugging.  As a result, I moved my code from the software API to the text-based interface that uses SMTLIB2.  This introduced a whole new host of problems.  For one, now I had to parse the output from the SMT solver rather than simply request data from the API.  Second, I had to correctly handle structure sharing, which occurs frequently in the systems I was analyzing.  Both of these problems were non-issues with the API.
The upshot is that after all of this, I was actually able to debug my problem.  As an added bonus, it turns out you actually get a different solver if you use the text-based interface instead of the API and currently the text-based solver is many times faster.
This started the feature creep process.  Along the way I gained a ton of extra features; all of which had bugs; all of which thus needed to be debugged.  This took perhaps another month of my time and got me no closer to evaluating my idea.  Why did I do this?  Well the features were useful.  They helped me validate my work.  I was now able to sanity check my program.  This turned out to be good because there were some subtle problems and the problem got refined.  Perhaps this wasn’t just a huge waste of time.
The thesis of this section is that it’s really difficult to predict what will get you answers fastest.  Answers are useless if they are wrong, so you have to be confident in your answers.  I clearly added more features than was absolutely necessary, but that is what I get for attempting to make a useful tool.

The Evaluation

Now that I had an implementation, it was time to evaluate it.  We had made several mistakes at this point.  The first was not knowing what we were looking for.  We knew that we had something cool on our hands, but we didn’t know how best to convince others this was a cool widget.  Had we established this up front, the implementation may have gone smoother.  Who knows.

The bigger problem was that we had decided to perform a comparison of techniques.  Comparison papers are a ton of work.  I’ve done one before.  The massive amounts of data that needed to be collected was prohibitively large given the one week we had left before deadline.  I didn’t know this, of course, so I queued up a ton of benchmarks and let them run over the weekend.

The bugs started rolling in.  When I checked on the status over the weekend.  First I found out that when problems were timing out, they weren’t cleaning up after themselves.  The SMT solver that was launched as an executable underneath my program, which was consuming all of the time and memory wasn’t being killed when the program timed out.  Some processes had run for over 10,000 seconds when set to run with a 600 second timeout.

Killing programs is hard

It turns out that killing these programs is really hard.  I tried Z3’s built-in timeout mechanism.  It didn’t work reliably.  I tried ulimit.  It also didn’t work reliably.  I tried a script called timeout3 and it didn’t work either.  I even modified timeout3 to kill process groups instead of just processes and could not successfully kill both my program and Z3 underneath it.

What’s curious about this is that hitting Ctrl-C in bash works and kills both my program and all child processes.  Unfortunately, I cannot mimic this behavior programmatically it seems.  After a lot of hunting I found a script that appears to work: timeout.  There’s a blog post about the development of the script here.  They tried many techniques beyond what I tried and failed as well.

Timeout is not a great script, though.  It enumerates the process tree by parsing the results of ps and it determines the amount of memory and time used by a tree of processes by parsing the /proc/pid/stat entry for each pid in the process tree.  This involves launching one program and potentially opening many, many files for each poll.  In practice this seems to consume 2-3% of a CPU.

Managing benchmarks

What happens when you screw up the benchmarks?  Being the human that I am, the program of course didn’t work completely.  This meant that many results were invalid.  When I discovered that these results were incorrect, I needed to rerun those results.  The problem is that the results were intertwined for the incorrect benchmarks were intertwined with the results from the correct benchmarks.

What I desired to do was: 1) to throw away the results from the bad benchmarks; 2) prevent any pending bad benchmarks from running in the future and 3) let the correct benchmarks keep running.  I then wanted to fix the program and after it was fixed, 4) queue up the previous bad benchmarks to run with the fixed version of the program.

Some folks use make to manage their benchmarks.  This works for some problems.  If each benchmark does its output to a specific file, those files can be deleted, solving 1.  The makefile has to be modified or regenerated leaving out the broken benchmarks to solve 2.  3 is where the make approach breaks down.  To stop a single benchmark in make you have to kill all benchmarks.  When benchmarks may take a long time to run killing benchmarks is a waste of time.

Needless to say, I didn’t have a good solution to this problem prior to running benchmarks, so I attempted to develop an ad-hoc solution in python to run the benchmarks in parallel.  This took time and yet more debugging.

I will not start another project without having a system that allows me to automate running buggy benchmarks.  I will be documenting the development process of this system in future posts.


Written by arlencox

October 24, 2011 at 9:03 pm

Posted in grad school

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: