A framework for applying deep learning to software testing

Hi, everyone,

I must say I enjoyed the online course and these forums very much. Most everything I’ve learned about modern AI and its applications, I’ve learned here. I never expected to get drawn in this deep when I first watched the lecture videos.

But the reason I wanted to study deep learning was to try to improve the software development and testing process. It always seemed to me that this area would lend itself well to AI involvement, but of course that remains to be proven empirically. So I decided to create a framework to facilitate research into it. It took a lot of hard work, but it is finally good enough to allow experimentation with real-life software. It can parse C++ code, generate randomized tests for it, and feed the test results directly to deep-learning models. Using it, I managed to teach a neural network to discern between failing and passing unit tests. Please take a look at a paper I wrote describing it:

Everyone I spoke to seemed impressed by this and excited for what it can eventually accomplish. But I’d really appreciate the feedback from you guys; you’re the experts that I trust. Please let me know what you think!

Interesting topic, I’m thinking about automating regression testing in UI using deeplearning so we have similar interest. I’m not an expert but i will be glad to have a look at your approach.

So I spent some time analyzing how exactly the model manages to predict test success/failure from the run logs, and I’ve come to some interesting conclusions. To wit, the random values generated are mostly ignored by the model; it derives its conclusions based predominantly on the code locations logged. It simply notices where the program goes and derives its prediction from that.

For instance, if the last log entry is the location deciding how many more methods to invoke, the model predicts success – well, sure, if that’s the last log entry, that most likely means that the program decided to invoke zero more methods and end right there. But the model’s weights are such that that zero in the log carries almost no impact, while the fact that the location was the last one in the log carries a lot. This holds true for all other values, too. Even if an extreme value is obviously the cause of an assertion failure, that value is transformed into a minuscule number in the calculations by normalization and inner-layer weights, while the sequence of program locations is amplified into a correct prediction of success/failure.

While this is a fascinating illustration of the power of deep-learning networks, it’s not very promising for autonomous test generation. I’m trying a different approach now; I’ll write about it in the next post.