Vulnerability Finding Using Machine Learning
Microsoft is training a machine-learning system to find software bugs:
At Microsoft, 47,000 developers generate nearly 30 thousand bugs a month. These items get stored across over 100 AzureDevOps and GitHub repositories. To better label and prioritize bugs at that scale, we couldn’t just apply more people to the problem. However, large volumes of semi-curated data are perfect for machine learning. Since 2001 Microsoft has collected 13 million work items and bugs. We used that data to develop a process and machine learning model that correctly distinguishes between security and non-security bugs 99 percent of the time and accurately identifies the critical, high priority security bugs, 97 percent of the time.
News article.
I wrote about this in 2018:
The problem of finding software vulnerabilities seems well-suited for ML systems. Going through code line by line is just the sort of tedious problem that computers excel at, if we can only teach them what a vulnerability looks like. There are challenges with that, of course, but there is already a healthy amount of academic literature on the topic—and research is continuing. There’s every reason to expect ML systems to get better at this as time goes on, and some reason to expect them to eventually become very good at it.
Finding vulnerabilities can benefit both attackers and defenders, but it’s not a fair fight. When an attacker’s ML system finds a vulnerability in software, the attacker can use it to compromise systems. When a defender’s ML system finds the same vulnerability, he or she can try to patch the system or program network defenses to watch for and block code that tries to exploit it.
But when the same system is in the hands of a software developer who uses it to find the vulnerability before the software is ever released, the developer fixes it so it can never be used in the first place. The ML system will probably be part of his or her software design tools and will automatically find and fix vulnerabilities while the code is still in development.
Fast-forward a decade or so into the future. We might say to each other, “Remember those years when software vulnerabilities were a thing, before ML vulnerability finders were built into every compiler and fixed them before the software was ever released? Wow, those were crazy years.” Not only is this future possible, but I would bet on it.
Getting from here to there will be a dangerous ride, though. Those vulnerability finders will first be unleashed on existing software, giving attackers hundreds if not thousands of vulnerabilities to exploit in real-world attacks. Sure, defenders can use the same systems, but many of today’s Internet of Things (IoT) systems have no engineering teams to write patches and no ability to download and install patches. The result will be hundreds of vulnerabilities that attackers can find and use.
Rj • April 20, 2020 8:10 AM
This is an interesting subject. I suspect that a deep learning neural network approach will be tried first, but will meet with some pushback because these systems are not very good at explaining why and how they reached this decision.
Attackers probably don’t care, so the deep learning neural network approach will work for them. It has the advantage of being faster at finding bugs once it has been trained than say an ID3 type of decision tree induction system. The advantage of the decision tree systems is that they can explain how they reached a decision in a manner that is tractable to humans.
Therefore, I predict that the attackers will still have the upper hand until these systems get perfected and used to find bugs in new code before it is released.
Still, experience tells me that after that code is released, some bugs will still remain, and these are more likely to be found more quickly by the attackers’ neural nets than by the defenders’ decision trees.