Data-driven software security: Models and methods
Abstract
For computer software, our security models, policies,
mechanisms, and means of assurance were primarily conceived
and developed before the end of the 1970’s. However,
since that time, software has changed radically: it is thousands
of times larger, comprises countless libraries, layers, and services,
and is used for more purposes, in far more complex ways. It is
worthwhile to revisit our core computer security concepts. For
example, it is unclear whether the Principle of Least Privilege
can help dictate security policy, when software is too complex for
either its developers or its users to explain its intended behavior.
One possibility is to take an empirical, data-driven approach
to modern software, and determine its exact, concrete behavior
via comprehensive, online monitoring. Such an approach can be
a practical, effective basis for security—as demonstrated by its
success in spam and abuse fighting—but its use to constrain
software behavior raises many questions. In particular, three
questions seem critical. First, can we efficiently monitor the
details of how software is behaving, in the large? Second, is it
possible learn those details without intruding on users’ privacy?
Third, are those details a good foundation for security policies
that constrain how software should behave?
This paper outlines what a data-driven model for software
security could look like, and describes how the above three
questions can be answered affirmatively. Specifically, this paper
briefly describes methods for efficient, detailed software monitoring,
as well as methods for learning detailed software statistics
while providing differential privacy for its users, and, finally, how
machine learning methods can help discover users’ expectations
for intended software behavior, and thereby help set security
policy. Those methods can be adopted in practice, even at very
large scales, and demonstrate that data-driven software security
models can provide real-world benefits.
mechanisms, and means of assurance were primarily conceived
and developed before the end of the 1970’s. However,
since that time, software has changed radically: it is thousands
of times larger, comprises countless libraries, layers, and services,
and is used for more purposes, in far more complex ways. It is
worthwhile to revisit our core computer security concepts. For
example, it is unclear whether the Principle of Least Privilege
can help dictate security policy, when software is too complex for
either its developers or its users to explain its intended behavior.
One possibility is to take an empirical, data-driven approach
to modern software, and determine its exact, concrete behavior
via comprehensive, online monitoring. Such an approach can be
a practical, effective basis for security—as demonstrated by its
success in spam and abuse fighting—but its use to constrain
software behavior raises many questions. In particular, three
questions seem critical. First, can we efficiently monitor the
details of how software is behaving, in the large? Second, is it
possible learn those details without intruding on users’ privacy?
Third, are those details a good foundation for security policies
that constrain how software should behave?
This paper outlines what a data-driven model for software
security could look like, and describes how the above three
questions can be answered affirmatively. Specifically, this paper
briefly describes methods for efficient, detailed software monitoring,
as well as methods for learning detailed software statistics
while providing differential privacy for its users, and, finally, how
machine learning methods can help discover users’ expectations
for intended software behavior, and thereby help set security
policy. Those methods can be adopted in practice, even at very
large scales, and demonstrate that data-driven software security
models can provide real-world benefits.