Static analysis of tainted variables and sanitation in Datalog

Static analysis and sanitation of tainted variables; implementation in Datalog. A group project in Program Analysis for System Security and Reliability (PASS).

Creating this static analyzer was a 2018 group project in Program Analysis for System Security and Reliability (PASS) together with Ivan Tishchenko (TIPS group). Here, I document some of my own experiences from this project and our working analyzer, including some code snippets. Writing code for such a project is still the best way to learn a language.

Datalog

Datalog is a declarative language similar to Prolog. Simplified, the program is written in the form of logic predicates that can deduce the output from a provided input in form of facts (data tables). Complex problems can be solved by composition of the predicates.

Project

The goal of the project was to build a sanitizer for tainted variables in a simple toy language. Explicit and implicit tainting had to be considered. The analyzer shall report sanitize(l, x) locations (sanitize a variable x at location/label l) for provided programs with the goals:

  • no under-sanitation, i.e. no path from a tainted source to a vulnerable sink
  • no over-sanitation, i.e. no sanitize shall be found on a program trace without any tainted variable in this trace
  • minimum number of sanitize()

The project description is no longer online. Thus, I do not go into more details of the given rule set.

Design overview

The analyzer consists of the following 5 layers, each providing different forms of abstractions. Induction is a typical/natural form to define the predicates: base cases and induction steps. Despite these multiple layers but thanks to the design of the logic rules, our analyzer was fast and worked with only low memory requirements (compared with other analyzers of the class).

LayerNameDescription
5Multiple ModelsCreate up to 8 different models for possible solutions: all sound; pick version with lowest sanitize() count
4ModelEach model finds sanitize candidates based on results of the lower layers following its configuration; uses ReachingDef AvailableUse, and min-cut == max-flow theorem among other things
3Risky pathsDefine risky paths and remove forbidden sanitize candidates, i.e. (label, variable) tuples
2Basic toolsBuild traces, DefUse chains, ReachingDef, and tainted variants; add helper assign_() (see phi() functions in SSA forms).
1Low-levelMaps the provided control-flow graph (CFG) by follows() to a modified CFG by mapped_follows() more suitable to our analysis

A more detailed description may follow at a later date on my github page.

Few code snippets

The implementation made use of learned knowledge of former classes like Compiler Design (building a Javali compiler in Java in group projects) and Software Engineering and Architecture (static analyzers in Alloy and Java using polyhedra for abstract interpretation in group projects). Thus, a few code snippets to illustrate the syntax of datalog and also some of the basic code analysis tools (level 2 in our analyzer).

As mentioned, induction is one key to writing datalog code: define base case(s) and induction rules. Then the language completes the rest.

The code samples were encoded in a simple syntax: Each line of the code has a label. The toy language had conditionals but no loops.

Here a very simple program and its mapping to datalog predicates. Note how the follows() predicates build a control flow graph (CFG). else is encoded as a second if condition with different target label. Following the given rules, x must be sanitized either at label l4 or l5. Missing the need for sanitation would be under-sanitation; sanitizing x e.g. at L2 or L9would be over-sanitation.

// control test case 1:
// a simple control test with direct DefUse from source to sink
// modified from provided test case 1

//L1: x:=source();
//L2: y:=x+z;
//L3: if(z+2>0){
//L4:   y:=5;
//L5: }
//L6: else{
//L7:   x:=2;
//L8: }
//L9: sink(x);

source("l1","x").
sink("l9","x").
follows("l2","l1").
follows("l3","l2").
follows("l4","l3").
follows("l5","l4").
follows("l6","l2").
follows("l7","l6").
follows("l8","l7").
follows("l9","l5").
follows("l9","l8").
join("l9","l5","l8").
assign("l2","y","x","z").
assign("l4","y","5","0").
assign("l7","x","2","0").
if("l3","l4","z","2").
if("l6","l7","z","2").

//expected output: ("l4","x") or ("l5","x").

The predicates shown in this example are formally defined in the declarations code in the table, which then illustrates various building blocks used in our datalog analyzer.

FileDescription
DeclarationsBasic declarations used by all of the following examples
Transitive closurelater() is the transitive closure of follows() in the CFG
TracesGenerates all possible program traces; illustrating count() operator
ReachingDefExtended ReachingDef that includes trace information
DefUseDefUse and DefUse chains
BasicBlockstested but not needed in the final analyzer; combines forward and backward analysis

These are very basic examples. The entire analyzer is much more complex in its complete form. However, the clear structure in the designed abstraction layers allows the reader to understand well what is going on in which part of the analyzer. As so often, good design is key to a successful project.