Static analysis of tainted variables and sanitation in Datalog

Jan 4, 2019

Static analysis and sanitation of tainted variables; implementation in Datalog. A group project in Program Analysis for System Security and Reliability (PASS).

Creating this static analyzer was a 2018 group project in Program Analysis for System Security and Reliability (PASS) together with Ivan Tishchenko (TIPS group). Here, I document some of my own experiences from this project and our working analyzer, including some code snippets. Writing code for such a project is still the best way to learn a language.

Datalog

Datalog is a declarative language similar to Prolog. Simplified, the program is written in the form of logic predicates that can deduce the output from a provided input in form of facts (data tables). Complex problems can be solved by composition of the predicates.

Project

The goal of the project was to build a sanitizer for tainted variables in a simple toy language. Explicit and implicit tainting had to be considered. The analyzer shall report sanitize(l, x) locations (sanitize a variable x at location/label l) for provided programs with the goals:

no under-sanitation, i.e. no path from a tainted source to a vulnerable sink
no over-sanitation, i.e. no sanitize shall be found on a program trace without any tainted variable in this trace
minimum number of sanitize()

The project description is no longer online. Thus, I do not go into more details of the given rule set.

Design overview

The analyzer consists of the following 5 layers, each providing different forms of abstractions. Induction is a typical/natural form to define the predicates: base cases and induction steps. Despite these multiple layers but thanks to the design of the logic rules, our analyzer was fast and worked with only low memory requirements (compared with other analyzers of the class).

Layer	Name	Description
5	Multiple Models	Create up to 8 different models for possible solutions: all sound; pick version with lowest `sanitize()` count
4	Model	Each model finds sanitize candidates based on results of the lower layers following its configuration; uses ReachingDef AvailableUse, and `min-cut == max-flow` theorem among other things
3	Risky paths	Define risky paths and remove forbidden sanitize candidates, i.e. (label, variable) tuples
2	Basic tools	Build traces, DefUse chains, ReachingDef, and tainted variants; add helper `assign_()` (see `phi()` functions in SSA forms).
1	Low-level	Maps the provided control-flow graph (CFG) by `follows()` to a modified CFG by `mapped_follows()` more suitable to our analysis

A more detailed description may follow at a later date on my github page.

Few code snippets

The implementation made use of learned knowledge of former classes like Compiler Design (building a Javali compiler in Java in group projects) and Software Engineering and Architecture (static analyzers in Alloy and Java using polyhedra for abstract interpretation in group projects). Thus, a few code snippets to illustrate the syntax of datalog and also some of the basic code analysis tools (level 2 in our analyzer).

As mentioned, induction is one key to writing datalog code: define base case(s) and induction rules. Then the language completes the rest.

The code samples were encoded in a simple syntax: Each line of the code has a label. The toy language had conditionals but no loops.

Here a very simple program and its mapping to datalog predicates. Note how the follows() predicates build a control flow graph (CFG). else is encoded as a second if condition with different target label. Following the given rules, x must be sanitized either at label l4 or l5. Missing the need for sanitation would be under-sanitation; sanitizing x e.g. at L2 or L9would be over-sanitation.

// control test case 1:
// a simple control test with direct DefUse from source to sink
// modified from provided test case 1

//L1: x:=source();
//L2: y:=x+z;
//L3: if(z+2>0){
//L4:   y:=5;
//L5: }
//L6: else{
//L7:   x:=2;
//L8: }
//L9: sink(x);

source("l1","x").
sink("l9","x").
follows("l2","l1").
follows("l3","l2").
follows("l4","l3").
follows("l5","l4").
follows("l6","l2").
follows("l7","l6").
follows("l8","l7").
follows("l9","l5").
follows("l9","l8").
join("l9","l5","l8").
assign("l2","y","x","z").
assign("l4","y","5","0").
assign("l7","x","2","0").
if("l3","l4","z","2").
if("l6","l7","z","2").

//expected output: ("l4","x") or ("l5","x").

The predicates shown in this example are formally defined in the declarations code in the table, which then illustrates various building blocks used in our datalog analyzer.

File	Description
Declarations	Basic declarations used by all of the following examples
Transitive closure	`later()` is the transitive closure of `follows()` in the CFG
Traces	Generates all possible program `traces`; illustrating `count()` operator
ReachingDef	Extended `ReachingDef` that includes trace information
DefUse	`DefUse` and DefUse chains
BasicBlocks	tested but not needed in the final analyzer; combines forward and backward analysis

These are very basic examples. The entire analyzer is much more complex in its complete form. However, the clear structure in the designed abstraction layers allows the reader to understand well what is going on in which part of the analyzer. As so often, good design is key to a successful project.

Pirmin Schmid

Static analysis of tainted variables and sanitation in Datalog

Datalog

Project

Design overview

Few code snippets