Combinatoric Input Set Generation

GeneratorI’ve been studying combinatoric methods of generating test cases for quite some time now. Most publicly available fuzz testing packages implement fairly crude techniques for passing input values to applications–although recent research is becoming more creative in attacking the issue because of insufficient path coverage metrics by orthodox methods. Generating input sets combinatorially is a much more holistic approach to the black-box software testing paradigm.

In this article, I’ll be providing a brief overview of how set operations taken from the field of discrete mathematics can be applied to fuzz testing. Explicit definitions of set operations can be found elsewhere and links to Wikipedia are provided where appropriate. Brief explanations on set theory will be given here but I will mainly be focusing on how it relates to software testing.

Let’s start off with permutations; just about any computer science bachelor is familiar with these. They’re the possible orderings of a sequence. A “sequence” may consist of members that are non-unique (which is in contrast to a “set” whose members are unordered and unique.) For example, the permutations of sequence {1,2,3} are:

{{1,2,3},{1,3,2}, {2,1,3},{2,3,1},{3,1,2},{3,2,1}}

Easy enough. Permutations can be used when testing protocols where the order of commands is significant. For example, if testing the FTP protocol with a sequence of {“STAT .”,”CWD ..”,”MKD foo”} depending on the order in which the commands are executed, the client could be retrieving the status for and/or making the “foo” directory in the parent/child directories.

Sets of subsets (also known as combinations) are also quite useful and are expressed using what’s called “n-choose-r” notation because r elements are being chosen from a set that has n elements in total (the set is said to have cardinality n.) Combinations can be utilized to enumerate command-line argument possibilities since the order of argv values is usually irrelevant. Say an executable has the command line flags “-n”, “-i”, “-q”, and “-v”. We want to generate all subsets of cardinality 2 (4-choose-2.) The answer is:

{{“-n”,”-i”},{“-n”,”-q”},{“-n”,”-v”},{“-i”,”-q”},

{“-i”,”-v”},{“-q”,”-v”}}

r-permutations are similar to subsets in that a given amount of objects are chosen from the list but since permutations are represented as sequences the order of the elements matters. Using the same baseline sequence as the previously described n-permutations, the 2-permutations of {1,2,3} are as follows:

{{1,2},{2,1},{1,3},{3,1},{2,3},{3,2}}

Again, observe how these permutations would affect the logic of network protocol commands processed by a server daemon..

Moving on, a power set is the set of all subsets including the empty set. Using the example above, the power set of {“-n”,”-i”,”-q”,”-v”} looks like this:

{{},{“-n”},{“-i”},{“-q”},{“-v”},{“-n”,”-i”},{“-n”,”-q”},

{“-n”,”-v”},{“-i”,”-q”}{“-i”,”-v”},{“-q”,”-v”},

{“-n”,”-i”,”-q”},{“-n”,”-q”,”-v”},{“-n”,”-i”,”-v”},

{“-i”,”-q”,”-v”},{“-n”,”-i”,”-q”,”-v”}}

The accuracy of the power set calculation can be checked because power sets have a cardinality of 2**n. In this case, 2**4=16. Other set operations can be checked with similar formulas.

Taking it another step further, consider giving that executable whose command line flags are being generated environment variable input as well. Suppose that the power set for the command line options is:

{{},{“-n”},{“-i”},{“-n”,”-i”}}

and the power set for the environment variable values is:

{{},{“TERM=vt100”},{“TERM=%n%n%n”}},{“LOGIN=root”},

{“LOGIN=%n%n%n”},{“TERM=vt100″,”LOGIN=root”},

{“TERM=%n%n%n”,”LOGIN=root”},

{“TERM=%n%n%n”,”LOGIN=%n%n%n”}}

The environment variable names and values were paired up using a set operation known as the Cartesian product. i.e. the Cartesian product of the sets {{“TERM”}} and {“vt100″,”%n%n%n”} is:

{{“TERM”,”vt100″},{“TERM”,”%n%n%n”}}

Taking further advantage of Cartesian products, all possible pairings of command line flags and environment values are generated. I won’t be typing the whole thing out here as it is excruciatingly long but it would start out something like this:

{{{},{}},{{},”TERM=vt100″},{{},”TERM=%n%n%n”}},

{{},”LOGIN=root”}, … , {{“-n”,”-i”},{“TERM=%n%n%n”,”LOGIN=root”}},

{{“-n”,”-i”},{{“TERM=%n%n%n”,”LOGIN=%n%n%n”}}}

I put the ellipsis in there because, well, you get the idea! This continues on until all possible subsets of command line flags have been paired with all possible subsets of pairings of environment variable names and values. The final Cartesian product for two power sets of equal size can be represented visually by Pascal’s triangle.

The aforementioned set operations can be used to systematically prove the correctness of simple computer programs via deterministic testing. Modern computer programs are so complex that attempting to calculate all possible input scenarios would be infeasible with a silicon-based machine. In the future, I expect that it will be commonplace for quantum molecular systems or perhaps even DNA computers to solve software assurance problems (and many others) in constant time or O(1), but I digress.

Since silicon will be prevalent for the foreseeable future, the input space used for a real-world software test has to be reduced. The goal is to minimize test executions while maximizing path coverage so disparate input sets must be chosen. This is where “pseudo exhaustive”, “n-way”, or more recently “6-way” testing comes into practice. More information about input space reduction and pseudo exhaustive testing is in Rick Kuhn’s research at NIST.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: