The blog post explores different approaches to finding segmentation faults in binary machine code. It compares static analysis, fuzzing, and concolic analysis. While static analyzers quickly scan code for potential issues, they often produce false positives. Fuzzers dynamically test program behavior with varied inputs but may take a long time and lack reliability. Concolic analyzers explore program behavior comprehensively, offering higher accuracy but requiring more resources. Experiments demonstrate the performance and limitations of each approach, with concolic analysis showing promise despite its resource intensity.

Segmentation faults (segfaults) are a common error that occurs when a  program tries to access a restricted area of memory. Segfaults can occur for a wide variety of reasons: usage of uninitialized pointers, out-of-bounds memory accesses, memory leaks, buffer overflows, etc. Segfaults almost always result in unintended program behavior, crashes, and/or security vulnerabilities. Most software developers can remember a time when their program resulted in a pesky segfault that was both difficult and time consuming to track down and fix.

Common methods to track down and eliminate segfaults typically rely on having a deep understanding of the target program’s behavior. A programmer can use a debugger like GDB to reproduce the conditions needed to cause a segfault, and then fix said segfault in their source code. A programmer can also write a suite of automated tests that probe their software, ensuring no edge cases result in a segmentation fault.

Even with these measures in place, segfaults still sometimes make their way into release builds of software. Users do not have the same knowledge about the software’s behavior as its programmer does. The user doesn’t know where in the program to look for a segfault. For this reason, I want to explore how to detect segmentation faults when you lack intimate knowledge about a program’s behavior and design.

What approaches are possible?

Without intimate knowledge about a program, its implementation details are often obscure. A lack of documentation, source code, and other factors make it difficult for the uninformed to begin finding vulnerabilities. There exist a couple kinds of automated tools which help address this problem. We will explore each of these tools, comparing how they function, their pros, their cons, and their performance. These tools consist of the following:

Static analyzers assess both binary machine code and source code without executing the program under analysis. Often, these tools are used to enforce security, coding style, and other best practices. Static analyzers can help us find vulnerabilities by pointing out problematic code segments (either in binary or in source code). However, static analyzers are often considered “noisy”: they will report anything and everything. This includes the vulnerabilities we are looking for, albeit in addition to numerous other false positives.

Fuzzers allow you to inject semi-random data into a program to find bugs and security vulnerabilities. They do this by manipulating untrusted input until unintended program behavior occurs. When a fuzzer finds a vulnerability, you know exactly what input is needed to trigger the vulnerability. However, fuzzers are often limited by performance issues: they may run for days without finding the input needed to break an insecure program.

Concolic analyzers utilize a combination of symbolic execution (treating program variables as manipulatable symbols) and concrete testing (testing on particular inputs). They trace the control flow of the target program, making notes of the input conditions needed to reach a certain state. Concolic analysis lets us explore the entirety of a program’s potential behavior, including when/where vulnerabilities may exist. However, concolic analyzers are quite resource intensive. They must keep track of each potential program state within memory.

A Primer on Weak Pointer Analysis

Weak Pointer Analysis is a form of concolic analysis that focuses on evaluating the behavior of pointers under various simulated program conditions. Its aim is to detect memory pointers that can be manipulated by untrusted input to point outside of their intended boundaries. Such pointers can result in buffer overflows, segfaults, and other memory-related vulnerabilities.)

Weak Pointer Analysis improves upon pre-existing approaches to concolic analysis by accelerating performance and avoiding the state explosion problem. This is done using a Bounded Model Checker (BMC), which reduces the total number of program states necessary to explore before the analysis is considered complete. Unreachable states are avoided, as they do not reflect true instances of program behavior.

Other resource-saving measures are implemented by Weak Pointer Analysis, including a round-robin/sliding window mechanism for assessing program states and a state caching mechanism. For the purposes of this blog post, consider Weak Pointer Analysis as an enhanced form of concolic analysis that is tailored for our segfault-finding use-case. Weak Pointer Analysis will act as a stand in for concolic analysis in general as we compare it to the other segfault-finding tools.

To compare each of the tools, we have designed three experiments. The first experiment compares the accuracy of static analysis to concolic/ Weak Pointer Analysis. The second and third experiments compare the performance and runtime of fuzzing to concolic/Weak Pointer analysis.

Static Analysis: Fast, but Noisy

For the first experiment, we have written and compiled a set of programs that allow us to measure and compare the accuracy of flawfinder, cwe_checker, cppcheck, and WPA. Each binary program contains various function definitions, all of which use the memcpy function. The memcpy function is potentially dangerous because it can result in a buffer overflow/segfault. This can occur if the memory in a large buffer is copied into a smaller buffer. If memcpy is used properly, it will not result in a buffer overflow/segfault. It is up to the developer to ensure proper usage of memcpy. Developer error, however, can easily result in a vulnerable usage of memcpy.

The source code presented tot he right illustrates one of these programs. In it, one function is unsafe and results in a segfault, while the other two are safe.

In each sample binary program, all the functions except one are safe (they do not result in memory corruption via a buffer overflow). The size of each program is defined by the number of safe-to-use functions it defines. For example, size 10 indicates that the program contains 10 safe functions and 1 unsafe function (12 functions in total, when including main()). Each program calls all its defined functions in order, with the unsafe function being called last.

Although each program has some number of potentially dangerous functions, they each only have a single use of memcpy that can result in a vulnerability/segfault. A surface level analysis would be incapable of distinguishing between a safe use of memcpy, and a dangerous use of memcpy.

For each of these programs, we ran flawfinder, cwe_checker, cppcheck, and WPA on either the binary or source code representation of the file (depending on the kind of input required by the tool). We recorded runtimes, total reported vulnerabilities/weaknesses, and accuracy rates. This data is summarized in the tables below.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const int LARGE = 131072;
const int SMALL = 4;

// causes a heap overflow and segfault
void heap_overflow() {
    printf("Performing unsafe heap operation...\n");
    char* src = (char*)malloc(sizeof(char) * LARGE);
    char* dst = (char*)malloc(sizeof(char) * SMALL);
    memcpy(dst, src, LARGE);
    free(src);
    free(dst);
}

// performs a heap safe operation
void heap_safe_1() {
    printf("Performing safe heap operation...\n");
    char* src = (char*)malloc(sizeof(char) * SMALL);
    char* dst = (char*)malloc(sizeof(char) * SMALL);
    memcpy(dst, src, SMALL);
    free(src);
    free(dst);
}

// performs a heap safe operation
void heap_safe_2() {
    printf("Performing safe heap operation...\n");
    char* src = (char*)malloc(sizeof(char) * SMALL);
    char* dst = (char*)malloc(sizeof(char) * SMALL);
    memcpy(dst, src, SMALL);
    free(src);
    free(dst);
}

int main() {
    heap_safe_1();
    heap_safe_2();
    heap_overflow();
}

Size (# of functions) Runtime (sec) Total Reported Items Vulnerable Reported Items Accuracy
5 0.031559 6 1 16.67%
10 0.030592 11 1 9.09%
20 0.034237 21 1 4.76%
50 0.0363287 51 1 1.96%
100 0.044598 101 1 0.99%
500 0.098342 501 1 0.20%
721 0.12650418 722 1 0.14%

Table 1 – flawfinder Accuracy Results

Size (# of functions) Runtime (sec) Total Reported Items Vulnerable Reported Items Accuracy
5 0.0074203 9 2 22.22%
10 0.009376 14 2 14.29%
20 0.0119152 24 2 8.33%
50 0.02241158 54 2 3.70%
100 0.0396542 104 2 1.92%
500 0.261377 504 2 0.40%
721 0.476597 725 2 0.28%

Table 3 – cppcheck Accuracy Results

Size (# of functions) Runtime (sec) Total Reported Items Vulnerable Reported Items Accuracy
5 6.8047869 19 2 10.53%
10 7.09194 34 2 5.88%
20 7.124346 64 2 3.13%
50 7.5739443 154 2 1.30%
100 8.4504086 304 2 0.66%
500 14.143265 1504 2 0.13%
721 17.14055538 2167 2 0.09%

Table 2 – cwe_checker Accuracy Results

Size (# of functions) Runtime (sec) Total Reported Items Vulnerable Reported Items Accuracy
5 56.37603 1 1 100%
10 63.9488 1 1 100%
20 74.829524 1 1 100%
50 89.03786 1 1 100%
100 108.076774 1 1 100%
500 292.4567799 1 1 100%
721 312.5943696 1 1 100%

Table 4 – Weak Pointers Accuracy Results

Accuracy is measured as the number of vulnerable reported items divided by the total number of reported items. You can think of this as a measure of the “noisiness” of each tool. Out of everything the tool reports, how much of it reflects a truly troublesome vulnerability.

As an example of what the static analyzers output, consider the output shown on the right from flawfinder when ran against the size 500 program.

Flawfinder, as well as each of the other static analyzers, reported all uses of memcpy as being unsafe, even though only one of them was truly unsafe. Some of the static analyzers also reported the same unsafe usage of memcpy twice, hence the 2 in some the Vulnerable Reported Items columns in the tables above.

This data implies that static analysis, both when using binary and source code as input, is rather noisy. Although static analysis tools are fast, they report anything and everything. This makes it quite difficult to distinguish between what is valuable from what is unimportant. This is further illustrated in the diagram below:

Accuracy of Each Tool per Binary Size

No matter the size of the target program, concolic analysis only reported the single truly vulnerable usage of memcpy that results in a segfault. This is shown in the output from the Weak Pointers analysis to the right, which indicates that the usage of memcpy found at program address 0x1200 is exploitable.

The reason concolic analysis is capable of this is because it models runtime behavior, whereas static analysis does not. This capability lets us differentiate the exploitable from the benign, ultimately resulting in the significantly higher accuracy rating shown in the data, albeit at a cost of performance time.

Blackbox Fuzzing: Sometimes Reliable

As is demonstrated, static analyzers often have a low accuracy because they fail to account for runtime behavior. Fuzzers do not have this problem: they dynamically analyze a program’s behavior at runtime by generating malformed inputs. However, fuzzers often have severe performance limitations. They must explore a wide variety of potential inputs before landing on one that results in unintended program behavior.

We designed two experiments which compare the performance of fuzzing to concolic/Weak Pointer Analysis. For each experiment, we compiled a set of target binary samples. The first set of binaries contain broad CFGs, whereas the second contain deep CFGs. We will explain what “broad” and “deep” mean in this context shortly. The first experiment will use the broad CFG binaries, the second experiment will use the deep CFG binaries.

NOTE: A control-flow graph (CFG) is a representation of the flow of control within a program during its execution. Each node in the CFG represents a block of code in the program. Each edge represents a potential transfer of control/branch.

All binary samples in both sets each contain a single poor usage of memcpy that results in a segfault in one of the basic blocks in its CFG. What this means will become clearer below when we discuss the source code used to compile the samples.

Each tool will be run against each binary. The time each tool takes to discover the vulnerable usage of memcpy/vulnerability will be recorded. This will serve as a general measure of each tool’s performance. It will also illustrate how differently shaped CFGs affect the performance of each tool.

Broad CFG

For this experiment, we are defining CFG breadth as the maximum number of edges which extend from a single node in the CFG. This is best understood visually:

As is shown in the figure, most nodes have only one outward edge, whereas a single node has numerous outward edges. This CFG is considered to have a breadth of 10, as the single node with the maximum number of outward edges has 10 in total.

An easy way to produce a very broad CFG is to write a function with a very long switch-case in C. Once compiled, a large switch-case statement results in at least one node with many outward  edges.

To this end, we generated various binaries of differing breadth sizes according to the definition of breadth size mentioned above. For each of these binaries, one of the cases in the switch statement results in a heap buffer overflow vulnerability/segfault. When the binary follows this case, it will produce a buffer overflow, segmentation fault, and program crash. The specific case is randomized, and different for each program.

The source code for a program with size 5 CFG breadth is shown to the right. For this binary, the input 3 results in a segfault.

Each binary reads data from stdin, before selecting the case to follow. This means that user input determines whether a segfault will occur. This makes these binaries prime candidates for fuzzing and concolic/Weak Pointer analysis.

For each program, we ran AFL++, Ramadsa, and Weak Pointers until they each found a segmentation fault vulnerability. We recorded the results of this effort, which are summarized in the tables below.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const int LARGE = 131072;
const int SMALL = 4;

// causes a heap overflow and segfault
void heap_overflow() {
    printf("Performing unsafe heap operation...\n");
    char* src = (char*)malloc(sizeof(char) * LARGE);
    char* dst = (char*)malloc(sizeof(char) * SMALL);
    memcpy(dst, src, LARGE);
    free(src);
    free(dst);
}

int main() {
    char in_str[8];
    fgets(in_str, 8, stdin);
    switch(atoi(in_str)) {
        case 1:
            printf("1");
            break;
        case 2:
            printf("2");
            break;
        case 3:
            printf("3");
            heap_overflow();
            break;
        case 4:
            printf("4");
            break;
        case 5:
            printf("5");
            break;
    }
}
Size (CFG Breadth) Mean Time to Find Segfault (sec) Median Time to Find Segfault (sec) Number of Samples
5 1.22705 1.229158 10
10 1.539695 1.640244 10
20 1.228709 1.229786 10
50 4.550136 1.633169 10
100 2.42164056 2.050192 10
500 19.4467412 18.128518 10
1000 112.776517 85.526875 10
5000 11.9792385 11.9792385 1
10000 82.006280 82.006280 1

Table 5 – AFL++ Broad Performance Results

Size (CFG Breadth) Mean Time to Find Segfault (sec) Median Time to Find Segfault (sec) Number of Samples
5 0.2211975 0.12884569 10
10 1.67807545 1.3554393 10
20 0.04569835 0.020627021 10
50 16.7256742 11.99373 10
100 13.124415 7.375611 10
500 2.6779618 2.110273 10
1000 653.57073388 353.841504 10
5000 4819.5401723 4819.5401723 1
10000 629.6727004 629.6727004 1

Table 6 – Radamsa Broad Performance Results

Size (CFG Breadth) Mean Time to Find Segfault (sec) Median Time to Find Segfault (sec) Number of Samples
5 30.939618 29.4254662 10
10 29.5910989 29.461279 10
20 28.9873888 28.85682845 10
50 29.38255059 29.1132495 10
100 29.59618673 29.2723473 10
500 33.6372462 33.7710034 10
1000 31.0504938 31.0986369 10
5000 38.09562659 38.09562659 1
10000 52.149712324 52.149712324 1

Table 7 –Weak Pointers Broad Performance Results

For the most part, we ran the tools 10 times per binary to reduce the effect of randomness on the resulting data. You can see that we ran the tools only once for some of the larger binaries to save time.

For the smaller program sizes, fuzzing performed better than concolic analysis. However, these performance differences begin to invert towards the larger sizes. This is especially true in the case of Radamsa, which took over an hour for the size 5,000 program.

The times taken by the fuzzers also appear more spread out than those taken by concolic analysis. The standard deviation for the AFL++ mean times was 41.45 and for Radamsa the standard deviation was 1575.87. For Weak Pointers, the standard deviation was just 7.49. This is most likely because of the highly random nature of fuzzing. Fuzzing is closer in approximation to a random guess than it is to a programmatic approach like concolic analysis.

Fuzzing is essentially an improved form of random guessing.  Fuzzing does not guarantee that every branch of the CFG was explored. In fact, the performance of a fuzzer often has less to do with the shape/size of the CFG than it does the relationship between input and output. For example, Radamsa took only 0.046 seconds to find the segfault present in the size 20 program. This is faster than the smaller size 5 and size 10 programs: why? This is because the input which causes the segfault in the size 20 program is by chance just the character 1. The character 1 is much more likely to be guessed by a fuzzer than a more arbitrary sequence like 4243. In fact, 4243 was the input required to segfault the size 5,000 program, which explains why Radamsa took so long to find it.

For blackbox fuzzers, the size and shape of the program’s CFG is somewhat meaningless. In the experiment, there is a positive correlation between CFG breadth and fuzzer runtime not because the CFG is any broader, but because bigger programs are more likely to contain a hard to guess sequence of bytes. For concolic analysis, the size and shape of the CFG matters much more, as that approach explores the CFG directly using symbolic execution.

Mean Runtime of Each Tool per Binary CFG Breadth Size

There is a notable difference in performance between AFL++ and Radamsa shown in the data. This is likely caused by multiple factors. For one, AFL++ makes use of parallelization. Additionally, the input mutation algorithms used by the two tools are different. In this case, AFL++’s mutation algorithm seems to perform better, but this might not be true in every case.

Both AFL++ and Radamsa ask for an example input to permutate when fuzzing. We supplied the value 123 for each. This certainly had a significant impact on the resulting performance data. With this configuration, input sequences similar to 123 are favored when fuzzing. If a target binary segfaults on an input dissimilar to 123, the fuzzers have a much harder time finding it.

Deep CFG

The final experiment is quite like the previous, this time focusing on program CFG depth instead of breadth. Once again, this is best understood visually. Below to the left is an example of a “deep” CFG:

An easy way to produce a binary with a very deep CFG is to write a function with a very long if-else-if statement in C. When compiled, a long if-else-if statement will produce at least one branch of the CFG that is very long. This branch corresponds to the conditions required to end up in the last else-if statement, after all other conditions were not met.


#include 
#include 
#include 

const int LARGE = 131072;
const int SMALL = 4;

// causes a heap overflow and segfault
void heap_overflow() {
    printf("Performing unsafe heap operation...\n");
    char* src = (char*)malloc(sizeof(char) * LARGE);
    char* dst = (char*)malloc(sizeof(char) * SMALL);
    memcpy(dst, src, LARGE);
    free(src);
    free(dst);
}

int main() {
    char in_str[8];
    fgets(in_str, 8, stdin);
    in_str[strcspn(in_str, "\n")] = '\0';
    if(strcmp(in_str, "1") == 0) {
        printf("1");
    }
    else if(strcmp(in_str, "2") == 0) {
        printf("2");
       heap_overflow();
    }
    else if(strcmp(in_str, "3") == 0) {
        printf("3");
    }
    else if(strcmp(in_str, "4") == 0) {
        printf("4");
    }
    else if(strcmp(in_str, "5") == 0) {
        printf("5");
    }
}

We generated various binaries of differing CFG depths by compiling from source code with different sized if-else-if statements. These program samples are labeled according to the size of their if-else-if statements. For example, size 10 contains an if-else-if statement that is 10 else-ifs long, size 100 contains an if-else-if statement that is 100 else-ifs long, etc. The following source code defines a program of this nature with a CFG depth size of 5.

Just like in the breadth experiment, a single, randomly selected block of this if-else-if statement will result in a heap overflow, segfault, and program crash. In the example above, this input is 2. Each binary reads from stdin before selecting the if-else-if block to execute. User input determines whether a segfault occurs.

For each binary program sample, we ran AFL++, Ramadsa, and Weak Pointers until they each found a segmentation fault. We recorded the results of this effort, which are summarized in the tables below.

Size (CFG Depth) Mean Time to Find Segfault (sec) Median Time to Find Segfault (sec) Number of Samples
5 1.229713 1.229758 10
10 2.9786429 1.9524197 10
20 2.7966087 2.366075 10
50 1.8744916 1.551688 10
100 7.1974167 4.50936579 10
500 12.8920267 10.11059761 10
1000 17.2364633 7.702806949 10
5000 1069.2453219 1069.2453219 1
10000 1195.153937 1195.153937 1

Table 8 – AFL++ Deep Performance Results

Size (CFG Depth) Mean Time to Find Segfault (sec) Median Time to Find Segfault (sec) Number of Samples
5 0.33727478 0.33034837 10
10 12.111697 10.371287 10
20 3.8423036 2.107616 10
50 3.00970213 3.0342566 10
100 16.722832 7.03891468 10
500 232.0650088 170.8253835 10
1000 20.4071374 19.534172 10
5000 599.9072213 599.9072213 1
10000 1511.0234968 1511.02349686 1

Table 9 – Radamsa Deep Performance Results

Size (CFG Depth) Mean Time to Find Segfault (sec) Median Time to Find Segfault (sec) Number of Samples
5 31.23626985 31.74081552 10
10 31.67405188 31.46334922 10
20 31.74081552 31.606569886 10
50 34.288318991 34.20289182 10
100 36.043357849 34.86434209 10
500 41.88138661 41.89520418 10
1000 50.7756350517 50.51226115 10
5000 152.95043087 152.95043087 1
10000 227.660034418 227.660034418 1

Table 10 – Weak Pointers Deep Performance Results

Mean Runtime of Each Tool per Binary CFG Depth Size

The fuzzers performed with much the same degree of randomness as in the breadth experiment. Once again, the CFG size/shape has an insignificant impact on the performance of the fuzzers than does the relationship between input and output. The bigger sized binaries typically had more arbitrary byte sequences which resulted in segfaults. For example, the size 10,000 binary segfaults on input 8976. Both Radamsa and AFL++ had quite a difficult time guessing this input, with both taking about 20 minutes.

In the case of concolic analysis, CFG depth appears to have a more significant impact on performance than CFG breadth does. In the breadth experiment, the largest binary took only 22 seconds longer to analyze than the smallest. In the depth experiment, this difference was 196 seconds. More research is required to determine why this is the case, and if it applies in more general circumstances. In any case, concolic analysis seemed to handle the deeper sized CFGs in a more reasonable time than the fuzzers did.

Conclusion

Static analysis is a fast, but noisy approach to black box segfault detection. Static analysis typically can find segfaults, albeit in intermixed with a slew of false positives and unnoteworthy findings. Fuzzing addresses these accuracy concerns: when a segfault is found, you know exactly where it is and how to reproduce it. Fuzzing finds exactly what you want, without any noise.

However, fuzzing can be extremely time consuming. It varies widely in terms of reliability. Sometimes it will find what you need in seconds. It might find what you need in days. Other times, it may find nothing at all.

Although resource intensive, concolic/weak pointer analysis offers unique benefits compared to the other approaches. Concolic/weak pointer analysis is more reliable at exploring the potential states a binary program can be in. When the Weak Pointer analysis finishes its assessment,  you can feel confident that most of what the target program is capable of has been explored. This is often not the case for fuzzing, where you are leaving vulnerability detection up to random chance.