The blog post explores different approaches to finding segmentation faults in binary machine code. It compares static analysis, fuzzing, and concolic analysis. While static analyzers quickly scan code for potential issues, they often produce false positives. Fuzzers dynamically test program behavior with varied inputs but may take a long time and lack reliability. Concolic analyzers explore program behavior comprehensively, offering higher accuracy but requiring more resources. Experiments demonstrate the performance and limitations of each approach, with concolic analysis showing promise despite its resource intensity.
Segmentation faults (segfaults) are a common error that occurs when a program tries to access a restricted area of memory. Segfaults can occur for a wide variety of reasons: usage of uninitialized pointers, out-of-bounds memory accesses, memory leaks, buffer overflows, etc. Segfaults almost always result in unintended program behavior, crashes, and/or security vulnerabilities. Most software developers can remember a time when their program resulted in a pesky segfault that was both difficult and time consuming to track down and fix.
Common methods to track down and eliminate segfaults typically rely on having a deep understanding of the target program’s behavior. A programmer can use a debugger like GDB to reproduce the conditions needed to cause a segfault, and then fix said segfault in their source code. A programmer can also write a suite of automated tests that probe their software, ensuring no edge cases result in a segmentation fault.
Even with these measures in place, segfaults still sometimes make their way into release builds of software. Users do not have the same knowledge about the software’s behavior as its programmer does. The user doesn’t know where in the program to look for a segfault. For this reason, I want to explore how to detect segmentation faults when you lack intimate knowledge about a program’s behavior and design.
What approaches are possible?
Without intimate knowledge about a program, its implementation details are often obscure. A lack of documentation, source code, and other factors make it difficult for the uninformed to begin finding vulnerabilities. There exist a couple kinds of automated tools which help address this problem. We will explore each of these tools, comparing how they function, their pros, their cons, and their performance. These tools consist of the following:
- Static Analyzers:
- flawfinder (https://github.com/david-a-wheeler/flawfinder)
- cwe_checker (https://github.com/fkie-cad/cwe_checker)
- cppcheck (https://github.com/danmar/cppcheck)
- Fuzzers:
- AFL++ (https://github.com/AFLplusplus/AFLplusplus)
- Radamsa (https://gitlab.com/akihe/radamsa)
- Concolic Analyzers
- Weak Pointer Analysis (WPA)
Static analyzers assess both binary machine code and source code without executing the program under analysis. Often, these tools are used to enforce security, coding style, and other best practices. Static analyzers can help us find vulnerabilities by pointing out problematic code segments (either in binary or in source code). However, static analyzers are often considered “noisy”: they will report anything and everything. This includes the vulnerabilities we are looking for, albeit in addition to numerous other false positives.
Fuzzers allow you to inject semi-random data into a program to find bugs and security vulnerabilities. They do this by manipulating untrusted input until unintended program behavior occurs. When a fuzzer finds a vulnerability, you know exactly what input is needed to trigger the vulnerability. However, fuzzers are often limited by performance issues: they may run for days without finding the input needed to break an insecure program.
Concolic analyzers utilize a combination of symbolic execution (treating program variables as manipulatable symbols) and concrete testing (testing on particular inputs). They trace the control flow of the target program, making notes of the input conditions needed to reach a certain state. Concolic analysis lets us explore the entirety of a program’s potential behavior, including when/where vulnerabilities may exist. However, concolic analyzers are quite resource intensive. They must keep track of each potential program state within memory.
A Primer on Weak Pointer Analysis
Weak Pointer Analysis is a form of concolic analysis that focuses on evaluating the behavior of pointers under various simulated program conditions. Its aim is to detect memory pointers that can be manipulated by untrusted input to point outside of their intended boundaries. Such pointers can result in buffer overflows, segfaults, and other memory-related vulnerabilities.)
Weak Pointer Analysis improves upon pre-existing approaches to concolic analysis by accelerating performance and avoiding the state explosion problem. This is done using a Bounded Model Checker (BMC), which reduces the total number of program states necessary to explore before the analysis is considered complete. Unreachable states are avoided, as they do not reflect true instances of program behavior.
Other resource-saving measures are implemented by Weak Pointer Analysis, including a round-robin/sliding window mechanism for assessing program states and a state caching mechanism. For the purposes of this blog post, consider Weak Pointer Analysis as an enhanced form of concolic analysis that is tailored for our segfault-finding use-case. Weak Pointer Analysis will act as a stand in for concolic analysis in general as we compare it to the other segfault-finding tools.
To compare each of the tools, we have designed three experiments. The first experiment compares the accuracy of static analysis to concolic/ Weak Pointer Analysis. The second and third experiments compare the performance and runtime of fuzzing to concolic/Weak Pointer analysis.
Static Analysis: Fast, but Noisy
For the first experiment, we have written and compiled a set of programs that allow us to measure and compare the accuracy of flawfinder, cwe_checker, cppcheck, and WPA. Each binary program contains various function definitions, all of which use the memcpy
function. The memcpy
function is potentially dangerous because it can result in a buffer overflow/segfault. This can occur if the memory in a large buffer is copied into a smaller buffer. If memcpy
is used properly, it will not result in a buffer overflow/segfault. It is up to the developer to ensure proper usage of memcpy
. Developer error, however, can easily result in a vulnerable usage of memcpy
.
The source code presented tot he right illustrates one of these programs. In it, one function is unsafe and results in a segfault, while the other two are safe.
In each sample binary program, all the functions except one are safe (they do not result in memory corruption via a buffer overflow). The size of each program is defined by the number of safe-to-use functions it defines. For example, size 10 indicates that the program contains 10 safe functions and 1 unsafe function (12 functions in total, when including main()
). Each program calls all its defined functions in order, with the unsafe function being called last.
Although each program has some number of potentially dangerous functions, they each only have a single use of memcpy
that can result in a vulnerability/segfault. A surface level analysis would be incapable of distinguishing between a safe use of memcpy
, and a dangerous use of memcpy
.
For each of these programs, we ran flawfinder, cwe_checker, cppcheck, and WPA on either the binary or source code representation of the file (depending on the kind of input required by the tool). We recorded runtimes, total reported vulnerabilities/weaknesses, and accuracy rates. This data is summarized in the tables below.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const int LARGE = 131072;
const int SMALL = 4;
// causes a heap overflow and segfault
void heap_overflow() {
printf("Performing unsafe heap operation...\n");
char* src = (char*)malloc(sizeof(char) * LARGE);
char* dst = (char*)malloc(sizeof(char) * SMALL);
memcpy(dst, src, LARGE);
free(src);
free(dst);
}
// performs a heap safe operation
void heap_safe_1() {
printf("Performing safe heap operation...\n");
char* src = (char*)malloc(sizeof(char) * SMALL);
char* dst = (char*)malloc(sizeof(char) * SMALL);
memcpy(dst, src, SMALL);
free(src);
free(dst);
}
// performs a heap safe operation
void heap_safe_2() {
printf("Performing safe heap operation...\n");
char* src = (char*)malloc(sizeof(char) * SMALL);
char* dst = (char*)malloc(sizeof(char) * SMALL);
memcpy(dst, src, SMALL);
free(src);
free(dst);
}
int main() {
heap_safe_1();
heap_safe_2();
heap_overflow();
}
Size (# of functions) | Runtime (sec) | Total Reported Items | Vulnerable Reported Items | Accuracy |
5 | 0.031559 | 6 | 1 | 16.67% |
10 | 0.030592 | 11 | 1 | 9.09% |
20 | 0.034237 | 21 | 1 | 4.76% |
50 | 0.0363287 | 51 | 1 | 1.96% |
100 | 0.044598 | 101 | 1 | 0.99% |
500 | 0.098342 | 501 | 1 | 0.20% |
721 | 0.12650418 | 722 | 1 | 0.14% |
Table 1 – flawfinder Accuracy Results
Size (# of functions) | Runtime (sec) | Total Reported Items | Vulnerable Reported Items | Accuracy |
5 | 0.0074203 | 9 | 2 | 22.22% |
10 | 0.009376 | 14 | 2 | 14.29% |
20 | 0.0119152 | 24 | 2 | 8.33% |
50 | 0.02241158 | 54 | 2 | 3.70% |
100 | 0.0396542 | 104 | 2 | 1.92% |
500 | 0.261377 | 504 | 2 | 0.40% |
721 | 0.476597 | 725 | 2 | 0.28% |
Table 3 – cppcheck Accuracy Results
Size (# of functions) | Runtime (sec) | Total Reported Items | Vulnerable Reported Items | Accuracy |
5 | 6.8047869 | 19 | 2 | 10.53% |
10 | 7.09194 | 34 | 2 | 5.88% |
20 | 7.124346 | 64 | 2 | 3.13% |
50 | 7.5739443 | 154 | 2 | 1.30% |
100 | 8.4504086 | 304 | 2 | 0.66% |
500 | 14.143265 | 1504 | 2 | 0.13% |
721 | 17.14055538 | 2167 | 2 | 0.09% |
Table 2 – cwe_checker Accuracy Results
Size (# of functions) | Runtime (sec) | Total Reported Items | Vulnerable Reported Items | Accuracy |
5 | 56.37603 | 1 | 1 | 100% |
10 | 63.9488 | 1 | 1 | 100% |
20 | 74.829524 | 1 | 1 | 100% |
50 | 89.03786 | 1 | 1 | 100% |
100 | 108.076774 | 1 | 1 | 100% |
500 | 292.4567799 | 1 | 1 | 100% |
721 | 312.5943696 | 1 | 1 | 100% |
Table 4 – Weak Pointers Accuracy Results
Accuracy is measured as the number of vulnerable reported items divided by the total number of reported items. You can think of this as a measure of the “noisiness” of each tool. Out of everything the tool reports, how much of it reflects a truly troublesome vulnerability.
As an example of what the static analyzers output, consider the output shown on the right from flawfinder when ran against the size 500 program.
Flawfinder, as well as each of the other static analyzers, reported all uses of memcpy
as being unsafe, even though only one of them was truly unsafe. Some of the static analyzers also reported the same unsafe usage of memcpy
twice, hence the 2 in some the Vulnerable Reported Items columns in the tables above.
This data implies that static analysis, both when using binary and source code as input, is rather noisy. Although static analysis tools are fast, they report anything and everything. This makes it quite difficult to distinguish between what is valuable from what is unimportant. This is further illustrated in the diagram below:
Accuracy of Each Tool per Binary Size
No matter the size of the target program, concolic analysis only reported the single truly vulnerable usage of memcpy
that results in a segfault. This is shown in the output from the Weak Pointers analysis to the right, which indicates that the usage of memcpy
found at program address 0x1200 is exploitable.
The reason concolic analysis is capable of this is because it models runtime behavior, whereas static analysis does not. This capability lets us differentiate the exploitable from the benign, ultimately resulting in the significantly higher accuracy rating shown in the data, albeit at a cost of performance time.
Blackbox Fuzzing: Sometimes Reliable
As is demonstrated, static analyzers often have a low accuracy because they fail to account for runtime behavior. Fuzzers do not have this problem: they dynamically analyze a program’s behavior at runtime by generating malformed inputs. However, fuzzers often have severe performance limitations. They must explore a wide variety of potential inputs before landing on one that results in unintended program behavior.
We designed two experiments which compare the performance of fuzzing to concolic/Weak Pointer Analysis. For each experiment, we compiled a set of target binary samples. The first set of binaries contain broad CFGs, whereas the second contain deep CFGs. We will explain what “broad” and “deep” mean in this context shortly. The first experiment will use the broad CFG binaries, the second experiment will use the deep CFG binaries.
NOTE: A control-flow graph (CFG) is a representation of the flow of control within a program during its execution. Each node in the CFG represents a block of code in the program. Each edge represents a potential transfer of control/branch.
All binary samples in both sets each contain a single poor usage of memcpy
that results in a segfault in one of the basic blocks in its CFG. What this means will become clearer below when we discuss the source code used to compile the samples.
Each tool will be run against each binary. The time each tool takes to discover the vulnerable usage of memcpy
/vulnerability will be recorded. This will serve as a general measure of each tool’s performance. It will also illustrate how differently shaped CFGs affect the performance of each tool.
Broad CFG
For this experiment, we are defining CFG breadth as the maximum number of edges which extend from a single node in the CFG. This is best understood visually:
As is shown in the figure, most nodes have only one outward edge, whereas a single node has numerous outward edges. This CFG is considered to have a breadth of 10, as the single node with the maximum number of outward edges has 10 in total.
An easy way to produce a very broad CFG is to write a function with a very long switch-case in C. Once compiled, a large switch-case statement results in at least one node with many outward edges.
To this end, we generated various binaries of differing breadth sizes according to the definition of breadth size mentioned above. For each of these binaries, one of the cases in the switch statement results in a heap buffer overflow vulnerability/segfault. When the binary follows this case, it will produce a buffer overflow, segmentation fault, and program crash. The specific case is randomized, and different for each program.
The source code for a program with size 5 CFG breadth is shown to the right. For this binary, the input 3 results in a segfault.
Each binary reads data from stdin, before selecting the case to follow. This means that user input determines whether a segfault will occur. This makes these binaries prime candidates for fuzzing and concolic/Weak Pointer analysis.
For each program, we ran AFL++, Ramadsa, and Weak Pointers until they each found a segmentation fault vulnerability. We recorded the results of this effort, which are summarized in the tables below.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const int LARGE = 131072;
const int SMALL = 4;
// causes a heap overflow and segfault
void heap_overflow() {
printf("Performing unsafe heap operation...\n");
char* src = (char*)malloc(sizeof(char) * LARGE);
char* dst = (char*)malloc(sizeof(char) * SMALL);
memcpy(dst, src, LARGE);
free(src);
free(dst);
}
int main() {
char in_str[8];
fgets(in_str, 8, stdin);
switch(atoi(in_str)) {
case 1:
printf("1");
break;
case 2:
printf("2");
break;
case 3:
printf("3");
heap_overflow();
break;
case 4:
printf("4");
break;
case 5:
printf("5");
break;
}
}
Size (CFG Breadth) | Mean Time to Find Segfault (sec) | Median Time to Find Segfault (sec) | Number of Samples |
5 | 1.22705 | 1.229158 | 10 |
10 | 1.539695 | 1.640244 | 10 |
20 | 1.228709 | 1.229786 | 10 |
50 | 4.550136 | 1.633169 | 10 |
100 | 2.42164056 | 2.050192 | 10 |
500 | 19.4467412 | 18.128518 | 10 |
1000 | 112.776517 | 85.526875 | 10 |
5000 | 11.9792385 | 11.9792385 | 1 |
10000 | 82.006280 | 82.006280 | 1 |
Table 5 – AFL++ Broad Performance Results
Size (CFG Breadth) | Mean Time to Find Segfault (sec) | Median Time to Find Segfault (sec) | Number of Samples |
5 | 0.2211975 | 0.12884569 | 10 |
10 | 1.67807545 | 1.3554393 | 10 |
20 | 0.04569835 | 0.020627021 | 10 |
50 | 16.7256742 | 11.99373 | 10 |
100 | 13.124415 | 7.375611 | 10 |
500 | 2.6779618 | 2.110273 | 10 |
1000 | 653.57073388 | 353.841504 | 10 |
5000 | 4819.5401723 | 4819.5401723 | 1 |
10000 | 629.6727004 | 629.6727004 | 1 |
Table 6 – Radamsa Broad Performance Results
Size (CFG Breadth) | Mean Time to Find Segfault (sec) | Median Time to Find Segfault (sec) | Number of Samples |
5 | 30.939618 | 29.4254662 | 10 |
10 | 29.5910989 | 29.461279 | 10 |
20 | 28.9873888 | 28.85682845 | 10 |
50 | 29.38255059 | 29.1132495 | 10 |
100 | 29.59618673 | 29.2723473 | 10 |
500 | 33.6372462 | 33.7710034 | 10 |
1000 | 31.0504938 | 31.0986369 | 10 |
5000 | 38.09562659 | 38.09562659 | 1 |
10000 | 52.149712324 | 52.149712324 | 1 |
Table 7 –Weak Pointers Broad Performance Results
For the most part, we ran the tools 10 times per binary to reduce the effect of randomness on the resulting data. You can see that we ran the tools only once for some of the larger binaries to save time.
For the smaller program sizes, fuzzing performed better than concolic analysis. However, these performance differences begin to invert towards the larger sizes. This is especially true in the case of Radamsa, which took over an hour for the size 5,000 program.
The times taken by the fuzzers also appear more spread out than those taken by concolic analysis. The standard deviation for the AFL++ mean times was 41.45 and for Radamsa the standard deviation was 1575.87. For Weak Pointers, the standard deviation was just 7.49. This is most likely because of the highly random nature of fuzzing. Fuzzing is closer in approximation to a random guess than it is to a programmatic approach like concolic analysis.
Fuzzing is essentially an improved form of random guessing. Fuzzing does not guarantee that every branch of the CFG was explored. In fact, the performance of a fuzzer often has less to do with the shape/size of the CFG than it does the relationship between input and output. For example, Radamsa took only 0.046 seconds to find the segfault present in the size 20 program. This is faster than the smaller size 5 and size 10 programs: why? This is because the input which causes the segfault in the size 20 program is by chance just the character 1. The character 1 is much more likely to be guessed by a fuzzer than a more arbitrary sequence like 4243. In fact, 4243 was the input required to segfault the size 5,000 program, which explains why Radamsa took so long to find it.
For blackbox fuzzers, the size and shape of the program’s CFG is somewhat meaningless. In the experiment, there is a positive correlation between CFG breadth and fuzzer runtime not because the CFG is any broader, but because bigger programs are more likely to contain a hard to guess sequence of bytes. For concolic analysis, the size and shape of the CFG matters much more, as that approach explores the CFG directly using symbolic execution.
Mean Runtime of Each Tool per Binary CFG Breadth Size
There is a notable difference in performance between AFL++ and Radamsa shown in the data. This is likely caused by multiple factors. For one, AFL++ makes use of parallelization. Additionally, the input mutation algorithms used by the two tools are different. In this case, AFL++’s mutation algorithm seems to perform better, but this might not be true in every case.
Both AFL++ and Radamsa ask for an example input to permutate when fuzzing. We supplied the value 123 for each. This certainly had a significant impact on the resulting performance data. With this configuration, input sequences similar to 123 are favored when fuzzing. If a target binary segfaults on an input dissimilar to 123, the fuzzers have a much harder time finding it.
Deep CFG
The final experiment is quite like the previous, this time focusing on program CFG depth instead of breadth. Once again, this is best understood visually. Below to the left is an example of a “deep” CFG:
An easy way to produce a binary with a very deep CFG is to write a function with a very long if-else-if statement in C. When compiled, a long if-else-if statement will produce at least one branch of the CFG that is very long. This branch corresponds to the conditions required to end up in the last else-if statement, after all other conditions were not met.
#include
#include
#include
const int LARGE = 131072;
const int SMALL = 4;
// causes a heap overflow and segfault
void heap_overflow() {
printf("Performing unsafe heap operation...\n");
char* src = (char*)malloc(sizeof(char) * LARGE);
char* dst = (char*)malloc(sizeof(char) * SMALL);
memcpy(dst, src, LARGE);
free(src);
free(dst);
}
int main() {
char in_str[8];
fgets(in_str, 8, stdin);
in_str[strcspn(in_str, "\n")] = '\0';
if(strcmp(in_str, "1") == 0) {
printf("1");
}
else if(strcmp(in_str, "2") == 0) {
printf("2");
heap_overflow();
}
else if(strcmp(in_str, "3") == 0) {
printf("3");
}
else if(strcmp(in_str, "4") == 0) {
printf("4");
}
else if(strcmp(in_str, "5") == 0) {
printf("5");
}
}
We generated various binaries of differing CFG depths by compiling from source code with different sized if-else-if statements. These program samples are labeled according to the size of their if-else-if statements. For example, size 10 contains an if-else-if statement that is 10 else-ifs long, size 100 contains an if-else-if statement that is 100 else-ifs long, etc. The following source code defines a program of this nature with a CFG depth size of 5.
Just like in the breadth experiment, a single, randomly selected block of this if-else-if statement will result in a heap overflow, segfault, and program crash. In the example above, this input is 2. Each binary reads from stdin before selecting the if-else-if block to execute. User input determines whether a segfault occurs.
For each binary program sample, we ran AFL++, Ramadsa, and Weak Pointers until they each found a segmentation fault. We recorded the results of this effort, which are summarized in the tables below.
Size (CFG Depth) | Mean Time to Find Segfault (sec) | Median Time to Find Segfault (sec) | Number of Samples |
5 | 1.229713 | 1.229758 | 10 |
10 | 2.9786429 | 1.9524197 | 10 |
20 | 2.7966087 | 2.366075 | 10 |
50 | 1.8744916 | 1.551688 | 10 |
100 | 7.1974167 | 4.50936579 | 10 |
500 | 12.8920267 | 10.11059761 | 10 |
1000 | 17.2364633 | 7.702806949 | 10 |
5000 | 1069.2453219 | 1069.2453219 | 1 |
10000 | 1195.153937 | 1195.153937 | 1 |
Table 8 – AFL++ Deep Performance Results
Size (CFG Depth) | Mean Time to Find Segfault (sec) | Median Time to Find Segfault (sec) | Number of Samples |
5 | 0.33727478 | 0.33034837 | 10 |
10 | 12.111697 | 10.371287 | 10 |
20 | 3.8423036 | 2.107616 | 10 |
50 | 3.00970213 | 3.0342566 | 10 |
100 | 16.722832 | 7.03891468 | 10 |
500 | 232.0650088 | 170.8253835 | 10 |
1000 | 20.4071374 | 19.534172 | 10 |
5000 | 599.9072213 | 599.9072213 | 1 |
10000 | 1511.0234968 | 1511.02349686 | 1 |
Table 9 – Radamsa Deep Performance Results
Size (CFG Depth) | Mean Time to Find Segfault (sec) | Median Time to Find Segfault (sec) | Number of Samples |
5 | 31.23626985 | 31.74081552 | 10 |
10 | 31.67405188 | 31.46334922 | 10 |
20 | 31.74081552 | 31.606569886 | 10 |
50 | 34.288318991 | 34.20289182 | 10 |
100 | 36.043357849 | 34.86434209 | 10 |
500 | 41.88138661 | 41.89520418 | 10 |
1000 | 50.7756350517 | 50.51226115 | 10 |
5000 | 152.95043087 | 152.95043087 | 1 |
10000 | 227.660034418 | 227.660034418 | 1 |
Table 10 – Weak Pointers Deep Performance Results
Mean Runtime of Each Tool per Binary CFG Depth Size
The fuzzers performed with much the same degree of randomness as in the breadth experiment. Once again, the CFG size/shape has an insignificant impact on the performance of the fuzzers than does the relationship between input and output. The bigger sized binaries typically had more arbitrary byte sequences which resulted in segfaults. For example, the size 10,000 binary segfaults on input 8976. Both Radamsa and AFL++ had quite a difficult time guessing this input, with both taking about 20 minutes.
In the case of concolic analysis, CFG depth appears to have a more significant impact on performance than CFG breadth does. In the breadth experiment, the largest binary took only 22 seconds longer to analyze than the smallest. In the depth experiment, this difference was 196 seconds. More research is required to determine why this is the case, and if it applies in more general circumstances. In any case, concolic analysis seemed to handle the deeper sized CFGs in a more reasonable time than the fuzzers did.
Conclusion
Static analysis is a fast, but noisy approach to black box segfault detection. Static analysis typically can find segfaults, albeit in intermixed with a slew of false positives and unnoteworthy findings. Fuzzing addresses these accuracy concerns: when a segfault is found, you know exactly where it is and how to reproduce it. Fuzzing finds exactly what you want, without any noise.
However, fuzzing can be extremely time consuming. It varies widely in terms of reliability. Sometimes it will find what you need in seconds. It might find what you need in days. Other times, it may find nothing at all.
Although resource intensive, concolic/weak pointer analysis offers unique benefits compared to the other approaches. Concolic/weak pointer analysis is more reliable at exploring the potential states a binary program can be in. When the Weak Pointer analysis finishes its assessment, you can feel confident that most of what the target program is capable of has been explored. This is often not the case for fuzzing, where you are leaving vulnerability detection up to random chance.