SUNS-cfr

Created: February 2025

Any questions can be directed to [email protected].

Environment Setup

System Requirements

We recommend using Ubuntu 24.04. Other operating systems may work, but this tutorial is built on Ubuntu 24.04.

Install Docker

SUNS-Dataset requires Docker to run analysis tools. Follow the instruction for your operating system.

If you are using Ubuntu 24.04, feel free to use the script provided.

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

# Install Docker
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Test Docker
sudo docker run hello-world

Add Self to Docker Group

Since we want to use Docker without root, add yourself to the docker group.

Make sure you are not root!

sudo usermod -aG docker `whoami`

Refresh Groups

Log out and log back in to see yourself in the group. You should see docker when you run groups. You can now confirm that you can run docker run hello-world without sudo.

Pull suns-dataset from Github

If you already have git installed, proceed to pull suns-dataset from the Github repository.

git clone https://github.com/Program-Understanding/suns-dataset.git

Programs

What is Control Flow Recovery?

In a truly simple program, a disassembler should be able to start from the entry point and trace the code all the way down to the end of the program. Unfortunately, real programs have constructs that make this challenging. These are called Indirect Control Flow (ICF).

For our work, we want to create some examples that will help us discover where exactly the boundaries are of current analysis tools.

Function Pointer Array

One of the most direct ways of introducing indirection is by creating a function pointer.

Take a look at the following source code.

#include <stdio.h>

int target1(int a, int b) { return a + b; }
int target2(int a, int b) { return a - b; }
int target3(int a, int b) { return a * b; }

typedef int (*fptr)(int, int);

int main() {
    fptr fptr_arr[3] = { target1, target2, target3 };

    char buf[64];
    printf("Input two integers: ");

    if (fgets(buf, sizeof(buf), stdin) == NULL) {
        printf("Error reading input.\n");
        return 1;
    }

    int a, b;
    if (sscanf(buf, "%d %d", &a, &b) != 2) {
        printf("Invalid input. Please enter two integers.\n");
        return 1;
    }

    for (int i = 0; i < 3; i++) {
        int val = fptr_arr[i](a, b);
        printf("%d\t", val);
    }

    printf("\n");
    return 0;
}

We have three functions intentionally named target to make later work easier. We also see an array of pointers! Instead of calling target1, target2, or target3 directly, we decide to place the addresses of these functions in an array and then execute each function with integers a and b.

Directory Layout

There are various software constructs that produce indirection. A function pointer array is one of the most direct ways of producing them, but other constructs include callbacks, switch statements, inheritance, et cetera.

Since we created an example that creates a function pointer array, we will save it in suns-dataset/control_flow_recovery/common_challenges/arrays_of_function_pointers.

For better organization, it might be useful to create a directory here to distinguish it from other challenges with the same construct. I will name mine enya-fptr_array.

Compilation

Follow the command sequence for guidance on how to name files and compile.

gcc -g -O2 enya-fptr_array.c -o enya-fptr_array
cp enya-fptr_array enya-fptr_array-stripped
strip enya-fptr_array-stripped

Challenge

  1. Create a directory in suns-dataset/control_flow_recovery/common_challenges/arrays_of_function_pointers/ named YOUR_NAME-fptr_array.
  2. Create a C file that implements a function pointer array named YOUR_NAME-fptr_array.c. Try different parameters or different functions!
  3. Compile with -O2 -g and test your program to make sure it works.

CFR File

To create a research question, you first need a CFR file.

CFR.json

To work with the tools in control_flow_recovery/analysis, you will need a proper -cfr.json file. Go ahead and create one as such:

touch enya-fptr_array-stripped-cfr.json

Notice the -cfr.json at the end of the file name. This is critical for the parser in tools.

{
    "program":"enya-fptr_array-stripped",
    "question":"What are the file offsets for the instructions that are the targets of the '$INSTRUCTION' instruction at file offset '$OFFSET' ?",
    "groundtruth": [],
    "evaluation":"set"
}

The four necessary fields are program, question, groundtruth, and evaluation. The program is a relative filepath to the -cfr.json. The question requires exact wording as tools currently rely on regex. The groundtruth contains the answers to the questions, and the evaluation is the format of the answers.

Other helpful fields can be added as an aid to a human viewer.

Formatting Questions and Finding Ground Truth

Finding the question when the program is simple is easy enough. Fortunately, we are very interested in minimal examples.

Objdump is a great tool that can help with this. One good news is, since we are working with ELF binaries compiled by GCC, Objdump already gives to us the file offset. Even better news is, stripping a binary does not affect the code sections, so we can use the same offsets we find in the non-stripped for the stripped.

objdump -d enya-fptr_array

Since we have our function calls in main, we should check for a call to a pointer. Objdump will make this obvious, and in our case, it was call *(%rbx) at offset 0x11b5.

The target of these calls are similarly also easy to find:

From the examples above, it is clear that they are 0x1310, 0x1320, and 0x1330.

Then we should be able to update our -cfr.json file with these offsets.

{
    "program":"enya-fptr_array-stripped",
    "question":"What are the file offsets for the instructions that are the targets of the 'call rbx' instruction at file offset '0x11b5' ?",
    "groundtruth": ["0x1310", "0x1320", "0x1330"],
    "evaluation":"set"
}

Run the validator script to confirm that the analysis tools will be able to find your CFR file.

cd ~/suns-dataset/control_flow_recovery/analysis
./validate_cfr_files

Then use ./show_me_all_the_questions to verify that at least one tool can process your question.

./show_me_all_the_questions

Challenge

  1. Create your own CFR file
  2. Validate your CFR file with the two helper scripts (make sure it works!)
  3. Once you have obtained good results with the two validators, push your new program directory to Github

Analysis

Building Tools

In control_flow_recovery/analysis/tools, you will see various tools that support certain questions. You can find the questions a tool can handle with their respective questions.json located in its directory.

Building the tools is as simple as running the ./build script in each directory.

For example:

cd ~/suns-dataset/control_flow_recovery/analysis/tools/ghidra-xrefs
./build
cd ~/suns-dataset/control_flow_recovery/analysis/tools/angr-cfg
./build
cd ~/suns-dataset/control_flow_recovery/analysis/tools/jakstab-cfg
./build
cd ~/suns-dataset/control_flow_recovery/analysis/tools/sja
./build

Running Analysis

After you've built the tools, you can run ./analyze in suns-dataset/control_flow_recovery/analysis. You can use the --help option to see the options available.

In general, you should run analyses on programs you have built. If you're following the example, this would be as simple as ./analyze --only_cfr enya-fptr_array-stripped. Note, you must use the name of the CFR file with the -cfr.json part stripped away.

The results can be found in results/ in the same directory.

Examples

Finding Difficult Call and Target Addresses Using GDB

x