x86-sok

What is x86-sok

This ground truth generation tool is available at https://github.com/junxzm1990/x86-sok. It is essential for reproducing Pang et al.'s Ground Truth for Binary Disassembly is Not Easy and SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask.

Installing x86-sok on Docker

The easiest way to run x86-sok is to pull their docker image and run the analysis on the host after compiling in their environment. You can pull their image with: docker pull bin2415/x86_gt:0.1.

To run their python scripts, you'll need to install the following pip packages: capstone protobuf pyelftools setuptools SQLAlchemy. As with any pip installation, I recommend using a virtual environment.

Check out my pip freeze:

beautifulsoup4==4.12.3
capstone==5.0.3
greenlet==3.1.1
protobuf==3.20.0
pyelftools==0.31
setuptools==75.4.0
soupsieve==2.6
SQLAlchemy==2.0.36
typing_extensions==4.12.2

You can add a volume to the ephemeral docker container by the following. In my case, I am attaching the ~/suns-dataset directory to /suns-dataset on the docker environment.

docker run --rm -it -v ~/suns-dataset:/suns-dataset bin2415/x86_gt:0.1 /bin/bash
# In the interactive docker container
source ./gcc64.rc
export CFLAGS="-g $CFLAGS" && export CXXFLAGS="-g $CXXFLAGS"

Generating Ground Truth

Say you have a C source file /suns-dataset/icf/src/switchdispatch_fptr.c. You should compile the program with $CC instead of gcc.

root@6ff852a93d92:/suns-dataset/icf/src# $CC -o switchdispatch_fptr switchdispatch_fptr.c
[bbinfo]: DEBUG, the target binary format is: size 64, is big endian 0
Update shuffleInfo Done!
Successfully wrote the ShuffleInfo to the .rand section!

Extract the .rand section from the binary to produce switchdispatch_fptr.gt.

objcopy --dump-section .rand=switchdispatch_fptr.gt.gz switchdispatch_fptr
gzip -d switchdispatch_fptr.gt.gz

The final step is to use x86-sok/extract_gt/extractBB.py to create a final protobuf output, defaulted to output in the /tmp directory, or otherwise selected by the -o option.

python3 extractBB.py -b ~/suns-dataset/icf/src/switchdispatch_fptr -m ~/suns-dataset/icf/src/switchdispatch_fptr.gt -o ~/suns-dataset/icf/src/gtBlock_switchdispatch_fptr.pb

Ground Truth Format

The project uses protobuf to store information about each binary. You should be able to find this in x86-sok/protobuf_def/blocks.proto. Otherwise, consider the following tables for the following protobuf messages:

module

Field Name Type Label Tag Default Value Description
fuc Function repeated 1 N/A A list of Function messages.
text_start uint64 optional 2 0 Starting address of the text section.
text_end uint64 optional 3 0 Ending address of the text section.
split_block bool optional 4 false Indicates if basic blocks should be split by call instructions.

Function

Field Name Type Label Tag Default Value Description
va uint64 required 1 N/A The virtual address of the function.
bb BasicBlock repeated 2 N/A A list of basic blocks within this function.
calledFunction CalledFunction repeated 3 N/A A list of called functions from this function.
type uint32 optional 4 0 Type indicator of the function. 0 represents a normal function; 1 represents a dummy.

Child

Field Name Type Label Tag Default Value Description
va uint64 required 1 N/A The virtual address value.

Instruction

Field Name Type Label Tag Default Value Description
va uint64 required 1 N/A The virtual address of the instruction.
size uint32 optional 2 0 The size of the instruction in bytes.
call_type uint32 optional 3 0 Indicates the call type: 1 (direct/indirect), 2 (indirect), 3 (direct).
callee uint64 optional 4 0 The virtual address of the callee if it is a call instruction.
callee_name string optional 5 "" The name of the callee if available.

CalledFunction

Field Name Type Label Tag Default Value Description
va uint64 required 1 N/A The virtual address of the called function.

BasicBlock

Field Name Type Label Tag Default Value Description
va uint64 required 1 N/A The virtual address of the basic block.
parent uint64 required 2 N/A The virtual address of the parent block or parent function reference.
child Child repeated 3 N/A A list of children blocks.
instructions Instruction repeated 4 N/A A list of instructions contained in this basic block.
size uint32 optional 5 0 The actual size of the basic block, excluding any padding bytes.
padding uint32 optional 6 0 The number of padding bytes after the block’s instructions.
type uint32 optional 7 0 Indicates the type or classification of the block. Possible values represent different types of control flow instructions or states (calls, jumps, returns, etc.).
terminate bool optional 8 false Indicates if the block contains a terminating instruction (for example, an illegal instruction that causes execution to stop).

Using the Ground Truth Format

For example, let's use this ground truth format to compare the performance of various tools in the task of recovering jump tables, such as in x86-sok/compare/compareJmpTable.py.

In the ground truth file, you can iterate over each function. In each function, you can iterate over each basic block. The BasicBlock.type field can classify a block as a JUMP_TABLE block. The last instruction is considered the terminator. Then, the successors are extracted from the block's child fields, where the VA corresponds to the jump targets.

If the binary is a PIE and when analyzing with a tool rather than ground truth, then the addresses are normalized by subtracting the disassembler's base address to match the ground truth's address space.

Limitations

One limitation is the hard treatment of function and basic block boundaries. LLVM and GCC both tend towards similar definitions, but the implementation of these definitions are biased towards the compilers' definitions.

Another limitation is that it does not have perfect code coverage, especially when it comes to linker-emitted code. When OracleGT encounters this situation, it skips the region from consideration.

(x86-sok) enya@reverse-gpu-2:~/x86-sok/compare$ python3 compareInsts.py -b ~/suns-dataset/icf/fptr -g /tmp/gtBlock_fptr.pb -c /tmp/objdump_inst.pb2

WARNING:root:[check ground truth function:] function __libc_csu_init in address 0x400610 not in ground truth

DEBUG:root:[Index 0x0]: It seems that we don't have the instruction's 0x400610 ground truth, let's skip it!

[Result]:The total instruction number is 108
[Result]:Instruction false positive number is 0, rate is 0.000000
[Result]:Instruction false negative number is 0, rate is 0.000000
[Result]:Padding byte instructions number is 9, rate is 0.076923
[Result]:Precision 1.000000
[Result]:Recall 1.000000
x