x86-sok
What is x86-sok
This ground truth generation tool is available at https://github.com/junxzm1990/x86-sok. It is essential for reproducing Pang et al.'s Ground Truth for Binary Disassembly is Not Easy and SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask.
Installing x86-sok on Docker
The easiest way to run x86-sok is to pull their docker image and run the analysis on the host after compiling in their environment. You can pull their image with: docker pull bin2415/x86_gt:0.1
.
To run their python scripts, you'll need to install the following pip packages: capstone protobuf pyelftools setuptools SQLAlchemy
. As with any pip installation, I recommend using a virtual environment.
Check out my pip freeze:
beautifulsoup4==4.12.3
capstone==5.0.3
greenlet==3.1.1
protobuf==3.20.0
pyelftools==0.31
setuptools==75.4.0
soupsieve==2.6
SQLAlchemy==2.0.36
typing_extensions==4.12.2
You can add a volume to the ephemeral docker container by the following. In my case, I am attaching the ~/suns-dataset
directory to /suns-dataset
on the docker environment.
docker run --rm -it -v ~/suns-dataset:/suns-dataset bin2415/x86_gt:0.1 /bin/bash
# In the interactive docker container
source ./gcc64.rc
export CFLAGS="-g $CFLAGS" && export CXXFLAGS="-g $CXXFLAGS"
Generating Ground Truth
Say you have a C source file /suns-dataset/icf/src/switchdispatch_fptr.c
. You should compile the program with $CC
instead of gcc
.
root@6ff852a93d92:/suns-dataset/icf/src# $CC -o switchdispatch_fptr switchdispatch_fptr.c
[bbinfo]: DEBUG, the target binary format is: size 64, is big endian 0
Update shuffleInfo Done!
Successfully wrote the ShuffleInfo to the .rand section!
Extract the .rand
section from the binary to produce switchdispatch_fptr.gt
.
objcopy --dump-section .rand=switchdispatch_fptr.gt.gz switchdispatch_fptr
gzip -d switchdispatch_fptr.gt.gz
The final step is to use x86-sok/extract_gt/extractBB.py
to create a final protobuf output, defaulted to output in the /tmp
directory, or otherwise selected by the -o
option.
python3 extractBB.py -b ~/suns-dataset/icf/src/switchdispatch_fptr -m ~/suns-dataset/icf/src/switchdispatch_fptr.gt -o ~/suns-dataset/icf/src/gtBlock_switchdispatch_fptr.pb
Ground Truth Format
The project uses protobuf to store information about each binary. You should be able to find this in x86-sok/protobuf_def/blocks.proto
. Otherwise, consider the following tables for the following protobuf messages:
module
Field Name | Type | Label | Tag | Default Value | Description |
---|---|---|---|---|---|
fuc | Function | repeated | 1 | N/A | A list of Function messages. |
text_start | uint64 | optional | 2 | 0 | Starting address of the text section. |
text_end | uint64 | optional | 3 | 0 | Ending address of the text section. |
split_block | bool | optional | 4 | false | Indicates if basic blocks should be split by call instructions. |
Function
Field Name | Type | Label | Tag | Default Value | Description |
---|---|---|---|---|---|
va | uint64 | required | 1 | N/A | The virtual address of the function. |
bb | BasicBlock | repeated | 2 | N/A | A list of basic blocks within this function. |
calledFunction | CalledFunction | repeated | 3 | N/A | A list of called functions from this function. |
type | uint32 | optional | 4 | 0 | Type indicator of the function. 0 represents a normal function; 1 represents a dummy. |
Child
Field Name | Type | Label | Tag | Default Value | Description |
---|---|---|---|---|---|
va | uint64 | required | 1 | N/A | The virtual address value. |
Instruction
Field Name | Type | Label | Tag | Default Value | Description |
---|---|---|---|---|---|
va | uint64 | required | 1 | N/A | The virtual address of the instruction. |
size | uint32 | optional | 2 | 0 | The size of the instruction in bytes. |
call_type | uint32 | optional | 3 | 0 | Indicates the call type: 1 (direct/indirect), 2 (indirect), 3 (direct). |
callee | uint64 | optional | 4 | 0 | The virtual address of the callee if it is a call instruction. |
callee_name | string | optional | 5 | "" | The name of the callee if available. |
CalledFunction
Field Name | Type | Label | Tag | Default Value | Description |
---|---|---|---|---|---|
va | uint64 | required | 1 | N/A | The virtual address of the called function. |
BasicBlock
Field Name | Type | Label | Tag | Default Value | Description |
---|---|---|---|---|---|
va | uint64 | required | 1 | N/A | The virtual address of the basic block. |
parent | uint64 | required | 2 | N/A | The virtual address of the parent block or parent function reference. |
child | Child | repeated | 3 | N/A | A list of children blocks. |
instructions | Instruction | repeated | 4 | N/A | A list of instructions contained in this basic block. |
size | uint32 | optional | 5 | 0 | The actual size of the basic block, excluding any padding bytes. |
padding | uint32 | optional | 6 | 0 | The number of padding bytes after the block’s instructions. |
type | uint32 | optional | 7 | 0 | Indicates the type or classification of the block. Possible values represent different types of control flow instructions or states (calls, jumps, returns, etc.). |
terminate | bool | optional | 8 | false | Indicates if the block contains a terminating instruction (for example, an illegal instruction that causes execution to stop). |
Using the Ground Truth Format
For example, let's use this ground truth format to compare the performance of various tools in the task of recovering jump tables, such as in x86-sok/compare/compareJmpTable.py
.
In the ground truth file, you can iterate over each function. In each function, you can iterate over each basic block. The BasicBlock.type
field can classify a block as a JUMP_TABLE
block. The last instruction is considered the terminator. Then, the successors are extracted from the block's child
fields, where the VA corresponds to the jump targets.
If the binary is a PIE and when analyzing with a tool rather than ground truth, then the addresses are normalized by subtracting the disassembler's base address to match the ground truth's address space.
Limitations
One limitation is the hard treatment of function and basic block boundaries. LLVM and GCC both tend towards similar definitions, but the implementation of these definitions are biased towards the compilers' definitions.
Another limitation is that it does not have perfect code coverage, especially when it comes to linker-emitted code. When OracleGT encounters this situation, it skips the region from consideration.
(x86-sok) enya@reverse-gpu-2:~/x86-sok/compare$ python3 compareInsts.py -b ~/suns-dataset/icf/fptr -g /tmp/gtBlock_fptr.pb -c /tmp/objdump_inst.pb2
WARNING:root:[check ground truth function:] function __libc_csu_init in address 0x400610 not in ground truth
DEBUG:root:[Index 0x0]: It seems that we don't have the instruction's 0x400610 ground truth, let's skip it!
[Result]:The total instruction number is 108
[Result]:Instruction false positive number is 0, rate is 0.000000
[Result]:Instruction false negative number is 0, rate is 0.000000
[Result]:Padding byte instructions number is 9, rate is 0.076923
[Result]:Precision 1.000000
[Result]:Recall 1.000000