One major task of reverse engineering binary code is to identify library code. Because, what the library code does is often known it is of no interest to an analyst. Hex-Rays has developed the IDA F.L.I.R.T. signatures to tackle the problem. Function ID is Ghidra’s function signature system. Unfortunately, for Ghidra there are very few Function ID datasets. There is only function identification for the Visual Studio supplied libraries.

Ghidra already comes with everything you need to automate Function ID dataset generation. In this article I outline how I use Ghidra’s headless analyzer and the Function ID pre- and post-analysis scripts to automatically generate Function ID datasets for the static libraries in the CentOS repositories.

My code: (Research code; so please read this article before trying to use;)

Function ID

Ghidra’s Function ID allows to automatically identify functions based on hashing the masked1 function bytes.

Analyzing a binary that has OpenSSL statically linked may look like:

Decompilation without Function ID

Decompilation without Function ID

An analyst that wants to know what this code does would need to figure out that some of these functions are functions of OpenSSL by manually analyzing them.

If Ghidra knows the function hashes of the included OpenSSL library it can identify OpenSSL functions as follows:

Decompilation with Function ID

Decompilation with Function ID

A match for a function shows what library matched the function’s hash. In this case it was openssl-static with version 1.0.2k from the 19.el7.x86_64 release:

Function ID match

Function ID match

This match is based on a Function ID dataset I automatically generated.

Getting static library data

A very quick way to get a large amount of static library code is from Linux repositories. To this end, I started by downloading all RPM packages containing -static- from the CentOS repositories.

This step is fully automated via

After running we have a folder rpms with:

|-- atlas-static-3.10.1-12.el7.i686.rpm
|-- atlas-static-3.10.1-12.el7.x86_64.rpm
|-- audit-libs-static-2.4.5-6.el6.i686.rpm
|-- audit-libs-static-2.4.5-6.el6.x86_64.rpm
|-- audit-libs-static-2.8.4-4.el7.i686.rpm
|-- zlib-static-1.2.7-18.el7.i686.rpm
`-- zlib-static-1.2.7-18.el7.x86_64.rpm

0 directories, 462 files

Next, we must unpack the RPMs and extract the object files from the containing library archives (.a).

This step is fully automated via (which calls

After running we have a folder el with:

|-- el6.i686
|   |-- audit-libs-static
|   |   `-- 2.4.5
|   |       `-- 6.el6.i686
|   |           |-- auditd-config.o
|   |           |-- audit_logging.o
|   |           |-- auparse.o
    |-- uwsgi-router-static
    |   `-- 2.0.16
    |       `-- 1.el7.x86_64
    `-- zlib-static
        `-- 1.2.7
            `-- 18.el7.x86_64
                |-- adler32.o
                |-- compress.o
                |-- uncompr.o
                `-- zutil.o

1231 directories, 154850 files

More specifically el having the folder structure el{6,7}.{i686,x86_64}/libname/version/release/*.o.

This is the folder structure needed by Ghidra’s This script automatically collects the function hashes from a directory structure with the above layout and populates a Function ID dataset with it.

You could now follow $GHIDRA_HOME/Ghidra/Features/FunctionID/data/building_fid.txt and manually import the libraries. However, the large amount of object files calls for full automation.

Automating import with Ghidra headless

The full help for the Ghidra headless analyzer is at $GHIDRA_HOME/support/analyzeHeadlessREADME.htm. is automating the import of the above directory structure.

The command is basically:

analyzeHeadless "$GHIDRA_PROJ" el-fidb -import el/el7.x86_64 -recursive \
    -preScript \
    -postScript \
    -processor x86:LE:64:default -cspec gcc

The and are scripts included with Ghidra that disable certain analysis options (such as Function ID) suitable for headless Function ID generation.

After this you should have the el directory structure with all the .o files in the Ghidra el-fidb project.

WARNING: This will take forever!

WARNING: You must use Ghidra 9.1-DEV or later. Earlier versions have a bug with x86_64 relocations.

It is advised that you delete the libraries from el that you don’t need Function ID datasets for to speed up the process. You can run to move all files to el7.{i686,x86_64}.full and only copy a selected set of libraries back to el7.{i686,x86_64}.

For starters I would recommend you make a folder under el where you only have on library, i.e., a directory structure like:

`-- el7.x86_64.test
    `-- openssl-static
        `-- 1.0.2k
            `-- 19.el7.x86_64
                |-- a_bitstr.o
                |-- a_bool.o
                |-- a_bytes.o
                |-- x_spki.o
                |-- xts128.o
                |-- x_val.o
                |-- x_x509a.o
                `-- x_x509.o

Then run ./ el/el7.x86_64.test to import the openssl-static 1.0.2k 19.el7.x86_64 library.

Generating Function ID dataset headless

Once you have imported a subdirectory from the el directory. We can use ghidra_scripts/ to generate the Function ID dataset. is a modification of Ghidra’s included to make it work headless. automates the Function ID dataset generation.

The script basically calls:

analyzeHeadless "$GHIDRA_PROJ" el-fidb -noanalysis -scriptPath ghidra_scripts \
    -preScript \
    log/duplicate_results.txt true fidb "/el7.x86_64.test" log/common.txt x86:LE:64:default

When you run el/el7.x86_64.test (and you followed the previous import instructions for el7.x86_64.test) Ghidra will generate a Function ID dataset as fidb/el7.x86_64.test.fidb that contains the function hashes of the libraries you imported from the el/el7.x86_64.test folder.

For the openssl-static 1.0.2k 19.el7.x86_64 library the generated Function ID dataset is openssl-static-1.0.2k-19.el7.x86_64.fidb.

I will upload more Function ID datasets to Ghidra once the analysis completed.

Summary and Future work

While running these 4 scripts will (in theory) generate you Function ID datasets for all static libraries available in the CentOS repositories, the generated Function ID datasets are likely not perfect.

$GHIDRA_HOME/Ghidra/Features/FunctionID/data/building_fid.txt mentions using to remove wrapper functions and other dirt from the generated Function ID dataset. However, that script is Visual Studio library specific. For these CentOS libraries a hand tuned would be needed to clean up the generated hashes to perfection.

Before distributing the .fidb you should run to pack the defragment it.

The next steps are to have a full run over the whole CentOS static libraries. (I previously imported lots of libraries. However, I then stumbled on a Ghidra bug rendering the imported data useless.)

The and can be modified to work with many different libraries.

If you have a good list of widely used static libraries for Windows or in firmware please contact me, so I can tackle that list next. The CentOS repo case was just a test to familiarize myself with Ghidra headless processing and Function ID dataset generation.

Another worthwhile effort is to run such automated analysis on a server via Ghidra server.

  1. Ghidra masks addresses so the hash is independent from the address of the function.