In this article I walk you through why Ghidra needs better Version Tracking Correlators for usable patch diffing and how I implemented a new Version Tracing Correlator as a Ghidra Extension.

Update 1: Seriously, just skip this article and go right to the new active project on https://github.com/threatrack/ghidra-patchdiff-correlator.

The project is moving so fast that most of the info in this article is now obsolete.

Problem

Ghidra’s Version Tracking only has Correlators that return perfect1 matches, i.e., matches with a similarity score of 1.000. This makes it hard to find changed functions. Especially, when you compare binaries without symbol names.

Current Best Workflow

Currently, the best workflow (I found) for patch diffing in Ghidra is:

Update 1: This workflow is rubbish! Just run the Automatic Version Tracking Command. It will automagically run all (and only those that are) necessary correlators in the correct order. The problem in, however, still persists.

  1. Run the Exact Function * Match Correlators.
  2. Accept the found matches (because they are perfect matches we are certain these are the same functions).
  3. Exclude the accepted functions from the list (via the Match Table Filters).
  4. Run the Duplicate Function Instructions Match Correlator. This is needed because the previous Correlators only return matches with only 1 destination function. Here we get 1:N mappings.
  5. Find which of the functions best match and accept the matches. These functions are mostly unimportant wrapper or very short functions.
  6. Exclude the accepted functions from the list (via the Match Table Filters).
  7. Run the Exact Symbol Name Match Correlator. If symbols are available!
  8. Compare the functions that weren’t matched via the Exact Function * Match Correlators for changes. If symbols are available this is a 1:1 comparison.

The problem: If in step 7 no symbols are available in step 8 we have a N:M comparison, with N and M being the unmatched functions in the Source and Destination Programs. This can quickly get out of hand.

Ghidra is lacking a correlator that would assign these remaining functions a similarity score.

Idea

To this end, I’ll implement some Program Correlators that facilitate this non-perfect similarity matching.

So far I will start with two basic ideas:

Bulk Instruction Program Correlator

The Bulk Instruction Program Correlator will make an unordered bulk list of Instructions occurring in a function.

Let’s say we have the function:

PUSH       EBP
MOV        EBP,ESP
SUB        ESP,0x8
MOV        ESP,EBP
POP        EBP
RET

Then the Correlator would “bulk” this to the following list of features:

If we now have a function:

PUSH       EBP
MOV        EBP,ESP
MOV        ESP,EBP
POP        EBP
RET

With features:

It would match 5 out of 6 features of the earlier function.

The matching is unordered - hence the notion of “bulk”.

So a function of (warning: doesn’t make sense):

POP        EBP
MOV        EBP,ESP
MOV        ESP,EBP
SUB        ESP,0x8
PUSH       EBP
RET

Would still match 6 of 6 with the original function, because of the unordered bulk comparison logic.

Eventually, proper state-of-the-art CFG based algorithms should be implemented, but as a start this is already an improvement over the binary-only Correlators.

Bulk Basic Block Program Correlator

The Bulk Basic Block Program Correlator isn’t implemented yet, but it will be the same as the Bulk Instruction Correlator, but the features in the bulk comparison will be basic block hashes.

Implementation

Obligatory development picture.

Obligatory development picture.

Unfortunately, the GhidraDev Eclipse plugin does not have a template for a ProgramCorrelator implementation. So we have to figure the Extension point out ourselves.

The extension points are:

ProgramCorrelator
AddressCorrelator
ProgramCorrelatorFactory
AddressCorrelatorFactory
Stringable
Validator

ProgramCorrelator correlates functions. AddressCorrelator correlates data. A Validator is used in the Pre-Check during starting of a Version Tracking Session.

To get Ghidra to recognize your Correlator you need to implement a ProgramCorrelatorFactory.

From the existing Program Correlators I will use the ExactMatchBytesProgramCorrelatorFactory.java and the returned FunctionMatchProgramCorrelator.java as a template.

So we need to implement a ProgramCorrelatorFactory that returns a ProgramCorrelator which in its doCorrelate(VTMatchSet matchSet, TaskMonitor monitor) function adds found matches to the delivered VTMatchSet matchSet.

You can find the implementation at https://github.com/threatrack/ghidra-patchdiff-correlator/.

Implementation Details

The BulkInstructionProgramCorrelatorFactory returns a BulkProgramCorrelator that gets a InstructionFunctionBulker as a parameter (this is similar to how the Ghidra included “Exact” type Correlators are organized).

The InstructionFunctionBulker does the main work. It iterates over the instructions and returns a list of hashes. The hash lists for each function are then compared and corresponding VT Matches added by the BulkProgramCorrelator.

To add the basic block Correlator only a new BasicBlockFunctionBulker and corresponding Factory must be implemented.

The implementation isn’t optimized yet and makes some odd decisions, such as hashing the instructions via calling hashCode() on the instruction strings.

But it works!

Results

You can install a first pre-release of the Ghidra Extension via: File -> Install Extensions and then select the downloaded ghidra_9.1-BETA_DEV_*_PatchDiffCorrelator.zip.

Then in a Version Tracking Session you can select the Bulk Instructions Match in the Version Tracking Wizard:

Bulk Instructions Match in the Version Tracking Wizard

Bulk Instructions Match in the Version Tracking Wizard

It will produce matches between 0.000 and 1.000 Similarity depending on how many instructions are found in both the Source and the Destination Function:

Two functions with a small change.

Two functions with a small change.

Please note this is a first prototype and likely contains bugs.

New Improved Workflow

We can now use the following improved workflow:

Update 1: This workflow is rubbish! Just run the Automatic Version Tracking Command. It will automagically run all (and only those that are) necessary correlators in the correct order. The problem in, however, still persists.

  1. Run the Exact Function * Match Correlators.
  2. Accept the found matches (because they are perfect matches we are certain these are the same functions).
  3. Exclude the accepted functions from the list (via the Match Table Filters).
  4. Run the Duplicate Function Instructions Match Correlator. This is needed because the previous Correlators only return matches with only 1 destination function. Here we get 1:N mappings.
  5. Find which of the functions best match and accept the matches. These functions are mostly unimportant wrapper or very short functions.
  6. Exclude the accepted functions from the list (via the Match Table Filters).
  7. Run the Exact Symbol Name Match Correlator. If symbols are available.
  8. Run the Bulk Instructions Match Correlator.
  9. Sort the results based on Similarity. Start searching from similar to non-similar functions.

Depending on the patch diffed binary this already speeds up the process somewhat.

However, smarter correlators will be needed!

Update 1: No. There are already smart enough correlators in Ghidra (the Reference Correlators). What was needed was a correlator that shows how different functions are. And this was fixed by this project. But non the less the following still applies:

I hope someone finds this useful and feels inspired to start their own kick-ass correlators!

Future Work






  1. They are actually near perfect matches as Ghidra masks out changes in references addresses to make matching position independent.