In this article I walk you through why Ghidra needs better Version Tracking Correlators for usable patch diffing and how I implemented a new Version Tracing Correlator as a Ghidra Extension.
Update 1: Seriously, just skip this article and go right to the new active project on https://github.com/threatrack/ghidra-patchdiff-correlator.
The project is moving so fast that most of the info in this article is now obsolete.
Ghidra’s Version Tracking only has Correlators that return perfect1 matches, i.e., matches with a similarity score of 1.000
. This makes it hard to find changed functions. Especially, when you compare binaries without symbol names.
Currently, the best workflow (I found) for patch diffing in Ghidra is:
Update 1: This workflow is rubbish! Just run the Automatic Version Tracking Command. It will automagically run all (and only those that are) necessary correlators in the correct order. The problem in, however, still persists.
Exact Function * Match
Correlators.Match Table Filters
).Duplicate Function Instructions Match
Correlator. This is needed because the previous Correlators only return matches with only 1 destination function. Here we get 1:N mappings.Match Table Filters
).Exact Symbol Name Match
Correlator. If symbols are available!Exact Function * Match
Correlators for changes. If symbols are available this is a 1:1 comparison.The problem: If in step 7 no symbols are available in step 8 we have a N:M comparison, with N and M being the unmatched functions in the Source and Destination Programs. This can quickly get out of hand.
Ghidra is lacking a correlator that would assign these remaining functions a similarity score.
To this end, I’ll implement some Program Correlators that facilitate this non-perfect similarity matching.
So far I will start with two basic ideas:
The Bulk Instruction Program Correlator
will make an unordered bulk list of Instructions occurring in a function.
Let’s say we have the function:
PUSH EBP
MOV EBP,ESP
SUB ESP,0x8
MOV ESP,EBP
POP EBP
RET
Then the Correlator would “bulk” this to the following list of features:
PUSH EBP
MOV EBP,ESP
SUB ESP,0x8
MOV ESP,EBP
POP EBP
RET
If we now have a function:
PUSH EBP
MOV EBP,ESP
MOV ESP,EBP
POP EBP
RET
With features:
PUSH EBP
MOV EBP,ESP
MOV ESP,EBP
POP EBP
RET
It would match 5 out of 6 features of the earlier function.
The matching is unordered - hence the notion of “bulk”.
So a function of (warning: doesn’t make sense):
POP EBP
MOV EBP,ESP
MOV ESP,EBP
SUB ESP,0x8
PUSH EBP
RET
Would still match 6 of 6 with the original function, because of the unordered bulk comparison logic.
Eventually, proper state-of-the-art CFG based algorithms should be implemented, but as a start this is already an improvement over the binary-only Correlators.
The Bulk Basic Block Program Correlator
isn’t implemented yet, but it will be the same as the Bulk Instruction Correlator, but the features in the bulk comparison will be basic block hashes.
Unfortunately, the GhidraDev Eclipse plugin does not have a template for a ProgramCorrelator
implementation. So we have to figure the Extension point out ourselves.
The extension points are:
ProgramCorrelator
AddressCorrelator
ProgramCorrelatorFactory
AddressCorrelatorFactory
Stringable
Validator
ProgramCorrelator
correlates functions. AddressCorrelator
correlates data. A Validator
is used in the Pre-Check during starting of a Version Tracking Session.
To get Ghidra to recognize your Correlator you need to implement a ProgramCorrelatorFactory
.
From the existing Program Correlators I will use the ExactMatchBytesProgramCorrelatorFactory.java and the returned FunctionMatchProgramCorrelator.java as a template.
So we need to implement a ProgramCorrelatorFactory
that returns a ProgramCorrelator
which in its doCorrelate(VTMatchSet matchSet, TaskMonitor monitor)
function adds found matches to the delivered VTMatchSet matchSet
.
You can find the implementation at https://github.com/threatrack/ghidra-patchdiff-correlator/.
The BulkInstructionProgramCorrelatorFactory returns a BulkProgramCorrelator that gets a InstructionFunctionBulker as a parameter (this is similar to how the Ghidra included “Exact” type Correlators are organized).
The InstructionFunctionBulker does the main work. It iterates over the instructions and returns a list of hashes. The hash lists for each function are then compared and corresponding VT Matches added by the BulkProgramCorrelator.
To add the basic block Correlator only a new BasicBlockFunctionBulker
and corresponding Factory
must be implemented.
The implementation isn’t optimized yet and makes some odd decisions, such as hashing the instructions via calling hashCode()
on the instruction strings.
But it works!
You can install a first pre-release of the Ghidra Extension via: File
-> Install Extensions
and then select the downloaded ghidra_9.1-BETA_DEV_*_PatchDiffCorrelator.zip.
Then in a Version Tracking Session you can select the Bulk Instructions Match
in the Version Tracking Wizard:
It will produce matches between 0.000
and 1.000
Similarity depending on how many instructions are found in both the Source and the Destination Function:
Please note this is a first prototype and likely contains bugs.
We can now use the following improved workflow:
Update 1: This workflow is rubbish! Just run the Automatic Version Tracking Command. It will automagically run all (and only those that are) necessary correlators in the correct order. The problem in, however, still persists.
Exact Function * Match
Correlators.Match Table Filters
).Duplicate Function Instructions Match
Correlator. This is needed because the previous Correlators only return matches with only 1 destination function. Here we get 1:N mappings.Match Table Filters
).Exact Symbol Name Match
Correlator. If symbols are available.Bulk Instructions Match
Correlator.Depending on the patch diffed binary this already speeds up the process somewhat.
However, smarter correlators will be needed!
Update 1: No. There are already smart enough correlators in Ghidra (the Reference Correlators). What was needed was a correlator that shows how different functions are. And this was fixed by this project. But non the less the following still applies:
I hope someone finds this useful and feels inspired to start their own kick-ass correlators!
hashCode()
on the instruction string.ghidra/app/plugin/core/searchmem/mask/MaskGenerator.java
or some other Ghidra included feature.Implement BasicBlockFunctionBulker.java
for more reliable basic block based matching.
Write a Auto Patch Diff Ghidra script (similar to the Auto Version Tracking action) that automates the steps outlined in New Improved Workflow Update 1: runs our new correlator after the Automatic Version Tracking Command has run :).
They are actually near perfect matches as Ghidra masks out changes in references addresses to make matching position independent.↩