EVM unravelled: recovering ABI from bytecode
The year-over-year growth in the use of decentralized applications and smart contracts brings an increasing prominence of security audits in this domain. Such audits are vital in maintaining the robustness and trustworthiness of platforms built on blockchain technologies like the Ethereum Virtual Machine (EVM). In a full black-box assessment—a methodology where the auditor has no knowledge of the system's inner workings—smart contracts can often appear more opaque compared to traditional centralized applications.
This article delves deep into the intricacies of EVM smart contracts, focusing on ABI (Application Binary Interface) recovery from a black-box perspective, unraveling insights into function signatures, attributes, and parameter analysis derived directly from the bytecode.
What's the problem?
To interact with an EVM (Ethereum Virtual Machine) smart contract using a standard Web3 library (for example web3.js or ethers.js), the smart contract's ABI (Application Binary Interface) is necessary. The ABI is a standardized JSON file that defines the smart contract's functions, input parameters and return values.
When compiling a Solidity code, for example with solcjs
, the --abi
parameter is used to generate the ABI corresponding to the bytecode:
$ solcjs test.sol --bin --abi
$ cat test_sol.bin
60806040526000805534801561001457600080fd5b50610108806100246000396000f3fe6080604052348015600f57600080fd5b506004361060285760003560e01c8063b29f083514602d575b600080fd5b60336035565b005b6001600054604291906083565b600081905550565b6000819050919050565b7f4e487b7100000000000000000000000000000000000000000000000000000000600052601160045260246000fd5b6000608c82604a565b9150609583604a565b9250827fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff0382111560c75760c66054565b5b82820190509291505056fea26469706673582212204d516bbfd9b64a055d787ae6c5ea3d1db012cbf1e60017fd2a993eaa40db88cc64736f6c634300080f0033
$ cat test_sol.abi
[{"inputs":[],"name":"doIt","outputs":[],"stateMutability":"nonpayable","type":"function"}]
Due to the costliness of storage space in a blockchain, the Ethereum blockchain only stores the minimum information required for the smart contract to be used, that is: the bytecode. The ABI should be provided by the developers by other means.
Without the ABI, interaction with a smart contract is impossible. In order for a client (the entity interacting with the smart contract) to successfully execute a transaction, they must know the specifics “entry points”: the name of the function, the type of inputs required, and what kind of output, if any, to expect. This crucial information is provided by the ABI.
So, where does one typically find an ABI of an already deployed contract? The most common is to search for the contract on Etherscan. Most developers will upload the ABI onto it for people to interact with the contract. Most of the time, the source code is also uploaded to the smart contract's page. The objective is to make the web3 ecosystem more transparent.
However, our quick analysis of the mainnet blockchain on March 15, 2023 revealed some concerning statistics. We focused on contracts holding at least €100 of Ether in their balances – an arbitrary criterion, yet one that likely identifies active contracts. Of the 101,547 contracts meeting this criterion, approximately 50% (51,465) neither expose their ABIs nor their source code on Etherscan. While this does not mean the source code and the ABI are not published elsewhere, it does highlight a transparency issue, given that Etherscan is the primary tool for searching contract information.
We will try in this article to recover the ABI description of smart contracts by only looking at their bytecodes. Keep in mind that we will mostly be interested in contracts compiled with solc
, which is the most used compiler for EVM contracts. Other compilers or other languages might give slightly different results.
Disclaimer: to completely understand this article, some prior knowledge is needed of what a smart contract and EVM are. We also strongly recommend readers to open evm.codes in another tab.
How to interact with a contract: a closer look
The transaction
The process of interacting with a smart contract is fairly straightforward but requires certain critical elements to be successful. Let's first recall how one interact with a contract.
A transaction needs to be sent to the recipient
address and can contain data
or value
or both. The data
will contain the function call and has the form Function Signature + Input Parameters
. The value
represents the number of wei (smallest denomination of ether) sent alongside the transaction.
Function signature
The function signature is the first four bytes of the result of keccak256("function(parameter1_type, parameter2_type, etc)")
.
For example, consider the following function defined in Solidity:
function setCount(uint32 input) public {
count = input;
}
This function signature will be 0x4ff3eaa4
, obtained as follows:
$ Web3.utils.keccak256("setCount(uint32)")
'0x4ff3eaa4476aff8d1766fed7b1db1c9680cce689b76f84bf3b80e7eb81dafb8e'
Note that while libraries like Web3.js require the function name and parameter types to construct the transaction data, it is possible to manually create this data if you have the function signature. This is because the actual function name is not present in the final transaction; instead, it is represented by its unique signature.
Input Parameters
The input parameters are appended to the function signature with the necessary number of leading zeros to encode them on 32 bytes. Simply put, this means the inputs are represented in a standardized format, filling up any unused space with zeros to maintain a consistent length. The encoding is actually a bit more complex than that, but we will not dive into it in this article as the Solidity documentation is already crystal clear.
What should be remembered is that, for the transactions not to be reverted, the ABI must contain the expected number of parameters the function accepts. Also, distinguishing static and dynamic parameters is crucial, as they are encoded differently. Dynamic types include bytes
, string
, arrays
and tuples
, while everything else is static.
Example
Consider the following contract:
pragma solidity ^0.8.11;
contract HighlyComplexContract {
uint256 count = 0;
function setCount(uint32 input) public {
count = input;
}
}
The ABI will look like this:
[
{
"inputs": [
{
"internalType": "uint32",
"name": "input",
"type": "uint32"
}
],
"name": "setCount",
"outputs": [],
"stateMutability": "nonpayable",
"type": "function"
}
]
The transaction data
will be the function signature (0x4ff3eaa4) followed by the parameter encoded on 32 bytes:
const Web3 = require("web3")
web3 = new Web3("[...]")
var abi = JSON.parse(fs.readFileSync("test_sol_HighlyComplexContract.abi"));
contract = new web3.eth.Contract(abi)
var methodToCall = contract.methods.setCount(0xdeadbeef)
methodToCall.encodeABI()
'0x4ff3eaa400000000000000000000000000000000000000000000000000000000deadbeef'
What do we need in the ABI?
In order to interact with a smart contract's function, the followings are necessary: its signature, the number of inputs and their data types (differentiating static and dynamic).
Beyond these essentials, what else is necessary to be sure that our transactions won't be reverted? Let's analyze the ABI specifications.
In the ABI, three types of objects can be defined: a function
, an event
or an error
. For the scope of this discussion, we will focus on functions as events and errors mainly perform client-side actions, which might not be relevant in a black-box audit scenario.
Below are the specs for a function ABI:
-
type
:function
,constructor
,receive
(the “receive Ether” function) orfallback
(the “default” function) -
name
: the name of the function -
inputs
: an array of objects, each of which contains:-
name
: the name of the parameter -
type
: the canonical type of the parameter -
components
: used for tuple types
-
-
outputs
: an array of objects similar toinputs
-
stateMutability
: a string with one of the following values:pure
,view
,nonpayable
, andpayable
The type
will be set on function
, and on fallback
or receive
in some cases. The constructor
type is not used when interacting with an already deployed contract.
The outputs
, which should normally be defined in the ABI, are not required when performing external transaction calls. This value can be dismissed from the recovered ABI (However, the presence of outputs
is quite easy to detect. The bytecode RETURN
will be used at the end of the function, instead of a typical STOP
).
The function
has an stateMutability
parameter. The value may be:
pure
(does not read or modify blockchain state)view
(does not modify the blockchain state)nonpayable
(function does not accept Ether – the default)payable
(accepts Ether)
The fact that a function is pure
or view
does not modify the transaction data
. However, sending a transaction containing Ether (meaning with a positive value
) to a nonpayable
function will revert the transaction.
So, the payable
characteristic needs to be identified in the code and the stateMutability
parameter must be defined in the ABI.
Digging in the smart contract
Let's recall the elements to identify in the bytecode:
- Function signature, then its corresponding name
- Function payability
- Number of parameters
- Types of parameters
The function signature
The smart contract code should, at first, extract the first four bytes of the transaction data which represents the function signature and then forwards execution to the correct code segment depending on the signature. In EVM, as there is no native bytecode or standard way to do that, each compiler may have its unique way of doing it. We will look, in this article, at the currently most popular compiler: solc
, the official Solidity compiler by the Ethereum fundation.
To perform the operation mentioned earlier, solc
creates what we call a function selector, which is roughly a switch ... case
.
The code creating the function selector can be seen at line 326 of ContractCompiler.cpp. The comment is perfectly summarizing what the method does:
// Code for selecting from n functions without split:
// n times: dup1, push4 <id_i>, eq, push2/3 <tag_i>, jumpi
// push2/3 <notfound> jump
// (called SELECT[n])
The compiler will write the following bytecodes sequence for each function
:
DUP1
PUSH4 <function signature>
EQ
PUSH2 <address>
JUMPI
// Regex: 8063[0-9a-fA-F]{8}1461[0-9a-fA-F]{4}57
This is the sequence we should look for at the beginning of the code. But it is not sufficient to only extract the PUSH4
opcode value as is. The compiler might have added what we call a splitter to enhance the smart contract performance. In fact, solc
compiler will add a splitter if there are more than four functions, which is quite common:
// Code for selecting from n functions with split:
// dup1, push4 <pivot>, gt, push2/3<tag_less>, jumpi
// SELECT[n/2]
// tag_less:
// SELECT[n/2]
[...]
if (_ids.size() <= 4)
split = false;
The splitter is similar to a B-Tree algorithm. If the value is greater than a pivot value, the execution flow is redirected to a small selector which should match the signature. Otherwise, the execution flow is redirected to another splitter.
The following bytecodes sequence will then be found:
DUP1
PUSH4 <pivot_value>
GT
PUSHN <selector_address>
JUMPI
So the only difference is the use of GT
instead of EQ
. The value after the PUSH4
here is not a function signature but an arbitrary pivot value. So all PUSH4
should be discarded from the analysis if they are followed by a GT
.
One other exception happens when the code is optimized and a function signature begins with 0x00
, for example 0x00fa21d5
. In the latter case, the PUSH4 0x00fa21d5
, will be replaced by a PUSH3 0xfa21d5
.
The fallback exception
Solidity allows defining a fallback function which will be called in case the transaction data
does not match any function signature in the contract. This will materialize in the code as a default jump at the end of the selector (instead of a REVERT
):
JUMPI // last conditional jump of the selector
PUSH2 <fallback_address>
JUMP
If there is no fallback function, the compiler will make the flow jump to a revert
sequence after the selector.
In the ABI, the function type will then be fallback
not function
and the function name will be omitted. However, the stateMutability
still needs to be stated. For example, the ABI may look like:
{
"stateMutability": "nonpayable",
"type": "fallback"
},
Keep in mind that a fallback function may accept input parameters and return objects. In this case, it will always accept bytes
as inputs and bytes
as outputs. However, even if the fallback function accepts inputs, the contract does not check if calldata
contains parameters. Thus, the function will never revert.
Discovering the function name
Once every function signature is extracted, it is possible to interact with the smart contracts manually by forging external transactions. However, if we want to use a high-level client, we require a comprehensive ABI that includes the actual function name. This could also help in comprehending the purpose of the function within the contract.
There are primarily two strategies for discovering the function name:
-
Database Search: One convenient approach is to leverage databases holding information on Ethereum function signatures. The Ethereum Signature Database is one such resource where function signatures used in other contracts that have disclosed their sources are stored. Simply input the function signature into the search bar. If the signature has been used before and is available in the database, you will not only find the function name but also obtain detailed information about the number and types of parameters required. However, this approach is contingent on the availability of the function signature in the database.
-
Bruteforce: If the database search comes up empty, another option is to brute force the function name. But keep in mind that if you do not know
, you will have to bruteforce the entire stringfunction(parameter1_type, parameter2_type, etc)
. This process can become exceedingly complex and time-consuming, as it involves guessing not only the function name but also the type and number of parameters.
The function payability
The function payability
attribute tells the client whether a transaction sent with a positive Ether balance can be handled by the function. In Solidity, that characteristic is defined with the payable
keyword. By default, a function will be non-payable
. So how does that transcribe in EVM bytecodes?
Solc
always writes the same sequence to enforce the non-payability of a function, a simple conditional revert after loading the CALLVALUE
:
void ContractCompiler::appendCallValueCheck()
{
// Throw if function is not payable but call contained ether.
m_context << Instruction::CALLVALUE;
m_context.appendConditionalRevert(false, "Ether sent to non-payable function");
}
If the transaction value
(the number of wei sent), obtained with the CALLVALUE
opcode, is equal to 0, the execution continues. If not, the REVERT
instruction is found, reverting the transaction. The bytecode sequence is the following:
CALLVALUE
DUP1
ISZERO
PUSH2 <function_start>
JUMPI
PUSH0 // Sometimes, it might be PUSH1 00
DUP1
REVERT
// Regex: 801561[0-9a-fA-F]{4}575f80fd
This sequence may appear in two cases:
- If one of the smart contract's function is payable, this check will be made at the beginning of every other non-payable functions.
- If every function is non-payable, this check will be made at the beginning of the function selector to avoid redundancy, so right after a
JUMPDEST
instruction.
The below code defines that behavior:
void ContractCompiler::appendFunctionSelector(ContractDefinition const& _contract)
{
[...]
bool needToAddCallvalueCheck = true;
if (!hasPayableFunctions(_contract) && !interfaceFunctions.empty() && !_contract.isLibrary())
{
appendCallValueCheck(); //First case, the smart contract has no payable functions
needToAddCallvalueCheck = false;
}
[...]
for (auto const& it: interfaceFunctions)
{
[...]
if (!functionType->isPayable() && !_contract.isLibrary() && needToAddCallvalueCheck)
appendCallValueCheck(); // Second case, the check is set on every non-payable function
The receive exception
Another important exception, Solidity permits to define a receive
function which will be called by default if the transaction data
is empty. That case raises when the sole purpose of the transaction is to sent Ether. Receive functions then handle the Ether value (in wei) sent along the transaction and performs operations if necessary. This function does not have a name, cannot have arguments, cannot return anything and must have external visibility and Payable
StateMutability
.
In Solidity 6.0.0+, this is a receive
function which only accepts Ether without doing anything else:
receive() external payable {}
Therefore, the ABI is always:
{
"stateMutability": "payable",
"type": "receive"
}
In the bytecodes, the existence of a receive
function appears at the beginning of the function selector, exactly as in the second case mentioned in the previous section when all functions are non-payable. However, in this case, the sequence checks if the size of the data
(thanks to the CALLDATASIZE
opcode) is less than 4 (which means it does not contain a function signature) and, if so, it jumps to the receive
function:
PUSH1 04
CALLDATASIZE
LT // if calldata size < 4, so if it does not contain a function signature...
PUSH2 <receive_address>
JUMPI // ... It jumps to receiver
// Regex: 6004361061[0-9a-fA-F]{4}57
The number and types of parameters
If parameters are defined in your Solidity function, the solc
compiler will perform the following steps, for each function:
if (!functionType->parameterTypes().empty())
{
// Parameter for calldataUnpacker
m_context << CompilerUtils::dataStartOffset;
// Append DUP1, CALLDATASIZE, SUB
m_context << Instruction::DUP1 << Instruction::CALLDATASIZE << Instruction::SUB;
CompilerUtils(m_context).abiDecode(functionType->parameterTypes());
}
void CompilerUtils::abiDecode(TypePointers const& _typeParameters, bool _fromMemory)
{
[...]
size_t encodedSize = 0;
for (auto const& t: _typeParameters)
// Calculate the total size of all parameters, after being encoded
encodedSize += t->decodingType()->calldataHeadSize();
// Create a conditional revert if the transaction CALLDATASIZE
// is lower than the previously calculated encoded size
Whiskers templ(R"({
if lt(len, <encodedSize>) { <revertString> }
})");
templ("encodedSize", to_string(encodedSize));
templ("revertString", m_context.revertReasonIfDebug("Calldata too short"));
m_context.appendInlineAssembly(templ.render(), {"len"});
The template defined by abiDecode
results in the following instruction set, prepended by the DUP1
, CALLDATASIZE
, SUB
instructions from the ContractCompiler
class:
DUP1
CALLDATASIZE // get the data size
SUB // substract the 4 first bytes (the function signature)
// Regex: 803603
// The logic after that depends on the compiler, below is an example
PUSH1 <minimum_data_size_required>
DUP2
LT // if it's lower than the target...
ISZERO
PUSH2 <function_start>
JUMPI // ... do not jump to the function start ...
PUSH1 00
DUP1
REVERT // ... and revert
So what we should watch here is the presence of DUP1
, CALLDATASIZE
and SUB
at the beginning of a function. The bytecode will revert the transaction if the data
size is lower than expected.
If no CALLDATASIZE
bytecode is present, it probably means the function does not take any input parameter. In that case, it does not matter if you send parameters or not in your transaction, it will be completely ignored by the smart contract.
So how can we determine the number and types of parameters required by a function, given that we know the minimum data size it needs?
Let's take a look at the result of calldataHeadSize()
, called in the previous code snippet:
/// If @a _padded then it is assumed that each element is padded to a multiple of 32 bytes.
virtual unsigned calldataEncodedSize([[maybe_unused]] bool _padded) const { solAssert(false, ""); }
/// Convenience version of @see calldataEncodedSize(bool)
unsigned calldataEncodedSize() const { return calldataEncodedSize(true); }
/// @returns the distance between two elements of this type in a calldata array, tuple or struct.
/// For statically encoded types this is the same as calldataEncodedSize(true).
/// For dynamically encoded types this is the distance between two tail pointers, i.e. 32.
/// Always returns a value greater than zero and throws if the type cannot be encoded in calldata.
unsigned calldataHeadSize() const { return isDynamicallyEncoded() ? 32 : calldataEncodedSize(true); }
For statically sized types, calldataHeadSize()
calls calldataEncodedSize(true)
which returns the size of the type in bytes. This size is always a multiple of 32 bytes due to padding. For dynamically sized types, if isDynamicallyEncoded()
is true, calldataHeadSize()
will also return 32.
The fact is, if a dynamic array is sent on a transaction, the first bytes will be the offset to the start of the real data (padded to 32 bytes). At that offset, we will find the number of elements in the array (padded to 32 bytes) then each element (padded to 32 bytes). In case of a dynamic array, the code will only check if there is at least 32 bytes (if the offset exists). An upper bound check is not performed specifically because of arrays (see the ABI specs).
For example, if a parameter type is uint[]
and the parameter is[7,8,9]
, the encoding will be:
0000000000000000000000000000000000000000000000000000000000000020 // start of the data, 32 bytes after the start, so right after that
0000000000000000000000000000000000000000000000000000000000000003 // The length of the array
0000000000000000000000000000000000000000000000000000000000000007 // First value: 7
0000000000000000000000000000000000000000000000000000000000000008 // Second value: 8
0000000000000000000000000000000000000000000000000000000000000009 // Third value: 9
So, if you happen to send a single static value as a parameter which is expecting a dynamic array, the transaction will not revert and the value you sent will be treated as the array offset, the EVM will probably then try to retrieve those elements and revert if they do not find them.
In the case of a static array of static types, the encoding is straightforward. For example, if a parameter type is uint[3]
and the parameter is[1,2,3]
, the encoding will be:
0000000000000000000000000000000000000000000000000000000000000001 // first value: 1
0000000000000000000000000000000000000000000000000000000000000002
0000000000000000000000000000000000000000000000000000000000000003
Because everything is always padded to 32 bytes, it is easy to get the number of expected elements.
If the PUSH1
right after CALLDATASIZE, SUB
is pushing:
- 32 (0x20) bytes → 1 element is needed
- 64 (0x40) bytes → 2 elements
- 96 (0x60) bytes → 3 elements
- etc.
What we call element can be:
- A static parameter
- A static value part of a larger static array
- A dynamic array
Indeed, it is important to remember that several elements can mean the function expects different parameters or a static array containing multiple static values. For example, the below two functions will expect the exact same CALLDATASIZE
:
function foo(uint32 x, bool y) {}
function bar(bytes3[2] z) {}
However, precisely because everything is always padded to 32 bytes, it is quite hard to ascertain the right data type. To understand that, one must dive into the EVM bytecode logic to understand how data is handled and to deduce its type (this will not be treated in this article). The parameter types are not easy to retrieve from the bytecode.
But hopefully, it is not really useful to know the exact types since it is possible to write bytes32
in the ABI, and the parameter will be encoded by the client in a way that fits most data types.
A final exception raises for bytes
and string
. Because their length cannot be known in advance, the data bytes are padded on the right on 32 bytes after the 32 bytes left-padded data length. So if the data is sent as though it was a bytes32
, the padding will be incorrect and the transaction will revert.
Concrete example
Let's try to apply what we discussed in this article by finding the ABI of a simple compiled contract. Let's compile the following code with solc
:
pragma solidity ^0.8.11;
contract HighlyComplexContract {
uint256 count = 0;
function Foo() public {}
function Bar(uint[3] memory a, uint b) public payable {}
receive() external payable {}
}
The compiled binary is the following:
60806040526004361061002c575f3560e01c80635428cfc514610037578063bfb4ebcf1461005357610033565b3661003357005b5f80fd5b610051600480360381019061004c91906101ed565b610069565b005b34801561005e575f80fd5b5061006761006d565b005b5050565b565b5f604051905090565b5f80fd5b5f80fd5b5f601f19601f8301169050919050565b7f4e487b71000000000000000000000000000000000000000000000000000000005f52604160045260245ffd5b6100c682610080565b810181811067ffffffffffffffff821117156100e5576100e4610090565b5b80604052505050565b5f6100f761006f565b905061010382826100bd565b919050565b5f67ffffffffffffffff82111561012257610121610090565b5b602082029050919050565b5f80fd5b5f819050919050565b61014381610131565b811461014d575f80fd5b50565b5f8135905061015e8161013a565b92915050565b5f61017661017184610108565b6100ee565b905080602084028301858111156101905761018f61012d565b5b835b818110156101b957806101a58882610150565b845260208401935050602081019050610192565b5050509392505050565b5f82601f8301126101d7576101d661007c565b5b60036101e4848285610164565b91505092915050565b5f806080838503121561020357610202610078565b5b5f610210858286016101c3565b925050606061022185828601610150565b915050925092905056fea2646970667358221220a86a0dd0f0c2a353297ae9fd8ed73b4032cf9ea9d399258054ae11aa790d3ba764736f6c63430008150033
First, let's identify the number of functions:
$ egrep -o '8063[0-9a-fA-F]{8}1461[0-9a-fA-F]{4}57' HighlyComplexContract.bin
80635428cfc51461003757
8063bfb4ebcf1461005357
There are two functions and their signatures are: 5428cfc5
and bfb4ebcf
.
Also, note that the first function will start at 0x37
and the second at 0x53
. It will be important for the following.
The name of the second function is easily found with 4byte.directory:
The first one however is not publicly known as the parameters it accepts are uncommon. We will have to keep the signature as it is in the ABI as a placeholder to use it directly in future transactions. We will call that function 5428cfc5()
.
Secondly, are the functions non-payable?
$ egrep -ob '801561[0-9a-fA-F]{4}575f80fd' HighlyComplexContract.bin
170:801561005e575f80fd
Only one non-payability check is present so there is only one non-payable
, but which one?egrep
tells us that it is found at the 170th characters, so at the 85th (0x55) byte. Because Foo()
starts at 0x53, it means it is the non-payable one.
Thirdly, how many inputs the functions accept?
$ egrep -ob '803603' HighlyComplexContract.bin
122:803603
Only one function performs a check on CALLDATASIZE
, the 5428cfc5()
(hex(122/2) = 0x3d). It probably also means that Foo()
does not accept any argument.
It is a bit more tricky here to find out the value of the PUSH1
as the EVM performs the following sequence:
003D 80 DUP1
003E 36 CALLDATASIZE
003F 03 SUB // Start of the function, CALLDATASIZE retrieves the data size, meaning it will be checked somewhere
0040 81 DUP2
0041 01 ADD
0042 90 SWAP1
0043 61 PUSH2 0x004c
0046 91 SWAP2
0047 90 SWAP1
0048 61 PUSH2 0x01ed
004B 56 JUMP
[...]
01ED 5B JUMPDEST
01EE 5F PUSH0
01EF 80 DUP1
01F0 60 PUSH1 0x80 // Here is the PUSH1
01F2 83 DUP4
01F3 85 DUP6
01F4 03 SUB
01F5 12 SLT
01F6 15 ISZERO
01F7 61 PUSH2 0x0203
01FA 57 JUMPI
This sequence may be found sometimes when the compiler optimizes the bytecode, making the flow jump to the end in order to verify the number of parameters.
What is important to find here is the instruction at 0x1f0
: PUSH1 0x80
which means the function expects 4 elements encoded on 32 bytes. This value is correct as the function Bar(uint[3] memory a, uint b)
expects a static array of 3 static uint
and 1 more uint
.
Finally, does the smart contract have fallback
or receive
functions?
A revert
is found at the end of the selector, so it does not contain a fallback
function:
0013 63 PUSH4 0x5428cfc5 // First function selector
0018 14 EQ
0019 61 PUSH2 0x0037
001C 57 JUMPI
001D 80 DUP1
001E 63 PUSH4 0xbfb4ebcf // second function selector
0023 14 EQ
0024 61 PUSH2 0x0053
0027 57 JUMPI
0028 61 PUSH2 0x0033
002B 56 JUMP
[...]
0033 5B JUMPDEST
0034 5F PUSH0
0035 80 DUP1
0036 FD REVERT // A JUMP would have been found here in case of a fallback function
However, the initial check for CALLDATASIZE
means a receive
function will be found at 0x2c
:
$ egrep -o '6004361061[0-9a-fA-F]{4}57' HighlyComplexContract.bin
6004361061002c57
We can deduce the following functional ABI:
[
{
"inputs": [
{
"name": "whatever",
"type": "bytes32[4]"
}
],
"name": "NOT_FOUND",
"selector": "0x5428cfc5",
"outputs": [],
"stateMutability": "payable",
"type": "function"
},
{
"inputs": [],
"name": "Foo",
"outputs": [],
"stateMutability": "nonpayable",
"type": "function"
},
{
"stateMutability": "payable",
"type": "receive"
}
]
Existing tools
A number of tools exist that can help to extract the Application Binary Interface (ABI) from EVM bytecode. These tools vary in their methodologies, user interface, and feature sets. We included here the ones returning good results during our tests. Feel free to contact us to enhance this list.
Whatsabi
Whatsabi is the most complete free tool to specifically perform this task. It is an open-source command-line tool that can generate a working ABI from an Ethereum smart contract bytecode. Whatsabi works by analyzing the EVM bytecode mostly the way we explained in this article and extracting information about the functions. We strongly encourage people reading this article to try this tool and contribute to its development.
Below is the output Whatsabi v0.8.6 gives on our above example. You will notice whatsabi does not show the exact number of inputs nor the receive function presence as it is not an easy task to analyze without advanced decompiler methods. The rest is almost similar:
> code = '6080[...]50033'
> whatsabi.abiFromBytecode(code);
[
{
"type": "function",
"selector": "0x5428cfc5",
"payable": true,
"stateMutability": "payable",
"inputs": [
{
"type": "bytes"
}
]
},
{
"type": "function",
"selector": "0xbfb4ebcf",
"payable": false,
"stateMutability": "view",
"outputs": [
{
"type": "bytes"
}
]
},
{
"type": "event",
"hash": "0x4e487b7100000000000000000000000000000000000000000000000000000000"
}
]
Whatsabi may also try to retrieve the function name:
> const signatureLookup = new whatsabi.loaders.OpenChainSignatureLookup();
> await signatureLookup.loadFunctions("0xbfb4ebcf");
[ 'Foo()' ]
> await signatureLookup.loadFunctions("0x5428cfc5");
[]
Ethereum-dasm
ethereum-dasm is also tries to recover the ABI from a static analysis of the bytecode. This functionality has less logic than Whatsabi but ethereum-dasm also disassembles and tries to decompile the contract.
JEB Decompiler
JEB Decompiler is a robust software analysis tool supporting decompilation for a multitude of languages, including Ethereum EVM bytecode. Its versatile analysis features, along with the EVM plugin, make it possible to interpret and reverse-engineer Ethereum smart contracts to a more comprehensible form, aiding in the extraction of ABIs. However, JEB is not free.
Conclusion and further works
Ethereum smart contracts, while appearing opaque at first glance, can yield a wealth of information if we know how to dissect their underlying bytecode. The function signatures, types, and parameters can be discerned through careful analysis. However, this field still presents limitations that need to be addressed and opportunities for future exploration.
Several challenges lie in accurately recovering an ABI. For example, we can infer the number of parameters from the required data size, but pinpointing the exact data type remains elusive in many cases. Additionally, the tools currently available, although useful, do not fully automate the process. Some exception may raise from the use of different compilers and their own optimization processes. We stated some of them but many more exist: initialization code, proxies, interfaces, dynamic jumps, making it hard to identify a whole function code…
In order to further enhance the accuracy and reliability of ABI extraction, dynamic analysis could be incorporated into our approach. This would provide a more comprehensive view of a contract behavior in response to varying inputs. Additionally, leveraging frameworks like Hardhat could be instrumental in testing and verifying the correctness of the extracted ABIs.