EVM unravelled: recovering ABI from bytecode

Rédigé par Adrien Peter - 10/10/2023 - dans Pentest - Téléchargement

The year-over-year growth in the use of decentralized applications and smart contracts brings an increasing prominence of security audits in this domain. Such audits are vital in maintaining the robustness and trustworthiness of platforms built on blockchain technologies like the Ethereum Virtual Machine (EVM). In a full black-box assessment—a methodology where the auditor has no knowledge of the system's inner workings—smart contracts can often appear more opaque compared to traditional centralized applications.

This article delves deep into the intricacies of EVM smart contracts, focusing on ABI (Application Binary Interface) recovery from a black-box perspective, unraveling insights into function signatures, attributes, and parameter analysis derived directly from the bytecode.

What's the problem?

To interact with an EVM (Ethereum Virtual Machine) smart contract using a standard Web3 library (for example web3.js or ethers.js), the smart contract's ABI (Application Binary Interface) is necessary. The ABI is a standardized JSON file that defines the smart contract's functions, input parameters and return values.

When compiling a Solidity code, for example with solcjs, the --abi parameter is used to generate the ABI corresponding to the bytecode:

$ solcjs test.sol --bin --abi

$ cat test_sol.bin 
60806040526000805534801561001457600080fd5b50610108806100246000396000f3fe6080604052348015600f57600080fd5b506004361060285760003560e01c8063b29f083514602d575b600080fd5b60336035565b005b6001600054604291906083565b600081905550565b6000819050919050565b7f4e487b7100000000000000000000000000000000000000000000000000000000600052601160045260246000fd5b6000608c82604a565b9150609583604a565b9250827fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff0382111560c75760c66054565b5b82820190509291505056fea26469706673582212204d516bbfd9b64a055d787ae6c5ea3d1db012cbf1e60017fd2a993eaa40db88cc64736f6c634300080f0033

$ cat test_sol.abi
[{"inputs":[],"name":"doIt","outputs":[],"stateMutability":"nonpayable","type":"function"}]  

Due to the costliness of storage space in a blockchain, the Ethereum blockchain only stores the minimum information required for the smart contract to be used, that is: the bytecode. The ABI should be provided by the developers by other means.

Without the ABI, interaction with a smart contract is impossible. In order for a client (the entity interacting with the smart contract) to successfully execute a transaction, they must know the specifics “entry points”: the name of the function, the type of inputs required, and what kind of output, if any, to expect. This crucial information is provided by the ABI.

So, where does one typically find an ABI of an already deployed contract? The most common is to search for the contract on Etherscan. Most developers will upload the ABI onto it for people to interact with the contract. Most of the time, the source code is also uploaded to the smart contract's page. The objective is to make the web3 ecosystem more transparent.

FTT Token on etherscan
FTT Token's Etherscan page. The ABI is found on this page.

However, our quick analysis of the mainnet blockchain on March 15, 2023 revealed some concerning statistics. We focused on contracts holding at least €100 of Ether in their balances – an arbitrary criterion, yet one that likely identifies active contracts. Of the 101,547 contracts meeting this criterion, approximately 50% (51,465) neither expose their ABIs nor their source code on Etherscan. While this does not mean the source code and the ABI are not published elsewhere, it does highlight a transparency issue, given that Etherscan is the primary tool for searching contract information.

We will try in this article to recover the ABI description of smart contracts by only looking at their bytecodes. Keep in mind that we will mostly be interested in contracts compiled with solc, which is the most used compiler for EVM contracts. Other compilers or other languages might give slightly different results.

Disclaimer: to completely understand this article, some prior knowledge is needed of what a smart contract and EVM are. We also strongly recommend readers to open evm.codes in another tab.

How to interact with a contract: a closer look

The transaction

The process of interacting with a smart contract is fairly straightforward but requires certain critical elements to be successful. Let's first recall how one interact with a contract.

A transaction needs to be sent to the recipient address and can contain data or value or both. The data will contain the function call and has the form Function Signature + Input Parameters. The value represents the number of wei (smallest denomination of ether) sent alongside the transaction.

Function signature

The function signature is the first four bytes of the result of keccak256("function(parameter1_type, parameter2_type, etc)").

For example, consider the following function defined in Solidity:

function setCount(uint32 input) public {
        count = input;
}

This function signature will be 0x4ff3eaa4, obtained as follows:

$ Web3.utils.keccak256("setCount(uint32)")
'0x4ff3eaa4476aff8d1766fed7b1db1c9680cce689b76f84bf3b80e7eb81dafb8e'

Note that while libraries like Web3.js require the function name and parameter types to construct the transaction data, it is possible to manually create this data if you have the function signature. This is because the actual function name is not present in the final transaction; instead, it is represented by its unique signature.

Input Parameters

The input parameters are appended to the function signature with the necessary number of leading zeros to encode them on 32 bytes. Simply put, this means the inputs are represented in a standardized format, filling up any unused space with zeros to maintain a consistent length. The encoding is actually a bit more complex than that, but we will not dive into it in this article as the Solidity documentation is already crystal clear.

What should be remembered is that, for the transactions not to be reverted, the ABI must contain the expected number of parameters the function accepts. Also, distinguishing static and dynamic parameters is crucial, as they are encoded differently. Dynamic types include bytes, string, arrays and tuples, while everything else is static.

Example

Consider the following contract:

pragma solidity ^0.8.11;

contract HighlyComplexContract {
    uint256 count = 0;

    function setCount(uint32 input) public {
        count = input;
    }
}

The ABI will look like this:

[
  {
    "inputs": [
      {
        "internalType": "uint32",
        "name": "input",
        "type": "uint32"
      }
    ],
    "name": "setCount",
    "outputs": [],
    "stateMutability": "nonpayable",
    "type": "function"
  }
]

The transaction data will be the function signature (0x4ff3eaa4) followed by the parameter encoded on 32 bytes:

const Web3 = require("web3")
web3 = new Web3("[...]")
var abi = JSON.parse(fs.readFileSync("test_sol_HighlyComplexContract.abi"));
contract = new web3.eth.Contract(abi)
var methodToCall =  contract.methods.setCount(0xdeadbeef)
methodToCall.encodeABI()
'0x4ff3eaa400000000000000000000000000000000000000000000000000000000deadbeef'

What do we need in the ABI?

In order to interact with a smart contract's function, the followings are necessary: its signature, the number of inputs and their data types (differentiating static and dynamic).

Beyond these essentials, what else is necessary to be sure that our transactions won't be reverted? Let's analyze the ABI specifications.

In the ABI, three types of objects can be defined: a function, an event or an error. For the scope of this discussion, we will focus on functions as events and errors mainly perform client-side actions, which might not be relevant in a black-box audit scenario.

Below are the specs for a function ABI:

  • type: function, constructor, receive (the “receive Ether” function) or fallback (the “default” function)

  • name: the name of the function

  • inputs: an array of objects, each of which contains:

    • name: the name of the parameter

    • type: the canonical type of the parameter

    • components: used for tuple types

  • outputs: an array of objects similar to inputs

  • stateMutability: a string with one of the following values: pure, view, nonpayable, and payable

The type will be set on function, and on fallback or receive in some cases. The constructor type is not used when interacting with an already deployed contract.

The outputs, which should normally be defined in the ABI, are not required when performing external transaction calls. This value can be dismissed from the recovered ABI (However, the presence of outputs is quite easy to detect. The bytecode RETURN will be used at the end of the function, instead of a typical STOP).

The function has an stateMutability parameter. The value may be: 

  • pure (does not read or modify blockchain state)
  • view (does not modify the blockchain state)
  • nonpayable (function does not accept Ether – the default)
  • payable (accepts Ether)

The fact that a function is pure or view does not modify the transaction data. However, sending a transaction containing Ether (meaning with a positive value) to a nonpayable function will revert the transaction.

So, the payable characteristic needs to be identified in the code and the stateMutability parameter must be defined in the ABI.

Digging in the smart contract

Let's recall the elements to identify in the bytecode:

  • Function signature, then its corresponding name
  • Function payability
  • Number of parameters
  • Types of parameters

The function signature

The smart contract code should, at first, extract the first four bytes of the transaction data which represents the function signature and then forwards execution to the correct code segment depending on the signature. In EVM, as there is no native bytecode or standard way to do that, each compiler may have its unique way of doing it. We will look, in this article, at the currently most popular compiler: solc, the official Solidity compiler by the Ethereum fundation.

To perform the operation mentioned earlier, solc creates what we call a function selector, which is roughly a switch ... case.

The code creating the function selector can be seen at line 326 of ContractCompiler.cpp. The comment is perfectly summarizing what the method does:

// Code for selecting from n functions without split:
//   n times: dup1, push4 <id_i>, eq, push2/3 <tag_i>, jumpi
//   push2/3 <notfound> jump
// (called SELECT[n])

The compiler will write the following bytecodes sequence for each function:

DUP1
PUSH4 <function signature>
EQ
PUSH2 <address>
JUMPI
// Regex: 8063[0-9a-fA-F]{8}1461[0-9a-fA-F]{4}57

This is the sequence we should look for at the beginning of the code. But it is not sufficient to only extract the PUSH4 opcode value as is. The compiler might have added what we call a splitter to enhance the smart contract performance. In fact, solc compiler will add a splitter if there are more than four functions, which is quite common:

// Code for selecting from n functions with split:
//   dup1, push4 <pivot>, gt, push2/3<tag_less>, jumpi
//     SELECT[n/2]
//   tag_less:
//     SELECT[n/2]
[...]
if (_ids.size() <= 4)
	split = false; 

The splitter is similar to a B-Tree algorithm. If the value is greater than a pivot value, the execution flow is redirected to a small selector which should match the signature. Otherwise, the execution flow is redirected to another splitter.

The following bytecodes sequence will then be found:

DUP1
PUSH4 <pivot_value>
GT
PUSHN <selector_address>
JUMPI

So the only difference is the use of GT instead of EQ. The value after the PUSH4 here is not a function signature but an arbitrary pivot value. So all PUSH4 should be discarded from the analysis if they are followed by a GT.

One other exception happens when the code is optimized and a function signature begins with 0x00, for example 0x00fa21d5. In the latter case, the PUSH4 0x00fa21d5, will be replaced by a PUSH3 0xfa21d5.

The fallback exception

Solidity allows defining a fallback function which will be called in case the transaction data does not match any function signature in the contract. This will materialize in the code as a default jump at the end of the selector (instead of a REVERT):

JUMPI // last conditional jump of the selector
PUSH2 <fallback_address>
JUMP

If there is no fallback function, the compiler will make the flow jump to a revert sequence after the selector.

In the ABI, the function type will then be fallback not function and the function name will be omitted. However, the stateMutability still needs to be stated. For example, the ABI may look like:

{
    "stateMutability": "nonpayable",
    "type": "fallback"
},

Keep in mind that a fallback function may accept input parameters and return objects. In this case, it will always accept bytes as inputs and bytes as outputs. However, even if the fallback function accepts inputs, the contract does not check if calldata contains parameters. Thus, the function will never revert.

Discovering the function name

Once every function signature is extracted, it is possible to interact with the smart contracts manually by forging external transactions. However, if we want to use a high-level client, we require a comprehensive ABI that includes the actual function name. This could also help in comprehending the purpose of the function within the contract.

There are primarily two strategies for discovering the function name:

  • Database Search: One convenient approach is to leverage databases holding information on Ethereum function signatures. The Ethereum Signature Database is one such resource where function signatures used in other contracts that have disclosed their sources are stored. Simply input the function signature into the search bar. If the signature has been used before and is available in the database, you will not only find the function name but also obtain detailed information about the number and types of parameters required. However, this approach is contingent on the availability of the function signature in the database.

  • Bruteforce: If the database search comes up empty, another option is to brute force the function name. But keep in mind that if you do not know which input is accepted by the function yet, you will have to bruteforce the entire string function(parameter1_type, parameter2_type, etc). This process can become exceedingly complex and time-consuming, as it involves guessing not only the function name but also the type and number of parameters.

The function payability

The function payability attribute tells the client whether a transaction sent with a positive Ether balance can be handled by the function. In Solidity, that characteristic is defined with the payable keyword. By default, a function will be non-payable. So how does that transcribe in EVM bytecodes?

Solc always writes the same sequence to enforce the non-payability of a function, a simple conditional revert after loading the CALLVALUE:

void ContractCompiler::appendCallValueCheck()
{
    // Throw if function is not payable but call contained ether.
    m_context << Instruction::CALLVALUE;
    m_context.appendConditionalRevert(false, "Ether sent to non-payable function");
}

If the transaction value (the number of wei sent), obtained with the CALLVALUE opcode, is equal to 0, the execution continues. If not, the REVERT instruction is found, reverting the transaction. The bytecode sequence is the following:

CALLVALUE
DUP1
ISZERO
PUSH2 <function_start>
JUMPI  
PUSH0 // Sometimes, it might be PUSH1 00
DUP1
REVERT
// Regex: 801561[0-9a-fA-F]{4}575f80fd

This sequence may appear in two cases:

  • If one of the smart contract's function is payable, this check will be made at the beginning of every other non-payable functions.
  • If every function is non-payable, this check will be made at the beginning of the function selector to avoid redundancy, so right after a JUMPDEST instruction.

The below code defines that behavior:

void ContractCompiler::appendFunctionSelector(ContractDefinition const& _contract)
{
[...]
     bool needToAddCallvalueCheck = true;
     if (!hasPayableFunctions(_contract) && !interfaceFunctions.empty() && !_contract.isLibrary())
     {
         appendCallValueCheck(); //First case, the smart contract has no payable functions
         needToAddCallvalueCheck = false;
     }
[...]
     for (auto const& it: interfaceFunctions)
     {
[...]
         if (!functionType->isPayable() && !_contract.isLibrary() && needToAddCallvalueCheck)
             appendCallValueCheck(); // Second case, the check is set on every non-payable function

The receive exception

Another important exception, Solidity permits to define a receive function which will be called by default if the transaction data is empty. That case raises when the sole purpose of the transaction is to sent Ether. Receive functions then handle the Ether value (in wei) sent along the transaction and performs operations if necessary. This function does not have a name, cannot have arguments, cannot return anything and must have external visibility and Payable StateMutability.

In Solidity 6.0.0+, this is a receive function which only accepts Ether without doing anything else:

receive() external payable {}

Therefore, the ABI is always:

{
    "stateMutability": "payable",
    "type": "receive"
}

In the bytecodes, the existence of a receive function appears at the beginning of the function selector, exactly as in the second case mentioned in the previous section when all functions are non-payable. However, in this case, the sequence checks if the size of the data (thanks to the CALLDATASIZE opcode) is less than 4 (which means it does not contain a function signature) and, if so, it jumps to the receive function:

PUSH1 04
CALLDATASIZE    
LT // if calldata size < 4, so if it does not contain a function signature...
PUSH2 <receive_address>    
JUMPI // ... It jumps to receiver 
// Regex: 6004361061[0-9a-fA-F]{4}57

The number and types of parameters

If parameters are defined in your Solidity function, the solc compiler will perform the following steps, for each function:

if (!functionType->parameterTypes().empty())
{
    // Parameter for calldataUnpacker
    m_context << CompilerUtils::dataStartOffset;
    // Append DUP1, CALLDATASIZE, SUB
    m_context << Instruction::DUP1 << Instruction::CALLDATASIZE << Instruction::SUB; 
    CompilerUtils(m_context).abiDecode(functionType->parameterTypes());
}
void CompilerUtils::abiDecode(TypePointers const& _typeParameters, bool _fromMemory)
{
[...]
    size_t encodedSize = 0;
    for (auto const& t: _typeParameters)
        // Calculate the total size of all parameters, after being encoded 
        encodedSize += t->decodingType()->calldataHeadSize();
 
    // Create a conditional revert if the transaction CALLDATASIZE 
    // is lower than the previously calculated encoded size 
    Whiskers templ(R"({
        if lt(len, <encodedSize>) { <revertString> }
    })");
    templ("encodedSize", to_string(encodedSize));
    templ("revertString", m_context.revertReasonIfDebug("Calldata too short"));
    m_context.appendInlineAssembly(templ.render(), {"len"});

The template defined by abiDecode results in the following instruction set, prepended by the DUP1, CALLDATASIZE, SUB instructions from the ContractCompiler class:

DUP1
CALLDATASIZE // get the data size
SUB // substract the 4 first bytes (the function signature)
// Regex: 803603
// The logic after that depends on the compiler, below is an example
PUSH1 <minimum_data_size_required>
DUP2
LT // if it's lower than the target...
ISZERO 
PUSH2 <function_start>
JUMPI // ... do not jump to the function start ...
PUSH1 00
DUP1
REVERT // ... and revert

So what we should watch here is the presence of DUP1, CALLDATASIZE and SUB at the beginning of a function. The bytecode will revert the transaction if the data size is lower than expected.

If no CALLDATASIZE bytecode is present, it probably means the function does not take any input parameter. In that case, it does not matter if you send parameters or not in your transaction, it will be completely ignored by the smart contract.

So how can we determine the number and types of parameters required by a function, given that we know the minimum data size it needs?

Let's take a look at the result of calldataHeadSize(), called in the previous code snippet:

/// If @a _padded then it is assumed that each element is padded to a multiple of 32 bytes.
virtual unsigned calldataEncodedSize([[maybe_unused]] bool _padded) const { solAssert(false, ""); }
/// Convenience version of @see calldataEncodedSize(bool)
unsigned calldataEncodedSize() const { return calldataEncodedSize(true); }
/// @returns the distance between two elements of this type in a calldata array, tuple or struct.
/// For statically encoded types this is the same as calldataEncodedSize(true).
/// For dynamically encoded types this is the distance between two tail pointers, i.e. 32.
/// Always returns a value greater than zero and throws if the type cannot be encoded in calldata.
unsigned calldataHeadSize() const { return isDynamicallyEncoded() ? 32 : calldataEncodedSize(true); }

For statically sized types, calldataHeadSize() calls calldataEncodedSize(true) which returns the size of the type in bytes. This size is always a multiple of 32 bytes due to padding. For dynamically sized types, if isDynamicallyEncoded() is true, calldataHeadSize() will also return 32.

The fact is, if a dynamic array is sent on a transaction, the first bytes will be the offset to the start of the real data (padded to 32 bytes). At that offset, we will find the number of elements in the array (padded to 32 bytes) then each element (padded to 32 bytes). In case of a dynamic array, the code will only check if there is at least 32 bytes (if the offset exists). An upper bound check is not performed specifically because of arrays (see the ABI specs).

For example, if a parameter type is uint[] and the parameter is[7,8,9], the encoding will be:

0000000000000000000000000000000000000000000000000000000000000020 // start of the data, 32 bytes after the start, so right after that
0000000000000000000000000000000000000000000000000000000000000003 // The length of the array
0000000000000000000000000000000000000000000000000000000000000007 // First value: 7
0000000000000000000000000000000000000000000000000000000000000008 // Second value: 8
0000000000000000000000000000000000000000000000000000000000000009 // Third value: 9

So, if you happen to send a single static value as a parameter which is expecting a dynamic array, the transaction will not revert and the value you sent will be treated as the array offset, the EVM will probably then try to retrieve those elements and revert if they do not find them.

In the case of a static array of static types, the encoding is straightforward. For example, if a parameter type is uint[3] and the parameter is[1,2,3], the encoding will be:

0000000000000000000000000000000000000000000000000000000000000001 // first value: 1
0000000000000000000000000000000000000000000000000000000000000002
0000000000000000000000000000000000000000000000000000000000000003

Because everything is always padded to 32 bytes, it is easy to get the number of expected elements.
If the PUSH1 right after CALLDATASIZE, SUB is pushing:

  • 32 (0x20) bytes 1 element is needed
  • 64 (0x40) bytes 2 elements
  • 96 (0x60) bytes  3 elements
  • etc.

What we call element can be:

  • A static parameter
  • A static value part of a larger static array
  • A dynamic array

Indeed, it is important to remember that several elements can mean the function expects different parameters or a static array containing multiple static values. For example, the below two functions will expect the exact same CALLDATASIZE:

function foo(uint32 x, bool y) {}
function bar(bytes3[2] z) {}

However, precisely because everything is always padded to 32 bytes, it is quite hard to ascertain the right data type. To understand that, one must dive into the EVM bytecode logic to understand how data is handled and to deduce its type (this will not be treated in this article). The parameter types are not easy to retrieve from the  bytecode.

But hopefully, it is not really useful to know the exact types since it is possible to write bytes32 in the ABI, and the parameter will be encoded by the client in a way that fits most data types.

A final exception raises for bytes and string. Because their length cannot be known in advance, the data bytes are padded on the right on 32 bytes after the 32 bytes left-padded data length. So if the data is sent as though it was a bytes32, the padding will be incorrect and the transaction will revert.

Concrete example

Let's try to apply what we discussed in this article by finding the ABI of a simple compiled contract. Let's compile the following code with solc:

pragma solidity ^0.8.11;

contract HighlyComplexContract {
    uint256 count = 0;

    function Foo() public {}

    function Bar(uint[3] memory a, uint b) public payable {}

    receive() external payable {}
}

The compiled binary is the following:

60806040526004361061002c575f3560e01c80635428cfc514610037578063bfb4ebcf1461005357610033565b3661003357005b5f80fd5b610051600480360381019061004c91906101ed565b610069565b005b34801561005e575f80fd5b5061006761006d565b005b5050565b565b5f604051905090565b5f80fd5b5f80fd5b5f601f19601f8301169050919050565b7f4e487b71000000000000000000000000000000000000000000000000000000005f52604160045260245ffd5b6100c682610080565b810181811067ffffffffffffffff821117156100e5576100e4610090565b5b80604052505050565b5f6100f761006f565b905061010382826100bd565b919050565b5f67ffffffffffffffff82111561012257610121610090565b5b602082029050919050565b5f80fd5b5f819050919050565b61014381610131565b811461014d575f80fd5b50565b5f8135905061015e8161013a565b92915050565b5f61017661017184610108565b6100ee565b905080602084028301858111156101905761018f61012d565b5b835b818110156101b957806101a58882610150565b845260208401935050602081019050610192565b5050509392505050565b5f82601f8301126101d7576101d661007c565b5b60036101e4848285610164565b91505092915050565b5f806080838503121561020357610202610078565b5b5f610210858286016101c3565b925050606061022185828601610150565b915050925092905056fea2646970667358221220a86a0dd0f0c2a353297ae9fd8ed73b4032cf9ea9d399258054ae11aa790d3ba764736f6c63430008150033

First, let's identify the number of functions:

$ egrep -o '8063[0-9a-fA-F]{8}1461[0-9a-fA-F]{4}57' HighlyComplexContract.bin
80635428cfc51461003757
8063bfb4ebcf1461005357

There are two functions and their signatures are: 5428cfc5 and bfb4ebcf.

Also, note that the first function will start at 0x37 and the second at 0x53. It will be important for the following.

The name of the second function is easily found with 4byte.directory:

The signature is already known.

The first one however is not publicly known as the parameters it accepts are uncommon. We will have to keep the signature as it is in the ABI as a placeholder to use it directly in future transactions. We will call that function 5428cfc5().

Secondly, are the functions non-payable?

$ egrep -ob '801561[0-9a-fA-F]{4}575f80fd' HighlyComplexContract.bin
170:801561005e575f80fd

Only one non-payability check is present so there is only one non-payable function, but which one?

egrep tells us that it is found at the 170th characters, so at the 85th (0x55) byte. Because Foo() starts at 0x53, it means it is the non-payable one.

Thirdly, how many inputs the functions accept?

$ egrep -ob '803603' HighlyComplexContract.bin                      
122:803603

Only one function performs a check on CALLDATASIZE, the 5428cfc5() (hex(122/2) = 0x3d). It probably also means that Foo() does not accept any argument.

It is a bit more tricky here to find out the value of the PUSH1 as the EVM performs the following sequence:

003D    80    DUP1
003E    36    CALLDATASIZE
003F    03    SUB        // Start of the function, CALLDATASIZE retrieves the data size, meaning it will be checked somewhere
0040    81    DUP2
0041    01    ADD
0042    90    SWAP1
0043    61    PUSH2 0x004c
0046    91    SWAP2
0047    90    SWAP1
0048    61    PUSH2 0x01ed
004B    56    JUMP
[...]
01ED    5B    JUMPDEST
01EE    5F    PUSH0
01EF    80    DUP1
01F0    60    PUSH1 0x80   // Here is the PUSH1
01F2    83    DUP4
01F3    85    DUP6
01F4    03    SUB
01F5    12    SLT
01F6    15    ISZERO
01F7    61    PUSH2 0x0203
01FA    57    JUMPI

This sequence may be found sometimes when the compiler optimizes the bytecode, making the flow jump to the end in order to verify the number of parameters.

What is important to find here is the instruction at 0x1f0: PUSH1 0x80 which means the function expects 4 elements encoded on 32 bytes. This value is correct as the function Bar(uint[3] memory a, uint b) expects a static array of 3 static uint and 1 more uint.

Finally, does the smart contract have fallback or receive functions?

A revert is found at the end of the selector, so it does not contain a fallback function:

0013    63    PUSH4 0x5428cfc5 // First function selector
0018    14    EQ
0019    61    PUSH2 0x0037
001C    57    JUMPI
001D    80    DUP1
001E    63    PUSH4 0xbfb4ebcf // second function selector
0023    14    EQ
0024    61    PUSH2 0x0053
0027    57    JUMPI
0028    61    PUSH2 0x0033
002B    56    JUMP
[...]
0033    5B    JUMPDEST
0034    5F    PUSH0
0035    80    DUP1
0036    FD    REVERT // A JUMP would have been found here in case of a fallback function

However, the initial check for CALLDATASIZE means a receive function will be found at 0x2c:

$ egrep -o '6004361061[0-9a-fA-F]{4}57' HighlyComplexContract.bin 
6004361061002c57

We can deduce the following functional ABI:

[
  {
    "inputs": [
      {
        "name": "whatever",
        "type": "bytes32[4]"
      }
    ],
    "name": "NOT_FOUND",
    "selector": "0x5428cfc5",
    "outputs": [],
    "stateMutability": "payable",
    "type": "function"
  },
  {
    "inputs": [],
    "name": "Foo",
    "outputs": [],
    "stateMutability": "nonpayable",
    "type": "function"
  },
  {
    "stateMutability": "payable",
    "type": "receive"
  }
]

Existing tools

A number of tools exist that can help to extract the Application Binary Interface (ABI) from EVM bytecode. These tools vary in their methodologies, user interface, and feature sets. We included here the ones returning good results during our tests. Feel free to contact us to enhance this list.

Whatsabi

Whatsabi is the most complete free tool to specifically perform this task. It is an open-source command-line tool that can generate a working ABI from an Ethereum smart contract bytecode. Whatsabi works by analyzing the EVM bytecode mostly the way we explained in this article and extracting information about the functions. We strongly encourage people reading this article to try this tool and contribute to its development.

Below is the output Whatsabi v0.8.6 gives on our above example. You will notice whatsabi does not show the exact number of inputs nor the receive function presence as it is not an easy task to analyze without advanced decompiler methods. The rest is almost similar:

> code = '6080[...]50033'
> whatsabi.abiFromBytecode(code);
[
  {
    "type": "function",
    "selector": "0x5428cfc5",
    "payable": true,
    "stateMutability": "payable",
    "inputs": [
      {
        "type": "bytes"
      }
    ]
  },
  {
    "type": "function",
    "selector": "0xbfb4ebcf",
    "payable": false,
    "stateMutability": "view",
    "outputs": [
      {
        "type": "bytes"
      }
    ]
  },
  {
    "type": "event",
    "hash": "0x4e487b7100000000000000000000000000000000000000000000000000000000"
  }
]

Whatsabi may also try to retrieve the function name:

> const signatureLookup = new whatsabi.loaders.OpenChainSignatureLookup();
> await signatureLookup.loadFunctions("0xbfb4ebcf");
[ 'Foo()' ]
> await signatureLookup.loadFunctions("0x5428cfc5");
[]

Ethereum-dasm

ethereum-dasm is also tries to recover the ABI from a static analysis of the bytecode. This functionality has less logic than Whatsabi but ethereum-dasm also disassembles and tries to decompile the contract.

JEB Decompiler

JEB Decompiler is a robust software analysis tool supporting decompilation for a multitude of languages, including Ethereum EVM bytecode. Its versatile analysis features, along with the EVM plugin, make it possible to interpret and reverse-engineer Ethereum smart contracts to a more comprehensible form, aiding in the extraction of ABIs. However, JEB is not free.

Conclusion and further works

Ethereum smart contracts, while appearing opaque at first glance, can yield a wealth of information if we know how to dissect their underlying bytecode. The function signatures, types, and parameters can be discerned through careful analysis. However, this field still presents limitations that need to be addressed and opportunities for future exploration.

Several challenges lie in accurately recovering an ABI. For example, we can infer the number of parameters from the required data size, but pinpointing the exact data type remains elusive in many cases. Additionally, the tools currently available, although useful, do not fully automate the process. Some exception may raise from the use of different compilers and their own optimization processes. We stated some of them but many more exist: initialization code, proxies, interfaces, dynamic jumps, making it hard to identify a whole function code…

In order to further enhance the accuracy and reliability of ABI extraction, dynamic analysis could be incorporated into our approach. This would provide a more comprehensive view of a contract behavior in response to varying inputs. Additionally, leveraging frameworks like Hardhat could be instrumental in testing and verifying the correctness of the extracted ABIs.