Skip to content
Lexi Brent edited this page Nov 28, 2018 · 15 revisions

Contents


Overview of Vandal

Vandal is an analysis pipeline for analysing Ethereum virtual machine smart contract bytecode. The analysis pipeline consists of several stages:

  • Scraper
  • Disassembler
  • Decompiler
  • Extractor
  • Security analysis as logic specifications using Souffle

All of the above are part of Vandal. With the exception of the logic specifications for security analyses, Vandal is written entirely in Python.

The following figure is a high-level overview of our pipeline:

High-level overview of Vandal analysis pipeline

Publications

  • Vandal: A Scalable Security Analysis Framework for Smart Contracts, Lexi Brent, Anton Jurisevic, Michael Kong, Eric Liu, Francois Gauthier, Vincent Gramoli, Ralph Holz, Bernhard Scholz, Technical Report, School of Computer Science, The University of Sydney, Sydney, Australia, September 2018. [pdf] [BibTeX]

  • MadMax: Surviving Out-of-Gas Conditions in Ethereum Smart Contracts, Neville Grech, Michael Kong, Anton Jurisevic, Lexi Brent, Bernhard Scholz, Yannis Smaragdakis, SPLASH 2018 OOPSLA, Boston, November 2018. [pdf] [BibTeX]

  • A Scalable Method to Analyze Gas Costs, Loops and Related Security Vulnerabilities on the Ethereum Virtual Machine, Michael Kong, Honours Thesis, November 2017, School of Computer Science, The University of Sydney. [pdf] [BibTeX]

History and Future of Vandal

When Vandal was first written, there was very little information available about Ethereum's execution model and smart contract security vulnerabilities relative to now, and it was authored under time pressure in a short span of about 10 weeks. We never intended to build a decompiler -- at first, we attempted a peephole analysis to build the control flow graph, but the problem quickly grew in scope as we realized this was insufficient in most cases due to the complexity of how jump addresses are propagated within Ethereum's stack-based execution model.

Given our extensive experience and hindsight, in future we intend to re-write Vandal from scratch, conforming to traditional compiler design.

Scraper

The scraper is a tool which accesses the JSON-RPC API of a running Parity Ethereum client and makes scrapes all smart contracts from the blockchain via API requests. It allows us to perform empirical measurements of the prevalence of potential vulnerabilities in the wild. Unfortunately, it has not been cleared for public release, so it is not included as part of Vandal at this stage. For more information, please contact us directly.

Disassembler

Our disassembler is similar to the official Ethereum disassembler but provides extra functionality such as basic block splitting and syntax highlighting. It leverages the parsing logic already implemented in our decompiler. Originally, our decompiler only accepted disassembly as input, but now accepts bytecode directly by default.

There is a command-line interface for executing the disassembler at /bin/disassemble. To see the available command line arguments, type bin/disassemble --help.

Decompiler

This is the core of Vandal. It converts EVM bytecode to an intermediate representation in which the execution stack is replaced with register variables. The compact stack-based execution model of Ethereum's virtual machine obfuscates jump addresses, which we need in order to build a contract's control flow graph (CFG) for vulnerability analysis.

The decompiler contains a lot of functionality, including:

  • Construction of a contract's control flow graph
  • Data flow fixed point analysis with basic block splitting to resolve CFG edges
  • Solidity smart contract function identification
  • Interactive HTML output for exploring a given contract's CFG and IR code
  • Memory abstraction (limited)
  • TSV output which forms the EDB for our Datalog analysis, including relations for:
    • Definition sites
    • Basic blocks
    • CFG edges and entry/exit nodes
    • Dominators and postdominators
    • Extracted public and private functions
    • Constant definitions
    • Opcode usage

Overview

When reading our code, it's helpful to understand some naming conventions and decisions we made early on in development. In particular:

Term Description
TAC Three Address Code. We now refer to this as our "Intermediate Representation (IR)" or "Intermediate Language (IL)", since it is not technically true three-address code.
Function As in the class function.Function, and references to "functions" in our source code and comments. We are referring to a function defined in a Solidity smart contract. Note that the EVM does not have a native concept of functions.
EVM Ethereum Virtual Machine code --- raw bytecode (or its disassembly) as intended to run on the EVM
CFG Control Flow Graph --- a graph containing basic blocks as nodes, and JUMP/JUMPI instructions as edges.
OpCode As in src/opcodes.py / opcodes.OpCode: a valid EVM bytecode operation, as defined in Appendix H of the Ethereum Yellowpaper.
EVMOp A concrete EVM opcode existing in some EVM bytecode.
TACOp A concrete operation existing in some intermediate representation code.

Our decompiler works with three different representations:

  1. Input representation --- either EVM bytecode or EVM bytecode disassembly. Raw input is parsed and represented internally as a sequence of evm_cfg.EVMBasicBlocks, each of which contains a sequence of evm_cfg.EVMOps.

  2. Intermediate representation (IR) --- corresponds to EVM bytecode, but the execution stack is replaced with register variables, allowing values to be traced with ease, and jump destination addresses (CFG edges) to be constructed with ease. We used to call this a "three-address code" representation, so you will see references to "TAC" in the source code. Technically it is not three-address code, so we now refer to it as an "intermediate representation (IR)" or "intermediate language (IL)".

  3. Output representation --- either text-based (as printed to stdout), TSV files which form the EDB for our Datalog analyses, Graphviz dot file output, or HTML output for generating an interactive CFG for the input contract, which shows all information extracted by the decompiler through data flow analysis (including extracted functions, jump addresses, EVM disassembly and its corresponding intermediate representation).

The idea is that our decompiler transforms EVM bytecode into the TAC/IR form through the following process:

  1. Parse bytecode/disassembly to construct a sequence of EVM basic blocks, containing EVM operations.
  2. Convert each EVM basic block to a corresponding TAC/IR basic block.
  3. Build a CFG for the TAC/IR blocks by performing iterative data flow analysis to resolve jump addresses (graph edges).

Once this process is complete, output is generated using one or more Exporter classes defined in src/exporter.py.

Implementation

Most of the decompiler implementation is contained in src. Each of these libraries is described below:

  • src/blockparse.py --- contains classes for parsing raw input formats, including EVM bytecode and disassembly such as the output from our disassembler. The parsers convert each bytecode to a corresponding evm_cfg.EVMOp object, and return a list of basic blocks (evm_cfg.EVMBasicBlock).

  • src/cfg.py --- defines base classes for ControlFlowGraph and BasicBlock.

  • src/dataflow.py --- contains functions, each of which accepts a ControlFlowGraph, and performs a different type of data flow analysis over the graph, including fixed point, stack size, and stack state analyses.

  • src/evm_cfg.py --- specializes the cfg.ControlFlowGraph and cfg.BasicBlock for representing a control flow graph for EVM bytecode.

  • src/tac_cfg.py --- specializes the evm_cfg.* classes for representing a contract in our intermediate representation. These classes also contain specialised helper methods for performing data flow analyses, such as finding dominators, post-dominators, block splitting, duplicate block removal, etc. It also defines tac_cfg.Destackifier, a class which transforms individual evm_cfg.EVMBasicBlock instances to tac_cfg.TACBasicBlock instances; and helper classes TACOp, TACAssignOp, TACArg, and TACLocRef for our intermediate representation.

  • src/exporter.py --- defines classes which produce Vandal's output representations from a given control flow graph. Each class inherits from abstract base class exporter.Exporter, and accepts a CFG instance in its constructor, and contains an export() function, which produces the output.

  • src/function.py --- classes for extracting Solidity functions from a contract's control flow graph, based on identification of function signatures used in the function dispatcher created by a compiler.

  • src/lattice.py --- classes implementing a lattice, which is used for fixed point data flow analysis.

  • src/memtypes.py --- classes for representing and working with register variables as used in our IR. Variables may be constant, set-valued, or unknown.

  • src/opcodes.py --- defines the EVM instruction set, using the information from Appendix H of the Ethereum Yellowpaper.

  • src/patterns.py --- object-oriented patterns (e.g. Visitor) used in other classes.

  • src/settings.py --- defines valid settings keys for config.ini, and contains logic for reading the INI file using Python's configparser.

Bulk Analysis Scripts

Our bulk analysis scripts are designed to take a directory full of raw smart contract bytecode in hexadecimal format (e.g. the output of our scraper), one contract per file, and run each contract through our pipeline. It collates output into a single JSON file and uses multiprocessing for parallelisation. It requires that Souffle be installed and available in $PATH.

For more information, see the files in tools/bulk_analyser. In particular, see README.md and the Python source code comments.

There is an alternative bulk analysis system available in tools/analyser, which uses a Ruby glue script and parallelised bash jobs.

Datalog Specifications

We detect the potential presence of potential security vulnerabilities using Datalog relations which we wrote for use with the Souffle Datalog engine. All of the relevant datalog is currently contained in the datalog/ directory.

For example, a very simple vulnerability specification is for detecting the use of the ORIGIN opcode. This is implemented in datalog/demo_analyses.dl as follows:

.decl originUsed(stmt:Statement)
.output originUsed

originUsed(useStmt) :-
  op(originStmt, "ORIGIN"),
  def(originVar, originStmt),
  depends(useVar, originVar),
  usedInStateOrCond(useVar, useStmt).

We have written a simple demo tutorial showing how to implement a new Datalog analysis specification with Vandal.