Performance Simulator

Pipeline modeling

In functional simulator all actions are incapsulated in to 5 stages:

Fetch
Decode, read sources
Execute, calculate address
Memory access
Writeback, PC update, information dump

In performance simulation each stage is encapsulated to modules connected with ports.

Modules and ports can be visualized with the dedicated tool

Ports

We use ports for two purposes:

Data port transfers data from one stage to the next one
Stall port signals that pipeline is stall to previous stages

Currently we don't use complicated port topology, so use constants PORT_BW, PORT_FANOUT and PORT_LATENCY everywhere. They are defined as 1.

Data port

Data ports have following syntax:

    class PerfMIPS {
        std::unique_ptr<ReadPort</*Type*/>>  rp_/*source_module*/_2_/*dest_module*/;
        std::unique_ptr<WritePort</*Type*/>> wp_/*source_module*/_2_/*dest_module*/;
        // examples
        std::unique_ptr<ReadPort<FuncInstr>>  rp_decode_2_execute;
        std::unique_ptr<ReadPort<FuncInstr>>  rp_execute_2_memory;
        std::unique_ptr<WritePort<uint32>>    wp_fetch_2_decode;
        std::unique_ptr<WritePort<FuncInstr>> wp_decode_2_execute;
    };

and are initialized in a following way:

    PerfMIPS::PerfMIPS() {
        // example
        rp_decode_2_execute = make_read_port<FuncInstr>("DECODE_2_EXECUTE", PORT_BW, PORT_FANOUT);
        wp_decode_2_execute = make_write_port<FuncInstr>("DECODE_2_EXECUTE", PORT_LATENCY);
    }

Each pair of data ports transmits FuncInstr object. The only exception is fetch->decode port which transmits raw uint32.

Stall ports

Stall port is used to stop previous stages if this stage can not be passed by current instructions and has to be re-started. These ports transmit only one 1 bit of data presented in bool type.

    std::unique_ptr<ReadPort<bool>>  rp_decode_2_fetch_stall;
    std::unique_ptr<WritePort<bool>> wp_decode_2_fetch_stall;

    rp_decode_2_fetch_stall = make_read_port<bool>("DECODE_2_FETCH_STALL", /**/);

Modules

Module names

We have following modules:

fetch
decode
execute
memory
writeback

Module objects

Each module consists of following objects:

read port from the previous stage*
write port to the next stage**
stall read port from the next stage**
stall write port to the previous stage*
internal value on the latch — FuncInstr object or data bytes*
void clock_module(int cycle) function (where module is name above)

* Is not needed on fetch module.

** Is not needed on writeback module.

Module behavior sceleton

    void clock_module( int cycle) {
        bool is_stall;
        /* If the next module tells us to stall, we stops
           and send stall signals to previous module */
        rp_next_2_me_stall->read( &is_stall, cycle);
        if ( is_stall) {
             wp_me_2_previous_stall->write( true, cycle);
             return;
        }

        /* If nothing cames from previous stage
           execute, memory and writeback modules have to jump out here */
        if ( rp_previous_2_me->read( &module_data, cycle))
            return;

        /* But, decode stage doesn't jump out
           It takes non-updated bytes from module_data 
           and re-decodes them */
        // rp_previous_2_me->read( &module_data, cycle)

        // Here we process data.

        if (...) {
             /* This branch is chosen if everything is OK and
                we may continue promotion to the next pipeline stages */
             wp_me_2_next->write( module_data, cycle);
        }
        else {
             // Otherwise, nothing is done and we have to stall pipeline
             wp_me_2_previous_stall->write( true, cycle);
        }
    }

Note: Decode stage behavior is slightly different from other modules, pay attention to code options

Stall generation

For now we assume that every instruction is executed in 1 cycle, so the only possible stalls are caused by data dependency and control dependency.

Data dependency tracker

Our goal is to stop instruction if its sources are not ready. It can be checked by following extension of RF: each register is extended by 1 validity bit. For instruction's destination register, this bit is set to false on decoding stage, and returned back to true on the writeback stage. Next instructions must check the bits of their sources. If and only if they are in true state, this instruction can continue execution, otherwise it is stalled.

Note: Because `$zero` register is never overwritten, its validity bit is always in `true` state!

The code changes should look like:

    class RF {
        struct Reg {
            uint32 value;
            bool   is_valid;
            Reg() : value(0ull), is_valid(true) { }    
        } array[REG_MAX_NUM];
    public:
        uint32 read( Reg_Num);
        bool check( Reg_Num num) const { return array[(size_t)num].is_valid; }
        void invalidate( Reg_Num num) { array[(size_t)num].is_valid = false; }
        void write ( Reg_Num num, uint32 val) {
             // ...
             assert( array[(size_t)num].is_valid == false);
             array[(size_t)num].is_valid = true;
        }
    };

Control dependency tracker

Control dependency can be represented as a data dependency via PC register. You have to add validity bit for PC register that is set to false by jumps and branches — they must be detected with FuncInstr::is_jump() const method. But, this bit have to be checked not on decode, but on fetch stage.

Note: Non-branch instructions must promote PC by 4 at the `decoding` stage to continue fetch of next instructions!

User interfaces

Master output

At each stage, the instruction disassembly (if exists) and its result (if exists) should be printed to the std::cout in the way similar to functional simulator, but preceeded by the stage name and current clock number separated "\t" sign:

Sometimes it is very useful to see what happens inside the machine. One of simpliest ways is per-stage output: simulator shows instruction being proceeded at each stage, like this:

    fetch   cycle 5:  0x43adcb90
    decode  cycle 5:  ori $t2, $t1, 0xAA00
    execute cycle 5:  add $t1, $t2, $t3
    memory  cycle 5:  bubble

IPC is printed in the end of simulation.

Method `PerfMIPS::run`

As in functional simulator, run has 2 parameters

const std::string& tr with file system path to the trace to execute
int instrs_to_run with amount of instructions to be performed

and one extra parameter

bool silent — see above

The code inside is very simple:

    PerfMIPS::run(...) {
        // .. init
        executed_instrs = 0; // this variable is stored inside PerfMIPS class
        cycle = 0;
        while (executed_instr <= instrs_to_run) {
              clock_fetch(cycle);
              clock_decode(cycle);
              clock_execute(cycle);
              clock_memory(cycle);
              clock_writeback(cycle); // each instruction writeback increases executed_instrs variable
              ++cycle;
        }
        // ..
    }

Question: Can calls of `clock_fetch` and `clock_decode` be swapped? What about `clock_writeback` and `clock_fetch`?

int main

Entry point has to be very similar to the FuncSim's one.

MIPT-V / MIPT-MIPS — Cycle-accurate pre-silicon simulation.

Main Page
Lectures
Getting started
MIPS
C++
Build and Deployment
Tools
Communication
Functional simulator internals
Performance simulator internals
Studies
- Cache associativity
- IPC sensitivity to cache line size
Integration
- Interactive simulation with GDB
Library
About Us

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Simulator

Pipeline modeling

Ports

Data port

Stall ports

Modules

Module names

Module objects

Module behavior sceleton

Stall generation

Data dependency tracker

Control dependency tracker

User interfaces

Master output

Method `PerfMIPS::run`

int main

Clone this wiki locally

Performance Simulator

Pipeline modeling

Ports

Data port

Stall ports

Modules

Module names

Module objects

Module behavior sceleton

Stall generation

Data dependency tracker

Control dependency tracker

User interfaces

Master output

Method PerfMIPS::run

int main

Clone this wiki locally

Method `PerfMIPS::run`