This repository contains a cycle-accurate RISC-V processor simulator developed as part of the CS209P Computer Architecture course project. The simulator models a multi-compute-unit architecture with realistic pipeline behavior, a two-level cache hierarchy, synchronization primitives, and programmer-managed scratchpad memory.
The goal of this project is to study memory system behavior, parallel execution, and performance trade-offs in modern processor designs.
-
Four Compute Units (CUs)
- Shared instruction fetch unit
- Independent Decode, Execute, Memory, and Writeback stages per CU
- Shared instruction and data memory
- Each CU has a unique CID (Compute ID) register
-
Single Fetch Unit
- All compute units fetch the same instruction
- Execution is selectively enabled or disabled based on CID
- Enables SIMD-like execution with control divergence
-
L1 Caches
- L1 Instruction Cache (L1I)
- L1 Data Cache (L1D)
- Configurable parameters:
- Cache size
- Block size
- Associativity
- Access latency
- Instruction fetch is treated as a cacheable memory access
- Cache blocks of 64 bytes hold up to 16 instructions
-
L2 Cache
- Unified L2 cache shared by instructions and data
- Configurable size, associativity, and latency
- Accessed on L1 cache misses
-
Main Memory
- Accessed on L2 cache misses
- Configurable main memory latency
- Variable-latency memory operations introduce pipeline stalls
- LRU (Least Recently Used)
- One additional configurable replacement policy The simulator tracks cache hits, misses, and stall cycles introduced by memory access delays.
In addition to the cache hierarchy, the simulator includes a programmer-managed scratchpad memory:
- Same size and access latency as L1D cache
- No automatic replacement or tag lookup
- Entirely controlled by software
-
lw_spm rd, offset(rs1) Loads a word from scratchpad memory into register rd
-
sw_spm rs2, offset(rs1) Stores a word from register rs2 into scratchpad memory
The SPM is used to compare cache-based and software-managed memory systems for strided access patterns.
- Acts as a barrier synchronization primitive
- A compute unit stalls at SYNC until all compute units reach the same instruction
- Implemented as a hardware-modeled no-op
- Ensures correctness for parallel workloads such as reductions
- This mechanism prevents premature reads of shared data before all compute units have completed their updates.
At the end of execution, the simulator reports:
- Total number of stall cycles
- Cache miss rate
- IPC (Instructions Per Cycle)
These metrics are used to evaluate different cache configurations and memory access strategies.
Supported Workloads
- Parallel array addition using per-CU partial sums
- Strided array access benchmarks
- Cache vs scratchpad memory performance comparison
- Barrier synchronization using SYNC
The simulator supports evaluating both direct-mapped and fully associative cache configurations.
-
- Date: 10-03-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Raghavendra implemented GUI and Anirudh connected it with the backend using flask.
-
- Date: 08-03-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Anirudh completed implementing the shared IF unit, shared Memory and worked on special purpose registers.
-
- Date: 06-03-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Raghavendra started GUI, Anirudh completed implementing latencies and worked on shared IF unit.
-
- Date: 04-03-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Anirudh Implemented pipelining with data forwarding and Raghavendra tried latencies.
-
- Date: 02-03-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Raghavendra and Anirudh worked for detection and correctness in stall count and finally completed the stall count implementation.
-
- Date: 28-02-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Raghavendra tried pipelining without forwarding implementation and Anirudh worked on detecting the RAW hazards and completed code for it, along with forwarding.
-
- Date: 25-02-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Raghavendra and Anirudh discussed about the way to implement the pipelining and had decided an architecture.
-
- Date: 19-02-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Raghavendra completed the GUI using HTML, CSS, and JavaScript, while Anirudh worked on integrating the GUI with the Python backend. Anirudh decided to use Flask for this integration.
-
- Date: 17-02-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: The team decided to implement a GUI for the simulator. Initially, Raghavendra developed a basic GUI using Tkinter (import tkinter as tk from tkinter import messagebox). However, it was not visually appealing, so We decided to build the GUI using HTML, CSS, and JavaScript instead.
-
- Date: 15-02-2025 - Memebers: Anirudh A, Raghavendra P - Decision: Anirudh tested the code with various programs and fixed several bugs using the data segment format. We verified the correct addressing of arrays and successfully obtained the correct output for sum-of-elements problems.
-
- Date: 13-02-2025 - Memebers: Anirudh A, Raghavendra P - Decision: The team collaboratively implemented the Bubble Sort algorithm. We also added a data segment (word) to the code by creating an array to store input data in the format: arr: .word 0x4 ...
-
- Date: 11-02-2025 - Memebers: Anirudh A, Raghavendra P - Decision: Anirudh realised array indexing is 1, but addi can also perform arithmetic operations and hence differentiating logical and pointer arithmetic will not be possible. Hence Anirudh decided to make memory of 4*x allocations, index belonging to its module 4 coreid.
-
- Date: 09-02-2025 - Memebers: Anirudh A, Raghavendra P - Decision: The team divided responsibilities: 1.Raghavendra was assigned to implement arithmetic operations. 2.Anirudh was responsible for memory operations. 3.We discussed defining unique instructions that differ from the RISC-V instruction set.
-
- Date: 07-02-2025 - Memebers: Anirudh A, Raghavendra P - Decision: 1.Anirudh was assigned to complete the Software Design by 10-02-2025. 2.Raghavendra was tasked with reviewing relevant topics and enhancing his Python knowledge.
-
- Date: 06-02-2025
- Memebers: Anirudh A, Raghavendra P
- Decision: Decided to complete and build the GPU simulator with
pythonlanguage since, 1.Python has a simpler syntax compared to C/C++, making it easier to implement and understand complex GPU architectures. 2.Python has great visualization tools like Matplotlib and Seaborn, which help analyze performance metrics.
- special register: x31
- instructions implemented: add addi sub la lw sw bne ble beq jal jr slt j li
- implemented .word in data segment
- code should have a .data and a .text segment to work
- label should have the corresponding instruction for ease, label should be written as a standalone statement
- memory starts being used from the end for storing .data segment values
- GUI
cd Codes
cd Simulator/Phase2
pip install -r requirements.txt
python main.py
Open 1270.0.1:5000 in browser- File Reading: change assembly.asm
cd Codes
cd Simulator/Phase2
pip install -r requirements.txt
python main.py