11月即将过去，写点什么呢？马上要期末考试了，不如整个期末复习吧～

前言

本科时学了一门《计算机组成原理》，讲授的内容和《计算机体系结构》有些重合。当时采用的教材是《Computer organization and design: The hardware / software interface》，没想到学《计算机体系结构》老师列出的辅助教材也是这本书。这学期上下来的体验就是，《计算机体系结构》比《计算机组成原理》的内容精而深，就拿 pipeline 来说，我记得本科的时候一个课时就能结束，很多知识只是“浅尝辄止”，浮于概念，但在《计算机体系结构》中学习的更加透彻了，所以这门课的收获还是挺大的。我把课程笔记放在了 https://github.com/NicoleMayer/ECE668-Notes 这个仓库中。

主题

Pipeline
Hazard
Dynamic Scheduling
Dynamic Branch Prediction
Exceptions
Tomasulo with ROB
Superscalar
Loop Unrolling
VLIW
Memory
Cache
Virtual Memory
I/O

老师说期末考试主要考察期中考试后的内容，所以我就不按照列举的顺序写，而是重点回顾一下 Dynamic Branch Prediction 之后的内容，简要概括前面的内容。

Dynamic Branch Prediction

(VS) Static Branch Prediction

Different types
- predict taken / untaken
- choose backward-going branches to be taken (loop); forward-going branches to be not taken (if)
- based on profile information collected from earlier runs
pros: simple, easy to implement
cons: does not work well in practice (prediction accuracy is low)
difference: Dynamic Branch Prediction can only be done by the hardware, while the static one can be done by the Compiler

除了上述提到的静态方法和下面会讲到的动态方法，还有一个叫 Delayed Branch 的方法，它使用编译器对指令进行静态调度（调到延迟槽中）MIPS用的是只有一条指令的延迟槽，这种方法和 predicted not taken 有点像，只不过槽里那条指令不会被 squash，而且延迟槽中的指令可以不是 i+1。个人感觉这种方法应该也属于 Static Branch Prediction。

Branch History Table (BHT)

这里展示的是一个 1-bit Branch History Table

注：懒得折腾怎么缩小图片，将就一下吧 😭

Type1: 1-bit Branch-Prediction

示意图在 BHT 中已展示
design idea: take advantage of current branch’s previous taken / not taken choice
cons
- need a huge page if full address index, in this case, most items are empty (only a few instructions are branch instructions)
- no full address check if only index by the lower bits of PC address -> false positive

Type2. 2-bit Branch-Prediction

2 cases:

Scheme 1: change prediction only if get mis-prediction twice but return in one step

Scheme 2: change prediction (in either direction) only if get mis-prediction twice

4 state finite state machine (2 bits per state)
the 2 schemes have similar accuracy

Type3. Correlating Branch Prediction

also called “The Two-Level Branch Predictor” (wiki)
design idea: consider the dependency of different branches
uses a two-dimensional table of counters, also called “Pattern History Table“
the table entries are two-bit counters

(k,2) predictor: k-bit global, 2-bit local

我对左边这张图粗浅的理解： 1. 最下面像衣架的东东是 Global BHR entry（通过 shift register 实现），表示整个 program 刚处理完毕的 2 个分支是否 taken 的情况（一共 2^2 = 4 entries 横向） 2. Branch address 以最后 4 位作为索引来查找哦 local predictors 中对应的 entry（一共 2^4=16 entries 纵向）

GShare Predictor

前面提到的 (k,2) predictor 像是一种 design 方式，而这个 GShare Predictor 就是真实世界的一种实现。

Type4. Tournament Branch Predictor

design idea: keep a balance of local and global information
use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector

Alpha 21264

implements the Tournament Branch Predictor

Selector

uses one of 4096 finite state machines (FSM) each with 4 states (4K entries)
the current state indicates which (global or local) predictor to use

The Selector’s Finite State Machines

绿色（a,b）表示使用 predictor1；红色（c,d）表示 predictor2

转换的 bit 有 2 位（假设分别为 i1, i2），如果 predictor 预测正确，则为 1，反之为 0

i1,i2=11 or 00 不需要改变状态
i1,i2=10 若当前使用 predictor1，那么预测正确，也不需要改变状态；反之，则需要改变
i1,i2=01 若当前使用 predictor2，那么预测正确，不需要改变状态；反之，则需要改变

Global Predictor

12-bit pattern
- ith bit is 0 => ith prior branch not taken
- ith bit is 1 => ith prior branch taken
used as an index of the 4K-entry’s global predictor, with each entry a standard 2-bit predictor
also used as an index of the 4K table of FSMs (selector)

Local Predictor

uses the least significant 10 bits of the PC for the branch instruction as index (1K table)
each entry has the most recent 10 branch outcomes for that particular branch (10 bits per entry)
next, this 10-bit history pattern is used to index into another 1K table
each entry in this table has 3-bit saturating counter which provides the local prediction

Summary

Total size: 4K*2 + 4K*2 + 1K*10 +1K*3 = 29K bits! (~180K transistors)

Type5. Branch Target Buffer

Branch Target Buffer

abbreviation for BTB
design idea: determine the destination address before the instruction is fully decoded (borrow the idea from Buffer)

Branch Target “Cache”

design idea: the utilization rate in BTB is low so BTC only stores branch taken instructions
“Cache” - Content Addressable Memory (CAM) or Associative Memory

5-Stage MIPS

Type6. Conditionally Executed Instructions

design idea: avoid branch prediction by turning branches into conditionally executed instructions if (x) then A = B op C else NOP
Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move
cons
- Still takes a clock even if “annulled”
- Stall if condition evaluated late

Type7. Return Address Predictors

design idea: BTB would be useless for indirect branches (JR r31)
use stack discipline for procedures, save return address in small buffer that acts like a stack for nested subroutines: 8 to 16 entries has small miss rate

概念

🌟🌟🌟 1. Static VS Dynamic Branch Prediction 2. 7 types of Dynamic Branch Prediction Handling Schemes 3. Branch history table (BHT)

🌟🌟 1. Global branch history 2. (k,2) predictor 3. Branch Target Buffer (BTB) VS Branch Target Cache 4. Branch Target Address 5. Dynamic Branch Prediction for the 5-Stage MIPS

🌟 1. Predicated Execution 2. Return Addresses issue 3. Saturating counters 4. Aliasing

习题

[Branch History Table, 1-bit Branch-Prediction, 2-bit Branch-Prediction] Consider the following sequence of actual outcomes for a single static branch, use one-bit counters / 2-bit counters (scheme 1 or 2) to get the prediction, then calculate the predict accuracy.
[Correlating / Two-Level Branch Prediction] similar to question 1. Note that the Predictor State has 2^m different states for (m, n) predictor, each with n bits
- 需要注意题目中总共有几个 branches，列举出争取的 History Register Content
- 其次需要注意 Counter for history 也就是每个 states 有几位
- In the mock exam, someone came up a question about “Global branch predictor” and “Local branch predictor”, the difference is the history register contents in the former one is shared among different branches while the latter doesn’t share.
Tournament Branch Predictor (use the Selector)
[Branch Target Buffer] Branch CPI Penalty Calculation
- 如果题目没有提及 update the BTB 需要一个额外的 cycle，那就说明这部分被 pipelined 了，不需要计算
- Branch.CPI.Penalty = Buffer.Hit.Rate * P(Icorrect.Prediction) * Penalty.Cycles + Buffer.Miss.Rate * P(Branch.Taken) * Penalty.Cycles
- 如果题目给定了 branch inst 百分比，最后还要乘以这个百分比

冷知识

关于动态预测分支的冷知识（不会考，了解即可）

47% MIPS branches not taken on average, 53% MIPS branches taken on average
squash the instruction: means flush or clear the bad instructions
What would be the size of the ideal Branch History Table (BHT) for MIPS processor and why?
- 2^30 since byte addressing
- In practice, usually a BHT entry has 10-14 bits => 1K-16K BHT
3-bit predictors do not offer higher accuracy to justify their higher complexity
Alpha 21264’s Tournament Branch Predictor: The 4096 table is not addressed by 12 bits from PC but by the 12 most recent outcomes of branches. Experience has shown that such input works better than bits of PC. 潜在的idea可能是从历史分支taken的情况挖掘出branch之间相互关联，有点扯，但这是经验主义试出来的，算是一个 implementation 的事实吧。
In practice, designers will use a big Branch History Table & a small Branch Target Cache to dynamically predict branches
RAM VS CAM
SPEC 85% indirect branches for procedure return

Exceptions

Definition: Unprogrammed control transfer
When an exception happenes:
- first go to the System Exception handler
- then record:
- the address of the “offending” instruction
- any other information necessary to return afterwards
- finally return control to user and restore the user states if possible
2 Types (Traps & Interrupts)

Interrupts	Traps
caused by external events e.g. Network, Keyboard, Disk I/O, Timer	caused by internal events e.g. exceptional conditions (overflow), errors (parity), page faults (non-resident page)
asynchronous to program execution	synchronous to program execution
may be handled between instructions	condition must be remedied （补救） by the handler
simply suspend and resume user program	instruction may be retried and program continued or program may be aborted

In MIPS Pipeline

Possible exceptions in each stage

IF	ID	EX	MEM
Page fault on instruction fetch	Undefined or illegal opcode	Arithmetic exception	Page fault on data fetch
misaligned memory access			misaligned memory access
memory-protection violation			memory-protection violation
			memory error

不解为什么MEM会比IF多一个 memory error，听起来很宽泛的概念？
Multiple Exceptions
may occur in the same clock cycle
may even occur out of order

How to realize the Precise Exceptions? - 4 solutions
- Reorder buffer (discuss later)
- History buffer
- Future register file
- Checkpointing
- Reference
- Hwu and Patt, “Checkpoint Repair for Out-of-order Execution Machines,” ISCA 1987.
- https://people.engr.ncsu.edu/efg/521/s06/common/lectures/notes/lec25.pdf

概念

🌟🌟🌟 1. Basics of exceptions 2. Interrupts vs. traps 3. Reorder buffer

🌟🌟 1. Exceptions in MIPS 2. Precise Interrupts

🌟 1. Future (Speculative) register file

习题

这部分没啥要考的，顶多是考察概念。

冷知识

关于异常的冷知识（不会考，了解即可）

The exception handler performs one of 3 actions:
1. notify the user of the exception (e.g., divide-by-zero or arithmetic-overflow) then terminate the program;
2. try to correct or mitigate the exception then restart the offending instruction;
3. if the exception is a benign interrupt (e.g., an I/O request), then save the program/pipeline state, service the interrupt request, then restart the program at the instruction pointed to by EPC + 4.
How to handle out of order exceptions? (based on https://web.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/restartExec.html)
1. the hardware posts all exceptions caused by a given instruction in a status vector associated with that instruction;
2. the status vector is carried along as the instruction goes down the pipeline;
3. once the exception indicator is set in the exception status vector, any control signals that may cause a data value to be written is turned off (this includes both register and memory writes);
4. when an instruction enters WB, the exception status vector is checked;
5. if any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine.

Tomasulo with ROB

Reorder Buffer

a queue that stores all the request for register saves in the proper execution order
allows a register to be saved only when the corresponding instruction is committed

Smith and Plezskun, “Implementing Precise Interrupts in Pipelined Processors” IEEE Trans on Computers 1988 and ISCA 1985.

Result Shift Register (RSR)

used for instructions to reserve write back cycles to the register file
the number of entries in the RSR is determined by the number of cycles of the operation with the longest execution time

In Tomasulo’s algorithm

an instruction commits when it completes its execution and all its predecessors have already committed
data written to a reservation station will be lost if the pipeline is flushed
before changing any register the system asks if there are any Exceptions that occurred
- if not => go ahead and write to the register
- if so => see if the register should be changed

4 Steps of Speculative Tomasulo Algorithm

Issue — get instruction from FP Op Queue

If reservation station, reorder buffer slot, and result shift register slot free, issue instr & send operands & reorder buffer no. for destination. (this stage also called “dispatch”)

Execution — operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; this takes care of RAW. (sometimes called “issue”)

Write result — finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available

Commit — update register with result from reorder buffer

When instr. at head of reorder buffer & result valid (present), update register with result (or store to memory) and remove instr from reorder buffer.

Mispredicted branch flushes entries in reorder buffer. (this stage sometimes called “graduation”)

Examples

Hardware

Tomasulo with Reorder Buffer

Code example:

1. LD F0, 10(R2) -> 9 cycles
2. ADDD F10, F4, F0 
3. DIVD F2, F10, F6 
4. BNE F2, <...>
5. LD F4, 0(R3) -> 3 cycles
6. ADDD F0, F4, F6 
7. ADDD F0, F4, F6

概念

🌟🌟🌟 1. Reorder Buffer (ROB) 2. Result Shift Register (RSR) 3. Speculation

🌟 1. dispatch / issue / graduation

习题

Tomasulo with Reorder Buffer (Some assumptions are similar to Tomasulo algorithm)

Superscalar & VLIW

How to reduce the CPI below 1

Similar questions: How Do We Design Faster CPUs?

Higher frequency
Larger dies (SOC - System On a Chip)
n-issue superscaler microprocessor
Parallel processing - use n independent processors
- 限制因素：Programming for performance / Load balancing / Optimizing communication and synchronization
Simultaneous multi-threading (SMT)
Pipelining

For CPI < 1, multiple instructions should be issued in one cycle.

Vector Processing
- Single Instruction Multiple Data (SIMD)
Superscalar
- varying no. instructions/cycle (1 to 8), scheduled by HW (Tomasulo)
(Very) Long Instruction Words (V)LIW
- fixed number of instructions (3-6) scheduled by the compiler; put ops into wide templates
Parallel processing
- used mostly to run separate programs
Multi-threading
- execute (2 or more) independent programs on the same core (processor)
- when thread1 does not have 4 independent instructions that can issue in the same cycle, instructions from a 2nd thread can be executed
- Processor must have separate PC and RF for each thread. (Extra hardware support)

Superscalar vs. VLIW

HW (superscalar) advantages
- HW better at memory disambiguation since knows actual addresses
- HW better at branch prediction with low overhead
- HW maintains precise exception model
- Same software works across multiple implementations
- Smaller code size (not as many NOPs filling blank instructions)
SW (VLIW) advantages
- Window of instructions that is examined for parallelism much bigger
- Speculation can be based on large-scale program behavior, not just local information
- More involved types of speculation can be done
- Much less hardware involved in VLIW (for issuing instructions)

Superscalar

MIPS Superscalar: 2 instructions, 1 FP & 1 anything (non FP arithmetic op)
Fetch 64-bits/clock cycle; Integer/Load on left, FP on right
More ports for register files

4-way Alpha 21164 Superscalar:

更多信息参考 http://studies.ac.upc.es/ETSETB/SEGPAR/microprocessors/21164%20(micro).pdf 我反对图片上的 ports 个数，但老师坚持认为他没错。。

Multiple-Issue Challenges

Integer/FP split
Higher branch penalties
Design Complexity
- Issue packet: group of instructions from fetch unit that could potentially issue in 1 clock
  - 一般会把 Issue stage 分为两部分，1st部分决定当前packet多少指令可以被 issue，2nd部分检测1st选择指令与正在执行指令之间的hazard
- Multiple ports support
- Rename logic
- Need multiple buses with associated matching logic at every reservation station. Or, need multiple forwarding paths

Operand dependence comparisons needed for issue checking: 2*(n-1)*n/2=n^2-n. (只是我的想法，quiz出过类似题目但是参考答案错了，老师后来怎么更正不太清楚，课件上模糊地写了一个约等于 n^2)

Dynamic Scheduling in Superscalar

Load & Store queue

Loads/stores might cause dependency between integer and FP issue:
- Replace load reservation station with a load queue; operands must be read in the order the Loads are fetched
- Load queue checks addresses in Store Queue to avoid RAW violation
- Store queue checks addresses in Load & Store Queues to avoid WAR & WAW

Register Renaming, Virtual Registers vs. Reorder Buffers

Alternative to Reorder Buffer is a larger virtual set of registers and register renaming
Virtual registers
- hold both architecturally visible registers + temporary values
Renaming process maps names of architectural registers to registers in virtual register set
Simplifies instruction commit: mark register as no longer speculative, free register with old value
Number of registers limits no. instructions in execution (used until commit)

Very Long Instruction Word (VLIW)

Simple issuing （优）but wasting instruction space（缺）
- All the operations in the long instruction word are independent => HW can issue and execute in parallel
- The long instruction word has room for many operations and not all will be used – NOPs
- Need compiling technique that schedules across several branches

概念

🌟🌟🌟 1. Instructions Per Clock cycle (IPC) 2. 5 ways to increase IPC 3. Superscaler

🌟🌟 1. Instruction-Level Parallelism (ILP) 2. System On a Chip (SOC) 3. Parallel processing 4. Simultaneous multi-threading (SMT) 5. Very Long Instruction Words (VLIW) 6. Pipelining 7. Register Renaming 8. Data Dependence (True dependence), Name Dependence (Anti-dependence, Output dependence), Control Dependencies

🌟 1. Load & Store queue 2. Virtual Registers 3. Issue packet

习题

Very Long Instruction Word (VLIW)
- 一般先 loop unrolling 之后加入对应的槽里
- 一定要事先弄清楚指令间的 stall
- 寄存器 renaming 随意
- Instruction Count = Iterations (before unrolling) * operations count in one entry of VLIW 答案给定是这样的，但不是 unrolling 了吗，为什么不是 after unrolling？
- 有关 MEM / BNE 是否有 WB stage 考试的时候还是问问老师；之前老师回复我 BNE does not; SW depends on your assumptions 但实际答案每次都不太一样。。。
其它都是概念性的知识点

Loop Unrolling

Example

Optimization - 3 ways

Original
- 9 cycles per loop
Reschedule
- 6 cycles per loop
Unroll Loop 3 Times (straightforward way)
- 27 cycles per 4 loops
Unroll Loop 3 Times (rescheduled one)
- 14 cycles per 4 loops

Loop Unrolling

Do not usually know upper bound of loop (at compile time)
Suppose it is n, and we would like to unroll the loop to make k copies of the body
1st executes (n mod k) times and has a body that is the original loop
2nd is the unrolled body (equivalent to k of the original iterations) that iterates (n/k) times

When Safe to Unroll Loop?

Greatest Common Divisor (GCD test): Determine unrolling the loop is safe by finding that the loop iterations are independent.

Software pipelining

Definition: reorganize loops so that each “iteration” is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW)
“Symbolic” Loop Unrolling
- Maximize result-use distance
- Less code space than unrolling
- Avoid structural hazards

概念

🌟🌟🌟 1. 3 ways to optimize loop codes

🌟🌟 1. “Symbolic” Loop Unrolling

🌟 1. Greatest Common Divisor (GCD test) 2. loop-carried dependence 3. explicitly parallel instruction computer (EPIC) 4. Register Window Mechanism

习题

GCD test
Identify the stalls in the instructions (usually a loop)
Reschedule / Reorder the instructions to avoid stalls
Loop unrolling
“Symbolic” Loop Unrolling 也许？

这是 memory 的分割线~

Memory

Memory Hierarchy

Why memory hierarchy?
- balance between memory size and access time
How to evaluate it?
- Cost per byte
- Unit of transfer
- Transfer rate (bandwidth)
- Access time (Note the difference between it and Cycle Time)
- Access Time: time between request and word arriving
- Cycle Time: time between requests > access time
- Total capacity

Levels

(Upper Level) Registers -> Cache -> Main Memory -> Disk (Lower Level)

	Capacity	Access time	Cost
Registers	100s Bytes	<1 ns
Cache	10s-1000s KB	1-10 ns	$10/MB
Main Memory	10s-1000s MB	60-200 ns	$1/MB
Disk	100s-1000s GB	10 ms (10,000,000 ns)	$0.0031/MB

Unit of transfer
- Registers - Cache: Instruction operands
- Controlled by program or compiler, usually 4-8 bytes
- Cache - Main Memory: Block
- Block (also called line): unit of copying
- Controlled by HW cache, usually 16-128 bytes
- Main memory - Disk: Page
- Controlled by OS, usually 512-4K bytes
How to determine Block (Page) Size?
- 3 figures: miss penalty (linear increase), miss rate (down and up), miss penalty*miss rate (down and up) with the block size
- Why does Miss-rate go down at the beginning and then it go up later on with the block size?
- 尽管由于 spatial locality，larger blocks 看上去是降低 miss rate 的；但对于一个固定大小的 cache，block size 越大，说明 block 越少，竞争就越激烈，最后导致 miss rate 增加
- 刚开始下降是由于 compulsory misses reduced，后来上升是因为 conflict misses increased

Three Levels of Hierarchy

CPU generates virtual rather than physical addresses
VA to PA translation commonly done by TLB
理解这张图非常重要！！！

Technology

Random Access (RAM)
- Dynamic Random Access Memory (DRAM)
- High density, low power, cheap, but slow
- Dynamic: need to be refreshed regularly (8 ms, 1% time)
- Addresses divided into 2 halves (Memory as a 2D matrix)
- Two step operation:
  - RAS or Row Access Strobe
  - CAS or Column Access Strobe
- only 1 transistor and 1 capacitor per bit
- Static Random Access Memory (SRAM)
- Low density, high power, expensive, fast
- Static: content will last forever (until lose power)
- 6 transistors/bit vs. 1 transistor for DRAM cell
- Main Memory uses DRAM; Cache uses SRAM
“Non-so-random” Access
- Access time varies from location to location and from time to time
- Examples: Disk, CDROM

Major Properties

The Principle of Locality
- Temporal Locality
- Spatial Locality
Coherence property
- copies in successive memory levels must be consistent
Inclusion Property
- Mi (upper level) ⊆ M{i+1} (lower level)

Memory Performance

Hit / Miss
Hit rate / Miss rate
Average Memory Access time (AMAT)
- AMAT = Hit Rate * T_upper + (1 - Hit Rate) * T_lower
- AMAT = Hit Rate * Hit Time + Miss Rate * Miss Penalty
- Hit time = T_upper + Time to determine hit/miss
- Miss penalty = Time to determine hit/miss + Time to replace a block in the upper level + Time to deliver the data to processor
- (Update) AMAT = Hit Time + Miss Rate * Miss Penalty
- Miss penalty = Time to replace a block in the upper level + Time to deliver the data to processor
- Access time: time to lower level = f(latency to lower level)
- Transfer time: time to transfer block =f(BW between upper & lower levels)
- AMAT = (Hit Time_inst + Miss Rate_inst * Miss Penalty_inst) * Freq_inst + (Hit Time_data + Miss Rate_data * Miss Penalty_data) * Freq_data
CPU time
- CPU time = Instruction Count * (ALUOps / Inst * CPI_ALUOps + MemAccess / Inst * AMAT) * Cycle Time
- CPI_ALUOps does not include Memory instructions
- CPU time = Instruction Count * (CPI_Execution + MemAccess / Miss Rate * Miss Penalty) * Cycle Time
- CPI_Execution includes ALU and Memory instructions

Memory Interleaving

Faster Memory Through Organization
- Simple
- CPU, Cache, Bus, Memory same width (32 or 64 bits)
- Wide
- CPU-to-Cache/Mux 1 word; Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC: 512)
- Disadvantages
  - Wide system bus
  - MUX between CPU & cache on processor’s critical path
  - Not useful for multiple units independently accessing memory
  - Multiprocessor/multi-core
  - I/O
  - CPU with Non-blocking Cache
- Interleaved
- CPU, Cache, Bus 1 word; Memory N Modules (4 Modules)
Simple timing model
- 1 CPU cycle to send address
- 6 CPU cycles access time, 1 cycle to send data
- Cache Block is 4 words
- Word size is 32 bits
- Simple M = 4*(1+6+1) = 32
- Wide M = 1+6+1 = 8
- Interleaved M = (1+6+1)+3*1 = 11

Interleaved Memory

How many banks?
- Ideally: number banks ≥ number clocks to access word in bank
- Ex 4-way interleaving, 8 words/cache_block; M_cycle=10 CPU cycles
8-way Memory Interleaving
- 高bit位用来找row（某一个bank的位置）低bit为用于找column（哪一个bank）
- 不过我在网上查到的资料是两种类型的设计都存在，看 designer 如何选择
Bandwidth (这里一堆公式，我觉得算冷知识，不理解)
Avoiding Bank Conflicts

概念

🌟🌟🌟 1. Levels of the Memory Hierarchy * Registers -> Cache -> Main Memory -> Disk 2. Hit rate / Miss rate 3. Average memory-access time (AMAT)

🌟🌟 1. Simple vs. Wide vs. Interleaved 2. Random Access (RAM) vs. “Non-so-random” Access 3. Dynamic Random Access Memory (DRAM) 4. Static Random Access Memory (SRAM)

🌟 1. Temporal Locality & Spatial Locality 2. Coherence property 3. Inclusion Property 4. RAS or Row Access Strobe vs. CAS or Column Access Strobe

习题

Calculate the AMAT (several cases)

冷知识

How many transistors does a single bit of static ram have, not including decoding logic?
- 2 transistors per gate (NOT gates)
- 2 gates per latch/flip-flop
- 2 for control
- => 6 transistors
- 根据 Static random-access memory - Wikipedia，”Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Two additional access transistors serve to control the access to a storage cell during read and write operations.”
Hit_Time << Miss_Penalty (50-100 cycles upon miss in cache; tens of thousands cycles upon miss in main memory)
DRAM Architecture
- Dual Inline Memory Module (DIMM) contains multiple chips with clock/control/address signals connected in parallel
- 4-8 logical banks/chip; each physically implemented as several smaller arrays
DRAM Operation
- Two steps in read/write access to a given bank
  - Column access (CAS)
  - decode column address to select small number of sense amplifier latches (4, 8, 16, or 32 bits
  - on read, send latched bits out to chip pins
  - on write, change latches which then charge storage cells to required value
  - can perform multiple column accesses on same row without another row access (burst mode) depending on DRAM package)
  - Row access (RAS)
  - decode row address, enable addressed row (often multiple Kb in row)
  - small change in voltage detected by sense amplifiers which latch whole row of bits
- Each step has a latency of around 15-20ns in modern DRAMs
- Various DRAM standards (DDR, RDRAM) have different ways of encoding the signals for transmission to the DRAM, but all share same core architecture

Cache

3 types

Fully Associative Cache

Place block anywhere in caches
Most flexible
Most expensive
- I think because we can place block anywhere, search time become crucial. To get better performance, the hardware cost will be larger.

bits of tag entry = 13 对应 main memory 的 entry 个数 2^13=8192 (0-8191)
offset=3 => 2^3=8 也就是 cache line 的大小，同时也是 main memory 一行数据的大小

Direct Mapped Cache

思考：根据上面这张图，如何理解 Direct Mapped Cache 的运行过程？

这里 main memory 被划分成田字格的形式，可以被tag+index的方式索引到
255=2^8-1 所以先用 index 来找到 cache entry，再看看 entry 中的 tag 值符不符合，最后看看 valid bits

n-way Set Associative Cache

n entries for each Cache Index
- A block can be placed in one out of n frames (n power of 2)
- Direct mapping is 1-way set associative
- Fully associative is j-way set associative (j is no. of block frames in cache)

Two-way set associative cache

我觉得这个图有点小问题：Tag field 5 bits 标记错了范围，应该是左边一半是 5 bits，右边一半也是 5 bits，总共加起来 10 bits

Hardware

Two direct-mapped caches operate in parallel
Cache Index selects a “set” from the cache (set includes 2 blocks)
The two tags in the set are compared in parallel
Data is selected based on the tag result

4 choices to consider

Write strategy

What happens on a write (Hit)?

Write through
- The information is written to both the block in the cache and to the block in the memory
- Add a Write Buffer to get better performance
- just a FIFO
- Typical number of entries: 4
- Works fine if: Store frequency << 1 / DRAM_write_cycle
Write back
- The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced
Pros and Cons
- WT: Can always discard cached data - up-to-date data is in memory, while WB: Can’t discard it - may have to write it back
- WT: need only a valid bit, while WB: need valid and dirty bits
- WT: Read misses (in cache) do not result in writes (in memory); Memory (or other processors) always have latest data; Simpler management of cache
- WB: Much lower bandwidth, since data often overwritten multiple times; Better tolerance to long-latency memory

What if write miss?

Write allocate
- Allocate new cache line in cache (do a “read miss”)
Write non-allocate (or write-around)
- Simply send write data through to underlying memory/(level 2 cache) – don’t allocate new cache line

Block placement

Where can a block be placed in the upper level?

Direct Mapped
- use index to find the location, only one choice, if occupied by other blocks, then replace it
Set Associative
- use index to find the location, n choices, if all of them are full and the current block is not in them (compare the tag), use replacement policy to replace one entry
Fully Associative
- cache hit -> do nothing
- cache miss -> cache not full -> place it in one of the available entries
- cache miss -> cache full -> use replacement policy to replace one entry

Block replacement

Which block should be replaced on a miss?

Easy for Direct Mapped
Set Associative or Fully Associative:
- Random
- LRU (Least Recently Used) only approximated

Block identification

How is a block found if it is in the upper level?

Tag for each block
Increasing associativity shrinks index expands tag 也就是说如果增加 associativity，那么 index 的 bits 数就会减少，而 tage 的 bits 数就会增加

How to Improve Cache Performance?

一“图”以蔽之

1. Reduce the miss rate

Where do Misses Come From (4 types)?

Compulsory
- Also called cold start misses or first reference misses
- The first access to a block is not in the cache
- Misses in even an Infinite Cache
Capacity
- If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved
- Misses in Fully Associative Size X Cache
Conflict
- Also called collision misses or interference misses
- Occurs when a block can be discarded and later retrieved if too many blocks map to its set
- Misses in set associative or direct mapped
Coherence
- Misses caused by cache coherence
- 前三种类型合称为“3Cs”，而 Coherence Miss 是个小可怜被孤立了，成为了 4th “C”

Approach 1: Change Associativity

For cache sizes up to 4KB higher associativity results in lower AMAT. But for larger caches, AMAT does not improved with increased associativity. Why?

尽管 change associativity 确实能降低 miss rate，但是还需要注意到它会影响到 hit time，进而使得最后 AMAT 反而变差了。

Approach 2: Fast Hit Time + Low Conflict => Victim Cache

How to combine fast hit time of direct mapped yet still avoid conflict misses?
- Add buffer to place data discarded from cache
- 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
- Used in Alpha, HP machines

Approach 3: Reducing Misses via “Pseudo-Associativity”

How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?
- Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)
- See http://people.cs.uchicago.edu/~ftchong/154/projects/Project3/pseudoassociative.html for more information
Disadvantages
- Saves time if the item is found in the first “way”, but wastes time if it is found in the last “way”
- CPU pipeline is complicated if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (i.e., L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC

Approach 4: Hardware Prefetching of Instructions & Data

Instruction Prefetching
- Alpha 21064 fetches 2 blocks on a miss, i and (i+1)
- Extra block placed in “stream buffer” – check upon miss
- If hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2)
Works with data blocks too:
- 1 data stream buffer reduced by 25% misses from 4KB cache; 4 streams reduced by 43%; 8 streams reduced by 50% to 70% misses from 64KB, 4-way set associative cache

Strided prefetch
- If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.
- IBM Power 5 (2003) supports eight independent streams of strided prefetch per processor
Intel processors use
- Adjacent cache line prefetch from L2 to L1
- Instructions from L2 to L1 based on Branch prediction unit (BTB)
- Data and instruction from memory to L2 – triggered by successive cache misses and detection of a stride in accesses
Prefetching relies on having extra memory bandwidth
- Intel processors allow disabling of HW prefetch

Approach 5: Software Prefetching Data

Data Prefetching comes in two flavors:
- Load data into register (HP PA-RISC loads)
- Binding prefetch: Must be correct address and register
- Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
- Non-Binding prefetch: Can be incorrect
```
for(i=0;i<N;i++){
prefetch(&a[i+1]);
prefetch(&b[i+1]);
SUM=SUM+a[i]*b[i];
}
```
Special prefetching instructions cannot cause faults; a form of speculative execution
Issuing Prefetch Instructions takes time
- Is cost of issuing prefetch < savings in reduced misses ?
- Higher superscalar reduces difficulty of issue bandwidth
Cache misses reduced by 75% on 8KB direct mapped cache using software Prefetch

Approach 6: Reducing Misses by Compiler Optimizations

Instructions
- Reorder procedures in memory so as to reduce conflict misses
- Through profiling to detect conflicts
Data
- Merging Arrays
- improve spatial locality by single array of compound elements vs. 2 arrays
- Loop Interchange
- change nesting of loops to access data in order stored in memory
- Loop Fusion
- Combine 2 independent loops that have same looping and some variables overlap
- See examples on slide 22 and 23 from lesson 23

Reduce the miss penalty

Approach 1: Read Priority over Write on Miss

Write-through w/ write buffers => RAW hazards with main memory reads on cache misses
- If simply wait for write buffer to empty, might increase read miss penalty (MIPS 1000 by 50%)
- Check write buffer contents before read; if no conflicts, let the memory access continue
Write-back => add buffer to hold displaced blocks
- Suppose a read miss replaces a dirty block
- Normal: Write dirty block to memory, and then do the read
- Instead, copy the dirty block to a write buffer, then do the read, and finally do the write
- CPU stall less since restarts as soon as do read

Approach 2: Early Restart and Critical Word First

Don’t wait for full block to be loaded before restarting CPU
Early restart
- As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
Critical Word First
- Also called wrapped fetch and requested word first
- Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block
Generally useful only in large blocks

Approach 3: Non-Blocking Caches to Reduce Stalls on Misses

Non-blocking cache or lockup-free cache
- allow data cache to continue to supply cache hits during a miss
- requires out-of-order execution
hit under miss
- reduces the effective miss penalty by working during miss
hit under multiple miss or miss under miss
- may further lower the effective miss penalty by overlapping multiple misses
Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses

Approach 4: Add a Second-Level Cache

AMAT = Hit Time_L1 + Miss Rate_L1 x Miss Penalty_L1
Miss Penalty_L1 = Hit Time_L2 + Miss Rate_L2 x Miss Penalty_L2
AMAT = Hit Time_L1 + Miss Rate_L1 x (Hit Time_L2 + Miss Rate_L2 x Miss Penalty_L2)
Local miss rate
- misses in this cache divided by the total number of memory accesses to this cache (Miss rate_L2)
Global miss rate
- misses in this cache divided by the total number of memory accesses generated by the CPU
Global Miss Rate is what matters

Reduce the time to hit in the cache

Not mentioned in the lectures.

Separate instruction and data cache

Cache Reloads

Cache Misses vs. Time 呈现出周期性变化
Reload transients can be time consuming
Each process (when run in isolation) has its own footprint in the cache
对比 Cache Footprint of Process A & A then B & B then A

概念

🌟🌟🌟 1. Basic Cache Optimizations （记住那个表格 11种方法） 2. Compulsory, Capacity, Conflict, Coherence 3. Fully Associative Cache, Direct Mapped Cache, n-way Set Associative Cache 4. Write strategy, Block identification, Block placement, Block replacement 5. Write through vs. Write back

🌟🌟 1. Write allocate vs. No-write allocate 2. Write Buffer 3. Local miss rate vs. Global miss rate

🌟 1. Victim Cache 2. Pseudo-Associativity 3. Prefetching (hardware & software) 4. Strided prefetch 5. Reorder procedures 6. Merging Arrays, Loop Interchange, Loop Fusion 7. Read Priority over Write on Miss 8. Early restart 9. Critical Word First (aka wrapped fetch, requested word first) 10. Non-blocking cache (aka lockup-free cache) 11. Cache Reloads 12. Reload transients 13. footprint

习题

Compare Unified and Split Caches, the AMAT of both cases
…

Virtual Memory

Address Mapping / Translation

Virtual Address <-> Physical Address
Issue1: Address Translation Overhead
Issue2: Page Table Size

Page Table

橙色部分代表 page tables；蓝色部分代表真正的 prorgam data
OS sets the Page Table Base Register whenever active user process changes

Page Table Entry (PTE)
- A bit to indicate if a page exists (V)
- PPN (physical page number) for a memory-resident page
- DPN (address on disk) for a page on the disk
- Access rights bits for protection

Page Protection

Translation Lookaside Buffer (TLB)

Definition
- a cache on the page table mappings
- TLB access time comparable to cache access time
- can be organized as fully associative, set associative, or direct mapped
- usually small
- 通常即便在高端机上，也不会超过 128-256 entries
- 这允许它们使用昂贵的 fully associative lookup，不过大多数中端机使用 n-way set associative
Components
- Virtual address
- Physical address
- Dirty
- Ref
- Valid
- Access (No_access, Read_only, Execute_only RW&Execute)
注意很多 designer 会为 instruction 和 data 单独设计 TLB，防止 multiple access 造成的冲突

However, it takes an extra TLB access to translate VA to PA making cache access expensive

Solutions:

Access cache with VA
- 但这种方法存在 synonym / alias problem:
  - Two different VAs map to same PA => two different cache entries holding data for the same physical address!
  - Two processes map the same VA to two different PAs with only one entry in cache - solution: flush cache upon context switch
Overlapped Cache & TLB Access
- The page offset bits are available (will not change); Use high order bits of VA for TLB access while low order bits index the cache
- Tag 应该能理解为物理地址最前面的几位（VA 需要通过 translation）
- Index 是物理地址的中间部分，也是 cache 的索引（VA 不需要 translation）
- Disadvantages
  - Small cache as it is limited by page size
  - Increase page size, or
  - Use n-way set associative cache
    - Cache_size = n × Page_size

Page Table Size Problem

尽管 TLB 加速了访问时间，但是没有解决 page table 在内存中占用 size 过大的问题
Solution 1: Hashing (reduce from 2^{21} to 2^{16})
- Hashing function takes 21 bits and outputs 16 bits. It performs some arithmetic operation (ignoring overflows) and then a mod k operation (k=2^{16}). The result is an integer in [0,k-1]
- The resulting table (called inverted page table) of 64K entries is sometimes divided into several smaller tables
- Possible aliasing
- The table entry must include the virtual page number
Solution 2: Segmented Memory
- No table has more than 2048 entries
- Only active page tables reside in memory
- Penalty?

Page Replacement

Goal: minimize number of page faults
Effectiveness depends on
- Page size
- No. of available frames
- Program behavior
Policies
- Optimal algorithm (upper bound)
- Random replacement (lower bound)
- First-In First-Out (FIFO) - longest time, e.g., 12131415
- Least Recently Used (LRU) - was not used recently for the longest time
- Least Frequently Used (LFU) - least referenced
Implementation
- FIFO - linked list
- Add new page at tail, remove from head
- Improvement: Put pages on the free-page list, scan it upon page fault and re-establish (soft page fault - no disk access)
- LRU - most commonly used policy
- Use (or Reference) bit - set when page accessed; OS periodically sorts and moves the referenced pages to the top & resets all Use bits
- A more accurate implementation - (software) Age Counter, +1 if not referenced, reset if referenced
- Must provide timer interrupt to update LRU bits

概念

🌟🌟🌟 1. Address Mapping / Translation 2. Virtual Address vs. Physical Address 3. Page Table 4. Page Table Entry (PTE) 5. Translation Look-Aside Buffer (TLB) 6. Page Replacement 7. Optimal algorithm, Random replacement, First-In First-Out (FIFO), Least Recently Used (LRU), Least Frequently Used (LFU)

🌟🌟 1. Page Protection 2. Use (or Reference) bit

🌟 1. Hashing (inverted page table), Segmented Memory (to solve the page size problem) 2. Access cache with VA, Overlapped Cache & TLB Access (to solve the overhead problem) 3. synonym / alias problem

习题

LRU / OPT / FIFO
…

I/O

Generic I/O System

Disk

Long-term, non-volatile storage, large, inexpensive
also serve as slow level in the storage hierarchy
Hard drives or Solid-state drives

Hard Drive Terminology

Several platters, with information recorded magnetically on both surfaces (usually)
Bits recorded sequentially in tracks, divided into sectors (e.g., 512 Bytes)
Actuator moves head (end of arm, 1/surface) over track (”seek”), select surface, wait for sector rotate under head, then read or write
“Cylinder”: all tracks under heads

Disk Performance

Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead
Seek Time
- depends on number of tracks arm moves
- = Average number of tracks arm moves
- Sum all possible seek distances from all possible tracks /Total_#_ops
  - Assumes random seek distance
- Disk industry standard benchmark
- Typical: ~8 ms
Rotation Time
- depends on speed disk rotates, how far sector is from head’s current position
- = Average distance to sector from head
- 7200 Revolutions Per Minute ⇒ 120 Rev/sec
- 1 revolution = 1/ 120 sec ⇒ 8.33 milliseconds
- ¹⁄₂ rotation (revolution) ⇒ 4.17 ms
Transfer Time
- depends on data rate (bandwidth) of disk (bit density), size of request
- 10-40 MByte/sec
Capacity
- Quadruples every 2 years
- 100s Gbytes to TeraBytes

Disk Reliability

Servers and data-centers require large storage capacity but large arrays of disks have low reliability.
Reliability_disk = exp(-λt) where λ is a constant failure rate
MTTF = Mean_Time_to_Failure = 1 / λ
A single disk has MTTF = 50,000 hours ≈ 6 years
Reliability of N disks = {exp(- λt)}^N = exp(-Nλt)
MTTF_array = 1 / N λ
For N = 70 disks: 50,000/70 = 700 hours ≈ 1 month
Arrays (without redundancy) too unreliable

How to deal with this problem? - Redundant Arrays of Independent Disks (RAID)

Files are “striped” across multiple disks with redundant data added
Upon failure: Contents reconstructed from data redundantly stored in the array
Redundancy yields high data availability
Service still provided to user, even if some components fail but
- Capacity penalty to store redundant info
- Performance penalty to update redundant info
RAID 1: Disk Mirroring/Shadowing
- Each disk is fully duplicated onto its “mirror” Very high availability
- Bandwidth sacrifice on write: Logical write = two physical writes
- Reads may be optimized (choose the one closest to the header)
- Most expensive solution: 100% capacity overhead
RAID 4: Parity Disk
- P = sum mod 2 of other disks (parity)
- If disk fails, subtract P from sum mod 2 of other disks to find missing information
- Each sector has an error check field (cyclic redundancy check - CRC) that detects single bit faults and consecutive bit faults
RAID 5: High I/O Rate Interleaved Parity

概念

🌟🌟🌟 1. Disk Performance 2. …

🌟🌟 1. Redundant Arrays of Inexpensive Disks (RAID) 2. …

🌟 1. Reliability (MTTF, MTTR, FIT, MTBF) 2. …

习题

这是期中考之前的内容分割线~ 时间有限，我就列一列相关概念和题型。

Bascis

概念

🌟🌟🌟 1. Speedup, Amdahl’s Law 2. CPU time, CPI, Instruction Count, Cycle Time 3. MIPS Architecture & Instruction Format

🌟🌟 1. Power (Dynamic Power vs. Static Power) 2. Manufacturing Cost, Die Yield, Wafer yield, Defect Density, Die cost, Wafer cost, Dies per Wafer, Testing cost, Packaging cost, Dies per wafer 3. Performance: Latency (response time, CPU time), Throughput, MIPS / MFLOPS

🌟 1. What is Computer Architecture? (2 components) 2. Instruction Set Architecture (ISA) 3. Machine Organization (Pipelining, Memory Hierarchy, Storage systems, …) 4. IBM 360 (mainframe) 1964 5. 4 Approaches to reduce Power (Do nothing well, Dynamic Voltage-Frequency Scaling (DVFS), Low power state for DRAM and disks, Overclocking) 6. Turbo mode from Intel 7. Benchmarks (SPEC CPU2000, Synthetic Benchmarks, Whetstone Benchmark, Dhrystone Benchmark) 8. RISC vs. CISC 9. Processor State (General-Purpose Registers, Floating-Point Registers, Program Status Word (PSW) Registers) 10. Data Formats (bytes, half-words, words, double-words) 11. Instruction Formats (R-type, I-type, J-type) 12. Control signals (Microinstructions) 13. Little Endian vs. Big Endian 14. Byte addressing, Word alignment

习题

Draw the pipeline execution figure (stalls)
Calculate CPI, speedup, throughput, CPU time
Amdahl’s Law
Calculate the power & energy consumption (usually dynamic power)

Different Types of CPI calculation 1. CPI = Base CPI + average stalls per instruction (memory access: load & store (part of) + instruction fetch (100%)) 2. …

Pipeline & Hazard

不包含 Control hazards 的相关处理。

概念

🌟🌟🌟 1. Pipelining (Concepts, Characteristics, 5 steps) 2. Structural hazards, Data hazards and Control hazards 3. Data Hazards: RAW, WAR, WAW 4. How to deal with 3 kinds of Hazards? 5. Reschedule instructions (To solve Data hazard, Static vs. Dynamic rescheduling) 6. Performance Issues in Pipelining

🌟🌟 1. 8-Stage MIPS: MIPS R4000

🌟

Latch / Bubbles / Stalls
Freeze / Flush
Dependency graph
Interlock Control Logic (how to detect?)
Why WAR and WAW not happened in 5-stage MIPS?

习题

Fill in the pipeline diagram
- 注意事项：w/ vs. w/o data forwarding; only one execution unit vs. separate adder and multiplier; X stages pipeline rather than 5 stages; register file support simultaneous reads but only a single write per cycle; adder and multiplier can operate in parallel and are fully pipelined (pipelined means more than 1 instructions can operate on addder or multiplier); Write back stage often has structural hazard;
Calculate Pipeline throughput
…

Dynamic Scheduling

不包括 Speculation 和 Tomasulo with ROB

概念

🌟🌟🌟 1. Scoreboard 2. Tomasulo

🌟🌟 1. Common Data Bus (CDB) 2. Reservation Stations

🌟 1. FP Op Queue 2. Load/Store Buffers 3. Instruction status 4. Register result status 5. Functional unit status

习题

Scoreboard
Tomasulo
- 注意事项：只是从做题经验角度出发，不同 assumption 结果不同
- no multiple writes in one cycle
- RAW the start cycle the right after the previous write cycle (not the complete cycle)
- For loop cases, calculate entire cycles for an iteration and then calculate the cycles difference of an iteration’s issue stages (remember # of iterations minus 1); In most cases, list one iteration’s behavior is enough （我的意思是在计算 cycles 个数时，但有时候答案会放好几个 iteration，最后还真的用了全部数据来计算，比如 HW5 Question 2(b)。玄学就对了。。。没有个标准我好难受）; actually when calculate # of cycles, consider multiple writes are possible, otherwise we can’t get a precise number;
- 找到别的学校这门课的作业题 https://courses.physics.illinois.edu/cs433/fa2020/assignments/cs433-fa20-hw3.pdf 第一题，前提条件一清二楚，毫无争议，希望期末考试老师能写的清楚点害
- For unlimited reservation stations with single Functional Unit for each type and functional units are not pipelined, then the next instruction with occupied function unit can issue normally, but execute right in the cycle when the previous instruction are in the write stage. 另外需要注意的是填表时不要从第一条指令写到最后一条，而是从第一个cycle写到最后一个cycle，很有可能后面的指令因为没有 data dependence 而比前面的指令先执行。
- Check the machine is single-issue or N-issue

后记

考试在北京时间早上五点半进行，截止那刻还有很多地方没整理完，所以这篇复习总结并不完善。比如在 Memory 之后的那些考点没有整理，然而期末考大多数题目出自于此。考试难度较期中考提高了很多，出了一些非“常规”的题目，比如把某些硬件结构蒙起来让你猜这是什么结构，非常考验我们对所学知识的融会贯通能力。原本以为是重中之重的 Tomasulo with ROB 考题却格外的简单，比较轻视的概念性知识却在大题里反复出现。总而言之，复习不能掉以轻心，一切皆会成为考点。

<	2025年6月					>
日	一	二	三	四	五	六
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	1	2	3	4	5

<	2025年6月					>
日	一	二	三	四	五	六
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	1	2	3	4	5

文章目录

前言

主题

Dynamic Branch Prediction

(VS) Static Branch Prediction

Branch History Table (BHT)

Type1: 1-bit Branch-Prediction

Type2. 2-bit Branch-Prediction

Type3. Correlating Branch Prediction

(k,2) predictor: k-bit global, 2-bit local

GShare Predictor

Type4. Tournament Branch Predictor

Alpha 21264

Type5. Branch Target Buffer

Branch Target Buffer

Branch Target “Cache”

5-Stage MIPS

Type6. Conditionally Executed Instructions

Type7. Return Address Predictors

概念

习题

冷知识

Exceptions

概念

习题

冷知识

Tomasulo with ROB

Reorder Buffer

Result Shift Register (RSR)

In Tomasulo’s algorithm

4 Steps of Speculative Tomasulo Algorithm

Examples

Hardware

Tomasulo with Reorder Buffer

概念

习题

Superscalar & VLIW

How to reduce the CPI below 1

Superscalar vs. VLIW

Superscalar

Multiple-Issue Challenges

Dynamic Scheduling in Superscalar

Very Long Instruction Word (VLIW)

概念

习题

Loop Unrolling

Example

Optimization - 3 ways

Loop Unrolling

When Safe to Unroll Loop?

Software pipelining

概念

习题

Memory

Memory Hierarchy

Levels

Technology

Major Properties

Memory Performance

Memory Interleaving

概念

习题

冷知识

Cache

3 types

Fully Associative Cache

Direct Mapped Cache

n-way Set Associative Cache

4 choices to consider

Write strategy

Block placement

Block replacement

Block identification

How to Improve Cache Performance?

1. Reduce the miss rate

Where do Misses Come From (4 types)?

Approach 1: Change Associativity

Approach 2: Fast Hit Time + Low Conflict => Victim Cache

Approach 3: Reducing Misses via “Pseudo-Associativity”

Approach 4: Hardware Prefetching of Instructions & Data

<	2025年6月					>
日	一	二	三	四	五	六
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	1	2	3	4	5