## IT16301 - <br> Computer Organization and Architecture Prepared by <br> N.Uma \& V.Ranjith Assistant professor/IT

## Course Outcomes

CO1 - Build the basic structure of computer, operations and instructions
CO2 - Design arithmetic and logic unit
CO3 - Discuss the pipelined execution and design control unit
CO4 - Evaluate performance of memory systems
CO5 - Construct the parallel processing
architectures

## Text Books:

- 1. M. Moris Mano, " Computer System Architecture", 3rd Edition, Pearson/ PHI, 2007
- 2. David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan kauffman / elsevier, Fifth edition, 2014.


## Unit 1

## BASIC COMPUTER ORGANIZATION AND DESIGN

" Instruction codes,
" Computer registers,
" computer instructions,
" Timing and Control,
" Instruction cycle,
" Memory-Reference Instructions,
" Input-output and interrupt,
" Complete computer description,
" Design of Basic computer,

## COMPUTER ORGANISATION AND ARCHITECTURE

- The components from which computers are built, i.e., computer organization.
- In contrast, computer architecture is the science of integrating those components to achieve a level of functionality and performance.
- It is as if computer organization examines the lumber, bricks, nails, and other building material
- While computer architecture looks at the design of the house.


## Basic Terminology

- Input - Whatever is put into a computer system.
- Data - Refers to the symbols that represent facts, objects, or ideas.


## Basic Terminology

Assembly language program (ALP) -Programs are written using mnemonics

- Mnemonic -Instruction will be in the form of English like form
- Assembler -is a software which converts ALP to MLL (Machine Level Language)


## Basic Terminology

- Interpreter - Converts HLL to MLL, does this job statement by statement
- System software - Program routines which aid the user in the execution of programs eg: Assemblers, Compilers
- Operating system - Collection of routines responsible for controlling and coordinating all


## Computers have two kinds of components:

- Hardware,
- consisting of its physical devices (CPU, memory, bus, storage devices, ...)
- Software,
- consisting of the programs it has (Operating system, applications, utilities, ...)

All computer functions are:

- Data PROCESSING
- Data STORAGE
- Data MOVEMENT
- Data $=$ Information
- CONTROL--- Coordinates How Information is Used
- INPUT UNIT:
-Converts the external world data to a binary format, which can be understood by CPU
- Mouse, Joystick etc
- OUTPUT UNIT:
-Converts the binary format data to a format that a common man can understand


## CPU

- The Brain of the machine
- carrying out computational task
- ALU, CU, Registers
- ALU Arithmetic and logical operations


## MEMORY

- Stores data, results, programs
- Two classes of storage
(i) Primary
(ii) Secondary
${ }^{\text {(.) }}$ Two types are RAM or R/W memory and ROM read only memory
(.) ROM is used to store data and program which is not going to change.


## Instruction Codes

- The Internal organization of a digital system is defined by the sequence of microoperations it performs on data stored in its registers
- The user of a computer can control the process by means of a program
- A program is a set of instructions that specify the operations, operands, and the processing sequence


## Instruction Codes

- A computer instruction is a binary code that specifies a sequence of micro-operations for the computer. Each computer has its unique instruction set
- Instruction codes and data are stored in memory
- The computer reads each instruction from memory and places it in a control register
- The control unit interprets the binary code of the instruction and proceeds to execute it by issuing a sequence of microoperations


## Instruction Codes

- An Instruction code is a group of bits that instructs the computer to perform a specific operation (sequence of microoperations). It is divided into parts (basic part is the operation part)
- The operation code of an instruction is a group of bits that defines certain operations such as add, subtract, shift, and complement


## Instruction Codes

- The number of bits required for the operation code depends on the total number of operations available in the computer
- 2 n (or little less) distinct operations $\rightarrow \mathrm{n}$ bit operation code


## Instruction Codes



## Instruction Codes

- An operation must be performed on some data stored in processor registers or in memory
- An instruction code must therefore specify not only the operation, but also the location of the operands (in registers or in the memory), and where the result will be stored (registers/memory)


## Instruction Codes

- Memory words can be specified in instruction codes by their address
- Processor registers can be specified by assigning to the instruction another binary code of $k$ bits that specifies one of $2 k$ registers
- Each computer has its own particular instruction code format
- Instruction code formats are conceived by computer designers who specify the architecture of the computer


## Instruction Codes Stored Program Organization

- An instruction code is usually divided into operation code, operand address, addressing mode, etc.
- The simplest way to organize a computer is to have one processor register (accumulator AC) and an instruction code format with two parts (op code, address)


## Instruction Codes Stored Program Organization



Instruction Format


Memory 4096x16

Instructions (program)

## Instruction Codes Indirect Address

- There are three Addressing Modes used for address portion of the instruction code:
- Immediate: the operand is given in the address portion (constant)
- Direct: the address points to the operand stored in the memory
- Indirect: the address points to the pointer (another address) stored in the memory that references the operand in memory
- One bit of the instruction code can be used to distinguish between direct \& indirect addresses


## Instruction Codes Indirect Address



## Instruction Codes - Indirect Address

- Effective address: the address of the operand in a computationtype instruction or the target address in a branch-type instruction

The pointer can be placed in a processor register instead of memory as done in commercial computers

## Computer Registers

- Computer instructions are normally stored in consecutive memory locations and executed sequentially one at a time
- The control reads an instruction from a specific address in memory and executes it, and so on
- This type of sequencing needs a counter to calculate the address of the next instruction after execution of the current instruction is completed


## Computer Registers

- It is also necessary to provide a register in the control unit for storing the instruction code after it is read from memory
- The computer needs processor registers for manipulating data and a register for holding a memory address


## Registers in the Basic Computer




## List of BC Registers

| DR | 16 | Data Register | Holds memory operand |
| :--- | ---: | :---: | :---: |
| AR | 12 | Address Register | Holds address for memory |
| AC | 16 | Accumulator | Processor register |
| IR | $16 \quad$ Instruction Register |  | Holds instruction code |
| PC | 12 | Program Counter | Holds address of instruction |
| TR | 16 | Temporary Register | Holds temporary data |
| INPR | 8 | Input Register | Holds input character |
| OUTR | 8 | Output Register | Holds output character |



## Computer Registers Common Bus System

## Computer Registers- Common Bus System

- S2S1S0: Selects the register/memory that would use the bus
- LD (load): When enabled, the particular register receives the data from the bus during the next clock pulse transition
- E (extended AC bit): flip-flop holds the carry
- DR, AC, IR, and TR: have 16 bits each
- AR and PC: have 12 bits each since they hold a memory address


## Computer Registers-Common Bus System

- When the contents of AR or PC are applied to the $16-$ bit common bus, the four most significant bits are set to zeros
- When AR or PC receives information from the bus, only the 12 least significant bits are transferred into the register
- INPR and OUTR: communicate with the eight least significant bits in the bus


## Computer Registers-Common Bus

 System- INPR: Receives a character from the input device (keyboard,...etc) which is then transferred to AC
- OUTR: Receives a character from AC and delivers it to an output device (say a Monitor)
- Five registers have three control inputs: LD (load), INR (increment), and CLR (clear)
- Register $\equiv$ binary counter with parallel load and


## Computer Registers-Memory Address

- The input data and output data of the memory are connected to the common bus
- But the memory address is connected to AR
- Therefore, AR must always be used to specify a memory address
- By using a single register for the address, we eliminate the need for an addresc hus that would have


## Computer Registers- Memory Address

- Register $\rightarrow$ Memory: Write operation
- Memory $\rightarrow$ Register: Read operation (note that AC cannot directly read from memory!!)
- Note that the content of any register can be applied onto the bus and an operation can be performed in the adder and logic circuit during the same clock cycle


## Computer Registers-Memory Address

- The transition at the end of the cycle transfers the content of the bus into the destination register, and the output of the adder and logic circuit into the AC
- For example, the two microoperations

$$
\mathrm{DR} \leftarrow \mathrm{AC} \text { and } \mathrm{AC} \leftarrow \mathrm{DR} \quad \text { (Exchange) }
$$

can be executed at the same time

## Computer Registers Memory Address

- 1- place the contents of AC on the bus (S2S1S0=100)
- 2- enabling the LD (load) input of DR
- 3- Transferring the contents of the DR through the adder and logic circuit into AC
- 4- enabling the LD (load) input of AC


## Computer Instructions

Basic Computer Instruction code format
Memory-Reference Instructions (OP-code = 000 ~ 110)

| 15 | 14 | 12 |
| :---: | :---: | :---: |
| 1 | Opcode | Address |

Register-Reference Instructions (OP-code =111, I=0)

| 15 | 1211 |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0 | 1 | 1 | 1 |  | Register operation |

Input-Output Instructions
(OP-code =111, I = 1 )

| 15 | 1211 |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | 1 | 1 | 1 | $1 / O$ operation |  |

## BASIC COMPUTER INSTRUCTIONS

| Hex Code |  |  | Description |
| :---: | :---: | :---: | :---: |
| Symbol | $\boldsymbol{I}=0 \quad \mathrm{I}$ | $1=1$ |  |
| AND | 0 xxx 8 | 8 xxx | AND memory word to AC |
| ADD | 1 xxx 9 | 9 kxx | Add memory word to AC |
| LDA | 2xxx A | Axxx | Load AC from memory |
| STA | $3 x x x$ B | Bxxx | Store content of AC into memory |
| BUN | 4xxx | Cxxx | * Branch unconditionally |
| B\$A | 5xxx | Dxxx | - Branch and save return address |
| ISZ | 6xxx | Exxx | Increment and skip if zero |
| CLA | 7800 |  | Clear AC |
| CLE | 7400 |  | Clear E |
| CIMA | 7200 |  | Complement AC |
| CME | 7100 |  | Complement E |
| CIR | 7080 |  | Circulate right AC and E |
| CIL | 7040 |  | Circulate left AC and E |
| INC | 7020 |  | Increment AC |
| SPA | 7010 |  | Skip next instr. if AC is positive |
| SNA | 7008 |  | Skip next instr. if AC is negative |
| SZA | 7004 |  | Skip next instr. if AC is zero |
| SZE | 7002 |  | Skip next instr. if E is zero |
| HLT | 7001 |  | Halt computer |
| INP | F800 |  | Input character to AC |
| OUT | F400 |  | Output character from AC |
| SKI | F200 |  | Skip on input flag |
| SkO | F100 |  | Skip on output flag |
| ION | F080 |  | Interrupt on |
| 10 F | F040 |  | Interrupt off |

## Computer Instructions

## Instruction Set Completeness

- The set of instructions are said to be complete if the computer includes a sufficient number of instructions in each of the following categories:
- Arithmetic, logical, and shift instructions
- Instructions for moving information to and from memory and processor registers


## Timing \& Control

- The timing for all registers in the basic computer is controlled by a master clock generator
- The clock pulses are applied to all flip-flops and registers in the system, including the flipflops and registers in the control unit
- The clock pulses do not change the state of a register unless the register is enabled by a


## Timing \& Control

- The control signals are generated in the control unit and provide control inputs for the multiplexers in the common bus, control inputs in processor registers, and microoperations for the accumulator
- There are two major types of control organization:
- Hardwired control
- Microprogrammed control


## Timing \& Control

- In the hardwired organization, the control logic is implemented with gates, flip-flops, decoders, and other digital circuits.
- In the microprogrammed organization, the control information is stored in a control memory (if the design is modified, the microprogram in control memory has to be updated)
- D3T4: SC $\leftarrow 0$


## The Control Unit for the basic computer



Hardwired Control Organization

- Generated by 4-bit sequence counter and 4x16 decoder
- The SC can be incremented or cleared.
- Example: T0, T1, T2, T3, T4, T0, T1, . . .

Assume: At time T4, SC is cleared to 0 if decoder output D3 is active.
D3T4: SC $\leftarrow 0$


## Timing \& Control

- A memory read or write cycle will be initiated with the rising edge of a timing signal
- Assume: memory cycle time < clock cycle time!
- So, a memory read or write cycle initiated by a timing signal will be completed by the time the next clock goes through its positive edge
- The clock transition will then be used to load the


## Timing \& Control

- T0: AR $\leftarrow \mathrm{PC}$
- Transfers the content of PC into AR if timing signal T0 is active
- T0 is active during an entire clock cycle interval
- During this time, the content of PC is placed onto the bus (with $\mathrm{S} 2 \mathrm{~S} 1 \mathrm{~S} 0=010$ ) and the LD (load) input of AR is enabled
- The actual transfer does not occur until the end of the clock


## Instruction Cycle

- A program is a sequence of instructions stored in memory
- The program is executed in the computer by going through a cycle for each instruction (in most cases)
- Each instruction in turn is subdivided into a sequence of sub-cycles or phases


## Instruction Cycle

- Instruction Cycle Phases:
- 1- Fetch an instruction from memory
- 2- Decode the instruction
- 3- Read the effective address from memory if the instruction has an indirect address
- 4- Execute the instruction
- This cycle repeats indefinitely unless a HALT


## Instruction Cycle Fetch and Decode

- Initially, the Program Counter (PC) is loaded with the address of the first instruction in the program
- The sequence counter SC is cleared to 0 , providing a decoded timing signal T0
- After each clock pulse, SC is incremented by one, so that the timing signals go through a


## Instruction Cycle

## Fetch and Decode

- T0: AR $\leftarrow \mathrm{PC} \quad$ (this is essential!!)

The address of the instruction is moved to AR.
$-\mathrm{T} 1: \mathrm{IR} \leftarrow \mathrm{M}[\mathrm{AR}], \mathrm{PC} \leftarrow \mathrm{PC}+1$
The instruction is fetched from the memory to IR , and the PC is incremented.

- T2: D0,..., D7 $\leftarrow$ Decode $\operatorname{IR}(12-14), \mathrm{AR} \leftarrow \operatorname{IR}(0-11)$, $\mathrm{I} \leftarrow \mathrm{IR}(15)$


## BC Instruction cycle: [Fetch Decode [Indirect] Execute]*

- Fetch and Decode

T0: AR $\leftarrow \mathrm{PC}(S 0 S 1 S 2=010, T 0=1)$
$T 1: I R \leftarrow M[A R], P C \leftarrow P C+1 \quad(S 0 S 1 S 2=111, T 1=1)$
T2: DO, . . . , D7 $\leftarrow$ Decode IR(12-14), AR $\leftarrow \operatorname{IR}(0-11), I \leftarrow \operatorname{IR}(15)$



## REGISTER REFERENCE INSTRUCTIONS

Register Reference Instructions are identified when

- D7 = 1, I = 0
- Register Ref. Instr. is specified in B0 ~ B11 of IR
- Execution starts with timing signal T3
$r=D 7$ I' T3 $=>$ Register Reference Instruction
$B i=\operatorname{IR}(i), i=0,1,2, \ldots, 11$, the ith bit of $\operatorname{IR}$.

|  |  | 1: <br> 0: <br> rB9: | ```\(\mathbf{S C} \leftarrow 0\) \(A C \leftarrow 0\) \(\mathrm{E} \leftarrow \mathbf{0}\) \(A C \leftarrow A C^{\prime}\) \(E \leftarrow E^{\prime}\) \(\mathrm{AC} \leftarrow \operatorname{shr} \mathrm{AC}, \mathrm{AC}(15) \leftarrow \mathrm{E}, \mathrm{E} \leftarrow \mathrm{AC}(0)\) \(\mathrm{AC} \leftarrow \mathrm{shl} \mathrm{AC}, \mathrm{AC}(0) \leftarrow \mathrm{E}, \mathrm{E} \leftarrow \mathrm{AC}(15)\) \(A C \leftarrow A C+1\) if \((A C(15)=0)\) then \((P C \leftarrow P C+1)\) if \((\mathrm{AC}(15)=1)\) then \((P C \leftarrow P C+1)\) if \((A C=0)\) then ( \(P C \leftarrow P C+1\) ) if \((E=0)\) then ( \(P C \leftarrow P C+1\) ) \(S \leftarrow 0\) ( S is a start-stop flip-flop)``` |
| :---: | :---: | :---: | :---: |

### 5.6 MEMORY REFERENCE INSTRUCTIONS

| Symbol | Operation Decoder | Symbolic Description |
| :---: | :---: | :---: |
|  | AND DO | $A C \leftarrow A C \wedge M[A R]$ |
|  | ADD D1 | $A C \leftarrow A C+M[A R], E \leftarrow C$ Cout |
|  | LDA D2 | $A C \leftarrow M[A R]$ |
|  | STA D3 | $\mathrm{M}[\mathrm{AR}] \leftarrow \mathrm{AC}$ |
|  | BUN | D4 $\mathrm{PC} \leftarrow \mathrm{AR}$ |
|  | BSA D5 | $\mathrm{M}[\mathrm{AR}] \leftarrow \mathrm{PC}, \mathrm{PC} \leftarrow \mathrm{AR}+1$ |
|  | ISZ D6 | $\mathrm{M}[\mathrm{AR}] \leftarrow \mathrm{M}[A R]+1$, if $\mathrm{M}[\mathrm{AR}]+1=0$ then PC $\leftarrow \mathrm{PC}+1$ |

- The effective address of the instruction is in AR and was placed there during timing signal T 2 when $\mathrm{I}=0$, or during timing signal T 3 when $\mathrm{I}=1$
- Memory cycle is assumed to be short enough to be completed in a CPU cycle
- The execution of MR Instruction starts with T4

AND to AC
DOT4: $\quad \mathrm{DR} \leftarrow \mathrm{M}[\mathrm{AR}] \quad$ Read operand
DOT5: $\quad A C \leftarrow A C \wedge D R, S C \leftarrow 0 \quad$ AND with $A C$
ADD to AC
D1T4: $\quad \mathrm{DR} \leftarrow M[A R] \quad$ Read operand
D1T5: $\quad \mathrm{AC} \leftarrow \mathrm{AC}+\mathrm{DR}, \mathrm{E} \leftarrow$ Cout, $\mathrm{SC} \leftarrow 0$ Add to AC and store carry in E

## MEMORY REFERENCE INSTRUCTIONS

LDA: Load to AC
D2T4: $\quad \mathrm{DR} \leftarrow \mathrm{M}[\mathrm{AR}]$
D2T5: $\quad A C \leftarrow D R, S C \leftarrow 0$
STA: Store AC
D3T4: $\quad M[A R] \leftarrow A C, S C \leftarrow 0$
BUN: Branch Unconditionally
D4T4: $\quad \mathrm{PC} \leftarrow A R, S C \leftarrow 0$
BSA: Branch and Save Return Address
$M[A R] \leftarrow P C, P C \leftarrow A R+1$

Memory, PC, AR at time T4


Memory, PC after execution


## Memory Reference Instructions

BSA: executed in a sequence of two micro-operations:
D5T4: $\quad M[A R] \leftarrow P C, A R \leftarrow A R+1$
D5T5: $\quad P C \leftarrow A R, S C \leftarrow 0$
ISZ: Increment and Skip-if-Zero
D6T4: $\quad \mathrm{DR} \leftarrow \mathrm{M}[\mathrm{AR}]$
D6T5: $\quad \mathrm{DR} \leftarrow \mathrm{DR}+1$
D6T6: $\quad M[A R] \leftarrow D R$, if $(D R=0)$ then $(P C \leftarrow P C+1), S C \leftarrow 0$


## Input-Output and Interrupt

- Instructions and data stored in memory must come from some input device
- Computational results must be transmitted to the user through some output device
- For the system to communicate with an input device, serial information is shifted into the input register INPR


## Input-Output and Interrupt



## Input-Output and Interrupt

- INPR and OUTR communicate with a communication interface serially and with the AC in parallel. They hold an 8-bit alphanumeric information
- I/O devices are slower than a computer system $\rightarrow$ we need to synchronize the timing rate difference between the input/output device and the computer.


## Input-Output and Interrupt

- FGI is set to 1 when a new information is available in the input device and is cleared to 0 when the information is accepted by the computer
- FGO: 1-bit output flag used as a control flipflop to control the output operation
- If FGO is set to 1 , then this means that the computer can send out the information from


## Input-Output and Interrupt

- The process of input information transfer:
- Initially, FGI is cleared to 0
- An 8-bit alphanumeric code is shifted into INPR (Keyboard key strike) and the input flag FGI is set to 1
- As long as the flag is set, the information in INPR cannot be changed by another data entry


## Input-Output and Interrupt

- Once the flag is cleared, new information can be shifted into INPR by the input device (striking another key)
- The process of outputting information:
- Initially, the output flag FGO is set to 1
- The computer checks the flag bit; if it is 1 , the information from AC is transferred in parallel to OUTR and FGO is cleared to 0


## Input-Output and Interrupt

- When the operation is completed, the output device sets FGO back to 1
- The computer does not load a new data information into OUTR when FGO is 0 because this condition indicates that the output device is busy to receive another information at the moment!!


## Input-Output Instructions

- Needed for:
- Transferring information to and from AC register
- Checking the flag bits
- Controlling the interrupt facility
- The control unit recognize it when $\mathrm{D} 7=1$ and $\mathrm{I}=1$
- The remaining bits of the instruction specify the particular operation


## Input-Output Instructions

```
D7IT3 = p
IR(i)= Bi, i= 6, ..., 11
```

| INP pB11: | $\mathrm{AC}(0-7) \leftarrow$ INPR, FGI $\leftarrow 0$ | Input char. to AC |
| :--- | :--- | :--- |
| OUT pB10: | $\mathrm{OUTR} \leftarrow \mathrm{AC}(0-7), \mathrm{FGO} \leftarrow 0$ | Output char. from AC |
| SKI pB9: if $(\mathrm{FGI}=1)$ then $(\mathrm{PC} \leftarrow \mathrm{PC}+1)$ | Skip on input flag |  |
| SKO pB8: $\mathrm{if}(\mathrm{FGO}=1)$ then $(\mathrm{PC} \leftarrow \mathrm{PC}+1)$ | Skip on output flag |  |
| ION pB7:IEN $\leftarrow 1$ | Interrupt enable on |  |
| IOF pB6:IEN $\leftarrow 0$ | Interrupt enable off |  |

## Program Interrupt

- The process of communication just described is referred to as Programmed Control Transfer
- The computer keeps checking the flag bit, and when it finds it set, it initiates an information transform (this is sometimes called Polling)
- This type of transfer is in-efficient due to the difference of information flow rate between the computer and the I/O device


## Program Interrupt

- The computer is wasting time while checking the flag instead of doing some other useful processing task
- An alternative to the programmed controlled procedure is to let the external device inform the computer when it is ready for the transfer
- This type of transfer uses the interrupt facility


## Program Interrupt

- While the computer is running a program, it does not check the flags
- Instead:
- When a flag is set, the computer is immediately interrupted from proceeding with the current program
- The computer stops what it is doing to take care of the input or output transfer
- Then. it returns to the current program to continue what it


## Program Interrupt

- The interrupt facility can be enabled or disabled via a flip-flop called IEN
- The interrupt enable flip-flop IEN can be set and cleared with two instructions (IOF, ION):
- IOF: IEN $\leftarrow 0$ (the computer cannot be interrupted)
- ION: IEN $\leftarrow 1$ (the computer can be interrupted)


## Program Interrupt

- Another flip-flop (called the interrupt flip-flop $\mathbf{R}$ ) is used in the computer's interrupt facility to decide when to go through the interrupt cycle
- FGI and FGO are different here compared to the way they acted in an earlier discussion!!


## Program Interrupt

- The interrupt cycle is a hardware implementation of a branch and save return address operation (BSA)
- The return address available in PC is stored in a specific location where it can be found later when the program returns to the instruction at which it was interrupted
- This location may be a processor register, a


## Program Interrupt

- For our computer, we choose the memory location at address 0 as a place for storing the return address
- Control then inserts address 1 into PC:
- this means that the first instruction of the interrupt service routine should be stored in memory at address 1 ,
- or, the programmer must store a branch instruction that sends the control to an interrupt


## Program Interrupt



Flowchart for interrupt cycle

## Program Interrupt

- IEN, $\mathrm{R} \leftarrow 0$ : no more interruptions can occur until the interrupt request from the flag has been serviced
- The service routine must end with an instruction that re-enables the interrupt (IEN $\leftarrow 1)$ and an instruction to return to the instruction at which the interrupt occurred
- The instruction that returns the control to the original program is "indirect BUN 0 "


## Program Interrupt

- Example: the computer is interrupted during execution of the instruction at address 255



## Interrupt Cycle

- The fetch and decode phases of the instruction cycle must be :
(Replace T0, T1, T2 $\rightarrow$ R'T0, R'T1, R'T2 (fetch and decode phases occur at the instruction cycle when $\mathrm{R}=0$ )
- Interrupt Cycle:
$-\mathrm{RT} 0: \mathrm{AR} \leftarrow 0, \mathrm{TR} \leftarrow \mathrm{PC}$


Register transfers for the Interrupt Cycle

## Interrupt

- Further Questions:
- How can the CPU recognize the device requesting an interrupt?
- Since different devices are likely to require different interrupt service routines, how can the CPU obtain the starting address of the appropriate routine in each case?
- Should any device be allowed to interrupt the CPU while another interrupt is being serviced?
- How can the situation be handled when two or more interrupt requests occur simultaneously?


## Complete Computer Description $S C \leftarrow 0$, IEN $\leftarrow 0, R \leftarrow 0$

Fig 5-15


## Complete Computer Description



## Complete Computer Description

## Register-Reference:

D71'T3 $=r$
$\operatorname{IR}(i)=B i$
(Common to all register-reference instructions)
( $\mathrm{i}=\mathbf{0 , 1 , 2}, \ldots, 11$ )
$\mathrm{SC} \leftarrow 0$
CLA
CLE
CMA
CME
CIR
CIL
INC
SPA
SNA
SZA
SZE
HLT
Input-Output:
D7IT3 $=p$
IR(i) $=$ Bi
p:
pB11:
pB10:
pB9:
pB8:
pB7:
pB6:
(Common to all input-output instructions)
( $i=6,7,8,9,10,11$ )
SC $\leftarrow 0$
INP
pB11:
OUT
pB10:
AC $(0-7) \leftarrow$ INPR, $\mathrm{FGI} \leftarrow 0$
SKI
SKO
pB8:
ION
pB6:

OUTR $\leftarrow \mathrm{AC}(0-7)$, FGO $\leftarrow 0$
If(FGI=1) then ( $\mathrm{PC} \leftarrow \mathrm{PC}+1$ )
If( $\mathrm{FGO}=1$ ) then $(\mathrm{PC} \leftarrow \mathrm{PC}+1)$
IEN $\leftarrow 1$
IEN $\leftarrow 0$

## Design of Basic Computer

1. A memory unit: $\mathbf{4 0 9 6} \times 16$.
2. Registers: AR, PC, DR, AC, IR, TR, OUTR, INPR, and SC
3. Flip-Flops (Status): I, S, E, R, IEN, FGI, and FGO
4. Decoders:
5. a 3x8 Opcode decoder
6. a 4x16 timing decoder
7. Common bus: $\mathbf{1 6}$ bits
8. Control logic gates

Adder and I oqic circuit• Connected to $\Delta C$

## Design of Basic Computer

The control logic gates are used to control:

- Inputs of the nine registers
- Read and Write inputs of memory
- Set, Clear, or Complement inputs of the flip-flops
- $\mathrm{S} 2, \mathrm{~S} 1, \mathrm{~S} 0$ that select a register for the bus
- AC Adder and Logic circuit


## Design of Basic Computer

- Control of registers and memory
- The control inputs of the registers are LD (load), INR (increment), and CLR (clear)
- To control AR We scan table to find out all the statements that change the content of AR:
- R'T0: AR $\leftarrow$ PC LD(AR)
- R'T2: $\quad$ AR $\leftarrow \operatorname{IR}(0-11) \quad$ LD(AR)
- D'7IT3: AR $\leftarrow M[A R] \quad$ LD(AR)
- RT0: AR $\leftarrow 0 \quad$ CLR(AR)
- D5T4: $\quad$ AR $\leftarrow$ AR + 1 INR(AR)


# Design of Basic Computer 

## Control Gates associated with AR



## Design of Basic Computer

- To control the Read input of the memory we scan the table again to get these:
- D0T4: DR $\leftarrow \mathrm{M}[A R]$
- D1T4: DR $\leftarrow \mathrm{M}[A R]$
- D2T4: DR $\leftarrow M[A R]$
- D6T4: DR $\leftarrow \mathrm{M}[A R]$
- D7'IT3: AR $\leftarrow M[A R]$
- $\mathrm{R}^{\prime} \mathrm{T} 1: \mathrm{IR} \leftarrow \mathrm{M}[\mathrm{AR}]$
$-\quad \rightarrow$ Read $=\mathrm{R}^{\prime} \mathrm{T} 1+\mathrm{D} 7$ 'IT3 $+(\mathrm{D} 0+\mathrm{D} 1+\mathrm{D} 2+\mathrm{D} 6)$
T4


## Design of Basic Computer

Control of Single Flip-flops (IEN for example)

- pB7: IEN $\leqslant 1$ (I/O Instruction)
- pB6: IEN $\leftarrow 0$ (I/O Instruction)
- RT2: IEN $\leftarrow 0$ (Interrupt)
- where $\mathbf{p}=$ D7IT3 (Input/Output Instruction)
- If we use a JK flip-flop for IEN, the control gate logic will be as shown in the following slide:


## Design of Basic Computer



JK FF Characteristic Table

## Design of Basic Computer

Control of Common bus is accomplished by placing an encoder at the inputs of the bus selection logic and implementing the logic for each encoder input


## Design of Basic Computer

To select AR on the bus then x 1 must be 1 . This is happen when:

- D4T4: PC $\leftarrow$ AR
- D5T5: PC $\leftarrow \mathbf{A R}$
$\Rightarrow \mathrm{x} 1=\mathrm{D} 4 \mathrm{~T} 4+\mathrm{D} 5 \mathrm{~T} 5$

| $\times 1 \times 2$ | $\times 3 \times 4 \times 5 \times 6 \times 7$ |  | S2 51 | S0 | selected <br> register |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | none |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | AR |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | PC |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | DR |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | AC |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | IR |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | TR |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | Memory |

## Design of Basic Computer

For $x 7$ :

- $\quad X 7=R^{\prime} T 1+D 7^{\prime} I T 3+(D 0+D 1+D 2+D 6) T 4$ where it is also applied to the read input


## Design of Accumulator Logic

Circuits associated with AC


All the statements that change the content of AC


## Design of Accumulator Logic

Gate structures for controlling the LD, INR, and CLR of AC


## Adder and Logic Circuit



# REGISTER TRANSFER AND MICROOPERATIONS 

- Register Transfer Language
- Register Transfer
- Bus and Memory Transfers
- Arithmetic Microoperations
- Logic Microoperations


## SIMPLE DIGITAL SYSTEMS

Combinational and sequential circuits can be used to create simple digital systems.

These are the low-level building blocks of a digital computer.

- Simple digital systems are frequently characterized in terms of
- the registers they contain, and
- the operations that they perform.

Typically,

- What operations are performed on the data in the registers
- What information is passed between registers
- The operations on the data in registers are called microoperations.
- The functions built into registers are examples of microoperations
- Shift
- An elementary operation performed (during one clock pulse), on the information stored in one or more registers
- 1 clock cycle:

- $\mathrm{R} \leftarrow \mathrm{f}(\mathrm{R}, \mathrm{R})$
- f: shift, load, clear, increment, add, subtract, complement, and, or, xor, ...


## ORGANIZATION OF A DIGITAL SYSTEM

- Definition of the (internal) organization of a computer
- Set of registers and their functions
- Micro operations set

Set of allowable micro operations provided by the organization of the computer

- Control signals that initiate the sequence of micro operations (to perform the functions)


## REGISTER TRANSFER LEVEL

- Viewing a computer, or any digital system, in this way is called the register transfer level
- This is because we're focusing on
- The system's registers
- The data transformations in them, and
- The data transfers between them.


## REGISTER TRANSFER LANGUAGE

- Rather than specifying a digital system in words, a specific notation is used, register transfer language
- For any function of the computer, the register transfer language can be used to describe the (sequence of) microoperations
- Register transfer language
- A symbolic language
- A convenient tool for describing the internal organization of digital computers
- Can also be used to facilitate the design process of digital


## DESIGNATION OF REGISTERS

- Registers are designated by capital letters, sometimes followed by numbers (e.g., A, R13, IR)
- Often the names indicate function:
- MAR - memory address register
- PC - program counter
- IR - instruction register
- Registers and their contents can be viewed and represented in various ways
- A register can be viewed as a single entity:

MAR

- Regicterc mav alco he renrecented chowing the hitc of data they contain


## - Designation of a register

- a register
- portion of a register
- a bit of a register
- Common ways of drawing the block diagram of a register


Numbering of bits

Showing individual bits

| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |


| 15 | 7 | 0 |
| :--- | :--- | :--- |
|  | PC(H) | PC(L) |

Subfields

## REGISTER TRANSFER

- Copying the contents of one register to another is a register transfer
- A register transfer is indicated as
$\mathrm{R} 2 \leftarrow \mathrm{R} 1$
- In this case the contents of register R1 are copied (loaded) into register R2
- A simultaneous transfer of all bits from the source R1 to the destination register R2, during one clock pulse
- Note that this is a non-destructive; i.e. the contents of R1 are not altered by copying (loading) them to R2
- A register transfer such as
$\mathrm{R} 3 \leftarrow \mathrm{R} 5$

Implies that the digital system has

- the data lines from the source register (R5) to the destination register (R3)
- Parallel load in the destination register (R3)
- Control lines to perform the action


## CONTROL FUNCTIONS

- Often actions need to only occur if a certain condition is true
- This is similar to an "if" statement in a programming language
- In digital systems, this is often done via a control signal, called a control function
- If the signal is 1 , the action takes place
- This is represented as:
$\mathrm{P}: \mathrm{R} 2 \leftarrow \mathrm{R} 1$

Which means "if $\mathrm{P}=1$, then load the contents of register R 1 into register R2", i.e., if $(P=1)$ then $(R 2 \leftarrow R 1)$

## HARDWARE IMPLEMENTATION OF CONTROLLED TRANSFERS

- Implementation of controlled transfer

Block diag Raß $\leftarrow \mathrm{R} 1$


Timing diagram


- The same clock controls the circuits that generate the control function and the destination register
- Registers are assumed to use positive-edge-triggered flipflops


## SIMULTANEOUS OPERATIONS

- If two or more operations are to occur simultaneously, they are separated with commas
$\mathrm{P}: \mathrm{R} 3 \leftarrow \mathrm{R} 5, \mathrm{MAR} \leftarrow \mathrm{IR}$
- Here, if the control function $\mathrm{P}=1$, load the contents of R 5 into R3, and at the same time (clock), load the contents of register IR into register MAR


## BASIC SYMBOLS FOR REGISTER TRANSFERS

| Symbols | Description | Examples |
| :--- | :--- | :--- |
| Capital letters | Denotes a register | MAR, R2 |
| \& numerals |  |  |
| Parentheses () | Denotes a part of a register | R2 $(\phi-7), R 2(L)$ |
| Arrow $\leftarrow$ | Denotes transfer of information | $R 2 \leftarrow R 1$ |
| Colon $:$ | Denotes termination of control function | $\mathrm{P}:$ |
| Comma, | Separates two micro-operations | $\mathrm{A} \leftarrow \mathrm{B}, \mathrm{B} \leftarrow \mathrm{A}$ |

## CONNECTING REGISTRS

- In a digital system with many registers, it is impractical to have data and control lines to directly allow each register to be loaded with the contents of every possible other registers
- To completely connect n registers $\rightarrow \mathrm{n}(\mathrm{n}-1)$ lines
- $\mathrm{O}(\mathrm{n} 2)$ cost
- This is not a realistic approach to use in a large digital system
- Instead, take a different approach
- Have one centralized set of circuits for data transfer - the bus
- Have control circuits to select which register is the source, and which is the destination


## BUS AND BUS TRANSFER

Bus is a path(of a group of wires) over which information is transferred, from any of several sources to any of several destinations.

From a register to bus: BUS $\leftarrow R$


## TRANSFER FROM BUS TO A DESTINATION REGISTER



Three-State Bus Buffers
Normal input A
Control input C


Output $\mathrm{Y}=\mathrm{A}$ if $\mathrm{C}=1$
High-impedence if $\mathrm{C}=0$

Bus line with three-state buffers


## BUS TRANSFER IN RTL

- Depending on whether the bus is to be mentioned explicitly or not, register transfer can be indicated as either

$$
R 2 \leftarrow R 1
$$

or

$$
\text { BUS } \leftarrow \mathrm{R} 1, \mathrm{R} 2 \leftarrow \operatorname{BUS}
$$

- In the former case the bus is implicit, but in the latter, it is explicitly indicated


## MEMORY (RAM)

- Memory (RAM) can be thought as a sequential circuits containing some number of registers
- These registers hold the words of memory
- Each of the r registers is indicated by an address
- These addresses range from 0 to $\mathrm{r}-1$
- Each register (word) can hold n bits of data
- Assume the RAM contains $\mathrm{r}=2 \mathrm{k}$ words. It needs the followingdata input lines
- n data input lines
- n data output lines
- $k$ address lines
- A Read control line
- A Write control line



## MEMORY TRANSFER

- Collectively, the memory is viewed at the register level as a device, M.
- Since it contains multiple locations, we must specify which address in memory we will be using
- This is done by indexing memory references
- Memory is usually accessed in computer systems by putting the desired address in a special register, the Memory Address Register (MAR, or AR)



## MEMORY READ

- To read a value from a location in memory and load it into a register, the register transfer language notation looks like this:

$$
\mathrm{R} 1 \leftarrow \mathrm{M}[\mathrm{MAR}]
$$

- This causes the following to occur
- The contents of the MAR get sent to the memory address lines
- $\quad \mathrm{A} \operatorname{Read}(=1)$ gets sent to the memory unit
- The contents of the specified address are put on the memory's output data lines
- These get sent over the bus to be loaded into register R1


## MEMORY WRITE

- To write a value from a register to a location in memory looks like this in register transfer language:

$$
M[M A R] \leftarrow R 1
$$

- This causes the following to occur
- The contents of the MAR get sent to the memory address lines
- A Write $(=1)$ gets sent to the memory unit
- The values in register R1 get sent over the bus to the data input lines of the memory
- The values get loaded into the specified address in the memory


## SUMMARY OF R. TRANSFER MICROOPERATIONS

| $A \leftarrow B$ | Transfer content of reg. B into reg. A |
| :---: | :---: |
| $A R \leftarrow D R(A D)$ Transfer content of AD portion of reg. DR into reg. AR |  |
| $\mathrm{A} \leftarrow$ constant | Transfer a binary constant into reg. A |
| ABUS $\leftarrow$ R1, | Transfer content of R1 into bus A and, at the same time, |
| $\mathrm{R} 2 \leftarrow \mathrm{ABUS}$ | transfer content of bus A into R2 |
| AR | Address register |
| DR | Data register |
| $\mathrm{M}[\mathrm{R}]$ | Memory word specified by reg. R |
| M | Equivalent to M[AR] |
| $\mathrm{DR} \leftarrow \mathrm{M}$ | Memory read operation: transfers content of memory word specified by AR into DR |
| $\mathrm{M} \leftarrow \mathrm{DR}$ | Memory write operation: transfers content of DR into memory word specified by AR |

## MICROOPERATIONS

- Computer system microoperations are of four types:
- Register transfer microoperations
- Arithmetic microoperations
- Logic microoperations
- Shift microoperations


## ARITHMETIC MICROOPERATIONS

- The basic arithmetic microoperations are
- Addition
- Subtraction
- Increment
- Decrement
- The additional arithmetic microoperations are
- Add with carry
- Subtract with borrow
- Transfer/Load Summary of Typical Arithmetic Micro-Operations

| $\begin{aligned} \text { etc. .. } \mathrm{R} 3 & \leftarrow R 1+\mathrm{R} 2 \\ \mathrm{R} 3 & \leftarrow R 1-\mathrm{R} 2 \\ \mathrm{R} 2 & \leftarrow R 2^{\prime} \\ \mathrm{R} 2 & \leftarrow R 2^{\prime}+1 \\ \mathrm{R} 3 & \leftarrow R 1+\mathrm{R} 2^{\prime}+ \\ \mathrm{R} 1 & \leftarrow R 1+1 \\ \mathrm{R} 1 & \leftarrow R 1-1 \text { Ded } \end{aligned}$ | Contents of R1 plus R2 transferred to R3 <br> Contents of R1 minus R2 transferred to R3 <br> Complement the contents of R2 <br> 2 's complement the contents of R2 (negate) <br> 1 subtraction <br> Increment <br> rement |
| :---: | :---: |

## BINARY ADDER / SUBTRACTOR / INCREMENTER



Binary Adder-Subtractor


Binary Incrementer


## ARITHMETIC CIRCUIT



| S1 | S0 | Cin | Y | Output | Microoperation |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0 | 0 | 0 | $B$ | $D=A+B$ | Add |
| 0 | 0 | 1 | $B$ | $D=A+B+1$ | Add with carry |
| 0 | 1 | 0 | $B^{\prime}$ | $D=A+B^{\prime}$ | Subtract with borrow |
| 0 | 1 | 1 | $B^{\prime}$ | $D=A+B^{\prime}+1$ | Subtract |
| 1 | 0 | 0 | 0 | $D=A$ | Transfer A |
| 1 | 0 | 1 | 0 | $D=A+1$ | Increment A |
| 1 | 1 | 0 | 1 | $D=A-1$ | Decrement A |
| 1 | 1 | 1 | 1 | D $=$ A |  |

## LOGIC MICROOPERATIONS

- Specify binary operations on the strings of bits in registers
- Logic micro operations are bit-wise operations, i.e., they work on the individual bits of data
- useful for bit manipulations on binary data
- useful for making logical decisions based on the bit value
 over two binaly igpub ariables $0 \ldots \ldots 111$

| 0 | 1 | 0 | 0 | 0 | $\ldots$ | 1 | 1 | 1 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 0 | 0 | 0 | 1 | $\ldots$ | 0 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | $\ldots$ | 1 | 0 | 1 |

- However, most systems only implement four of these


## LIST OF LOGIC MICROOPERATIONS

- List of Logic Microoperations
- 16 different logic operations with 2 binary vars.
- $n$ binary vars $\rightarrow 2{ }^{2}$ functions
- Truth tables for 16 functions of 2 variables and the corresponding 16 logic micro-operations

| x 0 0 1 1 <br> $y$ 0 1 0 1 | Boolean Function | MicroOperations | Name |
| :---: | :---: | :---: | :---: |
| 000 | $\mathrm{FO}=0$ | $\mathrm{F} \leftarrow 0$ | Clear |
| 0001 | F1 $=x y$ | $F \leftarrow A \wedge B$ | AND |
| 001 | F2 $=x y^{\prime}$ | $F \leftarrow A \wedge B^{\prime}$ |  |
| 0011 | F3 $=x$ | $F \leftarrow A$ | Transfer A |
| 010 | F4 $=x^{\prime} \mathrm{y}$ | $F \leftarrow A^{\prime} \wedge B$ |  |
| 0101 | F5 = y | $F \leftarrow B$ | Transfer B |
| 011 | F6 $=x \oplus y$ | $F \leftarrow A \oplus B$ | Ekclusive-OR |
| 0111 | F7 $=x+y$ | $F \leftarrow A \vee B$ | OR |
| 1000 | F8 $=(x+y)^{\prime}$ | $F \leftarrow(A \vee B)^{\prime}$ | NOR |
| 1001 | F9 $=(x \oplus y)^{\prime}$ | $F \leftarrow(A \oplus B)^{\prime}$ | Exclusive-NOR |
| 101 | F10 $=\mathrm{y}^{\prime}$ | $F \leftarrow B^{\prime}$ | Complement B |
| 1017 | F11 $=x+y^{\prime}$ | $\mathrm{F} \leftarrow \mathrm{A} \vee \mathrm{B}$ |  |
| 1100 | $\mathrm{F} 12=\mathrm{x}^{\prime}$ | $F \leftarrow A^{\prime}$ | Complement A |
| 1101 | F13 $=x^{\prime}+\mathrm{y}$ | $F \leftarrow A^{\prime} \vee B$ |  |
| 111 | F14 $=(x y)^{\prime}$ | $\mathrm{F} \leftarrow(\mathrm{A} \wedge \mathrm{B})^{\prime}$ | NAND |
| 1111 | $\mathrm{F} 15=1$ | $\mathrm{F} \leftarrow$ all 1 's | Set to all 1's |

## HARDWARE IMPLEMENTATION OF LOGIC MICROOPERATIONS



Function table

| S1 S0 | Output | $\mu$-operation |
| :---: | :--- | :--- |
| 0 | 0 | $F=A \wedge B$ |
| 0 | 1 | $F=A \vee B$ |
| 1 | 0 | $F=A \oplus B$ |
| 1 | 1 | $F=A^{\prime}$ | XOR | Complement |
| :---: |

## APPLICATIONS OF LOGIC MICROOPERATIONS

Logic micro operations can be used to manipulate individual bits or a portions of a word in a register

- Consider the data in a register A. In another register, B, is bit data that will be used to modify the contents of A
- Selective-set
$\mathrm{A} \leftarrow \mathrm{A}+\mathrm{B}$
- Selective-complement $\quad \mathrm{A} \leftarrow \mathrm{A} \oplus \mathrm{B}$
- Selective-clear $\quad \mathrm{A} \leftarrow \mathrm{A} \cdot \mathrm{B}^{\prime}$
- Mask (Delete) $\quad \mathrm{A} \leftarrow \mathrm{A} \cdot \mathrm{B}$
- Clear
$\mathrm{A} \leftarrow \mathrm{A} \oplus \mathrm{B}$
- Insert
$\mathrm{A} \leftarrow(\mathrm{A} \cdot \mathrm{B})+\mathrm{C}$


## SELECTIVE SET

- In a selective set operation, the bit pattern in B is used to set certain bits in A

| 1100 | At |  |
| :--- | :--- | :--- |
| 1010 | B |  |
| 1110 | $\mathrm{At}+1$ | $(\mathrm{~A} \leftarrow \mathrm{~A}+\mathrm{B})$ |

- If a bit in B is set to 1, that same position in A gets set to 1, otherwise that bit in A keeps its previous value


## SELECTIVE COMPLEMENT

- In a selective complement operation, the bit pattern in B is used to complement certain bits in A

$$
\begin{array}{ll}
\frac{1100}{} \mathrm{At} \\
\hline 1010 & \mathrm{~B} \\
0110 & \mathrm{At}+1
\end{array}(\mathrm{~A} \leftarrow \mathrm{~A} \oplus \mathrm{~B})
$$

- If a bit in B is set to 1 , that same position in A gets complemented from its original value, otherwise it is unchanged


## SELECTIVE CLEAR

- In a selective clear operation, the bit pattern in B is used to clear certain bits in A

$$
\begin{aligned}
& \frac{1100}{} \begin{array}{l}
\text { At } \\
1010 \\
0100 \\
\mathrm{Bt}+1
\end{array}\left(\mathrm{~A} \leftarrow \mathrm{~A} \cdot \mathrm{~B}^{\prime}\right)
\end{aligned}
$$

- If a bit in $B$ is set to 1 , that same position in $A$ gets set to 0 , otherwise it is unchanged


## MASK OPERATION

- In a mask operation, the bit pattern in B is used to clear certain bits in A

$$
\begin{array}{ll}
1000 \mathrm{At} & \\
1010 \mathrm{~B} & \\
1000 & \mathrm{At}+1
\end{array}(\mathrm{~A} \leftarrow \mathrm{~A} \cdot \mathrm{~B})
$$

- If a bit in B is set to 0 , that same position in A gets set to 0 , otherwise it is unchanged


## CLEAR OPERATION

- In a clear operation, if the bits in the same position in $A$ and $B$ are the same, they are cleared in A, otherwise they are set in A

```
1100 At
4-10-B
0110 At+1 (A\leftarrowA }\leftarrow\textrm{B}
```


## INSERT OPERATION

- An insert operation is used to introduce a specific bit pattern into A register, leaving the other bit positions unchanged
- This is done as
- A mask operation to clear the desired bit positions, followed by
- An OR operation to introduce the new bits into the desired positions
- Example
- Suppose you wanted to introduce 1010 into the low order four bits of A: 1101100010110001 A (Original)

1101 100010111010 A (Desired)

- 1101100010110001 A (Original)

1111111111110000 Mask
1101100010110000 A (Intermediate)
0000000000001010 Added bits

## SHIFT MICROOPERATIONS

- There are three types of shifts
- Logical shift
- Circular shift
- Arithmetic shift
- What differentiates them is the information that goes into the serial input
- A right shift operation

- A left shift operation



## LOGICAL SHIFT

- In a logical shift the serial input to the shift is a 0 .

A right $\operatorname{logical}_{0}$ shift operation:

. A left logical shiftoperation: $\square \longleftarrow \square \square \square$

- In a Register Transfer Language, the following notation is used
- shl for a logical shift left


## CIRCULAR SHIFT

- In a circular shift the serial input is the bit that is shifted out of the other end of the register.
- A right circular shift operation:

- In a RTL, the following notation is used
- cil for a circular shift left
- cir for a circular shift right


## ARITHMETIC SHIFT

- An arithmetic shift is meant for signed binary numbers (integer)
- An arithmetic left shift multiplies a signed number by two
- An arithmetic right shift divides a signed number by two
- The main distinction of an arithmetic shift is that it must keep the sign of the number the same as it performs the multiplication or division
- A right arithmetic shift operation:

- A left arithmetic shift operation:


## ARITHMETIC SHIFT

- An left arithmetic shift operation must be checked for the overflow

- In a RTL, the following notation is used
- ashl for an arithmetic shift left
- ashr for an arithmetic shift right
- Examples:
" R2 $\leftarrow \operatorname{ashr}$ R2
» R3 $\leftarrow$ ashl R3


## HARDWARE IMPLEMENTATION OF SHIFT MICROOPERATIONS



## ARITHMETIC LOGIC SHIFT UNIT



| S3 | S2 | S1 | S0 | Cin | Operation | Function |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | 0 | $\mathrm{F}=\mathrm{A}$ | Transfer A |
| 0 | 0 | 0 | 0 | 1 | $F=A+1$ | Increment A |
| 0 | 0 | 0 | 1 | 0 | $F=A+B$ | Addition |
| 0 | 0 | 0 | 1 | 1 | $F=A+B+1$ | Add with carry |
| 0 | 0 | 1 | 0 | 0 | $F=A+B^{\prime}$ | Subtract with borrow |
| 0 | 0 | 1 | 0 | 1 | $F=A+B^{\prime}+1$ | Subtraction |
| 0 | 0 | 1 | 1 | 0 | $F=A-1$ | Decrement A |
| 0 | 0 | 1 | 1 | 1 | $\mathrm{F}=\mathrm{A}$ | TransferA |
| 0 | 1 | 0 | 0 | X | $F=A \cdots B$ | AND |
| 0 | 1 | 0 | 1 | X | $\mathrm{F}=\mathrm{A} \psi \mathrm{B}$ | OR |
| 0 | 1 | 1 | 0 | X | $F=A \oplus B$ | XOR |
| 0 | 1 | 1 | 1 | X | $F=A^{\prime}$ | Complement A |
| 1 | 0 | X | X | X | $F=\operatorname{shi} A$ | Shift rioht A into F |
| 1 | 1 | X | X | X | $F=$ shl $A$ | Shift left A into F |

- UNIT II ARITHMETIC OPERATIONS ALU (Arthimetic Logic Unit)


## Reference

- Appendix B: The Basics of Logic Design
- Book - David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan kauffman / Elsevier, Fifth edition, 2014.


## Logic Gates

## AND



| Input A | Input B | Output Q |
| :---: | :---: | :---: |
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 1 | 1 |

$Q=A \cdot B$

## OR



| Input A | Input B | Output Q |
| :---: | :---: | :---: |
| 0 | 0 | $\mathbf{0}$ |
| 1 | 0 | $\mathbf{1}$ |
| 0 | 1 | $\mathbf{1}$ |
| 1 | 1 | $\mathbf{1}$ |

$$
Q=A+B
$$

## OR



| Input A | Input B | Output Q |
| :---: | :---: | :---: |
| 0 | 0 | $\mathbf{0}$ |
| 1 | 0 | $\mathbf{1}$ |
| 0 | 1 | $\mathbf{1}$ |
| 1 | 1 | $\mathbf{1}$ |

$$
Q=A+B
$$

## NOT (Inversion)



## Multiplexor



If $\mathrm{d}==0$
Then $\mathrm{c}=\mathrm{a}$
Else if $d==1$
Then $\mathrm{c}=\mathrm{b}$
Used to select an operation

ALU

## ALU definition

- Arithmetic Logic Unit (ALU) - Hardware
that performs addition, subtraction, and usually logical operations such as AND and OR.


## ALU

- Operation selector - output from multiplexor



## ALU symbol

a, b-inputs
ALU operation - operation selector
Carry out - add, sub
Zero - slt,beq,bne Result - add, sub, and, or ....
Overflow - Exception


## Agenda

- Signed Numbers
- 1s complement
- 2 s complement
- Binary Addition
- Binary Subtraction
- Multiplication
- Flow chart - algorithm
- Hardware design
- Problem


## Reference

- Chapter 2 : Instructions: Language of the Computer
- 2.4 Signed and Unsigned Numbers
- Chapter 3 : Arithmetic for Computers
- Book - David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan kauffman / Elsevier, Fifth edition, 2014.


## Signed Numbers

- Unsigned numbers - Ex: 0, 100, 999
- Signed numbers - Ex: -100, 0 , +100
- How computers represent + or - in 0's and 1's?
- 1s complement
- 2 s complement


## 1s complement

- Question - find the 1 s complement of $10001111_{2}$ -
- Solution - inverting the bits $10001111_{2}$
$01110000_{2}$


## $2 s$ complement

- Question - find the 2 s complement of $10001111_{2}$ -
- Solution - inverting the bits and adding 1 $10001111_{2}$
$0111 \mathbf{0 0 0 0}_{2} \rightarrow$ 1s complement

$$
1 \rightarrow \text { add } 1
$$

$01110001_{2}$

## Another problem

- Question : find 1 s and 2 s complement for the following
$-00000110_{2}$
$-00011000_{2}$


## Binary Addition

| A | B | A+B | Carry Out |
| :--- | :--- | :--- | :--- |
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 |

## Binary Addition Problem

$$
\begin{array}{ll} 
& 0111 \text { two }=7 \text { ten } \\
+ & 0110 \text { two }=6 \text { ten } \\
-----------------------13 t e n ~ \\
= & 1101 \text { two }=13 \text { ten }
\end{array}
$$

## Binary Subtraction Problem

- Subtraction via addition using the two's complement representation of 6 :

0000000000000000000000000000 0111two $=7$ ten

+ 11111111111111111111111111111010 two $=-6 t e n$
$=00000000000000000000000000000001$ two $=1$ ten


## Two's Complement Addition: Verifying Carry/Borrow method

- Two ( $\mathrm{n}+1$ )-bit integers: $\mathrm{X}=\mathrm{x}_{\mathrm{n}} \mathrm{X}^{\prime}, \mathrm{Y}=\mathrm{y}_{\mathrm{n}} \mathrm{Y}^{\prime}$

| Carry/borrow <br> add $X+Y$ | $0 \leq \mathrm{X}^{\prime}+\mathrm{Y}^{\prime}<2^{n}$ <br> (no CarryIn to last bit) | $2^{n} \leq \mathrm{X}^{\prime}+\mathrm{Y}^{\prime}<2^{n+1}-1$ <br> (CarryIn to last bit) |
| :--- | :--- | :--- |
| $\mathrm{x}_{\mathrm{n}}=0, y_{n}=0$ | ok | not ok(overflow!) |
| $\mathrm{x}_{\mathrm{n}}=1, y_{\mathrm{n}}=0$ | ok | ok |
| $\mathrm{x}_{\mathrm{n}}=0, y_{\mathrm{n}}=1$ | ok | ok |
| $\mathrm{x}_{\mathrm{n}}=1, y_{n}=1$ | not ok(overflow!) | ok |

- Prove the cases above!
- Prove if there is one more bit (total $\mathrm{n}+2$ then) available for the result then there is no problem with overflow in add!


## Two's Complement Operations

- Now verify the negation shortcut!
- consider $X+\bar{X}+1)=(X+\bar{X})+1$ : associative law - but what if there is overflow in one of the adds on either side, i.e., the result is wrong...!
- think minint!
- Examples:
- $-0101=1010+1=1011$
- $-1100=0011+1=0100$
- $-1000=0111+1=1000$


## Detecting Overflow

- No overflow when adding a positive and a negative number
- No overflow when subtracting numbers with the same sign
- Overflow occurs when the result has "wrong" sign :

| Operation | Operand A | Operand B | Result Indicating Overflow | Actual without Overflow |
| :---: | :---: | :---: | :---: | :---: |
| $A+B$ | $\geq 0$ (+ve) | $\geq 0$ (+ve) | $<0$ (-ve) | +ve |
| $A+B$ | $<0$ (-ve) | $<0$ (-ve) | $\geq 0$ (+ve) | -ve |
| $A-B$ | $\geq 0$ (+ve) | $<0$ (-ve) | $<0$ (-ve) | +ve |
| $A-B$ | $<0$ (-ve) | $\geq 0$ (+ve) | $\geq 0$ (+ve) | -ve |

- Consider the operations A + B, and A - B
- can overflow occur if $B$ is 0 ?
- can overflow occur if $A$ is 0 ?


## Effects of Overflow

- If an exception (interrupt) occurs
- control jumps to predefined address for exception
- interrupted address is saved for possible resumption
- Don't always want to cause exception on overflow
- add, addi, sub cause exceptions on overflow
- addu, addiu, subu do not cause exceptions on overflow


## Multiply

- Binary multiple of 2 ten $x 3$ ten $=6$ ten
- Multiplicand - 2
- Multiplier - 3
- Product - 6


## Multiply

- Grade school shift-add method:

| Multiplicand | 1000 |
| :--- | :---: |
| Multiplier | $\mathbf{x 1 0 0 1}$ |
|  | 1000 |
|  | 0000 |
|  | 0000 |
| Product | $\mathbf{0 1 0 0 1 0 0 0}$ |

- $m$ bits $\times n$ bits $=m+n$ bit product
- Binary makes it easy:
- multiplier bit $1=>$ copy multiplicand ( $1 \times$ multiplicand)
- multiplier bit 0 => place 0
( $0 \times$ multiplicand)


# Multiplication - <br> Sequential Refined Version Algorithm 

## Sequential(Refined) version of Multiplication Algorithm



## Sequential(Refined) Version of Multiplication Problem

$$
2 \times 3=6 ; \quad 0010 \times 0011=0110
$$

| Iteration | Steps | Multiplicand | Product |
| :--- | :--- | :--- | :--- |
| 0 | Initialize | 0010 | 00000011 |
| 1 | 1a. 1 => Prod = Prod + Mcand | 0010 | 00100011 |
| 2. Shift Right Product | 0010 | 00010001 |  |
| 2 | 1a. $=>$ Prod = Prod + Mcand | 0010 | 00110001 |
| 2. Shift Right Product | 0010 | 00011000 |  |
| 3 | 1a. $0=>$ No operation | 0010 | 00011000 |
| 2. Shift Right Product | 0010 | 00001100 |  |
| 4 | 1a. $0=>$ No operation | 0010 | 00001100 |
|  | 2. Shift Right Product | 0010 | 00000110 |

## Last Class Summary

- Signed Numbers Representation
- 1s complement
- 2 s complement
- Binary Addition
- Binary Subtraction
- Multiplication
- Flow chart - algorithm
- Hardware design - pending
- Problem


## Multiplication

- Two versions of multiplication:
- Sequential refined version algorithm
- Sequential first version algorithm
- Booth's algorithm
- Signed and Unsigned multiplication


## Sequential(Refined) Version of Multiplication Hardware



Product register is initialized with multiplier on right

## Multiplication -

## Sequential First Version Algorithm

## Sequential First Version of Multiplication Algorithm



## Sequential First Version of Multiplication Problem

$$
2 \times 3=6 ; \quad 0010 \times 0011=0110
$$

| Iteration | Step | Multiplicr | Multiplieand | Product |
| :---: | :---: | :---: | :---: | :---: |
| 0 | Initial values | 0011 | 00000010 | 00000000 |
| 1 | 1a: $1 \Rightarrow$ Prod = Prod + Mcand | 0011 | 00000010 | 00000010 |
|  | 2: Shift left Multiplicand | 0011 | 00000100 | 00000010 |
|  | 3: Shift right Multiplier | 0009 | 00000100 | 00000010 |
| 2 | 1a: $1 \Rightarrow$ Prod = Prod + Mcand | 0001 | 00000100 | 00000110 |
|  | 2: Shift left Multiplicand | 0001 | 00001000 | 00000110 |
|  | 3: Shift right Multiplier | 0000 | 00001000 | 00000110 |
| 3 | 1:0 0 No operation | 0000 | 00001000 | 00000110 |
|  | 2: Shift left Multiplicand | 0000 | 00010000 | 00000110 |
|  | 3: Shift right Multiplier | 0000 | 00010000 | 00000110 |
| 4 | 1:0 $0 \Rightarrow$ No operation | 0000 | 00010000 | 00000110 |
|  | 2: Shift left Multiplicand | 0000 | 00100000 | 00000110 |
|  | 3: Shift right Multiplier | 0000 | 00100000 | 00000110 |

Sequential First version of Multiplication Hardware


## Multiply in MIPS

- 2 Registers:
- separate pair of 32-bit registers to contain the 64-bit product called Hi and Lo.
- 4 Instructions:
- multiply (mult)
- multiply unsigned (multu)
- move from lo (mflo) - to place the product into registers
- Move from hi (mfhi) - to place the product into registers


# Booth's Algorithm Both Signed and Unsigned Multiplication 

## Reference

- Chapter 6 : Arithmetic
- 6.4 Signed - operand multiplication
- Booth algorithm
- Book - Carl Hamacher, "Computer organization", Fifth edition, Mc Graw Hill.


## Booth's algorithm

- $1^{\text {st }}$ step - Find the 2 s complement of negative multiplier
- $2^{\text {nd }}$ step - recode the 2 s complement of negative multiplier
- $3^{\text {rd }}$ step - multiply multiplicand and recoded multiplier using long hand method


## Basic concept - 2s complement

3 bits - Unsigned numbers - 8 - (0-7)
3 bits - Signed numbers - $8-(-4,3,-2,-1,0,1,2,3)$

| 3 bits - unsigned numbers |  | 3 bits - signed numbers |  |
| :---: | :---: | :---: | :---: |
|  |  | 100 | -4 |
| 000 | 0 | 101 | -3 |
| 001 | 1 |  |  |
| 010 | 2 | 110 | -2 |
| 011 | 3 | 111 | -1 |
| 100 | 4 | 000 | 0 |
| 101 | 5 | ob1 | +1 |
| 101 | 5 | 1 |  |
| 110 | 6 | 010 | +2 |
| 111 | 7 | O11 | +3 |

## Example problem

- Multiply +13 x-6
- Where +13 -> multiplicand -> 5bits
- $\quad-6$-> multiplier -> 5 bits
$1^{\text {st }}$ step $->$ find the $2 s$ complement of negative multiplier
Assume : 5bit multiplier

$$
\text { +6 -> } 00110 \text { (sign extension) }
$$

1s complement -> 11001
Add $1 \quad$-> 1

2 s complement -> $11010=-6$

## Why 5 bits for +13 ?

- +13 is in which range
- -13 ... $0 \ldots+13$ = total 27 numbers
- Nearest power of 2 number is $32=2^{5}$
- So 5 bits to represent +13 .


## How 5 bits for -6 ?

- Find binary number of -6 ?
- Range of -6 is $-6 . .0 . .+6=$ total 13 numbers
- Nearest power of 2 is $2^{4}=16$
- 6 -> 0110
- Invert 1001
- +1-> 1
- -6 -> 1010 11010 -> sign extension (Make -6 extend to 5 bits)


## Problem solving

- $2^{\text {nd }}$ step $->$ recode the 2 s complement of negative multiplier Booth multiplier recoding table:

| Multiplier |  | Version of multiplicand selected by bit i |
| :---: | :---: | :---: |
| Bit I | Bit i-1 |  |
| 0 | 0 | $0 \times \mathrm{M}$ |
| 0 | 1 | +1 $\times \mathrm{M}$ |
| 1 | 0 | $-1 \times \mathrm{M}$ |
| 1 | 1 | $0 \times \mathrm{M}$ |
| $-6 \quad \Rightarrow \quad 1 \begin{array}{llllll} -6 & 1 & 0 & 1 & 0 & (0) \end{array}$ |  |  |

## Booth's Algorithm Problem solving

- $3^{\text {rd }}$ step -> multiply multiplicand and recoded multiplier using long hand method.

| (+13) | 011101 |
| :---: | :---: |
| X (-6) | 0-1+1-1 0 |

0000000000
111110011
00001101
1110011
000000
$(-78) \quad 1110110010 \quad$-> how to verify this result

## Booth's multiplication features

- Handles both signed and unsigned multipliers uniformly.
- Faster multiplication - fewer additions


## How Speed affected by multiplier??

- Worst case multiplier
$\begin{array}{llllllllllllll}0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1\end{array}$
$+1-1+1-1+1-1+1-1+1-1+1-1+1-1$
- Ordinary multiplier

$$
\begin{array}{cccccccccccccccc}
1 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 1 & 0 & 0 \\
0 & -1 & 0 & 0 & +1 & -1 & +1 & 0 & -1 & +1 & 0 & 0 & 0 & -1 & 0 & 0
\end{array}
$$

- Good multiplier

0000111110000111
$000+10000-1000+100-1$

## Computer Architecture Division

## Last Class Summary

- Booth's Algorithm


## Agenda

- Booth's algorithm example
- Why 5 bits?
- How to verify the result?
- Division
- Hardware
- Algorithm
- Problem


## Basic concept - 2s complement

3 bits - Unsigned numbers - 8 - (0-7)
3 bits - Signed numbers - $8-(-4,3,-2,-1,0,1,2,3)$

| 3 bits - unsigned numbers |  | 3 bits - signed numbers |  |
| :---: | :---: | :---: | :---: |
|  |  | 100 | -4 |
| 000 | 0 | 101 | -3 |
| 001 | 1 |  |  |
| 010 | 2 | 110 | -2 |
| 011 | 3 | 111 | -1 |
| 100 | 4 | 000 | 0 |
| 101 | 5 | ob1 | +1 |
| 101 | 5 | 1 |  |
| 110 | 6 | 010 | +2 |
| 111 | 7 | O11 | +3 |

## Example problem

- Multiply +13 x-6
- Where +13 -> multiplicand -> 5bits
- $\quad-6$-> multiplier -> 5 bits
$1^{\text {st }}$ step $->$ find the $2 s$ complement of negative multiplier
Assume : 5bit multiplier

$$
\text { +6 -> } 00110 \text { (sign extension) }
$$

1s complement -> 11001
Add $1 \quad$-> 1

2 s complement -> $11010=-6$

## Why 5 bits for +13 ?

- +13 is in which range
- -13 ... $0 \ldots+13$ = total 27 numbers
- Nearest power of 2 number is $32=2^{5}$
- So 5 bits to represent +13 .


## How 5 bits for -6 ?

- Find binary number of -6 ?
- Range of -6 is $-6 . .0 . .+6=$ total 13 numbers
- Nearest power of 2 is $2^{4}=16$
- 6 -> 0110
- Invert 1001
- +1-> 1
- -6 -> 1010 11010 -> sign extension (Make -6 extend to 5 bits)


## Booth's Algorithm Problem solving

- $3^{\text {rd }}$ step -> multiply multiplicand and recoded multiplier using long hand method.

| (+13) | 011101 |
| :---: | :---: |
| X (-6) | 0-1+1-1 0 |

0000000000
111110011
00001101
1110011
000000
$(-78) \quad 1110110010 \quad$-> how to verify this result

## DIVISION

## - UNIT II ARITHMETIC OPERATIONS

 Division
## Reference

- Chapter 3 : Arithmetic for Computers
- Book - David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan kauffman / Elsevier, Fifth edition, 2014.


## Long Hand Divide



- Dividend = (Quotient * Divisor) + Remainder


## Division algorithms

1. Restoring

- $1^{\text {st }}$ version
- Improved version

2. Non-Restoring

## Restoring Divide version 1 - Algorithm



## Restoring Divide Version 1 - Problem

Divide $7_{\text {ten }}$ by $2_{\text {ten }}$,
$00000111_{\text {two }}$ by $0010_{\text {two }}$

| Iteration | Step | Quotient | Divisor | Remainder |
| :---: | :---: | :---: | :---: | :---: |
| 0 | Initial values | 0000 | 00100000 | 00000111 |
| 1 | 1: Rem = Rem - Div | 0000 | 00100000 | (1)110 0111 |
|  | 2b: Rem $<0 \Rightarrow+$ Div, sll $Q, Q 0=0$ | 0000 | 00100000 | 00000111 |
|  | 3: Shift Div right | 0000 | 00010000 | 00000111 |
| 2 | 1: Rem = Rem - Div | 0000 | 00010000 | (9)1110111 |
|  | 2b: Rem $<0 \Rightarrow+$ Div, sll $\mathrm{Q}, \mathrm{QO}=0$ | 0000 | 00010000 | 00000111 |
|  | 3: Shift Div right | 0000 | 00001000 | 00000111 |
| 3 | 1: Rem = Rem - Div | 0000 | 00001000 | (0)1111111 |
|  | 2b: Rem $<0 \Rightarrow+$ Div, sll $\mathrm{Q}, \mathrm{QO}=0$ | 0000 | 00001000 | 000000111 |
|  | 3: Shift Div right | 0000 | 00000100 | 00000111 |
| 4 | 1: Rem = Rem - Div | 0000 | 00000100 | (0000 0011 |
|  | 2a: Rem $\geq 0 \Rightarrow$ sll $\mathrm{Q}, \mathrm{Q} 0=1$ | 0001 | 00000100 | 00000011 |
|  | 3: Shift Div right | 0001 | 00000010 | 00000011 |
| 5 | 1: Rem = Rem - Div | 0001 | 00000010 | (0000 0001 |
|  | 2a: Rem $\geq 0 \Rightarrow$ sll $Q, Q 0=1$ | 0011 | 00000010 | 00000001 |
|  | 3: Shift Div right | 0011 | 00000001 | 00000001 |

## Restoring Divide version 1 - Hardware



## Observations on Divide Version 1

- Half the bits in divisor always 0
$-\Rightarrow 1 / 2$ of 64 -bit adder is wasted
$-\Rightarrow 1 / 2$ of divisor register is wasted
- Intuition: instead of shifting divisor to right, shift remainder to left...
- Step 1 cannot produce a 1 in quotient bit - as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit)
- Intuition: switch order to shift first and then subtract - can save 1 iteration...


## Restoring Divide - Improved Version

 (saves space)$\longrightarrow$ Data lines
$\longrightarrow$ Control lines


Hi - Remainder

# Restoring Divide Improved Version Algorithm 

## 1. Shift the remainder left 1 bit



## Observations on Divide Improved Version

- Signed divide:
- make both divisor and dividend positive and perform division
- negate the quotient if divisor and dividend were of opposite signs
- make the sign of the remainder match that of the dividend
- this ensures always
- dividend $=$ (quotient * divisor) + remainder
- -quotient $(x / y)=$ quotient $(-x / y)\left(e . g .7=3^{*} 2+1 \&-7=-3^{*} 2\right.$ -1)
- Faster divide:
- Based on Prediction and Moore's Law.
- SRT division


## MIPS instructions

- div (signed), divu (unsigned) -
- with two 32-bit register operands,
- put remainder in Hi register and quotient in Lo;
- overflow is ignored in both cases
- mflo - place the quotient from Lo to general purpose register
- mfhi - place the remainder from Hi to general purpose register
- Example; div \$s2, \$s1 \#\$s2/Ss1
- 

mflo \$s3

$$
\text { - mfhi } \$ s 4
$$

$$
\begin{aligned}
\text { \# Lo -quotient } \\
\text { \# Hi - Remainder }
\end{aligned} \quad \begin{aligned}
\text { \# } \$ 33= & \text { Lo }=\text { quotient } \\
\# \$ S 4= & \text { Hi }=\text { remainder }
\end{aligned}
$$

## MIPS instructions

| Arithmetic | buw |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | muttiply | mult | \$s2,\$s3 | Hi, Lo = \$ $2 \times$ \$ ${ }^{\text {S }}$ 3 | 64-bit signed product in Hi, Lo |
|  | multiply unsigned | mult | 402.402 |  | 64-bit unsigned product in Hi , Lo |
|  | divide | div | \$s2,\$s3 | $\begin{aligned} & L 0=\$ \$ 2 / \$ s 3, \\ & H i=\$ \$ 2 \mathrm{mod} \$ \mathrm{~s} 3 \end{aligned}$ | Lo = quotient, $\mathrm{Hi}=$ remalinder |
|  | divide unsigned | divu | \$s2,\$s3 | $\begin{aligned} & L 0=\$ 2 / \$ 33, \\ & H i=\$ 22 \text { mod } \end{aligned}$ | Unsigned quotient and remain |
|  | move from Hi | mfhi | \$ ${ }^{1}$ | \$S1 = Hi | Used to get copy of Hi |
|  | move from Lo | mflo | \$\$1 | \$ $\$ 1=$ Lo | Used to get copy of Lo |

## Non-Restoring Division

Reference:
Chapter 6 : Arithmetic
6.6 Integer Division - Nonrestoring Division

Book - Carl Hamacher, "Computer organization", Fifth edition, Mc Graw Hill.

## Non-Restoring Divide - Algorithm

- Do the following n times:

1. If the sign of $A$ is 0 , shift $A$ and $Q$ left one bit position and subtract $M$ from $A$; otherwise, shift $A$ and $Q$ left and add $M$ to $A$.
2. Now, if the sign of $A$ is 0 , set $q 0$ to 1 ; otherwise, set q0 to 0 .

## Non restoring divide - problem

- Divide 8 by 3;
- Q-8-Dividend - 1000
- M-3-Divisor - 00011
- A- 00000 initially
- Where,
- A, Q, M are registers
- $A$ and $M$ are $n+1$ bits
- Q is n bits

Solution:
$M=00011$

| Iteration | Steps | A | Q |
| :---: | :---: | :---: | :---: |
| 0 | Intial | 00000 | 1000 |
| 1 | 1. $\operatorname{sign}(A)=0=>a)$ shift $A$ and $Q$ left. | 00001 | 0000 |
|  | b) $A=A-M$ | 11110 | 0000 |
|  | 2. $\operatorname{Sign}(\mathrm{A})=1 \Rightarrow \mathrm{QO}=0$ | 11110 | 0000 |
| 2 | 1. $\operatorname{sign}(A)=1=>a)$ shift $A$ and $Q$ left. | 11100 | 0000 |
|  | b) $A=A+M$ | 11111 | 0000 |
|  | 2. $\operatorname{Sign}(\mathrm{A})=1 \Rightarrow \mathrm{QO}=0$ | 11111 | 0000 |
| 3 | 1. $\operatorname{sign}(A)=1=>a)$ shift $A$ and $Q$ left. | 11110 | 0000 |
|  | b) $A=A+M$ | 00001 | 0000 |
|  | 2. $\operatorname{Sign}(\mathrm{A})=0=>\mathrm{QO}=1$ | 00001 | 0001 |
| 4 | 1. $\operatorname{sign}(A)=0=>a)$ shift $A$ and $Q$ left. | 00010 | 0010 |
|  | b) $A=A-M$ | 11111 | 0010 |
|  | 2. $\operatorname{Sign}(A)=1=>Q 0=0$ | 11111 | 0010 |

Quotient - Q - 0010
Remainder $=\mathrm{A}+\mathrm{M}=11111+00011=00010$

## Sub Word Parallelism

## SubWord Parallelism

- A subword is a lower precision unit of data contained within a word.
- In subword parallelism, multiple subwords are packed into a word and then process whole words.
- Since the same instruction is applied to all subwords within the word, This is a form of SIMD(Single Instruction Multiple Data) processing.
- Sub word parallelism or data level parallelism or vector


## SubWord Parallelism

- A 32 bit processor simultaneously execute operations on 4 eight bit operands or 2 sixteen bit operands.
- Example: $5+5=10$ and $9+9=18$ parallel operation A1 - 16 bits A2-16 bits B1-16 bits B2-16 bits



## Application of Sub-Word parallelism

- Used in multimedia operations
- which has many sub-word arithmetic operations (8bit , 16bit and so on)
- ARMv7,ARMv8 - processors have NEON instructions that support sub-word parallelism
- Example NEON instructions: (128 bit registers)

VADD.F32 - adds 4 32-bit data simultaneously
VMULL.S8 - multiplies 16 8-bit data simultaneously

## Floating Point Operations

## Floating Point

We need a way to represent

- numbers with fractions, e.g., 3.1416
- very small numbers (in absolute value),
e.g., 0.00000000023
- very large numbers (in absolute value),

$$
\text { e.g., }-3.15576 * 10^{46}
$$

## Floating point

- Eaxmple: -3.15576 * $10^{46}$
- Sign = negative -0 bit
- Fraction = 15576
- Significand $=3.15576$
- Exponent = 46


## Floating Point - Representation

- Scientific Notation:

A notation that renders numbers with a single digit to the left of the decimal point.

Convert to scientific notation $00.001_{\text {two }} \times 2^{-2}$
$=>0.0001_{\mathrm{two}} \times 2^{-2+1} \quad=>0.0001_{\mathrm{two}} \times 2^{-1}$

- Move $n$ bits to right - add $n$ to exponent


## Floating Point - Representation

- Normalized scientific representation:

A number in floating-point notation that has no leading 0s.

Convert to normalized scientific notation $0.0001_{\text {two }} \times 2^{-1}$

$$
=>1.0_{\mathrm{two}} \times 2^{-1-4} \quad \Rightarrow 1.0_{\mathrm{two}} \times 2^{-5}
$$

- Move $\boldsymbol{n}$ bits to left - subtract $\boldsymbol{n}$ from exponent


## IEEE 754 Floating-point Standard

- IEEE 754 floating point standard:
- single precision: one word

| 31 | bits 30 to 23 | bits 22 to 0 |
| :---: | :--- | :---: |
| sign | 8 -bit exponent | 23-bit significand |

Range of single precision:
Smallest number is $->2.0_{\text {ten }} \times 10^{-38}$
Largest number is $->2.0_{\text {ten }} \times 10^{38}$
Overflow - when positive exponent becomes too large to fit in exponent field

Underflow - when negative exponent becomes too large to fit in exponent field

## IEEE 754 Floating-point Standard

- double precision: two words

| 31 bits 30 to 20 bits 19 to 0  <br> sign 11-bit exponent upper 20 bits of 52 -bit significand  <br> bits 31 to 0    |
| :--- |

Range of double precision:
Smallest is -> $2.0_{\text {ten }} \times 10^{-308}$
Largest is $->2.0_{\text {ten }} \times 10^{308}$

## General representation of IEEE 754 Floating-point Standard

```
    equals biased exponent value
(-1) sign * (1 + fraction) * 2 (exponent-bias)
```

- bias $=127$ for single precision and
- bias $=1023$ for double precision
- Biased exponent $=$ exponent $\boldsymbol{-}$ bias


## IEEE 754 Floating-point Standard

- Sign bit is 0 for positive numbers, 1 for negative numbers
- Significand:

Number is assumed normalized and leading 1 bit of significand left of binary point (for non-zero numbers) is assumed and not shown

- e.g., significand $1.1001 \ldots$ is represented as $1001 \ldots$,
- value $=(-1)^{\text {sign }} *(1+$ fraction $) * 2^{\text {exponent value }}$


## IEEE 754 Floating-point Standard

- Exponent is biased
- bias of 127 for single precision and 1023 for double precision
equals biased exponent value
- value $=(-1)^{\text {sign }} *(1+$ significand $) * 2^{(\text {exponent }- \text { bias })}$
- Biased exponent $=$ exponent $\boldsymbol{-}$ bias


## Example problem

- Represent $\mathbf{- 0 . 7 5}$ ten in IEEE 754 single precision

Solution:
Step1 : Convert decimal floating to binary normalized scientific floating point
Step2 : Write the general representation of IEEE 754
Step3 : Find the sign, fraction, biased exponent
Stpe4 : Draw the IEEE 754 single precision format

## Example problem

Step1 : Convert decimal floating to binary normalized scientific floating point

```
decimal: -0.75 = -3/4 = -3/2 2
``` binary: \(-11 \times 2^{-2}=-.11=-1.1 \times 2^{-1}\)

Step2 : Write the general representation of IEEE 754
\((-1)^{\text {sign }} *(1+\) fraction \() * 2^{\text {(exponent-bias) }}\)
Step3 : Find the sign, fraction, biased exponent
Sign = 1
Fraction \(=1_{\text {two }}=\)
Significand \(=1.1_{\text {two }}\)
Biased exponent \(=-1_{\text {ten }}\)
Biased exponent \(=\) exponent - bias
Exponent = biased exponent + bias
\[
=(-1)+127=126_{\mathrm{ten}}=01111110_{\mathrm{two}}
\]

\section*{Example problem}

\section*{Stpe4 : Draw the IEEE 754 single precision format}


\section*{Floating Point Addition Algorithm}


\section*{Floating point addition problem}
- Try adding the numbers 0.5 ten and 0.4375 ten in binary using the algorithm
\(1^{\text {st }}\) - Convert the decimal numbers to binary numbers.
- \(0.5_{\text {ten }}=1 / 2_{\text {ten }}=01_{\text {two }} \times 2^{-1}=0.1_{\text {two }}=1.0_{\text {two }} \times 2^{-1}\)
- \(0.4375_{\text {ten }}=7 / 16_{\text {ten }}=7 / 2^{4}{ }_{\text {ten }}\)
\[
\begin{aligned}
& =111_{\mathrm{two}} \times 2^{-4}=0.0111_{\mathrm{two}} \\
& =1.110_{\mathrm{two}} \times 2^{-2}
\end{aligned}
\]

\section*{Floating Point Addition Hardware}



\section*{Floating Point Complexities}
- In addition to overflow we can have underflow (number too small)
\(2.0_{\text {ten }} \times 10^{-308}-\) smallest
2. \(0_{\text {ten }} \times 10^{308}\) - largest
- Accuracy is the problem with both overflow and underflow because we have only a finite number of bits to represent numbers that may actually require arbitrarily many bits
- limited precision \(\Rightarrow\) rounding \(\Rightarrow\) rounding error
- IEEE 754 keeps two extra bits, guard and round

\section*{Usage of guard and round bits}
- \(2.56_{\text {ten }} \times 10^{0}+2.34_{\text {ten }} \times 10^{2}\)
- Assume only 3 significand bits
- \(1^{\text {st }}\) step : \(2.56_{\text {ten }} \times 10^{0}=0.0256_{\text {ten }} \times 10^{2}\)
- \(2^{\text {nd }}\) step :
\(0.0256_{\text {ten }} \times 10^{2}\)
\(2.3400_{\text {ten }} \times 10^{2}\)
\(2.3656_{\text {ten }} \times 10^{2}\)
- 2 rd sten • \(237 \times 10^{2}\)

\section*{Without guard and round bits}
- \(2.56_{\text {ten }} \times 10^{0}+2.34_{\text {ten }} \times 10^{2}\)
- Assume only 3 significand bits
- \(1^{\text {st }}\) step : \(2.56_{\text {ten }} \times 10^{0}=0.02_{\text {ten }} \times 10^{2}\)
- \(2^{\text {nd }}\) step :
\(0.02_{\text {ten }} \times 10^{2}\)
\(2.34_{\text {ten }} \times 10^{2}\)
\(2.36_{\text {ten }} \times 10^{2}\)
- 2 rd \(\operatorname{sten} \cdot 236 \times 10^{2}\)

\section*{Rounding error}
- units in the last place (ulp) - The number of bits in error in the least significant bits of the significand between the actual number and the number that can be represented.

With guard and round bits \(-2.37_{\text {ten }} \times 10^{2}\)
Without guard and round bits \(-2.36_{\text {ten }} \times 10^{2}\)
so here - it is off by 1 ulps

\section*{UNIT III PROCESSOR AND CONTROL UNIT}

Basic MIPS implementation - Building datapath Control Implementation scheme - Pipelining Pipelined datapath and control - Handling Data hazards \& Control hazards - Exceptions

\section*{Reference:}
- Chapter 4 - The Processor
- Appendix B - The Basics of Logic Design
- B. 7 Clock
- B. 8 Memory Elements
- Book - David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan Kauffman / Elsevier, Fifth edition, 2014.

\section*{Five Classic Components of a Computer}


\section*{Basic MIPS implementation}

\section*{Implementing MIPS}
- We're ready to look at an implementation of the MIPS instruction set
- Simplified to contain only
- arithmetic-logic instructions: add, sub, and, or, slt
- memory-reference instructions: lw, Sw
- control-flow instructions: beq, j


\section*{Overview implementing MIPS: Fetch/Execute Cycle}
- High-level abstract view of fetch/execute implementation
- use the program counter (PC) to read instruction address
- fetch the instruction from memory and increment PC
- use fields of the instruction to select registers to read
- execute depending on the instruction
- repeat.


\section*{Overview: Processor Implementation Styles}
- Single Cycle
- perform each instruction in 1 clock cycle
- clock cycle must be long enough for slowest instruction; therefore,
- disadvantage: only as fast as slowest instruction
- Multi-Cycle
- break fetch/execute cycle into multiple steps
- perform 1 step in each clock cycle
- advantage: each instruction uses only as many cycles as it needs
- Pipelined
- execute each instruction in multiple steps
- perform 1 step / instruction in each clock cycle
- process multiple instructions in parallel - assembly line

\section*{Basic Implementation of MIPS - with Data Path}


\section*{Processor}
- Two components of processor
- Datapath
- Control

\section*{Functional Elements}
- Two types of functional elements:
- elements that operate on data - combinational elements
- elements that contain data - state or sequential elements

\section*{Building Data Path}

\section*{Data path}
- A unit used to operate on or hold data within a processor.
- The datapath elements include
- instruction and data memories,
- the register file,
- the ALU,
- and adders.

\section*{Elements used to fetch instructions and increment the PC}

3 elements are used to fetch instructions and increment PC:
- Instruction memory
- A state element where instruction is stored.
- PC - program counter
- A state element or register containing the address of the next instruction to be executed.
- Adder
- an ALU wired to always add its two 32-bit inputs and place the sum on its output.

\section*{Element used to store/fetch instructions and increment the PC}

a. Instruction memory
b. Program counter
c. Adder

\title{
Datapath: Instruction Store/Fetch \& PC Increment
}


\section*{Animating the Datapath}


Implementing Datapath for each instruction classes
- Arithmetic and logical instructions - R-type
- Load store instructions
- Control - conditional
- Control - unconditional instructions - J-Type

\section*{Elements used in R-Type instructions}

2 elements are used in R-Type instructions
- Register file
- A state element that consists of a set of registers that can be read and written by supplying a register number to be accessed.
- ALU
- A combinational element that performs arithmetic and logical operations on the input data.

\section*{Elements used in R-Type instructions}

a. Registers
b. ALU

\section*{Datapath: R-Type Instruction}


\section*{Animating the R-Type Instruction Datapath}

Instruction


Implementing Datapath for each instruction classes
- Arithmetic and logical instructions - R-type
- Load store instructions
- Control - conditional
- Control - unconditional instructions - J-Type

\section*{Elements used in Load/Store instructions}

4 elements are used in Load/Store instructions
- Data memory
- The memory unit is a state element with inputs for the address and the write data, and a single output for the read result
- Sign Extend
- a 16-bit input that is sign-extended into a 32-bit result appearing on the output
- Register file
- ALU

\section*{Elements used in Load/Store instructions}

a. Data memory unit
b. Sign extension unit

\section*{Datapath: \\ Load/Store Instruction}


MemRead
a. Data memory unit


Datapath

\section*{Animating the Load Instruction Datapath}


\section*{Animating the Store Instruction Datapath}


Implementing Datapath for each instruction classes
- Arithmetic and logical instructions - R-type
- Load store instructions
- Control - conditional

I-Type
- Control - unconditional - J - Type

\section*{Datapath: Branch Instruction}


Datapath

\section*{Animating the Branch Instruction Datapath}


\section*{Combining the datapaths for R-type instructions and load/stores using two multiplexors}


\section*{Animating the Datapath: R-type Instruction}


\section*{Animating the Datapath: Load Instruction}


\section*{Animating the Datapath: Store Instruction}


IVIPS Datapath - Adding instruction fetch

\section*{(R-type \& load/store \& instruction fetch)}


\section*{Branch taken \& not taken}

Branch taken
\(X=1 ; Y=2\)
If \(X==Y\)
then
equal
else
not equal

Assign PC:
\(\mathrm{PC}=\mathrm{BTA}\)

\section*{Branch not taken}
\(X=1 ; Y=1\)
If \(X==Y\)
then
equal
else
not equal

Assign PC:
\(P C=P C+4\)

\section*{MIPS Datapath : Adding Branch capability (R-type, Load/Store, Instruction fetch, Branch)}


\section*{Datapath Executing add}


\section*{Datapath Executing lw}


\section*{Datapath Executing sw}


\section*{Datapath Executing beq (Branch Taken : PC = BTA)}


\section*{Control Implementation Scheme}

\section*{Processor}
- Two components of processor
- Datapath
- Control
- Control:
- Sends control signals to all other units

\section*{Control}
- Input to control unit
- the instruction opcode bits
- Control unit generates (Output)
- ALU control input
- write enable (possibly, read enable also) signals for each storage element
- selector controls for each multiplexor

ALUOp generation

\section*{ALU Control} by main control
\begin{tabular}{|l|l|}
\hline Instruction & ALUOp \\
\hline Load/Store & 00 \\
\hline Branch & 01 \\
\hline R-Type & 10 \\
\hline
\end{tabular}


\section*{Setting ALU Control Bits}
\begin{tabular}{|c|c|c|c|c|c|}
\hline Instruction opcode & AluOp & Instruction operation & Funct Field & Desired ALU action & ALU control input \\
\hline LW & 00 & load word & xxyxxx & add & 010 \\
\hline SW & 00 & store word & xxyxxx & add & 010 \\
\hline Branch eq & 01 & branch eq & xxxxxx & subtract & 110 \\
\hline R-type & 10 & add & 100000 & add & 010 \\
\hline R-type & 10 & subtract & 100010 & subtract & 110 \\
\hline R-type & 10 & AND & 100100 & and & 000 \\
\hline R-type & 10 & OR & 100101 & or & 001 \\
\hline R-type & 10 & set on less & 101010 & set on less & 111 \\
\hline
\end{tabular}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline \multicolumn{3}{|c|}{ ALUOp } & \multicolumn{5}{c|}{ Funct field } & \multirow{2}{c|}{ Operation } \\
\cline { 1 - 7 } ALUOp1 & ALUOp0 & F5 & F4 & F3 & F2 & F1 & F0 & \\
\hline 0 & 0 & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & 010 \\
\hline 0 & 1 & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & 110 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 0 & 0 & 0 & 010 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 0 & 1 & 0 & 110 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 1 & 0 & 0 & 000 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 1 & 0 & 1 & 001 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 1 & 0 & 1 & 0 & 111 \\
\hline
\end{tabular}

Truth table for ALU control bits

\section*{MIPS instruction formats (Observe the bit positions)}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline Field & 0 & rs & rt & rd & shamt & funct \\
\hline Bit positions & 31:26 & 25:21 & 20:16 & 15:11 & 10:6 & 5:0 \\
\hline \multicolumn{7}{|l|}{a. R-type instruction} \\
\hline Field & 35 or 43 & rs & rt & & address & \\
\hline Bit positions & 31:26 & 25:21 & 20:16 & & 15:0 & \\
\hline \multicolumn{7}{|l|}{b. Load or store instruction} \\
\hline Field & 4 & rs & rt & & address & \\
\hline Bit positions & 31:26 & 25:21 & 20:16 & & 15:0 & \\
\hline
\end{tabular}

\section*{Designing the Main Control}

- Observations about MIPS instruction format
- opcode is always in bits 31-26
- two registers to be read are always rs (bits 25-21) and rt (bits 20-16)
- base register for load/stores is always rs (bits 25-21)
- 16-bit offset for branch equal and load/store is always bits 15-0
- destination register for loads is in bits 20-16 (rt) while for Rtype instructions it is in bits 15-11 (rd) (will require multiplexor to select)

\section*{Other Control signal values}
- Asserted - value is 0
- Deasserted -value is 1

\section*{Data path with control lines}


What are the functions of each control signals??

\section*{Effects of seven control signals}
\begin{tabular}{|l|l|l|}
\hline \multicolumn{1}{|c|}{\begin{tabular}{c} 
Signal \\
name
\end{tabular}} & \multicolumn{1}{|c|}{ Effect when deasserted 0} & \multicolumn{1}{c|}{\(\quad\) Effect when asserted }
\end{tabular}\(|\)\begin{tabular}{|ll|l|}
\hline RegDst & \begin{tabular}{l} 
The register destination number for the \\
Write register comes from the rt field \\
(bits 20:16).
\end{tabular} & \begin{tabular}{l} 
The register destination number for the Write \\
register comes from the rd field (bits 15:11).
\end{tabular} \\
\hline RegWrite & None. & \begin{tabular}{l} 
The register on the Write register input is \\
written with the value on the Write data input.
\end{tabular} \\
\hline ALUSrc & \begin{tabular}{l} 
The second ALU operand comes from the \\
second register file output (Read data 2).
\end{tabular} & \begin{tabular}{l} 
The second ALU operand is the sign- \\
extended, lower 16 bits of the instruction.
\end{tabular} \\
\hline PCSrc & \begin{tabular}{l} 
The PC is replaced by the output of the \\
adder that computes the value of PC + 4.
\end{tabular} & \begin{tabular}{l} 
The PC is replaced by the output of the adder \\
that computes the branch target.
\end{tabular} \\
\hline MemRead & None. & \begin{tabular}{l} 
Data memory contents designated by the \\
address input are put on the Read data output.
\end{tabular} \\
\hline MemWrite & None. & \begin{tabular}{l} 
Data memory contents designated by the \\
address input are replaced by the value on \\
the Write data input.
\end{tabular} \\
\hline MemtoReg & \begin{tabular}{l} 
The value fed to the register Write data \\
input comes from the ALU.
\end{tabular} & \begin{tabular}{l} 
The value fed to the register Write data input \\
comes from the data memory.
\end{tabular} \\
\hline
\end{tabular}

\section*{Data path with control unit}


\section*{Branch condition evaluation}
\begin{tabular}{|l|l|l|}
\hline instruction & Condition true & Condition false \\
\hline beq & Zero \(=1\) & Zero \(=0\) \\
\hline Bnq & Zero \(=1\) & Zero \(=0\) \\
\hline Zero & \begin{tabular}{l} 
Branch \\
instruction
\end{tabular} & Zero AND Branch \\
\hline 1 & 1 & 1 \\
\hline 0 & 1 & 0 \\
\hline 0 & 0 & 0 \\
\hline 1 & 0 & 0 \\
\hline
\end{tabular}

\section*{Datapath with Control II}


MIPS datapath with the control unit: input to control is the 6-bit instruction opcode field, output is seven 1-bit signals and the 2-bit ALUOp signal

\section*{Determining control signals for the MIPS datapath based on instruction opcode}
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|}
\hline Instruction & RegDst & ALUSrc & \begin{tabular}{c} 
Memto- \\
Reg
\end{tabular} & \begin{tabular}{c} 
Reg \\
Write
\end{tabular} & \begin{tabular}{c} 
Mem \\
Read
\end{tabular} & \begin{tabular}{c} 
Mem \\
Write
\end{tabular} & Branch & ALUOp1 & ALUp0 \\
\hline R-format & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\
\hline lw & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\
\hline Sw & \(X\) & 1 & \(X\) & 0 & 0 & 1 & 0 & 0 & 0 \\
\hline beq & \(X\) & 0 & \(X\) & 0 & 0 & 0 & 1 & 0 & 1 \\
\hline
\end{tabular}

Control unit-7-1-bit signals
1-2-bit signals

\section*{Datapath with Control II (contd)}


\section*{Control Signals: R-Type Instruction}


\section*{Control Signals:}

\section*{lw Instruction}


\section*{Control Signals:} sw Instruction


\section*{Control Signals:}

\section*{beq Instruction}


\section*{Datapath with Control III}


MIPS datapath extended to jumps: control unit generates new Jump control bit

\section*{Datapath Executing j}


\section*{R-type Instruction: Step 1 add \$t1, \$t2, \$t3 (active = bold)}


Fetch instruction and increment PC count

\section*{R-type Instruction: Step 2 add \$t1, \$t2, \$t3 (active = bold)}


Read two source registers from the register file

\title{
R-type Instruction: Step 3 add \$t1, \$t2, \$t3 (active = bold)
}


ALU operates on the two register operands

\section*{R-type Instruction: Step 4} add \$t1, \$t2, \$t3 (active = bold)


Write result to register

\section*{Single-cycle Implementation Notes}
- The steps are not really distinct as each instruction completes in exactly one clock cycle - they simply indicate the sequence of data flowing through the datapath
- The operation of the datapath during a cycle is purely combinational - nothing is stored during a clock cycle
- Therefore, the machine is stable in a particular state at the start of a cycle and reaches a new stable state only at the end of the cycle
- Very important for understanding single-cycle computing: See our simple Verilog single-cycle computer in the folder SimpleSingleCycleComputer in Verilog/Examples

\section*{Load Instruction Steps Iw \$t1, offset(\$t2)}
1. Fetch instruction and increment PC
2. Read base register from the register file: the base register (\$t2) is given by bits 25-21 of the instruction
3. ALU computes sum of value read from the register file and the sign-extended lower 16 bits (offset) of the instruction
4. The sum from the ALU is used as the address for the data memory
5. The data from the memory unit is written into the register file: the destination register ( \(\$ \mathrm{t} 1\) ) is given by bits 20-16 of the instruction

\section*{Load Instruction lw \$t1, offset(\$t2)}


\section*{Branch Instruction Steps beq \$t1, \$t2, offset}
1. Fetch instruction and increment PC
2. Read two register (\$t1 and \$t2) from the register file
3. ALU performs a subtract on the data values from the register file; the value of PC+4 is added to the signextended lower 16 bits (offset) of the instruction shifted left by two to give the branch target address
4. The Zero result from the ALU is used to decide which adder result (from step 1 or 3) to store in the PC

\section*{Branch Instruction beq \$t1, \$t2, offset}


\section*{Implementation: ALU Control Block}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline \multicolumn{3}{|c|}{ ALUOp } & \multicolumn{5}{c|}{ Funct field } & Operation \\
\cline { 1 - 7 } ALUOp1 & ALUOp0 & F5 & F4 & F3 & F2 & F1 & F0 & \\
\hline 0 & 0 & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & 010 \\
\hline \(0 *\) & 1 & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & 110 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 0 & 0 & 0 & 010 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 0 & 1 & 0 & 110 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 1 & 0 & 0 & 000 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 1 & 0 & 1 & 001 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 1 & 0 & 1 & 0 & 111 \\
\hline
\end{tabular}
* Typo in text

Fig. 5.15: if it is X then there is potential conflict between
line 2 and lines 3-7!

Truth table for ALU control bits


ALU control logic

\section*{Implementation: Main Control Block}

Inputs
\begin{tabular}{|c|c|c|c|c|c|}
\hline & Signal name & \begin{tabular}{l}
R- \\
format
\end{tabular} & IW & SW & beq \\
\hline \multirow[b]{6}{*}{} & (Op5 & 0 & 1 & 1 & 0 \\
\hline & Op 4 & 0 & 0 & 0 & 0 \\
\hline & Op3 & 0 & 0 & 1 & 0 \\
\hline & Op2 & 0 & 0 & 0 & 1 \\
\hline & Op1 & 0 & 1 & 1 & 0 \\
\hline & Op0 & 0 & 1 & 1 & 0 \\
\hline \multirow[t]{9}{*}{} & (RegDst & 1 & 0 & x & x \\
\hline & ALUSrc & 0 & 1 & 1 & 0 \\
\hline & MemtoReg & 0 & 1 & x & x \\
\hline & RegWrite & 1 & 1 & 0 & 0 \\
\hline & MemRead & 0 & 1 & 0 & 0 \\
\hline & MemWrite & 0 & 0 & 1 & 0 \\
\hline & Branch & 0 & 0 & 0 & 1 \\
\hline & ALUOp1 & 1 & 0 & 0 & 0 \\
\hline & ALUOP2 & 0 & 0 & 0 & 1 \\
\hline
\end{tabular}

Truth table for main control signals


Main control PLA (programmable logic array): principle underlying PLAs is that any logical expression can be written as a sum-of-products

\section*{Single-Cycle Design Problems}
- Assuming fixed-period clock every instruction datapath uses one clock cycle implies:
- \(\mathrm{CPI}=1\)
- cycle time determined by length of the longest instruction path (load)
- but several instructions could run in a shorter clock cycle: waste of time
- consider if we have more complicated instructions like floating point!
- resources used more than once in the same cycle need to be duplicated
- waste of hardware and chip area

\section*{Example: Fixed-period clock vs. variable-period clock in a single-cycle implementation}
- Consider a machine with an additional floating point unit. Assume functional unit delays as follows
- memory: 2 ns., ALU and adders: 2 ns., FPU add: 8 ns., FPU multiply: 16 ns., register file access (read or write): 1 ns .
- multiplexors, control unit, PC accesses, sign extension, wires: no delay
- Assume instruction mix as follows
- all loads take same time and comprise \(31 \%\)
- all stores take same time and comprise \(21 \%\)
- R-format instructions comprise 27\%
- branches comprise 5\%
- jumps comprise 2\%
- FP adds and subtracts take the same time and totally comprise 7\%
- FP multiplys and divides take the same time and totally comprise 7\%
- Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction executes in one clock cycle that is only as long as it needs to be (not really practical but pretend it's possible!)

\section*{Solution}
\begin{tabular}{|llllllllc|}
\hline \begin{tabular}{l} 
Instruction \\
class
\end{tabular} & \begin{tabular}{l} 
Instr. Register \\
mem.
\end{tabular} & \begin{tabular}{l} 
ALU \\
read
\end{tabular} & \begin{tabular}{l} 
Data \\
mem.
\end{tabular} & \begin{tabular}{l} 
Register FPU \\
write
\end{tabular} & \begin{tabular}{l} 
FPU \\
add/ \\
sub
\end{tabular} & \begin{tabular}{l} 
Total \\
mul/ \\
dime
\end{tabular} & \begin{tabular}{c} 
ns.
\end{tabular} \\
Load word & 2 & 1 & 2 & 2 & 1 & & & 8 \\
Store word & 2 & 1 & 2 & 2 & & & & 7 \\
R-format & 2 & 1 & 2 & 0 & 1 & & & 6 \\
Branch & 2 & 1 & 2 & & & & & 5 \\
Jump & 2 & & & & & & & 2 \\
FP mul/div & 2 & 1 & & & 1 & & 16 & 20 \\
FP add/sub & 2 & 1 & & & 1 & 8 & & 12 \\
\hline
\end{tabular}
- Clock period for fixed-period clock \(=\) longest instruction time \(=20 \mathrm{~ns}\).
- Average clock period for variable-period clock \(=8 \times 31 \%+\) \(7 \times 21 \%+6 \times 27 \%+5 \times 5 \%+2 \times 2 \%+20 \times 7 \%+12 \times 7 \%\)
\(=7.0 \mathrm{~ns}\).
- Therefore, performance var-period \(/\) performance \(_{\text {fixed-period }}=20 / 7=2.9\)

\section*{Fixing the problem with singlecycle designs}
- One solution: a variable-period clock with different cycle times for each instruction class
- unfeasible, as implementing a variable-speed clock is technically difficult
- Another solution:
- use a smaller cycle time...
- ...have different instructions take different numbers of cycles by breaking instructions into steps and fitting each step into one cycle
- feasible: multicyle approach!

\section*{Summary}
- Basic implementation of MIPS
- ALU control

\section*{Summary}
- Techniques described in this chapter to design datapaths and control are at the core of all modern computer architecture
- Multicycle datapaths offer two great advantages over single-cycle
- functional units can be reused within a single instruction if they are accessed in different cycles - reducing the need to replicate expensive logic
- instructions with shorter execution paths can complete quicker by consuming fewer cycles
- Modern computers, in fact, take the multicycle paradigm to a higher level to achieve greater instruction throughput:
- pipelining (next topic) where multiple instructions execute simultaneously by having cycles of different instructions overlap in the datapath
- the MIPS architecture was designed to be pipelined

\section*{UNIT III PROCESSOR AND CONTROL UNIT}

Pipelining - Pipelined datapath and control Handling Data hazards \& Control hazards Exceptions.

Reference:
- Chapter 4 - The Processor
- Book - David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan Kauffman / Elsevier, Fifth edition, 2014.

\section*{Overview Of Pipeline}

\section*{Pipelining}
- Start work ASAP!! Do not waste time!


Assume 30 min . each task - wash, dry, fold, store - and that separate tasks use separate hardware and so can be overlapped


\section*{Pipelined vs. Single-Cycle Instruction Execution: the Plan}


Assume 2 ns for memory access, ALU operation; 1 ns for register access: therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns .


\section*{Computer Architecture Pipeline}

\section*{Pipelining: Keep in Mind}
- Pipelining does not reduce latency of a single task, it increases throughput of entire workload
- Pipeline rate limited by longest stage
- potential speedup = number pipe stages
- unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it - when there is slack in the pipeline - reduces speedup

\section*{Example Problem}
- Problem: for the laundry fill in the following table when
1. the stage lengths are \(30,30,3030\) min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.
\begin{tabular}{|c|c|c|c|c|c|}
\hline Person & Unpipelined finish time & Pipeline 1 finish time & Ratio unpipelined to pipeline 1 & Pipeline 2 finish time & Ratio unpiplelined to pipeline 2 \\
\hline 1 & & & & & \\
\hline 2 & & & & & \\
\hline 3 & & & & & \\
\hline 4 & & & & & \\
\hline n & & & & & \\
\hline
\end{tabular}
- Come up with a formula for pipeline speed-up!

\section*{Pipelining MIPS}
- What makes it easy with MIPS?
- all instructions are same length
- so fetch and decode stages are similar for all instructions
- just a few instruction formats
- simplifies instruction decode and makes it possible in one stage
- memory operands appear only in load/stores
- so memory access can be deferred to exactly one later stage
- operands are aligned in memory
- one data transfer instruction requires one memory access stage

\section*{Pipelining MIPS}
- What makes it hard?
- structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource
- control hazards: succeeding instruction, to put into pipeline, depends on the outcome of a previous branch instruction, already in pipeline
- data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline
- Before actually building the pipelined datapath and control we first briefly examine these potential hazards individually...

Pipeline Hazards

\section*{Structural Hazards}
- Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline in the same clock cycle
- E.g., suppose single - not separate - instruction and data memory in pipeline below with one read port
- then a structural hazard between first and fourth lw instructions

- MIPS was designed to be pipelined: structural hazards are easy to avoid!

\section*{Data Hazards}
- Data hazard: instruction needs data from the result of a previous instruction still executing in pipeline
- Solution Forward data if possible...


Instruction pipeline diagram: shade indicates use left=write, right=read


Without forwarding - blue line data has to go back in time; with forwarding - red line data is available in time

\section*{Data Hazards}
- Forwarding may not be enough
- e.g., if an R-type instruction following a load uses the result of the load called load-use data hazard


\section*{Reordering Code to Avoid Pipeline Stall (Software Solution)}
- Example:
```

lW $t0, O($t1)
lW $t2, 4($t1)
sw $t2, O($t1)
SW $t0, 4($t1)

```
- Reordered code:
```

lw $t0, O($t1)
lw $t2, 4($t1)
SW $t0, 4($t1)
sw $t2, 0($t1)

```


\section*{Control Hazards}
- Control hazard: need to make a decision based on the result of a previous instruction still executing in pipeline
- Solution 1 Stall the pipeline


Pipeline stall

\section*{Control Hazards}
- Solution 2 Predict branch outcome
- e.g., predict branch-not-taken :

Program


\section*{Control Hazards}
- Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay compiler's job to find a statement that can be put in the slot that is independent of branch outcome
- MIPS does this - but it is an option in SPIM (Simulator -> Settings)


Delayed branch beq is followed by add that is independent of branch outcome

\section*{Dynamic Branch Prediction}
- Prediction of branches at runtime using runtime information.
- Ex: Restaurant
- Two schemes:
- 1-bit scheme
- 2-bit scheme

\section*{1-bit scheme}
- branch prediction buffer - Also called branch history table.
- A small memory that is indexed by the lower portion of the address of the branch instruction and that contains one or more bits indicating whether the branch was recently taken or not.
- The memory contains a bit that says whether the branch was recently taken or not

\section*{Branch History Table}
\begin{tabular}{l|l|} 
& \begin{tabular}{l} 
Bit indicating \\
branch taken(1) \\
or not taken(0)
\end{tabular} \\
\hline 1000 & 0 \\
\hline 1012 & 1 \\
\hline 2046 & 0 \\
\hline 2116 & 1 \\
\hline
\end{tabular}
\[
\begin{aligned}
& 1000 \text { Beq } \$ s 2, \$ s 1, \text { exit } \\
& 1012 \text { Beq } \$ s 2, \$ s 1, \text { for } \\
& 2046 \text { Beq } \$ s 2, \$ s 1, \text { else }
\end{aligned}
\]

Assume 0 - bit indicating branch not taken
1 - bit indicating branch taken

\section*{Disadvantage of 1-bit scheme}
- What if the branch is taken and not taken the alternatively?
- Then the prediction table will be wrong always.
- (ie) prediction will be always a failure

\section*{Disadvantage example}
\(1000 \mathrm{i}=1\);
1004 while(i<=9)\{
1008
i++;
\}
Initially - bit is set to 1
\(1^{\text {st }}\) iteration - misprediction(since branch is not taken)
Next 8 iteration - prediction is success
\(10^{\text {th }}\) - exit iteration - misprediction ( since branch is taken)
Disadvantage:
2 Mispredictions out of 10 . so \(80 \%\) accuracy is achieved

\section*{2-bit scheme}
- A prediction must be wrong twice before the prediction is changed.
- A branch prediction buffer has 2 bits to store the history.
\begin{tabular}{|l|l|}
\hline \begin{tabular}{l} 
Bits indicating branch taken \\
or not taken
\end{tabular} & Meaning \\
\hline 00 & Strongly Branch not taken \\
\hline 01 & Weakly branch not taken \\
\hline 10 & Weakly branch taken \\
\hline 11 & Strongly branch taken \\
\hline
\end{tabular}

\section*{Finite state diagram}


\section*{Exception}

\section*{Exception}
- Exception - (interrupt) An unscheduled event that disrupts program execution; used to detect overflow.
- Interrupt - An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.)
- vectored interrupt - An interrupt for which the address to which control is transferred is determined by the cause of the exception.

\section*{Difference between exception and interrupt}
Exception Interrupt

Unscheduled event
Invoked Internal (within processor)
Ex: Arithmetic overflow

Unscheduled event
Invoked External (outside processor)

\author{
Ex: IO device request
}

\section*{Exception}
\begin{tabular}{|l|l|l|}
\hline \multicolumn{1}{|c|}{ Type of event } & \multicolumn{1}{c|}{ From where? } & \multicolumn{1}{c|}{ Milss terminology } \\
\hline I/O device request & External & Interrupt \\
\hline Invoke the operating system from user program & Internal & Exception \\
\hline Arithmetic overflow & Internal & Exception \\
\hline Using an undefined instruction & Internal & Exception \\
\hline Hardware malfunctions & Either & Exception or interrupt \\
\hline
\end{tabular}

\section*{Pipelined datapath and control}

\section*{Pipelined Datapath}
- We now move to actually building a pipelined datapath
- First recall the 5 steps in instruction execution
1. Instruction Fetch \& PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
- Review: single-cycle processor
- all 5 steps done in a single clock cycle
- dedicated hardware required for each step
- What happens if we break the execution into multiple cycles, but keep the extra hardware?

\section*{Review - Single-Cycle Datapath "Steps"}


\section*{Pipelined Datapath - Key Idea}
- What happens if we break the execution into multiple cycles, but keep the extra hardware?
- Answer: We may be able to start executing a new instruction at each clock cycle - pipelining
- ...but we shall need extra registers to hold data between cycles - pipeline registers

\section*{Pipelined Datapath}


\section*{Pipelined Datapath}


\section*{Bug in the Datapath}


Write register number comes from another later instruction!

\section*{Corrected Datapath}


Destination register number is also passed through ID/EX, EX/MEM and MEM/WB registers, which are now wider by 5 bits

\section*{Pipelined Example}
- Consider the following instruction sequence:
```

lw $t0, 10($t1)
sw $t3, 20($t4)
add \$t5, \$t6, \$t7
sub \$t8, \$t9, \$t10

```

\section*{Single-Clock-Cycle Diagram:} Clock Cycle 1


\section*{Single-Clock-Cycle Diagram:} Clock Cycle 2


\section*{Single-Clock-Cycle Diagram:} Clock Cycle 3


\section*{Single-Clock-Cycle Diagram:} Clock Cycle 4


Single-Clock-Cycle Diagram: Clock Cycle 5


\section*{Single-Clock-Cycle Diagram: Clock Cycle 6}


\section*{Single-Clock-Cycle Diagram: Clock Cycle 7}


\section*{Single-Clock-Cycle Diagram: Clock Cycle 8}


\section*{Alternative View -Multiple-Clock-Cycle Diagram}


\section*{Notes}
- One significant difference in the execution of an R-type instruction between multicycle and pipelined implementations:
- register write-back for the R-type instruction is the \(5^{\text {th }}\) (the last writeback) pipeline stage vs. the \(4^{\text {th }}\) stage for the multicycle implementation. Why?
- think of structural hazards when writing to the register file...
- Worth repeating: the essential difference between the pipeline and multicycle implementations is the insertion of pipeline registers to decouple the 5 stages
- The CPI of an ideal pipeline (no stalls) is 1. Why?
- The RaVi Architecture Visualization Project of Dortmund U. has pipeline simulations - see link in our Additional Resources page
- As we develop control for the pipeline keep in mind that the text does not consider jump - should not be too hard to implement!

\section*{Recall Single-Cycle Control - the Datapath}


\section*{Recall Single-Cycle - ALU Control}

Instruction AluOp Instruction Funct Field Desired ALU control
\begin{tabular}{ll} 
opcode & \\
LW & 00 \\
SW & 00 \\
Branch eq & 01 \\
R-type & 10 \\
R-type & 10 \\
R-type & 10 \\
R-type & 10 \\
R-type & 10
\end{tabular}
operation
\begin{tabular}{ll} 
load word & xxxxxx \\
store word & xxxxxx \\
branch eq & xxxxxx \\
add & 100000 \\
subtract & 100010 \\
AND & 100100 \\
OR & 100101 \\
set on less & 101010
\end{tabular}
\begin{tabular}{ll} 
add & 010 \\
add & 010 \\
subtract & 110 \\
add & 010 \\
subtract & 110 \\
and & 000 \\
or & 001 \\
set on less & 111
\end{tabular}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline \multicolumn{2}{|c|}{ ALUOp } & \multicolumn{5}{c|}{ Funct field } & \multirow{2}{*}{ Operation } \\
\cline { 1 - 8 } ALUOp1 & ALUOp0 & F5 & F4 & F3 & F2 & F1 & F0 & \\
\hline 0 & 0 & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & 010 \\
\hline 0 & 1 & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & \(X\) & 110 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 0 & 0 & 0 & 010 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 0 & 1 & 0 & 110 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 1 & 0 & 0 & 000 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 0 & 1 & 0 & 1 & 001 \\
\hline 1 & \(X\) & \(X\) & \(X\) & 1 & 0 & 1 & 0 & 111 \\
\hline
\end{tabular}

Truth table for ALU control bits

\title{
Recall Single-Cycle - Control Signals
}

\section*{Effect of control bits}
\begin{tabular}{|l|l|l|}
\hline \multicolumn{1}{|c|}{\begin{tabular}{c} 
Signal \\
name
\end{tabular}} & \multicolumn{1}{|c|}{ Effect when deasserted } & \multicolumn{1}{c|}{ Effect when asserted }
\end{tabular}\(|\)\begin{tabular}{|lll|}
\hline RegDst & \begin{tabular}{l} 
The register destination number for the \\
Write register comes from the rt field \\
(bits 20:16).
\end{tabular} & \begin{tabular}{l} 
The register destination number for the Write \\
register comes from the rd field (bits 15:11).
\end{tabular} \\
\hline RegWrite & None. & \begin{tabular}{l} 
The register on the Write register input is \\
written with the value on the Write data input.
\end{tabular} \\
\hline ALUSrc & \begin{tabular}{l} 
The second ALU operand comes from the \\
second register file output (Read data 2).
\end{tabular} & \begin{tabular}{l} 
The second ALU operand is the sign- \\
extended, lower 16 bits of the instruction.
\end{tabular} \\
\hline PCSrc & \begin{tabular}{l} 
The PC is replaced by the output of the \\
adder that computes the value of PC + 4.
\end{tabular} & \begin{tabular}{l} 
The PC is replaced by the output of the adder \\
that computes the branch target.
\end{tabular} \\
\hline MemRead & None. & \begin{tabular}{l} 
Data memory contents designated by the \\
address input are put on the Read data output.
\end{tabular} \\
\hline MemWrite & None. & \begin{tabular}{l} 
Data memory contents designated by the \\
address input are replaced by the value on \\
the Write data input.
\end{tabular} \\
\hline MemtoReg & \begin{tabular}{l} 
The value fed to the register Write data \\
input comes from the ALU.
\end{tabular} & \begin{tabular}{l} 
The value fed to the register Write data input \\
comes from the data memory.
\end{tabular} \\
\hline
\end{tabular}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|}
\hline Deter- & Instruction & RegDst & ALUSrc & MemtoReg & \begin{tabular}{l}
Reg \\
Write
\end{tabular} & \begin{tabular}{l}
Mem \\
Read
\end{tabular} & \begin{tabular}{l}
Mem \\
Write
\end{tabular} & Branch & ALUOp1 & ALUp0 \\
\hline mining & R-format & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\
\hline & lw & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\
\hline bits & sw & X & 1 & X & 0 & 0 & 1 & 0 & 0 & 0 \\
\hline & beq & X & 0 & X & 0 & 0 & 0 & 1 & 0 & 1 \\
\hline
\end{tabular}

\section*{Pipeline Control}
- Initial design - motivated by single-cycle datapath control - use the same control signals
- Observe:
- No separate write signal for the PC as it is written every cycle
- No separate write signals for the pipeline registers as they are written every cycle

Will be modified by hazard detection unit!!
- No separate read signal for instruction memory as it is read every clock cycle
- No separate read signal for register file as it is read every clock cycle
- Need to set control signals during each pipeline stage
- Since control signals are associated with components active during a single pipeline stage, can group control lines into five groups according to pipeline stage

\section*{Pipelined Datapath with Control I}


\section*{Pipeline Control Signals}
- There are five stages in the pipeline
- instruction fetch / PC increment
- instruction decode / register fetch
- execution / address calculation
- memory access
- write back

\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|}
\hline \multirow{4}{*}{} & \multicolumn{4}{|c|}{\begin{tabular}{c} 
Execution/Address Calculation \\
stage control lines
\end{tabular}} & \multicolumn{4}{c|}{\begin{tabular}{c} 
Memory access stage \\
control lines
\end{tabular}} & \multicolumn{2}{c|}{\begin{tabular}{c} 
stage control \\
lines
\end{tabular}} \\
\cline { 2 - 11 } & \begin{tabular}{c} 
Reg \\
Dst
\end{tabular} & \begin{tabular}{c} 
ALU \\
Op1
\end{tabular} & \begin{tabular}{c} 
ALU \\
Op0
\end{tabular} & \begin{tabular}{c} 
ALU \\
Src
\end{tabular} & Branch & \begin{tabular}{c} 
Mem \\
Read
\end{tabular} & \begin{tabular}{c} 
Mem \\
Write
\end{tabular} & \begin{tabular}{c} 
Reg \\
write
\end{tabular} & \begin{tabular}{c} 
Mem to \\
Reg
\end{tabular} \\
\hline R-format & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
\hline lw & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 1 & 1 \\
\hline SW & X & 0 & 0 & 1 & 0 & 0 & 1 & 0 & X \\
\hline beq & X & 0 & 1 & 0 & 1 & 0 & 0 & 0 & X \\
\hline
\end{tabular}

\section*{Pipeline Control Implementation}
- Pass control signals along just like the data - extend each pipeline register to hold needed control bits for succeeding stages

- Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6 least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register

\section*{Pipelined Datapath with Control II}


\section*{Pipelined Execution and Control}

Instruction sequence:
```

lw \$10, 20(\$1)
sub \$11, \$2, \$3
and \$12, \$4, \$7
or \$13, \$6, \$7
add \$14, \$8, \$9

```

Label "before<i>" means ith instruction before 1w


\section*{Pipelined Execution and Control}

Instruction sequence:
```

1w \$10, 20(\$1)
sub \$11, \$2, \$3
and \$12, \$4, \$7
or \$13, \$6, \$7
add \$14, \$8, \$9

```


\section*{Pipelined Execution and Control}

Instruction sequence:
```

lw \$10, $20(\$ 1)$
sub \$11, \$2, \$3
and \$12, \$4, \$7
or \$13, \$6, \$7
add \$14, \$8, \$9

```

Label "after<i>" means i th instruction after add


\section*{Pipelined Execution and Control}

Instruction sequence:
1w \(\$ 10,20(\$ 1)\)
sub \$11, \$2, \$3
and \$12, \$4, \$7
or \(\$ 13, \$ 6, \$ 7\)
add \$14, \$8, \$9


\section*{Pipelined Execution and Control}
- Instruction sequence:

1w \(\$ 10,20(\$ 1)\) sub \$11, \$2, \$3 and \$12, \$4, \$7 or \(\$ 13, \$ 6, \$ 7\) add \$14, \$8, \$9


\section*{Data Hazards}

\section*{Revisiting Hazards}
- So far our datapath and control have ignored hazards
- We shall revisit data hazards and control hazards and enhance our datapath and control to handle them in hardware...

\section*{Data Hazards and Forwarding}
- Problem with starting an instruction before previous are finished:
- data dependencies that go backward in time - called data hazards
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|}
\hline \multirow[t]{3}{*}{\[
\begin{aligned}
& \$ 2=10 \text { before sub; } \\
& \$ 2=-20 \text { after sub }
\end{aligned}
\]} & \multicolumn{3}{|l|}{Time (in clock cycles)} & & & & & & & \\
\hline & \multirow[t]{2}{*}{Value of
register \$2:} & CC 1 & CC 2 & CC 3 & CC 4 & CC 5 & CC 6 & CC 7 & CC 8 & CC 9 \\
\hline & & 10 & 10 & 10 & 10 & 10/-20 & -20 & -20 & -20 & -20 \\
\hline
\end{tabular}
```

sub \$2, \$1, \$3
and \$12, \$2, \$5
or \$13,\$6,\$2
add \$14, \$2, \$2
SW \$15, 100(\$2)

```

Program
execution
order
(in instructions)
sub \(\$ 2, \$ 1, \$ 3\)
and \(\$ 12, \$ 2, \$ 5\)
or \(\$ 13, \$ 6, \$ 2\)

-

add \(\$ 14, \$ 2, \$ 2\)


\section*{Software Solution}
- Have compiler guarantee never any data hazards!
- by rearranging instructions to insert independent instructions between instructions that would otherwise have a data hazard between them,
- or, if such rearrangement is not possible, insert nops
\begin{tabular}{llllll} 
sub & \(\$ 2, \quad \$ 1, \$ 3\) & sub & \(\$ 2, \$ 1, \$ 3\) \\
lw & \(\$ 10,40(\$ 3)\) & nop & & \\
slt & \(\$ 5, \$ 6, \$ 7\) & nop & & \\
and & \(\$ 12, \$ 2, \$ 5\) & or & and & \(\$ 12, \$ 2, \$ 5\) \\
or & \(\$ 13, \$ 6, \$ 2\) & & or & \(\$ 13, \$ 6, \$ 2\) \\
add & \(\$ 14, \$ 2, \$ 2\) & add & \(\$ 14, \$ 2, \$ 2\) \\
sw & \(\$ 15,100(\$ 2)\) & sw & \(\$ 15,100(\$ 2)\)
\end{tabular}
- Such compiler solutions may not always be possible, and nops slow the machine down
\[
\text { MIPS: nop }=\text { "no operation" }=00 \ldots 0(32 \text { bits })=\text { sll } \$ 0, \$ 0,0
\]

\section*{Hardware Solution: Forwarding}
- Idea: use intermediate data, do not wait for result to be finally written to the destination register. Two steps:
1. Detect data hazard
2. Forward intermediate data to resolve hazard

\section*{Pipelined Datapath with Control II (as before)}


\section*{Hazard Detection}
- Hazard conditions:

1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
- Eg., in the earlier example, first hazard between sub \$2, \$1, \$3 and and \(\$ 12, \$ 2, \$ 5\) is detected when the and is in EX stage and the sub is in MEM stage because
- EX/MEM.RegisterRd = ID/EX.RegisterRs = \$2 (1a)
- Whether to forward also depends on:
- if the later instruction is going to write a register - if not, no need to forward, even if there is register number match as in conditions above
- if the destination register of the later instruction is \(\$ 0\) - in which case there is no need to forward value ( \(\$ 0\) is always 0 and never overwritten)

\section*{Data Forwarding}
- Plan:
- allow inputs to the ALU not just from ID/EX, but also later pipeline registers, and
- use multiplexors and control signals to choose appropriate inputs to ALU
```

sub \$2, \$1, \$3
and \$12, \$2, \$5
or \$13, \$6, \$2
add \$14, \$2, \$2
sw \$15, 100(\$2)

```


\section*{Forwarding Hardware \\ }
a. № fomarDatapath before adding forwarding hardware

b. with forwardin:Datapath after adding forwarding hardware

\section*{Forwarding Hardware: Multiplexor Control}

Mux control Source Explanation
ForwardA \(=00 \quad\) ID/EX \(\quad\) The first ALU operand comes from the register file
ForwardA \(=10 \quad\) EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardA \(=01\) MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result

ForwardB \(=00 \quad\) ID/EX \(\quad\) The second ALU operand comes from the register file
ForwardB \(=10 \quad\) EX/MEM The second ALU operand is forwarded from prior ALU result
ForwardB \(=01\) MEM/WB The second ALU operand is forwarded from data memory


Depending on the selection in the rightmost multiplexor (see datapath with control diagram)

\section*{Data Hazard: Detection and Forwarding}
- Forwarding unit determines multiplexor control according to the following rules:
1. EX hazard
if \(\left(\begin{array}{l}\text { EX/MEM.RegWrite } \\ \text { and }(E X / M E M . R e g i s t e r R d ~\end{array}=0\right) \quad\) if there is a write...
and (EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then...

ForwardA \(=10\)
```

if ( EX/MEM.RegWrite // if there is a write...
and (EX/MEM.RegisterRd = 0 ) // to a non-\$0 register...
and (EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches, then...
ForwardB = 10

```

\section*{Data Hazard: Detection and Forwarding}

\section*{2.} MEM hazard
if ( MEM/WB.RegWrite
// if there is a write... and (MEM/WB.RegisterRd \(\neq 0\) )
// to a non-\$0 register...
and (EX/MEM.RegisterRd \(\neq\) ID/EX.RegisterRs ) // and not already a register match // with earlier pipeline register... and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline register, then...

ForwardA = 01
if ( MEM/WB.RegWrite // if there is a write... and (MEM/WB.RegisterRd \(\neq 0\) ) // to a non-\$0 register... and (EX/MEM.RegisterRd \(\neq\) ID/EX.RegisterRt ) // and not already a register match // with earlier pipeline register... and (MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline register, then...

ForwardB \(=01\)

This check is necessary, e.g., for sequences such as add \(\$ 1, \$ 1, \$ 2\); add \(\$ 1, \$ 1, \$ 3\); add \(\$ 1, \$ 1, \$ 4\); (array summing?), where an earlier pipeline (EX/MEM) register has more recent data

\section*{Forwarding Hardware with Control}

Called forwarding unit, not hazard detection unit, because once data is forwarded there is no hazard!


Datapath with forwarding hardware and control wires - certain details, e.g., branching hardware, are omitted to simplify the drawing Note: so far we have only handled forwarding to R-type instructions...!

\section*{Forwarding}

Execution example:
```

sub \$2, \$1, \$3
and \$4, \$2, \$5
or \$4, \$4, \$2
add \$9, \$4, \$2

| sub | $\$ 2$, | $\$ 1$, | $\$ 3$ |
| :--- | :--- | :--- | :--- |
| and | $\$ 4$, | $\$ 2$, | $\$ 5$ |
| or | $\$ 4$, | $\$ 4$, | $\$ 2$ |
| add | $\$ 9$, | $\$ 4$, | $\$ 2$ |

```


\section*{Forwarding}

Execution example (cont.):
```

sub \$2, \$1, \$3
and \$4, \$2, \$5
or \$4, \$4, \$2
add \$9, \$4, \$2

```

Clockacycle 5
-

\section*{after<2>}

\[
\text { after<2> } \mid \text { after<1> } \mid \text { add } \$ 9, \$ 4, \$ 2
\]

Clock eycle 6

\section*{Data Hazards and Stalls}
- Load word can still cause a hazard:
- an instruction tries to read a register following a load instruction that writes to the same register
- therefore, we need a hazard detection unit to stall the pipeline after the load instruction
```

lw \$2, 20(\$1)
and \$4, \$2, \$5
or \$8, \$2, \$6
add \$9, \$4, \$2
Slt \$1, \$6, \$7

```

As even a pipeline dependency goes backward in time forwarding will not solve the hazard


\section*{Pipelined Datapath with Control II (as before)}


\section*{Hazard Detection Logic to Stall}
- Hazard detection unit implements the following check if to stall
```

if ( ID/EX.MemRead // if the instruction in the EX stage is a load...
and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register of the
//instruction in the ID stage, then... stall the pipeline

```

\section*{Mechanics of Stalling}
- If the check to stall verifies, then the pipeline needs to stall only 1 clock cycle after the load as after that the forwarding unit can resolve the dependency
- What the hardware does to stall the pipeline 1 cycle:
- does not let the IF/ID register change (disable write!) - this will cause the instruction in the ID stage to repeat, i.e., stall
- therefore, the instruction, just behind, in the IF stage must be stalled as well - so hardware does not let the PC change (disable write!) - this will cause the instruction in the IF stage to repeat, i.e., stall
- changes all the EX, MEM and WB control fields in the ID/EX pipeline register to 0 , so effectively the instruction just behind the load becomes a nop - a bubble is said to have been inserted into the pipeline
- note that we cannot turn that instruction into an nop by Oing all the bits in the instruction itself - recall nop \(=00 \ldots 0\) ( 32 bits) - because it has already been decoded and control signals generated

\section*{Hazard Detection Unit}


Datapath with forwarding hardware, the hazard detection unit and controls wires - certain details, e.g., branching hardware are omitted to simplify the drawing

\section*{Stalling Resolves a Hazard}
- Same instruction sequence as before for which forwarding by itself could not resolve the hazard:


Hazard detection unit inserts a 1-cycle bubble in the pipeline, after which all pipeline register dependencies go forward so then the forwarding unit can handle them and there are no more hazards


\section*{Stalling}
- Execution example (cont.):
lw \(\$ 2,20(\$ 1)\)
and \$4, \$2, \$5
or \(\$ 4, \$ 4, \$ 2\)
add \$9, \$4, \$2

Clock cycle 4


Clock cyele 5
and \(\$ 4, \$ 2, \$ 5\)
or \(\$ 4, \$ 4, \$ 2\)
Iw \$2,

\section*{Stalling}
- Execution example (contd).
lw \(\$ 2\), \(20(\$ 1)\) and \$4, \$2, \$5 or \(\$ 4, \$ 4, \$ 2\) add \$9, \$4, \$2

Clock cyclete 6

\section*{after<2>}

Clock cycle 7

\section*{Control Hazards}

\section*{Control (or Branch) Hazards}
- Problem with branches in the pipeline we have so far is that the branch decision is not made till the MEM stage - so what instructions, if at all, should we insert into the pipeline following the branch instructions?
- Possible solution: stall the pipeline till branch decision is known
- not efficient, slow the pipeline significantly!
- Another solution: predict the branch outcome
- e.g., always predict branch-not-taken - continue with next sequential instructions
- if the prediction is wrong have to flush the pipeline behind the branch - discard instructions already fetched or decoded - and continue execution at the branch target

\section*{Predicting Branch-not-taken: Misprediction delay}


The outcome of branch taken (prediction wrong) is decided only when beq is in the MEM stage, so the following three sequential instructions already in the pipeline have to be flushed and execution resumes at 1 w

\section*{Optimizing the Pipeline to Reduce Branch Delay}
- Move the branch decision from the MEM stage (as in our current pipeline) earlier to the ID stage
- calculating the branch target address involves moving the branch adder from the MEM stage to the ID stage - inputs to this adder, the PC value and the immediate fields are already available in the IF/ID pipeline register
- calculating the branch decision is efficiently done, e.g., for equality test, by XORing respective bits and then ORing all the results and inverting, rather than using the ALU to subtract and then test for zero (when there is a carry delay)
- with the more efficient equality test we can put it in the ID stage without significantly lengthening this stage - remember an objective of pipeline design is to keep pipeline stages balanced
- we must correspondingly make additions to the forwarding and hazard detection units to forward to or stall the branch at the ID stage in case the branch decision depends on an earlier result

\section*{Flushing on Misprediction}
- Same strategy as for stalling on load-use data hazard...
- Zero out all the control values (or the instruction itself) in pipeline registers for the instructions following the branch that are already in the pipeline - effectively turning them into nops - so they are flushed
- in the optimized pipeline, with branch decision made in the ID stage, we have to flush only one instruction in the IF stage - the branch delay penalty is then only one clock cycle

\section*{Optimized Datapath for Branch}


Branch decision is moved from the MEM stage to the ID stage - simplified drawing not showing enhancements to the forwarding and hazard detection units

\section*{Pipelined Branch}
- Execution example:
\[
72 \mathrm{lw} \quad \$ 4, \quad 50(\$ 7)
\]

\section*{Optimized pipeline with} only one bubble as a result of the taken branch

Clock cyele 3

\[
\begin{aligned}
& 36 \text { sub \$10, \$4, \$8 } \\
& 40 \text { beq \$1, \$3, } 7 \\
& 44 \text { and } \$ 12 \text { \$2, \$5 } \\
& 48 \text { or } \$ 13 \text { \$2, } \$ 6 \\
& 52 \text { add \$14, \$4, \$2 } \\
& 56 \text { slt \$15, \$6, \$7 }
\end{aligned}
\]

\section*{Simple Example: Comparing Performance}
- Compare performance for single-cycle, multicycle, and pipelined datapaths using the gcc instruction mix
- assume 2 ns for memory access, 2 ns for ALU operation, 1 ns for register read or write
- assume gcc instruction mix 23\% loads, \(13 \%\) stores, \(19 \%\) branches, \(2 \%\) jumps, 43\% ALU
- for pipelined execution assume
- \(50 \%\) of the loads are followed immediately by an instruction that uses the result of the load
- \(25 \%\) of branches are mispredicted
- branch delay on misprediction is 1 clock cycle
- jumps always incur 1 clock cycle delay so their average time is 2 clock cycles

\section*{Simple Example: Comparing Performance}
- Single-cycle (p. 373): average instruction time 8 ns
- Multicycle (p. 397): average instruction time 8.04 ns
- Pipelined:
- loads use 1 cc (clock cycle) when no load-use dependency and 2 cc when there is dependency - given \(50 \%\) of loads are followed by dependency the average cc per load is 1.5
- stores use 1 cc each
- branches use 1 cc when predicted correctly and 2 cc when not - given \(25 \%\) misprediction average cc per branch is 1.25
- jumps use 2 cc each
- ALU instructions use 1 cc each
- therefore, average CPI is
\(1.5 \times 23 \%+1 \times 13 \%+1.25 \times 19 \%+2 \times 2 \%+1 \times 43 \%=1.18\)
- therefore, average instruction time is \(1.18 \times 2=2.36 \mathrm{~ns}\)

\section*{Dynamic Branch Prediction}
- Prediction of branches at runtime using runtime information.
- Ex: Restaurant
- Two schemes:
- 1-bit scheme
- 2-bit scheme

\section*{1-bit scheme}
- branch prediction buffer - Also called branch history table.
- A small memory that is indexed by the lower portion of the address of the branch instruction and that contains one or more bits indicating whether the branch was recently taken or not.
- The memory contains a bit that says whether the branch was recently taken or not

\section*{Branch History Table}
\begin{tabular}{l|l|} 
& \begin{tabular}{l} 
Bit indicating \\
branch taken(1) \\
or not taken(0)
\end{tabular} \\
\hline 1000 & 0 \\
\hline 1012 & 1 \\
\hline 2046 & 0 \\
\hline 2116 & 1 \\
\hline
\end{tabular}
\[
\begin{aligned}
& 1000 \text { Beq } \$ s 2, \$ s 1, \text { exit } \\
& 1012 \text { Beq } \$ s 2, \$ s 1, \text { for } \\
& 2046 \text { Beq } \$ s 2, \$ s 1, \text { else }
\end{aligned}
\]

Assume 0 - bit indicating branch not taken
1 - bit indicating branch taken

\section*{Disadvantage of 1-bit scheme}
- What if the branch is taken and not taken the alternatively?
- Then the prediction table will be wrong always.
- (ie) prediction will be always a failure

\section*{Disadvantage example}
\(1000 \mathrm{i}=1\);
1004 while(i<=9)\{
1008
i++;
\}
Initially - bit is set to 1
\(1^{\text {st }}\) iteration - misprediction(since branch is not taken)
Next 8 iteration - prediction is success
\(10^{\text {th }}\) - exit iteration - misprediction ( since branch is taken)
Disadvantage:
2 Mispredictions out of 10 . so \(80 \%\) accuracy is achieved

\section*{2-bit scheme}
- A prediction must be wrong twice before the prediction is changed.
- A branch prediction buffer has 2 bits to store the history.
\begin{tabular}{|l|l|}
\hline \begin{tabular}{l} 
Bits indicating branch taken \\
or not taken
\end{tabular} & Meaning \\
\hline 00 & Strongly Branch not taken \\
\hline 01 & Weakly branch not taken \\
\hline 10 & Weakly branch taken \\
\hline 11 & Strongly branch taken \\
\hline
\end{tabular}

\section*{Finite state diagram}


\section*{Exception}

\section*{Exception}
- Exception - (interrupt) An unscheduled event that disrupts program execution; used to detect overflow.
- Interrupt - An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.)
- vectored interrupt - An interrupt for which the address to which control is transferred is determined by the cause of the exception.

\section*{Difference between exception and interrupt}
Exception Interrupt

Unscheduled event
Invoked Internal (within processor)
Ex: Arithmetic overflow

Unscheduled event
Invoked External (outside processor)

\author{
Ex: IO device request
}

\section*{Exception}
\begin{tabular}{|l|l|l|}
\hline \multicolumn{1}{|c|}{ Type of event } & \multicolumn{1}{c|}{ From where? } & \multicolumn{1}{c|}{ Milss terminology } \\
\hline I/O device request & External & Interrupt \\
\hline Invoke the operating system from user program & Internal & Exception \\
\hline Arithmetic overflow & Internal & Exception \\
\hline Using an undefined instruction & Internal & Exception \\
\hline Hardware malfunctions & Either & Exception or interrupt \\
\hline
\end{tabular}

What happens in a processor when exception occur?
1. Processor save the address of the offending instruction in the exception program counter (EPC)
2. Determine the cause of exception
3. Transfer control to the operating system at some specified address(based on the cause of exception)

\section*{How to find the cause of exception}
- Two methods to detect the cause of exception:
- Use cause register
- Vectored interrupts

\section*{Cause register}
- Cause register - stores the reason for exception
- Ex: arithmetic overflow

I/O device request

\section*{Exception handling using cause register}
1. EPC = address of offending instruction
2. Cause register = cause of exception
3. Single entry point for exception handling code. ( say 1000 )
- Then decode the cause register to move to the specific exception handling code


\section*{2. Vectored Interrupt}
- An interrupt for which the address to which control is transferred is determined by the cause of the exception.

\section*{Exception type \\ Exception vector address (in hex)}
\begin{tabular}{|c|c|}
\hline Undefined instruction & \(80000000_{\operatorname{mex}}\) \\
\hline Arithmetic overflow & \(80000180_{\operatorname{mex}}\) \\
\hline
\end{tabular}

\section*{Exception handling using Vectored Interrupt}
1. \(E P C=\) address of offending instruction
2. Refer the vectored interrupt table to find the address of the specific exception handling code.

Exeppion type Exeeption veator aditiess (in hex)
\begin{tabular}{|c|c|c|c|c|c|}
\hline Undefined instruction & \(80000000{ }_{\text {nax }}\) & \multicolumn{2}{|l|}{\multirow[t]{2}{*}{\begin{tabular}{l}
8000 0000hex \\
Undefined \\
Instruction \(\qquad\) \\
Exception \\
Handling code \\
8000 0180hex \\
Arithmetic overflw \\
Exception \\
Handling code
\end{tabular}}} & & \\
\hline Arithmetic overfow & 80000180 & & & & \(\rightarrow\) \\
\hline
\end{tabular}

\section*{Implementing the exception system - using cause register}
- Elements added to implement exception system:
- Two registers
- Cause register
- EPC register (Exception Program Counter)
- Single entry point - exception handling code starting address

\section*{Exception - Datapath}


\section*{Arithmetic overflow exception detected clock cycle 6}


Exception handling of Arthimetic Overflow
2. All the instruction after
- clock cycle 7 The offending instruction are flushed

1. Stored EPC and cause register

\section*{Summary}
- Pipeline
- Pipeline Hazards
- Pipeline Datapath and Control
- Data and Control Hazards

\section*{UNIT IV MEMORY AND I/O SYSTEMS}

Memory hierarchy - Memory technologies - Cache basics - Measuring and improving cache performance - Virtual memory, TLBs

\section*{Reference:}
- Chapter 5 - Large and Fast: Exploiting Memory Hierarchy
- Book - David A. Patterson and John L. Hennessey, "Computer organization and design", Morgan Kauffman / Elsevier, Fifth edition, 2014.

Memory Hierarchy

\section*{Principle of Locality}
- Temporal locality - (locality in time):
\(\checkmark\) if an item is referenced, it will tend to be referenced again soon.
- Spatial locality - (locality in space):
\(\checkmark\) if an item is referenced, items whose addresses are close by will tend to be referenced soon

\section*{Memory Hierarchy}


\section*{Hit and Miss}
- two adjacent levels - called, upper (closer to CPU) and lower (farther from CPU)
- Terminology:
- block: minimum unit of data to move between levels
- hit: data requested is in upper level
- miss: data requested is not in upper level


\section*{Memory Access}
- hit rate - Hit memory access / Total memory access
- miss rate - Miss memory access / Total memory access
- Total memory access = Hit memory access + Miss memory access
- Miss rate \(=1\) - hit rate
- hit time -time to determine if the access is indeed a hit + time to access and deliver the data from the upper level to the CPU
- miss penalty: time to determine if the access is a miss + time to replace block at upper level with corresponding block at lower level + time to deliver the block to the CPU

\section*{Memory hierarchy}


\section*{Memory Technology}
- SRAM
- DRAM
- Flash memory
- Disk memory
\begin{tabular}{|c|c|c|}
\hline \multicolumn{1}{|c|}{ Memory technology } & Typical access time & Sper cis in 2012 \\
\hline SRAM semiconductor memory & \(0.5-2.5 \mathrm{~ns}\) & \(\$ 500-\$ 1000\) \\
\hline DRAM semiconductor memory & \(50-70 \mathrm{~ns}\) & \(\$ 10-\$ 20\) \\
\hline Flash semiconductor memory & \(5,000-50,000 \mathrm{~ns}\) & \(\$ 0.75-\$ 1.00\) \\
\hline Magnetic disk & \(5,000,000-20,000,000 \mathrm{~ns}\) & \(\$ 0.05-\$ 0.10\) \\
\hline
\end{tabular}

\section*{SRAM}
- Static random access memory
- Volatile - loses data when there is no power
- Used to make caches
- Very fast
- Very costly

\section*{DRAM}
- Dynamic random access memory
- Volatile - loses data when there is no power
- Used to make main memory
- Slower than SRAM
- Cheaper than SRAM

\section*{Difference between SRAM and DRAM}
\begin{tabular}{|l|l|l|}
\hline S.No & SRAM & DRAM \\
\hline 1 & \begin{tabular}{l} 
Static Random Access \\
Memory
\end{tabular} & \begin{tabular}{l} 
Dynamic Random Access \\
Memory
\end{tabular} \\
\hline 2 & Volatile & Volatile \\
\hline 3 & Faster & Slower \\
\hline 4 & Costlier & Cheaper than SRAM \\
\hline 5 & Less denser & Denser than SRAM \\
\hline 6 & Used as Cache memory & Used as Main memory \\
\hline
\end{tabular}

\section*{DRAM Internal Organisation}


Each DRAM - has 4 banks
Each bank - has many rows
Commands :
Pre - Precharge command - opens/close a bank
Act - Activate command - transfer row from bank to buffer
Each buffer - has many columns
Command:
Rd - Read command - read data from buffer column
Wr - Write command - write data to buffer column

\section*{DDR - DRAM}
- DDR - Double Data Rate
- Where data is transferred during both rising and falling edge of the clock. \((20+20=40 \mathrm{Mbps})\)


20Mbps

\section*{DRAM Growth}
\begin{tabular}{|c|c|c|c|c|}
\hline Year Introduced & Chlp slre & \$ per cli & \begin{tabular}{c} 
Total access time to \\
a new row/column
\end{tabular} & \begin{tabular}{c} 
Average column \\
access time to \\
exlsting row
\end{tabular} \\
\hline 1980 & 64 Kibibit & \(\$ 1,500,000\) & 250 ns & 150 ns \\
\hline 1983 & 256 Kibibit & \(\$ 500,000\) & 185 ns & 100 ns \\
\hline 1985 & 1 Mebibit & \(\$ 200,000\) & 135 ns & 40 ns \\
\hline 1989 & 4 Mebibit & \(\$ 50,000\) & 110 ns & 40 ns \\
\hline 1992 & 16 Mebibit & \(\$ 15,000\) & 90 ns & 30 ns \\
\hline 1996 & 64 Mebibit & \(\$ 10,000\) & 60 ns & 12 ns \\
\hline 1998 & 128 Mebibit & \(\$ 4,000\) & 60 ns & 10 ns \\
\hline 2000 & 256 Mebibit & \(\$ 1,000\) & 55 ns & 7 ns \\
\hline 2004 & 512 Mebibit & \(\$ 250\) & 50 ns & 5 ns \\
\hline 2007 & 1 Gibibit & \(\$ 50\) & 45 ns & 1.25 ns \\
\hline 2010 & 2 Gibibit & \(\$ 30\) & 40 ns & 1 ns \\
\hline 2012 & 4 Gibibit & \(\$ 1\) & 35 ns & 0.8 ns \\
\hline
\end{tabular}

\section*{Flash memory}
- EEPROM - Electrically erasable programmable read-only memory
- Disadvantage: Writes can wear out flash memory bits. (10,000 writes)
- Solution: (wear levelling)
- Controller - spread the writes by remapping blocks that have been written many times to less trodden blocks.

\section*{Disk Memory}


\section*{Disk Memory(2)}
- Read-write head - To read and write information on a hard disk, a movable arm containing a small electromagnetic coil called a read-write head is located just above each surface.
- Track - One of thousands of concentric circles that makes up the surface of a magnetic disk.
- Sector - One of the segments that make up a track on a magnetic disk; a sector is the smallest amount of information that is read or written on a disk.
- seek - The process of positioning a read/write head over the proper track on a disk

\section*{3 computational times}
- Seek time - the time to move the head to the desired track.
- Rotational latency - Also called rotational delay. The time required for the desired sector of a disk to rotate under the read/write head; usually assumed to be half the rotation time.
- Transfer time - is the time to transfer a block of bits.
- The transfer time is a function of the sector size, the rotation speed, and the recording density of a track.
- Transfer rates in 2012 were between 100 and \(200 \mathrm{MB} / \mathrm{sec}\).

\section*{Problem}
\[
\begin{aligned}
\text { Average rotational latency } & =\frac{0.5 \text { rotation }}{5400 \mathrm{RPM}}=\frac{0.5 \text { rotation }}{5400 \mathrm{RPM} /\left(60 \frac{\text { seconds }}{\text { minute }}\right)} \\
& =0.0056 \text { seconds }=5.6 \mathrm{~ms}
\end{aligned}
\]

\section*{Cache Basics}

\section*{Caches}
- By simple example
- assume block size = one word of data
\begin{tabular}{|c|}
\hline\(\times 4\) \\
\hline\(\times 1\) \\
\hline\(\times n-2\) \\
\hline\(\times n-1\) \\
\hline\(\times 2\) \\
\hline\(\times 3\) \\
\hline
\end{tabular}
a. Before the reference to Xn

b. After the reference to \(\times n\)
- Issues:
- how do we know if a data item is in the cache?
- if it is, how do we find it?
- if not, what do we do?
- Solution depends on cache addressing scheme...

\section*{Direct Mapped Cache}


\section*{Direct Mapped Cache}
- Addressing scheme in direct mapped cache:
- cache block address = memory block address mod (no of blocks in cache )
- 3 fields
- Index
- Tag
- valid

\section*{Accessing Cache}
- Example:
(0) Initial state:
\begin{tabular}{llll} 
Index & V & Tag & Data \\
000 & N & & \\
001 & N & \\
010 & N & \\
011 & N & \\
100 & N & \\
101 & N & \\
110 & N & \\
111 & N &
\end{tabular}
(1) Address referred 10110 (miss):

Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
\(110 \quad \mathrm{Y} \quad 10 \mathrm{Mem}(10110)\)
111 N
(2) Address referred 11010 (miss):
\begin{tabular}{llll} 
Index & V & Tag & Data \\
000 & N & & \\
001 & N & & \\
010 & Y & 11 & Mem \((11010)\) \\
011 & N & & \\
100 & N & & \\
101 & N & & \\
110 & Y & 10 & Mem \((10110)\) \\
111 & N & &
\end{tabular}
(3) Address referred 10110 (hit):
000 N
001 N
\(010 \quad \mathrm{Y} \quad 11\) Mem (11010)
011 N
100 N
101 N
\(110 \quad \mathrm{Y} \quad 10 \mathrm{Mem}(10110)\)
111 N
(4) Address referred 10010 (miss): replace
```

Index V Tag Data
000 N
0 0 1 ~ N
010 Y 10 Mem(10010)
011 N
100 N
101 N
110 Y 10 Mem(10110)
111 N

```

\section*{Direct Mapped Cache}

Address showing bit positions


Cache with 1024 1-word blocks: byte offset (least 2 significant bits) is ignored and
next 10 bits used to index into cach

What kind of locality are we taking advantage of?

\section*{Formula}
- The size of the block above was one word
- For the following situation:
- 32-bit addresses
- A direct-mapped cache
- The cache size is \(2^{n}\) blocks, so \(n\) bits are used for the index
- The block size is \(2^{m}\) words ( \(2^{m+2}\) bytes), so \(m\) bits are used for the word within the block, and two bits are used for the byte part of the address
- size of the tag field = 32-( \(\boldsymbol{n}+\boldsymbol{m}+2)\)
- The total number of bits in a direct-mapped cache
- For one block = block size(data) + tag size + valid field size
- The total number of bits in a direct-mapped cache is
- \(2^{\text {n }}\) (block size + tag size + valid field size)

\section*{Example Problem}
- How many total bits are required for a direct-mapped cache with 16 KB of data and 4-word block size, assuming a 32-bit address?
- Cache data \(=16 \mathrm{~KB}=2^{14}\) bytes \(=2^{12}\) words \(=\)
\[
=2^{12} \text { words } / 4 \text { words }=2{ }^{10} \text { blocks }
\]
- Each cache entry size = block data bits + tag bits + valid bit
\[
=(4 * 32)+(32-10-2-2)+1=128+18+1=147 \text { bits }
\]
- Therefore, Entire cache size \(=2^{10} \times 147\) bits \(=2^{10} \times(147)\) bits \(=147\) Kbits
- Total cache size -> 147Kbits / 8 = 18.4 Kbytes
- Actual cache size(only data) -> 16Kbytes
- total cache size/actual cache data \(=18.4 / 16=1.15\)
- So \(15 \%\) increase in the cache size to include tag and valid bits.

\section*{Example problem}
- Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte address 1200 map?
- Use the formula:
(Block address) modulo (Number of blocks in the cache)
- Block address = byte address/ bytes per block
- \(\quad=1200 / 16=75\)
-75 modulo \(64=11^{\text {th }}\) block in cache

\section*{Example Problem}
- How many total bits are required for a direct-mapped cache with 128 KB of data and 1-word block size, assuming a 32-bit address?
- Cache data \(=128 \mathrm{~KB}=2{ }^{17}\) bytes \(=2^{15}\) words \(=2^{15}\) blocks
- Cache entry size = block data bits + tag bits + valid bit
\[
=32+(32-15-2)+1=48 \text { bits }
\]
- Therefore, cache size \(=2^{15} \times 48\) bits \(=2{ }^{15} \times(1.5 \times 32)\) bits \(=1.5 \times 2^{20}\) bits \(=1.5 \mathrm{Mbits}\)
- data bits in cache \(=128 \mathrm{~KB} \times 8=1 \mathrm{Mbits}\)
- total cache size/actual cache data \(=1.5\)

\section*{Example Problem}
- How many total bits are required for a direct-mapped cache with 128 KB of data and 4-word block size, assuming a 32-bit address?
- Cache size \(=128 \mathrm{~KB}=2^{17}\) bytes \(=2^{15}\) words \(=2{ }^{13}\) blocks
- Cache entry size = block data bits + tag bits + valid bit
\[
=128+(32-13-2-2)+1=144 \text { bits }
\]
- Therefore, cache size \(=2^{13} \times 144\) bits \(=2^{13} \times(1.25 \times 128)\) bits \(=\) \(1.25 \times 2^{20}\) bits \(=1.25 \mathrm{Mbits}\)
- data bits in cache \(=128 \mathrm{~KB} \times 8=1 \mathrm{Mbits}\)
- total cache size/actual cache data \(=1.25\)

\section*{Cache Read Hit/Miss}
- Cache read hit: no action needed
- Instruction cache read miss:
1. Send original PC value (current PC - 4, as PC has already been incremented in first step of instruction cycle) to memory
2. Instruct main memory to perform read and wait for memory to complete access - stall on read
3. After read completes write cache entry
4. Restart instruction execution at first step to refetch instruction
- Data cache read miss:
- Similar to instruction cache miss
- To reduce data miss penalty allow processor to execute instructions while waiting for the read to complete until the word is required - stall on use (why won't this work for instruction misses?)

\section*{cache Mrite Hit/Miss}
- Write-through scheme
- on write hit: replace data in cache and memory with every write hit to avoid inconsistency
- on write miss: write the word into cache and memory - obviously no need to read missed word from memory!
- Write-through is slow because of always required memory write
- performance is improved with a write buffer where words are stored while waiting to be written to memory - processor can continue execution until write buffer is full
- when a word in the write buffer completes writing into main that buffer slot is freed and becomes available for future writes
- DEC 3100 write buffer has 4 words
- Write-back scheme
- write the data block only into the cache and write-back the block to main only when it is replaced in cache
- more efficient than write-through, more complex to implement

Measuring and Improving cache performance

\section*{Measuring Cache Performance}
- Simplified model assuming equal read and write miss penalties:
- CPU time \(=\) (execution cycles + memory stall cycles) \(\times\) cycle time
- Memory stall cycles = reads stall cycles + write stall cycles
- Read stall cycles \(=\) reads/program + read miss rate + read miss penalty
- Write stall cycles = writes/program + write miss rate + write miss penalty
- memory stall cycles \(=\) memory accesses/program \(\times\) miss rate \(\times\) miss penalty
- memory stall cycles \(=\) instructions/program \(\times\) miss/instruction \(\times\) miss penalty

Where, memory accesses = read memory access + write memory access miss rate \(=\) read miss rate + write miss rate miss penalty \(=\) read miss penalty \(=\) write miss penalty

\section*{Example Problems}
- Assume for a given machine and program:
- instruction cache miss rate \(2 \%\)
- data cache miss rate \(4 \%\)
- miss penalty always 100 cycles
- CPI of 2 without memory stalls
- frequency of load/stores \(36 \%\) of instructions
1. How much faster is a machine with a perfect cache that never misses?
2. What happens if we speed up the machine by reducing its CPI to 1 without changing the clock rate?

\section*{Solution-1}
- Assume instruction count = 1
- memory stall cycles \(=\) instructions/program \(\times\) miss/instruction \(\times\) miss penalty
- Instruction miss cycles \(=1 \times 2 \% \times 100=2.0 \times 1\)
- Data miss cycles \(=1 \times 36 \% \times 4 \% \times 100=1.44 \times 1\)
- total memory-stall cycles = Instruction miss cycles + Data miss cycles \(=2 \times I+1.44 \times I=3.44 \times I\)
- CPI \(_{\text {memory stalls }}=\) Basic CPI + memory stall CPI
\[
=2+3.44=5.44
\]
- CPU time = IC x CPI x clock cycles
- CPU time with stalls / CPU time with perfect cache \(=1 \times 5.44 \times\) clock cycles \(/ \mathrm{I} \times 2 \times\) clock cycles \(=5.44 / 2=2.72\)
- Performance with a perfect cache is better by a factor of 2.72

\section*{Solution - 2}
- CPI without stall = 1
- CPI with stall \(=\) Basic CPI + memory stall CPI
- \(\quad=1+3.44=4.44\)
- CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 4.44/1
- Performance with a perfect cache is better by a factor of 4.44
- Conclusion: with higher CPI cache misses "hurt more" than with lower CPI

\section*{Average Memory Access Time}
- AMAT = Time for a hit + Miss rate \(\times\) Miss penalty

\section*{AMAT - Problem}
- Given:
- Clock cycle time \(=1 \mathrm{~ns}\)
- Miss penalty \(=20\) clock cycles
- Miss rate \(=0.05\) misses per instruction
- Cache access(hit) time = 1 clock cycle

\section*{Solution}
- AMAT = Time for a hit + Miss rate \(\times\) Miss penalty
\[
\begin{aligned}
& =1+0.05 \times 20 \\
& =1+1 \\
& =2 \text { clock cycles }
\end{aligned}
\]

Or clock cycle time \(=2 \times 1 n s=2 n s\)

\section*{Cache Read Hit/Miss}
- Cache read hit: no action needed
- Instruction cache read miss:
1. Send original PC value to memory
2. Instruct main memory to perform read and wait for memory to complete access - stall on read
3. After read completes write cache entry
4. Restart instruction execution at first step to refetch instruction

\section*{Cache Write Hit/Miss}
- Write-through scheme - write data both in cache and main memory
- on write hit: replace data in cache and memory with every write hit to avoid inconsistency
- on write miss: write the word into cache and memory - obviously no need to read missed word from memory!
- Write-through is slow because of always required memory write
- performance is improved with a write buffer
- Write-back scheme
- write the data block only into the cache and write-back the block to main only when it is replaced in cache
- more efficient than write-through, more complex to implement

\section*{Decreasing Miss Rates with Associative Block Placment}
- Direct mapped: one unique cache location for each memory block
- cache block address = memory block address mod cache size
- Fully associative: each memory block can locate anywhere in cache
- all cache entries are searched (in parallel) to locate block
- Set associative: each memory block can place in a unique set of cache locations - if the set is of size n it is n -way set-associative
- cache set address = memory block address mod number of sets in cache
- all cache entries in the corresponding set are searched (in parallel) to locate block
- Increasing degree of associativity
- reduces miss rate
- increases hit time because of the parallel search and then fetch

\section*{Decreasing Miss Rates with Associative Block Placment}


Location of a memory block with address 12 in a cache with 8 blocks with different degrees of associativity

\section*{Adv and Disadv of mapping}
- Direct mapped :
- Adv - easy to search
- Disadv - only one block for multiple data
- Set-Associativity mapped :
- Adv - more than one block for multiple data
- Disadv - search more than one block
- Associativity mapped :
- Adv - all blocks available for data
- Disadv - search all the blocks

\section*{Decreasing Miss Rates with Associative Block Placment}


Four-way set associative


Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Configurations of an 8-block cache with different degrees of associativity

\section*{Example Problems}
- Find the number of misses for a cache with four 1-word blocks given the following sequence of memory block accesses:
\(0,8,0,6,8\),
for each of the following cache configurations
1. direct mapped
2. 2-way set associative (use LRU replacement policy)
3. fully associative
- Note about LRU replacement
- in a 2-way set associative cache LRU replacement can be implemented with one bit at each set whose value indicates the mostly recently referenced block

\section*{Solution}
- 1 (direct-mapped)
\begin{tabular}{cc} 
Block address & Cache block \\
0 & \(0(=0 \bmod 4)\) \\
6 & \(2(=6 \bmod 4)\) \\
8 & \(0(=8 \bmod 4)\)
\end{tabular}

Block address translation in direct-mapped cache
\begin{tabular}{cccccc} 
Address of memory & Hit or & \multicolumn{3}{c}{ Contents of cache blocks after reference } \\
block accessed & miss & \(\mathbf{0}\) & \(\mathbf{1}\) & \(\mathbf{2}\) & \(\mathbf{3}\) \\
0 & miss & Memory[0] & & \\
8 & miss & Memory[8] & \\
0 & miss & Memory[0] & \\
6 & miss & Memory[0] & Memory[6] \\
8 & miss & Memory[8] & Memory[6]
\end{tabular}

Cache contents after each reference - red indicates new entry added
- 5 misses

\section*{Solution (cont.)}
- 2 (two-way set-associative)
\begin{tabular}{cc} 
Block address & Cache set \\
0 & \(0(=0 \bmod 2)\) \\
6 & \(0(=6 \bmod 2)\) \\
8 & \(0(=8 \bmod 2)\)
\end{tabular}

Block address translation in a two-way set-associative cache
\begin{tabular}{cccccc} 
Address of memory & Hit or & \multicolumn{4}{c}{ Contents of cache blocks after reference } \\
block accessed & miss & Set 0 & Set 0 & Set 1 & Set 1 \\
0 & miss & Memory[0] & & & \\
8 & miss & Memory[0] & Memory[8] & & \\
0 & hit & Memory[0] & Memory[8] & \\
6 & miss & Memory[0] & Memory[6] & \\
8 & miss & Memory[8] & Memory[6] & & \\
8 & & & &
\end{tabular}

Cache contents after each reference - red indicates new entry added
- Four misses

\section*{Solution (cont.)}
- 3 (fully associative)
\begin{tabular}{cccccc} 
Address of memory & Hit or & \multicolumn{3}{c}{ Contents of cache blocks after reference } \\
block accessed & miss & Block 0 & Block 1 & Block 2 & Block 3 \\
0 & miss & Memory[0] & & & \\
8 & miss & Memory[0] & Memory[8] & \\
0 & hit & Memory[0] & Memory[8] & \\
6 & miss & Memory[0] & Memory[8] & Memory[6] & \\
8 & hit & Memory[0] & Memory[8] & Memory[6]
\end{tabular}

Cache contents after each reference - red indicates new entry added
- 3 misses

\section*{Implementation of a 4-way - SetAssociative Cache}


4-way set-associative cache with 4 comparators and one 4-to-1 multiplexor: size of cache is 1 K blocks = 256 sets \(\boldsymbol{*}\) 4-block set size

\section*{Decreasing Miss Penalty with Multilevel Caches}
- Add a second-level cache
- primary cache is on the same chip as the processor
- use SRAMs to add a second-level cache, sometimes off-chip, between main memory and the first-level cache
- if miss occurs in primary cache second-level cache is accessed
- if data is found in second-level cache miss penalty is access time of second-level cache which is much less than main memory access time
- if miss occurs again at second-level then main memory access is required and large miss penalty is incurred

\section*{Virtual Memory}

\section*{Virtual Memory}
- Motivation: main memory acts as cache for secondary storage, e.g., magnetic disk
- main memory size \(\leq\) disk size \(\leq\) virtual address space size
- Page table transparently converts a virtual memory address to a physical memory address,
- if the data is already in main; if not, it issues call to OS to fetch the data from disk into main
- Virtual memory is organized in fixed-size (power of 2, typically at least 4 KB ) blocks, called pages. Physical memory is also considered a collection of pages of the same size.
- the unit of data transfer between disk and physical memory is a page

\section*{Virtual Memory}


Mapping of pages from a virtual address to a physical address or disk address

\title{
Page Table Implements Virtual to Physical Address Translation
}


Physical address
Page table: page size 4 KB , virtual address space 4 GB, physical memory 1 GB

\section*{Example Problem}
- Assume:
- 32-bit virtual address
- 4 KB page size
- 4 bytes per page table entry
- What is the total page table size is we want to be able to access all of the virtual memory?

\section*{Solution}
- No. of page table entries = address space size / page size
\[
=2^{32} / 2^{12}=2^{20}
\]
- Size of page table \(=\) No. of entries \(\times\) entry size
\[
=2^{20} \times 4 \text { bytes }=4 \mathrm{MB} \text { (huge !) }
\]
- Note, to avoid large page table size:
- each program has its own page table
- page table register points to start of program's page table
- to reduce storage required per program page table
- page table for a program covers the span of virtual memory containing its own code and data
- other techniques, e.g., multiple-level page tables, hashing virtual address, etc.

\section*{Page Faults}
- Page fault: page is not in memory, must retrieve it from disk
- enormous miss penalty = millions of cycles
- therefore, page size should be large (e.g., 32 or 64 KB )
- to make one trip to disk worth a lot
- reducing page faults is critical
- LRU replacement policy - implemented approximately by setting a use bit each time a page is accessed, and then periodically clearing all these bits so that pages accessed in a fixed time period are known
- fully associative page placement - consequence of page table
- handle faults in software instead of hardware
- as software overhead is still small compared to disk access time
- using write-through is too expensive, so always use write-back

\section*{Resolving Page Faults using the Page}


Page table maps virtual page to either physical page or disk page

\title{
Making Address Translation Fast with the Translation-lookaside Buffer
}


On a page reference, first look up the virtual page number in the TLB; if there is a TLB miss look up the page table; if miss again then true page fault

\section*{Summary}
- Memory hierarchy
- Cache basics
- Measuring cache perfromance
- Improving cache performance
- Virtual memory

\section*{UNIT IV MEMORY AND I/O SYSTEMS}

Input/output system, programmed I/O, DMA and interrupts, I/O processors.

\section*{Reference:}
- Chapter 7 - Input Output System
- Book - William Stallings , "Computer Organization and Architecture", 8 \({ }^{\text {th }}\) Edition.

\section*{Input Output System}

\section*{Need for Input/Output Module}
- Wide variety of peripherals
- Delivering different amounts of data
- At different speeds
- In different formats
- All slower than CPU and RAM
- Need I/O modules

\section*{Input/Output Module}
- Interface to CPU and Memory
- Interface to one or more peripherals


\section*{Generic Model of I/O Module}


\section*{External Devices}
- Monitor Screen
- Printer
- Keyboard
- Modem

\section*{External Device Block Diagram}


\section*{I/O Module Function}
- Control \& Timing
- CPU Communication
- Device Communication
- Data Buffering
- Error Detection

\section*{I/O Steps}
- CPU checks I/O module device status
- I/O module returns status
- If ready, CPU requests data transfer
- I/O module gets data from device
- I/O module transfers data to CPU
- Variations for output, DMA, etc.

\section*{I/O Function and Steps}

1. Request status
4.Return status
5. Request data
8.Return data
\(\overrightarrow{2 .}\) Request status
3.Return status
6. Request data
7.Return data

\section*{I/O Module Diagram}


\section*{I/O Module Decisions}
- Hide or reveal device properties to CPU
- Support multiple or single device
- Control device functions or leave for CPU
- Also O/S decisions
- e.g. Unix treats everything it can as a file

\section*{Programmed I/O}

\section*{Input Output Techniques}
- Programmed I/O
- Interrupt driven I/O
- Direct Memory Access (DMA)

\section*{Three Techniques for Input of a Block of Data}

(a) Programmed I/O


Next instruction
(b) Interrupt-driven I/O

(c) Direct memory access


\section*{Programmed I/O}
- CPU has direct control over I/O
- Sensing status
- Read/write commands
- Transferring data
- CPU waits for I/O module to complete operation
- Wastes CPU time

\section*{Programmed I/O - detail}
- CPU requests I/O operation
- I/O module performs operation
- I/O module sets status bits
- CPU checks status bits periodically
- I/O module does not inform CPU directly
- I/O module does not interrupt CPU
- CPU may wait or come back later

\section*{I/O Commands}
- CPU issues address
- Identifies module (\& device if >1 per module)
- CPU issues command
- Control - telling module what to do
- e.g. spin up disk
- Test - check status
- e.g. power? Error?
- Read/Write
- Module transfers data via buffer from/to device

\section*{Addressing I/O Devices}
- Under programmed I/O data transfer is very like memory access (CPU viewpoint)
- Each device given unique identifier
- CPU commands contain identifier (address)

\section*{I/O Mapping}
- Memory mapped I/O
- Devices and memory share an address space
- I/O looks just like memory read/write
- No special commands for I/O
- Large selection of memory access commands available
- Isolated I/O
- Separate address spaces
- Need I/O or memory select lines
- Special commands for I/O
- Limited set

\section*{Memory Mapped and Isolated I/O}
\begin{tabular}{clc} 
ADDRESS & INSTRUCTION & OPERAND \\
200 & Load AC & "1" \\
& Store AC & 517 \\
202 & Load AC & 517 \\
& Branch if Sign = 0 & 202 \\
& Load AC & 516
\end{tabular}

OMMENT
Load accumulator
Initiate keyboard read
Get status byte
Loop until ready
Load data byte

\begin{tabular}{clc} 
ADDRESS & INSTRUCTION & OPERAND \\
200 & Load I/O & 5 \\
201 & Test I/O & 5 \\
& Branch Not Ready & 201 \\
& In & 5
\end{tabular}

COMMENT Initiate keyboard read Check for completion Loop until complete Load data byte
(b) Isolated I/O

\section*{Interrupt Driven I/O}


\section*{Interrupt Driven I/O}
- Overcomes CPU waiting
- No repeated CPU checking of device
- I/O module interrupts when ready

\section*{Interrupt Driven I/O Basic Operation}
- CPU issues read command
- I/O module gets data from peripheral whilst CPU does other work
- I/O module interrupts CPU
- CPU requests data
- I/O module transfers data

Hardware


Software


\section*{Simple Interrupt} Processing

\section*{CPU Viewpoint}
- Issue read command
- Do other work
- Check for interrupt at end of each instruction cycle
- If interrupted:-
- Save context (registers)
- Process interrupt
- Fetch data \& store
- See Operating Systems notes

\section*{Changes in Memory and Registers}

(a) Interrupt occurs after instruction at location N

(b) Return from interrupt

\section*{Design Issues}
- How do you identify the module issuing the interrupt?
- How do you deal with multiple interrupts?
- i.e. an interrupt handler being interrupted

\section*{Identifying Interrupting Module (1)}
- Different line for each module
- PC
- Limits number of devices
- Software poll
- CPU asks each module in turn
- Slow

\section*{Identifying Interrupting Module (2)}
- Daisy Chain or Hardware poll
- Interrupt Acknowledge sent down a chain
- Module responsible places vector on bus
- CPU uses vector to identify handler routine
- Bus Master
- Module must claim the bus before it can raise interrupt

\section*{Multiple Interrupts}
1) Implementation of interrupt priority using individual interrupt-request and acknowledge lines.
- Each line has different priority levels.
- Interrupt acknowledgement will be sent only if it has highest priority than the one assigned currently.
- Adv - accept interrupt request from some devices based on priority.
2) Daisy chain
- Adv - fewer lines
3) Arrangement of priority groups
- Adv - combine the advantage of both the above schemes

\section*{Example - PC Bus}
- \(80 \times 86\) has one interrupt line
- 8086 based systems use one 8259A interrupt controller
- 8259A has 8 interrupt lines

\section*{Sequence of Events}
- 8259A accepts interrupts
- 8259A determines priority
- 8259A signals 8086 (raises INTR line)
- CPU Acknowledges
- 8259A puts correct vector on data bus
- CPU processes interrupt

\section*{ISA Bus Interrupt System}
- ISA bus chains two 8259As together
- Link is via interrupt 2
- Gives 15 lines
- 16 lines less one for link
- IRQ 9 is used to re-route anything trying to use IRQ 2
- Backwards compatibility
- Incorporated in chip set

Slave

\section*{82C59A Interrupt Controller}


\section*{Intel 82C55A}

\section*{Programmable Peripheral Interface}

(a) Block diagram

(b) Pin layout

\section*{Keyboard/Display Interfaces to 82C55A}


\section*{DMA}

\section*{DMA working}


\section*{Direct Memory Access}
- DMA - used to transfer large block of data
- Interrupt driven and programmed I/O require active CPU intervention
- Transfer rate is limited
- CPU is tied up

\section*{DMA Function}
- Additional Module (hardware) on bus
- DMA controller takes over from CPU for I/O

\section*{Tvpical DMA Module Diagram}

\section*{DMA Operation}
- CPU tells DMA controller:-
- Read/Write
- Device address
- Starting address of memory block for data
- Amount of data to be transferred
- CPU carries on with other work
- DMA controller deals with transfer
- DMA controller sends interrupt when finished

\section*{Memory Cycle Stealing}
- DMA controller has higher priority over the system bus than the processor.
- This causes the DMA controller to steal the memory clock cycles of the processor.

\section*{DMA and Interrupt Breakpoints}

\section*{During an Instruction Cycle \\ Time}


\section*{Aside}
- What effect does caching memory have on DMA?
- What about on board cache?
- Hint: how much are the system buses available?

\section*{DMA Configurations (1)}

- Single Bus, Detached DMA controller
- Each transfer uses bus twice
- I/O to DMA then DMA to memory
- CPU is suspended twice

\section*{DMA Configurations (2)}

(b) Shgle-bus, Integrated DMA-I/O
- Single Bus, Integrated DMA controller
- Controller may support >1 device
- Each transfer uses bus once
- DMA to memory
- CPU is suspended once

\section*{DMA Configurations (3)}

- Separate I/O Bus
- Bus supports all DMA enabled devices
- Each transfer uses bus once
- DMA to memory
- CPU is suspended once

\section*{Intel 8237A DMA Controller}
- Interfaces to \(80 \times 86\) family and DRAM
- When DMA module needs buses it sends HOLD signal to processor
- CPU responds HLDA (hold acknowledge)
- DMA module can use buses
- E.g. transfer data from memory to disk
1. Device requests service of DMA by pulling DREQ (DMA request) high
2. DMA puts high on HRQ (hold request),
3. CPU finishes present bus cycle (not necessarily present instruction) and puts high on HDLA (hold acknowledge). HOLD remains active for duration of DMA
4. DMA activates DACK (DMA acknowledge), telling device to start transfer
5. DMA starts transfer by putting address of first byte on address bus and activating MEMR; it then activates IOW to write to peripheral. DMA decrements counter and increments address pointer. Repeat until count reaches zero
6. DMA deactivates \(H R Q\), giving bus back to CPU

\section*{8237 DMA Usage of Systems Bus}

CPU


DACK = DMA acknowledge
DREQ \(=\) DMA request
HLDA = HOLD acknowledge
HRQ = HOLD request

\section*{Fly-By}
- While DMA using buses processor idle
- Processor using bus, DMA idle
- Known as fly-by DMA controller
- Data does not pass through and is not stored in DMA chip
- DMA only between I/O port and memory
- Not between two I/O ports or two memory locations
- Can do memory to memory via register
- 8237 contains four DMA channels
- Programmed independently
- Any one active
- Numbered 0, 1, 2, and 3

\section*{I/O Channels}
- I/O devices getting more sophisticated
- e.g. 3D graphics cards
- CPU instructs I/O controller to do transfer
- I/O controller does entire transfer
- Improves speed
- Takes load off CPU
- Dedicated processor is faster

\section*{I/O Channel Architecture}

(a) Selector

(b) Multtplexor

\section*{Interfacing}
- Connecting devices together
- Bit of wire?
- Dedicated processor/memory/buses?
- E.g. FireWire, InfiniBand

\section*{IEEE 1394 FireWire}
- High performance serial bus
- Fast
- Low cost
- Easy to implement
- Also being used in digital cameras, VCRs and TV

\section*{FireWire Configuration}
- Daisy chain
- Up to 63 devices on single port
- Really 64 of which one is the interface itself
- Up to 1022 buses can be connected with bridges
- Automatic configuration
- No bus terminators
- May be tree structure

\section*{Simple FireWire Configuration}


\section*{FireWire 3 Layer Stack}
- Physical
- Transmission medium, electrical and signaling characteristics
- Link
- Transmission of data in packets
- Transaction
- Request-response protocol


\section*{FireWire - Physical Layer}
- Data rates from 25 to 400 Mbps
- Two forms of arbitration
- Based on tree structure
- Root acts as arbiter
- First come first served
- Natural priority controls simultaneous requests
- i.e. who is nearest to root
- Fair arbitration
- Urgent arbitration

\section*{FireWire - Link Layer}
- Two transmission types
- Asynchronous
- Variable amount of data and several bytes of transaction data transferred as a packet
- To explicit address
- Acknowledgement returned
- Isochronous
- Variable amount of data in sequence of fixed size packets at regular intervals
- Simplified addressing
- No acknowledgement

(b) Concatenated asynchronous subactions

(c) Example isochronous subactions

\section*{InfiniBand}
- I/O specification aimed at high end servers
- Merger of Future I/O (Cisco, HP, Compaq, IBM) and Next Generation I/O (Intel)
- Version 1 released early 2001
- Architecture and spec. for data flow between processor and intelligent I/O devices
- Intended to replace PCl in servers
- Increased capacity, expandability, flexibility

\section*{InfiniBand Architecture}
- Remote storage, networking and connection between servers
- Attach servers, remote storage, network devices to central fabric of switches and links
- Greater server density
- Scalable data centre
- Independent nodes added as required
- I/O distance from server up to
- 17m using copper
- 300 m multimode fibre optic
- 10km single mode fibre
- Up to 30Gbps

\section*{InfiniBand Switch Fabric}


\section*{InfiniBand Operation}
- 16 logical channels (virtual lanes) per physical link
- One lane for management, rest for data
- Data in stream of packets
- Virtual lane dedicated temporarily to end to end transfer
- Switch maps traffic from incoming to outgoing lane

\section*{InfiniBand Protocol Stack}


\section*{Foreground Reading}
- Check out Universal Serial Bus (USB)
- Compare with other communication standards e.g. Ethernet```

