Masters in Computer Science and Engineering

FACULTAT D'INFORMÀTICA
DE BARCELONA

Versió en Català | Versión en Castellano

Cutting-Edge Computer Architectures (ACA)

Credits	Dept.
7.5 (6.0 ECTS)	AC

Instructors

Person in charge:	(-)
Others:	(-)

General goals

Upon finishing this subject, students should be able to:
-  Understand the functionality of a simple segmented processor and implement one at block level with a cache hierarchy (instructions and data) and with support for paged virtual memory.
-  Understand the functionality of a simple segmented processor and implement one at block level with precise exceptions and support for external interrupts.
-  Understand the functionality and implement cutting-edge performance improvement techniques for segmented processors at block level, including superscalar execution, out-of-order execution and multi-threading techniques.
-  Understand the functionality and implement at block level vectorial processing technique, applied in particular to supporting multimedia programs.
-  Understand the functionality of special-purpose processors such as DSPs or 3D graphics accelerator cards.

Specific goals

Knowledges

Memory hierarchy, store instruction ordering, write-to-cache techniques, write buffer, merge buffer, cache flush.

Store-load dependencies in a pipeline with two write stages.

TLB (translation look-aside buffer) concept, Operating System support for virtual paging memory, and fault management in the TLB.

Precise (and imprecise) exceptions. Interruption. The impact of exceptions and interruptions in the segmentation process. Precise State recovery. Ordering exceptions. Implementation of vector interruptions.

Jump prediction - concept. Two-bit predictors with history registers. Integration of the predictor in the segmented pipeline. Predictor training. Error recovery

Out-of-order execution ("complete out of order" and "initiate and complete out of order" variants). Re-order Buffer and Future File technique (Smith & Plezskun, 1985). Recovering Precise State in cases of exception and interruption.

Renumbering registers - concept. Integrating pipeline renaming. Reorder Buffer integration. Recovering the rename table in cases of exception.

Selection (pick & wake-up) of instructions within an instruction window. Dependence control among registers. Control dependencies among memory instructions.

Super-scalar execution. Fetching multiple instructions within a clock cycle. Prerequisites for simultaneously executing two instructions.

Multithread execution. Types of multithreading: fine-grain, switch-on-event, simultaneous multithreading. Chip-Multiprocessor. Incorporating two threads for renaming in an out-of-order segmented pipeline.

Vectorial execution. Advantages and drawbacks of vectorial execution. Specific memory access instructions (Strides and Gather/Scatter). Datapath partitioning in lanes. Memory organisation by vectoral access.

The concept behind specific purpose processors: embedded, DSP, and 3D graphic processors. Example of the workings of a 3D graphic processor.

Abilities

Mastery of LogicWorks 4 and LogicWorks 5 simulators. Ability to create a model of the processors presented and simulate their operation.

Use verilog as a hardware modelling language in order to model the processors presented in class.

Competences

Ability to solve problems through the application of scientific and engineering methods.

Ability to create and use models of reality.

Ability to design and carry out experiments and analyse the results.

Know-how to apply the solution cycle to common scientific and engineering problems: specification, coming with ideas and alternatives, design solution strategies, carrying out the strategy, validation, interpretation and evaluation of results. Ability to analyse the process on completion.

Ability to take take decisions when faced with uncertainty or contradictory requirements.

Initiative: Resolution, knowing how to take decisions and how to act in order to solve a problem.

Ability to work effectively in large groups to solve complex problems.

Ability to set up and organise either a uni- or multi-disciplinary group to tackle a complex project.

Leadership.

Ability to understand and constructively criticise presentations given by others.

Assume responsibility for one"s own work.

Estimated time (hours):

T	P	L	Alt	Ext. L	Stu	A. time
Theory	Problems	Laboratory	Other activities	External Laboratory	Study	Additional time

1.	Introduction to the architecture of current processors.

Alt

Ext. L

Stu

A. time

Total

2,0

Introduction to the architecture concepts currently employed in processors. Summary of the way processors are manufactured and the constraints this places on processor architecture. Course presentation.

2.	Review of computer architecture concepts

Alt

Ext. L

Stu

A. time

Total

2,0

Review of the Performance Law and Amdhal"s Law. Register and memory dependencies. Review of the segmentation concept. Short circuits review.

3.	Base processor

Alt

Ext. L

Stu

A. time

Total

6,0

2,0

6,0

14,0

Review of 5-segment processor. Review of the concept of and reasons for memory hierarchy. Entering cache instructions and data in the processor. Store pipeline: difficulty of single-cycle implementation and solutions. Reminder - virtual memory. Introduction to TLB (translation look-aside buffer) TLB connection to the processor. Solutions to TLB failures. Precise exceptions: introduction and problems. Propagation vector exceptions. Communicating with the operating system.

4.	Multicycle Operations

Alt

Ext. L

Stu

A. time

Total

4,0

2,0

4,0

10,0

Floating point pipeline. Completing out-of-order instructions. Problems - write-back conflicts. Solving WAW risks. Problem - precise exceptions: buffer and future file solution (Smith & Plezskun, 1985). Problems - store pipeline, load-store dependencies, and by-passes.

5.	Jump prediction

Alt

Ext. L

Stu

A. time

Total

2,0

4,0

Concept. Integration of the predictor in the segmented pipeline. Techniques for implementing predictors: 2-bit predictor with history register.

6.	Processors employing out-of-order execution

Alt

Ext. L

Stu

A. time

Total

4,0

2,0

4,0

10,0

Problem of dynamic ordering. Centralised renaming of the kind used in the MIPS R10000 architecture. New structures: window instructions, rename table, Re-order Buffer(ROB). Pick & Wake-up implementation. Renumbering algorithm for R10000 registers. Exceptions: rename table recovery. Integration with predictor: copies of rename table. Free List implementation. Implementing selective kill instructions.

7.	Processors with Super-scalar execution

Alt

Ext. L

Stu

A. time

Total

2,0

1,0

2,0

5,0

IPC object> 1. Fetch problem: using the cache line to obtain two instructions per clock cycle. Decode changes: Cascaded renaming. Window changes: new picker. Modifications to retirement instructions.

8.	Processors with multi-threaded execution

Alt

Ext. L

Stu

A. time

Total

4,0

1,0

4,0

9,0

The threading concept. Relationship to OS. Object-oriented multithreading: increasing throughput and improving efficiency. Multithreading techniques: fine-grain, switch-on-event, simultaneous multithreading. Incorporating multithreading in an out-of-order segmented pipeline. Example: Penitum-4. Multi-core processors.

9.	Vectorial execution processors

Alt

Ext. L

Stu

A. time

Total

6,0

2,0

6,0

14,0

Parallel execution with multiple functional units: problems and solutions. The concept of "Single Instruction Multiple Data". Multimedia variants. Specific vectorial instructions: Memory access using stride, gather/scatter, mask execution. Implementation of vectorial register banks. Access to memory banks.

10.	3D

Alt

Ext. L

Stu

A. time

Total

5,0

2,0

5,0

12,0

Graphic Pipeline. Processing stages: geometry, transform, rasterise, fragment, raster ops. Vertex Shaders, Pixel Shaders, Memory Controller. Fixed-function pipeline stages. Multithreading Organisation. Synchronisation. Texture Cache.

11.	Implementation of a segmented multicycle superscalar 2-way processor with jump predictor.

Alt

Ext. L

Stu

A. time

Total

14,0

30,0

44,0

Lab development will use LogicWorks for this purpose. The processor will fetch and withdraw two instructions per cycle. Students a two-path processor involving just a single access to the data cache per clock cycle. The system will include main memory and a DMA disk controller to interrupt the processor. Optionally, some students may choose a version with 2 threads.

Additional laboratory activities:
Individual practical work to be carried out by students: implementation of schematics, test vectors, stage assembly, etc.

12.	Modern Processor

Alt

Ext. L

Stu

A. time

Total

4,0

8,0

This theme draws together all the concepts dealt with to date and is presented in the context of a pipeline and a real processor (Pentium 4 or Pentium M), and rounds off those aspects covered in a simplified form in previous themes.

Total per kind	T	P	L	Alt	Ext. L	Stu	A. time	Total
Total per kind	41,0	12,0	14,0	0	30,0	37,0	0	134,0
Avaluation additional hours								0
Total work hours for student								134,0

Docent Methodolgy

Theory and Problems: Common problems (lecture)

Laboratory:

Students will collectively undertake a practical session involving implementing a simple 5, 6, or 7-stage multi-cyclic, superscalar processor incorporating jump prediction. The processor must include a cache hierarchy (instructions and data), and must share data with a disk controller through a bus and main memory.

It should be stressed that all students will work on the same processor. In other words, students will work in groups and by stages but must assemble all of the processor components to produce a working whole. It goes without saying that this form of learning is both extremely effective and highly motivating. Students tackle a real project in which the various groups have to communicate with one another and reach agreement on an implementation that is straightforward for all concerned. Furthermore, the teacher will prevent over-simplified implementation being adopted to ensure that the whole process is as realistic as possible.

The course stresses putting theory into practice. The first four weeks of the course stress theory classes, and include some tasks. Students acquire with the knowledge needed to make an early start on the practical work.

Lab sessions take place in either the PC classroom or in a conventional classroom. In the former case, students implement specific aspects of the practical work and debug circuits. These sessions are particularly useful when it comes to drawing together the various processor stages.

In the latter case, the sessions cover processor design, in which the various groups discuss the precise processor implementation under the teacher"s supervision.

Evaluation Methodgy

NP1 = Practice Grade - First Submission
NP2 = Practice Grade - Second Submission
NE = Exam grade

Final Grade = 0.4 NE + 0.2 NP1 + 0.4 NP2

The first piece of practical work will be submitted before the course halfway point to ensure that the basic processor components work properly and to check that students grasp the theoretical concepts and are able to put them into practice.

Basic Bibliography

David A. Patterson, John L. Hennessy Computer Organization and Design, 3rd Edition, Morgan Kaufmann, 1997.

Complementary Bibliography

Mike L. Johnson Superscalar Microprocessor Design, Prentice Hall, 1991.
Bruce D. Shriver, Bennett Smith The Anatomy of a high-performance microprocessor : a systems perspective, IEEE Computer Society Press, 1998.
Capilano Computing Systems, Ltd LogicWorks 5 : interactive circuit design software, Prentice Hall, 2004.

Web links

(no available informacion)

Previous capacities

(-)

News
Agenda

RSS
This website uses cookies to offer you the best experience and service. If you continue browsing, it is understood that you accept our cookies policy.
Classic version Mobile version

Cutting-Edge Computer Architectures (ACA)

Instructors

General goals

Specific goals

Knowledges

Abilities

Competences

Contents

Docent Methodolgy

Evaluation Methodgy

Basic Bibliography

Complementary Bibliography

Web links

Previous capacities