Person in charge: | (-) |
Others: | (-) |
Credits | Dept. |
---|---|
7.5 (6.0 ECTS) | AC |
Person in charge: | (-) |
Others: | (-) |
Upon finishing this subject, students should be able to:
- Understand the functionality of a simple segmented processor and implement one at block level with a cache hierarchy (instructions and data) and with support for paged virtual memory.
- Understand the functionality of a simple segmented processor and implement one at block level with precise exceptions and support for external interrupts.
- Understand the functionality and implement cutting-edge performance improvement techniques for segmented processors at block level, including superscalar execution, out-of-order execution and multi-threading techniques.
- Understand the functionality and implement at block level vectorial processing technique, applied in particular to supporting multimedia programs.
- Understand the functionality of special-purpose processors such as DSPs or 3D graphics accelerator cards.
Estimated time (hours):
T | P | L | Alt | Ext. L | Stu | A. time |
Theory | Problems | Laboratory | Other activities | External Laboratory | Study | Additional time |
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
2,0 | 0 | 0 | 0 | 0 | 0 | 0 | 2,0 | |||
Review of the Performance Law and Amdhal"s Law. Register and memory dependencies. Review of the segmentation concept. Short circuits review.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
6,0 | 2,0 | 0 | 0 | 0 | 6,0 | 0 | 14,0 | |||
Review of 5-segment processor. Review of the concept of and reasons for memory hierarchy. Entering cache instructions and data in the processor. Store pipeline: difficulty of single-cycle implementation and solutions. Reminder - virtual memory. Introduction to TLB (translation look-aside buffer) TLB connection to the processor. Solutions to TLB failures. Precise exceptions: introduction and problems. Propagation vector exceptions. Communicating with the operating system.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 2,0 | 0 | 0 | 0 | 4,0 | 0 | 10,0 | |||
Floating point pipeline. Completing out-of-order instructions. Problems - write-back conflicts. Solving WAW risks. Problem - precise exceptions: buffer and future file solution (Smith & Plezskun, 1985). Problems - store pipeline, load-store dependencies, and by-passes.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
2,0 | 0 | 0 | 0 | 0 | 2,0 | 0 | 4,0 | |||
Concept. Integration of the predictor in the segmented pipeline. Techniques for implementing predictors: 2-bit predictor with history register.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 2,0 | 0 | 0 | 0 | 4,0 | 0 | 10,0 | |||
Problem of dynamic ordering. Centralised renaming of the kind used in the MIPS R10000 architecture. New structures: window instructions, rename table, Re-order Buffer(ROB). Pick & Wake-up implementation. Renumbering algorithm for R10000 registers. Exceptions: rename table recovery. Integration with predictor: copies of rename table. Free List implementation. Implementing selective kill instructions.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
2,0 | 1,0 | 0 | 0 | 0 | 2,0 | 0 | 5,0 | |||
IPC object> 1. Fetch problem: using the cache line to obtain two instructions per clock cycle. Decode changes: Cascaded renaming. Window changes: new picker. Modifications to retirement instructions.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 1,0 | 0 | 0 | 0 | 4,0 | 0 | 9,0 | |||
The threading concept. Relationship to OS. Object-oriented multithreading: increasing throughput and improving efficiency. Multithreading techniques: fine-grain, switch-on-event, simultaneous multithreading. Incorporating multithreading in an out-of-order segmented pipeline. Example: Penitum-4. Multi-core processors.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
6,0 | 2,0 | 0 | 0 | 0 | 6,0 | 0 | 14,0 | |||
Parallel execution with multiple functional units: problems and solutions. The concept of "Single Instruction Multiple Data". Multimedia variants. Specific vectorial instructions: Memory access using stride, gather/scatter, mask execution. Implementation of vectorial register banks. Access to memory banks.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
5,0 | 2,0 | 0 | 0 | 0 | 5,0 | 0 | 12,0 | |||
Graphic Pipeline. Processing stages: geometry, transform, rasterise, fragment, raster ops. Vertex Shaders, Pixel Shaders, Memory Controller. Fixed-function pipeline stages. Multithreading Organisation. Synchronisation. Texture Cache.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 14,0 | 0 | 30,0 | 0 | 0 | 44,0 | |||
Lab development will use LogicWorks for this purpose. The processor will fetch and withdraw two instructions per cycle. Students a two-path processor involving just a single access to the data cache per clock cycle. The system will include main memory and a DMA disk controller to interrupt the processor. Optionally, some students may choose a version with 2 threads.
|
|
T | P | L | Alt | Ext. L | Stu | A. time | Total | ||
---|---|---|---|---|---|---|---|---|---|---|
4,0 | 0 | 0 | 0 | 0 | 4,0 | 0 | 8,0 | |||
This theme draws together all the concepts dealt with to date and is presented in the context of a pipeline and a real processor (Pentium 4 or Pentium M), and rounds off those aspects covered in a simplified form in previous themes.
|
Total per kind | T | P | L | Alt | Ext. L | Stu | A. time | Total |
41,0 | 12,0 | 14,0 | 0 | 30,0 | 37,0 | 0 | 134,0 | |
Avaluation additional hours | 0 | |||||||
Total work hours for student | 134,0 |
Theory and Problems: Common problems (lecture)
Laboratory:
Students will collectively undertake a practical session involving implementing a simple 5, 6, or 7-stage multi-cyclic, superscalar processor incorporating jump prediction. The processor must include a cache hierarchy (instructions and data), and must share data with a disk controller through a bus and main memory.
It should be stressed that all students will work on the same processor. In other words, students will work in groups and by stages but must assemble all of the processor components to produce a working whole. It goes without saying that this form of learning is both extremely effective and highly motivating. Students tackle a real project in which the various groups have to communicate with one another and reach agreement on an implementation that is straightforward for all concerned. Furthermore, the teacher will prevent over-simplified implementation being adopted to ensure that the whole process is as realistic as possible.
The course stresses putting theory into practice. The first four weeks of the course stress theory classes, and include some tasks. Students acquire with the knowledge needed to make an early start on the practical work.
Lab sessions take place in either the PC classroom or in a conventional classroom. In the former case, students implement specific aspects of the practical work and debug circuits. These sessions are particularly useful when it comes to drawing together the various processor stages.
In the latter case, the sessions cover processor design, in which the various groups discuss the precise processor implementation under the teacher"s supervision.
NP1 = Practice Grade - First Submission
NP2 = Practice Grade - Second Submission
NE = Exam grade
Final Grade = 0.4 NE + 0.2 NP1 + 0.4 NP2
The first piece of practical work will be submitted before the course halfway point to ensure that the basic processor components work properly and to check that students grasp the theoretical concepts and are able to put them into practice.
(-)