High Performance Computing

Credits

Types

Compulsory

Requirements

This subject has not requirements , but it has got previous capacities

Department

Web

https://docencia.ac.upc.edu/gia-cap

Mail

josep.ll.berral@upc.edu, jordi.torres@upc.edu

The aim of this subject is to know the operation and applications of high-performance computing systems, in order to deploy artificial intelligence applications that require a large amount of resources, process optimization and application of accelerators, and leveraging and orchestrating cloud resources. This course will cover concepts of virtualization and containerization, as well as distributed file systems and distributed computing systems. You will also see scalability in machine learning algorithms and artificial intelligence, using state-of-the-art technologies, both for middleware and accelerators. We will work with C, Python and Scala languages.

Teachers

Person in charge

Josep Lluís Berral García ( berral@ac.upc.edu )

Others

Jordi Torres Viñals ( torres@ac.upc.edu )

Weekly hours

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Competences

Transversal Competences

Transversals

CT2 - Sustainability and Social Commitment. To know and understand the complexity of economic and social phenomena typical of the welfare society; Be able to relate well-being to globalization and sustainability; Achieve skills to use in a balanced and compatible way the technique, the technology, the economy and the sustainability.

CT3 - Efficient oral and written communication. Communicate in an oral and written way with other people about the results of learning, thinking and decision making; Participate in debates on topics of the specialty itself.

CT6 [Avaluable] - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.

Basic

CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.

Technical Competences

Especifics

CE05 - To be able to analyze and evaluate the structure and architecture of computers, as well as the basic components that make them up.

CE06 - To be able to identify the features, functionalities and structure of Operating Systems and to design and implement applications based on their services.

CE07 - To interpret the characteristics, functionalities and structure of Distributed Systems, Computer Networks and the Internet and design and implement applications based on them.

CE08 - To detect the characteristics, functionalities and components of data managers, which allow the adequate use of them in information flows, and the design, analysis and implementation of applications based on them.

CE11 - To identify and apply the fundamental principles and basic techniques of parallel, concurrent, distributed and real-time programming.

CE19 - To use current computer systems, including high-performance systems, for the processing of large volumes of data from the knowledge of its structure, operation and particularities.

Generic Technical Competences

Generic

CG1 - To ideate, draft, organize, plan and develop projects in the field of artificial intelligence.

CG3 - To define, evaluate and select hardware and software platforms for the development and execution of computer systems, services and applications in the field of artificial intelligence.

CG9 - To face new challenges with a broad vision of the possibilities of a professional career in the field of Artificial Intelligence. Develop the activity applying quality criteria and continuous improvement, and act rigorously in professional development. Adapt to organizational or technological changes. Work in situations of lack of information and / or with time and / or resource restrictions.

Objectives

Understand the use of high-performance computing and middlewares for artificial intelligence
Related competences: CG1, CG9, CT3, CT6, CE19,
Know the basic components of hardware and middleware in high-performance platforms
Related competences: CG9, CT2, CE05, CE08, CE19,
Learn about the use of accelerators (e.g. GPUs) and the tools for their exploitation
Related competences: CG3, CT6, CE08, CE19,
Learn about virtualization concepts and usage of virtual machines
Related competences: CG3, CT2, CB2, CE05, CE06,
Become familiar with the basic tools for exploiting distributed systems, with programming models oriented to distribution
Related competences: CG3, CT6, CE07, CE08, CE11,
Know the basic concepts on distributed systems, interconnection and connection among systems.
Related competences: CG3, CT3, CT6, CE07, CE11,
Learn about file systems: basic usage of file systems, redundancy on disks, logic volumes and fault tolerance.
Related competences: CG3, CT6, CB2, CE06, CE07, CE08,
Discover the challenges on high-performance computing on artificial intelligence
Related competences: CG1, CG9, CT2, CT3,

Introduction to High-Performance Computing Systems
Introduction to large-scale computing systems, specialized and the Cloud.
Accelerators and high-performance devices
Incorporation of accelerators (e.g. GPUs) and the tools for their exploitation. Matrix operations accelerated through specialized devices.
Middleware and high-performance platforms for artificial intelligence
Basic components of hardware and middleware in high-performance platforms. Use of state of the art and commodity tools (e.g. TensorFlow, Pytorch, etc.) combined with specialized devices.
Parallelism applied to artificial intelligence
Parallelism on high-performance computing through the most common middlewares for artificial intelligence, deep learning and transformers, and their associated techniques
Introduction to distributed programming models for Big Data
Introduction to Map-Reduce programming models over distributed data systems and language Scala.
Virtualization concepts and containerization
Introduction to the use of virtual machines and containerization, for isolation executions and personalized environments, as load migration and resource management in shared systems.
Local and distributed file systems, redundancy and availability
Basic usage of file systems, distributed file systems, logic volumes, redundancy, fault tolerance and high availability.
Distributed systems for computing
Basic concepts on distributed systems (e.g. Hadoop and Spark), interconnection and communications, paradigms of distributed systems and protocols, and fault tolerance. Basic tools for exploiting concurrency on distributed systems, and their programming models oriented towards artificial intelligence and Big Data processing.
Challenges for high-performance computing for artificial intelligence
Challenges for present and future of high-performance computing applied to artificial intelligence. Current tools and environments in the industry, the Cloud, academia and society.

Activities

Activity Evaluation act

Virtualization and containerization concepts

Introduction to the use of virtual machines and containerization, for isolated and customized execution of environments, as well as load migration and resource management to shared systems.
Objectives: 4
Contents:

6 . Virtualization concepts and containerization

Theory

Problems

Laboratory

Guided learning

Autonomous learning

12h

Service and application architecture

Introduction to Client-Server models, execution management systems, and launch of applications in cluster and Cloud systems.

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Supercomputers and High-performance Computing

Supercomputers and High Performance Computing systems, tools and environments. Familiarization with HPC facilities, hands-on use of HPC systems and C language.
Objectives: 2
Contents:

1 . Introduction to High-Performance Computing Systems

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Accelerators, supercomputers and high-performance devices

Accelerators and high performance devices. GPUs and accelerator devices. Matrix multiplication using GPUs. Introduction to Python on a supercomputer.
Objectives: 3
Contents:

2 . Accelerators and high-performance devices

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Computing in distributed systems

Basic concepts of distributed systems (e.g. Hadoop and Spark), interconnection and communications, distributed systems paradigms and protocols, and fault tolerance. Basic tools for the exploitation of concurrency in distributed systems, and their programming models oriented to artificial intelligence and massive data processing.
Objectives: 6 5
Contents:

8 . Distributed systems for computing

Theory

Problems

Laboratory

Guided learning

Autonomous learning

12h

Current tools and environments in industry, the cloud, academia and society.

Current tools and environments in industry, the cloud, academia and society.
Objectives: 8
Contents:

9 . Challenges for high-performance computing for artificial intelligence

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Local and distributed file systems, redundancy and availability

Basic uses of file systems, as well as distributed data storage systems, logical volumes, redundancy, fault tolerance, and high availability.
Objectives: 7
Contents:

7 . Local and distributed file systems, redundancy and availability

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Parallelism applied to artificial intelligence

Parallelism applied to artificial intelligence. Scalability, advanced deep learning techniques, transformers and the future of Deep Learning.
Objectives: 2 1
Contents:

4 . Parallelism applied to artificial intelligence
9 . Challenges for high-performance computing for artificial intelligence

Theory

Problems

Laboratory

Guided learning

Autonomous learning

12h

Middleware and high-performance platforms for artificial intelligence

Middleware and high performance platforms for artificial intelligence. TensorFlow/Pytorch, Deep Learning, LLMs and HPC.
Objectives: 1
Contents:

3 . Middleware and high-performance platforms for artificial intelligence

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Present and future challenges of high-performance computing applied to artificial intelligence. Seminars on HPC

Seminars of experts in the field. Presentation of work.
Objectives: 5 1 8
Contents:

9 . Challenges for high-performance computing for artificial intelligence

Theory

Problems

Laboratory

Guided learning

Autonomous learning

12h

Teaching methodology

The course is based on theory and face-to-face laboratory sessions. The theoretical sessions combine lectures and seminars by experts in the field, following the program set out in this study plan and based on the use of own material. During the sessions, dialogue and discussion are promoted in order to anticipate and consolidate the learning outcomes of the subject.

The laboratory sessions deal with the aspects related to the different technologies presented, and follow the same topics as the syllabus studies. These are hands-on practical sessions, using different computational resources in the Department of Computer Architecture and the Barcelona Supercomputing Center.

Evaluation methodology

The evaluation will basically be based on the completion of continuous work during the different sessions of the course. Attendance and participation will be mandatory, and therefore will also be assessed by passing a list and requiring participation in the interactive sessions. Finally, there will be a research project throughout the course, which students will have to present to their peers.

The distribution of weights for each activity is as follows:
- AS: attendance in class, theory and laboratories (10%), which will be used to evaluate transversal competence CT3.
- PR: class participation (15%)
- EX: laboratory and class deliverables (55%), as an arithmetic average of the different assignments.
- RE: presentation of a research paper (20%), which will be used to evaluate transversal skills CT2, CT3 and CT6.

The Final Grade (NF) of the subject is obtained from
NF = 0.10 x AS + 0.15 x PR + 0.55 x EX + 0.20 x RE

Re-evaluation
a) Re-evaluation can only be applied to students that presented all EX + RE exercises, and failed NF. (This is, those that want to upgrade their marks or are NP are excluded.)
b) Maximum mark in re-evaluation is 7.

Bibliography

Basic

First contact with Deep learning : practical introduction with Keras - Torres, Jordi, Kindle Direct Publishing, [2018]. ISBN: 9781983211553
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004153269706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Dive into Deep Learning - Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J., The authors, 2020.
High performance computing : modern systems and practices - Sterling, Thomas; Anderson, Matthew; Brodowicz, Maciej, Morgan Kaufmann, [2018]. ISBN: 9780124201583
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004173809706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Spark: the definitive guide: big data processing made simple - Chambers, B.; Zaharia, M, O'Reilly, 2018. ISBN: 9781491912300
Hadoop : the definitive guide - White, Tom, O'Reilly, 2015. ISBN: 9781491901632
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004054859706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
La inteligencia artificial explicada a los humanos - Torres Viñals, Jordi, Plataforma Editorial, 2023. ISBN: 9788419655561
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991005151879806711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Supercomputing for Artificial Intelligence: Foundations, Architectures and Scaling Deep Learning - Torres Viñals, Jordi, Watch This Space, 2025. ISBN: 9798319328359
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991005476510706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Complementary

BSC documentation about Marenostrum 5 - Barcelona Supercomputing Center,

Web links

Documentation MareNostrum-V https://www.bsc.es/supportkc/docs/MareNostrum5/intro/

Previous capacities

Having studied the subjects of Computer Fundamentals, as well as Parallelism and Distributed Systems.