Skip to main content

High Performance Computing

Credits
6
Types
Compulsory
Requirements
This subject has not requirements , but it has got previous capacities
Department
AC
Web
https://docencia.ac.upc.edu/gia-cap
Mail
josep.ll.berral@upc.edu, jordi.torres@upc.edu
The aim of this subject is to know the operation and applications of high-performance computing systems, in order to deploy artificial intelligence applications that require a large amount of resources, process optimization and application of accelerators, and leveraging and orchestrating cloud resources. This course will cover concepts of virtualization and containerization, as well as distributed file systems and distributed computing systems. You will also see scalability in machine learning algorithms and artificial intelligence, using state-of-the-art technologies, both for middleware and accelerators. We will work with C, Python and Scala languages.

Teachers

Person in charge

Others

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
6

Competences

Transversals

  • CT2 - Sustainability and Social Commitment. To know and understand the complexity of economic and social phenomena typical of the welfare society; Be able to relate well-being to globalization and sustainability; Achieve skills to use in a balanced and compatible way the technique, the technology, the economy and the sustainability.
  • CT3 - Efficient oral and written communication. Communicate in an oral and written way with other people about the results of learning, thinking and decision making; Participate in debates on topics of the specialty itself.
  • CT6 [Avaluable] - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.
  • Basic

  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • Especifics

  • CE05 - To be able to analyze and evaluate the structure and architecture of computers, as well as the basic components that make them up.
  • CE06 - To be able to identify the features, functionalities and structure of Operating Systems and to design and implement applications based on their services.
  • CE07 - To interpret the characteristics, functionalities and structure of Distributed Systems, Computer Networks and the Internet and design and implement applications based on them.
  • CE08 - To detect the characteristics, functionalities and components of data managers, which allow the adequate use of them in information flows, and the design, analysis and implementation of applications based on them.
  • CE11 - To identify and apply the fundamental principles and basic techniques of parallel, concurrent, distributed and real-time programming.
  • CE19 - To use current computer systems, including high-performance systems, for the processing of large volumes of data from the knowledge of its structure, operation and particularities.
  • Generic

  • CG1 - To ideate, draft, organize, plan and develop projects in the field of artificial intelligence.
  • CG3 - To define, evaluate and select hardware and software platforms for the development and execution of computer systems, services and applications in the field of artificial intelligence.
  • CG9 - To face new challenges with a broad vision of the possibilities of a professional career in the field of Artificial Intelligence. Develop the activity applying quality criteria and continuous improvement, and act rigorously in professional development. Adapt to organizational or technological changes. Work in situations of lack of information and / or with time and / or resource restrictions.
  • Objectives

    1. Understand the use of high-performance computing and middlewares for artificial intelligence
      Related competences: CG1, CG9, CT3, CT6, CE19,
    2. Know the basic components of hardware and middleware in high-performance platforms
      Related competences: CG9, CT2, CE05, CE08, CE19,
    3. Learn about the use of accelerators (e.g. GPUs) and the tools for their exploitation
      Related competences: CG3, CT6, CE08, CE19,
    4. Learn about virtualization concepts and usage of virtual machines
      Related competences: CG3, CT2, CB2, CE05, CE06,
    5. Become familiar with the basic tools for exploiting distributed systems, with programming models oriented to distribution
      Related competences: CG3, CT6, CE07, CE08, CE11,
    6. Know the basic concepts on distributed systems, interconnection and connection among systems.
      Related competences: CG3, CT3, CT6, CE07, CE11,
    7. Learn about file systems: basic usage of file systems, redundancy on disks, logic volumes and fault tolerance.
      Related competences: CG3, CT6, CB2, CE06, CE07, CE08,
    8. Discover the challenges on high-performance computing on artificial intelligence
      Related competences: CG1, CG9, CT2, CT3,

    Contents

    1. Introduction to High-Performance Computing Systems
      Introduction to large-scale computing systems, specialized and the Cloud.
    2. Accelerators and high-performance devices
      Incorporation of accelerators (e.g. GPUs) and the tools for their exploitation. Matrix operations accelerated through specialized devices.
    3. Middleware and high-performance platforms for artificial intelligence
      Basic components of hardware and middleware in high-performance platforms. Use of state of the art and commodity tools (e.g. TensorFlow, Pytorch, etc.) combined with specialized devices.
    4. Parallelism applied to artificial intelligence
      Parallelism on high-performance computing through the most common middlewares for artificial intelligence, deep learning and transformers, and their associated techniques
    5. Introduction to distributed programming models for Big Data
      Introduction to Map-Reduce programming models over distributed data systems and language Scala.
    6. Virtualization concepts and containerization
      Introduction to the use of virtual machines and containerization, for isolation executions and personalized environments, as load migration and resource management in shared systems.
    7. Local and distributed file systems, redundancy and availability
      Basic usage of file systems, distributed file systems, logic volumes, redundancy, fault tolerance and high availability.
    8. Distributed systems for computing
      Basic concepts on distributed systems (e.g. Hadoop and Spark), interconnection and communications, paradigms of distributed systems and protocols, and fault tolerance. Basic tools for exploiting concurrency on distributed systems, and their programming models oriented towards artificial intelligence and Big Data processing.
    9. Challenges for high-performance computing for artificial intelligence
      Challenges for present and future of high-performance computing applied to artificial intelligence. Current tools and environments in the industry, the Cloud, academia and society.

    Activities

    Activity Evaluation act


    Virtualization and containerization concepts

    Introduction to the use of virtual machines and containerization, for isolated and customized execution of environments, as well as load migration and resource management to shared systems.
    Objectives: 4
    Contents:
    Theory
    4h
    Problems
    0h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    12h

    Service and application architecture

    Introduction to Client-Server models, execution management systems, and launch of applications in cluster and Cloud systems.

    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    6h

    Supercomputers and High-performance Computing

    Supercomputers and High Performance Computing systems, tools and environments. Familiarization with HPC facilities, hands-on use of HPC systems and C language.
    Objectives: 2
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    9h

    Accelerators, supercomputers and high-performance devices

    Accelerators and high performance devices. GPUs and accelerator devices. Matrix multiplication using GPUs. Introduction to Python on a supercomputer.
    Objectives: 3
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    9h

    Computing in distributed systems

    Basic concepts of distributed systems (e.g. Hadoop and Spark), interconnection and communications, distributed systems paradigms and protocols, and fault tolerance. Basic tools for the exploitation of concurrency in distributed systems, and their programming models oriented to artificial intelligence and massive data processing.
    Objectives: 6 5
    Contents:
    Theory
    4h
    Problems
    0h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    12h

    Current tools and environments in industry, the cloud, academia and society.

    Current tools and environments in industry, the cloud, academia and society.
    Objectives: 8
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    6h

    Local and distributed file systems, redundancy and availability

    Basic uses of file systems, as well as distributed data storage systems, logical volumes, redundancy, fault tolerance, and high availability.
    Objectives: 7
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    2h
    Guided learning
    0h
    Autonomous learning
    6h

    Parallelism applied to artificial intelligence

    Parallelism applied to artificial intelligence. Scalability, advanced deep learning techniques, transformers and the future of Deep Learning.
    Objectives: 2 1
    Contents:
    Theory
    4h
    Problems
    0h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    12h

    Middleware and high-performance platforms for artificial intelligence

    Middleware and high performance platforms for artificial intelligence. TensorFlow/Pytorch, Deep Learning, LLMs and HPC.
    Objectives: 1
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    6h

    Present and future challenges of high-performance computing applied to artificial intelligence. Seminars on HPC

    Seminars of experts in the field. Presentation of work.
    Objectives: 5 1 8
    Contents:
    Theory
    6h
    Problems
    0h
    Laboratory
    4h
    Guided learning
    0h
    Autonomous learning
    12h

    Teaching methodology

    The course is based on theory and face-to-face laboratory sessions. The theoretical sessions combine lectures and seminars by experts in the field, following the program set out in this study plan and based on the use of own material. During the sessions, dialogue and discussion are promoted in order to anticipate and consolidate the learning outcomes of the subject.

    The laboratory sessions deal with the aspects related to the different technologies presented, and follow the same topics as the syllabus studies. These are hands-on practical sessions, using different computational resources in the Department of Computer Architecture and the Barcelona Supercomputing Center.

    Evaluation methodology

    The evaluation will basically be based on the completion of continuous work during the different sessions of the course. Attendance and participation will be mandatory, and therefore will also be assessed by passing a list and requiring participation in the interactive sessions. Finally, there will be a research project throughout the course, which students will have to present to their peers.

    The distribution of weights for each activity is as follows:
    - AS: attendance in class, theory and laboratories (10%), which will be used to evaluate transversal competence CT3.
    - PR: class participation (15%)
    - EX: laboratory and class deliverables (55%), as an arithmetic average of the different assignments.
    - RE: presentation of a research paper (20%), which will be used to evaluate transversal skills CT2, CT3 and CT6.

    The Final Grade (NF) of the subject is obtained from
    NF = 0.10 x AS + 0.15 x PR + 0.55 x EX + 0.20 x RE

    Re-evaluation
    a) Re-evaluation can only be applied to students that presented all EX + RE exercises, and failed NF. (This is, those that want to upgrade their marks or are NP are excluded.)
    b) Maximum mark in re-evaluation is 7.

    Bibliography

    Basic

    Complementary

    • BSC documentation about Marenostrum 5 - Barcelona Supercomputing Center,

    Web links

    Previous capacities

    Having studied the subjects of Computer Fundamentals, as well as Parallelism and Distributed Systems.