Parallelism and Discrete Systems

You are here

Credits
6
Types
Compulsory
Requirements
This subject has not requirements, but it has got previous capacities
Department
AC
Mail
The course aims to make the student aware that today's analysis of data that most companies carry out require advanced platforms that provide large-scale and high-performance computing based on parallel systems · Available and distributed, available through the companies themselves or through the wide range of Cloud Computing service providers.

This course will provide students with the basics, and will introduce them into their use, of these parallel and distributed computing systems to support the data analysis environments that scientists and data engineers require . The student will understand the continuous development of these systems that allow the convergence of advanced analysis algorithms and related computing technologies.

Classes are complemented with programming exercises based on usual data scientist problems and evaluate solutions, using parallel and distributed systems available to everyone. Thanks to the services of computation and analysis of high performance. Thus, the student can design experiments that are realistic.

One of the final objectives of the course is to encourage students to be actors and not spectators of this profound transformation of the high performance analytic that is being produced and try to stimulate their desire to further explore this exciting world of technology, beyond the subject.

Teachers

Person in charge

  • Julita Corbalan Gonzalez ( )

Others

  • Yolanda Becerra Fontal ( )

Weekly hours

Theory
2
Problems
0
Laboratory
2
Guided learning
0
Autonomous learning
4

Competences

Technical Competences

Technical competencies

  • CE4 - Use current computer systems, including high performance systems, for the process of large volumes of data from the knowledge of its structure, operation and particularities.

Transversal Competences

Transversals

  • CT4 - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
  • CT5 - Solvent use of information resources. Manage the acquisition, structuring, analysis and visualization of data and information in the field of specialty and critically evaluate the results of such management.
  • CT6 - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.
  • CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.

Basic

  • CB1 - That students have demonstrated to possess and understand knowledge in an area of ??study that starts from the base of general secondary education, and is usually found at a level that, although supported by advanced textbooks, also includes some aspects that imply Knowledge from the vanguard of their field of study.
  • CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
  • CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy

Generic Technical Competences

Generic

  • CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.
  • CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.
  • CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.

Objectives

  1. Conèixer els fonaments dels sistemes paral·lels i distribuïts actuals
    Related competences: CG1, CB1,
  2. Coneixer i saber usar els elements bàsics que conformen els sistemes paral·lels i distribuïts
    Related competences: CT4, CT6, CT7, CB2,
  3. Familiaritzar-se amb els models de programació més habituals dels sistemes paral·lels i distribuïts
    Related competences: CE4, CT5, CB5,
  4. Coneixer i poder triar convenientment quin els entorns d'analítica avançada que usen sistemes distribuïts i parallel
    Related competences: CE4, CG2, CG4,
  5. Us pràctic per diferents problemes plantejats dels entorns cloud, sistemes paral.lels i distribuïts disponibles actualment per a un enginyer i científic de dades
    Related competences: CE4, CT4, CT6, CG1, CB2,

Contents

  1. Foundations of parallel and distributed supercomputing
    In this topic, students will learn basic concepts of parallel computing as well as metrics that will help them evaluate both the performance of their programs and the limits derived from the application structure itself.
  2. Parallel and distributed architectures
    In this topic, students will learn the main characteristics of the parallel and distributed architectures that can most influence them when designing their data analysis programs or to understand the performance (or loss of performance) of them.
  3. Execution environments for parallel computing and data analytics
    In this topic, students will learn about the different environments that can be found mainly when executing so many applications to generate data such as those stored or analyzed. Emphasis will be placed on the differences between the three environments and their impact on the efficiency of their applications.
  4. Programming models for supercomputers
    In this topic the students will see the basic principles of the most used programming models in the HPC environments: MPI, OpenMP and hybrid MPI OpenMP models. The tools will be given to detect and manage the main details that may affect both the robustness of their programs and their efficiency.
    Co-processor oriented models that offer good performance vs. efficiency will also be introduced. Energy consumption and very used in the analysis of data.
  5. Software and execution environment specific for advanced analytics
    In this topic the students will see in more detail the characteristics of the programming models and execution environments for storage and data analysis. The Apache Spark / Hadoop model will be used as a reference, as a reference for Cassandra data storage and as TensorFlow / keras analysis tools.
  6. Powering Machine Learning with supercomputers: Case Study with Spark/Cassandra/TensorFlow
    In this subject, you will learn in a machine learning environment using the Apache Spark model, with DB key / value Cassandra i com aina d'anàlisi TensorFlow. S'explicaran the elements més importants d'aquests three components that can affect in greater measure to the design of applications of machine learning with l'emmagatzematge de dades i anàlisi.
  7. Lab sessions
    The laboratory sessions will be grouped into two projects that will be carried out both in the laboratory sessions and in autonomous work. The two projects will be related to the programming, analysis and optimization of a case as realistic as possible in two environments: parallel execution environments (mpi OpenMP, queue systems, etc.) used to generate and post-process data , and specific management and data analysis environments such as Apache Stark Cassandra TensorFlow.

Activities

Activity Evaluation act


Course introduction

During this activity, the objectives, contents, and operation of the subject will be explained

Theory
1h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Development of the theme "Fundamentals of parallel and distributed supercomputing"

In this topic, students will learn basic concepts of parallel computing as well as metrics that will help them assess both the performance of their programs and the limits derived from the structure of the application.
Objectives: 1
Contents:
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
2h

Development of the theme "Parallel and Distributed Architectures"

In this topic, students learn the main features of parallel and distributed architectures that can influence the design of their data analysis programs and understand the performance (or loss of performance) of these: They will be seen , for example features of systems with multi-core architecture, hyperthreading, shared-distributed memory, local time-space data, type of storage (local, remote), typology networks, etc.
Objectives: 1 2
Contents:
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
2h

Development of the theme "Execution environments for parallel computation and data analysis"

In this topic, students will learn about the different environments that can be found mainly when executing so many applications to generate data such as those stored or analyzed. Emphasis will be placed on the differences between the three environments and their impact on the efficiency of their applications. Running environment with queues for HPC, cloud computing for DA. During this topic, it will be divided into HPC environments and data analysis environments (DAs). Problems will also be exercised during theory classes.
Objectives: 2 4
Contents:
Theory
6h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
6h

Development of the subject "Models of programming for supercomputers"

In this topic, students will see the basic principles of the most used programming models in the HPC environments: MPI, OpenMP and MPI + OpenMP hybrid models. The tools will be provided to detect and manage the main details that can affect both the robustness of their programs and their efficiency. Coprocessor-oriented models that offer good performance vs. efficiency Energy consumption and much used in the analysis of data.
Objectives: 3
Contents:
Theory
6h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
6h

Development of the subject "New software for data analysis"

In this topic, the students will see in more detail the characteristics of the programming models and execution environments for the storage and the analysis of data. The Apache Spark / Hadoop model will be used as a reference, as a reference for the Cassandra data storage and as TensorFlow / keras analysis tools.
Objectives: 4
Contents:
Theory
7h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
7h

Development of the subject "Machine Learning in Supercomputers: Case Based on Spark / Cassandra / TensorFlow"

In this topic we will study in a Machine Learning environment using the Apache Spark model, such as DB key / value Cassandra and TensorFlow analysis tool. The most important elements of these three components will be explained, which can affect, in greater measure, the design of machine learning applications as well as the storage of data and analysis.
Objectives: 5
Contents:
Theory
4h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
4h

Laboratory sessions and deliverables: Application execution in HPC environments, Data generation in HPC environments, data storage and data analysis in context of DA (Data Analytics)

During the lab exercises will be proposed that will be done most during the classes. Some of these exercises will aim at practicing specific aspects of both more traditional HPC and data analytics environments. Others will be part of a larger exercise over the course of sessions. There will be two exercises: one for the most HPC and one more specific for data analysis environments. It will first be delivered just after the end of sessions dedicated to HPC environments and applications. The second just after the data analysis environment sessions are over.
Objectives: 2 3 4 5
Contents:
Theory
0h
Problems
0h
Laboratory
28h
Guided learning
0h
Autonomous learning
28h

Teaching methodology

During the course there will be four types of activities:

a) Activities aimed at acquiring theoretical knowledge. Theoretical activities include participatory classes, which explain the basic contents of the course.

b) The activities focused on acquiring knowledge through experimentation using the "learn to do" approach in practice-guided laboratory sessions (and final report). Some sessions may include pre-work or post-session work depending on
the use of laboratories.

c) Few sessions during the theory classes where practical exercises will be performed to perform numerical evaluations and analysis for performance evaluation.

d) Two reports of exercises to be performed in laboratories related to HPC environments and applications and data analysis environments

Evaluation methodology

The evaluation of the subject will come out of three components:

- Partial exam: 40%
- Final exam: 40%
- Laboratory: 20%.

The lab grade will come from the evaluation of the lab deliverables
First partial exam will eliminate the material from Final exam (if mark >= 5.0)
The final grade is compted: 0.2*lab+0.4*final exam + 0.4*Partial exam

In case the grade is less than 5.0, student will be allowed to do the reevaluation exam. In that case, the grade will be computed as:

Final Note: Max(Revaluation Exam * 0.8, Partial exam * 0.4 + Final Exam * 0.4) + Laboratory * 0.2

Bibliography

Basic:

  • Hand-on sessions at GitHub - TORRES, Jordi,
  • Slides of the course - Torres, J, UPC,
  • Understanding supercomputing: with Marenostrum Supercomputer in Barcelona - Torres, J, Universitat Politècnica de Catalunya, Barcelona Supercomputing Center, 2016. ISBN: 9781365376825
    http://cataleg.upc.edu/record=b1490214~S1*cat
  • Hello world en TensorFlow - Torres, J, Universitat Politècnica de Catalunya, Barcelona Supercomputing Centrer, 2016. ISBN: 9781326532383
    http://cataleg.upc.edu/record=b1472879~S1*cat
  • Introducción a Apache Spark: para empezar a programar el big data - Macias, M.; Gómez, M.; Tous, R.; Torres, J, UOC, 2015. ISBN: 9788491160373
    http://cataleg.upc.edu/record=b1467894~S1*cat
  • Articles from Technical Journals in the area - ,

Complementary:

Previous capacities

C and Python are the programming language of choice for the labs sessions of this course. It is assumed that the student has a basic knowledge of Python and C prior to starting classes.