Parallelism and Discrete Systems

Credits

Types

Compulsory

Requirements

This subject has not requirements , but it has got previous capacities

Department

Web

http://docencia.ac.upc.edu/FIB/GCED/PSD

Mail

julita.corbalan@upc.edu

The course aims to make the student aware that today's analysis of data that most companies carry out require advanced platforms that provide large-scale and high-performance computing based on parallel systems · Available and distributed, available through the companies themselves or through the wide range of Cloud Computing service providers.

This course will provide students with the basics, and will introduce them into their use, of these parallel and distributed computing systems to support the data analysis environments that scientists and data engineers require . The student will understand the continuous development of these systems that allow the convergence of advanced analysis algorithms and related computing technologies.

Classes are complemented with programming exercises based on usual data scientist problems and evaluate solutions, using parallel and distributed systems available to everyone. Thanks to the services of computation and analysis of high performance. Thus, the student can design experiments that are realistic.

One of the final objectives of the course is to encourage students to be actors and not spectators of this profound transformation of the high performance analytic that is being produced and try to stimulate their desire to further explore this exciting world of technology, beyond the subject.

Teachers

Person in charge

Julita Corbalan Gonzalez (julita.corbalan@upc.edu)

Others

Yolanda Becerra Fontal (yolandab@ac.upc.edu)

Weekly hours

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Competences

Technical Competences

Technical competencies

CE4 - Use current computer systems, including high performance systems, for the process of large volumes of data from the knowledge of its structure, operation and particularities.

Transversal Competences

Transversals

CT4 [Avaluable] - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.

CT5 [Avaluable] - Solvent use of information resources. Manage the acquisition, structuring, analysis and visualization of data and information in the field of specialty and critically evaluate the results of such management.

CT6 - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.

CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.

Basic

CB1 - That students have demonstrated to possess and understand knowledge in an area of ??study that starts from the base of general secondary education, and is usually found at a level that, although supported by advanced textbooks, also includes some aspects that imply Knowledge from the vanguard of their field of study.

CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.

CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy

Generic Technical Competences

Generic

CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.

CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.

CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.

Objectives

Conèixer els fonaments dels sistemes paral·lels i distribuïts actuals
Related competences: CG1, CB1,
Coneixer i saber usar els elements bàsics que conformen els sistemes paral·lels i distribuïts
Related competences: CT4, CT6, CT7, CB2,
Coneixer i poder triar convenientment quin els entorns d'analítica avançada que usen sistemes distribuïts i parallel
Related competences: CE4, CG2, CG4,
Us pràctic per diferents problemes plantejats dels entorns cloud, sistemes paral.lels i distribuïts disponibles actualment per a un enginyer i científic de dades
Related competences: CE4, CT4, CT6, CG1, CB2,
Familiaritzar-se amb els models de programació més habituals dels sistemes paral·lels i distribuïts
Related competences: CE4, CT5, CB5,

Foundations of parallel and distributed supercomputing
In this topic, students will learn basic concepts of parallel computing as well as metrics that will help them evaluate both the performance of their programs and the limits derived from the application structure itself.
Parallel and distributed architectures
In this topic, students will learn the main characteristics of the parallel and distributed architectures that can most influence them when designing their data analysis programs or to understand the performance (or loss of performance) of them.
Execution environments for parallel computing and data analytics
In this topic, students will learn about the different environments that can be found mainly when executing so many applications to generate data such as those stored or analyzed. Emphasis will be placed on the differences between the three environments and their impact on the efficiency of their applications.
Programming models for supercomputers
In this topic the students will see the basic principles of the most used programming models in the HPC environments: MPI, OpenMP and hybrid MPI OpenMP models. The tools will be given to detect and manage the main details that may affect both the robustness of their programs and their efficiency.
Co-processor oriented models that offer good performance vs. efficiency will also be introduced. Energy consumption and very used in the analysis of data.
Software and execution environment specific for advanced analytics
In this topic the students will see in more detail the characteristics of the programming models and execution environments for storage and data analysis. The Apache Spark / Hadoop model will be used as a reference, as a reference for Cassandra data storage and as TensorFlow / keras analysis tools.
Powering Machine Learning with supercomputers: Case Study with Spark/Cassandra/TensorFlow
In this subject, you will learn in a machine learning environment using the Apache Spark model, with DB key / value Cassandra i com aina d'anàlisi TensorFlow. S'explicaran the elements més importants d'aquests three components that can affect in greater measure to the design of applications of machine learning with l'emmagatzematge de dades i anàlisi.
Lab sessions
The laboratory sessions will be grouped into two projects that will be carried out both in the laboratory sessions and in autonomous work. The two projects will be related to the programming, analysis and optimization of a case as realistic as possible in two environments: parallel execution environments (mpi OpenMP, queue systems, etc.) used to generate and post-process data , and specific management and data analysis environments such as Apache Stark Cassandra TensorFlow.

Activities

Activity Evaluation act

Course introduction

During this activity, the objectives, contents, and operation of the subject will be explained

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of the theme "Fundamentals of parallel and distributed supercomputing"

In this topic, students will learn basic concepts of parallel computing as well as metrics that will help them assess both the performance of their programs and the limits derived from the structure of the application.
Objectives: 1
Contents:

1 . Foundations of parallel and distributed supercomputing
2 . Parallel and distributed architectures

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of the theme "Parallel and Distributed Architectures"

In this topic, students learn the main features of parallel and distributed architectures that can influence the design of their data analysis programs and understand the performance (or loss of performance) of these: They will be seen , for example features of systems with multi-core architecture, hyperthreading, shared-distributed memory, local time-space data, type of storage (local, remote), typology networks, etc.
Objectives: 1 2
Contents:

1 . Foundations of parallel and distributed supercomputing
2 . Parallel and distributed architectures
3 . Execution environments for parallel computing and data analytics

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of the theme "Execution environments for parallel computation and data analysis"

In this topic, students will learn about the different environments that can be found mainly when executing so many applications to generate data such as those stored or analyzed. Emphasis will be placed on the differences between the three environments and their impact on the efficiency of their applications. Running environment with queues for HPC, cloud computing for DA. During this topic, it will be divided into HPC environments and data analysis environments (DAs). Problems will also be exercised during theory classes.
Objectives: 2 3
Contents:

2 . Parallel and distributed architectures
3 . Execution environments for parallel computing and data analytics

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of the subject "Models of programming for supercomputers"

In this topic, students will see the basic principles of the most used programming models in the HPC environments: MPI, OpenMP and MPI + OpenMP hybrid models. The tools will be provided to detect and manage the main details that can affect both the robustness of their programs and their efficiency. Coprocessor-oriented models that offer good performance vs. efficiency Energy consumption and much used in the analysis of data.
Objectives: 5
Contents:

3 . Execution environments for parallel computing and data analytics
4 . Programming models for supercomputers
5 . Software and execution environment specific for advanced analytics

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of the subject "New software for data analysis"

In this topic, the students will see in more detail the characteristics of the programming models and execution environments for the storage and the analysis of data. The Apache Spark / Hadoop model will be used as a reference, as a reference for the Cassandra data storage and as TensorFlow / keras analysis tools.
Objectives: 3
Contents:

5 . Software and execution environment specific for advanced analytics

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of the subject "Machine Learning in Supercomputers: Case Based on Spark / Cassandra / TensorFlow"

In this topic we will study in a Machine Learning environment using the Apache Spark model, such as DB key / value Cassandra and TensorFlow analysis tool. The most important elements of these three components will be explained, which can affect, in greater measure, the design of machine learning applications as well as the storage of data and analysis.
Objectives: 4
Contents:

5 . Software and execution environment specific for advanced analytics
6 . Powering Machine Learning with supercomputers: Case Study with Spark/Cassandra/TensorFlow

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Laboratory sessions and deliverables: Application execution in HPC environments, Data generation in HPC environments, data storage and data analysis in context of DA (Data Analytics)

During the lab exercises will be proposed that will be done most during the classes. Some of these exercises will aim at practicing specific aspects of both more traditional HPC and data analytics environments. Others will be part of a larger exercise over the course of sessions. There will be two exercises: one for the most HPC and one more specific for data analysis environments. It will first be delivered just after the end of sessions dedicated to HPC environments and applications. The second just after the data analysis environment sessions are over.
Objectives: 2 5 3 4
Contents:

3 . Execution environments for parallel computing and data analytics
4 . Programming models for supercomputers
5 . Software and execution environment specific for advanced analytics
6 . Powering Machine Learning with supercomputers: Case Study with Spark/Cassandra/TensorFlow
7 . Lab sessions

Theory

Problems

Laboratory

28h

Guided learning

Autonomous learning

28h

Teaching methodology

During the course there will be four types of activities:

a) Activities aimed at acquiring theoretical knowledge. Theoretical activities include participatory classes, which explain the basic contents of the course.

b) The activities focused on acquiring knowledge through experimentation using the "learn to do" approach in practice-guided laboratory sessions (and final report). Some sessions may include pre-work or post-session work depending on
the use of laboratories.

c) Few sessions during the theory classes where practical exercises will be performed to perform numerical evaluations and analysis for performance evaluation.

d) Two reports of exercises to be performed in laboratories related to HPC environments and applications and data analysis environments

This semester, as lab classes will be held in theory classrooms, students will be required to bring their own laptop. To take the exams, both theoretical and laboratory, because they will be delivered in digital format, you will also need to bring your own laptop. All the theory classes that are done online will meet a meet in the official schedule. For laboratory classes, those students who are confined will be able to meet to be able to follow the classes.

Evaluation methodology

- Partial exam: 35% (First part of the course: HPC): theoryHPC
- Final exam: 35% (Second part of the course: AD): AD theory
- Laboratory: 30%.

The Laboratory note will come from the evaluation of laboratory deliverables: 15% labHPC and 15% labAD
The Theory of Theory note will be: (theoryHPC + theoryAD)/2

The Final grade will be 0.3*Laboratory grade+0.7*Theory grade

Reassessment:

If theoryHPC !=NP and theoryAD != NP and labHPC != NP and labAD != NP and NotaFinal < 5 ==> You can access the re-evaluation exam

1 re-evaluation exam that includes HPC theory and AD theory (labs are not re-evaluated)

NotaFinalPSD=max(notaFinal, notaReevaluation)

Bibliography

Basic

Hand-on sessions at GitHub - TORRES, Jordi,
Slides of the course - Torres, J, UPC,
Understanding supercomputing: with Marenostrum Supercomputer in Barcelona - Torres, J, Universitat Politècnica de Catalunya, Barcelona Supercomputing Center, 2016. ISBN: 9781365376825
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004105469706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Hello world en TensorFlow - Torres, J, Universitat Politècnica de Catalunya, Barcelona Supercomputing Centrer, 2016. ISBN: 9781326532383
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004074709706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Introducción a Apache Spark: para empezar a programar el big data - Macias, M.; Gómez, M.; Tous, R.; Torres, J, UOC, 2015. ISBN: 9788491160373
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991004068679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Articles from Technical Journals in the area - ,

Complementary

Empresas en la nube: ventajas y retos del cloud computing - Torres, J, Libros de Cabecera, 2011. ISBN: 9788493908225
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003890319706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Previous capacities

C and Python are the programming language of choice for the labs sessions of this course. It is assumed that the student has a basic knowledge of Python and C prior to starting classes.