The course aims to make the student aware that today's analysis of data that most companies carry out require advanced platforms that provide large-scale and high-performance computing based on parallel systems · Available and distributed, available through the companies themselves or through the wide range of Cloud Computing service providers.
This course will provide students with the basics, and will introduce them into their use, of these parallel and distributed computing systems to support the data analysis environments that scientists and data engineers require . The student will understand the continuous development of these systems that allow the convergence of advanced analysis algorithms and related computing technologies.
Classes are complemented with programming exercises based on usual data scientist problems and evaluate solutions, using parallel and distributed systems available to everyone. Thanks to the services of computation and analysis of high performance. Thus, the student can design experiments that are realistic.
One of the final objectives of the course is to encourage students to be actors and not spectators of this profound transformation of the high performance analytic that is being produced and try to stimulate their desire to further explore this exciting world of technology, beyond the subject.
Person in charge
Julita Corbalan Gonzalez (
Yolanda Becerra Fontal (
CE4 - Use current computer systems, including high performance systems, for the process of large volumes of data from the knowledge of its structure, operation and particularities.
CT4 - Teamwork. Be able to work as a member of an interdisciplinary team, either as a member or conducting management tasks, with the aim of contributing to develop projects with pragmatism and a sense of responsibility, taking commitments taking into account available resources.
CT5 - Solvent use of information resources. Manage the acquisition, structuring, analysis and visualization of data and information in the field of specialty and critically evaluate the results of such management.
CT6 - Autonomous Learning. Detect deficiencies in one's own knowledge and overcome them through critical reflection and the choice of the best action to extend this knowledge.
CT7 - Third language. Know a third language, preferably English, with an adequate oral and written level and in line with the needs of graduates.
CB1 - That students have demonstrated to possess and understand knowledge in an area of ??study that starts from the base of general secondary education, and is usually found at a level that, although supported by advanced textbooks, also includes some aspects that imply Knowledge from the vanguard of their field of study.
CB2 - That the students know how to apply their knowledge to their work or vocation in a professional way and possess the skills that are usually demonstrated through the elaboration and defense of arguments and problem solving within their area of ??study.
CB5 - That the students have developed those learning skills necessary to undertake later studies with a high degree of autonomy
Generic Technical Competences
CG1 - To design computer systems that integrate data of provenances and very diverse forms, create with them mathematical models, reason on these models and act accordingly, learning from experience.
CG2 - Choose and apply the most appropriate methods and techniques to a problem defined by data that represents a challenge for its volume, speed, variety or heterogeneity, including computer, mathematical, statistical and signal processing methods.
CG4 - Identify opportunities for innovative data-driven applications in evolving technological environments.
Conèixer els fonaments dels sistemes paral·lels i distribuïts actuals
Coneixer i saber usar els elements bàsics que conformen els sistemes paral·lels i distribuïts
Familiaritzar-se amb els models de programació més habituals dels sistemes paral·lels i distribuïts
Coneixer i poder triar convenientment quin els entorns d'analítica avançada que usen sistemes distribuïts i parallel
Us pràctic per diferents problemes plantejats dels entorns cloud, sistemes paral.lels i distribuïts disponibles actualment per a un enginyer i científic de dades
Foundations of parallel and distributed supercomputing
In this topic, students will learn basic concepts of parallel computing as well as metrics that will help them evaluate both the performance of their programs and the limits derived from the application structure itself.
Parallel and distributed architectures
In this topic, students will learn the main characteristics of the parallel and distributed architectures that can most influence them when designing their data analysis programs or to understand the performance (or loss of performance) of them.
Execution environments for parallel computing and data analytics
In this topic, students will learn about the different environments that can be found mainly when executing so many applications to generate data such as those stored or analyzed. Emphasis will be placed on the differences between the three environments and their impact on the efficiency of their applications.
Programming models for supercomputers
In this topic the students will see the basic principles of the most used programming models in the HPC environments: MPI, OpenMP and hybrid MPI OpenMP models. The tools will be given to detect and manage the main details that may affect both the robustness of their programs and their efficiency.
Co-processor oriented models that offer good performance vs. efficiency will also be introduced. Energy consumption and very used in the analysis of data.
Software and execution environment specific for advanced analytics
In this topic the students will see in more detail the characteristics of the programming models and execution environments for storage and data analysis. The Apache Spark / Hadoop model will be used as a reference, as a reference for Cassandra data storage and as TensorFlow / keras analysis tools.
Powering Machine Learning with supercomputers: Case Study with Spark/Cassandra/TensorFlow
In this subject, you will learn in a machine learning environment using the Apache Spark model, with DB key / value Cassandra i com aina d'anàlisi TensorFlow. S'explicaran the elements més importants d'aquests three components that can affect in greater measure to the design of applications of machine learning with l'emmagatzematge de dades i anàlisi.
The laboratory sessions will be grouped into two projects that will be carried out both in the laboratory sessions and in autonomous work. The two projects will be related to the programming, analysis and optimization of a case as realistic as possible in two environments: parallel execution environments (mpi OpenMP, queue systems, etc.) used to generate and post-process data , and specific management and data analysis environments such as Apache Stark Cassandra TensorFlow.
During this activity, the objectives, contents, and operation of the subject will be explained
Development of the theme "Fundamentals of parallel and distributed supercomputing"
In this topic, students will learn basic concepts of parallel computing as well as metrics that will help them assess both the performance of their programs and the limits derived from the structure of the application. Objectives:1 Contents:
Development of the theme "Parallel and Distributed Architectures"
In this topic, students learn the main features of parallel and distributed architectures that can influence the design of their data analysis programs and understand the performance (or loss of performance) of these: They will be seen , for example features of systems with multi-core architecture, hyperthreading, shared-distributed memory, local time-space data, type of storage (local, remote), typology networks, etc. Objectives:12 Contents:
Development of the theme "Execution environments for parallel computation and data analysis"
In this topic, students will learn about the different environments that can be found mainly when executing so many applications to generate data such as those stored or analyzed. Emphasis will be placed on the differences between the three environments and their impact on the efficiency of their applications. Running environment with queues for HPC, cloud computing for DA. During this topic, it will be divided into HPC environments and data analysis environments (DAs). Problems will also be exercised during theory classes. Objectives:24 Contents:
Development of the subject "Models of programming for supercomputers"
In this topic, students will see the basic principles of the most used programming models in the HPC environments: MPI, OpenMP and MPI + OpenMP hybrid models. The tools will be provided to detect and manage the main details that can affect both the robustness of their programs and their efficiency. Coprocessor-oriented models that offer good performance vs. efficiency Energy consumption and much used in the analysis of data. Objectives:3 Contents:
Development of the subject "New software for data analysis"
In this topic, the students will see in more detail the characteristics of the programming models and execution environments for the storage and the analysis of data. The Apache Spark / Hadoop model will be used as a reference, as a reference for the Cassandra data storage and as TensorFlow / keras analysis tools. Objectives:4 Contents:
Development of the subject "Machine Learning in Supercomputers: Case Based on Spark / Cassandra / TensorFlow"
In this topic we will study in a Machine Learning environment using the Apache Spark model, such as DB key / value Cassandra and TensorFlow analysis tool. The most important elements of these three components will be explained, which can affect, in greater measure, the design of machine learning applications as well as the storage of data and analysis. Objectives:5 Contents:
Laboratory sessions and deliverables: Application execution in HPC environments, Data generation in HPC environments, data storage and data analysis in context of DA (Data Analytics)
During the lab exercises will be proposed that will be done most during the classes. Some of these exercises will aim at practicing specific aspects of both more traditional HPC and data analytics environments. Others will be part of a larger exercise over the course of sessions. There will be two exercises: one for the most HPC and one more specific for data analysis environments. It will first be delivered just after the end of sessions dedicated to HPC environments and applications. The second just after the data analysis environment sessions are over. Objectives:2345 Contents:
During the course there will be four types of activities:
a) Activities aimed at acquiring theoretical knowledge. Theoretical activities include participatory classes, which explain the basic contents of the course.
b) The activities focused on acquiring knowledge through experimentation using the "learn to do" approach in practice-guided laboratory sessions (and final report). Some sessions may include pre-work or post-session work depending on
the use of laboratories.
c) Few sessions during the theory classes where practical exercises will be performed to perform numerical evaluations and analysis for performance evaluation.
d) Two reports of exercises to be performed in laboratories related to HPC environments and applications and data analysis environments
The evaluation of the subject will come out of three components:
The lab grade will come from the evaluation of the lab deliverables
First partial exam will eliminate the material from Final exam (if mark >= 5.0)
The final grade is compted: 0.2*lab+0.4*final exam + 0.4*Partial exam
In case the grade is less than 5.0, student will be allowed to do the reevaluation exam. In that case, the grade will be computed as:
Final Note: Max(Revaluation Exam * 0.8, Partial exam * 0.4 + Final Exam * 0.4) + Laboratory * 0.2
Hand-on sessions at GitHub -
Slides of the course -
Understanding supercomputing: with Marenostrum Supercomputer in Barcelona -
Universitat Politècnica de Catalunya, Barcelona Supercomputing Center, 2016. ISBN: 9781365376825 http://cataleg.upc.edu/record=b1490214~S1*cat