Machine Learning (ML) has taken the world by storm and has become a fundamental pillar of engineering. As a result, the last decade has witnessed an explosive growth in the use of deep neural networks (DNNs) in pursuit of exploiting the advantages of ML in virtually every aspect of our lives: computer vision, natural language processing, medicine or economics are just a few examples. However, NOT all DNNs fit to all problems: convolutional NNs are good for computer vision, recurrent NNs are good for temporal analysis, and so on. In this context, the main focus of N3Cat and BNN-UPC is to explore the possibilities of the new and less explored variant called Graph Neural Networks (GNNs), whose aim is to learn and model graph-structured data. This has huge implications in fields such as quantum chemistry, computer networks, or social networks among others. OBJECTIVES =========== N3Cat and BNN-UPC are looking for students wanting to work in the area of Graph Neural Networks studying their uses, processing architectures, and algorithms. To this end, the candidate will work on ONE of the following areas: - Investigating the state of the art on this area, surveying the different works done in terms of applications, processing frameworks, algorithms, benchmarks, datasets. This can be taken from a hardware or software perspective. - Helping to build a testbed formed by a cluster of GPUs that will be running pyTorch or Tensorflow. We will instrument the testbed to measure the computation workload and communication flows between GPUs. - Analyzing the communication workload of running a GNN either in the testbed or by means of architectural simulations. - Developing means of accelerating GNN processing in software (e.g., improving scheduling of the message passing) or hardware (e.g. designing a domain-specific architecture).
Companies and scientists working in areas such as finance or genomics are generating enormously large datasets (in the order of petabytes) commonly referred as Big Data. How to efficiently and effectively process such large amounts of data is an open research problem. Since communication is involved in Big Data processing at many levels, at the NaNoNetworking Center in Catalunya (N3Cat) we are currently investigating the potential role of wireless communications in the Big Data scenario. The main focus of the project is to evaluate the impact of applying wireless communications and networking methods to processors and data centers oriented to the management of Big Data. OBJECTIVES =========== N3Cat is looking for students wanting to work in the area of wireless communications for Big Data. To this end, the candidate will work on one of the following areas: - Traffic analysis of Big Data frameworks and applications, as well as in smaller manycore systems. - Channel characterization in Big Data environments: indoor, within the racks of a data center, within the package of CPU, within a chip. - Design of wireless communication protocols for computing systems from the processor level to the data center level.
In many state-of-the-art simulation codes the discretization is so closely tied to the data layout and solver that switching discretizations in the same code is not possible. Not only does this preclude the kind of comparison that is necessary for scientific investigation, but it makes library development impossible. This project consists of implementing and verifying different topology strategies to treats all the different pieces of a 3D mesh (e.g. cells, faces, edges, and vertices) in exactly the same way. This allows the mesh interface to be very small and simple while remaining flexible and general. This also allows "dimension independent programming", which means that the same algorithm can be used unchanged for meshes of different shapes and dimensions. The project will use an existing parallel python prototype and explore alternatives to improve its robustness and extend it without sacrificing flexibility. This project will investigate various ways to optimize and parallelize Python programs for large-scale simulations on real-life production clusters. This project will be developed in the context of the PIXIL project (Interreg POCTEFA), which is coordinated by the geosciences applications group of the Barcelona Supercomputing Center.
To carry out the experiments and tests, the marenostrum supercomputer will be used (https://www.bsc.es/marenostrum)
More information at:
Robotic Process Automation is receiving significant attention, due to the promise of improving the performance of the main processes of an organization by incorporating robots that partially perform repetitive tasks. In this project, we will consider how Process Mining can help into finding opportunities to apply Robotic Process Automation for a real case study.
Recently, one of the leaders in Robotic Process Automation has adquired one of the main process mining tools (https://www.uipath.com/newsroom/uipath-acquires-process-gold-unparalleled-process-understanding). This is a confirmation of the potential link between the field of process mining and the field of robotic process automation.
In this project we will try to find out how strong is this link. By using real data from a company that is in trying to automate its processes, the student will dig into the field of process mining to propose a methodology to unleash the application of RPA.
In this project, there is a possibility to have a grant that covers the time invested.
From a snow avalanche model developed at the UPC, which simulates the dynamics of this phenomenon, we want to do a full validation of the model to allow specialists in avalanches of the ICGC to use it as a tool in the decision-making process.
The Validation Verification and Accreditation of a model is essential to be able to effectively use a model in production for decission making. The project aims to validate the model and the implementation so that the end result reproduces the natural dynamics of the phenomenon. With this validation model will be used as a tool in support of avalanche team of the ICGC. In the development of the project will help from specialists of the ICGC who are taking part in this process of validation.Advanced DOE techniques will be applied during the project.
Languages follow many statistical regularities called laws. Perhaps the most popular example is Zipf's law for word frequencies, that relates the frequency of a word with its rank, but other laws have been formulated, such as the law of abbreviation, the law of meaning-distribution, the meaning-frequency law,...and so on (Zipf 1949). About 15 years ago, a family of optimization models was introduced to shed light on the origins of Zipf's law for word frequencies (Ferrer-i-Cancho & Solé 2003, Ferrer-i-Cancho 2005). In that family, language is modelled as a bipartite graph where words connect to meanings and a cost function is defined based on the structure of that graph. A simple Monte Carlo algorithm was used to minimize the cost function while the structure of the graph was allowed to vary. Recently, it has been shown how these models shed light on how children learn words (Ferrer-i-Cancho 2017). The aim of this project is to investigate new versions of these models (e.g., Ferrer-i-Cancho & Vitevitch 2018) in two directions: (1) Providing an efficient implementation of the optimization algorithm. (2) Comparing the statistical properties of the model against the statistical properties of natural communication systems.
In greater detail, the two directions consist of
(1) Providing an efficient implementation of the optimization algorithm. See Ferrer-i-Cancho and Solé (2003) and Ferrer-i-Cancho (2005) for further details about the algorithm. Evaluating the cost for a given bipartite graph from scratch has cost of the order of nm, where n is the number of words and m is the number of meanings. Decinding when to stop the optimizacion algorithm requires (nm)^2 evaluations of the cost function (in practice it had to be cut down to about nm due to computational costs). For these reasons, n and m have been kept small in previous studies compared to real values in fully fledged human language (e.g., n = m = 150 in Ferrer-i-Cancho and Solé 2003). This computational callenge would be solved applying different techniques, e.g., (a) parallelization (b) dynamic calculation (when changing a few cells of the adjacency matrix, the cost function should not be computed from scratch) and (c) heuristics to speed up the Monte Carlo scheme.
(2) Comparing the statistical properties of the model against the real statistical properties of human language (e.g., linguistics laws) and animal communication, including properties that have not been tested in previous research on these models. See Ferrer-i-Cancho (2018) for an overview of some of the statistical properties of real language that could be tested.
Depending on the personal interests of the student, the project can focus in one of the two directions.
It is possible to publish the results of the project in a research journal.
Ferrer-i-Cancho, R. & Solé, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences USA 100, 788-791.
Ferrer-i-Cancho, R. (2005). Zipf's law from a communicative phase transition. European Physical Journal B 47, 449-457.
Ferrer-i-Cancho, R. (2017). The optimality of attaching unlinked labels to unlinked meanings. Glottometrics 36, 1-16.
Ferrer-i-Cancho, R. & Vitevitch, M. S. (2018). The origins of Zipf's meaning-frequency law. Journal of the American Society for Information Science and Technology 69 (11), 1369-1379.
Ferrer-i-Cancho, R. (2018). Optimization models of natural communication. Journal of Quantitative Linguistics 25 (3), 207-237.
Zipf, G.K. (1949). Human behaviour and the principle of least effort. Cambridge (MA), USA: Addison-Wesley.
FHIR (Fast Healthcare Interoperability Resources) is a set of standards developed by HL7 International to facilitate eHealth information interoperability and use. On the other hand, different efforts are in place to improve the representation (more compression and security) of Genomic information, such as those from the GA4GH (Global Alliance for Genomics and Health) and the MPEG standardization committee. The DMAG (Distributed Multimedia Applications Group) of the Computer Architecture Department of the UPC is involved in the specification of some of these new standards. The objective of this project is to integrate genomic information into EHRs (Electronic Health Records). For this purpose, the different standards for the representation of medical and genomic information will be analysed, and FHIR will be used to faclitate that integration. Finally, a small prototype will be developed, probably making use of existing open source software. The results of this work could be contributed to one of the different standardization organizations for its consideration.
In recent years the volume of information available electronically has increased exponentially, coining the term Big Data to refer to this phenomenon. The medical domain is an area in which the number of documents generated by the centers for patient primary care constantly increases. However, a bottleneck is generated because processing these documents requires specialized personnel craftly performing tasks. In the framework of TAIDAMED research project, we are developing a set of processors that allow automatic analysis of medical texts taking into account criteria of robustness, high precision and coverage. In particular, this thesis would aim at the acquisition of patterns of clinical behavior from medical reports represented as semantic graphs using Neo4J database.
The medical records of each patient contain textual information about the clinical evolution of the patient (including drugs, chemicals, diseases, symptoms and body parts). This corpus has already been represented in a structured format as a set of semantic graphs, using Neo4J graph database.
This thesis would aim at the acquisition of patterns of clinical behavior from these graphs. These patterns would be specifically devoted to provide help in diagnosis to the medical community in primary care. For example, we can get to automatically infer that a certain drug has a previously unknown side effect, or that patients suffering from a certain disease develop certain symptoms in a certain period of time. A simple example of pattern might be "patient has fever and is prescribed ibuprofen -(after x days) - patient has fever and soreness - (after y days) - patient has fever and breathing difficulty -> patient diagnosed with COVID".
Since we have an annotated corpus of medical records, the project might use supervised and semi-supervised machine learning techniques as well as more standard data mining techniques.
Mesh networking with LoRa nodes Meshtastic (https://www.meshtastic.org/) is an open source project which builds a mesh network between LoRa nodes. The LoRa nodes are coupled via Bluetooth to an Android application which implements a messaging service. Text messages are spread over the LoRa network to the other Meshtastic nodes. This is a good place to start reading: https://meshtastic.letstalkthis.com/ We have a couple of TTGO ESP32 LoRa nodes (e.g. http://www.lilygo.cn/prod_view.aspx?TypeId=50003&Id=1271&FId=t3:50003:3) on which Meshtastic can be flashed. Code and further information can be found here: https://github.com/meshtastic/Meshtastic-device More advanced topics: https://github.com/meshtastic/Meshtastic-device/blob/master/docs/software/mesh-alg.md Depending on the interest, the initial work could focus getting familiar with microcontrolers and install and deploy a testbed of a few Meshtastic nodes. Some evaluations and an assessment could be carried out. More advanced work could look at the design of the mesh protocol in Meshtastic, analyze design parameters and propose and evaluate changes/alternative options. If there is a strong interest in the topic, the project could be connected to the work on LoRa mesh networking by Roger Pueyo, member of our research group (https://futur.upc.edu/RogerPueyoCentelles), https://www.sciencedirect.com/science/article/abs/pii/S0167739X20306063 Note: We have also a gateway connect to The Things Network https://www.thethingsnetwork.org/
We consider network traffic which may come from existing datasets or obtained from the Guifi.net community network (e.g. https://gitlab.com/rbaig/dipet-nids-dev/-/blob/master/netflow/README.md) and we aim to determine anomalies by means of neural network models. The first part of the work consists in determining and evaluating a suitable neural network design, understand performance and the design space. Then the deployment of the model should be addressed. Targets may include dockerized components for low-capacity devices in Guifi.net as well as deployments in a Kubernetes cluster. The resource consumption of the anomaly detection system should be analyzed to determine the tradeoff between the required application performance and the observed resource consumption. With the results an anomaly detection system which is able to adapt to different constraints of the context can be proposed.
In this project the aim is to implement and evaluate some agile optimization methods for city logistics that meet real time and large scale requirements.
City logistics is benefiting significantly by using big data analytics (based on IoT data) to improve the performance and sustainability in modern large cities. However, smart city platforms distinguish for their dynamics and large scale making it difficult to take real time decisions. Therefore agile optimization methods have emerged as a way to cope with such demanding requirements.
The project will seek large scale distributed implementations using real life data sets.
Study and implement deep learning based recommendation methods such as Neural collaborative filtering or Neural matrix factorization (DeepFM), and applied in a real dataset of sales of an online retail shoe company (Camper) for recommending products to clients
There are several papers in Deep learning based Recommender systems that I will provide, in particular Multilayer Perceptron (MLP) based recommendation by collaborative filtering and contents based methods . The goal is to build a (deep) neural network recommendation system based on user's history of buys. I have a large dataset of history of sales of company Camper through its digital platform which will serve as real test scenarios.
Web tracking technologies are extensively used to collect large amounts of personal information (PI), including the things we search, the sites we visit, the people we contact, or the products we buy. Although it is commonly believed that this data is mainly used for targeted advertising, some recent works revealed that it is exploited for many other purposes, such price discrimination, financial credibility, insurance coverage, government surveillance, background scanning or identity theft. The main objective of this project is to apply network traffic monitoring and analysis technologies to uncover the particular methods used to track Internet users and collect PI. This project will be useful for both Internet users and the research community, and will produce open source tools, real data sets, and publications revealing most privacy attempting practices. Some preliminary results of our work in this area were recently published in Proceedings of the IEEE (IF: 9.237) and featured in a Wall Street Journal article.
More info at:
Gran parte del código abierto para big-data está escrito para la JVM y actualmente gran parte de este código lo forman algoritmos de minería de datos, y otras técnicas que pueden incluirse en la especialidad de Inteligencia Artificial. Además de Java, Python, otro lenguaje interpretado, es cada vez más utilizado en cualquier entorno de programación, en concreto también en inteligencia artificial y para algoritmos de machine learning. Hay adeptos da cada uno de los dos lenguajes, basándose en su curva de aprendizaje, su portabilidad, ... Es open source, portable y soporta tareas estándar de la minería de datos, como es el pre-procesado de datos, la clasificación, agrupación en clústeres, visualización, regresión y selección de características. El objetivo principal de este proyecto es: A partir de un conjunto de algoritmos que caracterizan el trabajo a realizar en minería de datos, tales como pre-procesado de datos, la clasificación, agrupación en clústeres, visualización, regresión y selección de características, compara el rendimiento de estos dos lenguajes y ver las ventajas o inconvenientes de utilizar uno u otro, según la plataforma hardware subyacente. En concreto, x86 y ARM.
Improvement of OpenMP library for HPC on many core systems
Improvement of OpenMP library for HPC on many core systems
Machine Learning with TinyML TinyML aims to do machine learning on microcontrolers. Microcontrolers are sometimes the only hardware choice when the power supply is limited, e.g. in for battery-operated applications. Just one application areas are "wildlife" observation, such as: https://mybirdbuddy.com/ https://opencollar.io/ Arduino Uno and Mega boards are well known for all kinds of hobbyist microcontroler projects, but there is another kind of more powerful 32 bit microcontrolers and develpment boards, which are able to run machine learning applications. To get the first information on this topic you can have a look at TensorFlow Lite for Microcontrollers: https://www.tensorflow.org/lite/microcontrollers We have a couple of the mentioned boards like the Arduino Nano 33 BLE Sense, STM32F746 Discovery kit and Espressif ESP32 microcontrolers which can be used for this project. The project will in a first phase explore with practically example applications the topic and do some reading to get a basic understanding of the background. Then the second phase can be shaped accroding to the interest: The project could either develop and deploy a specific applications of interest or focus on analyzing and experimenting a specific step of the machine learning (ML) pipeline which starts at data acquisition and building a machine learning model until doing the deployment and evaluating the application. Other ideas can be suggested, also the integrating an ML-component running on a microcontroler with network connectivity into a distributed application can be discussed. You can find several TinyML examples in Tensorflow, medium or towardsdatascience webs, which people have already tried with code in github repositories e.g. https://codelabs.developers.google.com/codelabs/ai-magicwand/#0 https://www.digikey.es/en/maker/projects/intro-to-tinyml-part-1-training-a-model-for-arduino-in-tensorflow/8f1fc8c0b83d417ab521c48864d2a8ec https://towardsdatascience.com/tensorflow-meet-the-esp32-3ac36d7f32c7
The Barcelona Neural Networking Center (BNN-UPC) is offering two positions to develop the Master Thesis in the field of Graph Neural Networks (GNN) applied to computer networking. This TFM will be fully funded and will be carried in the context of a large industrial project with a major multinational technology company.
Graph Neural Networks (GNN) have been recently proposed to learn, model and generalize over graph structured data. Computer Networks are fundamentally graphs, and many of its relevant characteristics -such as topology and routing- are represented as graph-structured data.
GNN are a central tool to apply ML techniques to Computer Networks. GNN can learn the relationship of complex network characteristics and build relevant models that can be useful to plan and manage a network. In combination with Deep-Reinforcement Learning (DRL) techniques, GNN can help developing autonomous network optimization mechanisms that will result in unprecedented performance, achieving the ultimate vision of self-driving networks.
The Barcelona Neural Networking Center (https://bnn.upc.edu) is a new research initiative of UPC with the main goal of carrying out fundamental research in the field of Graph Neural Networks applied to Computer Networks, and providing education and training to the new generation of Computer Networking students.
The main goal of this project is to develop a network monitoring system that can be used by network operators to detect bitcoin miners (or miners from other blockchain technologies) in their network. The system will rely only on network measurements obtained by standard network measurement tools and estimate interesting characteristics of detected miners, such as power consumption. How to apply: Please send an email to with your CV and academic file (pdf can be generated from the Raco).
UPC is offering a new position to develop the TFG/TFM in the field of Machine Learning and Cybersecurity. This TFM will be fully funded (internship) and carried out in collaboration with the Global Security Operations Center of Nestlé and UPC.
Cybersecurity is becoming an increasingly important challenge for all companies and individuals alike. While big names used to be the main targets in the past, as people's lives move online, anyone is nowadays a potential target for any kind of cyber-attack, ranging from phishing to ransomware or serious privacy issues. In order to fight against those ever-evolving threats, Machine Learning is increasingly being used behind the scenes to design better systems that are able of self-learning to boost detection rates and boost overall resilience to unknown attacks. As AI-based solutions penetrate products across the industry, a new kind of threat that is often overlooked is becoming more and more prominent and dangerous: adversarial machine learning (AML).
AML focuses on designing specific inputs to deceive a previously trained Machine Learning models into misclassifying them for a specific purpose. One of the main flaws of any state-of-the-art Machine Learning or Deep Learning algorithms is that they assume that the nature of the data they receive is systematically benign, which is generally the case but does not hold true when an adversarial input is received. The motivation behind altering a ML model into thinking that, for example, a new sample is benign when in fact is malicious can range from pure research to more serious real-life issues such as an autonomous car wrongly classifying a stop sign (and thus provoking a fatal accident) or a wrongly diagnosed disease because of a slightly manipulated magnetic resonance image.
This problem is no exception for Cybersecurity where companies wrongly assume that once the last AI-based product is deployed in their network, their employees are safe...
The identification of the applications behind the network traffic (i.e. traffic classification) is crucial for ISPs and network operators to better manage and control their networks. However, the increasing use of encryption and web-based applications makes this identification very challenging. This problem is exacerbated with the widespread deployment of content distribution networks (e.g. Akamai) and cloud-based services (e.g. Amazon AWS). The goal of this project is to develop a traffic monitoring tool to accurately identify web services from HTTPS traffic, including Google, YouTube, Facebook, Twitter among others. The tool will combine the information from IP addresses and DNS, with novel classification methods inspired on the Google PageRank algorithm to identify encrypted traffic, even if served from Akamai, AWS or Google infrastructures. This project will be carried out in collaboration with the tech-based company Talaia Networks (https://www.talaia.io), which develops cloud-based network monitoring solutions.
How to apply: Please send an email to firstname.lastname@example.org with your CV and academic file (pdf can be generated from the Raco).