Skip to main content

Mining Unstructured Data

Credits
6
Types
Compulsory
Requirements
This subject has not requirements , but it has got previous capacities
Department
CS
Web
https://www.cs.upc.edu/~turmo/mud/plan0a6/MUD.html
The goal of this course is to provide the fundamentals of Natural Language Processing (NLP) to the student. Concretely, the course is an introduction to the most relevant drawbacks involved in NLP, the most relevant techniques and resources used to tackle with them, and the theories they are based on. In addition, brief descriptions of the most relevant NLP applications are included.
The flow of the course is along two main axis: (1) computational formalisms to describe natural language processes, and (2) statistical and machine learning methods to acquire linguistic models from large data collections and solve specific linguistic tasks

Teachers

Person in charge

Others

Weekly hours

Theory
1.5
Problems
0.5
Laboratory
2
Guided learning
0
Autonomous learning
7.11

Competences

Information literacy

  • CT4 - Capacity for managing the acquisition, the structuring, analysis and visualization of data and information in the field of specialisation, and for critically assessing the results of this management.
  • Third language

  • CT5 - Achieving a level of spoken and written proficiency in a foreign language, preferably English, that meets the needs of the profession and the labour market.
  • Entrepreneurship and innovation

  • CT1 - Know and understand the organization of a company and the sciences that govern its activity; have the ability to understand labor standards and the relationships between planning, industrial and commercial strategies, quality and profit. Being aware of and understanding the mechanisms on which scientific research is based, as well as the mechanisms and instruments for transferring results among socio-economic agents involved in research, development and innovation processes.
  • Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB7 - Ability to integrate knowledge and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
  • CB8 - Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.
  • CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.
  • CB10 - Possess and understand knowledge that provides a basis or opportunity to be original in the development and/or application of ideas, often in a research context.
  • Generic

  • CG2 - Identify and apply methods of data analysis, knowledge extraction and visualization for data collected in disparate formats
  • Especifics

  • CE6 - Design the Data Science process and apply scientific methodologies to obtain conclusions about populations and make decisions accordingly, from both structured and unstructured data and potentially stored in heterogeneous formats.
  • CE7 - Identify the limitations imposed by data quality in a data science problem and apply techniques to smooth their impact
  • CE11 - Analyze and extract knowledge from unstructured information using natural language processing techniques, text and image mining
  • CE12 - Apply data science in multidisciplinary projects to solve problems in new or poorly explored domains from a data science perspective that are economically viable, socially acceptable, and in accordance with current legislation
  • CE13 - Identify the main threats related to ethics and data privacy in a data science project (both in terms of data management and analysis) and develop and implement appropriate measures to mitigate these threats
  • Objectives

    1. Know and understand basic NLP tasks and their application to text analysis.
      Related competences: CT4, CT1, CG2, CE6, CE7, CE11, CB6, CB7, CB10,
    2. Know, understand, and apply text mining techniques, including entity recognition, sentiment analysis, and document retrieval.
      Related competences: CT4, CT5, CE11, CE12, CB6, CB7, CB8, CB9,
    3. Know, understand, and apply basic principles of deep learning in unstructured data tasks, such as natural language processing, or computer vision.
      Related competences: CT4, CT5, CG2, CE6, CE7, CE11, CE13, CB6, CB7, CB8, CB9, CB10,

    Contents

    1. Natural language processing and its application to text analysis
      Introduction: What is NLP and its applications
    2. natural language processing stages
      Text segmentation: sentence splitting, tokenization; morpholigcal analysis, PoS tagging, syntactic parsing
    3. text classification, text similarity.
      Similarity measures for text. String edit based distances. Vector and set distance measures, distributional semantics. Document retrieval.
      Text classification: Sentiment analysis
    4. Information extraction: Entity recognition, relation extraction
    5. Deep learning techniques for the analysis of non-structured data
      Word embeddings, neural language processing
    6. Main deep learning architectures for non-structured data
      Recurrent NN, Convolutional NN, Transformers

    Activities

    Activity Evaluation act


    lab project


    Objectives: 3
    Week: 16 (Outside class hours)
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Final exam


    Objectives: 1 2
    Week: 16 (Outside class hours)
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    NLP and its applications

    Introduction. What is NLP, tasks, components, and applications.
    Objectives: 1
    Contents:
    Theory
    2h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    natural language processing stages

    Text segmentation: sentence splitting/tokenization; morphological analysis; PoS tagging; syntactic parsing.
    Objectives: 1
    Contents:
    Theory
    7.3h
    Problems
    2.5h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Text classification, text similarity

    Similarity measures for text. String edit based distances. Vector and set distance measures, distributional semantics. Document retrieval. Text classification: Sentiment analysis
    Objectives: 2
    Contents:
    Theory
    1.5h
    Problems
    0.5h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Information extraction: Entity recognition, relation extraction


    Objectives: 1 2
    Contents:
    Theory
    1.5h
    Problems
    0.5h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Deep learning techniques for the analysis of non-structured data

    Word embeddings, neural language processing
    Objectives: 3
    Contents:
    Theory
    4.5h
    Problems
    2h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Main deep learning architectures for non-structured data

    Recurrent NN, Convolutional NN, Transformers
    Objectives: 3
    Contents:
    Theory
    3.5h
    Problems
    1.5h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Theory
    0h
    Problems
    0h
    Laboratory
    6h
    Guided learning
    0h
    Autonomous learning
    0h

    Theory
    0h
    Problems
    0h
    Laboratory
    6h
    Guided learning
    0h
    Autonomous learning
    0h

    Theory
    0h
    Problems
    0h
    Laboratory
    6h
    Guided learning
    0h
    Autonomous learning
    0h

    partial exam


    Objectives: 3
    Week: 8 (Outside class hours)
    Theory
    0h
    Problems
    0h
    Laboratory
    0h
    Guided learning
    0h
    Autonomous learning
    0h

    Teaching methodology

    Participative lectures with theoretical and practical content
    Practical sessions with student participation for the resolution of exercises related to the course contents
    lab project - team work
    Consulting sessions

    Evaluation methodology

    Lab projects 40% + partial exam 30% + final exam 30%

    Bibliography

    Basic

    Previous capacities

    Advanced skills on python programming
    Math and statistics skills to the level of an engineering/tech/science university degree
    Fundamentals of machine learning