Statistical Natural Language Processing

You are here

Credits
6
Types
Specialization complementary (Data Science)
Requirements
This subject has not requirements
Department
CS
This course is an introduction to the most relevant tasks, applications, techniques and resources involved in empirical Natural Language Processing (NLP), i.e. using Statistical and Machine Learning (ML) methods.

Weekly hours

Theory
3
Problems
0
Laboratory
0
Guided learning
0.6
Autonomous learning
6.4

Competences

Generic Technical Competences

Generic

  • CG1 - Capability to apply the scientific method to study and analyse of phenomena and systems in any area of Computer Science, and in the conception, design and implementation of innovative and original solutions.
  • CG3 - Capacity for mathematical modeling, calculation and experimental designing in technology and companies engineering centers, particularly in research and innovation in all areas of Computer Science.

Transversal Competences

Information literacy

  • CTR4 - Capability to manage the acquisition, structuring, analysis and visualization of data and information in the area of informatics engineering, and critically assess the results of this effort.

Reasoning

  • CTR6 - Capacity for critical, logical and mathematical reasoning. Capability to solve problems in their area of study. Capacity for abstraction: the capability to create and use models that reflect real situations. Capability to design and implement simple experiments, and analyze and interpret their results. Capacity for analysis, synthesis and evaluation.

Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB7 - Ability to integrate knowledges and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
  • CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.

Technical Competences of each Specialization

Specific

  • CEC1 - Ability to apply scientific methodologies in the study and analysis of phenomena and systems in any field of Information Technology as well as in the conception, design and implementation of innovative and original computing solutions.
  • CEC2 - Capacity for mathematical modelling, calculation and experimental design in engineering technology centres and business, particularly in research and innovation in all areas of Computer Science.
  • CEC3 - Ability to apply innovative solutions and make progress in the knowledge that exploit the new paradigms of Informatics, particularly in distributed environments.

Objectives

  1. Justify the approppriateness of specific statistical techniques for facing specific NLP tasks.
    Related competences: CG1, CG3, CB6, CB7, CEC1, CEC2, CEC3, CTR4, CTR6,
  2. Evaluating the usefulness of statistical components to be included into NLP applications for carrying out NLP tasks
    Related competences: CG3, CEC3,
  3. Searching and selection of statistical NLP resources and tools able to be used in NLP tasks and applications
    Related competences: CG3, CB7, CB9, CEC1, CEC2, CEC3, CTR4,
  4. Design and implementation of new NLP components, tuning of existing components, and integration into a NLP application
    Related competences: CG3, CB6, CB9, CEC2, CEC3, CTR4,

Contents

  1. Introduction & basics
    NLP vs Computational Linguistics vs HLT
    Knowledge-based vs Empirical methods
    Resources
    Lexical resources
    Corpora
    Grammars
    Ontologies
  2. Language Models
    Basics
    {word, class, phrase}-based models
    Information content
    entropy
    mutual information
    joint and conditional entropy
    pointwise mutual information
    Kullback-Leibler divergence (KL)
    Application to NLP tasks
    Noise channel models
    Alignment models
    Application to NLP tasks
  3. Finite State Models
    Finite State Automata (FSA) and Regular grammars
    Finite State Transducers (FST)
    Finite State Probabilistic models
    Application to NLP tasks
  4. Log linear & Maximum Entropy Models
    Classification problems – MLE vs MEM
    Generative and conditional (discriminative) models.
    MM and HMM.
    CRF
    Building ME models
    Maximum Entropy Markov Models (MEMM)
    Applications to NLP
  5. Models for parsing
    Constituent parsing
    Stochastic Context Free Grammars (SCFG)
    Richer probabilistic models
    Applications to NLP.
    Syntactic parsing
    Semantic parsing
    Dependency parsing
  6. Supervised Machine Learning for NLP
    Classification problems.
    Margin-based classifiers: Perceptron, SVM, AdaBoost.
    Kernel-based mehods.
  7. Semi-supervised Learning
    Bootstrapping
  8. Unsupervised Learning (Clustering)
    Similarity
    Hiereachical Clustering
    non-hierarchical clustrering
    Clustering evaluation.
  9. Using statistical techniques for NLP applications
    Machine Translation (MT) in detail
    Other NLP tasks (Part of Speech (POS) tagging, Named Entity Recognition and Classification (NERC), Mention detection & tracking, Coreference resolution, Text Alignment, Lexical Acquisition, Relation Extraction, Semantic Role Labeling (SRL), Word Sense Disambiguation (WSD)) and applications (Information Extraction (IE), Information Retrieval (IR), Question Answering (Q&A), Automatic Summarization, Sentiment Analysis, and Text Classification) only sketched.

Activities

Activity Evaluation act


Introduction & basics

Introduction & basics attending the theory class Homework discusion and tutoring
Objectives: 2
Contents:
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
4h

Language Models

Language Models attending the theory class Homework discusion and tutoring
Objectives: 1 3
Contents:
Theory
6h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
10h

Finite State Models

Finite State Models attending the theory class Homework discusion and tutoring
Objectives: 1 2 3
Contents:
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
4h

Log linear & Maximum Entropy Models

Log linear & Maximum Entropy Models attending the theory class Homework discusion and tutoring
Objectives: 1 2 3 4
Contents:
Theory
9h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
12h

Models for parsing

Models for parsing attending the theory class Homework discusion and tutoring
Objectives: 1 2 4
Contents:
Theory
6h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
10h

Supervised Machine Learning for NLP

Supervised Machine Learning for NLP attending the theory class Homework discusion and tutoring
Objectives: 1 2 4
Contents:
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
5h

Semi-supervised Learning

Semi-supervised Learning attending the theory class Homework discusion and tutoring
Objectives: 1 2 3 4
Contents:
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
5h

Unsupervised Learning (Clustering)

Unsupervised Learning (Clustering) attending the theory class Homework discusion and tutoring
Objectives: 1 2 3 4
Contents:
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
4h

Using statistical techniques for NLP applications

Using statistical techniques for NLP applications attending the theory class Homework discusion and tutoring
Objectives: 1 2 3 4
Contents:
Theory
9h
Problems
0h
Laboratory
0h
Guided learning
1h
Autonomous learning
9h

Homeworks

Students will solve the 5 homeworks at home although they will receive advise from the teachers. Homeworks are due two weeks after the proposal. The evaluation will contain comments on the student works
Objectives: 4
Contents:
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
30h

Final Exam

Final exam of the course The exam will be in the classroom
Objectives: 1 2 3
Week: 16
Type: final exam
Theory
3h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Teaching methodology

The teaching methology is as follows:

For each of the 9 topics there will be one (the most frequent case) or more theory classes. The material (slides, readings, etc.) is known in advance.

Additionaly, a set of homeworks directly attached with the different topic will be proposed along the course to the students (usually 5 homeworks are proposed). These homeworks can be sometimes solved by hand and in other cases by writing a short program.

Evaluation methodology

The evaluation is based on two components:

1) The final exam
2) The grades of the 5 homeworks

The final grade is obtained from the grades of such components.

The weights of the two components are the same (50%).
The weights of the five homeworks are the same (20%).

Bibliography

Basic: