Quantitative linguistics is a branch of linguistics that is primarily concerned about statistical patterns of language (the so-called linguistics laws), their explanation and theory construction. The course is relevant to anybody interested in how languages (and animal communication) are and why.
This course covers a myriad of statistical laws of language (beyond the scope of traditional courses on information retrieval or natural language processing), how to analyze them and their origins.
A fundamental working hypothesis is that these laws emerge from the need to reduce the cognitive effort of speakers or listeners. This course makes emphasis on potential explanations in terms of general principles of cognition in humans and other species. The course covers the mathematical and computational models that have been developed to explain these regularities. During this journey, students will enrich their current knowledge with concepts and tools from linguistics, biology, cognitive science, information theory and multidisciplinary physics under the hawk-eye perspective of philosophy of science.
The course is relevant to researchers interested in squeezing linguistic data as well as evaluating or adapting algorithms, machine learning methods,...based on the real statistical properties of language and the underlying theory. As these regularities are often the result of reducing the cognitive effort of language users, the course is also relevant to researchers interested in developing resources or systems that are easier to use or understand by humans or interested in developing language processing tools that exploit the real constraints of the human brain.
Teachers
Person in charge
Ramon Ferrer Cancho (
)
Weekly hours
Theory
2.5
Problems
0.5
Laboratory
1
Guided learning
0
Autonomous learning
7.11
Objectives
Know the foundations of science and the scientific method. Understand the difference between hypothesis and theory, between modeling and understanding, between describing and explaining, between manifestation and principle. Understand the value of prediction and the types of prediction.
Related competences:
CTR6,
Learn about the statistical laws of language and their origins.
Related competences:
CTR4,
CTR6,
CTR7,
Know and understand the principles of organization of languages and other communication systems
Related competences:
CTR6,
CTR7,
Know the mathematical foundations of quantitative linguistics. Know basic probability theory and information theory.
Related competences:
CTR6,
CTR7,
Know the statistical analysis methods of quantitative linguistics.
Related competences:
CTR4,
Learn how to write a scientific article. Know how to distinguish between a laboratory report and a research paper.
Related competences:
CTR3,
CTR4,
CTR6,
CTR7,
CTR9,
Contents
Introduction to Quantitative Linguistics
What is quantitative linguistics? Overview of linguistic laws, key concepts and research problems in quantitative linguistics.
Law of abbreviation and problem of compression
The law of abbreviation in humans and other species. Methods of analysis of the law of abbreviation. Introduction to information theory. Predictions of optimal coding.
Information theory
Classic information theory and extensions for natural communication systems.
Theory of power laws
Relationships between power laws. Inference of power laws. Power-law analysis methods.
Models of Zipf's law for word frequencies
Debowski's bounds. Classic models. Zipfian optimization models of communication.
The statistical structure of symbolic sequences
Word returns. Correlations in symbolic sequences. Persistence and antipersistence. n-gram models. Generative models.
Dependency syntax
Introduction to dependency syntax. Formal constraints on syntactic dependency structures.
Word order theory
Word order principles. Predictions. Ordre de subjecte (S), complement directe (O) and Verb (V).
Theory construction
The scientific method. A general theory. Closing.
Activities
ActivityEvaluation act
Introduction
Introducció a la lingüística quantitativa. Introducció a l'assignatura Objectives:12 Contents:
The theory sessions will be done primarily by the professor using either the blackboard or projected slides.
The lab work will be done in front of the computer. Students are expected to be working on their assignment, and the professor will explain all that is necessary to follow the class in the beginning of the session. Each lab session will be accompanied by a thorough guide describing the work that needs to be done.
The research project will be carried out under the supervision of the professor.
All the material relevant for the course will be available from Racó or the course's website .
Evaluation methodology
Grading is done by means of exams, reports on various tasks (labs and a research project) throughout the course.
There will two partial exams which count toward 30% of the score. Students are expected to hand in 4 lab work reports about two weeks after its corresponding lab session, which count toward 30% of the final grade. Finally, students will have to deliver a research project by the end of the course that accounts for 40% of the final grade. The research project is the most important activity and must be understood as a course project (not as one more lab). Labs must be understood as a training for the research project.
The formula to compute the final grade is therefore
where P1 is the score of the first partial exam, P2 is the score of the 2nd partial exam, Li stands for the grade for i-th lab and RP is the grade of the research project.
Information theory meets power laws: Stochastic processes and language models -
Debowski, Lukasz, Wiley ,
2021.
Web links
Laws of language outside human language. Statistical laws of language in the behavior of other species, genomes and beyond https://cqllab.upc.edu/biblio/laws/