Topic Overview

(*) indicates that you should be familiar with the basic concept, but details will not be tested on the exam.

General Linguistics Concepts (Parts of J&M 2nd ed. Ch 1 but mostly split over different chapters in J&M. Ch.1 not yet available in 3rd ed.)

Levels of linguistic representation: phonetics/phonology, morphology, syntax, semantics, pragmatics
Ambiguity, know some examples in syntax and semantics, including PP attachment, noun-noun compounds.
Garden-path sentences.
Type/Token distinction.
Know the following terms: word form, stem, lemma.
Parts of speech:
- know the 9 traditional POS and some of the Penn Treebank tags (but no need to remember the complete tagset).
Types of Linguistic Theories (Prescriptive, Descriptive, Explanatory)
Syntax:
- Constituency and Recursion. Constituency tests (*).
- Dependency.
- Grammatical Relations.
- Subcategorization
- Long-distance dependencies. (*)
- Syntactic heads (connection between dependency and constituency structure).
- Center embeddings.
- Dependency syntax:
  - Head, dependent
  - Dependency relations and labels
  - Projectivity

Text Processing (Split over different chapters in J&M. Parts of J&M 3rd ed. Ch. 6)

Probability Background

Text Classification (J&M 3rd ed. Ch. 6)

Language Models (J&M 2nd ed. Ch 4.1-4.8, J&M 3rd ed. Ch 4)

Sequence Labeling (POS tagging) (J&M 2nd ed Ch 5.1-5.5, J&M 3rd ed. Ch 10.1-10.4)

Linguistic tests for part of speech (*).
Hidden Markov Model:
- Observations (sequence of tokens)
- Hidden states (sequence of part of speech tags)
- Transition probabilities, emission probabilities
- Markov chain
- Three tasks on HMMs: Decoding, Evaluation, Training
  - Decoding: Find the most likely sequence of tags:
    - Viterbi algorithm (dynamic programming, know the algorithm and data structures involved)
  - Evaluation: Find the probability of a sequence of words
    - Spurious ambiguity: multiple hidden sequences lead to the same observation.
    - Forward algorithm (difference to Viterbi).
  - Training: We only discussed maximum likelihood estimates.
- Extending HMMs to trigrams. (*)
Applying HMMs to other sequence labeling tasks, for example Named Entity Recognition
- B.I.O. tags for NER.

Parsing with Context Free Grammars (J&M 2nd ed Ch. 12 and 13.1-13.4 and 14.1-14.4 and Ch. 16, J&M 3rd ed. Ch. 11.1-11.5and Ch 13.1-13.4

Tree representation of constituency structure. Phrase labels.
CFG definition: terminals, nonterminals, start symbol, productions/rules
derivations and language of a CFG.
derivation trees (vs. derived string), parse tree.
Regular grammars / languages(*)
Complexity classes.
- "Chomsky hierarchy"
- Center embeddings as an example of a non-regular phenomenon.
- Cross-serial dependencies as an example of a non-context-free phenomenon. (*)
Probabilitistic context free grammars (PCFG):
- Probability of a tree vs. probability of a sentence.
- Maximum likelihood estimates from treebanks.
- Computing the most likely parse tree using CKY.
Recognition (membership checking) problem vs. parsing
Top-down vs. bottom-up parsing.
CKY parser:
- bottom-up approach.
- Chomsky normal form.
- Dynamic programming algorithm. (know the algorithm and required data structure: CKY parse table). Split position.
- Backpointers.
- Parsing with PCFGs

Dependency parsing ( J&M 3rd ed. Ch 14.1-14.5 Supplementary material: Küber, McDonald, and Nivre (2009): Dependency Parsing, Ch.1 to 4.2 )

Transition based dependency parsing:
- States (configuration): Stack, Buffer, Partial dependency tree A
- Transitions (Arc-standard system): Shift, Left-Arc, Right-Arc
- Predicting the next transition using discriminative classifiers.
  - Feature definition (address + attribute)
- Training the parser from a treebank:
  - Oracle transitions from annotated dependency tree.
- Arc-eager system (*)
Graph-based dependency parsing (* know the general idea: each word needs to get a head, start with completely connected graph, score each edge, keep the highest scoring edges)

Machine Learning (Some textbook references below)

Generative vs. discriminative algorithms
Supervised learning. Classification vs. regression problems, multi-class classification
Loss functions: Least squares error (*). Classification error.
Training (empirical risk) vs. testing error (validation risk). Overfitting.
Linear Models.
- activation function. (step function)
- Bias weight
- perceptron learning algorithm (*)
- linear separability and the XOR problem. (*)
Feature functions
Log-linear models
- model formulation
- logistic regression and logits (*)
- log likelihood as a function of the weight vector.
- gradient ascent
- derivative of the log likelihood function (*)
- regularization (*)
- Maximum Entropy Markov Models (for POS tagging)
Feed-forward neural nets (J&M 3rd. ed. ch. 8, as well as Goldberg Ch 3. & 5.1)
- Multilayer neural nets.
- Different activation functions (sigmoid, ReLU, tanh)
- Computation Graph.
- Softmax.
- Backpropagation (*) (know the idea of propagating back partial derivatives of the loss with respect to each weight -- you won't have to do calculus on the final).
- Neural language model (Goldberg Ch. 9.4, J&M SLP 3rd. edition Ch. 6.2-6.5 )

Naive Bayes