Statistical Machine Translation, Language Modeling and Natural Language Processing

This site contains materials on the subjects of Statistical Machine Translation, Language Modelling and Natural Language Processing, compiled by Simon Carter, primarily for Simon Carter.

Contact Information

Simon Carter cv linked in profile
PhD Candidate, ILPS Group, University of Amsterdam
Supervisor: Christof Monz
Email: s.c.carter@uva.nl

NLP, SMT & Computing Books

Speech and Language Processing
Foundations of Statistical Natural Language Processing
Introduction to Algorithms
Statistical Machine Translation
The Elements of Statistical Learning

Graduate Studies and Research

Because I don't yet have a PhD, these people are better placed than I to talk about life as a researcher/graduate student. Many of these links come from, and can be found with many more, here.
Does one have to be a genius to do maths?
Ten Lessons I Wish I Had Been Taught
Survival tips
Career advice from a Fields Medalist
On getting used to be stupid... for once
Tips on writing, and the research process (pdf)
general research articles
some grad school tips

Online Lectures

Stanford CS229 Machine Learning Lectures
Stanford CS224N Natural Language Processing Lectures
University of Heidelberg Introduction to Statistical Parsing
MIT 6.042J Mathematics for Computer Science
Data Mining tutorials by Andrew Moore

Data

Europarl Corpus
Linguistic Data Consortium
UCI Machine Learning Repository
Glasgow's stop word list

SMT Articles

Grammatical Machine Translation Stefan Riezler, 2006
The Alignment Template Approach to Statistical Machine Translation. Franz Josef Och, Hermann Ney, 2004
Statistical Phrase-Based Translation Philipp Koehn, Franz Josef Och, and Daniel Marcu, 2003
BLEU: a method for automatic evaluation of machine translation Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2001
Re-evaluating the Role of Bleu in Machine Translation Research Chris Callison-Burch, Miles Osborne and Philipp Koehn, 2006
The Mathematics of Statistical Machine Translation: Parameter Estimation Peter E. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer, 1993
Statistical Machine Translation Adam Lopez, 2008
What Can Syntax-based MT Learn from Phrase-based MT? Steve DeNeefe, Kevin Knight, Wei Wang, Daniel Marcu, 2007
Hierarchical Phrase-Based Translation David Chiang, 2007

LM Articles

A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model Libin Shen, Jinxi Xu and Ralph Weischedel, 2008
Parsers as language models for statistical machine translation Matt Post and Daniel Gildea, 2008
Refining Generative Language Models using Discriminative Learning Ben Sandbank, 2008
Large and diverse language models for statistical machine translation Holger Schwenk and Philipp Koehn, 2008
A discriminative Language Model with Pseudo-Negative Samples Daisuke Okanohara, Jun'ichi Tsujii, 2007
Better N-best Translations through Generative n-gram Language Models Boxing Chen, Marcello Federico and Mauro Cettolo, 2007
Supertagged Phrase-Based Statistical Machine Translation Hany Hassan, and Khalil Sima'an, and Andy Way, 2007
CCG Supertags in Factored Statistical Machine TranslationAlexandra Birch, Miles Osborne, and Philipp Koehn, 2007
Large language models in machine translation Thorsten Brants, Ashok C. Popat, Peng Xu Franz, J. Oc, Jeffrey Dean, 2007
Discriminative n-gram language modelingBrian Roark, Murat Saraclar, Michael Collins, 2006
A Neural Syntactic Language Model Ahmad Emami, Frederick Jelinek, 2005
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark and Murat Saraclar, 2005
Minimum Sample Risk Methods for Language Modeling Jianfeng Gai, Hao Yu, Wei Yuan, Peng Xu, 2005 (slides)
The Use of a Structural N-gram Language Model in Generation-Heavy Hybrid Machine Translation Nizar Habash, 2004
Discriminative language modeling with conditional random fields and the perceptron algorithm B Roark, M Saraclar, M Collins, M Johnson, 2004
Syntax-based Language Models for Statistical Machine Translation Eugene Charniak, Kevin Knightand Kenji Yamada, 2003
Factored Language Models and Generalized Parallel Backoff J Bilmes and K Kirchhoff, 2003
Discriminative training of language models for speech recognition Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee, 2002
Immediate-head parsing for language models Eugene Charniak, 2001
Two decades of statistical language modeling: Where do we go from here Ronald Rosenfeld, 2000
On the Use of Grammar Based Language Models for Statistical Machine Translation Hassan Sawaf, Kai Schutz, Hermann Ney, 1999
Exploiting syntactic structure for language modeling Ciprian Chelba, Frederick Jelinek, 1998
Class-Based n-Gram Models of Natural Language Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Jenifer C. Lai, 1998
An Empirical Study of Smoothing Techniques for Language Modeling Stanley F. Chen and Joshua T. Goodman, 1998
A Maximum Entropy Approach to Adaptive Statistical Language Modeling Ronald Rosenfeld, 1996
Building Probabilistic Models for Natural Language Stanley F. Chen, 1996
On Structuring Probabilistic Dependencies in Stochastic Language Modeling H. Ney, U. Essen, and R. Kneser, 1994
The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression I. Witten and T. Bell, 1991
Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer S. Katz, 1987

Research Interests

I am interested in the statistics of natural language and the application of machine learning techniques to nlp problems. Currently, I work on language models in the context of Statistical Machine Translation. I am interested in going beyond standard n-gram models, improving their abilities within the generative and discriminative contexts.

Publications

Simon Carter, Christof Monz. Parsing Statistical Machine Translation Output. Language & Technology Conference, 2009 pdf.
Simon Carter, Christof Monz, Sirvan Yahyaei. The QMUL System Description for IWSLT 2008. International Workshop on Spoken Language Translation, pp.104-107, 2008 pdf

ILPS/UvA stuff

ILPS Twiki
ILPS PhD blog
ILPS Machines
UvA Staff Website (self-service)
UvA bibliotheek
Picarata

Summer Schools

Machine Learning Summer School 2009
21st European Summer School in Logic, Language and Information (ESSLLI)

Useful Links

ACL Wiki
MT-Eval 08 Results
MT-Eval 06 Results
Call For Papers Wiki
2009 Computational Linguistics Conferences compiled by Joel Tetreault
Reading list on Bayesian modeling for language compiled by Sharon Goldwater
Collection of resources for cross-lingual information retrieval
Online learning articles collection of articles on online learning
Useful links page compiled by SIGNLL
Collection of various software tools
Latex Packages for tree drawing
Maths Refresher
UCL English Grammar guide for the majority of Brits who get no grammar tuition at all
ILP for NLP
Bib on Adaptive Language Modelling
Hal Daumé's What To See

Free/Open Source Software

R for Statisical Computations
Octave for Numerical Computations
OpenNLP for Java NLP tools
OpenNLP MAXENT for Java Maxmimum Entropy tools
gnuplot free plotting software and here's a guide on using it, and another.
Moses an Open Source SMT System
Evalb a bracket scoring program
Carmel a finite-state transducer package
JabRef a Java based reference manager
TreeForm software for Syntax Tree Drawing
Weka an ML toolkit

Online Translation

Google Translate
Babel Fish
sampark system (experimental)

SMT/LM Toolkits

SRILM
IRSTLM
CMU-Cambridge
GIZA++
tercom for calculating translation error rate

Parsers and Taggers

Collins Parser
Charniak Parser
Dan Bikel's Parser an implementaion of Collins Model 2 which can be re-trained and is more robust than Collins own Parser.
Stanford Parser
MXPOST a POS tagger
tree tagger a POS tagger

Companies doing SMT & ASR

Language Weaver
BBN
SRI

Journals/Proceedings/Transactions in no specific order

ACM Computing Surveys
ACM Transactions on Speech and Language Processing
Computational Linguistics
Computer Speech and Language
Cryptologia
International Journal of Translation
Journal of Artificial Intelligence Research
Journal of Artificial Intelligence
Journal of Logic and Computation
Journal of Machine Learning Research
Journal of Natural Language Engineering
Machine Learning
Machine Translation

Groups in no specific order

ILPS Group, University of Amsterdam
Edinburgh SMT Group
Stanford NLP Group
Mircrosoft Machine Learning and Applied Statistics Group
Carnegie Mellon Machine Learning Department
John Hopkins LSP Center
Rochester Machine Translation Group
Oxford CL Group
Cambridge Natural Language and Information Processing Group
Cambridge SMT Group
FBK

Academics in no specific order

Christof Monz
Adam Lopez
Phillipp Koehn
Franz Josef Och
Kevin Knight
Frederick Jelinek
John D. Lafferty
Chris Callison-Burch
Joshua Goodman
Stanley Chen
David Chiang
Rens Bod
Philip Resnik
Roni Rosenfeld
Daniel Gildea
Stefan Riezler
Yue Zhang
Michael Collins
Christopher Manning
Eugene Charniak
Andy Way
Hany Hassan
Khalil Sima'an
Stephen Clark
Hal Daumé III
Antal van den Bosch
Andreas Zollmann
Nizar Habash
Brian Roark
Michel Galley
David McClosky
Yik-Cheung Tam
Zhifei Li
Natasha Singh-Miller
Bill Byrne
Jörg Tiedemann
Percy Liang
Chris Dyer
Dan Klein
Richard Zens
Holger Schwenk
Philip Resnik

Blogs in no specific order

Machine Learning (Theory)
Readings in Machine Learning
Natural Language Processing
Language Log
Mathematics and Computation
Tasty Research
Babel's Dawn
Jurgen Van Gael Blog on ML topics
on recommender systems