The billions of lines of source code that have been written contain implicit knowledge about how to write good code, code that is easy to read and to debug. A recent line of research aims to find statistical patterns in large corpora of code to drive new software development tools and program analyses.

This website and the accompanying article surveys the work in this emerging area.

Like writing and speaking, software development is an act of human communication.

At its core, the naturalness of software employs statistical modeling over big code to reason about rich variety of programs developers write. This new line of research is inherently interdisciplinary, uniting the machine learning and natural language processing communities with software engineering and programming language communities.

This site is an experiment: a living literature review that allows you explore the navigate the literature in this area, by following a taxonomy based on the underlying design principles of each model.

The full survey is available as a research paper. Please cite as

  title={A Survey of Machine Learning for Big Code and Naturalness},
  author={Allamanis, Miltiadis and Barr, Earl T. and Devanbu, Premkumar and Sutton, Charles},
  journal={arXiv preprint arXiv:1709.06182},


This research area is evolving so fast that a static review cannot keep up. But a website can! We hope to make this site a living document. Anyone can add a paper to this web site, essentially by creating one Markdown file. To contribute, open a pull request in GitHub, by following these instructions for contributing.

Datasets and Other Resources

Some resources about Big Code and Naturalness can be found at A list of datasets used in this area can be found at the appendix of the survey and at