|2018||Learning to Represent Programs with Graphs||M. Allamanis, M. Brockscmidt, M. Khademi||ICLR||
Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code’s known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures.
In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.
|2018||A General Path-Based Representation for Predicting Program Properties||U. Alon, M. Zilberstein, O. Levy, E. Yahav||PLDI||
|2018||Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks||N. Bui, L. Jiang, Y. Yu||NLSE||
Towards the vision of translating code that implements an algorithm from one programming language into another, this paper proposes an approach for automated program classification using bilateral tree-based convolutional neural networks (BiTBCNNs). It is layered on top of two tree-based convolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual programming language. The combination layer of the networks recognizes the similarities and differences among code in different programming languages. The BiTBCNNs are trained using the source code in different languages but known to implement the same algorithms and/or functionalities. For a preliminary evaluation, we use 3591 Java and 3534 C++ code snippets from 6 algorithms we crawled systematically from GitHub. We obtained over 90% accuracy in the cross-language binary classification task to tell whether any given two code snippets implement a same algorithm. Also, for the algorithm classification task, i.e., to predict which one of the six algorithm labels is implemented by an arbitrary C++ code snippet, we achieved over 80% precision.
|2018||Path-Based Function Embedding and its Application to Specification Mining||D. DeFreez, A. V. Thakur, C. Rubio-González||ICSE||
Identifying the relationships among program elements is useful
for program understanding, debugging, and analysis. One such
relationship is synonymy. Function synonyms are functions that
play a similar role in code, e.g. functions that perform initialization
for different device drivers, or functions that implement different
symmetric-key encryption schemes. Function synonyms are not
necessarily semantically equivalent and can be syntactically dissimilar; consequently, approaches for identifying code clones or
functional equivalence cannot be used to identify them. This paper presents
|2018||Deep code search||X. Gu, H. Zhang, S. Kim||ICSE||
To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.
In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.
As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
|2018||Deep Reinforcement Learning for Programming Language Correction||R. Gupta, A. Kanade, S. Shevade||Novice programmers often struggle with the formal syntax of programming languages. To assist them, we design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We demonstrate that the agent can be trained through self-exploration directly from the raw input, that is, program text itself, without any knowledge of the formal syntax of the programming language. We leverage expert demonstrations for one tenth of the training data to accelerate training. The proposed technique is evaluated on 6975 erroneous C programs with typographic errors, written by students during an introductory programming course. Our technique fixes 14% more programs and 29% more compiler error messages relative to those fixed by a state-of-the-art tool, DeepFix, which uses a fully supervised neural machine translation approach.|
|2018||Learning to Repair Software Vulnerabilities with Generative Adversarial Networks||J. A. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y. Kim||arXiv:1805.07475||
Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.
|2018||Exploring the Naturalness of Buggy Code with Recurrent Neural Network||J. Lanchantin, J. Gao||
Statistical language models are powerful tools which have been used for many tasks within natural language processing. Recently, they have been used for other sequential data such as source code. (Ray et al., 2015) showed that it is possible train an n-gram source code language mode, and use it to predict buggy lines in code by determining “unnatural” lines via entropy with respect to the language model. In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model source code and classify buggy lines based on entropy. We show that our method slightly outperforms an n-gram model in the buggy line classification task using AUC
|2018||NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System.||X.V. Lin, C. Wang, L. Zettlemoyer and M.D. Ernst||LREC||
We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task.
|2018||Deep Learning to Detect Redundant Method Comments||A. Louis, S. K. Dash, E. T. Barr, C. Sutton||
Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment’s natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called CRAIC which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. CRAIC uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that CRAIC can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments
|2018||Public Git Archive: a Big Code dataset for all||V. Markovtsev, W. Long||MSR||
The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive – dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for Big Code research.
|2018||A Deep Learning Approach to Identifying Source Code in Images and Video||J. Ott, A. Atchison, P. Harnack, A. Bergh, E. Linstead||MSR||
While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools.
|2018||Building Language Models for Text with Named Entities||M.R. Parvez, S. Chakraborty, B. Ray, KW Chang||ACL||
Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging for a language model as they appear less frequent on the training corpus. In this paper, we propose a novel and effective approach to building a discriminative language model which can learn the entity names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2% better perplexity in recipe generation and 22.06% on code generation than the state-of-the-art language models.
|2018||Polyglot Semantic Parsing in APIs||Kyle Richardson, Jonathan Berant, Jonas Kuhn||NAACL||
Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.
|2018||Syntax and Sensibility: Using language models to detect and correct syntax errors||E. A. Santos, J. C. Campbell, D. Patel, A. Hindle, J. N. Amaral||SANER||
Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare n-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.
|2018||Evaluation of Type Inference with Textual Cues||A. Shirani, A. P. Lopez-Monroy, F. Gonzalez, T. Solorio, M.A. Alipour||NLSE||
Type information plays an important role in the success of information retrieval and recommendation systems in software engineering. Thus, the absence of types in dynamically-typed languages poses a challenge to adapt these systems to support dynamic languages.
In this paper, we explore the viability of type inference using textual cues. That is, we formulate the type inference problem as a classification problem which uses the textual features in the source code to predict the type of variables. In this approach, a classifier learns a model to distinguish between types of variables in a program. The model is subsequently used to (approximately) infer the types of other variables.
We evaluate the feasibility of this approach on four Java projects wherein type information is already available in the source code and can be used to train and test a classifier. Our experiments show this approach can predict the type of new variables with relatively high accuracy (80% F-measure). These results suggest that textual cues can be complementary tools in inferring types for dynamic languages.
|2018||Deep Learning Similarities from Different Representations of Source Code||M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, D. Poshyvanyk||MSR||
Assessing the similarity between code components plays a pivotal role in a number of Software Engineering (SE) tasks, such as clone detection, impact analysis, refactoring, etc. Code similarity is generally measured by relying on manually defined or hand-crafted features, e.g., by analyzing the overlap among identifiers or comparing the Abstract Syntax Trees of two code components. These features represent a best guess at what SE researchers can utilize to exploit and reliably assess code similarity for a given task. Recent work has shown, when using a stream of identifiers to represent the code, that Deep Learning (DL) can effectively replace manual feature engineering for the task of clone detection. However, source code can be represented at different levels of abstraction: identifiers, Abstract Syntax Trees, Control Flow Graphs, and Bytecode. We conjecture that each code representation can provide a different, yet orthogonal view of the same code fragment, thus, enabling a more reliable detection of similarities in code. In this paper, we demonstrate how SE tasks can benefit from a DL-based approach, which can automatically learn code similarities from different representations.
|2018||Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow||P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig||MSR||
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
|2018||Generating Regular Expressions from Natural Language Specifications: Are We There Yet?||Z. Zhong, J. Guo, W. Yang, T. Xie, JG Lou, Y. Liu, D. Zhang||NLSE||
Recent state-of-the-art approaches automatically generate regular expressions from natural language specifications. Given that these approaches use only synthetic data in both training datasets and validation/test datasets, a natural question arises: are these approaches effective to address various real-world situations? To explore this question, in this paper, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world dataset collected from the Internet, and conduct an experimental study on applying a state-of-the-art approach on the real-world dataset. Our study results suggest the existence of distinct characteristics between the synthetic datasets and the real-world dataset, and the state-of-the-art approach (based on a model trained from a synthetic dataset) achieves extremely low effectiveness when evaluated on real-world data, much lower than the effectiveness when evaluated on the synthetic dataset. We also provide initial analysis on some of those challenging cases and discuss future directions.
|2017||Mining Semantic Loop Idioms from Big Code||M. Allamanis, E. T. Barr, C. Bird, M. Marron, C. Sutton||
During maintenance, developers spend a lot of time transforming existing code: refactoring, optimizing, and adding checks to make it more robust. Much of this work is the drudgery of identifying and replacing specific patterns, yet it resists automation, because of meaningful patterns are hard to automatically find. We present a technique for mining loop idioms, surprisingly probable semantic patterns that occur in loops, from big code to find meaningful patterns. First, we show that automatically identifiable patterns exist, in great numbers, with a large scale empirical study of loop over 25 MLOC. We find that loops in this corpus are simple and predictable: 90% of them have fewer than 15LOC and 90% have no nesting and very simple control structure. Encouraged by this result, we coil loops to abstract away syntactic diversity to define information rich loop idioms. We show that only 50 loop idioms cover 50% of the concrete loops. We show how loop idioms can help a tool developers identify and prioritize refactorings. We also show how our framework opens the door to data-driven tool and language design discovering opportunities to introduce new API calls and language constructs: loop idioms show that LINQ would benefit from an Enumerate operator, a result confirmed by the fact that precisely this feature is one of the most requested features on StackOverflow with 197 votes and 95k views.
|2017||SmartPaste: Learning to Adapt Source Code||M. Allamanis, M. Brockscmidt||
Deep Neural Networks have been shown to succeed at a range of natural language tasks such as machine translation and text summarization. While tasks on source code (ie, formal languages) have been considered recently, most work in this area does not attempt to capitalize on the unique opportunities offered by its known syntax and structure. In this work, we introduce SmartPaste, a first task that requires to use such information. The task is a variant of the program repair problem that requires to adapt a given (pasted) snippet of code to surrounding, existing source code. As first solutions, we design a set of deep neural models that learn to represent the context of each variable location and variable usage in a data flow-sensitive way. Our evaluation suggests that our models can learn to solve the SmartPaste task in many cases, achieving 58.6% accuracy, while learning meaningful representation of variable usages.
|2017||Neural Attribute Machines for Program Generation||M. Amodio, S. Chaudhuri, T. Reps||
Recurrent neural networks have achieved remarkable success at generating sequences with complex structures, thanks to advances that include richer embeddings of input and cures for vanishing gradients. Trained only on sequences from a known grammar, though, they can still struggle to learn rules and constraints of the grammar. Neural Attribute Machines (NAMs) are equipped with a logical machine that represents the underlying grammar, which is used to teach the constraints to the neural machine by (i) augmenting the input sequence, and (ii) optimizing a custom loss function. Unlike traditional RNNs, NAMs are exposed to the grammar, as well as samples from the language of the grammar. During generation, NAMs make significantly fewer violations of the constraints of the underlying grammar than RNNs trained only on samples from the language of the grammar.
|2017||A parallel corpus of Python functions and documentation strings for automated code documentation and code generation||A.V.M. Barone, R. Sennrich||ArXiV 1707.02275||
Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains.
In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings (“docstrings”) generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data.
We release our datasets and processing scripts in order to stimulate research in these areas.
|2017||Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts||R. Bavishi, M. Pradel, K. Sen||
|2017||pix2code: Generating Code from a Graphical User Interface Screenshot||T. Beltramelli||ArXiV 1705.07962||
Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites and mobile applications. In this paper, we show that Deep Learning techniques can be leveraged to automatically generate code given a graphical user interface screenshot as input. Our model is able to generate code targeting three different platforms (i.e. iOS, Android and web-based technologies) from a single input image with over 77% of accuracy.
|2017||End-to-end Deep Learning of Optimization Heuristics||C. Cummins, P. Petoumenos, Z. Wang, H. Leather||
Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect.
Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts.
We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-the-art predictive models using hand-crafted features, providing on average 14% and 12% more performance with no human effort expended on designing features.
|2017||Synthesizing benchmarks for predictive modeling||C. Cummin, P. Petoumenos, Z. Wang, H. Leather||CGO||
Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27x. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30x.
|2017||Autofolding for Source Code Summarization||J. Fowkes, R. Ranca, M. Allamanis, M. Lapata, C. Sutton||TSE||
Developers spend much of their time reading and browsing source code, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks of code. However this is impractical to use as folding decisions must be made manually or based on simple rules. We introduce the autofolding problem, which is to automatically create a code summary by folding less informative code regions. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a scoped topic model for code tokens. On an annotated set of popular open source projects, we show that our summarizer outperforms simpler baselines, yielding a 28% error reduction. Furthermore, we find through a case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this work will aid program comprehension by turning code folding into a usable and valuable tool.
|2017||Semantically enhanced software traceability using deep learning techniques||J. Guo, J. Cheng, J. Cleland-Huang||ICSE||
In most safety-critical domains the need for traceability is prescribed by certifying bodies. Trace links are generally created among requirements, design, source code, test cases and other artifacts; however, creating such links manually is time consuming and error prone. Automated solutions use information retrieval and machine learning techniques to generate trace links; however, current techniques fail to understand semantics of the software artifacts or to integrate domain knowledge into the tracing process and therefore tend to deliver imprecise and inaccurate results. In this paper, we present a solution that uses deep learning to incorporate requirements artifact semantics and domain knowledge into the tracing solution. We propose a tracing network architecture that utilizes Word Embedding and Recurrent Neural Network (RNN) models to generate trace links. Word embedding learns word vectors that represent knowledge of the domain corpus and RNN uses these word vectors to learn the sentence semantics of requirements artifacts. We trained 360 different configurations of the tracing network using existing trace links in the Positive Train Control domain and identified the Bidirectional Gated Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU significantly out-performed state-of-the-art tracing methods including the Vector Space Model and Latent Semantic Indexing.
|2017||DeepFix: Fixing Common C Language Errors by Deep Learning||R. Gupta, S. Pal, A. Kanade, S. Shevade||AAAI||
The problem of automatically fixing programming errors is a very active research topic in software engineering. This is a challenging problem as fixing even a single error may require analysis of the entire program. In practice, a number of errors arise due to programmer’s inexperience with the programming language or lack of attention to detail. We call these common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this work, we present an end-to-end solution, called DeepFix, that can fix multiple such errors in a program without relying on any external tool to locate or fix them. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention which is trained to predict erroneous program locations along with the required correct statements. On a set of 6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs completely and 1338 (19%) programs partially.
|2017||Are Deep Neural Networks the Best Choice for Modeling Source Code?||V. J. Hellendoorn, P. Devanbu||FSE||
Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N -gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.
|2017||CodeSum: Translate Program Language to Natural Language||X. Hu, Y. Wei, G. Li, Z. Jin||ArXiV 1708.01837||
During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task in software engineering, code summarization aims to generate brief natural language descriptions for source code. In this paper, we propose a new code summarization model named CodeSum. CodeSum exploits the attention-based sequence-to-sequence (Seq2Seq) neural network with Structure-based Traversal (SBT) of Abstract Syntax Trees (AST). The AST sequences generated by SBT can better present the structure of ASTs and keep unambiguous. We conduct experiments on three large-scale corpora in different program languages, i.e., Java, C#, and SQL, in which Java corpus is our new proposed industry code extracted from Github. Experimental results show that our method CodeSum outperforms the state-of-the-art significantly.
|2017||Automatically Generating Commit Messages from Diffs using Neural Machine Translation||S. Jiang, A. Armaly, C. McMillan||ArXiV 1708.09492||
Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically “translate” diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead.
|2017||Learning a Classifier for False Positive Error Reports Emitted by Static Code Analysis Tools||U. Koc, P. Saadatpanah, J. S. Foster, A. A. Porter||MAPL||
The large scale and high complexity of modern software systems make perfectly precise static code analysis (SCA) infeasible. Therefore SCA tools often over-approximate, so not to miss any real problems. This, however, comes at the expense of raising false alarms, which, in practice, reduces the usability of these tools.
To partially address this problem, we propose a novel learning process whose goal is to discover program structures that cause a given SCA tool to emit false error reports, and then to use this information to predict whether a new error report is likely to be a false positive as well. To do this, we first preprocess code to isolate the locations that are related to the error report. Then, we apply machine learning techniques to the preprocessed code to discover correlations and to learn a classifier.
We evaluated this approach in an initial case study of a widely-used SCA tool for Java. Our results showed that for our dataset we could accurately classify a large majority of false positive error reports. Moreover, we identified some common coding patterns that led to false positive errors. We believe that SCA developers may be able to redesign their methods to address these patterns and reduce false positive error reports.
|2017||Learning to Align the Source Code to the Compiled Object Code||D. Levy, L. Wolf||ICML||
We propose a new neural network architecture and use it for the task of statement-by-statement alignment of source code and its compiled object code. Our architecture learns the alignment between the two sequences – one being the translation of the other – by mapping each statement to a context-dependent representation vector and aligning such vectors using a grid of the two sequence domains. Our experiments include short C functions, both artificial and human-written, and show that our neural network architecture is able to predict the alignment with high accuracy, outperforming known baselines. We also demonstrate that our model is general and can learn to solve graph problems such as the Traveling Salesman Problem.
|2017||Code Completion with Neural Attention and Pointer Networks||J. Li, Y. Wang, I. King, M. R. Lyu||
Intelligent code completion has become an essential tool to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning from large codebases, and investigate the effectiveness of attention mechanism on the code completion task. However, standard neural language models even with attention mechanism cannot correctly predict out-of-vocabulary (OoV) words thus restrict the code completion performance. In this paper, inspired by the prevalence of locally repeated terms in program source code, and the recently proposed pointer networks which can reproduce words from local context, we propose a pointer mixture network for better predicting OoV words in code completion. Based on the context, the pointer mixture network learns to either generate a within-vocabulary word through an RNN component, or copy an OoV word from local context through a pointer component. Experiments on two benchmarked datasets demonstrate the effectiveness of our attention mechanism and pointer mixture network on the code completion task.
|2017||Software Defect Prediction via Convolutional Neural Network||J. Li, P. He, J. Zhu, and M. R. Lyu||QRS||
To improve software reliability, software defect prediction is utilized to assist developers in finding potential bugs and allocating their testing efforts. Traditional defect prediction studies mainly focus on designing hand-crafted features, which are input into machine learning classifiers to identify defective code. However, these hand-crafted features often fail to capture the semantic and structural information of programs. Such information is important in modeling program functionality and can lead to more accurate defect prediction. In this paper, we propose a framework called Defect Prediction via Convolutional Neural Network (DP-CNN), which leverages deep learning for effective feature generation. Specifically, based on the programs’ Abstract Syntax Trees (ASTs), we first extract token vectors, which are then encoded as numerical vectors via mapping and word embedding. We feed the numerical vectors into Convolutional Neural Network to automatically learn semantic and structural features of programs. After that, we combine the learned features with traditional hand-crafted features, for accurate software defect prediction. We evaluate our method on seven open source projects in terms of F-measure in defect prediction. The experimental results show that in average, DP-CNN improves the state-of-the-art method by 12%.
|2017||Program Synthesis from Natural Language Using Recurrent Neural Networks||X.V. Lin, C. Wang, D. Pang, K. Vu, L. Zettlemoyer, M.D. Ernst||Technical Report UW-CSE-17-03-01, University of Washington Department of Computer Science and Engineering||
Oftentimes, a programmer may have difficulty implementing a desired operation. Even when the programmer can describe her goal in English, it can be difficult to translate into code. Existing resources, such as question-and-answer websites, tabulate specific operations that someone has wanted to perform in the past, but they are not effective in generalizing to new tasks, to compound tasks that require combining previous questions, or sometimes even to variations of listed tasks.
Our goal is to make programming easier and more productive by letting programmers use their own words and concepts to express the intended operation, rather than forcing them to accommodate the machine by memorizing its grammar. We have built a system that lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language for review and approval by the programmer. Our system, Tellina, does the translation using recurrent neural networks (RNNs), a state-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements.
We evaluated Tellina in the context of shell scripting. We trained Tellina’s RNNs on textual descriptions of file system operations and bash one-liners, scraped from the web. Although recovering completely correct commands is challenging, Tellina achieves top-3 accuracy of 80% for producing the correct command structure. In a controlled study, programmers who had access to Tellina outperformed those who did not, even when Tellina’s predictions were not completely correct, to a statistically significant degree.
|2017||A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes||P. Loyola, E. Marrese-Taylor, Y. Matsuo||ArXiV 1704.04856||
We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting.
|2017||Topic modeling of public repositories at scale using names in source code||V. Markovtsev, E. Kant||ArXiV 1704.00135||
Programming languages themselves have a limited number of reserved keywords and character based tokens that define the language specification. However, programmers have a rich use of natural language within their code through comments, text literals and naming entities. The programmer defined names that can be found in source code are a rich source of information to build a high level understanding of the project. The goal of this paper is to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics. One of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks). We show how to address it using the same identifiers which are extracted for topic modeling.
We open with a discussion on naming in source code, we then elaborate on our approach to remove exact duplicate and fuzzy duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work on topic modeling; and finally present the results from our data analysis together with open-access to the source code, tools and datasets.
|2017||Bayesian Sketch Learning for Program Synthesis||V. Murali, S. Chaudhuri, C. Jermaine||arXiv preprint 1703.05698||
We present a Bayesian statistical approach to the problem of automatic program synthesis. Our synthesizer starts by learning, offline and from an existing corpus, a probabilistic model of real-world programs. During synthesis, it is provided some ambiguous and incomplete evidence about the nature of the programming task that the user wants automated, for example sets of API calls or data types that are relevant for the task. Given this input, the synthesizer infers a posterior distribution over type-safe programs that assigns higher likelihood to programs that, according to the learned model, are more likely to match the evidence.
We realize this approach using two key ideas. First, our learning techniques operate not over code but syntactic abstractions, or sketches, of programs. During synthesis, we infer a posterior distribution over sketches, then concretize samples from this distribution into type-safe programs using combinatorial techniques. Second, our statistical model explicitly models the full intent behind a synthesis task as a latent variable. To infer sketches, we first estimate a posterior distribution on the intent, then use samples from this posterior to generate a distribution over possible sketches. We show that our model can be implemented effectively using the new neural architecture of Bayesian encoder-decoders, which can be trained with stochastic gradient descent and yields a simple inference procedure.
We implement our ideas in a system, called BAYOU , for the synthesis of API-heavy Java methods. We train BAYOU on a large corpus of Android apps, and find that the trained system can often synthesize complex methods given just a few API method names or data types as evidence. The experiments also justify the design choice of using a latent intent variable and the levels of abstraction at which sketches and evidence are defined.
|2017||Finding Likely Errors with Bayesian Specifications||V. Murali, S. Chaudhuri, C. Jermaine||arXiv preprint 1703.01370||
We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and a method to use this framework to statically detect anomalous, hence likely buggy, program behavior. The distinctive insight here is to build a statistical model that correlates all specifications hidden inside a corpus with the syntax and observed behavior of programs that implement these specifications. During the analysis of a particular program, this model is conditioned into a posterior distribution that prioritizes specifications that are relevant to this program. This allows accurate program analysis even if the corpus is highly heterogeneous. The problem of finding anomalies is now framed quantitatively, as a problem of computing a distance between a “reference distribution” over program behaviors that our model expects from the program, and the distribution over behaviors that the program actually produces.
We present a concrete embodiment of our framework that combines a topic model and a neural network model to learn specifications, and queries the learned models to compute anomaly scores. We evaluate this implementation on the task of detecting anomalous usage of Android APIs. Our encouraging experimental results show that the method can automatically discover subtle errors in Android applications in the wild, and has high precision and recall compared to competing probabilistic approaches.
|2017||Exploring API Embedding for API Usages and Applications||T.D. Nguyen, A.T. Nguyen, H.D. Phan, T.N. Nguyen||ICSE||
Word2Vec is a class of neural network models that as being trained from a large corpus of texts, they can produce for each unique word a corresponding vector in a continuous space in which linguistic contexts of words can be observed. In this work, we study the characteristics of Word2Vec vectors, called API 2 VEC or API embeddings, for the API elements within the API sequences in source code. Our empirical study shows that the close proximity of the API 2 VEC vectors for API elements reflects the similar usage contexts containing the surrounding APIs of those API elements. Moreover, API 2 VEC can capture several similar semantic relations between API elements in API usages via vector offsets. We demonstrate the usefulness of API 2 VEC vectors for API elements in three applications. First, we build a tool that mines the pairs of API elements that share the same usage relations among them. The other applications are in the code migration domain. We develop API 2 API , a tool to automatically learn the API mappings between Java and C# using a characteristic of the API 2 VEC vectors for API elements in the two languages: semantic relations among API elements in their usages are observed in the two vector spaces for the two languages as similar geometric arrangements among their API 2 VEC vectors. Our empirical evaluation shows that API 2 API relatively improves 22.6% and 40.1% top-1 and top-5 accuracy over a state-of-the-art mining approach for API mappings. Finally, as another application in code migration, we are able to migrate equivalent API usages from Java to C# with up to 90.6% recall and 87.2% precision.
|2017||Deep Learning to Find Bugs||M. Pradel, K. Sen||
|2017||Abstract Syntax Networks for Code Generation and Semantic Parsing||M. Rabinovich, M. Stern, D. Klein||ACL||
Tasks like code generation and semantic parsing require mapping unstructured (or partially structured) inputs to well-formed, executable outputs. We introduce abstract syntax networks, a modeling framework for these problems. The outputs are represented as abstract syntax trees (ASTs) and constructed by a decoder with a dynamically-determined modular structure paralleling the structure of the output tree. On the benchmark Hearthstone dataset for code generation, our model obtains 79.2 BLEU and 22.7% exact match accuracy, compared to previous state-of-the-art values of 67.1 and 6.1%. Furthermore, we perform competitively on the Atis, Jobs, and Geo semantic parsing datasets with no task-specific engineering.
|2017||The Code2Text Challenge: Text Generation in Source Code Libraries||Kyle Richardson, Sina Zarrieß, Jonas Kuhn||INLG||
We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction (Richardson and Kuhn, 2017b; Richardson and Kuhn, 2017a), and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.
|2017||Function Assistant: A Tool for NL Querying of APIs||Kyle Richardson, Jonas Kuhn||EMNLP||
In this paper, we describe Function Assistant, a lightweight Python-based toolkit for querying and exploring source code repositories using natural language. The toolkit is designed to help end-users of a target API quickly find information about functions through high-level natural language queries and descriptions. For a given text query and background API, the tool finds candidate functions by performing a translation from the text to known representations in the API using the semantic parsing approach of Richardson and Kuhn (2017). Translations are automatically learned from example text-code pairs in example APIs. The toolkit includes features for building translation pipelines and query engines for arbitrary source code projects. To explore this last feature, we perform new experiments on 27 well-known Python projects hosted on Github.
|2017||Learning Technical Correspondences in Technical Documentation||Kyle Richardson, Jonas Kuhn||ACL||
We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals.
|2017||Recovering Clear, Natural Identifiers from Obfuscated JS Names||B. Vasilescu, C. Casalnuovo, P. Devanbu||FSE||
|2017||Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities||M. White, M. Tufano, M. Martínez, M. Monperrus, D. Poshyvanyk||
In the field of automated program repair, the redundancy assumption claims large programs contain the seeds of their own repair. However, most redundancy-based program repair techniques do not reason about the repair ingredients—the code that is reused to craft a patch. We aim to reason about the repair ingredients by using code similarities to prioritize and transform statements in a codebase for patch generation. Our approach, DeepRepair, relies on deep learning to reason about code similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity to suspicious elements (i.e., code elements that contain suspicious statements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined these new search strategies for patch generation with respect to effectiveness from the viewpoint of a software maintainer. Our comparative experiments were executed on six open-source Java projects including 374 buggy program revisions and consisted of 19,949 trials spanning 2,616 days of computation time. DeepRepair’s search strategy using code similarities generally found compilable ingredients faster than the baseline, jGenProg, but this improvement neither yielded test-adequate patches in fewer attempts (on average) nor found significantly more patches than the baseline. Although the patch counts were not statistically different, there were notable differences between the nature of DeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot be found by existing redundancy-based repair techniques
|2017||A Language Model for Statements of Software Code||Y. Yang, Y. Jiang, M. Gu, J. Sun, J. Gao, H. Liu||ASE||
Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common sub-sequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%-60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%-90% of the cases with ten candidates.
|2017||A Syntactic Neural Model for General-Purpose Code Generation||P. Yin, G. Neubig||ACL||
We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.
|2017||Abridging Source Code||B. Yuan, V. Murali, C. Jermain||OOPSLA||
In this paper, we consider the problem of source code abridgment, where the goal is to remove statements from a source code in order to display the source code in a small space, while at the same time leaving the ``important’’ parts of the source code intact, so that an engineer can read the code and quickly understand purpose of the code. To this end, we develop an algorithm that looks at a number of examples, human-created source code abridgments, and learns how to remove lines from the code in order to mimic the human abridger. The learning algorithm takes into account syntactic features of the code, as well as semantic features such as control flow and data dependencies. Through a comprehensive user study, we show that the abridgments that our system produces can decrease the time that a user must look at code in order to understand its functionality, as well as increase the accuracy of the assessment, while displaying the code in a greatly reduced area.
|2016||A Convolutional Attention Network for Extreme Summarization of Source Code||M. Allamanis, H. Peng, C. Sutton||ICML||
Attention mechanisms in neural networks have proved useful for problems in which the input and output do not have fixed dimension. Often there exist features that are locally translation invariant and would be valuable for directing the model’s attention, but previous attentional architectures are not constructed to learn such features specifically. We introduce an attentional neural network that employs convolution on the input tokens to detect local time-invariant and long-range topical attention features in a context-dependent way. We apply this architecture to the problem of extreme summarization of source code snippets into short, descriptive function name-like summaries. Using those features, the model sequentially generates a summary by marginalizing over two attention mechanisms: one that predicts the next summary token based n the attention weights of the input tokens and another that is able to copy a code token as-is directly into the summary. We demonstrate our convolutional attention neural network’s performance on 10 popular Java projects showing that it achieves better performance compared to previous attentional mechanisms.
|2016||Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks||S. Bhatia, R. Singh||
We present a method for automatically generating repair feedback for syntax errors for introductory programming problems. Syntax errors constitute one of the largest classes of errors (34%) in our dataset of student submissions obtained from a MOOC course on edX. The previous techniques for generating automated feedback on programming assignments have focused on functional correctness and style considerations of student programs. These techniques analyze the program AST of the program and then perform some dynamic and symbolic analyses to compute repair feedback. Unfortunately, it is not possible to generate ASTs for student programs with syntax errors and therefore the previous feedback techniques are not applicable in repairing syntax errors. We present a technique for providing feedback on syntax errors that uses Recurrent neural networks (RNNs) to model syntactically valid token sequences. Our approach is inspired from the recent work on learning language models from Big Code (large code corpus). For a given programming assignment, we first learn an RNN to model all valid token sequences using the set of syntactically correct student submissions. Then, for a student submission with syntax errors, we query the learnt RNN model with the prefix token sequence to predict token sequences that can fix the error by either replacing or inserting the predicted token sequence at the error location. We evaluate our technique on over 14, 000 student submissions with syntax errors. Our technique can completely repair 31.69% (4501/14203) of submissions with syntax errors and in addition partially correct 6.39% (908/14203) of the submissions.
|2016||Learning Python Code Suggestion with a Sparse Pointer Network||A. Bhoopchand, T. Rocktäschel, E.T. Barr, S. Riedel||ArXiV 1611.08307||
To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.
|2016||Statistical Deobfuscation of Android Applications||B. Bichsel, V. Raychev, P. Tsankov, M. Vechev||CCS||
This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed “Big Code”). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.
We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.
|2016||PHOG: Probabilistic Model for Code||P. Bielik, V. Raychev, M. Vechev||ICML||
|2016||Automatically generating features for learning program analysis heuristics||K. Chae, H. Oh, K. Heo, H. Yang||ArXiV 1612.09394||
We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features.
|2016||A deep language model for software code||H. K. Dam, T. Tran, T. Pham||ArXiV 1608.02715||
Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process.
|2016||Parameter-Free Probabilistic API Mining across GitHub||J. Fowkes, C. Sutton||FSE||
Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages.
|2016||Deep API Learning||X. Gu, H. Zhang, D. Zhang, S. Kim||FSE||
Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and APIs as bag-of-words (i.e., keyword matching or word-to-word alignment) and lack a deep understanding of the semantics of the query.
We propose DeepAPI, a deep learning based approach to generate API usage sequences for a given natural language query. Instead of a bags-of-words assumption, it learns the sequence of words in a query and the sequence of associated APIs. DeepAPI adapts a neural language model named RNN Encoder-Decoder. It encodes a word sequence (user query) into a fixed-length context vector, and generates an API sequence based on the context vector. We also augment the RNN Encoder-Decoder by considering the importance of individual APIs. We empirically evaluate our approach with more than 7 million annotated code snippets collected from GitHub. The results show that our approach generates largely accurate API sequences and outperforms the related approaches.
|2016||Summarizing Source Code using a Neural Attention Model||S. Iyer, I. Konstas, A. Cheung, L. Zettlemoyer||ACL||
High quality source code is often paired with high level summaries of the computation it performs, for example in code documentation or in descriptions posted in online forums. Such summaries are extremely useful for applications such as code search but are expensive to manually author, hence only done for a small fraction of all code that is produced. In this paper, we present the first completely data-driven approach for generating high level summaries of source code. Our model, CODE-NN , uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries. CODE-NN is trained on a new corpus that is automatically collected from StackOverflow, which we release. Experiments demonstrate strong performance on two tasks: (1) code summarization, where we establish the first end-to-end learning results and outperform strong baselines, and (2) code retrieval, where our learned model improves the state of the art on a recently introduced C# benchmark by a large margin.
|2016||Gated Graph Sequence Neural Networks||Y. Li, R. Zemel, M. Brockschmidt, D. Tarlow||ICLR||
Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be described as abstract data structures.
|2016||Latent Predictor Networks for Code Generation||W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, F. Wang, P. Blunsom||ACL||
Many language generation tasks require the production of text conditioned on both structured and unstructured inputs. We present a novel neural network architecture which generates an output sequence conditioned on an arbitrary number of input functions. Crucially, our approach allows both the choice of conditioning context and the granularity of generation, for example characters or tokens, to be marginalised, thus permitting scalable and effective training. Using this framework, we address the problem of generating programming code from a mixed natural language and structured specification. We create two new data sets for this paradigm derived from the collectible trading card games Magic the Gathering and Hearthstone. On these, and a third preexisting corpus, we demonstrate that marginalising multiple predictors allows our model to outperform strong benchmarks.
|2016||Towards Better Program Obfuscation: Optimization via Language Models||H. Liu||ICSE||
|2016||Convolutional Neural Networks over Tree Structures for Programming Language Processing||L. Mou, G. Li, L. Zhang, T. Wang, Z. Jin||AAAI||
Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.
|2016||Mapping API Elements for Code Migration with Vector Representations||T.D. Nguyen, A.T. Nguyen, T.N. Nguyen||ICSE|
|2016||Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data||J. Patra, M. Pradel||
|2016||Learning API usages from bytecode: a statistical approach.||H.V. Pham, T.T. Nguyen, P.M. Vu, T.T. Nguyen||ICSE||
Mobile app developers rely heavily on standard API frameworks and libraries. However, learning API usages is often challenging due to the fast-changing nature of API frameworks for mobile systems and the insufficiency of API documentation and source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include HAPI, a statistical model of API usages and three algorithms to extract method call sequences from apps’ bytecode, to train HAPI based on those sequences, and to recommend method calls in code completion using the trained HAPIs. Our empirical evaluation shows that our prototype tool can effectively learn API usages from 200 thousand apps containing 350 million method sequences. It recommends next method calls with top-3 accuracy of 90% and outperforms baseline approaches on average 10-20%.
|2016||sk_p: a neural program corrector for MOOCs||Y. Pu, K. Narasimhan, A. Solar-Lezama, R. Barzilay||SPLASH||
We present a novel technique for automatic program correction in MOOCs, capable of fixing both syntactic and semantic errors without manual, problem specific correction strategies. Given an incorrect student program, it generates candidate programs from a distribution of likely corrections, and checks each candidate for correctness against a test suite.
The key observation is that in MOOCs many programs share similar code fragments, and the seq2seq neural network model, used in the natural-language processing task of machine translation, can be modified and trained to recover these fragments.
Experiment shows our scheme can correct 29% of all incorrect submissions and out-performs state of the art approach which requires manual, problem specific correction strategies.
|2016||Learning Programs from Noisy Data||V. Raychev, P. Bielik, M. Vechev, A. Krause||POPL||
We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program’s score on that dataset. The two components are connected in a continuous feedback-directed loop.
We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space.
We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting – a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human).
Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as “Big Code”. The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems.
|2016||Question Independent Grading using Machine Learning: The Case of Computer Program Grading||G. Singh, S. Srikant, V. Aggarwal||KDD||
Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work (by Srikant, Aggarwal; KDD 2014) wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system single s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach.
|2016||Automatically Learning Semantic Features for Defect Prediction||S. Wang, T. Liu, L. Tan||ICSE||
Software defect prediction, which predicts defective code regions, can help developers find bugs and prioritize their testing efforts. To build accurate prediction models, previous studies focus on manually designing features that encode the characteristics of programs and exploring different machine learning algorithms. Existing traditional features often fail to capture the semantic differences of programs, and such a capability is needed for building accurate prediction models.
To bridge the gap between programs’ semantics and defect prediction features, this paper proposes to leverage a powerful representation-learning algorithm, deep learning, to learn semantic representation of programs automatically from source code. Specifically, we leverage Deep Belief Network (DBN) to automatically learn semantic features from token vectors extracted from programs’ Abstract Syntax Trees (ASTs).
Our evaluation on ten open source projects shows that our automatically learned semantic features significantly improve both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP) compared to traditional features. Our semantic features improve WPDP on average by 14.7% in precision, 11.5% in recall, and 14.2% in F1. For CPDP, our semantic features based approach outperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1.
|2016||Bugram: bug detection with n-gram language models||S. Wang, D. Chollak, D. Movshovitz-Attias, L. Tan||ASE||
To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs.
In this paper, we propose a new approach—Bugram—that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware.
We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers (4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph- and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches.
|2016||Neural Code Completion||C. Liu, X. Wang, R. Shin, J.E. Gonzalez, D. Song||
|2016||Deep Learning Code Fragments for Code Clone Detection||M. White, M. Tufano, C. Vendome, D. Poshyvanyk||ASE||
Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend on generic, handcrafted features to represent code fragments. We introduce learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository. Our code analysis supports a framework, which relies on deep learning, for automatically linking patterns mined at the lexical level with patterns mined at the syntactic level. We evaluated our novel learning-based approach for code clone detection with respect to feasibility from the point of view of software maintainers. We sampled and manually evaluated 398 file- and 480 method-level pairs across eight real-world Java systems; 93% of the file- and method-level samples were evaluated to be true positives. Among the true positives, we found pairs mapping to all four clone types. We compared our approach to a traditional structure-oriented technique and found that our learning-based approach detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Our results affirm that our learning-based approach is suitable for clone detection and a tenable technique for researchers.
|2016||Extracting Code from Programming Tutorial Videos||S. Yadid, E. Yahav||Onward!||
The number of programming tutorial videos on the web increases daily. Video hosting sites such as YouTube host millions of video lectures, with many programming tutorials for various languages and platforms. These videos contain a wealth of valuable information, including code that may be of interest. However, two main challenges have so far prevented the effective indexing of programming tutorial videos: (i) code in tutorials is typically written on-the-fly, with only parts of the code visible in each frame, and (ii) optical character recognition (OCR) is not precise enough to produce quality results from videos.
We present a novel approach for extracting code from videos that is based on: (i) consolidating code across frames, and (ii) statistical language models for applying corrections at different levels, allowing us to make corrections by choosing the most likely token, combination of tokens that form a likely line structure, and combination of lines that lead to a likely code fragment in a particular language. We implemented our approach in a tool called ACE , and used it to extract code from 40 Android video tutorials on YouTube . Our evaluation shows that ACE extracts code with high accuracy, enabling deep indexing of video tutorials.
|2015||Using Machine Translation for Converting Python 2 to Python 3 Code||K. Aggarwal, M. Salameh, and A. Hindle||
In this paper, we have tried to use Statistical machine translation in order to convert Python 2 code to Python 3 code. We use data from two projects and achieve a high BLEU score. We also investigate the cross-project training and testing to analyze the errors so as to ascertain differences with previous case. We have described a pilot study on modeling programming languages as natural language to build translation models on the lines of natural languages. This can be further worked on to translate between versions of a programming language or cross-programming-languages code translation.
|2015||A Bimodal Modelling of Source Code and Natural Language||M. Allamanis, D. Tarlow, A. D. Gordon, Y. Wei||ICML||
We consider the problem of building probabilistic models that jointly model short natural language utterances and source code snippets. The aim is to bring together recent work on statistical modelling of source code and work on bimodal models of images and natural language. The resulting models are useful for a variety of tasks that involve natural language and source code. We demonstrate their performance on two retrieval tasks: retrieving source code snippets given a natural language query, and retrieving natural language descriptions given a source code query (i.e., source code captioning). Experiments show there to be promise in this direction, and that modelling the structure of source code improves performance.
|2015||Suggesting Accurate Method and Class Names||M. Allamanis, E. T. Barr, C. Bird, C. Sutton||FSE||
Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for methods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.
|2015||Irish: A Hidden Markov Model to detect coded information islands in free text||L. Cerulo, M. Di Penta, A. Bacchelli, M, Ceccarelli, G. Canfora||Science of Computer Programming||
Developers’ communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers’ communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts.
We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.
|2015||Exploring the Use of Deep Learning for Feature Location||C.S. Corley, K. Damevski, N.A. Kraft||
Deep learning models are a class of neural networks. Relative to n-gram models, deep learning models can capture more complex statistical patterns based on smaller training corpora. In this paper we explore the use of a particular deep learning model, document vectors (DVs), for feature location. DVs seem well suited to use with source code, because they both capture the influence of context on each term in a corpus and map terms into a continuous semantic space that encodes semantic relationships such as synonymy. We present preliminary results that show that a feature location technique (FLT) based on DVs can outperform an analogous FLT based on latent Dirichlet allocation (LDA) and then suggest several directions for future work on the use of deep learning models to improve developer effectiveness in feature location.
|2015||CACHECA: A Cache Language Model Based Code Suggestion Tool||C. Franks, Z. Tu, P. Devanbu, V. Hellendoorn||ICSE||
Nearly every Integrated Development Environment includes a form of code completion. The suggested completions (“suggestions”) are typically based on information available at compile time, such as type signatures and variables in scope. A statistical approach, based on estimated models of code patterns in large code corpora, has been demonstrated to be effective at predicting tokens given a context. In this demo, we present CACHECA, an Eclipse plugin that combines the native suggestions with a statistical suggestion regime. We demonstrate that a combination of the two approaches more than doubles Eclipse’s suggestion accuracy. A video demonstration is available at https://www.youtube.com/watch?v=3INk0N3JNtc.
|2015||OverCode: visualizing variation in student solutions to programming problems at scale||E.L. Glassman, J. Scott, R. Singh, P. Guo, and R.C. Miller||
In MOOCs, a single programming exercise may produce thousands of solutions from learners. Understanding solution variation is important for providing appropriate feedback to students at scale. The wide variation among these solutions can be a source of pedagogically valuable examples and can be used to refine the autograder for the exercise by exposing corner cases. We present OverCode, a system for visualizing and exploring thousands of programming solutions. OverCode uses both static and dynamic analysis to cluster similar solutions, and lets teachers further filter and cluster solutions based on different criteria. We evaluated OverCode against a nonclustering baseline in a within-subjects study with 24 teaching assistants and found that the OverCode interface allows teachers to more quickly develop a high-level view of students’ understanding and misconceptions, and to provide feedback that is relevant to more students’ solutions.
|2015||Synthesizing Java expressions from free-form queries||T. Gvero, V. Kuncak||OOPSLA||
We present a new code assistance tool for integrated development environments. Our system accepts as input free-form queries containing a mixture of English and Java, and produces Java code expressions that take the query into account and respect syntax, types, and scoping rules of Java, as well as statistical usage patterns. In contrast to solutions based on code search, the results returned by our tool need not directly correspond to any previously seen code fragment. As part of our system we have constructed a probabilistic context free grammar for Java constructs and library invocations, as well as an algorithm that uses a customized natural language processing tool chain to extract information from free-form text queries. We present the results on a number of examples showing that our technique (1) often produces the expected code fragments, (2) tolerates much of the flexibility of natural language, and (3) can repair incorrect Java expressions that use, for example, the wrong syntax or missing arguments.
|2015||Will they like this? Evaluating Code Contributions With Language Models||V.J. Hellendoorn, P. Devanbu, A. Bacchelli||MSR||
Popular open-source software projects receive and review contributions from a diverse array of developers, many of whom have little to no prior involvement with the project. A recent survey reported that reviewers consider conformance to the project’s code style to be one of the top priorities when evaluating code contributions on Github. We propose to quantitatively evaluate the existence and effects of this phenomenon. To this aim we use language models, which were shown to accurately capture stylistic aspects of code. We find that rejected changesets do contain code significantly less similar to the project than accepted ones; furthermore, the less similar changesets are more likely to be subject to thorough review. Armed with these results we further investigate whether new contributors learn to conform to the project style and find that experience is positively correlated with conformance to the project’s code style.
|2015||Visualizing and Understanding Recurrent Networks||A. Karpathy, J. Johnson, L. Fei-Fei||arXiv preprint arXiv:1506.02078||
Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful applications in a wide range of machine learning problems that involve sequential data. However, while LSTMs provide exceptional results in practice, the source of their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge this gap by providing an analysis of their representations, predictions and error types. In particular, our experiments reveal the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets. Moreover, our comparative analysis with finite horizon n-gram models traces the source of the LSTM improvements to long-range structural dependencies. Finally, we provide analysis of the remaining errors and suggests areas for further study.
|2015||Gated Graph Sequence Neural Networks||Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel||ICLR||
Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.
|2015||A User-Guided Approach to Program Analysis||R. Mangal, X. Zhang, A. V. Nori, M. Naik||FSE||
Program analysis tools often produce undesirable output due to various approximations. We present an approach and a system Eugene that allows user feedback to guide such approximations towards producing the desired output. We formulate the problem of user-guided program analysis in terms of solving a combination of hard rules and soft rules: hard rules capture soundness while soft rules capture degrees of approximations and preferences of users. Our technique solves the rules using an off-the-shelf solver in a manner that is sound (satisfies all hard rules), optimal (maximally satisfies soft rules), and scales to real-world analy- ses and programs. We evaluate Eugene on two different analyses with labeled output on a suite of seven Java pro- grams of size 131–198 KLOC. We also report upon a user study involving nine users who employ Eugene to guide an information-flow analysis on three Java micro-benchmarks. In our experiments, Eugene significantly reduces misclassified reports upon providing limited amounts of feedback.
|2015||Convolutional Neural Networks over Tree Structures for Programming Language Processing||L. Mou, G. Li, L. Zhang, T. Wang, Z. Jin||AAAI||
Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.
|2015||KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts||D. Movshovitz-Attias, W. W. Cohen||ACL||
Many existing knowledge bases (KBs), including Freebase, Yago, and NELL, rely on a fixed ontology, given as an input to the system, which defines the data to be cataloged in the KB, i.e., a hierarchy of categories and relations between them. The system then extracts facts that match the predefined ontology. We propose an unsupervised model that jointly learns a latent ontological structure of an input corpus, and identifies facts from the corpus that match the learned structure. Our approach combines mixed membership stochastic block models and topic models to infer a structure by jointly modeling text, a latent concept hierarchy, and latent semantic relationships among the entities mentioned in the text. As a case study, we apply the model to a corpus of Web documents from the software domain, and evaluate the accuracy of the various components of the learned ontology.
|2015||Graph-based Statistical Language Model for Code||A.T. Nguyen, T.N. Nguyen||ICSE||
n-gram statistical language model has been successfully applied to capture programming patterns to support code completion and suggestion. However, the approaches using n-gram face challenges in capturing the patterns at higher levels of abstraction due to the mismatch between the sequence nature in n-grams and the structure nature of syntax and semantics in source code. This paper presents GraLan, a graph-based statistical language model and its application in code suggestion. GraLan can learn from a source code corpus and compute the appearance probabilities of any graphs given the observed (sub)graphs. We use GraLan to develop an API suggestion engine and an AST-based language model, ASTLan. ASTLan supports the suggestion of the next valid syntactic template and the detection of common syntactic templates. Our empirical evaluation on a large corpus of open-source projects has shown that our engine is more accurate in API code suggestion than the state-of-the-art approaches, and in 75% of the cases, it can correctly suggest the API with only five candidates. ASTLan also has high accuracy in suggesting the next syntactic template and is able to detect many useful and common syntactic templates.
|2015||Learning API Usages from Bytecode: A Statistical Approach||T.T. Nguyen, H.V. Pham, P.M. Vu, T.T. Nguyen||ICSE||
When developing mobile apps, programmers rely heavily on standard API frameworks and libraries. However, learning and using those APIs is often challenging due to the fast-changing nature of API frameworks for mobile systems, the complexity of API usages, the insufficiency of documentation, and the unavailability of source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include: i) ARUS, a graph-based representation of API usage scenarios; ii) HAPI, a statistical, generative model of API usages; and iii) three algorithms to extract ARUS from apps’ bytecode, to train HAPI based on method call sequences extracted from ARUS, and to recommend method calls in code completion engines using the trained HAPI. Our empirical evaluation suggests that our approach can learn useful API usage models which can provide recommendations with higher levels of accuracy than the baseline n-gram model.
|2015||Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation||Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura||ASE||
Pseudo-code written in natural language can aid the comprehension of source code in unfamiliar programming languages. However, the great majority of source code has no corresponding pseudo-code, because pseudo-code is redundant and laborious to create. If pseudo-code could be generated automatically and instantly from given source code, we could allow for on-demand production of pseudo-code without human effort. In this paper, we propose a method to automatically generate pseudo-code from source code, specifically adopting the statistical machine translation (SMT) framework. SMT, which was originally designed to translate between two natural languages, allows us to automatically learn the relationship between source code/pseudo-code pairs, making it possible to create a pseudo-code generator with less human effort. In experiments, we generated English or Japanese pseudo-code from Python statements using SMT, and find that the generated pseudo-code is largely accurate, and aids code understanding.
|2015||Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation||H. Oh, H. Yang, K, Yi||OOPSLA||
Building a cost-effective static analyser for real-world programs is still regarded an art. One key contributor to this grim reputation is the difficulty in balancing the cost and the precision of an analyser. An ideal analyser should be adap- tive to a given analysis task, and avoid using techniques that unnecessarily improve precision and increase analysis cost. However, achieving this ideal is highly nontrivial, and it requires a large amount of engineering efforts.
In this paper we present a new approach for building an adaptive static analyser. In our approach, the analyser includes a sophisticated parameterised strategy that decides, for each part of a given program, whether to apply a precision-improving technique to that part or not. We present a method for learning a good parameter for such a strategy from an existing codebase via Bayesian optimisation. The learnt strategy is then used for new, unseen programs. Using our approach, we developed partially flow- and context-sensitive variants of a realistic C static analyser. The experimental results demonstrate that using Bayesian optimisation is crucial for learning from an existing codebase. Also, they show that among all program queries that require flow- or context-sensitivity, our partially flow- and context-sensitive analysis answers the 75% of them, while increasing the analysis cost only by 3.3x of the baseline flow- and context-insensitive analysis, rather than 40x or more of the fully sensitive version.
|2015||Learning Program Embeddings to Propagate Feedback on Student Code||C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, M, Sahami, L. Guibas||ICML||
Providing feedback, both assessing final work and giving hints to stuck students, is difficult for open-ended assignments in massive online classes which can range from thousands to millions of students. We introduce a neural network method to encode programs as a linear mapping from an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. We apply our algorithm to assessments from the Code.org Hour of Code and Stanford University’s CS1 course, where we propagate human comments on student assignments to orders of magnitude more submissions.
|2015||Intelligent Code Completion with Bayesian Networks||S. Proksch, J. Lerch, M. Mezini||TSE||
Code completion is an integral part of modern Integrated Development Environments (IDEs). Developers often use it to explore Application Programming Interfaces (APIs). It is also useful to reduce the required amount of typing and to help avoid typos. Traditional code completion systems propose all type-correct methods to the developer. Such a list is often very long with many irrelevant items. More intelligent code completion systems have been proposed in prior work to reduce the list of proposed methods to relevant items.
This work extends one of these existing approaches, the Best Matching Neighbor (BMN) algorithm. We introduce Bayesian networks as an alternative underlying model, use additional context information for more precise recommendations, and apply clustering techniques to improve model sizes. We compare our new approach, Pattern-based Bayesian Networks (PBN), to the existing BMN algorithm. We extend previously used evaluation methodologies and, in addition to prediction quality, we also evaluate model size and inference speed.
Our results show that the additional context information we collect improves prediction quality, especially for queries that do not contain method calls. We also show that PBN can obtain comparable prediction quality to BMN, while model size and inference speed scale better with large input sizes.
|2015||On the “Naturalness” of Buggy Code||B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, P. Devanbu||ICSE||
Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be “natural”, like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is “unnatural” in some sense, and thus possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 8,296), from 10 different Java projects, and we focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed. Focusing on highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid language-independent and simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.
|2015||Predicting Program Properties from “Big Code”||V. Raychev, M. Vechev, A. Krause||POPL||
We present a new approach for predicting program properties from massive codebases (aka “Big Code”). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.
The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.
By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of “Big Code” including invariant generation, de-compilation, synthesis and others.
|2015||Products, Developers, and Milestones: How Should I Build My N-Gram Language Model||C. Saraiva, C. Bird, T. Zimmermann||FSE||
Recent work has shown that although programming languages en- able source code to be rich and complex, most code tends to be repetitive and predictable. The use of natural language processing (NLP) techniques applied to source code such as n-gram language models show great promise in areas such as code completion, aiding impaired developers, and code search. In this paper, we address three questions related to different methods of constructing lan- guage models in an industrial context. Specifically, we ask: (1) Do application specific, but smaller language models perform better than language models across applications? (2) Are developer specific language models effective and do they differ depending on what parts of the codebase a developer is working in? (3) Finally, do language models change over time, i.e., does a language model from early development model change later on in development? The answers to these questions enable techniques that make use of programming language models in development to choose the model training corpus more effectively.
We evaluate these questions by building 28 language models across developers, time periods, and applications within Microsoft Office and present the results in this paper. We find that developer and application specific language models perform better than models from the entire codebase, but that temporality has little to no effect on language model performance.
|2015||NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model||A. Sharma, Y. Tian, D. Lo||SANER||
Twitter is one of the most widely used social media platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active users generate close to 500 million tweets per day. Such rapid generation of user generated content in large magnitudes results in the problem of information overload. Users who are interested in information related to a particular domain have limited means to filter out irrelevant tweets and tend to get lost in the huge amount of data they encounter. A recent study by Singer et al. found that software developers use Twitter to stay aware of industry trends, to learn from others, and to network with other developers. However, Singer et al. also reported that developers often find Twitter streams to contain too much noise which is a barrier to the adoption of Twitter. In this paper, to help developers cope with noise, we propose a novel approach named NIRMAL, which automatically identifies software relevant tweets from a collection or stream of tweets. Our approach is based on language modeling which learns a statistical model based on a training corpus (i.e., set of documents). We make use of a subset of posts from StackOverflow, a programming question and answer site, as a training corpus to learn a language model. A corpus of tweets was then used to test the effectiveness of the trained language model. The tweets were sorted based on the rank the model assigned to each of the individual tweets. The top 200 tweets were then manually analyzed to verify whether they are software related or not, and then an accuracy score was calculated. The results show that decent accuracy scores can be achieved by various variants of NIRMAL, which indicates that NIRMAL can effectively identify software related tweets from a huge corpus of tweets.
|2015||Toward Deep Learning Software Repositories||M. White, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk||MSR||
Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these models to generalize well has ushered in tremendous advances in many fields such as natural language processing (NLP). Recent research in the software engineering (SE) community has demonstrated the usefulness of applying NLP techniques to software corpora. Hence, we motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models. Our deep learning models are applicable to source code files (since they only require lexically analyzed source code written in any programming language) and other types of artifacts. We show how a particular deep learning model can remember its state to effectively model sequential data, e.g., streaming software tokens, and the state is shown to be much more expressive than discrete tokens in a prefix. Then we instantiate deep learning models and show that deep learning induces high-quality models compared to n-grams and cache-based n-grams on a corpus of Java projects. We experiment with two of the models’ hyperparameters, which govern their capacity and the amount of context they use to inform predictions, before building several committees of software language models to aid generalization. Then we apply the deep learning models to code suggestion and demonstrate their effectiveness at a real SE task compared to state-of-the-practice models. Finally, we propose avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts. Thus, our work serves as the first step toward deep learning software repositories.
|2014||Learning Natural Coding Conventions||M. Allamanis, E. T. Barr, C. Bird, C. Sutton||FSE||
Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project’s coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present Naturalize, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. Naturalize builds on recent work in applying statistical natural language processing to source code. We apply Naturalize to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. Naturalize achieves 94% accuracy in its top suggestions for identifier names. We used Naturalize to generate 18 patches for 5 open source projects: 14 were accepted.
|2014||Mining Idioms from Source Code||M. Allamanis, C. Sutton||FSE||
We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.
|2014||Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models||J. C. Campbell, A. Hindle, J. N. Amaral||MSR||
A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.
|2014||NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation||S. Gulwani, M. Marron||SIGMOD||
Millions of computer end users need to perform tasks over tabular spreadsheet data, yet lack the programming knowledge to do such tasks automatically. This paper describes the design and implementation of a robust natural language based interface to spreadsheet programming. Our methodology involves designing a typed domain-specific language (DSL) that supports an expressive algebra of map, filter, reduce, join, and formatting capabilities at a level of abstraction appropriate for non-expert users. The key algorithmic component of our methodology is a translation algorithm for converting a natural language specification in the context of a given spreadsheet to a ranked set of likely programs in the DSL. The translation algorithm leverages the spreadsheet spatial and temporal context to assign interpretations to specifications with implicit references, and is thus robust to a variety of ways in which end users can express the same task. The translation algorithm builds over ideas from keyword programming and semantic parsing to achieve both high precision and high recall. We implemented the system as an Excel add-in called NLyze that supports a rich user interaction model including annotating the user’s natural language specification and explaining the synthesized DSL programs by paraphrasing them into structured English. We collected a total of 3570 English descriptions for 40 spreadsheet tasks and our system was able to generate the intended interpretation as the top candidate for 94% (97% for the top 3) of those instances.
|2014||Using Web Corpus Statistics for Program Analysis||C. Hsiao, M. Cafarella, S. Narayanasamy||OOPSLA||
Several program analysis tools—such as plagiarism detection and bug finding—rely on knowing a piece of code’s relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common a line of code is. Natural language processing researchers have encountered a similar problem and addressed it using an n-gram model of text frequency, derived from statistics computed over text corpora.
|2014||Phrase-Based Statistical Translation of Programming Languages||S. Karaivanov, V. Raychev, M. Vechev||Onward||
Phrase-based statistical machine translation approaches have been highly successful in translating between natural languages and are heavily used by commercial systems (e.g. Google Translate).
The main objective of this work is to investigate the applicability of these approaches for translating between programming languages. Towards that, we investigated several variants of the phrase-based translation approach: i) a direct application of the approach to programming languages, ii) a novel modification of the approach to incorporate the grammatical structure of the target programming language (so to avoid generating target programs which do not parse), and iii) a combination of ii) with custom rules added to improve the quality of the translation.
To experiment with the above systems, we investigated machine translation from C# to Java. For the training, which takes about 60 hours, we used a parallel corpus of 20, 499 C#-to-Java method translations. We then evaluated each of the three systems above by translating 1,000 C# methods. Our experimental results indicate that with the most advanced system, about 60% of the translated methods compile (the top ranked) and out of a random sample of 50 correctly compiled methods, 68% (34 methods) were semantically equivalent to the reference solution.
|2014||Structured Generative Models of Natural Source Code||C.J. Maddison, D. Tarlow||ICML||
We study the problem of building generative models of natural source code (NSC); that is, source code written by humans and meant to be understood by humans. Our primary con- tribution is to describe new generative models that are tailored to NSC. The models are based on probabilistic context free grammars (PCFGs) and neuro-probabilistic language models (Mnih & Teh, 2012), which are extended to incorporate additional source code-specific structure. These models can be efficiently trained on a corpus of source code and outperform a variety of less structured baselines in terms of predictive log likelihoods on held-out data.
|2014||Building Program Vector Representations for Deep Learning||L. Mou, G. Li, Y. Liu, H. Peng, Z. Jin, Y. Xu, L. Zhang||International Conference on Knowledge Science, Engineering and Management||
Deep learning has made significant breakthroughs in various fields of artificial intelligence. Advantages of deep learning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However, it is still virtually impossible to use deep learning to analyze programs since deep architectures cannot be trained effectively with pure back propagation. In this pioneering paper, we propose the “coding criterion” to build program vector representations, which are the premise of deep learning for program analysis. Our representation learning approach directly makes deep learning a reality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based on the experiments, the coding criterion is successful in building program representations. To evaluate whether deep learning is beneficial for program analysis, we feed the representations to deep neural networks, and achieve higher accuracy in the program classification task than “shallow” methods, such as logistic regression and the support vector machine. This result confirms the feasibility of deep learning to analyze programs. It also gives primary evidence of its success in this new field. We believe deep learning will become an outstanding technique for program analysis in the near future.
|2014||Statistical Learning Approach for Mining API Usage Mappings for Code Migration||A.T. Nguyen, H.A. Nguyen, T.T. Nguyen, T.N. Nguyen||ASE||
The same software product nowadays could appear in multiple platforms and devices. To address business needs, software companies develop a software product in a programming language and then migrate it to another one. To support that process, semi-automatic migration tools have been proposed. However, they require users to manually define the mappings between the respective APIs of the libraries used in two languages. To reduce such manual effort, we introduce StaMiner, a novel data-driven approach that statistically learns the mappings between APIs from the corpus of the corresponding client code of the APIs in two languages Java and C#. Instead of using heuristics on the textual or structural similarity between APIs in two languages to map API methods and classes as in existing mining approaches, StaMiner is based on a statistical model that learns the mappings in such a corpus and provides mappings for APIs with all possible arities. Our empirical evaluation on several projects shows that StaMiner can detect API usage mappings with higher accuracy than a state-of-the-art approach. With the resulting API mappings mined by StaMiner, Java2CSharp, an existing migration tool, could achieve a higher level of accuracy.
|2014||Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code||A.T. Nguyen, T.T. Nguyen, T.N. Nguyen||ASE||
Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces much semantically incorrect code. A key limitation is the use of sequences in phrase-based SMT to model and translate source code with well-formed structures. We propose mppSMT, a divideand-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First, mppSMT treats a program as a sequence of syntactic units and maps/translates such sequences in two languages to one another. Second, with a syntax-directed fashion, it deals with the tokens within syntactic units by encoding them with semantic symbols to represent their data and token types. This encoding via semantic symbols helps better migration of API usages. Third, the lexical tokens corresponding to each sememe are mapped or migrated. The resulting sequences of tokens are merged together to form the final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to syntactical structures in source code, thus, improving migration accuracy. Our empirical evaluation on several real-world systems shows that 84.8–97.9% and 70–83% of the migrated methods are syntactically and semantically correct, respectively. 26.3–51.2% of total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from 6.6–57.7% relatively. Importantly, it does not require manual labeling for training data or manual definition of rules.
|2014||Code Completion with Statistical Language Models||V. Raychev, M. Vechev, E. Yahav||PLDI||
We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls.
Our main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences. We design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model. We then employ the language model to find the highest ranked sentences, and use them to synthesize a code completion. Our approach is able to synthesize sequences of calls across multiple objects together with their arguments.
Experiments show that our approach is fast and effective. Virtually all computed completions typecheck, and the desired completion appears in the top 3 results in 90% of the cases.
|2014||A system to grade computer programming skills using machine learning||S. Srikant, V. Aggarwal||KDD||
The automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the programmer. In this paper, we present a system to grade computer programs automatically. In addition to grading a program on its programming practices and complexity, the key kernel of the system is a machine-learning based algorithm which determines closeness of the logic of the given program to a correct program. This algorithm uses a set of highly-informative features, derived from the abstract representations of a given program, that capture the program’s functionality. These features are then used to learn a model to grade the programs, which are built against evaluations done by experts. We show that the regression models provide much better grading than the ubiquitous test-case-pass based grading and rivals the grading accuracy of other open-response problems such as essay grading . We also show that our novel features add significant value over and above basic keyword/expression count features. In addition to this, we propose a novel way of posing computer-program grading as a one-class modeling problem and report encouraging preliminary results. We show the value of the system through a case study in a real-world industrial deployment. To the best of the authors’ knowledge, this is the first time a system using machine learning has been developed and used for grading programs. The work is timely with regard to the recent boom in Massively Online Open Courseware (MOOCs), which promises to produce a significant amount of hand-graded digitized data.
|2014||On the Localness of Software||Z. Tu, Z. Su, P. Devanbu||FSE||
The n-gram language model, which has its roots in statistical natural language processing, has been shown to successfully capture the repetitive and predictable regularities (“naturalness”) of source code, and help with tasks such as code suggestion, porting, and designing assistive coding devices. However, we show in this paper that this natural-language-based model fails to exploit a special property of source code: localness. We find that human-written programs are localized: they have useful local regularities that can be captured and exploited. We introduce a novel cache language model that consists of both an n-gram and an added “cache” component to exploit localness. We show empirically that the additional cache component greatly improves the n-gram approach by capturing the localness of software, as measured by both cross-entropy and suggestion accuracy. Our model’s suggestion accuracy is actually comparable to a state-of-the-art, semantically augmented language model; but it is simpler and easier to implement. Our cache language model requires nothing beyond lexicalization, and thus is applicable to all programming languages.
|2014||Learning to Execute||W. Zaremba, I. Sutskever||ArXiV 1410.4615||
Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.
|2013||Mining Source Code Repositories at Massive Scale Using Language Modeling||M. Allamanis, C. Sutton||MSR||
The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new “lens” for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program’s core logic based solely on general information theoretic criteria.
|2013||A Hidden Markov Model to Detect Coded Information Islands in Free Text||L. Cerulo, M. Ceccarelli, M. Di Penta, G. Canfora||SCAM||
Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of source code and natural language, unstructured text.
In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens—e.g., words, language keywords, numbers, parentheses, punctuation marks, etc.—observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language.
We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.
|2013||Using Semantic Unification to Generate Regular Expressions from Natural Language||N. Kushman, R. Barzilay||NAACL||
We consider the problem of translating natural language text queries into regular expressions which represent their meaning. The mismatch in the level of abstraction between the natural language representation and the regular expression representation make this a novel and challenging problem. However, a given regular expression can be written in many semantically equivalent forms, and we exploit this flexibility to facilitate translation by finding a form which more directly corresponds to the natural language. We evaluate our technique on a set of natural language queries and their associated regular expressions which we gathered from Amazon Mechanical Turk. Our model substantially outperforms a state-of-the-art semantic parsing baseline, yielding a 29% absolute improvement in accuracy.
|2013||A Machine Learning Framework for Programming by Example||A. K. Menon, O. Tamuz, S. Gulwani, B. Lampson, A.T. Kalai||ICML||
Learning programs is a timely and interesting challenge. In Programming by Example (PBE), a system attempts to infer a program from input and output examples alone, by searching for a composition of some set of base functions. We show how machine learning can be used to speed up this seemingly hopeless search problem, by learning weights that relate textual features describing the provided input-output examples to plausible sub-components of a program. This generic learning framework lets us address problems beyond the scope of earlier PBE systems. Experiments on a prototype implementation show that learning improves search and ranking on a variety of text processing tasks found on help forums.
|2013||Natural Language Models for Predicting Programming Comments||D. Movshovitz-Attias, W.W. Cohen||ACL||
Statistical language models have successfully been used to describe and analyze natural language documents. Recent work applying language models to programming languages is focused on the task of predicting code, while mainly ignoring the prediction of programmer comments. In this work, we predict comments from JAVA source files of open source projects, using topic models and n-grams, and we analyze the performance of the models given varying amounts of background data on the project being predicted. We evaluate models on their comment-completion capability in a setting similar to codecompletion tools built into standard code editors, and show that using a comment completion tool can save up to 47% of the comment typing.
|2013||Lexical Statistical Machine Translation for Language Migration||A. T. Nguyen, T. T. Nguyen, T. N. Nguyen||FSE||
Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be repetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source file given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation (SMT) models for natural languages could help in migrating source code from one programming language to another. We treat source code as a sequence of lexical tokens and apply a phrase-based SMT model on the lexemes of those tokens. Our empirical evaluation on migrating two Java projects into C# showed that lexical, phrase-based SMT could achieve high lexical translation accuracy ( BLEU from 81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to correct it. However, a high percentage of total translation methods (49.5-58.6%) is syntactically incorrect. Therefore, our result calls for a more program-oriented SMT model that is capable of better integrating the syntactic and semantic information of a program to support language migration.
|2013||A Statistical Semantic Language Model for Source Code||T.T. Nguyen, A.T. Nguyen, H.A. Nguyen, T.N. Nguyen||FSE||
Recent research has successfully applied the statistical n-gram language model to show that source code exhibits a good level of repetition. The n-gram model is shown to have good predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to capture source code regularities/patterns is based only on the lexical information in a local context of the code units. To improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the regularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC, we developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18–68% higher accuracy than the state-of-the-art approach.
|2013||A Study of Repetitiveness of Code Changes in Software Evolution||H.A. Nguyen, A.T. Nguyen, T.T. Nguyen, T.N. Nguyen, H. Rajan||ASE||
In this paper, we present a large-scale study of repetitiveness of code changes in software evolution. We collected a large data set of 2,841 Java projects, with 1.7 billion source lines of code (SLOC) at the latest revisions, 1.8 million code change revisions (0.4 million fixes), 6.2 million changed files, and 2.5 billion changed SLOCs. A change is considered repeated within or cross-project if it matches another change having occurred in the history of the project or another project, respectively. We report the following important findings. First, repetitiveness of changes could be as high as 70–100% at small sizes and decreases exponentially as size increases. Second, repetitiveness is higher and more stable in the cross-project setting than in the project-within one. Third, fixing changes repeat similarly to general changes. Importantly, learning code changes and recommending them in software evolution is beneficial with accuracy for top-1 recommendation of over 30% and top-3 of nearly 35%. Repeated fixing changes could also be useful for automatic program repair.
|2013||Structured Statistical Syntax Tree Prediction||C. Omar||SPLASH||
Statistical models of source code can be used to improve code completion systems, assistive interfaces, and code compression engines. We are developing a statistical model where programs are represented as syntax trees, rather than simply a stream of tokens. Our model, initially for the Java language, combines corpus data with information about syntax, types and the program context. We tested this model using open source code corpuses and find that our model is significantly more accurate than the current state of the art, providing initial evidence for our claim that combining structural and statistical information is a fruitful strategy.
|2012||On the Naturalness of Software||A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu||ICSE||
Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.
We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations—and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse’s built-in completion capability. We conclude the paper by laying out a vision for future research in this area.
|2009||Learning from Examples to Improve Code Completion Systems||M. Bruch, M. Monperrus, and M. Mezini||ESEC/FSE||
The suggestions made by current IDE’s code completion features are based exclusively on static type system of the programming language. As a result, often proposals are made which are irrelevant for a particular working context. Also, these suggestions are ordered alphabetically rather than by their relevance in a particular context. In this paper, we present intelligent code completion systems that learn from existing code repositories. We have implemented three such systems, each using the information contained in repositories in a different way. We perform a large-scale quantitative evaluation of these systems, integrate the best performing one into Eclipse, and evaluate the latter also by a user study. Our experiments give evidence that intelligent code completion systems which learn from examples significantly outperform mainstream code completion systems in terms of the relevance of their suggestions and thus have the potential to enhance developers’ productivity.
|2007||A Factor Graph Model for Software Bug Finding||T. Kremenek, A.Y. Ng, D. Engler||IJCAI||
Automatic tools for finding software errors require knowledge of the rules a program must obey, or “specifications,” before they can identify bugs. We present a method that combines factor graphs and static program analysis to automatically infer specifications directly from programs. We illustrate the approach on inferring functions in C programs that allocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and the OS kernel for Mac OS X (XNU). The inferred specifications are highly accurate and with them we have discovered numerous bugs.