Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade
International Conference on Machine Learning (ICML), Vienna, Austria (2020)

Abstract

Recent research has achieved impressive results
on understanding and improving source code by
building up on machine-learning techniques developed
for natural languages. A significant advancement
in natural-language understanding has
come with the development of pre-trained contextual
embeddings, such as BERT, which can
be fine-tuned for downstream tasks with less labeled
data and training budget, while achieving
better accuracies. However, there is no attempt
yet to obtain a high-quality contextual embedding
of source code, and to evaluate it on multiple
program-understanding tasks simultaneously; that
is the gap that this paper aims to mitigate. Specifically,
first, we curate a massive, deduplicated corpus
of 6M Python files from GitHub, which we
use to pre-train CuBERT, an open-sourced code-understanding
BERT model; and, second, we create
an open-sourced benchmark that comprises
five classification tasks and one program-repair
task, akin to code-understanding tasks proposed
in the literature before. We fine-tune CuBERT on
our benchmark tasks, and compare the resulting
models to different variants of Word2Vec token
embeddings, BiLSTM and Transformer models,
as well as published state-of-the-art models, showing
that CuBERT outperforms them all, even with
shorter training, and with fewer labeled examples.
Future work on source-code embedding can benefit
from reusing our benchmark, and comparing
against CuBERT models as a strong baseline.