CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation

Pardis Pashakhanloo
Aaditya Naik
Yuepeng Wang
Mayur Naik
ICLR (2022)

Abstract

Designing a suitable representation for code-reasoning tasks is challenging in
aspects such as the kinds of program information to model, how to combine them,
and how much context to consider. We propose CodeTrek, a deep learning approach
that addresses these challenges by representing codebases as databases that conform
to rich relational schemas. The relational representation not only allows CodeTrek
to uniformly represent diverse kinds of program information, but also to leverage
program-analysis queries to derive new semantic relations, which can be readily
incorporated without further architectural engineering. CodeTrek embeds this
relational representation using a set of walks that can traverse different relations
in an unconstrained fashion, and incorporates all relevant attributes along the way.
We evaluate CodeTrek on four diverse and challenging Python tasks: variable
misuse, exception prediction, unused definition, and variable shadowing. CodeTrek
achieves an accuracy of 91%, 63%, 98%, and 94% on these tasks respectively, and
outperforms state-of-the-art neural models by 2--19% points.