Learning to Answer Semantic Queries over Code

Surya Prakash Sahu
Madhurima Mandal
Shikhar Bharadwaj
Aditya Kanade
Shirish Shevade
Google Research (2022)


During software development, developers need answers to queries about semantic
aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering
semantic queries over code using neural networks has not yet been explored. This
is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning.
We bridge this gap by building a new, curated dataset called CodeQueries, and
proposing a neural question-answering methodology over code.

We build upon state-of-the-art pre-trained models of code to predict answer and
supporting-fact spans. Given a query and code, only some of the code may be
relevant to answer the query. We first experiment under an ideal setting where
only the relevant code is given to the model and show that our models do well. We
then experiment under three pragmatic considerations: (1) scaling to large-size
code, (2) learning from a limited number of examples and (3) robustness to minor
syntax errors in code. Our results show that while a neural model can be resilient
to minor syntax errors in code, increasing size of code, presence of code that is not
relevant to the query, and reduced number of training examples limit the model
performance. We are releasing our data and models
to facilitate future work on
the proposed problem of answering semantic queries over code.