DeepSomatic: A SNV and Small-Indel Somatic Caller Using Deep Neural Networks

Nicolas Robine
Benedict Paten
Kishwar Shafin
Jimin Park
Lucas Brambrink
Daniel Cook
Mikhail Kolmogorov
Andrew Carroll
Nature Biotechnology (2025)
Google Scholar

Abstract

DeepVariant is a highly accurate germline variant caller that applies deep dearning to classify germline variants with high accuracy. Here, we present DeepSomatic, which applies DeepVariant’s convolutional neural network (CNN) to accurately call somatic mutations from paired tumor-normal sequencing. To develop DeepSomatic, we adapted the input pipeline to accommodate tumor-normal pairs. DeepVariant model inputs consist of a set of pileup images, also referred to as “channels”, that represent features extracted from sequence data at candidate sites. Features include base, base quality, mapping quality, and haplotype. We modified this input for DeepSomatic by stacking the channels for tumor-normal pairs into a single input. In this way, our model can compare tumor and normal reads and learns to distinguish germline variants from somatic mutations.

For training, DeepSomatic uses sequence data from the SEQC2 consortium, which has established a benchmarking dataset for HCC1395 (a triple-negative breast cancer cell line) across a variety of sequencing technologies. Additionally, we have supplemented the SEQC2 data with additional sequencing from cell lines (H2009, H1437, HCC1954, HCC1937, Hs578T) to further improve our training datasets and enable benchmarking across different cancer types. Using these data, we have trained models on whole-genome data from Illumina , PacBio, and Oxford Nanopore (ONT).

We report that DeepSomatic outperforms existing somatic callers across sequencing technologies. For example, we observe that our DeepSomatic Illumina model has a SNP F1 score of 0.983 on held-out chr1 HCC1395 data, which outperforms ClairS (F1=0.969), and Strelka2 (F1=0.952). Similarly, our PacBio model has a SNP F1=0.95 compared to ClairS (F1=0.935). Finally, our ONT model achieves an F1=0.869, which is an improvement over ClairS (F1=0.863). We similarly observe that DeepSomatic outperforms on F1 scores for indels as well. We plan to investigate tumor-only models and additional training approaches to further improve DeepSomatic.
×