Home NATURALEZA A community effort to optimize sequence-based deep learning models of gene regulation

A community effort to optimize sequence-based deep learning models of gene regulation

18
0


Designing the test sequences

High-expression and low-expression sequences were designed using DEAP58 with the mutation probability and the two-point crossover probability set to 0.1, selection tournament size of 3, initial population size of 100,000 and the genetic algorithm run for ten generations, using the predictions of a CNN trained on random yeast promoter sequences as the fitness function22. Native test subset sequences were designed by sectioning native yeast promoters into 80-bp fragments13. Random sequences were sampled from a previous experiment where the tested DNA was synthesized randomly (as in the training data) and quantified13. Challenging sequences were designed by maximizing the difference between the expressions predicted by a CNN model22 and a biochemical model (a type of physics-informed NN)13; these sequences represented the pareto front of the differences in expression between models when optimizing populations of 100 sequences at a time for 100 generations using a genetic algorithm with a per-base mutation rate of 0.02 and recombination rate of 0.5 using DEAP58 and a custom script (GASeqDesign.py59). Most of the SNVs represented sequence trajectories from Vaishnav et al.22 but also included random mutations added to random, designed and native promoter sequences. Motif perturbation included Reb1 and Hsf1 perturbations. Sequences with perturbed Reb1 binding sites were created by inserting Reb1 consensus binding sites (strong or medium affinity; sense and reverse complement orientations) and then adding 1–3 SNVs to each possible location of each motif occurrence and inserting canonical and mutated motif occurrence into ten randomly generated sequences at position 20 or 80. Sequences with Hsf1 motif occurrence were designed by tiling random background sequences with 1–10 Hsf1 monomeric consensus sites (ATGGAACA), added sequentially from both right and left of the random starting sequences, added individually within each of the possible eight positions or similarly tiling or inserting 1–5 trimeric Hsf1 consensus sites (TTCTAGAANNTTCT). The motif tiling test subset sequences were designed by embedding a single consensus for each motif (poly(A), AAAAA; Skn7, GTCTGGCCC; Mga1, TTCT; Ume6, AGCCGCC; Mot3, GCAGGCACG; Azf1, TAAAAGAAA) at every possible position (with the motif contained completely within the 80-bp variable region) and orientation for three randomly generated background sequences13.

Quantifying promoter expression

High-complexity random DNA libraries that comprised the training data were created using Gibson assembly to assemble a double-stranded random DNA insert into a dual reporter vector yeast_DualReporter (AddGene, 127546). The random DNA insert was created by annealing a complementary primer sequence and extending to double strand using Phusion polymerase master mix (New England Biolabs) and gel-purifying before cloning. The random DNA was inserted between distal (GCTAGCAGGAATGATGCAAAAGGTTCCCGATTCGAACTGCATTTTTTTCACATC) and proximal (GGTTACGGCTGTTTCTTAATTAAAAAAAGATAGAAAACATTAGGAGTGTAACACAAGACTTTCGGATCCTGAGCAGGCAAGATAAACGA) promoter regions. The random promoter library in Escherichia coli theoretically contained about 74 million random promoters (estimated by dilution and plating) and was transformed into S288c (ΔURA3) yeast yielding 200 million transformants, which were selected in SD-Ura medium. Then, 1 L of Chardonnay grape must (filtered) was inoculated with the pool to an initiate an optical density at 600 nm (OD600) of 0.05 and grown at room temperature without continual shaking, with the culture diluted as needed with fresh Chardonnay grape must to maintain the OD below 0.4, for a total growth time of 48 h and having undergone >5 generations. Before each OD measurement, the culture was gently agitated to decarbonate it, waiting for the resulting foam to die down before agitating again and continuing until no more bubbles were released. Before sorting, yeasts were spun down, washed once in ice-cold PBS, resuspended in ice-cold PBS, kept on ice and then sorted by log2(red fluorescent protein (RFP)/YFP) signal (using mCherry and green fluorescent protein absorption and emission) on a Beckman-Coulter MoFlo Astrios, using the constitutive RFP under pTEF2 regulation to control for extrinsic noise13. Cells were sorted into 18 uniform bins, in three batches of six bins each. After sorting, cells from each bin were spun down and resuspended in SC-Ura and then grown for 2–3 days, shaking at 30 °C. Plasmids were isolated, the promoter region was amplified, Nextera adaptors and multiplexing indices were added by PCR and the resulting libraries were sequenced, with sequencing libraries pooled and sequenced on an Illumina NextSeq using 2 × 76-bp paired-end reads with 150-cycle kits. The designed (test) experiment was performed similarly but the library was amplified by PCR from a Twist oligo pool and the E.coli transformation complexity was only 105, over 10× coverage of the library.

To obtain sequence–expression pairs for random promoter sequences, the paired-end reads representing both sides of the promoter sequence were aligned using the overlapping sequence in the middle, constrained to have 40 ± 15 bp of overlap, discarding any reads that failed to align well within these constraints13. To collapse related promoters into a single representative sequence, we aligned the sequences observed in each library to themselves using Bowtie2 (ref. 60), creating a Bowtie database containing all unique sequences observed in the experiment (default parameters) and aligning these same sequences, which allowed for multimapping reads (parameters included ‘-N 1 -L 18 -a -f -no-sq -no-head -5 17 -3 13’. Any sequences that aligned to each other were assigned to the same cluster, which were merged using the sequence with the most reads as the ‘true’ promoter sequence for each cluster. Expression levels for each promoter sequence were estimated as the weighted average of bins in which the promoter was observed13. For the designed (test) library, we instead directly aligned reads to a Bowtie database of the sequences we ordered to quantify and estimated their expression levels using MAUDE61, with the read abundance in each sorting bin as input, and estimating the initial abundance of each sequence as the average relative abundance of that sequence across all bins.

Competition rules

  1. 1.

    Only the provided training data could be used to train models. Models had to train from scratch without any pretraining on external datasets to avoid overfitting to sequences present in the test data (for example, some sequences in the test data were derived from extant yeast promoters).

  2. 2.

    Reproducibility was a prerequisite for all submissions. The participants had to provide the code and instructions to reproduce their models. We retrained the top-performing solutions to validate their performance.

  3. 3.

    Augmenting the provided training data was allowed. Pseudolabeling the provided test data was not allowed. Using the test data for any purpose during training was not allowed.

  4. 4.

    Ensembles were not allowed.

Detailed information on the competition and its guidelines can be found on the DREAM Challenge webpage (https://www.synapse.org/#!Synapse:syn28469146/wiki/617075).

Performance evaluation metric

We calculated Pearson’s r2 and Spearman’s ρ between predictions and measurements for each sequence subset. The weighted sum of each performance metric across promoter types yielded our two final performance measurements, which we called the Pearson score and Spearman score.

$${\rm{Pearson}}\; {\rm{score}}=\mathop{\sum }\limits_{i\,=\,0}^{\rm{subsets}}{w}_{i}\times {\rm{Pearson}}\;{{r}^{2}}_{i}\,/\,\mathop{\sum }\limits_{i\,=\,0}^{\rm{subsets}}{w}_{i}$$

\({\rm{Spearman}}\; {\rm{score}}=\mathop{\sum }\limits_{{i}=\,0}^{\rm{subsets}}{w}_{{i}} \times {\rm{Spearma}{n}}_{i}/\mathop{\sum }\limits_{{i}=\,0}^{\rm{subsets}}{w}_{i}\)

Here, wi is the weight used for the ith test subset (Table 1). \({\rm{Pearson}}\;{{r}^{2}}_{i}\) and \({\rm{Spearma}{n}}_{i}\) are, respectively, the square of the Pearson coefficient and the Spearman coefficient for sequences in the ith subset.

Bootstrapping analysis of model performance

To determine the relative performance of the models, we performed a bootstrapping analysis. Here, we sampled 10% of the test data 10,000 times and, for each sample, calculated the performance of each model and the rankings of the models for both Pearson and Spearman scores. We averaged the ranks from both metrics to decide their final ranks.

Description of the approaches used by the participants

An overview of the approaches used by the participants in the challenge is provided in the Supplementary Information.

Prix Fixe framework

The Prix Fixe framework, implemented in Python and Pytorch, facilitated the design and training of NNs by modularizing the entire process, from data-preprocessing to prediction, enforcing specific formats for module inputs and outputs to allow integration of components from different approaches. The different modules in the Prix fixe framework are described below.

Data processor and trainer

The data processor class is dedicated to transforming raw DNA sequence data into a usable format for subsequent NN training. The data processor can produce an iterable object, delivering a dictionary containing, a feature matrix ‘x’ (input to the NN) and a target vector ‘y’ (expected output). Additional keys can be included to support extended functionalities. Moreover, the data processor can provide essential parameters to initiate NN blocks, such as determining the number of channels in the first layer.

The trainer class manages the training of the NN. It processes batches of data from the data processor and feeds them into the NN. It computes auxiliary losses, if necessary, alongside the main losses from the NN, facilitating complex loss calculation during training.

Prix Fixe net

This module embodies the entirety of the NN architecture:

  1. (i)

    First layer block: This constitutes the primordial layers of the network. They may include initial convolutional layers or facilitate specific encoding mechanisms such as k-mer encoding for the input.

  2. (ii)

    Core layer block: This represents the central architecture components, housing elements such as residual connections, LSTM mechanisms and self-attention. The modular construction of this block also allows for versatile combinations, such as stacking a residual CNN block with a self-attention block.

  3. (iii)

    Final layer block: This phase narrows the latent space to produce the final prediction, using layers such as pooling, flattening and dense layers. It computes the prediction and outputs it alongside the loss.

For all three blocks, the standard input format is (batch, channels, seqLen). The first two blocks yield an output in a consistent format (batch, channels, seqLen), whereas the last block delivers the predicted expression values. Each block can propagate its own loss. The whole framework is implemented in PyTorch.

To ensure fair comparison across solutions in the Prix Fixe framework, we removed specific test time processing steps that were unique to each solution. We divided the DREAM Challenge dataset into two segments, allocating 90% sequences for training and 10% for validation. Using these data, we retrained all combinations that were compatible within the framework. Of the 81 potential combinations, we identified 45 as compatible and 41 of these successfully converged during training. Because of graphics processing unit (GPU) memory constraints, we adjusted the batch sizes for certain combinations.

DREAM-optimized models from Prix Fixe runs

Data processor and trainer

Promoter sequences were extended at the 5′ end using constant segments from the plasmids to standardize to a length of 150 bp. These sequences underwent OHE into four-dimensional vectors. ‘Singleton’ promoters, observed only once across all bins, were categorized with integer expression estimates. Considering the potential variability in these singleton expression estimates, a binary ‘is_singleton’ channel was incorporated, marked as 1 for singletons and 0 otherwise. To account for the diverse behavior of regulatory elements on the basis of their strand orientation relative to transcription start sites, each sequence in the training set was provided in both its original and reverse complementary forms, identified using the ‘is_reverse’ channel (0 for original and 1 for reverse complementary). Consequently, the input dimensions were set at (batch, 6, 150).

The model’s training used the AdamW optimizer, set with a weight_decay of 0.01. The maximum LR of 0.005 was chosen for most blocks, while a conservative rate of 0.001 was applied to attention blocks because of the inherent sensitivity of self-attention mechanisms to higher rates. This LR was scheduled by the one-cycle LR policy62, which featured two phases and used the cosine annealing strategy. Training data were segmented into batches of size 1,024, with the entire training procedure spanning 80 epochs. Model performance and selection were based on the highest Pearson’s r value observed in the validation dataset.

During prediction, the data processing mirrored the data processor apart from setting ‘is_singleton’ to 0. Predictions for both the original and reverse complementary sequences were then averaged.

Prix Fixe net

DREAM-CNN

First layer block: The OHE input was processed through a one-dimensional (1D) CNN. Drawing inspiration from DeepFam63, convolutional layers with kernel sizes of 9 and 15 were used, mirroring common motif lengths as identified by ProSampler64. Each layer had a channel size of 256, used rectified linear unit activation and incorporated a dropout rate of 0.2. The outputs of the two layers were concatenated along the channel dimension.

Core layer block: This segment contained six convolution blocks whose structure was influenced by the EfficientNet architecture. The segment contained modifications such replacing depth-wise convolution with grouped convolution, using squeeze and excitation (SE) blocks65 and adopting channel-wise concatenation for residual connections. The channel configuration started with 256 channels for the initial block, followed by 128, 128, 64, 64, 64 and 64 channels66.

Final layer block: The final block consisted of a single point-wise convolutional layer followed by channel-wise global average pooling and SoftMax activation.

DREAM-RNN

First layer block: Same as DREAM-CNN.

Core layer block: The core used a Bi-LSTM, designed to capture motif dependencies. The LSTM’s hidden states had dimensions of 320 each, resulting in 640 dimensions after concatenation. A subsequent CNN block, similar to the first layer block, was incorporated.

Final layer block: Same as DREAM-CNN.

DREAM-Attn

First layer block: This segment was a standard convolution with kernel size 7, followed by BatchNorm67 and sigmoid linear unit activation68.

Core layer block: This block used the Proformer69, a Macaron-like transformer encoder architecture, which uses two half-step feed-forward network (FFN) layers at the start and end of the encoder block. Additionally, a separable 1D convolution layer was integrated after the initial FFN layer and before the multihead attention layer.

Final layer block: Same as DREAM-CNN and DREAM-RNN.

Human MPRA models

Within each of the three large-scale MPRA libraries, every sequence and its corresponding reverse complement were grouped together and these pairs were then distributed into ten distinct cross-validation folds to ensure that both the forward and the reverse sequences resided within the same fold. DREAM-CNN, DREAM-RNN, DREAM-Attn and MPRAnn were trained using nine of these ten folds, reserving one fold to evaluate the model’s performance. For every held-out test fold, nine models were trained, with one fold being dedicated for validation purposes while the remaining eight acted as training folds. Subsequent predictions from these nine models were aggregated, with the average being used as the final prediction for the held-out test data.

The MPRAnn architecture45 was trained with an LR of 0.001, an early stopping criterion with patience of 10 on 100 epochs, a batch size of 32 and the Adam optimizer with a mean squared error loss function. For DREAM-CNN, DREAM-RNN and DREAM-Attn, components that could not accommodate Agarwal et al.’s data were discarded. For instance, the ‘is_singleton’ channel is not relevant for MPRA data and loss calculation was performed using the mean squared error (as in MPRAnn) in place of Kullback–Leibler divergence because of the infeasibility of transitioning the problem to soft classification. MPRAnn used a much smaller batch size than our DREAM-optimized trainer (32 versus 1,024); thus, we reduced it to be the same as MPRAnn. No other alterations were made to either the model’s structure or the training paradigm.

Drosophila UMI-STARR-seq models

The DeepSTARR architecture42 was trained with an LR of 0.002, an early stopping criterion with patience of 10 on 100 epochs, a batch size of 128 and the Adam optimizer with a mean squared error loss function. For DREAM-CNN, DREAM-RNN and DREAM-Attn, we used the exact setting as used for human MPRA datasets but with the output layer modified to predict two values corresponding to expression with housekeeping and developmental promoters and the loss calculated as the sum of the mean squared errors for each output (as in DeepSTARR).

Only the five largest Drosophila chromosomes (Chr2L, Chr2R, Chr3L, Chr3R and ChrX) were used as test data. For every held-out test chromosome, the remaining sequences were distributed into ten distinct train–validation folds and DREAM-CNN, DREAM-RNN, DREAM-Attn and DeepSTARR models (ten of each) were trained. Subsequent predictions from these ten models were aggregated, with the average being used as the final prediction for the held-out test chromosome.

Human chromatin accessibility models

We used five separate train–validation–test splits as proposed in a previous study46 for ATAC-seq experiments on the human cell line K56249. For each of these partitions, we first trained five different bias models, one per fold, which were designed to capture enzyme-driven biases present in ATAC-seq profiles. Subsequently, ChromBPNet, DREAM-CNN and DREAM-RNN models were trained for each fold, using the same bias models. For DREAM-CNN and DREAM-RNN, the prediction head from ChromBPNet (1D convolution layer, cropping layer, average pooling layer and a dense layer) was used in the final layer block to predict the accessibility profile and read counts. Input-encoding channels for is_singleton and is_rev were omitted. We modified the DREAM-optimized trainer in this case to use the same batch size as ChromBPNet (from 1,024 to 64) and a reduced maximum LR (from 5 × 10−3 to 5 × 104). No other alterations were made to either the model’s structure or the training paradigm.

For this task, we reimplemented DREAM-CNN and DREAM-RNN architectures in TensorFlow to ensure that all models had same bias models. This methodological choice came at the cost of having to leave some components (input encoding, AdamW optimizer, etc.) out of the DREAM-optimized data processor and trainer. However, it ensured uniformity across models, leading to an unbiased comparison across the different architectures.

ISM

The ISM scores for DNA sequences were obtained by creating all possible single-nucleotide mutations of each sequence and calculating the change in predicted expression relative to the original sequence. A single ISM score for each position was then determined by averaging the mutagenesis scores across nucleotides at that position.

Average training time per batch and throughput

We measured the training time per batch for the human ATAC-seq dataset with a batch size of 64 and on two other datasets with a batch size of 32. Throughput was determined by measuring how many data points each model could predict per second (without backpropagation). Starting with a batch size of 32, we doubled the batch size incrementally (64, 128, 256, etc.) and recorded the throughput at each stage until the maximum batch size supportable by the GPU was reached, which was then used to calculate throughput. We processed 5,000 batches for each model to calculate the average training time per batch and processed 100 batches for throughput. The calculations for both training time per batch and throughput were repeated 50 times to ensure reliability and the distribution of these measurements is presented as a box plot in Extended Data Fig. 10. All tests were conducted using an NVIDIA V100 16-GB GPU, ensuring consistency in the computational resources across all experiments.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here