Home NATURALEZA A comprehensive proteogenomic pipeline for neoantigen discovery to advance personalized cancer immunotherapy

A comprehensive proteogenomic pipeline for neoantigen discovery to advance personalized cancer immunotherapy

17
0


Installation and configuration of NeoDisc

NeoDisc is available upon registration (https://neodisc.unil.ch/). A manual file, describing NeoDisc requirements and outputs, is provided with NeoDisc to guide the user through the installation and configuration process.

DNA read alignment

FastQ reads were aligned and processed following the Genome Analysis Toolkit (GATK)’s best practice workflow for data preprocessing for variant discovery (https://doi.org/10.1002/0471250953.bi1110s43). Briefly, FastQ reads were aligned to the reference human genome (GRCh37) (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/) with bwa (0.7.17)45. Duplicate reads were marked and base quality scores were recalibrated with GATK (4.4.0.0)15 tools (MarkDuplicatesSpark, BaseRecalibrator and ApplyBQSR). Alignment and read quality were assessed with picard tools version 3.0.0 (https://broadinstitute.github.io/picard/) (CollectAlignmentSummaryMetrics and CollectHsMetrics) and FastQC 0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

CNV, tumor content and variant calling

CNV and tumor content were estimated from the recalibrated BAM files with the Sequenza (3.0.0)36 workflow, using matched-cancer-type CNV frequencies derived from the Pan-Cancer Analysis of Whole Genomes (PCAWG)46 as weights to improve estimations. Four variant-calling algorithms (Mutect (1.1.7)13, VarScan (2.4.6)14 and GATK (4.4.0.0)15 tools HaplotypeCaller13 and Mutect2) were applied to the recalibrated BAM files.

The GATK HaplotypeCaller algorithm reduces the global false-positive variant call rate by relying on haplotype de novo assembly in variable regions47,48. HaplotypeCaller was run in GVCF mode on tumor and germline recalibrated BAM files separately to detect SNV and indel variants. Tumor and germline GVCFs were combined using GATK GenotypeGVCF to produce a single VCF. GATK’s best practices were followed for subsequent variant quality score recalibration and variant quality assessment. Participant-specific SNPs were defined as variants present in both tumor and germline, while tumor-specific variants were defined as SMs.

The Mutect 1.1.7 variant-calling algorithm predicts SMs on the basis of log odds scores of two Bayesian classifiers. One classifier identifies nonreference variants in the tumor sample and the second classifier determines the tumor specificity of those variants. Read support is used for the filtering of candidate SMs, reducing next-generation sequencing (NGS) artifacts. Mutect 1.1.7 was run with default parameters and selected SMs were exported in VCF format.

The Mutect2 variant-calling algorithm also relies on a Bayesian somatic genotyping model but this model is different from Mutect1 and it uses de novo assembly of haplotypes in variable regions. Mutect2 was run with default parameters and identified variants (SNPs and SMs) were exported in VCF format.

The VarScan2 algorithm14 relies on hard filtering of calls, making it less sensitive to bias such as extreme read coverage or sample contamination. Variants are filtered using parameters that include read quality, strand bias, coverage and frequency. VarScan2 was run in somatic mode with custom parameters (–normal-purity 1.00, –min-normal-coverage 6, –min-tumor-coverage 8, –min-tumor-coverage 8, –somatic-p-value 0.05, –strand-filter 0) and the –tumor-purity and –min-var-freq flags were set on the basis of the tumor content estimation (0.4 × tumor content and 0.2), respectively. Resulting variant (SMs and SNPs) calls were further filtered using VarScan2 fpfilter (–dream3-settings, –min-var-basequal 25, –min-ref-avgrl 70, –min-var-avgrl 70, –keep-failures) and exported in VCF format. Finally, only variants passing fpfilter were retained.

Variant calls from GATK, Mutect1, Mutect2 and VarScan2 were combined into a single VCF that contains the union of the variants of all four callers. Ambiguous calls (that is, different calls at the same genomic coordinate) were resolved by a majority rule. In the absence of majority, calls were rejected. High-confidence variants were defined as variants identified by two or more algorithms and low-confidence variants were defined as identified by a single algorithm. Alternatively, when NeoDisc was operated in ‘sensitive’ mode, the union of all variants identified by all variant-calling algorithms was considered and annotated as high confidence. All identified variants were phased, using Whatshap (1.7)49 to determine their detection on the same chromosome. This information was critical in determining whether neighboring variants affected the same transcript and necessitated their inclusion within the same peptide sequence. When GP variants were used, a generic VCF was created and used for downstream analysis.

Clonality analysis

Only high-confidence SM were considered. The VAF was calculated for all high-confidence SMs detected in all samples, from the BAM files, using SAMtools (1.17) mpileup50. CCF, clonal and subclonal assessments were calculated from the VAF, CNVs and tumor content, as described previously51. Zygosity annotation was derived from Sequenza analysis, where segments with one loss were annotated as LOH and other segments were annotated as heterozygous. We used LICHeE (1.0)52 for phylogenetic reconstruction of cancer evolution, using all SMs identified. Trees were annotated with clonality information and with mutated genes annotated as drivers using Intogen (2016.5)39.

RNA read alignment and gene expression quantification

FastQ reads were aligned and counted per gene with STAR (2.7.10b)53 (–runMode alignReads, –twopassMode Basic, –outReadsUnmapped Fastx, –quantMode GeneCounts, –outSAMtype BAM unsorted) on the reference human genome (GRCh37) and the GENCODE version 43 annotation. Resulting BAM files were sorted and indexed with SAMtools (1.17)50 (sort, index). RSeQC (5.0.1)54 (infer_experiment.py) was used to determine RNAseq data strandness if the sequencing kit used was not known or not available in the list of predefined kits. TPM, reads per kilobase per million mapped reads (RPKM), counts per million mapped reads (CPM) and upper quartile normalization (UQN) values were calculated from raw gene counts.

HLA typing and HLA LOH analysis

HLA typing was performed on both DNA and RNA data with HLA-HD (1.7.0)37, which required paired-end sequencing data. HLA-specific read alignments derived from HLA-HD were used to derive HLA segment-specific coverage. RNA coverage of each segment was calculated as the number of reads mapping uniquely to the segment. HLA allele-specific CNs were estimated following equations described in Van Loo et al.55 and McGranaham et al.38. The log R value was calculated as follows:

$$\log R={\log }_{2}\left(\frac{{{\rm{Tum}}}_{\mathrm{cov}}}{{{\rm{Gl}}}_{\mathrm{cov}}}\times M\right),$$

where Tumcov and Glcov are the tumor and germline segment-specific coverages, respectively, and M is the multiplication factor, defined as the genome-wide tumor–germline ratio of uniquely mapped reads.

For each HLA allele pair, the segment baf was calculated as the coverage of the first allele segment divided by the sum of the coverage of the two alleles.

The segment CN was calculated as follows:

$${{\rm{CN}}}_{{\rm{allele}}1}=\frac{\left(\;\rho -1\right)+{baf}\times {2}^{\log R}\times (2\left(1-\rho \right)+\rho \varphi )}{\rho },$$

$${{\rm{CN}}}_{{\rm{allele}}2}=\frac{\left(\;\rho -1\right)-({baf}-1)\times {2}^{\log R}\times (2\left(1-\rho \right)+\rho \varphi )}{\rho },$$

where ρ is the estimated tumor content and φ is the estimated tumor ploidy.

For heterozygous HLA alleles, P values supporting loss events were calculated using a paired t-test on CN, log R and RNAseq coverage of segments with the matched allele and an unpaired t-test on CN with segments having equal numbers of major and minor alleles, as determined in the Sequenza segment file. A final P value supporting minor allele loss LOHpval was calculated by combining P values supporting loss events using the Fisher method. P values supporting allele presence were calculated using a paired t-test on CN, log R and RNAseq coverage of segments with the matched allele and an unpaired t-test on CN with segments having lost the minor allele, as determined in the Sequenza segment file. A final P value supporting allele presence presencepval was calculated by combining P values supporting presence events using the Fisher method.

A sample-specific naive Bayes classifier (model A) was trained to predict CNallele, as defined from Sequenza’s sample-specific segments, from the tumor–germline coverage depth ratio and allele frequency at heterozygous sites. Model A was applied to the sample-specific heterozygous HLA allele segments and the \(\rm{{CN}}_{{allel}{e}_{{predicted}}}\) was defined as the segment median predicted CNs. Lastly, LOH events on heterozygous alleles were called if

$$\begin{array}{c}\left(\left({{\rm{CN}}}_{{\rm{allele}}}\le 0.5\left[{\rm{and}}\right]{{\rm{CN}}}_{{\rm{allele}}_{{\rm{predicted}}}}\le 1\right)\left[{\rm{or}}\right]\right.\\\left.\left({{\rm{CN}}}_{{\rm{allele}}} < 1\left[{\rm{and}}\right]{{\rm{CN}}}_{{\rm{allele}}_{{\rm{predicted}}}} < 1\right)\right)\\\left[{\rm{and}}\right]\\{{\rm{LOH}}}_{{\rm{pval}}} < 0.01\\\left[{\rm{and}}\right]\\{{\rm{LOH}}}_{{\rm{pval}}} < {{\rm{presence}}}_{{\rm{pval}}}.\end{array}$$

A second sample-specific naive Bayes classifier (model B) was trained to predict CNtotal, as defined from Sequenza’s sample-specific segments, from the tumor–germline coverage depth ratio at all sites. Model B was applied to the sample-specific homozygous HLA alleles segments and CNallele was defined as the median predicted CN of its segments. The \({{\rm{CN}}}_{{\rm{allel}{e}}_{{\rm{confidence}}}}\) score was calculated as the product of individual CNallele segment probabilities. LOH events were defined for alleles for which \({{\rm{CN}}}_{{\rm{allele}}}=0\left[{\rm{and}}\right]{{\rm{CN}}}_{{\rm{allel}{e}}_{{\rm{confidence}}}} < 0.01\). The \({{\rm{CN}}}_{{\rm{allel}{e}}_{{\rm{final}}}}\) was set to 0 for LOH events and to \(\max (1,{\rm{round}}({{\rm{CN}}}_{{\rm{allele}}}))\). Models A and B were trained using 80% of the data for training and 20% for testing; random_state was set to 312, priors were set to none and the sklearn GridSearchCV algorithm was used to define the var_smoothing parameter using the following grid search parameters: param_grid = [‘var_smoothing’: np.logspace(0, −9, num = 50)], scoring = ‘f1_weighted’ and refit = true. A consensus HLA haplotype was built from the results of the germline and tumor typing from DNA and RNAseq data. Lost HLA-I and HLA-II alleles are flagged by NeoDisc. By default, lost alleles are kept for downstream TSA prediction and prioritization. However, users have the option to choose whether to keep or discard these alleles using the command line. It is important to note that this choice does not impact immunopeptidomic analysis. In this study, lost alleles were discarded from the predictions, with the exception of Fig. 4h, where the effect of keeping or discarding lost HLA-I alleles on the prioritization was assessed.

Noncanonical genes

Noncanonical genes were defined as expressed genes annotated as long intergenic noncoding RNA, lncRNA, microRNA, miscellaneous RNA, mitochondrial ribosomal RNA (rRNA), mitochondrial transfer RNA, processed pseudogene, processed transcript, pseudogene, retained intron, rRNA, rRNA pseudogene, small conditional RNA, sense intronic, sense overlapping, small nucleolar RNA, small nuclear RNA, translated processed pseudogene, unprocessed pseudogene and vault RNA in GENCODE version 38.

TAAs from canonical and noncanonical sources and HC-TSAs

In general, a gene expression level of >0.0 TPM was considered expressed in a given sample. Canonical and noncanonical genes (as defined above) with expression across all GTEx tissues, except testis and sun-exposed skin, with <1.0 TPM at the 90th percentile were considered as TAAs. In addition, we curated a list of immunogenic HC-TSAs by mining the literature, IEDB (https://www.iedb.org/ (ref. 9), as of 12 January 2022) and Cancer Antigenic Peptide Database (CAPED; https://caped.icp.ucl.ac.be/Peptide/list, as of 10 March 2022). HC-TSA source genes were defined on the basis of GTEx expression (<1.0 TPM at the 90th percentile in all tissues, except testis and sun-exposed skin), having expression > 1.0 TPM in the sample, except for CTAG1B, which was considered expressed if it was >0.0 TPM. All 8–19-mer peptide sequences derived from expressed HC-TSAs and canonical TAAs were predicted for binding to the participant HLA-I and HLA-II alleles using PRIME 2.0 (ref. 56) and MixMHC2pred 2.2 (ref. 57), respectively. HC-TSA and canonical TAAs derived peptides matching other sequences in the personalized healthy proteome reference were removed (with the exception of sequences derived from other HC-TSAs and canonical TAAs). Only HC-TSA and canonical TAAs derived peptides with a predicted percentage rank ≤ 0.5 or present in the curated database of immunogenic HC-TSA peptides were retained for further prioritization (see below). This restriction aims to limit the number of prioritized HC-TSA and canonical TAAs but it does not impact the detection of HC-TSA-derived and TAA-derived peptides in the immunopeptidomic data.

Detection of viruses

The NCBI Virus Genomes Resource database58 was used to retrieve all human-specific (filtering the database by host = human) and fetch their annotated genomes and proteomes. Nonhuman viruses genomes and proteomes were retrieved from the list of all available viruses, selecting entries associated with nonhuman hosts, complete reference sequence available and available accession identifier. ART (2.5.8)59 was used to simulate Illumina reads (art_illumina -ss HSXt -p -l 150 -f 4 -m 200 -s 10) from the human-specific and nonhuman viral transcript sequences. STAR (2.7.8a)53 was used for the alignment and counting of the simulated reads on the human-specific viruses genomes. We calculated the viral score of each virus as follows:

$${{\rm{RPKM}}}_{{\rm{vg}}}=\frac{{R}_{{\rm{vg}}}\times {10}^{9}}{{R}_{{\rm{Tot}}}\times {L}_{{\rm{vg}}}},$$

$${\rm{Viral}\;{score}}=\frac{\sum {\rm{R{{PKM}}}}_{{\rm{vg}}}}{{N}_{{\rm{vg}}}},$$

where Rvg is the number of reads mapped to a viral gene, RTot is the total number of mapped reads, Lvg is the length of the viral gene in base pairs and Nvg is the number of viral genes.

The specificity and sensitivity of the identifications were calculated by comparing viral scores of human-specific and nonhuman viral reads. We selected a threshold on the viral score (>97% sensitivity and specificity) to separate high-confidence from low-confidence viral identifications.

Reads not mapped to the human reference genome were aligned and counted using STAR (2.7.10b)53 to the human-specific viruses genomes database. Viral score, TPM, RPKM, CPM and UQN values for each viral gene were calculated from reads counts.

Prediction of viral peptides

We created a list of high-confidence immunogenic viral peptides from the literature and IEDB (https://www.iedb.org/ (ref. 9)). High-confidence peptides were defined as peptides not mapping to the human proteome. All 8–19-mers derived from high-confidence identified virus sequences, as determined from RNAseq, were predicted for binding to the participant HLA-I and HLA-II alleles using PRIME 2.0 (ref. 56) and MixMHC2pred 2.0 (ref. 57), respectively. NeoDisc screened all possible binding peptides across the entire protein sequence and excluded any peptide sequence matching with a personalized healthy proteome, resulting in a list of tumor-specific viral peptides.

Creation of sample-specific proteome references

The GENCODE version 43 reference proteome was combined with the three open reading frame translation products from methionine to stop of TAA noncanonical transcripts and the identified viral proteomes. It was then personalized by the addition of all VCF variants (SNPs and SMs) and annotated with gene expression, CNVs and variant-related features. This sample-specific proteome was subsequently used for the prediction of neoantigens, by including all high-confidence nonsynonymous SMs and phased high-confidence SNPs, and for the MS search, by including all high-confidence and low-confidence variants (SNPs and SMs). Additionally, a personalized healthy proteome reference was created by including all high-confidence nonsynonymous SNPs and it was used to filter out sequences that are not tumor specific.

Neoantigen prediction

All 8–19-mers predicted or MS-detected peptides covering the SM were first filtered to remove any sequences mapping to the healthy personalized proteome to assure tumor specificity. If multiple transcripts were affected by the SM, all unique sequences were considered. The 8–12-mer peptides were predicted for binding to the participant HLA-I using PRIME 2.0 (ref. 56) and all 12–19-mer peptides were predicted for binding to the participant HLA-II using MixMHC2pred 2.0 (ref. 57). Predicted peptides were annotated with CScape60 prediction of oncogenicity, median expression in the matched healthy tissue in GTEx, median expression of the matched tumor type in TCGA, Intogen (2016.5)39 gene and mutation driver information, annotated protein location from ProteinAtlas61 and peptide presentation information from ipMSDB31. For each mutation, gene expression was derived from RNAseq analysis and the mutation expression was determined as the percentage of reads carrying the mutation in the BAM file, using SAMtools mpileup50. Long peptide sequences were designed by sorting annotated short peptides sequences on the basis of their predicted binding to HLA-I and HLA-II, their predicted binding rank (low to high), the number of different alleles predicted to bind the short peptide (high to low), the ipMSDB presentation value on HLA-I and on HLA-II (high to low) and the length of the peptide sequence (low to high). Optimal long peptide sequence design started with the best short peptide sequence. The sequence was sequentially expanded with the next short peptide sequences until reaching the maximum specified length. Short peptide sequences leading to a length longer than the requested length were ignored. To prevent synthesis issues, when the addition of short peptides led to problematic amino acids at the N terminus or C at the C terminus, the long peptide sequence was further extended until reaching another amino acid compatible with peptide synthesis.

Prioritization of predicted TSAs

When RNAseq sequencing data were available, HLA-I SNV neoantigens were prioritized with the ML-based prioritization algorithm10. Other predicted neoantigens (including frameshifts), HC-TSAs, TAAs, viral peptides and HLA-II-bound neoantigens were prioritized using specific rule-based algorithms.

For the ML-based prioritization, HLA-I SNVs neoantigens were further annotated with NetMHCpan (4.1)62 for HLA-I binding affinity prediction, NetMHCstabpan (1.0a)63 for HLA-I binding stability, NetChop (3.1)64 for C-terminal proteasomal cleavage, NetCTLpan (1.1)65 for recognition by the TAP transporter complex, differential agretopicity indices calculated as the log(neoantigen percentage rank) − log(wild type percentage rank) and HLA-binding anchor position derived from MixMHCpred sequence motifs. Those features, in combination with features derived from MixMHCpred, PRIME, CScape60, GTEx66, TCGA, RNAseq and ipMSDB, were used to apply the ML model and prioritize predicted HLA-I SNV neoantigens.

The rule-based approach was used for HLA-II neoantigens. First, long and short minimal epitope neoantigen sequences were sorted separately on the basis of a series of features (Extended Data Fig. 2). Second, the sorted neoantigens were classified into five categories (‘A’, ‘B’, ‘C’, ‘D’ and ‘reject’) while maintaining the sequence order from the initial sorting step. Similar rule-based approaches were implemented for the prioritization of antigens from viruses and TAAs (including both HC-TSAs and canonical TSAs) but they relied on different features (Extended Data Fig. 2). Only positive T cell assays in the IEDB were considered for immunogenicity.

APPM analysis

A list of genes involved in HLA-I and HLA-II presentation machinery pathways was selected from the Kyoto Encyclopedia of Genes and Genomes Pathway Database67. All gene expression was quantified in TCGA and GTEx. Genes with sample RNAseq expression lower than the expression in the matched GTEx tissue at the 5th percentile were considered downregulated and upregulated if their expression was greater than the 95th percentile expression in the matched GTEx tissue. In addition, genes were annotated with LOH and SM events identified from DNA data.

Inflammation score calculation

From RNAseq quantification data, NeoDisc derived the T cell inflammation score using the T cell inflammation signature as described in Danaher et al.68,69; nevertheless, users may use alternative tools, as the immunogenicity score has no impact on the prioritization of the antigen. This score was used to determine the T cell inflammation status, which could be noninflamed, intermediate inflamed or T cell inflamed. Furthermore, the sample inflammation signature was represented within the context of the matched cancer type from TCGA.

Tandem MS (MS/MS) spectra conversion

Currently, only raw files from the Orbitrap technology are supported for immunopeptidomic analysis. The Proteowizard (3.0.22132)70 msconvert tool was used to convert immunopeptidomic DDA and DIA MS/MS spectra acquired on Thermo Fisher Scientific instruments (.raw) to MGF and mzML.

Peptides identification with Comet and NewAnce

DDA MS/MS data were converted into MGF format and were searched against the personalized MS searchable proteome with Comet (2022.01)71. Default Comet search high–high parameters were modified as follows: decoy_search = 1, peff_format = 5, precursor_tolerance_type = 0, isotope_error = 1, isotope_error = 1, max_variable_mods_in_peptide = 2, max_variable_mods_in_peptide = 2, sample_enzyme_number = 0, activation_method = HCD, digest_mass_range = 500.0 3,000.0, max_precursor_charge = 4 and spectrum_batch_size = 30,000. The following variable modifications were included in the search: methionine oxidation (Δmass = 15.9949), N-terminal acetylation (Δmass = 42.010565), cysteine carbamyl (Δmass = 43.005814), carbamidomethylation of cysteine (Δmass = 57.021464), cysteinylation (Δmass = 119.004099), glutamine to pyroglutamatic acid (Δmass = −17.026549) and glutamic acid to pyroglutamic acid (Δmass = −18.010565). No fixed modifications were considered. Peptides lengths were adjusted to 8–15 aa and to 8–22 aa for HLA-I and HLA-II searches, respectively. The ‘allowed_missed_cleavage’ variable was set to 15 and 22 for HLA-I and HLA-II, respectively.

HLA-I and HLA-II Comet identifications were filtered at a 3% group-specific false discovery rate (FDR) using NewAnce (1.7.5)72, where protein-coding, noncanonical and virus-derived peptides were separated for FDR calculation. Frameshift-derived peptides were FDR-filtered together with noncanonical peptides. Peptide spectral matches (PSMs) were merged and grouped into unique peptide sequences, mapped to the most likely protein group (protein-coding > frameshift > noncanonical > virus) and neoantigens were flagged.

Peptide identification with FragPipe

DDA and DIA mzML files were searched against the personalized MS searchable proteome with the FragPipe 20.0 (https://fragpipe.nesvilab.org/) pipeline, which contains DIA Umpire73, Philosopher (5.0.0)74, MSFragger (3.8)75 and IonQuant (1.9.8)76. The default nonspecific HLA and nonspecific HLA DIA parameters available in FragPipe were edited as follows: addition of variable modifications (see above), removal of fixed modifications, setting Philosopher ion, PSM and peptide FDR to 0.03 and activation of group-specific FDR by setting msfragger.group_variable = 2 and adding to the Philosopher report filter the flag ‘–group’. IonQuant was turned on for peptide quantification when running in DDA only mode. Peptide lengths were adjusted to 8–15 aa and to 8–22 aa for HLA-I and HLA-II searches, respectively. HLA-I and HLA-II mzML files were searched separately. PSMs were merged and grouped into unique peptide sequences, mapped to the most likely protein group (protein-coding > frameshift > noncanonical > virus) and neoantigens were flagged.

Combining Comet-NewAnce and FragPipe identifications

Comet-NewAnce and FragPipe peptide identifications were combined by retaining the union of the identification. Conflicting PSMs were not reported. All peptides mapped to non-protein-coding groups were searched with L or I replacement against the basic local alignment search tool (BLAST)77 nonredundant human database (downloaded using BLAST-plus (2.11.0)78 on 6 April 2023). All peptides were annotated with available genomic and transcriptomic information and protein-coding peptides mapping to our list of high-confidence immunogenic HC-TSAs were annotated with literature-derived immunogenicity information. All identified peptides were predicted for binding to the participant’s HLA-I and HLA-II HLA alleles (including alleles subjects to LOH) with PRIME 2.0 (ref. 56) and MixMHC2pred 2.0 (ref. 57), respectively. MoDec (1.2)79 was used for HLA-I and HLA-II motif deconvolution of peptides mapping to protein-coding, other (noncanonical) and viral groups. The number of motifs detected was based on the Akaike information criterion value reported by MoDec. Each motif was assigned to the participant’s HLA with the closest motif, determined as the allele with the smallest Kullback–Leibler divergence (KLDa) between the deconvoluted motif position weight matrices \({{\rm{PWM}}}_{p,i}^{m}\) and \({{\rm{PWM}}}_{p,i}^{b}\) (m, MoDec value; i, amino acid; b, binding predictor (MixMHCpred56 or MixMHC2pred57); p, position; l, peptide length for HLA-I and core length for HLA-II):

$${{\rm{KLD}}}_{p}^{a}=-\mathop{\sum }\limits_{i=1}^{20}{{\rm{PWM}}}_{p,i}^{m}\times \log \left(\frac{{{\rm{PWM}}}_{p,i}^{b}}{{{\rm{PWM}}}_{p,i}^{m}}\right)$$

$${{\rm{KLD}}}^{a}=\frac{{\sum }_{p=1}^{l}{{\rm{KLD}}}_{p}^{a}}{l}$$

Tumor-specific peptides were defined as peptides not assigned to the protein-coding group predicted to bind with a rank ≤ 2. PSM images were generated with PDV (1.8.0)80 and the best fragmentation pattern as determined by NewAnce (1.7.5)72 was reported. Identified tumor-specific peptides were sorted on the basis of their protein group, evidence of immunogenicity in the literature (only for HC-TSA), source gene expression and predicted binding affinity.

Tools used for NeoDisc benchmarking

pTuneos (1.0.0)19 was run in WES mode, using matched germline and tumor WES and tumor RNAseq data. Default settings and hg19 genome assembly were used to ensure a fair comparison. Immunogenic peptide ranks were retrieved from the final_neo_model.tsv file.

pVACtools20,21 starts with a VCF file. To allow a fair comparison on the same set of mutations, a pVACtools-compatible VCF containing all the high-confidence SMs used for NeoDisc predictions and annotated with mutation and gene expression values was created. We followed the official documentation for the VCF preparation and ran pVACtools v4.0.8 on participants’ HLA-I alleles with the default settings (all prediction algorithms, -e1 8, 9, 10, 11 and 12). Immunogenic peptide ranks were retrieved from the all_epitopes.aggregated.tsv file.

Gartner et al.’s immunogenic mutation and peptide ranks were retrieved from the original publication18.

For LOHHLA (1.1.7)38, germline and tumor WES FastQ files, HLA-I alleles, tumor purity and ploidy as inferred by NeoDisc were used as input. Default settings were used, HLA CN estimations were retrieved from columns ‘HLA_type1copyNum_withBAFBin’ and ‘HLA_type2copyNum_withBAFBin’ in the DNA.HLAlossPrediction_CI.xls file. HLA loss events were called for the loss allele if copyNumwithBAFBin < 0.5 [and] PValunique < 0.01.

Sample collection

Samples used in this study were collected from participants (Supplementary Table 3) screened for inclusion in phase 1 trials approved by the institutional regulatory committee at Lausanne University Hospital (Ethics Committee, University Hospital of Lausanne (CHUV)): NSCLC-1 (NCT05195619), MEL-1 (ref. 64) and MEL-2 (ref. 64) (NCT03475134) and CESC-1, NPC-1, MEL-3 and MEL-4 (NCT04643574). Data included in this study comprised both published and unpublished data.

Tumor cell line generation

Tumor cell lines were derived from fresh tumor fragments, which underwent digestion using a combination of collagenase type I (Sigma-Aldrich) and DNAse (Roche) in R10 medium (RPMI-1640 GlutaMAX (Gibco) supplemented with 10% FBS (Gibco), 100 IU per ml of penicillin and 100 μg·ml−1 streptomycin (Bio-Concept)) at 37 °C for 90 min. The resulting cell suspension was harvested, passed through a cell strainer to eliminate any residual tissue fragments and then seeded in R10 medium for incubation at 37 °C with 5% CO2. The culture medium was replenished every 2–3 days and cells were subcultured upon reaching 80% confluence. For subculturing, tumor cells were gently detached using Accutase (Thermo Fisher Scientific) and split and R10 medium was completely refreshed. Cells were harvested and washed in PBS and pellets of 20 million cells were stored at −80 °C for downstream assays.

Sample preparation, library preparation and sequencing parameters

DNA was extracted with the commercially available DNeasy Blood and Tissue Kit (Qiagen) according to the manufacturer’s protocols. RNA was extracted by phase separation using Trizol and chloroform and centrifugation at 12,000g for 15 min. Then, the aqueous phase containing the RNA was collected, mixed with 70% ethanol and loaded on an RNeasy mini spin column from the total RNA isolation RNeasy mini kit (Qiagen). The rest of the protocol was performed according to the manufacturer’s instructions (including DNase I (Qiagen) on-column digestion). WES and RNAseq were performed at Microsynth (CESC-1 and MEL-4 tissues) or at the Health 2030 Genome Center (NSLC-1, NPC-1, MEL-1, MEL-2, MEL-3 and MEL-4 cell lines). WES libraries were prepared using the Agilent SureSelect XT Human All exome V7 kit (Microsynth), the Twist Human Core Exome kit in combination with the Twist Human Refseq kit (Genome Center, MEL-1 and MEL-2 cell lines) or the Twist comprehensive exome panel kit (Genome Center; NSLC-1, NPC-1, MEL-3 and MEL-4 cell lines). RNAseq libraries were prepared using the Illumina Truseq stranded mRNA reagents. WES and RNAseq samples were sequenced at Microsynth on the NextSeq 500/550 system or at the Genome Center on the Illumina HiSeq 4000 (MEL-1 and MEL-2cell lines) or Novaseq 6000 (NSLC-1, NPC-1, MEL-3 and MEL-4 cell lines) systems.

Flow cytometry analysis and isolation of tumor cells by fluorescence-activated cell sorting (FACS)

Tumor cells from participant MEL-4 were stained with anti-HLA-ABC (AF700) antibody (Biolegend, 311438; dilution of 3:100), washed and resuspended in PBS 10% FBS for FACS acquisition. Samples were acquired on a five-laser LSR-II flow cytometer (BD Biosciences). The BD FACSDiva software was used for data acquisition. FlowJo 10.9.0 (BD Biosciences) was used for data analysis.

The isolation of tumor cells from participant MEL-4 was performed using a five-laser MoFlo Astrios EQ flow cytometer (Beckman Coulter), following cell staining with HLA-ABC AF700 (BioLegend, 311438; dilution of 3:100) and viability dye DAPI 422801 (BioLegend, dilution of 1:500). HLA-ABClow and HLA-ABChigh tumor cells were sorted and then directly expanded in RPMI-1640 medium supplemented with 10% FBS and 1% penicillin–streptomycin for further analysis.

Peptide purification

The HLA-I and HLA-II peptides were purified according to our established protocols81. Specifically, anti-HLA-I W6/32 and anti-HLA-II HB145 monoclonal antibodies were purified from the supernatants of HB95 (American Type Culture Collection (ATCC), HB95) and HB145 cells (ATCC, HB145) using protein A Sepharose 4B (Pro-A) beads (Invitrogen) and antibodies were then crosslinked to Pro-A beads at a ratio of 5 mg of antibodies per ml of beads for HB95 and 1 mg of antibodies per ml of beads for HB145. Cell pellets and snap-frozen tissue samples were lysed in lysis buffer containing PBS with 0.25% sodium deoxycholate (Sigma-Aldrich), 0.2 mM iodoacetamide (Sigma-Aldrich), 1 mM EDTA, 1:200 protease inhibitor cocktail (Sigma-Aldrich), 1 mM PMSF (Roche) and 1% octyl-β-d-glucopyranoside (Sigma-Aldrich). Tissues were homogenized on ice in 3–5 short intervals of 5 s each using an Ultra Turrax homogenizer (IKA). Samples were left for 1 h at 4 °C and then centrifuged for 50 min at 4 °C at 75,600g in a high-speed centrifuge (Beckman Coulter, Avanti JXN-26 Series, JA-25.50 rotor) for tissues or in a table-top centrifuge (Eppendorf) at 21,000g for cell lysates. HLA immunopurification was performed using the Waters Positive Pressure-96 Processor (Waters) and 96-well single-use filter microplates with 3-µm glass fibers and 25-µm polyethylene membranes (Agilent, 204495-100). The cleared lysates were first loaded on a plate with wells containing Pro-A beads for endogenous antibody depletion, then on a second plate with wells containing anti-pan HLA-I antibody-crosslinked beads and finally through the third plate containing anti-pan HLA-II antibody-crosslinked beads. The HLA-I and HLA-II crosslinked beads were then washed separately in their plates with varying concentrations of salts using the processor. Lastly, the beads were washed twice with 2 ml of 20 mM Tris-HCl pH 8.

Sep-Pak tC18 100 mg Sorbent 96-well plates (Waters, 186002321) were used for the purification and concentration of HLA-I and HLA-II peptides. The C18 sorbents were conditioned with 1 ml of 80% ACN (Chemie Brunschwig) in 0.1% TFA (Sigma-Aldrich) and then twice with 1 ml of 0.1% TFA and the HLA complexes and bound peptides were directly eluted from the filter plates with 1% TFA. After washing the C18 sorbents with 2 ml of 0.1% TFA, HLA-I peptides were eluted with 25% or 28% ACN in 0.1% TFA and HLA-II peptides were eluted with 32% ACN in 0.1% TFA. Recovered HLA-I and HLA-II peptides were dried using vacuum centrifugation (Concentrator Plus, Eppendorf) and stored at −20 °C.

Liquid chromatography (LC)–MS

Immunopeptides were resuspended in 2% ACN and 0.1% formic acid. iRT peptides (Biognosis) were spiked into in the samples (Biognosis) as recommended by the vendor and analyzed by LC–MS/MS.

The LC–MS system consisted of an Easy-nLC 1200 (Thermo Fisher Scientific) coupled to a Q Exactive HFX MS instrument (Thermo Fisher Scientific) and/or to an Eclipse Tribrid MS instrument (Thermo Fisher Scientific). HLAs were eluted on a 450–500-mm analytical column (~8-μm tip, 75-μm inner diameter) packed with ReproSil-Pur C18 (1.9-μm particle size, 120-Å pore size; Dr Maisch) and separated at a flow rate of 250 nl min−1 as described previously81.

For DDA measurements, selection of the top 20 most abundant precursor ions was performed on the Q Exactive HFX as described81. On the Eclipse, full MS spectra were acquired in the Orbitrap (m/z = 300–1,650) with a resolution of 120,000 (m/z = 200), maximum ion accumulation time and automatic gain control (AGC) set to auto. MS/MS spectra were acquired on the 20 most abundant precursor ions with a resolution of 30,000 (m/z = 200) with an isolation window of 1.2 m/z and ion accumulation time of 54 ms. The AGC was set to 5 × 104 ions, the dynamic exclusion was set to 20 s and a normalized collision energy of 30 was used for fragmentation. No fragmentation was performed for HLA-I peptides with assigned precursor ion charge states of four and above and the peptide match option was enabled.

The DIA cycle of acquisition was performed on the HFX as described previously82. On the Eclipse Tribrid MS, the cycle of acquisitions consisted of a full MS scan from 300 to 1,650 m/z with a resolution of 120,000, ion accumulation time of 60 ms, normalized AGC of 250% and 22 DIA MS/MS scans in the orbitrap. For each DIA MS/MS scan, a resolution of 30,000, a normalized AGC of 2,000% and a stepped normalized collision energy (27, 30 and 32) were used. The maximum ion accumulation time was set to auto, the fixed first mass was set to 200 m/z and the overlap between consecutive MS/MS scans was 1 m/z.

For parallel reaction monitoring, synthetic peptides were ordered from Thermo Fisher Scientific as crude. After resuspension in 2% ACN in 0.1% formic acid, peptides were analyzed by pools of peptides (three in total) for a charge state of 2+ and 3+. The cycle of acquisition consisted of a full MS scan from 300 to 1,650 m/z with a resolution of 60,000, ion accumulation time of 100 ms and consecutive acquisition of targeted peptide MS/MS scans (isolation window of 1.2 m/z resolution of 30,000 and accumulation time of 54 ms) in the Q Exactive HFX.

Immunogenicity assessment of TIL cultures and isolation of tumor antigen-specific TILs

Peptides were synthetized in-house (Peptide and Tetramer Core Facility, University of Lausanne (UNIL) and CHUV) or ordered externally (Thermo Fisher Scientific). The set of peptides was then tested for immunogenicity by interrogating both clinical TILs (NCT03475134 and NCT04643574) and in-house generated NeoScreen TILs using IFNγ ELISpot. The NeoScreen method for TIL expansion and cultures25 involves the exposure of small tumor fragments to engineered autologous B cells and the neoantigenic peptides in the presence of interleukin 2 (IL-2) for a period of a 12–21 days (preREP). This is followed by further REP of TILs with OKT3 (anti-CD3, Miltenyi, 30 ng ml−1) in the presence of an excess of irradiated feeder cells and IL-2 for 14 days. TILs were challenged with specific peptides at 1 µM concentration in precoated 96-well ELISpot plates (Mabtech). Following 16–20 h of coculture, cells were gently harvested from ELISpot plates, which were then developed according to the manufacturer’s instructions and counted with a Bioreader-6000-E (BioSys). Positive conditions were defined as those with an average number of spots higher than the counts of the negative control (TILs alone) plus three times the s.d. of the negative control. The background level of IFNγ spot-forming units (SFU) per 106 cells by the negative control was subtracted from that of antigen-rechallenged TILs in cumulative figures. TILs retrieved from plates were centrifuged and stained to assess the upregulation of the activation marker CD137 on CD8+ T cells by flow cytometry (Extended Data Fig. 3b). The background level of CD137 expression by the negative controls (TILs alone) was subtracted from that of antigen-rechallenged TILs in cumulative figures.

For the isolation of tumor antigen-specific cells for participants MEL-1 and MEL-4, IL-2-deprived TILs were stimulated with tumor-specific peptides at 1 µM for 24 h at 37 °C, 5% CO2 and reactive TILs were sorted on the basis of CD137 upregulation. After incubation, TILs were collected, washed and stained with anti-human CD3 (Beckman Coulter; dilution of 2:100), anti-CD8 (BD Biosciences; dilution of 3:100), anti-CD4 (BioLegend; dilution of 0.5:100), anti-CD137 (Miltenyi; dilution of 2:100) and Aqua viability dye (Thermo Fisher Scientific; dilution of 0.3:100) for 30 min at 4 °C. Antigen-specific CD8+ and CD4+ TILs were isolated by FACS using a BD FACS Melody cell sorter (BD Biosciences) and subsequently subjected to bulk TCR sequencing83. Briefly, mRNA was extracted and amplified by in vitro transcription. The resulting coding RNA was converted to single-stranded DNA using multiplex reverse transcription with a collection of TRAV-specific and TRBV-specific primers, which included unique molecular identifiers (UMIs) and Illumina adaptors. TCR sequences were then amplified by PCR (20 cycles) using a single primer pair targeting the constant region and the Illumina adaptor added during reverse transcription. A second PCR (25 cycles) was performed to add the multiplexing indices. The TCR products were purified, quantified and loaded onto MiSeq or NextSeq platforms (Illumina) for deep sequencing of the TCRα and TCRβ chains. The resulting TCR sequences were processed using custom Perl scripts to (1) group all TCR sequences coding for the same protein sequence; (2) correct for amplification and sequencing errors using 9-mer UMIs; (3) filter out out-of-frame sequences; and (4) determine the abundance of each TCR sequence. TCR sequences with a single read were excluded from further analysis.

Antigen and tumor reactivity validation by TCR cloning

Reactivities of TCRα–TCRβ pairs from samples MEL-1 and MEL-4 were validated by cloning into recipient T cells. The Jurkat cell line (TCR/CD3 Jurkat-luc cells (NFAT), Promega; stably transduced in-house with human CD8α–CD8β and TCRα–TCRβ knockout by clustered regularly interspaced short palindromic repeats) and activated peripheral T cells were used to validate antigen and tumor specificity, respectively64.

Jurkat cells were electroporated using the Neon electroporation system (Thermo Fisher Scientific) and cocultured with peptide-pulsed HLA-matched APCs followed by a bioluminescence assay (an example is shown in Extended Data Fig. 5g). For the luciferase assay, 2 × 104 Jurkat cells were cocultured with 4 × 104 PHA-activated CD4+ T cells (CD4 blastocysts) in the presence of 1 µM specific peptides in 96-well plates. After overnight incubation, the assay was performed using the Bio-Glo luciferase assay system (Promega). Luminescence was measured with a Spark multimode microplate reader (Tecan). To assess the antitumor reactivity of TCRs, 105 TCR RNA-electroporated cells were cocultured in a 1:1 ratio with 200 ng ml−1 IFNγ-pretreated tumor cell lines (an example is shown in Extended Data Fig. 3e). After overnight incubation, T cells were recovered and the upregulation of CD137 was evaluated by staining with anti-CD137 (Miltenyi; dilution of 2:100), anti-CD3 (Biolegend or BD Biosciences; dilution of 1:100), anti-CD4 (BD Biosciences; dilution of 3:100), anti-CD8 (Biolegend; dilution of 1:100) and anti-mouse TCRβ-constant (Thermo Fisher Scientific dilution of 2:100) antibodies and with Aqua viability dye (Thermo Fisher Scientific). Flow cytometer LSRFortessaTM (BD Bioscience) or IntelliCyt iQue Screener PLUS (Bucher Biotec) were used for the fluorescence readout and the results were analyzed with FlowJo 10 (TreeStar). For both Jurkat and activated PBMCs, the following experimental controls of TCR transfection were used: mock (transfection with water) and a control TCR (irrelevant crossmatch of a TCRα and TCRβ chain from a private TCR library).

mIF staining

mIF staining was performed on 4-μm formalin-fixed paraffin-embedded tissue sections on automated Ventana Discovery Ultra staining module (Ventana, Roche). Slides were placed on the staining module for deparaffinization, epitope retrieval (64 min at 98 °C) and endogenous peroxidase quenching for 8 min (Discovery Inhibitor, Roche, 760-4840, RTU). Multiplex staining consists in multiple rounds of staining. Each round includes nonspecific sites blocking (Discovery Goat IgG, Roche, 760-6008, RTU; Discovery Inhibitor, Roche, 760-4840, RTU), primary antibody incubation, secondary horseradish peroxidase (HRP)-labeled antibody incubation for 16 min (Discovery OmniMap anti-rabbit HRP, Ventana, 760-4311, RTU; anti-mouse HRP, Ventana, 760-4310, RTU; OPAL reactive fluorophore detection (Akoya Biosciences)) to covalently label the primary epitope (incubation for 12 min) and then antibody heat denaturation. The panel was optimized to look at the HLA expression on tumor and immune cells. The sequence of antibodies used in the multiplex with the associated OPAL was as follows: (1) rabbit anti-Sox10 antibody (1.2 µg ml−1, clone EP268, Cellmarque; 1 h, 37 °C), OPAL690; (2) mouse anti-human HLA-DR antibody (1 µg ml−1, DAKO; 1 h, room temperature), OPAL570; (3) mouse anti-CD11c antibody (1.5 µg ml−1, clone 5D11, Cellmarque, DAKO; 1 h, 37 °C), OPAL520; (4) mouse anti-human HLA-ABC antibody (0.1 µg ml−1, Clone EMR8-5, Abcam; 1 h, room temperature), OPAL480; (5) rabbit anti-CD8 antibody (1 µg ml−1, clone SP16, Cellmarque; 1 h, 37 °C), OPAL620; (6) rabbit anti-CD3 antibody (1.5 µg ml−1, DAKO; 1 h, 37 °C), OPAL780. Nuclei were visualized by a final incubation with Spectral DAPI (1/10, FP1490, Akoya Biosciences) for 12 min. Multiplex IF images were acquired on PhenoImager allowing whole-slide multispectral imaging acquisition (Akoya Biosciences). IF signal extractions from qptiff were performed, enabling a per-cell analysis of IF markers of multiplex stained tissue sections and counting of every cell population.

CD8+ T cell infiltration assessment

To characterize the tissue heterogeneity for a specific phenotype, we segmented the entire multispectral image into regions of interest (ROIs). ROIs were determined by dividing the image into tiles, each evaluated for cellular metrics. The size of each tile was set by an area-based criterion, typically 640,000 μm2. A shift of 2 (where the ROIs partially overlap at their centers) was applied to increase the granularity of tile positioning.

By iteratively shifting the tile, we calculated the tissue area and CD8+ cell count per tile. These densities were plotted on a graph, displaying the density of CD8+ cells in tumor islets on the x axis and in the tumor microenvironment (TME) on the y axis. Each graph point reflects the CD8+ density in a particular tile, illustrating the spatial distribution of CD8+ cells within the samples. A threshold density of 21 CD8+ cells per mm² (in either tumor islets or the TME) was set to determine infiltrated, excluded or desert tumor islets.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here