Home NATURALEZA Droplet Hi-C enables scalable, single-cell profiling of chromatin architecture in heterogeneous tissues

Droplet Hi-C enables scalable, single-cell profiling of chromatin architecture in heterogeneous tissues

18
0


Experimental protocol for Droplet Hi-C

A brief description of the Droplet Hi-C experimental procedure is provided below.

Fixation

Cells were cross-linked in 1% formaldehyde, which was diluted from 37% formaldehyde with 1× PBS (pH = 7.4), and incubated at room temperature for 10 min. After cross-linking, the reactions were quenched in 200 mM glycine and incubated at room temperature for 5 min. Quenched reactions were spun down at 1,000 × g for 5 min at 4 °C and resuspended using 1% bovine serum albumin (BSA) in 1× PBS (pH = 7.4) to wash twice. One million cells were aliquoted into each tube. The cells were spun once again at 1,000 × g for 5 min, the supernatant was removed, and the pellet was flash frozen in liquid nitrogen and finally stored indefinitely at −80 °C.

In situ Hi-C

Cell pellets were lysed with pre-cold 300 μl lysis buffer (10 mM Tris–HCl, pH 8.0 (Thermo Fisher Scientific, 15568025), 10 mM NaCl (Sigma, S5150), 0.2% IGEPAL CA-630 (Sigma, I8896) and 1× protease inhibitor (Roche, 5056489001)) on ice for 45 min, centrifuged at 1,000 × g for 5 min at 4 °C to collect nuclei, and washed once with 200 μl lysis buffer. The nuclei were resuspended with 50 μl 0.5% SDS and incubated at 62 °C for 10 min on Thermomixer. Then, we added 145 μl of nuclease-free H2O and 25 μl 10% Triton X-100 (Sigma, 93443) to quench SDS and incubated samples at 37 °C for 15 min on Thermomixer at 300 rpm. To the samples were added 27 μl of 10× CutSmart Buffer and three restriction enzymes, including 50 U DpnII (NEB, R0543L), 62.5 U MboII and 7.5 U NlaIII (NEB, R0125L); then, the samples were incubated at 37 °C for 90 min on Thermomixer at 550 rpm. The enzymes were deactivated at 65 °C for 20 min and then cooled down to room temperature. The nuclei were collected by centrifugation at 1,000 × g for 5 min at 4 °C and washed once with 200 μl ligation buffer (100 μl 10× T4 DNA ligase buffer (NEB, B0202S), 5 μl 20 mg ml−1 BSA (NEB, B9000S), 865 μl H2O); the ligation reaction was performed with 200 μl ligation buffer and 20 μl T4 DNA ligase (NEB, M0202L) at 37 °C for 40 min on Thermomixer at 300 rpm.

Single-cell Hi-C sequencing library construction

The ligated nuclei pellets were resuspended in 1 ml 1% BSA in PBS (diluted from 10% BSA in PBS (Sigma, A1595) using 1× PBS (pH = 7.4)) in each tube, 1 μl 1,000× 7-AAD (Invitrogen, A1310) was added and the nuclei were sorted by fluorescence-activated nuclei sorting with an SH800 cell sorter (Sony) for the isolation of single nuclei (Supplementary Fig. 3). The nuclei were collected in collection buffer (5% BSA in PBS) at 5 °C and immediately centrifuged for 5 min at 1,000 × g and 4 °C to collect the nuclei. The nuclei were washed once with 1× Nuclei Buffer (diluted from 20× Nuclei Buffer (10x Genomics, PN-2000207)), resuspended in 10 μl 1× Nuclei Buffer and counted on an RWD C100-Pro fluorescence cell counter with DAPI staining. According to the desired targeted nuclei recovery number, 3,000–15,000 nuclei were aliquoted into PCR tubes, to which was added Transposition Mix following the user’s guide for the Chromium Next GEM Single Cell ATAC Reagent Kits v1.1 or the Chromium Next GEM Single Cell ATAC Reagent Kits v2 (10x Genomics). The incubation time for tagmentation is 60 min for both the v1.1 and v2 kits. We also extended the index PCR elongation time from 20 s to 1 min. The double-sided size selection was changed to 1.14× SPRIselect to remove only small fragments.

All sequencing experiments were performed with an Illumina NextSeq 2000 or NovaSeq 6000 sequencer.

Experimental protocol for Paired Hi-C

A brief description of the Paired Hi-C experimental procedure is provided below.

Fixation

Cells were cross-linked in 0.6% formaldehyde, which was diluted from 37% formaldehyde with 1× PBS (pH = 7.4), and incubated at room temperature for 10 min. After cross-linking, the reactions were quenched in 200 mM glycine and incubated at room temperature for 5 min. The quenched reactions were spun down at 1,000 g for 5 min at 4 °C and resuspended using 1% BSA in 1× PBS (pH = 7.4) to wash twice, and then 1 million cells were aliquoted into each tube. The cells were spun once again at 1,000 × g for 5 min, the supernatant was removed, and the pellet was flash frozen in liquid nitrogen and finally stored indefinitely at −80 °C.

In situ Hi-C

Cell pellets were lysed with pre-cold 300 μl lysis buffer (10 mM Tris–HCl (pH 8.0; Thermo Fisher Scientific, 15568025), 10 mM NaCl (Sigma, S5150), 0.2% IGEPAL CA-630 (Sigma, I8896), 1× protease inhibitor (Roche, 5056489001), 1 U μl−1 RNaseOUT (Invitrogen, 10777-019) and 1 U μl−1 SUPERaseIn inhibitor (Invitrogen, AM2694)) on ice for 45 min, then centrifuged at 1,000 × g for 5 min at 4 °C to collect nuclei and washed once with 200 μl lysis buffer. The nuclei were resuspended with 50 μl 0.5% SDS, which was diluted from 10% SDS using 1× PBS (pH = 7.4) with 1 U μl−1 RNaseOUT and 1 U μl−1 SUPERaseIn inhibitor, and incubated at 37 °C for 60 min on Thermomixer at 300 rpm. Then, we added 197 μl quenching buffer (137.59 μl of nuclease-free H2O, 25 μl 10% Triton X-100, 27 μl 10× CutSmart, 1 U μl−1 RNaseOUT and 1 U μl−1 SUPERaseIn inhibitor) and incubated samples at 37 °C for 15 min on Thermomixer at 300 rpm. The samples were centrifuged at 1,000 × g for 5 min at room temperature, the supernatant was removed and the pellets were resuspended with 250 μl digestion buffer (1× CutSmart Buffer, 1 U μl−1 RNaseOUT, 1 U μl−1 SUPERaseIn inhibitor, 50 U DpnII, 62.5 U MboII and 5 U NlaIII), and then incubated at 37 °C for 90 min on Thermomixer at 550 rpm. The enzymes were deactivated at 65 °C for 20 min and then cooled to room temperature. The nuclei were collected by centrifugation at 1,000 × g for 5 min at 4 °C and washed once with 200 μl ligation buffer (827.5 μl H2O, 100 μl 10× T4 DNA ligase buffer, 5 μl 20 mg ml−1 BSA, 25 μl SUPERaseIn inhibitor and 12.5 μl RNaseOUT), and the ligation reaction was performed with 200 μl ligation buffer and 20 μl T4 DNA ligase at 37 °C for 40 min on Thermomixer at 300 rpm.

PBMCs (Allcells, LP, CR, MNC, 100 M) were cross-linked in 1% formaldehyde. Then, 1 million cell pellets were lysed with 200 μl of high-salt lysis buffer 1 (50 mM HEPES (pH 7.4), 2 mM EDTA (pH 8.0), 140 mM NaCl, 0.25% Triton X-100, 0.5% IGEPAL CA-630, 10% glycerol and 1× proteinase inhibitor cocktail) and incubated on ice for 10 min. After this, the cells were pelleted at 500 g for 2 min at 4 °C and then resuspended in 200 μl of high-salt lysis buffer 2 (10 mM Tris–HCl (pH 8), 3 mM EDTA, 200 mM NaCl, 1× proteinase inhibitor cocktail). The solution was incubated on ice for 10 min. Following this, the cells were then pelleted at 500 g for 2 min at 4 °C and then resuspended in 200 μl of 1× T4 DNA ligase buffer containing 0.2% SDS. The cells were then incubated at room temperature for 1 h. To quench the reaction, 200 μl of 1× T4 DNA ligase buffer and 10 μl 10% Triton X-100 were added to the tube. Finally, the cells were spun at 500 g for 4 min at 22 °C. The pellets were resuspended with 250 μl digestion buffer (1× T4 DNA ligase buffer, 1 U μl−1 RNaseOUT, 1 U μl−1 SUPERaseIn inhibitor, 80 U DpnII) and then incubated at room temperature for 2 h. The nuclei were collected by centrifugation at 500 g for 4 min at 4 °C and washed once with 200 μl ligation buffer (857.5 μl H2O, 100 μl 10× T4 DNA ligase buffer, 5 μl 20 mg ml−1 BSA, 25 μl SUPERaseIn inhibitor and 12.5 μl RNaseOUT), and the ligation reaction was performed with 200 μl ligation buffer and 20 μl T4 DNA ligase at room temperature for 40 min.

Single-cell sequencing library construction

The ligated nuclei pellets were resuspended in 1 ml 1% BSA in PBS (diluted from 10% BSA in PBS using 1× PBS (pH = 7.4)) with 1 U μl−1 RNaseOUT and 1 U μl−1 SUPERaseIn inhibitor in each tube, 1 μl 1,000× 7-AAD (Invitrogen, A1310) was added, and the nuclei were sorted by fluorescence-activated nuclei sorting with an SH800 cell sorter (Sony) for the isolation of single nuclei. The nuclei were collected in collection buffer (5% BSA in PBS with 5 U μl−1 Recombinant RNasin (Promega, PAN2515)) at 5 °C and immediately centrifuged for 5 min at 1,000 × g and 4 °C to collect the nuclei. The nuclei were washed once with 1× Nuclei Buffer (diluted from 20× Nuclei Buffer (10x Genomics, PN-2000207) with 1 mM DTT and 1 U μl−1 Recombinant RNasin), resuspended in 10 μl 1× Nuclei Buffer and counted on an RWD C100-Pro fluorescence cell counter with DAPI staining. According to the desired targeted nuclei recovery number, 3,000–15,000 nuclei were aliquoted into PCR tubes, to which was added Transposition Mix following the user’s guide of Chromium Next GEM Single Cell Multiome ATAC + Gene Expression (10x Genomics). The tagmented nuclei were loaded onto a Chromium Next GEM Chip J for droplet generation with a Chromium X microfluidic system (10x Genomics). The gel bead-in-emulsions (GEMs) were collected, reverse transcription and cell barcoding were performed using a thermal cycler and the reaction was quenched following the user’s guide of the Chromium Next GEM Single Cell Multiome ATAC + Gene Expression (10x Genomics). After GEM cleanup and pre-amplification, the SPRIselect products were eluted into 100 μl elution buffer (Qiagen, 19086) instead of 160 μl. For scHi-C library construction, we extended the index PCR elongation time from 20 s to 1 min, and the double-sided size selection was changed to 1.55× SPRIselect to remove only small fragments. For scRNA-seq library construction, we used 0.8× SPRIselect to purify cDNA twice instead of 0.6× SPRIselect to keep short cDNA fragments.

All sequencing experiments were performed with an Illumina NextSeq 2000 or NovaSeq 6000 sequencer.

Processing of Droplet Hi-C data

Custom scripts for demultiplexing, mapping and extracting single-cell contacts from Droplet Hi-C data were publicly available at https://github.com/Xieeeee/Droplet-Hi-C. First, depending on the indexing primers used, Droplet Hi-C fastq files were demultiplexed using Illumina bcl2fastq (v2.19.0.316) or 10x Genomics cellranger-atac mkfastq (v2.0.0). After demultiplexing, a custom script was used to extract the cellular barcode sequence from Read2, and barcodes were aligned to the white list provided by 10x Genomics using Bowtie (v1.3.0)72. Aligned barcode sequences were appended to the beginning of the read name to record the cellular identity of each read. Next, sequencing adapters were detected and trimmed with Trim-Garole (v0.6.10)73. Cleaned reads were then mapped to the human (hg38) or mouse (mm10) reference genome using BWA-MEM (v0.7.17)74, with arguments ‘-SP5M’ specified. Finally, after mapping, valid contacts were parsed, sorted and deduplicated by Pairtools (v0.3.0)75, with barcode information stored in a separated column in the pairs file. Contacts were balanced and stored in cool format using Cooler (v0.8.10) for visualization and downstream analysis76. High-quality cells were selected based on the number of total unique contacts per cell in each library.

Analysis of Droplet Hi-C data

Here we describe the general analysis strategy and workflows for our Droplet Hi-C data. Dataset-specific manipulations in the paper, if any, will be indicated.

Visualization of Hi-C contact matrices

Bulk, pseudo-bulk or single-cell contact matrices in cool format were visualized and plotted using Cooler along with Matplotlib (v3.5.1)77.

Signal enrichment calculation

Density plots and signal enrichment over Center for Epigenomics of the Mouse Brain Atlas candidate cis-regulatory elements (cCREs) were generated using bamCoverage from deepTools (v3.5.3). Peaks that overlapped with ENCODE blacklist (v2) regions were removed during the calculation of enrichment78.

Embedding and clustering of single-cell Droplet Hi-C dataset on cultured cells

We used Higashi to infer low-dimensional cell embeddings for all our Droplet Hi-C datasets without imputations, including a human cell line mix (HeLa–GM12878–K562; Extended Data Fig. 1b), GM12878–WTC-11 (Extended Data Fig. 1i), GBM39–GBM39-ER (Fig. 4b) and the sample from the patient with GBM (Fig. 5b). For visualization, the L2 norm of cell embeddings was projected to a two-dimensional space with uniform manifold approximation and projection (UMAP). To identify cells with similar identity, we performed Leiden clustering using igraph (v0.9.9) and leidenalg (v0.8.8)79,80.

Contact frequency by distance analysis

We used the ‘expected_cis’ module from cooltools (v0.5.1) to calculate the average contact frequency as a function of genomic separation81. To plot the curve, we performed smoothing and aggregation among chromosome regions.

Imputation of single-cell chromatin contacts on the adult mouse cortex Droplet Hi-C dataset

For the adult mouse cortex Droplet Hi-C data, we used the scHiCluster (v1.3.4) to perform contact matrix imputation for individual cells at three different resolutions: 100 kb (for cell embedding and visualization), 25 kb (for domain boundary calling) and 10 kb (for cell embedding and loop calling)82. In brief, contacts from individual cells were first masked for ENCODE blacklist regions78. scHiCluster then performed linear convolution and random walk with restart to impute the sparse single-cell contact matrices for each chromosome. Considering storage efficiency, we output the imputed results for the whole chromosome at 100 kb, for the 10.05 Mb window at 25 kb and for the 5.05 Mb window at 10 kb resolution for each cell. For cancer cells or tumor samples, we did not perform imputation owing to CNVs and SVs.

Embedding and annotation of single-cell Droplet Hi-C dataset on the adult mouse cortex

For the imputed adult mouse cortex datasets and PBMC datasets, we adopted a previously described method to co-embed Hi-C data with gene expression data (snRNA-seq) generated from the same tissue and to assign cell-type identities39. Briefly, single-cell GAD (scGAD) score Rij is calculated as the raw number of interactions at the gene body region for gene i in cell j with the 10 kb imputed matrix. This method has been implemented in scHiCluster as the ‘gene_score’ module. By this, the single-cell chromatin interaction profiles can then be represented as a cell-by-gene scGAD score matrix, which serves as the input for integration with snRNA-seq data.

To perform integration, we began by normalizing and selecting the top 2,000 variable genes in reference snRNA-seq data. These genes were subjected to dimension reduction by principal component analysis (PCA) using Scanpy (v1.7.2), retaining 30 principal components83. The derived PCA model was then applied to transform the scGAD score matrix. Subsequent integration was conducted with a tailored version of Seurat integration using canonical correlation analysis84. For the visualization of co-embedded data, we calculated the nearest neighbor graph (k = 25) and UMAP embedding with Scanpy.

To annotate cell identities within the Droplet Hi-C dataset, we identified the 15 nearest snRNA-seq neighbors for each cell via PyNNDescent (v0.5.6) using Euclidean distances85. These distances were scaled and converted into standardized scores so that the total score is added up to 1, and the neighbor with the minimal distance would have the highest score. The identity of each Droplet Hi-C cell was inferred by nominating the cell type label that garnered the highest standardized score among its top nearest neighbors.

To validate the cell type annotations, co-embedding of Droplet Hi-C and sn-m3C-seq datasets was performed using scHiCluster, which facilitates the low-dimensional representation of chromatin contact information. For each cell, imputed matrices at 100 kb resolution were binarized and flattened. The binarized data from individual cells in both datasets were then concatenated into a larger matrix. PCA was applied to this matrix on a per-chromosome basis. The PCA results for each chromosome were concatenated for a second round of PCA to generate the final cell embeddings. Batch effects were corrected using harmonypy (v0.0.9)86. After batch correction, Leiden clustering was used to identify co-embedding clusters. The validity of the original annotation was verified by calculating the overlap coefficients (Oi,j) between the clusters and the original annotations from Droplet Hi-C (A) and sn-m3C-seq dataset (B): \({O}_{i,j}=\min ([\frac{{A}_{i}\cap {C}_{k}}{{A}_{i}}],\,\max [\frac{{B}_{j}\cap {C}_{k}}{{B}_{j}}])\), where i indicates the query cell type from Droplet Hi-C, j indicates the reference cell type from sn-m3C-seq and k indicates the co-embedding cluster.

Analysis of A and B compartments

On sample levels, we used the ‘eigs_cis’ module from cooltools to perform eigen decomposition on balanced contact matrices and calculate the compartment score at 100 kb resolution81. The orientation of the resulting eigenvectors was adjusted to correlate positively with the CpG density of the corresponding 100 kb genomic bins, thereby determining the sign of the compartments (A and B). Similarity between samples is calculated as the Spearman’s correlation coefficient (SCC) of compartment scores among all autosomes (Extended Data Fig. 1c,j). To illustrate chromatin interactions within and between compartments, we used the ‘saddle’ module from cooltools to calculate the average observed versus expected contact frequency, categorized by compartment scores.

To generate pseudo-bulk profiles for distinct cell types or clusters within a sample, we adopted a uniform approach outlined before to minimize bias and facilitate direct comparisons. Initially, the raw contact matrices at the sample level were normalized for distance effects. Next, Pearson correlation coefficients were computed on the distance-normalized matrices, and PCA was then performed on the correlation matrix. The model fitted at the sample level was used to transform the raw contact matrices of various cell types or clusters. To compare the similarity between cell types or clusters, we calculated the SCC of the compartment score using the compartment that intersected with variable gene regions (Fig. 1e and Extended Data Fig. 7e). To identify differential compartments, we used dchic (v2.1)87, selecting only those with an adjusted P value < 0.01 and a Manhattan distance greater than the 2.5th percentile of the standard normal distribution for further downstream analysis.

For joint analysis of compartment and cell-type-specific histone modifications, we used the recently published adult mouse cortex Droplet Paired-Tag dataset40. We retained only cell types that contain >100 cells and were shared between datasets to ensure comparability among the analyses. For all differential compartments, we calculated the Pearson correlation coefficients between the compartment score and the signal of H3K27ac or H3K27me3 (counts per million mapped reads (CPM)) within the same 100 kb genomic bins among various cell types.

Analysis of chromatin domains

On sample levels, we used the ‘insulation’ module from cooltools to compute the insulation score for raw contact matrices at 25 kb resolution. To delineate cell-type-specific variable domain boundaries in mouse cortex data, we used TopDom (v0.0.2) on 25 kb resolution imputed contact matrices for each individual cell88. We defined the ‘boundary probability’ for a given bin as the proportion of cells within a particular cell type that designate the bin as a domain boundary. The presence or absence of a domain boundary is summarized by n cell types into an n × 2 contingency tables. We then computed the chi-square statistics and P value for each bin, and the domain boundaries were classified as variable if they showed a false discovery rate (FDR) <0.001 and a boundary probability difference >0.05.

For correlation analysis of gene expression surrounding variable domain boundaries, we first identified genes within a 100 kb window centered on the variable domain boundaries. We then calculated Pearson correlation coefficients to quantify the relationship between boundary probabilities and expression levels (reads per kilobase per million mapped reads (RPKM)) of these genes from Droplet Paired-Tag data among shared cell types. The genes were further categorized into ‘housekeeping’, ‘constant’ and ‘variable’. In brief, the housekeeping gene list was taken from a previous report89. The top 2,500 variable genes from Droplet Paired-Tag RNA modality were selected as the variable group. Genes with a similar expression level as the top 2,500 variable genes but were not within the ‘housekeeping’ or ‘variable’ group were treated as ‘constant’.

Analysis of chromatin loops

For loops calling on sample levels, we adopted the ‘dots’ module in cooltools, which implements the principle of the commonly used HiCCUPS loop calling strategy, to call loops on the 10 kb raw contact matrices90.

A modified version of the SnapHiC workflow implemented in scHiCluster is used to perform loops calling on the 10 kb imputed contact matrices for mouse cortex cell types91. To compare histone modification enrichment at loop anchors, we calculate H3K27ac or H3K27me3 CPM among all cell types at all loop anchors identified. When histone modification profiles and loop anchors are from the same cell types, the enrichment is classified as ‘matched’; otherwise, it is classified as ‘unmatched’.

Gene Ontology (GO) annotation of loop anchors was performed using rGREAT (v1.26.0) in R92. The GO biological process was selected for annotations. The result is used to generate plots in Fig. 2g.

Multi-way chromatin interaction analysis

Multi-way interactions are extracted with ‘pairtools parse2’ function in Pairtools, which is designed to rescue complex chromatin ligation events. After deduplication, only trans interactions and cis interactions with genomic distance >10 kb were retained. Read pairs overlapping with the ENCODE mm10 blacklist regions were removed. For each pair-end read on autosomes supporting multi-way contacts, we collected the 10 kb genomic bins containing their anchors. Finally, we defined one pair-end read to support multi-way contacts, if it contacts ≥3 unique 10 kb bins (can be both intrachromosomal and interchromosomal).

To define contact hubs for multi-way interactions in each mouse brain cell type, we followed the below strategy: first, the reference genome (mm10 for mouse cortex) was segmented into consecutive 10 kb bins. Next, a binary indicator was assigned to each 10 kb bin in each single cell, depending on whether it is involved in a multi-way contact. Then, for each 10 kb bin, we calculated the frequency of single cells involved in multi-way contacts among all cells belonging to the same cell type. We excluded the top 1% of bins with the highest frequency or bins with sequence mappability scores <0.8 when calculating the mean and standard deviation for frequency distribution, then converted the frequency distribution into a Z-score. Finally, a 10 kb bin is defined as in multi-way contact hub if its Z-score is >1.96. Custom scripts to analyze the multi-way interaction hub are available at https://github.com/HuMingLab/Multiway.hub.

Enrichment of chromatin features at multi-way chromatin hubs

After identifying the multi-way chromatin hubs in each mouse cortex cell type, we performed enrichment analysis against cCREs, super-enhancers and cell-type specifically expressed genes. Specifically, cCREs are calculated and converted into 10 kb genomic bins from single-nucleus ATAC using sequencing (snATAC-seq) data71. Super-enhancers are called using HOMER ‘findPeaks -style super’ on the identified cCREs within each cell type with default parameters93, and converted into 10 kb genomic bins. For each cell type A involving a multi-way contact hub, we calculated its overlap with cCREs or super-enhancers from another cell type (B) and created a 2 × 2 contingency table. We then performed Fisher’s exact test to evaluate the significance of enrichment and reported the log2 odds ratio as the enrichment score. We enumerated all pairs of cell types and compared the log2 odds ratio in matched (AA) versus unmatched (AB, where B ≠ A) cell types. We repeated the same enrichment analysis for cell-type-specific marker genes from published datasets40, except that we used the 10 kb bin overlap with the marker gene transcription start site for enrichment analysis.

Infer CNV with Droplet Hi-C data

Copy number is inferred from Hi-C data using the ‘calculate-cnv’ module in NeoLoopFinder (v0.4.3)94. For each single cell, the output residuals from the generalized additive model in ‘calculate-cnv’ were directly used to estimate the copy number ratio. On bulk or pseudo-bulk level, an additional hidden Markov model-based segmentation is performed using the ‘segment-cnv’ module on the copy number ratio to determine the boundaries of copy number ratio segments. We assume all samples used in this paper for CNV calculation are diploid; therefore, the inferred copy number is equal to 2 × copy number ratio. A genome-wide CNV heatmap is plotted at 10 Mb resolution, where chromosome level CNV is plotted at 1 Mb resolution, and the regional CNV profile is plotted at 10 kb resolution in R using pheatmap (v1.0.12). The inferred copy number is further used to assist in identifying ecDNA candidates and the associated significant trans-interactions, and to correct the bias in contact matrices for SV identification.

Identify SV from Droplet Hi-C data

SV was identified using the ‘predictSV’ module in EagleC (v0.1.9)95. In short, CNV effects on contact matrices at 5 kb, 10 kb and 50 kb resolutions were removed using the ‘correct-cnv’ module in NeoLoopFinder. ‘predictSV’ then uses a deep learning model to predict SV at each resolution on the corrected matrices and combines all results to obtain a uniform, high-resolution SV list at 5 kb resolution. The ‘annotate-gene-fusion’ module is applied on the final SV list to annotate gene fusion events.

Metrics to define chromatin interactions

We compiled a suite of metrics to assess the cis- and trans-interaction patterns at specific genomic loci. These metrics include the following: (1) contact evenness or ‘hub index’, quantified as the Gini coefficient for trans-interactions among all chromosomes excluding chromosome Y given a 1 Mb genomic bin, and was derived using ineq (v0.2-13) in R; (2) trans-to-cis contacting bin ratio (Ri), which compares the quantity of interacting bins within the same chromosome (NC) to those among different chromosomes (NT) for a given bin i: \({R}_{i}=\frac{{N}_{T}}{{N}_{C}}\). This ratio represents the trans-interaction tendency while it is not confounded by CNV; and (3) copy-number-adjusted trans-chromosomal interaction (adjnTIF), which is also used to measure a genomic locus’s interaction activity, was calculated as described before49.

Develop ecDNA callers for identifying ecDNA candidates

We derived two ecDNA callers to identify ecDNA candidates; one is based on the logistic regression model, and the other one is based on the convolutional neural networks. Both models predict genome-wide 1 Mb bins that contain ecDNA, as well as cells with presence of ecDNA population-wide. Detailed methodologies for both algorithms are delineated as follows.

Logistic regression model-based ecDNA caller

We trained a multivariate logistic regression model using the glm function in R with inferred copy number, hub index and trans-to-cis contacting bin ratio as predictive variables. The training dataset comprised the well-defined ecMYC in COLO320DM (chr8: 127–128 Mb) and EGFR ecDNA in GBM39 (chr7: 55–56 Mb) as positive data, with the same loci in COLO320HSR and GBM39-ER as negative control. We established a probability threshold for classifying ecDNA presence. Specifically, in COLO320DM/COLO320HSR data, this threshold was determined to be 0.5. In GBM39/GBM39-ER data, the threshold was determined to be 0.95.

Deep-learning-model-based ecDNA caller

Data preprocessing

To train the deep-learning-based ecDNA caller, we also selected the well-defined ecMYC in COLO320DM (chr8: 127–128 Mb) and EGFR ecDNA in GBM39 (chr7: 55–56 Mb) as positive data, with the same loci in COLO320HSR and GBM39-ER as negative control. We used all autosomes and chromosome X to create a Hi-C contact matrix at 1 Mb resolution (3,044 × 3,044) for each cell. We then randomly selected 90% of cells as the training data and kept the remaining 10% of cells as the validation data. For each 1 Mb bin, we retained its local 5 Mb neighborhood region (including the center bin itself), and used both cis- and trans-contacts (that is, a 5 × 3,044 matrix) as its feature in the neural network model.

Model architecture

Our proposed model consists of two sequentially placed convolutional modules that extract features from the binarized Hi-C contact matrices. Each convolutional module consists of a multi-channel convolutional layer (8 channels in the first module, 16 channels in the second module), a batch normalization layer, a rectified linear unit activation function96 and a max-pooling97 layer sequentially. Convolution kernels scan along the direction of rows in each layer. Large convolutional kernel sizes (5 × 45 in the first module and 1 × 45 in the second layer) are set to improve pattern capture because of sparsity. Max-pooling reduces the size of the matrix to half on its second dimension to keep the most important features and thus improves learning efficiency and propagation speed. Strides and paddings of each layer were designed to balance computational efficiency and information retention and must be compatible with the matrix shapes and max-pooling layers. To enhance robustness and prevent overfitting, a dropout layer98 with probability of 0.5 is inserted between convolutional modules.

Subsequently, a two-layer fully connected (dense) network integrates information from multiple sources and makes ternary predictions. The first fully connected layer with a 223 hidden size receives (1) the flattened output of the convolutional modules, (2) the flattened 5 × 5 small contact matrix of the corresponding 5 Mb region (whose diagonal entries denote intra-bin contacts) from binarized 5 × 3,044 Hi-C contact matrices, (3) the hub index (used to measure inequality and heterogeneity of interactions with different chromosomes for certain bins and calculated by Python package ‘pygini’) of the center bin calculated from interaction counts aggregated per chromosome and (4) L1 normalized row means of binarized 5 × 3,044 Hi-C contact matrices. The output passes through a batch normalization layer, a Gaussian error linear unit activation function99 and a dropout layer with probability of 0.5. Finally, the second fully connected layer with hidden size 64 produces predictions with a subsequent softmax activation function, in which each prediction contains the three probabilities of ‘none’, ‘ecDNA’ and ‘HSR’ sequentially.

Model training and validation

With the package PyTorch100, the training process spanned 40 epochs using the mini-batch training strategy (batch size = 32). To ensure robust optimization, we applied the AdamW101,102 optimization algorithm with 0.001 weight decay and 0.001 learning rate, and implemented the ‘hard’ bootstrapped cross-entropy loss with parameter β equal to 0.99, which is calculated from one-hot-encoded class labels and predictive softmax103 probabilities. To efficiently reduce false positives, we compensated the difference in quantities of different labels and in addition forced the model to favor negative prediction (‘none’) using biased class weights, which is incorporated into the loss calculation. Thus, the model receives a greater loss as a penalty when generating any positive prediction (‘ecDNA’ or ‘HSR’). We mainly used confusion matrix and subsequently derived binary and ternary precision, sensitivity, specificity and accuracy as our evaluation metrics, and binary results (‘none’ or ‘ecDNA’ only) from logistic regression as the baseline performance.

Prediction of ecDNA

The model scans each Hi-C contact matrix from beginning to end without strides to generate SoftMax probabilities on each 1 Mb bin (except the first two bins of chromosome 1 and the last two bins of chromosome X as no padding to the Hi-C matrix is applied) on each single cell. Argmax (the arguments of the maxima) function is used to determine the final predicted class, which is either 0 (‘none’), 1 (‘ecDNA’) or 2 (‘HSR’). Subsequently, results are aggregated to calculate the proportion of cells with ecDNA and/or HSR among the entire cell population.

Identify significant trans-interactions of ecDNA candidate loci

To identify genomic regions that preferentially interact with ecDNA, we first quantify the number of interactions (Ni) of 500 kb genomic interval i on different chromosomes with ecDNA candidate loci, treating each chromosome separately. The observed interaction frequency (Pi) is the proportion of Ni relative to all contacts on the same chromosome: \({P}_{i}=\,\frac{{N}_{i}}{{\sum }_{g=1}^{n}{N}_{g}}\). The expected interaction frequency for interval i was calculated as the ratio of the inferred copy number (CNi) to the total copy number on the same chromosome: \({E}_{i}=\,\frac{{\rm{CN}}_{i}}{{\sum }_{g=1}^{n}{\rm{CN}}_{g}}\), based on the null hypothesis that the interaction frequency for genomic regions with ecDNA is weighted only by its underlying copy number. To identify significant interactions, observed-versus-expected P value was calculated based on the binomial distribution model. Multiple testing correction was done by Bonferroni adjustment. Regions with adjusted P value < 0.05 were selected as significant interacting regions.

ecDNA hub analysis

To generate a background distribution for ecDNA hub analysis, we shuffled chromosome identities for all trans interactions in each cell and calculated the hub index using R package ineq based on the shuffled contact matrices. P value was calculated using the Wilcoxon signed-rank test.

Analysis of GBM cellular states

To classify single cells within GBM 10x Genomics Multiome or Paired Hi-C RNA datasets into one of four predefined malignant cellular states, we adopted a previously described two-dimensional visualization technique53. Briefly, we quantified the gene enrichment score (SCj(i)) for each cell (i) against one of the four gene sets (Gj) associated with the particular cellular state. This score was calculated as the relative averaged expression (Exp) of Gj in cell i compared with a group of genes (Gjcont) with a similar level of expression as the control: \({\rm{SC}}_{j(i)}=\frac{{\sum }_{g=1}^{{N}_{j}}{\rm{Exp}}_{g(i)}}{{N}_{j}}-\frac{{\sum }_{g=1}^{{N}_{j}\rm{cont}}{\rm{Exp}}_{g(i)}}{{N}_{{j}\rm{cont}}}\), where Nj and Njcont are the number of genes in Gj and Gjcont. After obtaining scores for all four cellular states, the cells were stratified into OPC/NPC versus AC/MES categories using the differential score \(D=\max \left({\rm{SC}}_{\rm{OPC}},\,{\rm{SC}}_{\rm{NPC}}\right)-\max \left({\rm{SC}}_{\rm{AC}},\,{\rm{SC}}_{\rm{MES}}\right)\). For further refinement, OPC/NPC cells were assigned an identity value \(C={\log }_{2}(\left|{\rm{SC}}_{\rm{OPC}}-{\rm{SC}}_{\rm{NPC}}\right|+1)\), and AC/MES cells were similarly categorized with \(C={\log }_{2}(\left|{\rm{SC}}_{\rm{AC}}-{\rm{SC}}_{\rm{MES}}\right|+1)\). The distribution of cellular states was then plotted in the two-dimensional representation with D on the y-axis and C on the x-axis.

For cellular state identification in Droplet Hi-C data, co-embedding with the reference snRNA-seq was performed using the scGAD score. After determining the top 15 nearest snRNA-seq neighbors and their scaled distance-based similarity score (Dm) for each Droplet Hi-C cell, the Hi-C gene enrichment score (HSCj(i)) was computed as the sum of the neighbor-weighted enrichment scores: \({\rm{HSC}}_{j\left(i\right)}=\,{\sum }_{g=1}^{15}{\rm{SC}}_{j(g)}x{D}_{m(g)}\).

Processing and analysis of Paired Hi-C data

Preprocessing and analysis of Hi-C modality in Paired Hi-C are identical as described for Droplet Hi-C. For RNA modality, fastq files were demultiplexed by cellranger-arc, but preprocessed using cellranger (v6.1.2). After obtaining the cell-by-gene matrix, clustering and visualization were performed as described for the 10x Genomics Multiome RNA dataset. As barcodes in the same gel bead for RNA and Hi-C modalities are different, we performed manual pairing to match cell barcodes based on the 10x Genomics Multiome barcodes white list provided in cellranger-arc (10x Genomics).

Integration of the Paired Hi-C RNA dataset with reference datasets was performed by Seurat. First, normalization of gene counts was performed, and the top 2,000 shared variable genes among datasets were selected as integration features. Subsequent canonical correlation analysis allowed projecting all nuclei into a unified embedding space. Anchors (pairs of corresponding cells from distinct datasets) were then discerned through mutual nearest neighbor searching. Anchors with low confidence were excluded, and the shared neighbor between anchors and query cells are computed. Louvain clustering was applied to the shared neighbor graph to discern co-embedded clusters. We calculated overlap coefficients as described in a previous section to compare clustering results from different datasets.

Analysis of gene expression levels and copy numbers of ecDNA

To calculate the correlation between gene expression level and inferred copy number of genes on ecDNA, we first refined the range of ecDNA at 10 kb resolution. This is based on the observation of the increased intra-ecDNA interaction frequency than interactions with regions on linear genome, irrespective of their genomic distance. Specifically, we enumerated the local interactions within each 10 kb segment of the 1 Mb ecDNA candidates. Subsequently, the contact numbers were smoothed among adjacent bins. By calculating the differential contact numbers between neighboring bins, we determined the changing points of interaction with an average difference cutoff among the entire 1 Mb region. The outermost local maxima and minima were designated as ecDNA boundaries.

With the ecDNA boundaries established, we categorized genes with gene bodies residing within ecDNA as ‘ecDNA genes’. In the case of ecMYC, we further classified genes on GBM39 ecMYC as ‘shared genes’ and genes on non-overlapping regions between GBM39 and GBM39-ER as ‘ecMYC variable genes’, considering the greater ecDNA size in GBM39-ER at the pseudo-bulk level. The SCC between gene expression levels (RPKM) and the average inferred copy number over all 10 kb segments at gene body among all cells was calculated, for a comprehensive representation of the interplay between gene expression and copy number.

Ethical approval

The GBM specimen collection was approved by the Institutional Review Board (IRB) at the University of Minnesota (IRB number STUDY00012599). The AML and myelodysplastic syndrome specimen collection was approved by the IRB at the University of California, San Diego (IRB number 131550). Before sample collection, a dedicated research coordinator obtained informed consent from each patient, in accordance with the Declaration of Helsinki and appropriate Ethics Committee approval from each partner institution.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here