Experimental method
All reagents used are enumerated in Supplementary Table 2. All oligonucleotides were ordered from Integrated DNA Technologies and are enumerated in Supplementary Table 3. The detailed workflow of the ‘end-to-end’ experiment is shown in Supplementary Figs. 1c and 3; All steps from the preparation of embryos through IVT occurred in situ and all other steps were performed ex situ.
Preparation of zebrafish embryos
AB-wildtype zebrafish were kept and crossed in accordance with the approved protocols and ethical guidelines of the University of Chicago Institutional Animal Care and Use Committee. Embryos were collected at 24 hpf and dechorionated with 1 mg ml−1 pronase for 5 min at 28 °C. The dechorionated embryos were fixed in 4% paraformaldehyde in 1× PBS at 4 °C overnight. After dehydration in 100% methanol for 15–30 min at room temperature, the embryos were stored at −80 °C for at least 2 h before use. Embryos were successively rehydrated with 75%, 50% and 25% methanol in 1× PBS for 5 min each and washed four times with PBST (1× PBS + 0.1% Tween-20) at 5 min per wash at room temperature. The embryos were then permeabilized with 5 × 10−5 U per μl thermolabile proteinase K for 12 min at room temperature, followed by inactivation at 55 °C for 15 min. Samples were then washed four times in PBST at 5 min per wash.
In situ RT
After permeabilization, embryos were incubated at 4 °C for 1 h under slow rotation (10 rpm) followed by 10 min of 65 °C incubation in a pre-RT buffer comprising 20% formamide, 0.5 U per μl Superase-In, 4.4 mM DTT, 0.5 μg μl−1 recombinant albumin and 1× PBS and then immediately cooled to 4 °C. After one water rinse, RT mix (4.4 mM DTT, 400 μM deoxynucleotide triphosphate (dNTP), 32 μM aminoallyl-dUTP, 0.5 μg μl−1 recombinant albumin, 1 U per μl Superase-In, 10 U per μl Superscript III and 1 μM 21.068C-8N RT primer in 1× FS buffer) was added and underwent 4 °C incubation for 1 h, 60 °C incubation for 3 min and 37 °C incubation overnight under slow horizontal orbital rotation, followed by incubation for 1 h at 50 °C. After RT, embryos were washed three times in PBST and once with water and then incubated in Exo I mix (1.43 U per μl Exo I in 1× Exo I buffer) at 4 °C for 1 h under slow rotation, followed by 1 h at 37 °C to remove the RT primer and displaced cDNA. Embryos were again washed three times in PBST and once with water.
Tagmentation
Transposomes were assembled according to the manufacturer’s protocol (Supplementary Fig. 1a). Briefly, oligos 22tn5.003A and 22tn5.MOS-3p were resuspended to 100 μM individually in annealing buffer (40 mM Tris-HCl pH 8 and 50 mM NaCl) and annealed at an equimolar ratio at 95 °C for 2 min, followed by 1 °C decrements per min for 70 min. Transposome assembly mix (25 μM of the annealed oligo-duplex and 1 μg μl−1 tagmentase) was incubated at 23 °C for 30 min. Glycerol was then added to a final concentration of 50% and stored until use at −20 °C. Following RT, the transposome glycerol stock was diluted 3:10 in tagmentase dilution buffer. This diluted solution was then added at a further 1:50 dilution to transposome reaction mix containing 5 mM MgCl2, 10 mM Tris-HCl pH 7.5, 10% N,N-dimethylformamide, 9% PEG8000 and 850 μM adenosine triphosphate (ATP)24. This mix was added to samples and incubated at 4 °C for 1 h under slow rotation followed by 1 h at 55 °C. Samples were then washed twice in PBST and once in 1× PBS.
Crosslinking and transposome denaturation
The 5 mM BS(PEG)5 solution was prepared in 1× PBS and added to samples for incubation at room temperature for 1 h under slow rotation. Samples were then rinsed with 1 M Tris pH 8 and quenched in this buffer for 30 min at room temperature. Samples were then incubated in 4× saline–sodium citrate (SSC) at 4 °C for 10 min under slow rotation, followed by 15 min at 70 °C. Afterward, samples were incubated in 10% formamide and 2× SSC at 4 °C under slow rotation for 10 min and then at 50 °C for 10 min. Samples were then washed three times in PBST for 5 min each.
3′ adaptor ligation and circDNA annealing
After rinsing with water, 3′ adaptor ligation mix (500 nM 22tn5.005, 500 nM 22tn5.006, 1.25 U per μl SplintR ligase in 1× SplintR ligase reaction buffer containing ATP) was added to samples, which were incubated at 4 °C for 1 h followed by 23 °C overnight. After ligation, samples were washed three times in PBST for 5 min each wash and then rinsed with water; the 3′ phosphates on the ligated oligos were removed with 0.5 U per μl Quick CIP in 1× rCutSmart buffer by incubation at 4 °C for 1 h under slow rotation, followed by 37 °C for 1 h. Samples were then again washed three times in 2× SSCT (SSC + Tween-20) at 5 min per wash. circDNAs (circ6G1 and circ7G1) were prepared using T4 DNA ligase and short splint oligos25 (splint6F5 and splint7F5, respectively) (Supplementary Figs. 1b and 2) and purified using a Zymo Oligo Concentrator spin column. Products were checked for size and purity using TBE–urea Gels. circDNA annealing mix (100 nM circ6G1 and 100 nM circ7G1 in 1× hybridization buffer containing 2× SSC, 10% formamide and 0.1% Tween-20) was added to samples and incubated overnight at 40 °C under slow rotation. Samples were then washed in hybridization buffer at 40 °C for 30 min under slow rotation3 and then washed in 2× SSCT, 1× SSCT and finally PBST at 5 min per wash.
Generation of cDNA-anchored DNA nanoballs
After rinsing with water, 3′ adaptors were initially extended along the circDNAs under two different conditions: ‘normal’ pre-RCA extension (100 μM dNTP, 0.04 U per μl T4 DNA polymerase in 1× NEBuffer2.1) and ‘enhanced’ pre-RCA extension (25 ng μl−1 T4 gene 32, 1 mM dNTP, 0.04 U per μl T4 DNA polymerase in 1× NEBuffer2.1). Following the initial extension of the 3′ adaptors, DNA nanoballs were generated by adding the RCA mix (25 ng μl−1 T4 gene 32, 0.5 μg μl−1 recombinant albumin, 250 μM d(AUGC)TP and 0.2 U per μl phi29 polymerase in 1× phi29 reaction buffer), before incubation at 4 °C for 1 h under slow rotation and then 30 °C overnight. For fluorescence observation, 20 μM fluorescein-12–dUTP was added to the RCA mix. Samples were washed three times in 2× SSCT.
UEI oligo annealing and T4 DNA extension and ligation
UEI annealing mix (100 nM 21.004G1/2-BC oligo mix, 100 nM 21.073pt, 100 nM 21.074B, 2× SSC, 5% formamide and 0.1% Tween-20) was added to samples, before incubation at at 4 °C for 1 h under slow rotation and then 50 °C for 2 h. After being brought to room temperature, samples were washed in UEI hybridization buffer (2× SSC, 5% formamide and 0.1% Tween-20) for 1 h at 50 °C, followed by washes in 2× SSCT, 1× SSCT and then PBST. After water rinse, extension–ligation mix (1 mM dNTP, 0.15 U per μl T4 DNA polymerase and 20 U per μl T4 DNA ligase in 1× T4 ligase reaction buffer including ATP) was added, before incubation for 1 h under slow rotation at 4 °C, followed by 40 min at room temperature. Samples were then washed three times in PBST, followed by a water rinse.
IVT
IVT mix (Supplementary Fig. 4) was prepared by combining the following components: 100 nM 21.075, 7.5 mM rNTP mix, 100 ng μl−1 T4g32, 0.05 U per μl USER enzyme, 10% T7 Enzyme Mix, 1× reaction buffer, 6.4 μg μl−1 THIOCURE ETTMP 1300 and 73.6 μg μl−1 4arm-PEG20K-vinylsulfone (PEG reagents were thawed from −80 °C immediately before reaction). The mixture was added to individual zebrafish embryos at a total volume of 30 μl. Hydrogels were allowed to form around samples for 2 h at room temperature. Reactions were then incubated at 37 °C for 20 h. Afterward, the hydrogels were denatured by adding 12 μl of denaturation solution (457.5 mM KOH, 100 mM EDTA and 42.5 mM DTT) for 2 h at 4 °C. Denaturation was stopped by adding 12 μl of stop solution (600 mM Tris-HCl pH 7.5 and 0.4 N HCl). After mixing, 30 μl of proteinase K mix (0.28% Tween-20, 0.09 U per μl proteinase K and 8.6 mM Tris-HCl pH 7.5) was added to the 54-μl samples for a total of 84 μl. This mixture was incubated at 50 °C for 1 h.
IVT RNA isolation and cDNA synthesis
IVT RNA was purified by addition of 1.2× RNAClean XP beads, following the manufacturer’s protocol, and eluted into water. DNase I digestion was performed (final concentrations of 0.8 U per μl Superase-In, 0.1 U per μl DNase I and 1× DNase I reaction buffer) at 37 °C for 30 min. RNA was again purified using 1.2× RNAclean XP and eluted into water. RT was carried out in a final concentration of 500 nM each of RT primers (21.077 and 21.085), 500 μM dNTP, 1× FS buffer, 5 mM DTT, 1 U per μl Superase-In and 10 U per μl Superscript III. Primers and dNTP were added first to RNA and water eluent and incubated at 65 °C for 5 min, after which the mixture was placed promptly on ice. The remainder of the reaction mixture was then added and samples were incubated for 1 h at 50 °C, followed by inactivation for 15 min at 70 °C, before being kept at 4 °C. Exo I was then added directly to the product to final concentration of 3.3 U per μl and incubated at 37 °C for 30 min, followed by heat inactivation at 80 °C for 20 min.
Library preparation
cDNA products (from IVT products) were then amplified in two separate PCR reactions. cDNA amplicons were amplified by adding Exo I-digested product at a final 1:80 dilution into four separate reactions (per embryo) containing final concentrations of 300 nM 21.046G1-BC primer, 300 nM 21.081b primer, 1× HiFi PCR buffer, 200 μM dNTP, 2 mM MgSO4 and 0.02 U per μl Platinum Taq HiFi. These reactions were thermocycled as follows: 95 °C for 2 min, 5× (95 °C for 30 s, 56 °C for 30 s and 68 °C for 2 min), 20× (95 °C for 30 s and 68 °C for 2 min), 68 °C for 5 min and 4 °C. UEI amplicons were amplified by adding Exo I-digested product at a final 1:40 dilution into two separate reactions (per embryo) containing final concentrations of 300 nM 21.077-G1 primer, 300 nM 21.076BB primer, 3.3 μM each of 4E4.interf1 and 4E.interf2 (3′ phosphate-capped oligos to interfere with PCR recombination6,26), 5% DMSO, 1× HiFi PCR buffer, 200 μM dNTP, 2 mM MgSO4 and 0.02 U per μl Platinum Taq HiFi. These reactions were thermocycled as follows: 95 °C for 2 min, 1× (95 °C for 30 s, 66 °C for 30 s and 68 °C for 2 min), 18× (95 °C for 30 s and 68 °C for 2 min), 68 °C for 5 min and 4 °C. PCR products were then purified using a 0.75× volume of Ampure XP beads, following the manufacturer’s protocol. Products were quantified and sequenced on an Illumina NextSeq 500 instrument using 150-bp cycle kits (112 nt for read 1 and 44 nt for read 2), including the sequencing primer SBS3B as a custom spike-in, according to the manufacturer’s protocol. Illumina’s Local Run Manager version 4.0.0.649 was used on the instrument for data collection.
Summary of protocol optimization
Initial experiments in cultured cells optimized key chemistries to enhance the DNA microscopy protocol’s efficiency and specificity. Oligonucleotides were redesigned to reduce hairpin structures and minimize nonspecific annealing. An Exo I treatment after RT effectively removed residual RT primers. SplintR ligase was selected over other ligases for efficient 3′ adaptor ligation because of its higher efficiency and specificity for RNA splinted DNA ligation. Introducing Tn5 transposase-mediated tagmentation increased cDNA yield and created more 3′ ends for adaptor ligation; a transposome dilution of 1:50 balanced DNA fragmentation with sample uniformity. Adjustments to PEG8000 concentration and ATP levels improved tagmentation efficiency. Including 32 nM aminoallyl-dUTP during in situ RT and crosslinking with BS(PEG)5 immobilized fragmented RNA and cDNA. The phi29 DNA polymerase was chosen over Bst DNA polymerase for RCA because of superior performance. UEI oligo annealing conditions were optimized by adjusting formamide concentration, incubation time and temperature. Incorporating IVT for linear amplification of cDNA and UEIs reduced material loss before PCR and minimized amplification errors; inclusion of T4 gene 32 protein during IVT further enhanced UEI formation. RNA and DNA purification steps were optimized using RNAClean XP and Ampure XP beads to improve recovery and reduce noise.
Applying the protocol to 24-hpf zebrafish embryos required additional optimization because of densely packed cells hindering reagent diffusion. Effective permeabilization was achieved using 5 × 10−5 U per μL thermolabile proteinase K under specific conditions. Including 20% formamide before in situ RT enhanced RNA denaturation and cDNA synthesis. To mitigate crowding from DNA nanoballs formed during RCA, the amplification strategy was modified; T4 DNA polymerase (lacking strand displacement activity) and dNTPs were used to extend 3′ adaptors along circDNAs, copying UMIs only once. Subsequent RCA with phi29 DNA polymerase, substituting deoxythymidine triphosphate with dUTP, produced DNA nanoballs digestible by USER enzyme after UEI formation, reducing crowding and improving reagent diffusion. Including T4 gene 32 protein in RCA further enhanced UEI formation. Following UEI oligo annealing, a protocol combining T4 DNA ligation and extension with high dNTP concentration greatly improved UEI yield. PCR conditions were adjusted by lowering template concentration to minimize nonspecific products and adding 5% DMSO in UEI amplification reactions to enhance specificity.
Sequence analysis
Sequence analysis was performed using the pipeline previously described6, with code updated for the larger scale of data (https://github.com/wlab-bio/vdnamic).
Briefly, sequencing reads were demultiplexed using the barcodes depicted in Supplementary Fig. 1. Subsequently, for each amplicon type, sequence elements (UMI type I, UMI type II, UEIs and cDNA inserts) were separately clustered with a 1-bp difference criterion using the EASL algorithm6.
For UEI datasets, each UEI was assigned a UMI pair by plurality (relevant only if a specific UEI appeared to show two different pairings of UMIs, a signature of PCR recombination). The resulting ‘consensus pairings’ were then pruned, with each UMI required to be associated with two UEIs and each association (unique UMI–UMI pair) required to be associated in at least two reads. The largest contiguous matrix (found by single-linkage clustering, with rarefaction depicted in Fig. 2o) was retained for image inference.
Although the above thresholding was the extent to which the reprocessed6 data for Fig. 4y were subjected, for volumetric DNA microscopy datasets, a more stringent threshold was additionally applied that every UMI–UMI association was required to possess at least two UEIs.
For cDNA-insert sequence data, reads grouped by the same UMI had a sequence consensus generated by majority vote. These sequences were then trimmed to eliminate sequence adaptors. Those inserts retaining at least 25 bp of nonartificial sequences (at least among known artificial sequences) were then counted toward the cDNA-insert UMIs (depicted as rarefaction in Fig. 2i,j). These consensus inserts were then input into STAR alignment27 version 2.7.9a using the Danio rerio genome assembly GRCz11. Gene annotations were obtained from Ensembl release 109 (February 2023) for D. rerio using the chromosomal assembly GTF file (Danio_rerio.GRCz11.109.chr.gtf) based on the GRCz11 reference genome. Ribosomal RNA (rRNA)-annotated loci were extended by 1 kb upstream and downstream of their respective annotated regions to ensure comprehensive labeling.
Alignments were filtered on the basis of alignment score and match percentage to retain high-quality matches. Mismatches and mutations were identified by analyzing compact idiosyncratic gapped alignment report strings and mismatch description tags, constructing mutation strings detailing each mismatch. Gene annotations were obtained by mapping genomic coordinates to features in the D. rerio GTF file, associating each sequence with gene names and biotypes. To avoid false-positive identification of mRNAs when sequences were actually rRNAs, the script checked for rRNA or Mt_rRNA biotypes and explicitly set the gene name accordingly, assigning an attention rank to prioritize accurate classification of rRNA sequences. Those not mapped to rRNA, Mt_rRNA, mRNA or long intergenic noncoding RNA loci were assigned the label ‘genomic DNA or other’ in Supplementary Table 1. Lastly, the data from alignments and annotations were merged and sorted to produce the final annotated output, consolidating information for each UMI and ensuring that each sequence was associated with its corresponding genomic annotations and read counts, thereby facilitating detailed analysis while minimizing the risk of misclassification.
The UMIs from genome-mapped cDNA-insert libraries (Supplementary Table 1) were then matched back to the UMIs in the UEI amplicon libraries of the corresponding specimen, allowing for single-nucleotide mismatches. The gene calls were then applied to label UMIs in the UEI-inferred image.
Simulations
All simulations were performed by taking the raw coordinates depicted in Fig. 4 and calculating Gaussian PSFs. For UMI i and UMI j at ground-truth positions \({\overrightarrow{x}}_{i}\) and \({\overrightarrow{x}}_{j}\), respectively, with N being the sum-total of all UEI counts in the simulated dataset, we assigned a raw count nij ← NegativeBinomial(mean = μij, p) where \({\mu }_{ij}\leftarrow N{e}^{-| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}}/{\sum }_{ij}{e}^{-| |\; {\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}}\) and p ← 0.8.
Image inference
Review of DNA microscopy inference as a form of dimensional reduction
The DNA microscopy dataset comprised UEI counts nij, corresponding to type I UMI index i and type II UMI index j. In discussing image inference, we use the below-described notation that effectively describes the dataset as a bipartite graph, whose weights are UEI counts and whose nodes are UMIs. First, the ‘raw’ and symmetric UEI matrix, with elements (UEI counts) nij, is written as follows:
$${\bf{n}}\equiv \left(\begin{array}{cc}0&{{\bf{n}}}_{{m}_{{\bf{I}}}\times {m}_{{\bf{II}}}}\\ {{\bf{n}}}_{{m}_{{\bf{II}}}\times {m}_{{\bf{I}}}}&0\end{array}\right)$$
On the other hand, the row-normalized matrix is written as follows:
$${\bf{N}}\equiv \,\text{diag}\,{({n}_{i\cdot })}^{-1}{\bf{n}}$$
Here and elsewhere in this paper, subscript ‘⋅’ denotes index summation, such that ni⋅ ≡ ∑jnij. The top eigenvectors of N are preliminary estimates of basis vectors to compute the final dimensional embedding. The reason for this is discussed below. Take the Kullback–Leibler minimization problem,
$$\mathop{{\rm{arg}}\,{\rm{min}}}\limits_{\overrightarrow{x}}{D}_{\rm{KL}}(p\parallel q)=\mathop{{\rm{arg}}\,{\rm{min}}}\limits_{\overrightarrow{x}}\sum _{ij}{p}_{ij}\log {p}_{ij}/{q}_{ij}(\overrightarrow{x})$$
where pij ≡ nij/n⋅⋅ and \({q}_{ij}(\overrightarrow{x})\equiv {e}^{-| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}}/{\sum }_{ij}{e}^{-| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}}\). \({q}_{ij}(\overrightarrow{x})\) here denotes reaction probabilities given the coordinates \(({\overrightarrow{x}}_{i},{\overrightarrow{x}}_{j})\).
Writing this out, we want to solve
$$\begin{array}{rcl}\mathop{{\rm{arg}}\,{\rm{min}}}\limits_{\overrightarrow{x}}{D}_{\rm{KL}}(\;p\parallel q)&=&\mathop{{\rm{arg}}\,{\rm{min}}}\limits_{\overrightarrow{x}}\left(-\sum _{ij}{p}_{ij}\log {q}_{ij}(\overrightarrow{x})\right)\\ &=&\mathop{{\rm{arg}}\,{\rm{min}}}\limits_{\overrightarrow{x}}\left(\sum _{ij}{n}_{ij}| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}+{n}_{\cdot \cdot }\log \sum _{ij}{e}^{-| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}}\right)\end{array}$$
We can rewrite this as a least squares problem plus additional terms that we deal with later:
$$\mathop{{\rm{arg}}\,{\rm{min}}}\limits_{\overrightarrow{x}}{D}_{\rm{KL}}(\;p\parallel q)=\sum _{ij}{n}_{ij}| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}+\,\text{nonlinear terms}\,$$
(1)
Taking the derivative of the sum of squares with respect to \(\overrightarrow{x}\) and setting to 0 translates in matrix algebra to \({\bf{N}}\;\overrightarrow{x}\approx \overrightarrow{x}\) (that is, we want \(\overrightarrow{x}\) to be eigenvectors of N with eigenvalues closest to 1). Instead of allowing each point to be moved independently (previously called ptMLE6), we can prioritize finding putative solutions to the convex eigenvector problem; despite approximating the original problem, it has a unique solution. These putative solutions are used to explore the wider solution space to minimize the full objective function.
In sMLE, these putative solutions (that is, eigenvectors) are spectrally ordered (by eigenvalue) and are used incrementally to construct a solution. The basic procedure is that we start with the first d eigenvectors (where d is the number of dimensions in the embedding) and perform gradient descent on those initial d2 linear coefficients (because any dimension can use any set of the starting d eigenvectors). The gradient being used itself is that of the nonlinear objective above projected onto the (orthogonalized) set of eigenvectors under consideration. After each projected gradient descent is complete, a new eigenvector is introduced with linear coefficient zero and allowed to change to further minimize the objective.
The purpose of GSE
In GSE, we extend this idea by identifying a preconditioning operation on N so that the eigenvectors of the preconditioned matrix (what we call the GSE matrix) are individually closer approximations of the optimization problem; therefore, the solution to the projected gradient descent operation is closer to the UMIs or point coordinates that gave rise to the data.
In broad strokes, GSE examines the curvature of the d-dimensional manifold swept out by the E eigenvectors from earlier. However, to evaluate the curvature of this manifold, it is important for us to define some notion of distance within this E-dimensional subspace.
Rescaling the E-dimensional subspace by an ‘as-uniform-as-possible’ criterion
We begin with our row-normalized UEI matrix N and note that the quadratic portion of the objective in Eq. (1) is minimized when the matrix product \({\overrightarrow{x}}^{T}({\bf{I}}-{\bf{N}})\;\overrightarrow{x}\) is minimized. If we apply this to a scenario in which each eigenvector \({\overrightarrow{z}}_{i}\) of N occupies its own dimension and is rescaled by some factor αi, then the contribution to the objective of each dimension i goes as \({\alpha }_{i}\;{\overrightarrow{{z}_{i}}}^{T}({\bf{I}}-{\bf{N}})\;\overrightarrow{{z}_{i}}{\alpha }_{i}={\alpha }_{i}^{2}(1-{\lambda }_{i})\), where λi is the corresponding eigenvalue. Consequently, letting \({\alpha }_{i}=1/\sqrt{1-{\lambda }_{i}}\) will equalize the contribution of each dimension to the objective.
To perform this rescaling on the data, we construct an orthonormal subspace U by singular value decomposition of the matrix of N’s top E eigenvectors in the m × E matrix Zraw. Specifically, we use the left singular vectors of Zraw, U, to project N into an E × E matrix Nreduced ≡ UTNU. We then perform a full eigenvector decomposition on Nreduced:
$${{\bf{N}}}_{{\rm{reduced}}}{\overrightarrow{v}}_{i}={\lambda }_{i}\;{\overrightarrow{v}}_{i}$$
where \({\overrightarrow{v}}_{i}\) denotes the E eigenvectors and λi denotes the eigenvalues in the subspace U. The Ritz approximation of the eigenvectors in the original space is given by \({\overrightarrow{x}}_{i}\approx {\bf{U}}\;{\overrightarrow{v}}_{i}\). The rescaled subspace is then set to the rescaled Ritz-vectors:
$${\overrightarrow{z}}_{i}^{{\rm{rescaled}}}={\bf{U}}\;{\overrightarrow{v}}_{i}/\sqrt{1-{\lambda }_{i}}$$
(2)
Each of these vectors then forms columns of the matrix subspace Z.
Next, a k-nearest neighbors (kNN) search is conducted, with k set to 2E, to identify the local neighborhoods for each UMI in Z, with k here chosen so that the neighbors generally span all E dimensions. For each UMI i, the set of its k-nearest neighbors \({\mathcal{N}}(i)\) is determined on the basis of the Euclidean distances \(\parallel {\overrightarrow{z}}_{i}-{\overrightarrow{z}}_{j}\parallel\), where \({\overrightarrow{z}}_{i}\) and \({\overrightarrow{z}}_{j}\) represent the ith and jth rows of Z, respectively. For kNN searches with small numbers of dimensions ≤ 3, we used Python’s Scikit-learn KDtree search and Python’s Annoy library otherwise. Both operate with \(O(m\log m)\) time complexity.
Piecewise linear approximation of pairwise distances (‘geodesic smoothing’)
Following the identification of nearest neighbors, singular vectors (representing local tangent spaces) are computed for each point and then geodesic distances gij between each UMI i and its neighbors \(j\in {\mathcal{N}}(i)\) are estimated. Before computing the singular vectors, we define a weight wi for each point i as the reciprocal of its degree in the kNN graph, with a small constant to avoid division by zero:
$${w}_{i}=\frac{1}{| \{\;j:i\in {\mathcal{N}}(j)\}| }$$
where \(| \{\;j:i\in {\mathcal{N}}(j)\}|\) denotes the number of times point i appears in the neighborhood of other points.
For each point i, the local tangent space Ui is estimated by performing a singular value decomposition (SVD) on the weighted difference matrix:
$${{\bf{D}}}_{i}=\,\text{diag}\,{\left(\sqrt{{w}_{j}}\right)}_{j\in {\mathcal{N}}(i)}\left({{\bf{Z}}}_{{\mathcal{N}}(i),:}-{\overrightarrow{1}}_{| {\mathcal{N}}(i)| }\;{\overrightarrow{z}}_{i}^{T}\right)={{\bf{U}}}_{i}{{\mathbf{\Sigma }}}_{i}{{\bf{V}}}_{i}^{T}$$
where \({{\bf{Z}}}_{{\mathcal{N}}(i),:}\) is a matrix whose rows are the coordinate vectors (in subspace Z) of the neighbors of point \(i,{\overrightarrow{z}}_{i}\) is the coordinate vector (in subspace Z) of point \(i,{\overrightarrow{1}}_{| {\mathcal{N}}(i)| }\) is a column vector of ones with length equal to the number of neighbors of point i and \(\,\text{diag}\,{(\sqrt{{w}_{j}})}_{j\in {\mathcal{N}}(i)}\) is a diagonal matrix of the square roots of the weights for the neighbors of point i. The tangent space Ui is then given by the first d left singular vectors of Di, where d is the dimensionality of the tangent space.
To estimate the geodesic distance gij between point i and its neighbor j, we define \({\overrightarrow{z}}_{ij}={\overrightarrow{z}}_{j}-{\overrightarrow{z}}_{i}\) and use the following formula:
$${g}_{ij}=(1-{\beta }_{ij})\cdot \left(\left\Vert {{\bf{U}}}_{i}^{T}{\overrightarrow{z}}_{ij}\right\Vert +\left\Vert {{\bf{U}}}_{j}^{T}{\overrightarrow{z}}_{ij}\right\Vert \right)+{\beta }_{ij}\cdot \parallel {\overrightarrow{z}}_{ij}\parallel$$
where
$${{\bf{M}}}_{ij}={{\bf{U}}}_{i}^{T}{{\bf{U}}}_{j}={\tilde{{\bf{U}}}}_{ij}{\tilde{{\mathbf{\Sigma }}}}_{ij}{\tilde{{\bf{V}}}}_{ij}^{T}$$
and
$${\beta }_{ij}={\text{GeomMean}}_{l}({\tilde{\sigma }}_{ij,l})$$
where \({\tilde{\sigma }}_{ij,l}\) denotes the diagonal values of \({\tilde{{\mathbf{\Sigma }}}}_{ij}\). βij, as defined, represents the ‘tangent closeness metric’ calculated as the geometric mean of these singular values. Note that Mij is a d × d matrix, reflecting the inner product of the d-dimensional tangent spaces. This formulation adjusts the Euclidean distance to more accurately reflect the underlying manifold structure.
Constructing the GSE matrix
The estimated geodesic distances gij, defined for all i and j where \(j\in {\mathcal{N}}(i)\), are subsequently transformed into similarity weights using a Gaussian (radial basis function) kernel:
$${W}_{ij}\leftarrow \frac{{\left({\sigma }_{i}^{2}+{\sigma }_{j}^{2}\right)}^{-d/2}{e}^{-{g}_{ij}^{2}/\left({\sigma }_{i}^{2}+{\sigma }_{j}^{2}\right)}}{{\rm{row}}\,{\rm{norm}}}$$
where σi is the median value of gij over all \(j\in {\mathcal{N}}(i)\). The resulting sparse matrix W is then used in the final determination of the GSE eigenvector basis. The matrix product can now be defined as follows:
$${{\bf{W}}}_{{\rm{GSE}}}\equiv \tilde{{\bf{W}}}{\bf{N}}$$
(3)
GSE computes the top eigenvectors of the normalized and symmetrized version of this matrix, \({{\mathbf{\Lambda }}}^{-1}({{\bf{W}}}_{{\rm{GSE}}}+{{\bf{W}}}_{{\rm{GSE}}}^{T})\), where Λ is a diagonal matrix with elements being the row sum of \({{\bf{W}}}_{{\rm{GSE}}}+{{\bf{W}}}_{{\rm{GSE}}}^{T}\).
The top eigenvectors of this matrix are considered to be putative solutions to the d-dimensional embedding problem, taking into account the curvature of the data manifold. Because we consider the top eigenvectors of \({\tilde{{\bf{W}}}}_{{\rm{GSE}}}\) to fit the data to the global curvature of the dataset, we now apply an incremental projected gradient descent on a global objective function (Fig. 3). The objective function we use here is the Kullback–Leibler divergence:
$${D}_{{\rm{KL}}}\equiv \sum _{ij}\left(\frac{{n}_{ij}}{{n}_{\cdot \cdot }}\right)\log \left(\frac{{n}_{ij}/{n}_{\cdot \cdot }}{{w}_{ij}/{w}_{\cdot \cdot }}\right)$$
This is a statistical distance between the UEI counts nij and the proximities in the embedding space \({w}_{ij}\equiv {e}^{-| |\;{\overrightarrow{x}}_{i}-{\overrightarrow{x}}_{j}| {| }^{2}+{A}_{i}+{A}_{j}}\), where \({\overrightarrow{x}}_{i}\) is the d-dimensional embedding coordinate of point i and where the amplitude \({A}_{i}\leftarrow \log {\sum }_{j}{n}_{ij}/{n}_{\cdot j}\).
Note that minimizing DKL is equivalent to maximizing the log likelihood \({\mathcal{L}}\) of the multinomial that uses UEI counts as ‘independent trials’ on the space of all possible UMI pairings (the framing in earlier DNA microscopy work)6. This is because the following statements are true:
$${\mathcal{L}}=\log \left({\prod }_{ij}{\left(\frac{{w}_{ij}}{{w}_{\cdot \cdot }}\right)}^{{n}_{ij}}\right)+{\rm{constant}}$$
(4)
$$=\sum _{ij}{n}_{ij}\log \frac{{w}_{ij}}{{w}_{\cdot \cdot }}+{\rm{constant}}$$
(5)
which is the same as the negative expression for DKL once constant coefficients are factored out.
Projected gradient descent
Writing out the objective gradient yields
$${\partial }_{{x}_{k}}{D}_{{\rm{KL}}}=\sum _{ij}\left(\frac{{n}_{ij}}{{n}_{\cdot \cdot }}\right){\partial }_{{x}_{k}}\log \left(\frac{{w}_{ij}}{{w}_{\cdot \cdot }}\right)=\frac{1}{{n}_{\cdot \cdot }}\sum _{j}\frac{{n}_{kj}}{{w}_{kj}}{\partial }_{{x}_{k}}{w}_{kj}-\frac{1}{{w}_{\cdot \cdot }}{\partial }_{{x}_{k}}{w}_{k\cdot }$$
(6)
Although the first term is a sparse sum and can be calculated in O(m) time, the second term summing all pairs of molecules presents unique challenges. In earlier work6, the N-body algorithm known as the fast Gauss transform (FGT) was implemented as part of the gradient calculation. Although FGT provided a way to scalably calculate the gradient contributions of large numbers of points within a small spatial volume in 2D, performing a similar calculation on still larger numbers of points and larger volumes in 3D presents a crucial challenge of scale.
Because of the fact that sMLE and GSE both involve projected gradient descent on the eigenvectors of a matrix, a much faster numerical approximation becomes possible. Specifically, we can select representative pairings at each iteration of GSE’s project gradient descent for each point.
At each iteration of the projected gradient descent, for each point, we first find its k = 2(d + 1)-nearest neighbors across all points in the d-dimensional embedding; we refer to this as pairing group SI. For each point, we next randomly select k points from the nearest 1/2d quantile of points to that original point (pairing group SII) and k points from the remaining 1 − 1/2d points (pairing group SIII). Then, we assign the corresponding fractions:
$${f}_{{\bf{I}}}\leftarrow \frac{k}{m}\,\,;\,\,{f}_{{\bf{II}}}\leftarrow \frac{1}{{2}^{d}}-\frac{k}{m}\,\,;\,\,{f}_{{\bf{III}}}\leftarrow 1-\frac{1}{{2}^{d}}$$
where the sum fI + fII + fIII = 1 corresponds to the total number of point-to-point (or UMI-to-UMI) comparisons. We can then calculate full approximations of the low-dimensional gradient by multiplying the gradient contribution of randomly selected UMI–UMI pairings by their expected representation within the ensemble of all UMI–UMI pairings.
Review of resolution limits in DNA microscopy
Here, we review the first-order approximation of resolution limits in DNA microscopy experiments and image reconstruction6. When a UMI k accumulates nk⋅ total UEIs with neighboring UMIs, its positional uncertainty can be estimated by a Gaussian approximation to the likelihood encountered earlier in Eq. (4). Under slowly varying UMI density conditions on the scale of the characteristic diffusion length Ldiff, each interaction contributes a term in the log likelihood that is approximately quadratic in the UMI’s position. Denoting by \({\overrightarrow{x}}_{k}\) the maximum-likelihood position of UMI k, one finds that
$$\log p(\;{\overrightarrow{X}}_{k})\approx -\frac{1}{{L}_{{\rm{diff}}}^{2}}\sum _{j}{n}_{kj}\,\parallel {\overrightarrow{X}}_{k}-{\overrightarrow{x}}_{j}{\parallel }^{2}+\,\text{constant}\,$$
where nkj is the number of UEIs between UMIs k and j. Expanding about \({\overrightarrow{x}}_{k}\) gives a Gaussian form,
$$p(\;{\overrightarrow{X}}_{k})\,\propto \,\exp \left(-\frac{\parallel {\overrightarrow{X}}_{k}-{\overrightarrow{x}}_{k}{\parallel }^{2}}{2\,{\sigma }^{2}}\right),$$
with σ determined by the curvature of \(\log p({\overrightarrow{X}}_{k})\) at the peak. This curvature is proportional to the total UEI count nk⋅ = ∑jnkj, leading to
$$\sigma ={\left(\frac{2{n}_{k\cdot }}{{L}_{{\rm{diff}}}^{2}}\right)}^{-\frac{1}{2}}=\frac{{L}_{{\rm{diff}}}/\sqrt{2}}{\sqrt{{n}_{k\cdot }}}.$$
Each such σ should be viewed as defining a characteristic linear dimension within which UMI k is likely to reside; therefore, when extended to 3D, one interprets σ3 as the ‘voxel’ volume of uncertainty.
Hierarchical GSE
In the main text, we introduced hierarchical GSE as an implementation of GSE that uses sparse sampling of the global UEI graph to generate a collection of S subgraphs, each containing mS UMIs. Initial reconstructions from these subgraphs are then performed. Then, a linear interpolation extrapolates these small solutions to the remaining UMIs. These putative global solutions are then input as the starting E-dimensional subspace into GSE.
Subgraph selection
Subgraph selection for hierarchical GSE was performed through a series of iterations to generate S subgraphs with as close to mS UMIs each as possible. Points were randomly selected for analysis, alternating between uniform and nonuniform (local) selections. For uniform distributions, subgraphs were assembled from UMIs that were sorted purely on the basis of their coverage by prior generated subgraphs, prioritizing those with lower inclusion counts to ensure uniform sampling across the dataset. For nonuniform (local) selections, the subspace Z, derived from the rescaled eigenvectors of the entire dataset, was used to identify regions that were undersampled.
In nonuniform (local) selections, the process began by selecting a seed point that maximized its distance from UMIs included in prior subgraphs. Starting from this seed, UMIs were added to the subgraph by selecting the top 1/2d fraction of points (with d being the embedding dimension) closest to that seed in order of lowest prior coverage by other subgraphs. UMIs that did not have at least two UEI links were pruned to ensure robust connectivity within the subgraph. Subsequently, the largest connected component within the pruned subset was identified and retained for subset GSE inference.
In all simulations S = 25 subgraphs were generated before consensus was generated in the manner described below. In all real datasets, S = 50 was used.
Linear interpolation of subsolutions
Let the desired subsample size be mS and the number of subsamples be S. For each S subsample iteration, the algorithm sorts the points on the basis of their coverage scores from all previous subsamplings cumulatively, prioritizing those that have not yet been sampled and randomizing those that are tied. The algorithm constructs a subgraph of size mS by agglomerating points in that order. The selected subsample is stored for further processing. Eigenvectors are computed from each subsample’s UEI matrix and the coverage scores are updated on the basis of which points were included, ensuring diversity across iterations. GSE is then applied to these subsets and the resulting embedding is extrapolated back to the original dataset as described below.
Following on from Eq. (1) and its linear approximation, let N represent the row-normalized UEI matrix. We can formulate our objective as bringing the normalized vector gradient of the objective in Eq. (1):
$$({\bf{N}}-{\bf{I}})\;\overrightarrow{x}\to 0$$
(7)
Let S be a subset of UMIs with known values XS and \(\bar{{\bf{S}}}\) be the complement set such that \(| {\bf{S}}| +| \bar{{\bf{S}}}| =m\). We partition the matrices and vectors as follows:
$${\bf{N}}\equiv \left(\begin{array}{cc}{{\bf{N}}}_{{\bf{S}}\times {\bf{S}}}&{{\bf{N}}}_{{\bf{S}}\times \bar{{\bf{S}}}}\\ {{\bf{N}}}_{\bar{{\bf{S}}}\times {\bf{S}}}&{{\bf{N}}}_{\bar{{\bf{S}}}\times \bar{{\bf{S}}}}\end{array}\right),\quad \overrightarrow{x}\equiv \left(\begin{array}{c}{\overrightarrow{x}}_{{\bf{S}}}\\ {\overrightarrow{x}}_{\bar{{\bf{S}}}}\end{array}\right)$$
The gradient Eq. (7) with respect to \({\overrightarrow{x}}_{\bar{{\bf{S}}}}\) is
$${{\bf{M}}}_{\bar{{\bf{S}}}\times {\bf{S}}}\;{\overrightarrow{x}}_{{\bf{S}}}={{\bf{M}}}_{\bar{{\bf{S}}}\times \bar{{\bf{S}}}}\;{\overrightarrow{x}}_{\bar{{\bf{S}}}}$$
(8)
where \({{\bf{M}}}_{\bar{{\bf{S}}}\times {\bf{S}}}\equiv {{\bf{N}}}_{\bar{{\bf{S}}}\times {\bf{S}}}\) and \({{\bf{M}}}_{\bar{{\bf{S}}}\times \bar{{\bf{S}}}}\equiv ({\bf{I}}-{{\bf{N}}}_{\bar{{\bf{S}}}\times \bar{{\bf{S}}}})\).
We solve this system using the conjugate gradient method with regularization. The conjugate gradient method is an iterative algorithm that minimizes the quadratic form \(\frac{1}{2}{\overrightarrow{x}}^{T}{\bf{A}}\;\overrightarrow{x}-{\overrightarrow{b}}^{T}\overrightarrow{x}\) where \({\bf{A}}={{\bf{M}}}_{\bar{{\bf{S}}}\times \bar{{\bf{S}}}}\) and \(\overrightarrow{b}={{\bf{M}}}_{\bar{{\bf{S}}}\times {\bf{S}}}\;{\overrightarrow{x}}_{{\bf{S}}}\). The algorithm converges when the residual \(\parallel {\overrightarrow{r}}_{k}\parallel =\parallel \overrightarrow{b}-{\bf{A}}\;{\overrightarrow{x}}_{k}\parallel\) is below a specified tolerance.
The regularization parameter λ is calculated as
$$\lambda ={\sigma }^{2}| \bar{{\bf{S}}}|$$
where \({\sigma }^{2}\equiv \parallel {\overrightarrow{x}}_{{\bf{S}}}{\parallel }^{2}/| {\bf{S}}|\) is the empirical average L2-norm of the known values. This regularization helps to stabilize the solution and prevent overfitting, especially when the UEI matrix is not well connected.
At each iteration of the regularized conjugate gradient, we solve
$$({{\bf{M}}}_{\bar{{\bf{S}}}\times \bar{{\bf{S}}}}+\lambda {\bf{I}})\;{\overrightarrow{x}}_{\bar{{\bf{S}}}}={{\bf{M}}}_{\bar{{\bf{S}}}\times {\bf{S}}}\;{\overrightarrow{x}}_{{\bf{S}}}$$
To maintain consistency with the known values, we iteratively rescale the intermediate result:
$${\overrightarrow{x}}_{\bar{{\bf{S}}}}\leftarrow {\overrightarrow{x}}_{\bar{{\bf{S}}}}\frac{\sqrt{\parallel {\overrightarrow{x}}_{{\bf{S}}}{\parallel }^{2}/| {\bf{S}}| }}{\sqrt{\parallel {\overrightarrow{x}}_{\bar{{\bf{S}}}}{\parallel }^{2}/| \bar{{\bf{S}}}| }}$$
(9)
This rescaling ensures that the variance of the interpolated values matches that of the known values, preserving the overall scale of the solution.
After solving the regularized conjugate gradient, we apply a final empirical quantile transformation to \({\overrightarrow{x}}_{\bar{{\bf{S}}}}\). This transformation is designed to ensure that the distribution of interpolated values closely matches the distribution of known values, while preserving the relative ordering obtained from the interpolation. The process begins by computing the empirical cumulative distribution function (ECDF) for both \({\overrightarrow{x}}_{{\bf{S}}}\) and \({\overrightarrow{x}}_{\bar{{\bf{S}}}}\). These ECDFs represent the proportion of values in each vector that are less than or equal to a given value. For each value in \({\overrightarrow{x}}_{\bar{{\bf{S}}}}\), we then determine its quantile (that is, its position in the sorted list of values) according to the ECDF of \({\overrightarrow{x}}_{\bar{{\bf{S}}}}\). This quantile is then used to find the corresponding value in the ECDF of \({\overrightarrow{x}}_{{\bf{S}}}\). By mapping the quantiles in this way, we effectively transform the distribution of \({\overrightarrow{x}}_{\bar{{\bf{S}}}}\) to match that of \({\overrightarrow{x}}_{{\bf{S}}}\), while maintaining the order relationships among the interpolated values.
Calculating a rescaled E-dimensional subspace from interpolated solutions
After each of the S d-dimensional interpolated solutions are generated, several random rotations are applied to each to generate an ‘inflated’ interpolated solution space of size Efinal. Each vector in this inflated solution space is rank-transformed to stabilize variance. The resulting concatenated vector matrix m × S ⋅ d matrix is then used to construct a new m × E-dimensional rescaled subspace to be fed into GSE.
This is achieved by first using SVD to orthogonalize the m × S ⋅ d concatenated vector matrix into a new matrix \({{\bf{U}}}_{{\rm{interp}}}{{\bf{S}}}_{{\rm{interp}}}{{\bf{V}}}_{{\rm{interp}}}^{T}\). We then simply rewrite Eq. (2):
$${\overrightarrow{z}}_{i}^{{\rm{rescaled}}}={{\bf{U}}}_{{\rm{interp}}}{{\bf{S}}}_{{\rm{interp}}}{\overrightarrow{v}}_{i}/\sqrt{1-{\lambda }_{i}}$$
(10)
Using hierarchical GSE for noise filtering
As indicated in simulated datasets in Supplementary Fig. 8e,j, 2D DNA microscopy data in Fig. 4y and Supplementary Fig. 9d and all 3D DNA microscopy data in Fig. 5, a 0.2% UMI filter was applied to eliminate the top fringe of overdispersed UEIs. To do this, over two iterations, S = 6 subgraphs were generated (three alternations between uniform and nonuniform selections) to generate an estimated eigenvector basis and these were used to remove 0.1% of UMIs with the greatest RMS dispersion of UEIs between neighboring UMIs. We incorporated this filter, albeit affecting datasets minimally (Supplementary Fig. 9b,d), to reduce plausible influence of contaminating PCR products from other experiments. Further experimentation with this will better clarify its utility in such experiments going forward.
Decoherence curve plots
Decoherence plots in Fig. 4v–x and Supplementary Figs. 5b and 6 quantify the quality of local registration at different length scales. To do this, n = 500 reference points were sampled from the ground-truth coordinates. For each sampled point, neighborhoods of size k (with the 500 samples drawn from values k = 100, 200, 400, 800, 2,000, 4,000, 5,000, 10,000, 15,000, 20,000, 25,000, 27,000, 30,000 and 33,000 for 50,000-point 3D simulations and values k = 100, 200, 400, 800, 1,600, 3,200 and 6,400 for 10,000-point 2D simulations) were identified through kNN analysis of the ground-truth coordinates. The corresponding points were then extracted in the embedding. Each embedded neighborhood was then rigidly aligned to its ground-truth counterpart through Procrustes alignment and the r.m.s.d. was computed. The average distance of the points from the reference point in the ground truth defined the ‘spread’. This procedure yielded (spread, r.m.s.d.) pairs, which were binned according to the spread value using 20 equally sized bins between the minimum and maximum values. Shaded bands indicated ±1 s.d. about the mean r.m.s.d. within each bin.
Statistics and reproducibility
The paired figure panels Fig. 2a–f are individually representative but single zebrafish embryos from each chemistry shown at different locations within the embryo. Figure 4y, drawn from an earlier dataset6, corresponds to a single randomly deposited cell population that underwent both fluorescence imaging and unanchored diffusion DNA microscopy. Figures 5a–d,m–t and 6k–o are the same four embryos, as indicated in the figure caption, corresponding to the biological replicates in the current experiment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.