Mar 7, 2025
The ancestral recombination graph (ARG) describes the evolutionary relationship between
genetic materials in the presence of recombination and drift
The ancestral recombination graph (ARG) describes the evolutionary relationship between
genetic materials in the presence of recombination and drift
The full probabilistic process is complicated
In this work, we condition on the realized ARG, resulting a sequence of local trees
What is the conditional distribution of a trait given the trees?
Since the genealogy is fixed, the only randomness that remains is mutation
\[ \text{Trait} \mid \text{Local trees} \quad \sim \quad ? \]
Linear mixed models are popular in quantitative genetics \[ \mathbf{y} = \underbrace{\mathbf{Z}\mathbf{u}}_{\text{random effects}} + \underbrace{\mathbf{X}\mathbf{b}}_{\text{fixed effects}} + \boldsymbol{\varepsilon} \] where \(\mathbf{Z}\) includes genotyped variants and \(\mathbf{X}\) is the covariate matrix
In particular, the SNP effects \(\mathbf{u} \sim p(\cdot)\) is random
Some questions …
We answer these questions from a genealogical perspective
The trait \(\mathbf{y}\) is a linear function of the genotype \(\mathbf{G}\) \[ \mathbf{y} = \mathbf{G}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \] \(\mathbf{y} \in \mathbb{R}^N\), \(\mathbf{G} \in \mathbb{R}^{N \times P}\), \(\boldsymbol{\beta} \in \mathbb{R}^P\), and \(\boldsymbol{\varepsilon} \in \mathbb{R}^N\)
\(\mathbf{G}\) contains all positions the genome including genotyped ones
\(N\): number of samples, \(P\): length of the genome
\[
\mathbf{y}_1=\boldsymbol{\beta}_1+\boldsymbol{\beta}_2, \;
\mathbf{y}_2 = \boldsymbol{\beta}_2, \;
\mathbf{y}_3 = \boldsymbol{\beta}_2
\]
\[
\mathbf{y}_1=\boldsymbol{\beta}_4, \;
\mathbf{y}_2=\boldsymbol{\beta}_4, \;
\mathbf{y}_3 = \boldsymbol{\beta}_3
\]
Consider a local tree that spans over a region
We get trait values by adding up effect sizes (\(\boldsymbol{\beta}\))
Inherit a branch first, then a mutation
Branch’s effect \(=\) Sum of mutations’ effect
Branch effect is a random variable!
\[ \text{Trait} = \sum_p \text{Variant$_p$ effect size} \quad \Rightarrow \quad \text{Trait} = \sum_e \text{Branch$_e$ effect size} \]
\[ \boldsymbol{\upsilon}_e = \text{Branch$_e$ effect size} = \sum_p \text{Variant$_p$ on Branch$_e$} \]
\[ \mathbf{y} = \sum_p \mathbf{G}_p\boldsymbol{\beta}_p + \boldsymbol{\varepsilon} \quad \Rightarrow \quad \mathbf{y} = \sum_e \mathbf{Z}_e\boldsymbol{\upsilon}_e + \boldsymbol{\varepsilon} \] where \(\mathbf{Z}_{ne}=\) the number of haplotypes of \(n\) that inherit \(e\)
Split \(\boldsymbol{\upsilon}\) to \(\mathbf{u} = \boldsymbol{\upsilon} - \mathrm{E}\boldsymbol{\upsilon}\) and \(\mathbf{f}= \mathrm{E}\boldsymbol{\upsilon}\)
\[ \mathbf{y} = \underbrace{\mathbf{Z} \mathbf{u}}_{\text{Random effects}} + \underbrace{\mathbf{Z} \mathbf{f}}_{\text{Fixed effects}} + \boldsymbol{\varepsilon} \]
This is the ancestral recombination graph linear mixed model (ARG-LMM) and \(\mathbf{Z} \mathrm{Cov}(\mathbf{u}) \mathbf{Z}^T\) is the expected genetic relatedness matrix (eGRM) (Fan, Mancuso, and Chiang 2022; Zhang et al. 2023)
\[
\small l_e:\text{ length in time} \quad s_e:\text{ span in base pairs}
\]
\[
\small l_e:\text{ length in time} \quad s_e:\text{ span in base pairs}
\]
\[ \small \mathrm{Var}(\mathbf{u}_e) \quad \propto \quad \text{Number of mutations} \quad \propto \quad \text{Area}=l_es_e \]
\[ \text{What does ARG-LMM tell us about complex trait analysis?} \]
\(\mathbf{y}_1 = \mathbf{u}_{13} + \mathbf{u}_{34} \quad \text{and} \quad \mathbf{y}_2 = \mathbf{u}_{23} + \mathbf{u}_{34}\)
\[ \mathrm{Cov}(\mathbf{y}_1, \mathbf{y}_2) = \mathrm{Cov}(\mathbf{u}_{13} + \mathbf{u}_{34}, \mathbf{u}_{23} + \mathbf{u}_{34}) = \mathrm{Cov}(\mathbf{u}_{34}, \mathbf{u}_{34}) = \mathrm{Var}(\mathbf{u}_{34}) \propto t_3 - t_2 \]
\[ \mathrm{Cov}(\mathbf{y}_1, \mathbf{y}_2) = \mathrm{Cov}(\mathbf{u}_{13} + \textcolor{red}{\mathbf{u}_{34}}, \mathbf{u}_{23} + \textcolor{red}{\mathbf{u}_{34}}) = \mathrm{Cov}(\textcolor{red}{\mathbf{u}_{34}}, \textcolor{red}{\mathbf{u}_{34}}) = \mathrm{Var}(\textcolor{red}{\mathbf{u}_{34}}) \propto t_3 - t_2 \]
\[ \small \text{Heritability: }h_g^2 = \frac{\mathrm{Var}(\mathbf{g}_n)}{\mathrm{Var}(\mathbf{y}_n)} =\frac{\mathrm{Var}(\mathbf{g}_n)}{\mathrm{Var}(\mathbf{g}_n)+\mathrm{Var}(\boldsymbol{\varepsilon}_n)} \] This applies to all individuals \(\small n \in \{1,\ldots,N\}\)
However, all individuals have a different amount of genetic variance (except haploids) \[ \small \mathrm{Var}(\mathbf{g}_n) = \mathrm{Var}(\mathbf{h}_{n1}+\mathbf{h}_{n2}) = \mathrm{Var}(\mathbf{h}_{n1}) + \mathrm{Var}(\mathbf{h}_{n2}) + 2\mathrm{Cov}(\mathbf{h}_{n1},\mathbf{h}_{n2}) \]
\[ \small \mathrm{Var}(\mathbf{g}_n) = \mathrm{Var}(\mathbf{h}_{n1}+\mathbf{h}_{n2}) = \mathrm{Var}(\mathbf{h}_{n1}) + \mathrm{Var}(\mathbf{h}_{n2}) + 2\textcolor{red}{\underbrace{\mathrm{Cov}(\mathbf{h}_{n1},\mathbf{h}_{n2})}_{\text{Self-relatedness}}} \]
We can’t define a single quantitity \(h^2_g=\frac{\textcolor{red}{\mathrm{Var}(\mathbf{g}_n)}}{\textcolor{red}{\mathrm{Var}(\mathbf{g}_n)} + \mathrm{Var}(\boldsymbol{\varepsilon}_n)}\) for everyone
Some people are less genetically variable than others
Some people are harder to predict genetically than others
Some populations are inherently harder to predict!
Demographic model from (Browning et al. 2018)
\(\textsf{tslmm}\) utilizes an efficient genetic relatedness matrix - vector product to fit the restricted maximum likelihood (REML) objective
It can estimate variance components and compute polygenic scores by best linear unbiased prediction (BLUP)
The algorithm needs to pass mutations to the correct samples
A naive approach is to push the mutations down to the leaves every time
Wait until the subtree’s topology changes due to edge insertion/deletion
The wrong recipient will receive the mutations if we procrastinate further
Fitting REML \(\mathcal{O}(n_s^3) \; \Rightarrow \; \mathcal{O}(n_s+n_t\log n_s)\)
\(n_s\): number of samples, \(n_t\): number of trees
Modified figures by Nathaniel Pope (Oregon)
The runtime scales linearly with respect to the number of individuals (genome length = \(10^8\))
We measured the accuracy of polygenic scores computed from \(\textsf{tslmm}\)
Training and testing on two non-overlapping groups embedded in the same tree sequence
True trees are better, but inferred trees are not too behind!
ARG-LMM lays an explicit connection between population and quantitative genetics
Pseudoreplication due to shared ancestry (Rosenberg and VanLiere 2009)
Missing heritability, Mutations vs Mendelian segregation
A powerful trait simulator based on ARG-LMM (Cranmer, Brehmer, and Louppe 2020)
Super interesting technical details and proofs (10+ backup slides prepared)
Predicting polygenic scores of internal nodes (Edge and Coop 2018; Peng, Mulder, and Edge 2024)
Time conditioned analysis (random vs fixed effects) (Fan, Mancuso, and Chiang 2022)
Link to (Lehmann et al. 2025), \(\textsf{tslmm}\) preprint coming soon
Collaborators: Nathaniel Pope (Oregon), Jerome Kelleher (Oxford), Gregor Gorjanc (Edinburgh), and Peter Ralph (Oregon)
An individual should be a descendant of an edge (\(\mathbf{Z}_{ne}=1\))
\[ + \] That edge should have mutation (\(\mathbf{1}_{ep}=1\))
\[ \mathbf{G}_{np} = \sum_{e: p \in e} \mathbf{Z}_{ne} \mathbf{1}_{ep} \quad \Leftrightarrow \quad \mathbf{G}_p = \sum_{e:p \in e} \mathbf{Z}_{e} \mathbf{1}_{ep} \]
Recall \(\mathbf{G}_p = \sum_{e:p \in e} \mathbf{Z}_{e} \mathbf{1}_{ep}\) and \(\mathbf{y} = \sum_{p=1}^P \mathbf{G}_p \boldsymbol{\beta}_p + \boldsymbol{\varepsilon}\)
Substitute \(\mathbf{G}_p\) \[ \textcolor{red}{\sum_{p=1}^P \sum_{e:p \in e}} \mathbf{Z}_e \boldsymbol{\beta}_p \mathbf{1}_{ep} + \boldsymbol{\varepsilon} \]
Exchange the inner and the outer summation \[ \textcolor{red}{\sum_{e=1}^E \sum_{p:p \in e}} \mathbf{Z}_e \boldsymbol{\beta}_p \mathbf{1}_{ep} + \boldsymbol{\varepsilon} \]
Pull out \(\mathbf{Z}_e\) and group the positions nested in \(p: p \in e\) \[ \begin{aligned} & \sum_{e=1}^E \mathbf{Z}_e \textcolor{blue}{\left(\sum_{p:p\in e} \boldsymbol{\beta}_p \mathbf{1}_{ep} \right)} + \boldsymbol{\varepsilon} \\ &= \sum_{e=1}^E \mathbf{Z}_e \textcolor{blue}{\boldsymbol{\upsilon}_e} + \boldsymbol{\varepsilon} \\ &= \mathbf{Z} \boldsymbol{\upsilon} + \boldsymbol{\varepsilon} \end{aligned} \]
\(\boldsymbol{\upsilon}\) is a random variable made up of mutation-driven random variables \(\mathbf{1}_{ep}\)!
Independent entries of random effects is a key assumption of linear mixed models
This can be proved in ARG-LMM, instead of assuming it \[ \mathrm{Cov}( \mathbf{u}_e, \mathbf{u}_{e'} ) = \sum_{p \in e,e'} \boldsymbol{\beta}_p^2 \mathrm{Cov}(\mathbf{1}_{ep}, \mathbf{1}_{e'p}) \]
The covariance between the indicators are higher-order terms of mutation rates, so we ignore it (Wakeley 2008) \[ \begin{aligned} \mathrm{Cov}(\mathbf{1}_{ep}, \mathbf{1}_{e'p}) &= \mathrm{E}[\mathbf{1}_{ep}\mathbf{1}_{e'p}] - \mathrm{E}[\mathbf{1}_{e'p}]\mathrm{E}[\mathbf{1}_{ep}] \\ &= 0 -l_eu_{ep}l_{e'}u_{e'p} \approx 0 \end{aligned} \] where \(l_e\) is the (time-)length of edge \(e\).
The Gaussian prior on random effects is a popular choice
One might be tempted to invoke the central limit theorem to \(\mathbf{u}_e\) (sum of indicators) \[ \mathbf{u}_e \bigg/ \sqrt{l_e s_e} \cdot \sqrt{ \frac{1}{s_e} \sum_{p:p \in e} \boldsymbol{\beta}_p^2 u_{ep} } \rightarrow N(0,1^2) \text{ as } s_e \rightarrow \infty \] where \(s_e\) is the span (in base pairs) of edge \(e\)
The convergence is unlikely to be fast enough given the small value of \(\mathbf{E}[\mathbf{1}_{ep}]\) (Berry-Esseen).
Fortunately, the variance is computable and is \[ \mathrm{Var}(\mathbf{u}_e) = l_e s_e \cdot \frac{1}{s_e} \sum_{p:p \in e} \boldsymbol{\beta}_p^2 u_{ep} \]
\[ \begin{aligned} &\left[ \mathbf{Z} \mathbf{f} \right]_n = \sum_{e=1}^E \mathbf{Z}_{ne} \mathbf{E} \left[ \sum_{p: p \in e} \boldsymbol{\beta}_p\mathbf{1}_{ep} \right] \\ \end{aligned} \]
\[ \sum_{p=1}^P \boldsymbol{\beta}_p u_p \left( \sum_{e: p \in e} \mathbf{Z}_{ne} l_e \right) = \sum_{p=1}^P \boldsymbol{\beta}u_p \cdot 2t_{\mathrm{root},p} = \text{const. resp. to $n$} \]
An intercept is enough to account for the fixed effects \(\mathbf{Z}\mathbf{f}\) under this condition
The assumption is standard in neutral settings
\[ \text{\textcolor{red}{Conjecture:} selection $\Rightarrow$ fixed effects?} \]
This is also the very reason why BLUP works
We are all correlated!
ARG-LMM variance component only reflects mutational variability
Figure from Whalebone Magazine
Simulated trees
Inferred (tsinfer+tsdate) trees