Chapter 2 Random drift in the idealized population

2.1 Introduction

This exercise makes use of equation 3.1 to produce a series of figures depicting random drift. The code is written such that key parameters can be changed and the results observed. Only the tidyverse library is used for the exercise. flextable was used for generating the tables in Rmarkdown.

library(flextable)
library(tidyverse)

2.2 Flexible parameters

If we look closely at Figure 3.1 and Equation 3.1 we can devise a set of parameters whose effects will be interesting to explore. Both axis of the figure, generation and q frequency, are obvious choices. Population size and the number of lines might also prove interesting. These parameters with general starting values are defined as variables below.

generations <- 20
q_0 <- 0.5
pop_size <- 100
lines <- 10

The challenge is a little more complex than simply substituting values into equation 3.1. One must remember that we can only predict the magnitude of the change in gene frequency and not the direction. As such, we need to step the lines forward through each generation using the results of the previous generation. A for loop will help us do this.

2.3 Making a For Loop

Before starting the for loop we need to make a data frame to store the data. For each line, a data frame of 7x columns and as many rows as generations defined in the parameters (NB: initially 20). The gen column provides the generation number. q_0 and p_0 provide the gene frequencies. N equates to the population size of each line. The variance, var, standard deviation, sd, and the change in gene frequency, diff, columns are calculated fields that will be filled in by the for loop. The first few columns of the data frame are shown for reference. Zero values have yet to be determined.

gen	q_0	p_0	N
0	0.5	0.5	100
1	0.5	0.5	100
2	0.5	0.5	100
3	0.5	0.5	100
4	0.5	0.5	100
5	0.5	0.5	100

Data frame setup and the for loop are nested inside a function names random_drift. The function accepts a vector of population sizes, N, as input, and produces the complete data frame described earlier. For reference, the code is provided below.

The data frame, df, is first written. The for loop is then specified to run from 1 to the number of generations + 1. The extra generation is added to accommodate the base, or 0, generation.

The for loop begins with a few if functions. The first if function updates the frequency of q_0 for every generation other than the 0 generation. If generation is greater than 0 then q_0 is adjusted based on the change in gene frequency determined in the preceding generation. The second and third if functions handle the loss or fixation of the q allele should this occur through random drift. Once the calculated frequency of q either goes below 0 or above 1, the frequency will remain at either 0 or 1 in the line.

After the if functions, p_0 is calculated as (1 - q_0) in line 4. This is important since q_0 is re-written for every generation above 0. The variance of the change in gene frequency using equation 3.2 is then calculated in line 5. Next, the standard deviation is derived from the variance in line 6. Finally, the predicted change in gene frequency is modeled in line 7. Since dispersive processes have no direction, and the calculated variance is not a measure of magnitude for a single line, the prediction takes advantage of the base R function, rnorm. The function pulls a defined number of samples from a normal distribution with a specified mean and standard deviation. For our purposes, a single sample is drawn from a normal distribution with a mean of 0 and a standard deviation equating to the variance of gene frequencies in the current generation. Because a mean of 0 is used, the predicted change in frequency can be negative or positive. The prediction is written to the diff column where it can be used to adjust the frequency of q in the following generation.

The return(df) function specifies the desired output from the random_drift function.

random_drift <- function(pop_size){

df <- data.frame(gen = 0:generations,
                 q_0 = q_0,
                 p_0 = 0,
                 N = pop_size,
                 var = 0,
                 sd = 0,
                 diff = 0)

  for(i in 1:(generations+1)){
    
    if(df$gen[i] > 0){df$q_0[i] <- df$q_0[i-1] + df$diff[i-1]} # line 1
    if(df$q_0[i] <= 0) {df$q_0[i] <- 0} # line 2
    if(df$q_0[i] >= 1) {df$q_0[i] <- 1} # line 3
    
    df$p_0[i] <- 1 - df$q_0[i] # line 4
    df$var[i] <- df$q_0[i] * df$p_0[i] / (2 * df$N[i]) # line 5
    df$sd[i] <- (df$var[i])^0.5 # line 6
    df$diff[i] <- rnorm(1, mean = 0, sd = df$sd[i]) # line 7
    
  }
 return(df)
}

The following code runs the function. An lapply function is used to run random_drift for the population size and number of lines specified. The rbind within a do.call function simply takes all the data frame outputs and binds them into a single data frame. That last bit of code simply adds a line identification number for graphing purposes.

df <- do.call(rbind,
              lapply(rep(pop_size, lines),
                     FUN = random_drift)) %>%
  mutate(line = rep(1:lines, each = generations + 1))

Data pertaining to the first line is shown below. One can see how the for loop builds out the data from generation 0 onward. Starting at the specified gene frequency, the variance is calculated and the change in gene frequency is predicted. The prediction is then applied to the next generation and the calculations repeated.

gen	q_0	p_0	N	var	sd	diff	line
0	0.5000000	0.5000000	100	0.001250000	0.03535534	0.0339284308	1
1	0.5339284	0.4660716	100	0.001244244	0.03527385	-0.0142333932	1
2	0.5196950	0.4803050	100	0.001248061	0.03532790	0.0221009687	1
3	0.5417960	0.4582040	100	0.001241265	0.03523160	-0.0011202113	1
4	0.5406758	0.4593242	100	0.001241727	0.03523815	0.0363186676	1
5	0.5769945	0.4230055	100	0.001220359	0.03493364	-0.0216567100	1
6	0.5553378	0.4446622	100	0.001234689	0.03513814	-0.0772769044	1
7	0.4780608	0.5219392	100	0.001247593	0.03532129	-0.0026910197	1
8	0.4753698	0.5246302	100	0.001246967	0.03531242	-0.0633035347	1
9	0.4120663	0.5879337	100	0.001211338	0.03480429	0.0151344487	1
10	0.4272007	0.5727993	100	0.001223501	0.03497858	-0.0257358670	1
11	0.4014649	0.5985351	100	0.001201454	0.03466200	-0.0064263586	1
12	0.3950385	0.6049615	100	0.001194915	0.03456755	-0.0348319107	1
13	0.3602066	0.6397934	100	0.001152289	0.03394538	0.0353046530	1
14	0.3955113	0.6044887	100	0.001195411	0.03457471	0.0173856739	1
15	0.4128969	0.5871031	100	0.001212065	0.03481473	0.0240856612	1
16	0.4369826	0.5630174	100	0.001230144	0.03507341	-0.0470940279	1
17	0.3898886	0.6101114	100	0.001189377	0.03448735	0.0379642055	1
18	0.4278528	0.5721472	100	0.001223974	0.03498534	0.0562896673	1
19	0.4841424	0.5158576	100	0.001248743	0.03533755	0.0122932726	1
20	0.4964357	0.5035643	100	0.001249936	0.03535444	0.0007437296	1

2.4 Plotting the results

The following code produces a plot similar to Figure 3.1. The base population q frequency is shown as a dashed line. Each line is given a different color. The biggest difference between the plot and Figure 3.1 is the lack of selection in the simulation.

df %>%
  mutate(line = factor(line)) %>%
  ggplot() +
  geom_line(aes(x = gen, y = q_0, group = line, color = line),
            linewidth = 0.75) +
  scale_y_continuous(limits = c(0, 1),
                     breaks = seq(0, 1, by = 0.1)) +
  scale_x_continuous(limits = c(0, generations),
                     breaks = seq(0, generations, by = 1)) +
  geom_hline(yintercept = 0.5, linewidth = 0.75,
             color = 'black', linetype = 'dashed') +
  xlab('generation') +
  ylab('frequency of q allele') +
  theme_bw() +
  theme(legend.position = "none")

2.5 Conclusion

With an understanding of how the simulation works, alter the starting patterns and observe the change in results. If you were really ambitious, the code is a good starting point for an interactive Rshiny app. We’ll be revisiting all these topics in fuutre exercises.

gen	q_0	p_0	N
0	0.5	0.5	100
1	0.5	0.5	100
2	0.5	0.5	100
3	0.5	0.5	100
4	0.5	0.5	100
5	0.5	0.5	100

gen	q_0	p_0	N
0	0.5	0.5	100
1	0.5	0.5	100
2	0.5	0.5	100
3	0.5	0.5	100
4	0.5	0.5	100
5	0.5	0.5	100

gen	q_0	p_0	N
0	0.5	0.5	100
1	0.5	0.5	100
2	0.5	0.5	100
3	0.5	0.5	100
4	0.5	0.5	100
5	0.5	0.5	100