## Highlights

- •Most high-throughput time series studies do not contain technical repeats
- •Given limited budget, should we profile more repeat experiments or more time points?
- •We develop a theoretical framework to address this question
- •Under reasonable assumptions, more time points are the correct choice

## Summary

An important experimental design question for high-throughput time series studies is the number of replicates required for accurate reconstruction of the profiles. Due to budget and sample availability constraints, more replicates imply fewer time points and vice versa. We analyze the performance of dense and replicate sampling by developing a theoretical framework that focuses on a restricted yet expressive set of possible curves over a wide range of noise levels and by analyzing real expression data. For both the theoretical analysis and experimental data, we observe that, under reasonable noise levels, autocorrelations in the time series data allow dense sampling to better determine the correct levels of non-sampled points when compared to replicate sampling. A Java implementation of our framework can be used to determine the best replicate strategy given the expected noise. These results provide theoretical support to the large number of high-throughput time series experiments that do not use replicates.

## Graphical Abstract

## Introduction

High-throughput time series experiments have been used to study several biological systems and processes and to measure readouts, such as mRNA levels, using RNA sequencing (RNA-seq) (

Trapnell et al., 2012

) and protein-DNA interactions using chromatin immunoprecipitation sequencing (ChIP-seq) (Chang et al., 2013

). Such studies are often designed with a defined start and end point and a selected number of time points to be sampled in between.The more points that can be profiled between the start and end points, the more likely it is that the reconstructed trajectory for the data type being studied is accurate. However, in practice, the number of time points that are used in a study is usually very small (

Zinman et al., 2013

). The main limiting factor for most experiments is budget. Whereas technology has greatly improved over the last 2 decades, high-throughput sequencing studies still cost hundreds of dollars for a single experiment. This is a major issue for time series studies, especially those that need to profile multiple types of biological data (for example, mRNA, miRNAs, and methylation levels) at each selected point. Another issue that can limit the number of experiments performed (and therefore the total number of time points that can be used) is biological sample availability. Thus, when designing such experiments, researchers often need to balance the overall goals of reconstructing the most-accurate temporal representation of the data types being studied and the need to limit the number of experiments due to scarcity of resources.Given these constraints, researchers designing high-throughput time series studies must carefully consider the need for replicate experiments. On one hand, replicates are a hallmark of biological experiments (

Cumming et al., 2007

), providing valuable information about measurement noise and reliability of the measured values. On the other hand, replicates further reduce the number of time points that can be profiled, leading to the possibility of missing key events between sampled points. When resource scarcity is an issue, even one replicate for each time point cuts the total number of points that can be profiled by half, and this can have a large impact on our ability to accurately reconstruct the trajectories of the biological data being profiled. Indeed, when examining the time series datasets deposited in GEO, we observe that, in many cases, replicates have not been used in these studies (Zinman et al., 2013

).We aim to analyze the trade-offs between dense sampling (profiling more time points using one experiment per point) and replicate sampling (profiling fewer points, with more than one experiment for each point) in high-throughput biological time series studies and determine which strategy works best and under what circumstances.

The relative merit of replicates and their impact has been investigated previously for high-throughput biological experiments, but this has been limited to static datasets (where no relationship is assumed between consecutive experiments that are not replicates) and not time series data. For example,

Mongan et al., 2008

analyzed the variation in a large number of replicate experiments of the same samples collected on different dates and determined that overall correlations between these experiments were high. Others have used replicate experiments for follow-up analysis, for example, to identify differentially expressed genes (Tu et al., 2002

) and to improve the performance of clustering methods (Tjaden, 2006

). However, whereas most methods for identifying differentially expressed genes in static experiments rely on replicates, most methods for the identification of differentially expressed genes in time series studies do not assume replicates and instead rely on the overall trajectory of the genes (Kim et al., 2013

, Bar-Joseph et al., 2003b

, Ma et al., 2009

).Outside the realm of high-throughput biological datasets, the issue of replicate experiments in time series studies has been the focus of several statistical papers. For example, for epidemiological studies, trade-offs were established between frequent measurements of a small number of patients and more-infrequent measurements of a larger number of patients (

Schmidt et al., 2010

). Other examples include the analysis of sampling versus replicates in speech processing (Listgarten et al., 2004

) and early, theoretical work on reconstructing curves using parametric methods (Åström, 1969

). However, the major difference between high-throughput biological datasets and most prior work that studied these trade-offs is the fact that, in the biological experiments, all genes must be sampled at the same time at each experiment. In other words, rather than trying to infer a single curve or profile for each experiment, we are actually inferring tens of thousands of curves simultaneously. Thus, methods for the analysis of such data should consider a much-larger set of possible outcomes and examine the impact of the two possible strategies (using either dense or replicate sampling) in the context of such large number of potential curves.To compare dense versus replicate sampling in the context of the complexity of high-throughput biological data, we establish a framework for both theoretical analysis and analysis of experimental (i.e., real) gene expression data. Several methods have been suggested and used for functional representation of high-throughput time series data, such as gene expression profiles. These representations include splines (

Bar-Joseph et al., 2003a

), sinusoidal functions (Whitfield et al., 2002

), Gaussian processes (Kalaitzis and Lawrence, 2011

), and impulse models (Chechik et al., 2008

), among others. Here, we focus on piecewise linear curves (lines connecting the values at the measured points), which are by far the most popular for representing such profiles. Whereas expression and other profiles are usually not piecewise linear, these curves can represent important types of biological responses (for example, gradual or single activation, cyclic behavior, increase, and then return to baseline, etc.). Several popular analysis methods for time series expression data also assume such piecewise linear model (Ernst and Bar-Joseph, 2006

, Ernst et al., 2005

). Thus, we believe that a theoretical analysis focused on piecewise linear functions provides a good balance between a realistic model for time series gene expression and our ability to rigorously compare different sampling strategies (which is easier to perform with these simple functions).Overall, our results support the commonly used (though so far not justified) practice of reducing or eliminating replicate experiments in time series high-throughput studies. For both the theoretical analysis when using reasonable noise levels and the biological data we analyzed, we see that profiles reconstructed using dense sampling are more accurate than those reconstructed based on replicate sampling (i.e., they can better predict the levels of genes at intermediate time points that were not experimentally profiled). This indicates that autocorrelation can indeed be a useful feature when trying to reduce the impact of noise on the reconstructed curves. Our results can be used to determine the best strategy for using replicate experiments given the noise expected in the data.

## Results

We perform both theoretical analysis and analysis of real data to test the impact of replicate versus dense sampling on the ability to accurately characterize high-throughput time series biological data. The main goal of our theoretical analysis (Experimental Procedures) is to develop a framework for computing the expected difference in the resulting error (defined as the difference between the true underlying curves and the estimated curves) between the two possible strategies we are considering. Unlike the analysis of real data, which is obviously restricted to a few sample datasets, the theoretical analysis methods we develop allow us to compute such errors for a very large, and generally representative, set of possible curves. Whereas we constrain our analysis to piecewise linear profiles, these often represent the outcomes that researchers care about, as mentioned above. Indeed, clustering methods based on such piecewise linear representation for time series data have been used in the past (

Ernst and Bar-Joseph, 2006

), indicating that they can represent an important subset of the possible trajectories.Our comparisons focus on two possible strategies for sampling in time series data: dense sampling, which performs a single expression experiment at each time point, and replicate sampling, which performs two or more (depending on the setting) such experiments at each of its time points. Because we assume a fixed budget (which means that the number of experiments both methods perform is the same), dense sampling is able to query more time points (using uniform sampling) but would have to pay a price in terms of accuracy at each point because no replicates are available.

To analyze the impact of replicates on the ability to accurately reconstruct the gene expression signal, we use a probabilistic model that computes the likelihood of reconstructing a specific profile under each of the strategies given a specific error level (Experimental Procedures). We use this to compare and evaluate the performance of the two strategies. This is done by computing the expected reconstruction error—difference between the true profile, which is based on sampled and non-sampled points, and the reconstructed profile, which is only based on the sampled points for the two strategies. The lower the error, the better the sampled points represent the full profile, which is the goal of the experiment. We repeat this process for different noise levels and different numbers of overall experiments.

For the theoretical analysis, we assume that the time series is studied between [0,

*T*_{max}] and that there are*k*transition points in each expression profile (Figure 1). The value at each transition point can change (up or down) by 1, where 1 represents a unit change in our model (for example, log fold change of 2). Note that, whereas this assumption restricts the set of potential expression profiles, the possible set of resulting curves is still rich enough to define an important subset of expression trajectories. The transition points themselves are not restricted in terms of their temporal occurrence and so do not need to coincide with the measured time points. More importantly, by varying*k*, we can model (using a piecewise linear model) several realistic trajectories. Additionally, in many cases, researchers are primarily interested in the transition points themselves (for example, the first time a gene becomes differentially expressed) and so such a model captures an important aspect of the goals of time series gene expression analysis.To test the difference between using a dense sampling (more time points profiled) versus replicate sampling (fewer points but the same number of experiments), we first used the theoretical framework discussed above to evaluate the expected performance of the two strategies then compared them using real gene expression profiles. For the theoretical analysis, we assumed that gene expression was measured between 0 and 50

*h*(similar to real experiments, for example,Whitfield et al., 2002

). We use the term “Repeat_{v}” throughout the rest of the paper to represent a setting where we are using*v*replicates for each time point and “Dense” to represent a setting where we are not using any replicates. In such settings, when we have a budget for*x*experiments, Dense performs*x*RNA-seq (or microarray) experiments uniformly between 0 and 50*h*, whereas Repeat_{2}and Repeat_{3}perform two and three experiments at $x/2$ and $x/3$ uniformly sampled points, respectively. As mentioned above, we assume that noise in each measurement for each gene is Gaussian (mean 0 and SD σ ranging between 0.1 and 1.5).### Detecting Transition Time for Step Functions

We first evaluated the performance for step functions. These functions can represent genes that start as inactive (0) and become active after a certain time point (1), where the goal is to determine the transition time (Experimental Procedures). For each of the noise levels we consider, we randomly selected 100 transition points and evaluated the performance of the two strategies for the resulting curves. For such profiles, Dense performs better for noise levels lower than 0.9 for both 12 and 24 experiments (Figures 2A and 2B ). Note that, because we assume that the difference between an active and non-active gene is 1, a SD close to 1 is unlikely, and so values less than 0.9 are more likely in practice. Indeed, for most gene expression experiments, σ is much lower than 1 when analyzing log scale values (for example, close to 0.3 for

Blake et al., 2003

). For such values, Dense leads to lower reconstruction errors and is clearly much better than Repeat_{2}and Repeat_{3}. We have also tested a larger number of experiments fixing the noise SD at 0.3. As can be seen in Figure 2C, when the number of experiments increases beyond 24, the improvement seen for the dense strategy decreases. However, even for a very large number of experiments (40 over 50 hr with a single transition), Dense still outperforms Repeat_{2}and Repeat_{3}when using σ = 0.3.### Analysis of More-Complicated Transition Functions

Following the analysis of the step function scenario, we analyzed more-complex transition profiles, including monotonically increasing and non-monotonic transitions (Figures 2D and 2E) with 12 experiments. Specifically, we looked at a monotonically increasing function 0, 1, 2, 3, representing a gene that is continuously upregulated during the course of the study (common in response experiments, for example immune response;

Teschendorff et al., 2007

), and at a 0, 1, 0, 1, representing a fluctuating gene (for example, for cases of cyclic activity, such as cell cycle and circadian rhythms; Rund et al., 2011

). For these functions, we use the theoretical analysis above to compute the expected area difference between the true profile and the estimated profile for each of the methods (because the direction of the transitions are known, the differences are a function of inaccurate estimation of the transition time points). Dense outperforms Repeat_{2}and Repeat_{3}when the noise is low to moderate (Figures 2D and 2E). However, even for high noise values, we see that Repeat_{2}and Repeat_{3}do not improve upon Dense, indicating that, even when the noise levels cannot be completely determined, using Dense is at least going to lead to comparable results to the Repeat_{2}and Repeat_{3}and, in most cases, would outperform them.For the most general type of our theoretical framework, Dense outperforms Repeat

_{2}and Repeat_{3}(Figure 2F) in noise levels up to 0.6 (which, as mentioned above, is much higher than often observed in practice). For this analysis, we fixed the number of transitions (in this case to 3) but do not assume that the directions are known. Thus, the analysis considers all possible 2^{3}transition profiles (Experimental Procedures). Results are more mixed for higher noise levels, though there does not seem be a noise level in which Repeat_{2}and Repeat_{3}strongly dominates Dense.### Analysis of Real Biological Data

The analysis above used our theoretical framework to compare the Dense and Repeat strategies for various profiles and noise levels. Whereas such analysis is informative because it applies to any measurements resulting from the setting being considered, it is also important to analyze real biological expression data to compare the two strategies. For this, we used a gene expression dataset that profiled 22,769 genes in

*Anopheles gambiae*for 48 hr. The study had two settings, both with 13 experiments over the duration being studied: 12 hr light/12 hr dark (LD) switching and constant dark (DD). Experiments were performed every 4 hr with two replicates for each time point used. As usual, we computed the values in each time point as log fold changes to the values at time point 0. For both strategies, we performed the following analysis: given a specific number of experiments (upper bounded by 13, the total number of points sampled), we sample time points uniformly between 12 and 60*h*for each strategy. We use the value of the closest time point if a time point is not measured in the original dataset. For Dense, we randomly select one of the replicates at each of the time points that are used, whereas both measurements are used for Repeat_{2}(though the total number of time points used by Repeat_{2}is half that of the ones used by Dense). Next, we fit interpolating splines for each gene and estimate the mean squared error (MSE) by comparing to median values obtained when using all sampled points. Note that, in all experiments, at least half the experiments are not used (even when sampling 13 points for Dense, it only uses one experiment for each time point), and so the test data are not fully used in the reconstruction, even when using the largest number of points. We repeat this procedure 10,000 times for Dense and Repeat_{2}and report the mean error.We find that Dense is better than Repeat

_{2}, improving our ability to determine the correct expression levels at non-sampled points in nine of the ten comparisons in which both strategies use the same number of points. Moreover, in some cases, Repeat_{2}leads to errors that are 50% higher, on average, when compared to Dense, for example, when the budget only allows for six to eight experiments in the LD setting (Figures 3A and 3B ). The performance difference between them decreases when the number of experiments increases. However, Dense is generally as good for all settings.The measured expression levels and the reconstructed profiles using both Dense and Repeat

_{2}for three circadian genes are presented in Figures 3C–3E. In this figure, we allow both Dense and Repeat_{2}to use eight experiments. This figure helps explain the differences between the performances of the two methods. Whereas Dense correctly reconstructs the circadian profile of these genes, Repeat_{2}is unable to correctly reconstruct these profiles because it only measures four of the time points. For example, for AGAP010658 and AGAP000987, which were identified in this study as cycling with LD (the key goal of this experiment), Dense indeed recovers the correct two cycles profile, whereas Repeat_{2}completely misses the correct profile.### Comparisons Using a Subset of the Genes

The above analysis examined performance over all genes profiled in the experiment. However, in most cases, researchers tend to focus on a much-smaller subset of genes (often the most varying), and any strategy for designing experiments should be able to recover an accurate representation for these genes. To study the difference between the Dense and Repeat strategies for these key genes, we used 536 rhythmic genes identified as rhythmic using a cosine wave-fitting algorithm for both LD and DD conditions (

Rund et al., 2011

). We use a spline-fitting procedure as discussed above. Dense performs better than Repeat_{2}for this important subset, both quantitatively, leading to overall lower errors for non-sampled points, and qualitatively, enabling us to correctly identify cycling genes (Figures 4A and 4B ).Analysis of the gene-specific performance differences for this smaller set of genes (as opposed to the average differences presented above) suggests that Dense performs better than Repeat

_{2}for 468 (87%) and 523 (98%) out of the 536 rhythmic genes in the LD and DD datasets, respectively (Figures 4C and 4D). Even when increasing the number of experiments to ten, Dense still performs significantly better than Repeat_{2}(p < 0.01; Wilcoxon rank-sum test). See also Figure S1, where we show similar results when using a piecewise linear fit rather than a spline fit for these data.## Discussion

Whereas replicate experiments have been widely used in high-throughput analysis studies, they have been utilized to a much-lesser extent when using the same technology to study time series data (

Zinman et al., 2013

). Whereas it is hard to determine the exact causes for this practice, it is very likely that budget and sample quantity constraints have played a role. However, no systematic study examined the trade-offs between more time points and more replicates for such studies.Here, we have tried to address this issue using a combined theoretical and real data analysis framework. Our theoretical models consider the impact of various noise levels on the ability of each of these strategies to correctly infer the underlying profile. As we show, by analyzing a restricted yet expressive set of piecewise linear curves, for reasonable noise levels, dense sampling leads to better results than a strategy that profiles a smaller number of time points but a larger number of replicates per sample. We obtain similar results when analyzing real biological gene expression data for both the full set of genes being studied and a subset of the key genes identified in a specific study.

Whereas we conclude that a dense sampling is beneficial when the number of experiments is limited by external constraints, we do not claim that replicates do not provide additional and valuable information.

If resource constraints do not exist, or if it is possible to increase the number of experiments performed, replicates are an important and useful strategy for identifying differentially expressed genes, for clustering and modeling their behavior, and for understanding measurement noise and reliability. However, whereas replicates can be very useful in dealing with measurement noise, if we assume that the data being studied can indeed be represented by a (smooth) continuous curve, which is often the case (

Bar-Joseph et al., 2003b

), then the autocorrelation between successive points can also provide information about the noise in the data (we do not expect large variations between these points). In such cases, more time points, even at the expense of fewer or no replicates, may prove to be a better strategy for reconstructing the dynamics of the type of data being studied.Whereas this study is mainly focused on developing a theoretical framework for considering the trade-offs between dense and replicate sampling of gene expression data, and on re-analysis of existing experimental data, the methods we developed can also be used by experimentalists when designing other high-throughput time course experiments. Specifically, the Java program (provided as Data S1 and on the supporting website http://sb.cs.cmu.edu/repeats) allows researchers to input the total duration of their experiment, the total number of experiments that they can perform (given budget/sample constraints), and the expected noise (which, if unknown, can be determined by performing very few experiments for time point 0;

Bar-Joseph et al., 2003a

). Given these inputs, we use the piecewise linear simulation framework to evaluate the expected curve reconstruction error when using dense and replicate sampling and the results are returned to the user.## Experimental Procedures

### Likelihood-Based Framework

We use a probabilistic model for our theoretical analysis. Such a model allows us to capture the uncertainty in the measurement replicates and the noise associated with high-throughput biological data. We assume that the time series is studied between [0,

*T*_{max}] and that there are*k*transition points in each expression profile (Figure 1). The value at each transition point can change (up or down) by 1, where 1 represents a unit change in our model (for example, log fold change of 2). The transition points are not restricted in terms of their temporal occurrence and so do not need to coincide with the measured time points. By varying*k*, we can model several realistic trajectories.More formally, we use Dense to denote a dense sampling strategy and Repeat

_{v}to denote a sampling strategy that uses*v*replicates in each time point. Denote the observed data using Dense for a gene $g$ by ${D}_{g}^{d}$ and for Repeat by ${D}_{g}^{r}$. Let ${T}^{d}$ and ${T}^{r}$ be the set of measured time points for each method, respectively, and ${n}_{r}$ be the number of experiments used for each time point in Repeat_{nr}. For a given budget $B$, we assume $\left|{T}^{d}\right|=B$ and $\left|{T}^{r}\right|=\left[B/{n}_{r}\right]$. We assume that the true expression profile for a gene $g$ is defined by transition times ${S}_{g}=\left\{{s}_{g}^{1},\mathrm{\dots},{s}_{g}^{k}\right\}$ and corresponding transition directions ${C}_{g}=\left\{{c}_{g}^{1},\mathrm{\dots},{c}_{g}^{k}\right\}$, where each ${c}_{g}^{i}\mathrm{\in}\left\{-\mathrm{1,1}\right\}$. The goal of an experiment (using either of the sampling methods) is to detect, as accurately as possible, these transition times and directions. Let ${S}_{r}=\left\{{s}_{r}^{1},\mathrm{\dots},{s}_{r}^{k}\right\}$ and ${C}_{r}=\left\{{c}_{r}^{1},\mathrm{\dots},{c}_{r}^{k}\right\}$ denote the points and directions estimated by Repeat_{v}. Similarly, let ${S}_{d}$ and ${C}_{d}$ denote the points and the directions estimated by Dense. We assume ${S}_{g},{S}_{r},{S}_{d}$ to be sorted in increasing order and define ${f}_{\text{mis}}\left({S}_{g},{C}_{g},{S}_{r},{C}_{r}\right)$ to be the difference between the area of the true gene profile curve defined by ${S}_{g}$ and ${C}_{g}$ and area of the estimated curve defined by ${S}_{r}$ and ${C}_{r}$. We compare both strategies by ${f}_{\text{mis}}$.### General Likelihood Function

Given the experiment values and $k$ (the required number of transitions), we next need to select the set of transition points and directions for each method ${S}_{d}$, ${C}_{d}$ and ${S}_{r}$, ${C}_{r}$. For this, we use the maximum likelihood (ML) criterion. Let ${A}^{r}$ be the set of all $k$ -point subsets of ${T}^{r}$ that are candidates for ${S}_{g}$ and $Q={\left\{1,-1\right\}}^{k}$ be the set of all possible transition directions for these points that are candidates for ${C}_{g}$. Each $k$ -point subset ${T}^{\mathit{\prime}}=\left\{{T}_{i}^{\prime},i\mathrm{\in}1,\mathrm{\dots},k\right\}\mathrm{\in}{A}^{r}$ and a transition function $C=\left\{{c}_{i},i\mathrm{\in}1,\mathrm{\dots},k\right\}\mathrm{\in}Q$ partitions $\left[0,{T}_{\mathrm{max}}\right]$ into $k+1$ intervals ${I}^{\mathit{\prime}}=\left\{{I}_{i}^{\prime}=\left[{T}_{i}^{\prime},{T}_{i+1}^{\prime}\right],i\mathrm{\in}0,\mathrm{\dots},k\right\}$ with corresponding values $\left\{{v}_{i},i\mathrm{\in}0,\mathrm{\dots ,}k\right\}$, where ${T}_{0}^{\prime}=0$, ${T}_{k+1}^{\prime}={T}_{\mathrm{max}}$, and ${v}_{i+1}={v}_{i}+{c}_{i+1}$. Let $\mathsf{L}\left({T}^{\mathit{\prime}},C|{D}_{g}^{r}\right)$ denote the probability of the observed values for Repeat

where ${d}_{j:z}^{r}\mathrm{\in}{D}_{g}^{r}$ is the $z$’th replicate of $j$’th measured value, ${t}_{j}^{r}\mathrm{\in}{T}^{r}$ is the set of time points for Repeat

_{v}conditioned on transition times ${T}^{\mathit{\prime}}$ and directions $C$. Assuming independent Gaussian measurement noise, this likelihood can be formulated by$\begin{array}{c}\mathsf{L}\left({T}^{\mathit{\prime}},C|{D}_{g}^{r}\right)=p\left({D}_{g}^{r}|{T}^{\mathit{\prime}},C\right)=\prod _{i=0}^{k}\prod _{{t}_{a}\mathrm{\le}{t}_{j}^{r}\mathrm{<}{t}_{b},{I}_{i}^{\prime}=\left[{t}_{a},{t}_{b}\right)}\prod _{z=1}^{{n}_{r}}p\left({d}_{j:z}^{r}|{v}_{i},\sigma \right)\\ =\prod _{i=0}^{k}\prod _{{t}_{a}\mathrm{\le}{t}_{j}^{r}\mathrm{<}{t}_{b},{I}_{i}^{\prime}=\left[{t}_{a},{t}_{b}\right)}\prod _{z=1}^{{n}_{r}}\frac{1}{\sigma \sqrt{2\pi}}{\mathrm{exp}}^{-\frac{{\left({d}_{j:z}^{r}-{v}_{i}\right)}^{2}}{2{\sigma}^{2}}}\end{array},$

(Equation 1)

where ${d}_{j:z}^{r}\mathrm{\in}{D}_{g}^{r}$ is the $z$’th replicate of $j$’th measured value, ${t}_{j}^{r}\mathrm{\in}{T}^{r}$ is the set of time points for Repeat

_{nr}, and $p\left({d}_{j:z}^{r}|{v}_{i},\sigma \right)$ is Gaussian probability of observing ${d}_{j:z}^{r}$ given mean ${v}_{i}$ and SD $\sigma $. To find the ML estimate for ${S}_{r}$ and ${C}_{r}$, we set ${S}_{r},{C}_{r}=\mathit{argma}{\mathit{x}}_{{\mathit{T}}^{\mathit{\prime}}\mathrm{\in}{\mathit{A}}^{\mathit{r}},\overline{C}\mathrm{\in}Q}p\left({D}_{g}^{r}|{T}^{\mathit{\prime}},\overline{C}\right)$. A similar analysis can be carried out to determine the ML estimate for ${S}_{d}$ and ${C}_{d}$, where we condition on observed values for each point in ${T}^{d}$ and ${n}_{r}=1$.### Analyzing a Restricted Set of Profiles

Whereas our goal is to evaluate the general likelihood function presented above, because of the combinatorial nature of the computation (over all selections of points and directions), it is impossible to compare the methods for completely unrestricted cases. We thus continue by discussing restriction on the general framework that, on the one hand, allows us to compute a closed-form solution to the expected differences between the two methods in a reasonable (polynomial) time while at the same time capturing a relevant and biologically important subset of the potential expression profiles.

We start by considering step functions. Such functions allow only a single transition (for example, a gene that is only up- or downregulated at some point during the experiment and stays in that level until the end). Whereas step functions are clearly highly restricted, there are many cases where genes with a step-function-like behavior are of interest, for example, when looking for differentially expressed genes in a response experiment. For such genes, the key question is to determine the timing of the step event (time of activation). We next discuss analysis that allows functions with a larger number of possible transitions, assuming that we know the direction of each of these transitions. Finally, we consider the most-general case where both the location and direction of transitions are unknown.

For a step function, we only need to determine a single time point that leads to ${s}_{r}={s}_{r}^{1}$, ${s}_{d}={s}_{d}^{1}$, ${s}_{g}={s}_{g}^{1}$, ${c}_{r}={c}_{r}^{1}$, ${c}_{d}={c}_{d}^{1}$, and ${c}_{g}={c}_{g}^{1}$. The likelihood function (Equation 1) becomes

where ${c}_{r}$ is the direction change (here, an activation, so ${c}_{r}=1$). For a step function that transitions from $0$ to $1$ at time ${s}_{g}$, expected error is

where $p\left({s}_{r}={t}_{i}^{r}|{c}_{r},{s}_{g},{c}_{g},{\sigma}^{2}\right)$ is the probability of selecting the $i$’th time point that transitions into the value ${c}_{r}=1$ conditioned on the actual step time, actual transition direction, and the noise in the measured data.

$\mathsf{L}\left({s}_{r},{c}_{r}|{D}_{g}^{r}\right)=\prod _{{t}_{j}\mathrm{<}{s}_{r}}\prod _{z=1}^{{n}_{r}}p\left({d}_{i:z}^{r}|0\right)\prod _{{t}_{j}\mathrm{\ge}{s}_{r}}\prod _{z=1}^{{n}_{r}}p\left({d}_{j:z}^{r}|{c}_{r}\right),$

(Equation 2)

where ${c}_{r}$ is the direction change (here, an activation, so ${c}_{r}=1$). For a step function that transitions from $0$ to $1$ at time ${s}_{g}$, expected error is

$E\left({f}_{\text{mis}}\right)=\sum _{{t}_{i}^{r}\mathrm{\in}{T}^{r}}p\left({s}_{r}={t}_{i}^{r}|{c}_{r}=1,{s}_{g},{c}_{g},{\sigma}^{2}\right)\underset{{f}_{\text{mis}}\left({s}_{g},{c}_{g},{t}_{i}^{r},{c}_{r}\right)}{\underbrace{\left(\left({T}_{\mathrm{max}}-{t}_{i}^{r}\right)\left|{c}_{g}-{c}_{r}\right|+\left({t}_{i}^{r}-{s}_{g}\right)\left|{c}_{r}\right|\right)}},$

(Equation 3)

where $p\left({s}_{r}={t}_{i}^{r}|{c}_{r},{s}_{g},{c}_{g},{\sigma}^{2}\right)$ is the probability of selecting the $i$’th time point that transitions into the value ${c}_{r}=1$ conditioned on the actual step time, actual transition direction, and the noise in the measured data.

In order to select ${t}_{i}$ as the step point, we need the likelihood defined by it and ${c}_{r}=1$ to be higher than any other point and ${c}_{r}$. From here on, we drop the superscript $r$ when referring to the sampled time points and values and use the shorthand notation $\mathsf{L}\left({t}_{i},1\right)$ to denote $\mathsf{L}\left({t}_{i}^{r},1|{D}^{r}\right)$. Because transition direction is known:

where $\mathsf{L}\left({t}_{i},1\right)$ is defined in Equation 2. Computing $p\left(\mathsf{L}\left({t}_{i},1\right)\mathrm{>}\mathsf{L}\left({t}_{j},1\right),\mathrm{\forall}j\mathrm{\ne}i\right)$ involves nested integrals over pairwise probabilities. Let ${S}_{i}=\left\{{t}_{1},\mathrm{\dots},{t}_{i-1}\right\}$ and ${M}_{i}=\left\{{t}_{i+1},\mathrm{\dots},{t}_{T}\right\}$ be the set of sorted time points that are smaller or larger than ${t}_{i}$, respectively, and $p\left(\mathsf{L}\left({t}_{i},{c}_{x}\right)\mathrm{>}\mathsf{L}\left({t}_{j},{c}_{y}\right)\right)$ be the probability of likelihood defined by ${t}_{i}$ and direction ${c}_{x}$, being larger than the likelihood defined by ${t}_{j}$ and ${c}_{y}$. For ${t}_{j}\mathrm{\in}{S}_{i}$, both predicted curves have the same value up to ${t}_{j}$ $\left(0\right)$ as well as at and after ${t}_{i}$ $\left(1\right)$ because ${c}_{x}={c}_{y}=1$. Then, this pairwise probability $p\left(\mathsf{L}\left({t}_{i},1\right)\mathrm{>}\mathsf{L}\left({t}_{j},1\right)\right)$ can be expressed as in Equation 5 in terms of the log-likelihood comparison:

where $h$ is the number of measurements between ${t}_{j}$ and ${t}_{i-1}$, including both time points, and $\mathrm{\Phi}\left({n}_{r}\left(h/2\right),{m}_{j}^{i},{\sigma}_{j}^{i}\right)$ is the cumulative distribution function (cdf) of a Gaussian with mean ${m}_{j}^{i}$ and a SD ${\sigma}_{j}^{i}$. Because we know ${s}_{g}$ and ${c}_{g}$ (the computation is conditioned on them) and we are dealing with a Gaussian, the sum of the observations is also a Gaussian with mean ${m}_{j}^{i}={n}_{r}{\sum}_{m=j,{t}_{m}\mathrm{\ge}{s}_{g}}^{i-1}1$ and SD ${\sigma}_{j}^{i}=\sqrt{{n}_{r}{\sum}_{m=j}^{i-1}{\sigma}^{2}}$. Note that such analysis takes into account the number of replicates when computing the noise for each time point.

$p\left({s}_{r}={t}_{i}^{r}|{c}_{r}=1,{s}_{g},{c}_{g},{\sigma}^{2}\right)=p\left(\mathsf{L}\left({t}_{i},1\right)\mathrm{>}\mathsf{L}\left({t}_{j},1\right),\mathrm{\forall}j\mathrm{\ne}i\right),$

(Equation 4)

where $\mathsf{L}\left({t}_{i},1\right)$ is defined in Equation 2. Computing $p\left(\mathsf{L}\left({t}_{i},1\right)\mathrm{>}\mathsf{L}\left({t}_{j},1\right),\mathrm{\forall}j\mathrm{\ne}i\right)$ involves nested integrals over pairwise probabilities. Let ${S}_{i}=\left\{{t}_{1},\mathrm{\dots},{t}_{i-1}\right\}$ and ${M}_{i}=\left\{{t}_{i+1},\mathrm{\dots},{t}_{T}\right\}$ be the set of sorted time points that are smaller or larger than ${t}_{i}$, respectively, and $p\left(\mathsf{L}\left({t}_{i},{c}_{x}\right)\mathrm{>}\mathsf{L}\left({t}_{j},{c}_{y}\right)\right)$ be the probability of likelihood defined by ${t}_{i}$ and direction ${c}_{x}$, being larger than the likelihood defined by ${t}_{j}$ and ${c}_{y}$. For ${t}_{j}\mathrm{\in}{S}_{i}$, both predicted curves have the same value up to ${t}_{j}$ $\left(0\right)$ as well as at and after ${t}_{i}$ $\left(1\right)$ because ${c}_{x}={c}_{y}=1$. Then, this pairwise probability $p\left(\mathsf{L}\left({t}_{i},1\right)\mathrm{>}\mathsf{L}\left({t}_{j},1\right)\right)$ can be expressed as in Equation 5 in terms of the log-likelihood comparison:

$p\left(\mathsf{L}\left({t}_{i},1\right)\mathrm{>}\mathsf{L}\left({t}_{j},1\right)\right)=p\left(\mathrm{log}\left(\mathsf{L}\left({t}_{i},1\right)\right)-\mathrm{log}\left(\mathsf{L}\left({t}_{j},1\right)\right)\mathrm{>}0\right)=p\left(-\frac{1}{2{\sigma}^{2}}\left(\sum _{m=j}^{i-1}\sum _{z=1}^{{n}_{r}}{\left({d}_{m:z}\right)}^{2}-\sum _{m=j}^{i-1}\sum _{z=1}^{{n}_{r}}{\left({d}_{m:z}-1\right)}^{2}\right)\mathrm{>}0\right)=p\sum _{m=j}^{i-1}\sum _{z=1}^{{n}_{r}}\left({d}_{m:z}\mathrm{\le}{n}_{r}\frac{h}{2}\right)=\mathrm{\Phi}\left({n}_{r}\frac{h}{2},{m}_{j}^{i},{\sigma}_{j}^{i}\right),$

(Equation 5)

where $h$ is the number of measurements between ${t}_{j}$ and ${t}_{i-1}$, including both time points, and $\mathrm{\Phi}\left({n}_{r}\left(h/2\right),{m}_{j}^{i},{\sigma}_{j}^{i}\right)$ is the cumulative distribution function (cdf) of a Gaussian with mean ${m}_{j}^{i}$ and a SD ${\sigma}_{j}^{i}$. Because we know ${s}_{g}$ and ${c}_{g}$ (the computation is conditioned on them) and we are dealing with a Gaussian, the sum of the observations is also a Gaussian with mean ${m}_{j}^{i}={n}_{r}{\sum}_{m=j,{t}_{m}\mathrm{\ge}{s}_{g}}^{i-1}1$ and SD ${\sigma}_{j}^{i}=\sqrt{{n}_{r}{\sum}_{m=j}^{i-1}{\sigma}^{2}}$. Note that such analysis takes into account the number of replicates when computing the noise for each time point.

Repeating the pairwise comparison in Equation 5 for all points in

*S*_{i}and*M*_{i}returns a set of distributions that needs to be satisfied. For a step function, distributions returned by*S*_{i}and*M*_{i}are independent of each other, so the nested integral for Equation 4 can be separated into two integrals, each of which can be efficiently estimated by Gaussian quadrature or by Markov chain Monte Carlo (MCMC) (Press et al., 2007

; see Supplemental Experimental Procedures for details).### Analyzing Profiles with Multiple Transitions

Following the analysis of step functions, where we focused on identifying a single change point, we now consider the more-general (though still not the most general) case, where we know the number of transition points and the direction (for example, $\mathrm{0,1,2,1}$) but do not know the specific time points in which they occur. In this case, we estimate $p\left({S}_{r}={T}^{i}|{C}_{r},{S}_{g},{C}_{g},{\sigma}^{2}\right)$ for ${T}^{i}\mathrm{\in}{A}^{d}$ in Equation 3, which is defined as the probability of the likelihood defined by ${T}^{i}$ to be higher than the likelihood defined by any other $k$ subset.

In order to estimate this probability, we follow the approach used for the step function and define the pairwise probability $p\left(\mathsf{L}\left({T}^{i},{C}^{r}\right)\mathrm{>}\mathsf{L}\left({T}^{j},{C}^{r}\right)\right)$ as in

where ${I}^{i}$ and ${I}^{j}$ are intervals defined by ${T}^{i}$ and ${T}^{j}$ and ${v}_{m+1}^{r}={v}_{m}^{r}+{c}_{m+1}^{r}$ is the corresponding value of the $m+2$’th interval defined by transition directions ${C}^{r}$. For every $k$ -subset ${T}^{i}$, there are $\left(\begin{array}{c}T\\ k\end{array}\right)-1$ comparisons, intersection of which defines the integral boundaries for estimating Equation 4. In contrast to the step functions in section 0, we cannot separate the estimation of the nested integral in Equation 4 into two parts because there are no total ordering and independence between variables. In this case, we estimate the integral by sampling over the domain. As with step functions, we evaluate the success of the Dense and Repeat strategies by determining the area between the true and estimated curves for each gene.

$p\left(\mathsf{L}\left({T}^{i},{C}^{r}\right)\mathrm{>}\mathsf{L}\left({T}^{j},{C}^{r}\right)\right)=p\left(\sum _{m=0}^{k}\sum _{\underset{{I}_{m}^{i}=\left[{t}_{a},{t}_{b}\right)}{{t}_{a}\mathrm{\le}t\mathrm{<}{t}_{b},}}\sum _{z=1}^{{n}_{r}}{\left({d}_{t:z}-{v}_{m}^{r}\right)}^{2}-\sum _{n=0}^{k}\sum _{\underset{{I}_{n}^{j}=\left[{t}_{a},{t}_{b}\right)}{{t}_{a}\mathrm{\le}t\mathrm{<}{t}_{b}}}\sum _{z=1}^{{n}_{r}}{\left({d}_{t:z}-{v}_{n}^{r}\right)}^{2}\mathrm{>}0\right),$

(Equation 6)

where ${I}^{i}$ and ${I}^{j}$ are intervals defined by ${T}^{i}$ and ${T}^{j}$ and ${v}_{m+1}^{r}={v}_{m}^{r}+{c}_{m+1}^{r}$ is the corresponding value of the $m+2$’th interval defined by transition directions ${C}^{r}$. For every $k$ -subset ${T}^{i}$, there are $\left(\begin{array}{c}T\\ k\end{array}\right)-1$ comparisons, intersection of which defines the integral boundaries for estimating Equation 4. In contrast to the step functions in section 0, we cannot separate the estimation of the nested integral in Equation 4 into two parts because there are no total ordering and independence between variables. In this case, we estimate the integral by sampling over the domain. As with step functions, we evaluate the success of the Dense and Repeat strategies by determining the area between the true and estimated curves for each gene.

### General Transition Functions

Finally, we arrive at the most-general case, where both the location and direction of transitions are unknown. Note that the number of transitions

where $p\left({S}_{r}={T}^{i},{C}_{r}={C}^{x}|{S}_{g},{C}_{g},{\sigma}^{2}\right)$ is the probability of selecting $k$ -point subset ${T}^{i}$ and set of transition directions ${C}^{x}=\left\{{c}_{1}^{x},\mathrm{\dots}{c}_{k}^{x}\right\}$ conditioned on the actual step time, actual transition direction, and the noise in the measured data. In Equation 7, expectation is taken over all possible $k$ -point subsets of ${T}^{r}$ and all possible transition directions ${C}^{x}$ of length $k$ because we also do not know the transition directions. When estimating $p\left({S}_{r}={T}^{i},{C}_{r}={C}^{x}|{S}_{g},{C}_{g},{\sigma}^{2}\right)$, we want the likelihood defined by ${T}^{i}$ and ${C}^{x}$ to be higher than the likelihood defined by any other $k$ -point subset ${T}^{j}$ and $k$ -length transition directions ${C}^{y}$ pair. We follow the approach used for the step function and define the pairwise probability $p\left(\mathsf{L}\left({T}^{i},{C}^{r}\right)\mathrm{>}\mathsf{L}\left({T}^{j},{C}^{y}\right)\right)$ as in

where ${I}^{i}$ and ${I}^{j}$ are intervals defined by ${T}^{i}$ and ${T}^{j}$ and ${v}_{m+1}^{x}={v}_{m}^{x}+{c}_{m+1}^{x}$, ${v}_{m+1}^{y}$ are the corresponding values of the $m+2$’th interval defined by transition directions ${C}^{x}$ and ${C}^{y}$, respectively. For every $k$ -subset ${T}^{i}$ and ${C}^{x}$, there are $\left(\begin{array}{c}T\\ k\end{array}\right){2}^{k}-1$ comparisons defining the integral boundaries in estimating Equation 4. Similar to the discussion of known transition directions above, full ordering is not guaranteed between the variables, so the nested integral in Equation 4 can again be estimated via sampling.

*k*is an input for this computation, but because the goal of the modeling here is to determine how well Dense and Repeat strategies do, we can easily perform the computation on all relevant values of*k*to reach the conclusions we are interested in for a specific noise model (it is unlikely that genes would have more than five to six transitions in most time series studies; in fact, in most cases they have much fewer). For the case of*k*possible transitions, expected error becomes$E\left({f}_{\text{mis}}\right)=\sum _{{T}^{i}\mathrm{\in}{A}^{r}}\sum _{{C}^{x}\mathrm{\in}Q}p\left({S}_{r}={T}^{i},{C}_{r}={C}^{x}|{S}_{g},{C}_{g},{\sigma}^{2}\right){f}_{\text{mis}}\left({S}_{g},{C}_{g},{T}^{i},{C}_{x}\right),$

(Equation 7)

where $p\left({S}_{r}={T}^{i},{C}_{r}={C}^{x}|{S}_{g},{C}_{g},{\sigma}^{2}\right)$ is the probability of selecting $k$ -point subset ${T}^{i}$ and set of transition directions ${C}^{x}=\left\{{c}_{1}^{x},\mathrm{\dots}{c}_{k}^{x}\right\}$ conditioned on the actual step time, actual transition direction, and the noise in the measured data. In Equation 7, expectation is taken over all possible $k$ -point subsets of ${T}^{r}$ and all possible transition directions ${C}^{x}$ of length $k$ because we also do not know the transition directions. When estimating $p\left({S}_{r}={T}^{i},{C}_{r}={C}^{x}|{S}_{g},{C}_{g},{\sigma}^{2}\right)$, we want the likelihood defined by ${T}^{i}$ and ${C}^{x}$ to be higher than the likelihood defined by any other $k$ -point subset ${T}^{j}$ and $k$ -length transition directions ${C}^{y}$ pair. We follow the approach used for the step function and define the pairwise probability $p\left(\mathsf{L}\left({T}^{i},{C}^{r}\right)\mathrm{>}\mathsf{L}\left({T}^{j},{C}^{y}\right)\right)$ as in

$p\left(\mathsf{L}\left({T}^{i},{C}^{x}\right)\mathrm{>}\mathsf{L}\left({T}^{j},{C}^{y}\right)\right)=p\left(\sum _{m=0}^{k}\sum _{\underset{{I}_{m}^{i}=\left[{t}_{a},{t}_{b}\right)}{{t}_{a}\mathrm{\le}t\mathrm{<}{t}_{b},}}\sum _{z=1}^{{n}_{r}}{\left({d}_{t:z}-{v}_{m}^{r}\right)}^{2}-\sum _{n=0}^{k}\sum _{\underset{{I}_{n}^{j}=\left[{t}_{a},{t}_{b}\right)}{{t}_{a}\mathrm{\le}t\mathrm{<}{t}_{b},}}\sum _{z=1}^{{n}_{r}}{\left({d}_{t:z}-{v}_{n}^{y}\right)}^{2}\mathrm{>}0\right),$

(Equation 8)

where ${I}^{i}$ and ${I}^{j}$ are intervals defined by ${T}^{i}$ and ${T}^{j}$ and ${v}_{m+1}^{x}={v}_{m}^{x}+{c}_{m+1}^{x}$, ${v}_{m+1}^{y}$ are the corresponding values of the $m+2$’th interval defined by transition directions ${C}^{x}$ and ${C}^{y}$, respectively. For every $k$ -subset ${T}^{i}$ and ${C}^{x}$, there are $\left(\begin{array}{c}T\\ k\end{array}\right){2}^{k}-1$ comparisons defining the integral boundaries in estimating Equation 4. Similar to the discussion of known transition directions above, full ordering is not guaranteed between the variables, so the nested integral in Equation 4 can again be estimated via sampling.

## Author Contributions

Z.-B.J. conceived the idea for this study and supervised the work. E.S. developed the theoretical models and analyzed both theoretical and real biological data. M.K. analyzed the theoretical models and real data and wrote the software. All authors wrote the paper.

## Acknowledgments

The work was supported in part by NIH (grant number U01 HL122626 to Z.-B.J.), by the National Science Foundation (grant number DBI-1356505 to Z.-B.J.) and by the James S. McDonnell Foundation Scholars Award in Studying Complex Systems. An early version of this paper was submitted to and peer reviewed at the 2016 Annual International Conference on Research in Computational Molecular Biology (RECOMB). The manuscript was revised and then independently further reviewed at

*Cell Systems*. Supporting code and datasets are at http://sb.cs.cmu.edu/repeats and in Data S1.## Supplemental Information

- Document S1. Supplemental Experimental Procedures and Figure S1

- Data S1. Software Source Code, Related to Experimental Procedures

## References

- On the choice of sampling rates in parametric identification of time series.
*Inf. Sci.*1969; 1: 273-278 - Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes.
*Proc. Natl. Acad. Sci. USA.*2003; 100: 10146-10151 - Continuous representations of time-series gene expression data.
*J. Comput. Biol.*2003; 10: 341-356 - Noise in eukaryotic gene expression.
*Nature.*2003; 422: 633-637 - Temporal transcriptional response to ethylene gas drives growth hormone cross-regulation in Arabidopsis.
*eLife.*2013; 2: e00675 - Activity motifs reveal principles of timing in transcriptional control of the yeast metabolic network.
*Nat. Biotechnol.*2008; 26: 1251-1259 - Error bars in experimental biology.
*J. Cell Biol.*2007; 177: 7-11 - STEM: a tool for the analysis of short time series gene expression data.
*BMC Bioinformatics.*2006; 7: 191 - Clustering short time series gene expression data.
*Bioinformatics.*2005; 21: i159-i168 - A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression.
*BMC Bioinformatics.*2011; 12: 180 - A method to identify differential expression profiles of time-course gene data with Fourier transformation.
*BMC Bioinformatics.*2013; 14: 310 - Multiple alignment of continuous time series.in: Advances in Neural Information Processing Systems 17. MIT Press, 2004: 817-824
- Identifying differentially expressed genes in time course microarray data.
*Stat. Biosci.*2009; 1: 144-159 - Assessment of repeated microarray experiments using mixed tissue RNA reference samples.
*Biotechniques.*2008; 45: 283-292 - Numerical Recipes: The Art of Scientific Computing.Third Edition. Cambridge University Press, 2007
- Genome-wide profiling of diel and circadian gene expression in the malaria vector Anopheles gambiae.
*Proc. Natl. Acad. Sci. USA.*2011; 108: E421-E430 - Sampling strategies to measure the prevalence of common recurrent infections in longitudinal studies.
*Emerg. Themes Epidemiol.*2010; 7: 5 - An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer.
*Genome Biol.*2007; 8: R157 - An approach for clustering gene expression data with error information.
*BMC Bioinformatics.*2006; 7: 17 - Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.
*Nat. Protoc.*2012; 7: 562-578 - Quantitative noise analysis for gene expression microarray experiments.
*Proc. Natl. Acad. Sci. USA.*2002; 99: 14031-14036 - Identification of genes periodically expressed in the human cell cycle and their expression in tumors.
*Mol. Biol. Cell.*2002; 13: 1977-2000 - ExpressionBlast: mining large, unstructured expression databases.
*Nat. Methods.*2013; 10: 925-926

## Article Info

### Publication History

Published: July 21, 2016

Accepted:
June 14,
2016

Received in revised form:
April 22,
2016

Received:
February 25,
2016

### Identification

### Copyright

© 2016 Elsevier Inc.

### User License

Elsevier user license | How you can reuse

Elsevier's open access license policy

Elsevier user license

## Permitted

### For non-commercial purposes:

- Read, print & download
- Text & data mine
- Translate the article

## Not Permitted

- Reuse portions or extracts from the article in other works
- Redistribute or republish the final article
- Sell or re-use for commercial purposes

Elsevier's open access license policy

### ScienceDirect

Access this article on ScienceDirect## Related Articles

## Comments

#### Cell Press commenting guidelines

To submit a comment for a journal article, please use the space above and note the following:

- We will review submitted comments within 2 business days.
- This forum is intended for constructive dialog. Comments that are commercial or promotional in nature, pertain to specific medical cases, are not relevant to the article for which they have been submitted, or are otherwise inappropriate will not be posted.
- We recommend that commenters identify themselves with full names and affiliations.
- Comments must be in compliance with our Terms & Conditions.
- Comments will not be peer-reviewed.