# Taking the baseline measurement into account

###### constrained LDA in R

In Randomized Controlled Trials (RCTs), a “Pre” measurement is often taken at baseline (before randomization), and treatment effects are measured at one or more time point(s) after randomization (“Post” measurements). There are many ways to take the baseline measurement into account when comparing 2 groups in a classic pre-post design with one post measurement. In __this__ R-Bloggers post several strategies were discussed, including:

- The delta approach: a comparison of the “delta’s” (post measurement minus baseline measurement) using a independent samples t-test.
- The ANCOVA: A regression model with the pre measurement and group as predictors, and the post measurement as outcome variabele.

In this tutorial, we will apply the constrained longitudinal data analysis (cLDA) model in R.

**LDA** (as defined in Liu, 2009) is a mixed model approach where both baseline measurements and post measurements are modeled as dependent variables. The model includes at least *Group*, *Time* and their interaction *Group:Time* as predictors. We’re specifically interested in the interaction term, telling us whether the changes in mean response over time differ accross groups.

In **cLDA**, the group means are assumed to be equal at baseline (hence the “constrained” in cLDA), which is a reasonable assumption in a randomized trial. As described below, this strategy is an important exception of the rule that an interaction should not be included in a model without its main effects. In Liu, 2009 several advantages of this method are listed:

- In contrast to the ANCOVA and delta approach, all subjects with at least one measurement (baseline or post-randomization) are included in the analysis.
- The models can easily be extended to 3 or more repeated measures.
- It is possible to model the correlation between baseline and post measures differently for different treatment groups.

I’ll share with you how to perform this analysis in R. At the end of the post you’ll find a short section about non-randomized experiments/studies.

### Simulating the data

Data were created using the simstudy package. Each subject has equal chance to be assigned to either the Control group or the Experimental group. Both Pre and Post measurements are normally distributed. The assignment has no influence on the pre measurement. The post measurement depends on *Group* (Exp or Con) and on the baseline measurement (*Pre*).

```
library(simstudy)
tdef <- defData(varname = "Group", dist = "binary",
formula = 0.5, id = "Id")
tdef <- defData(tdef, varname = "Pre", dist = "normal",
formula = 7, variance = 10)
tdef <- defData(tdef, varname = "Post", dist = "normal",
formula = "Pre + 0.1 + 1.05* Group", variance =3 )
set.seed(123)
dtTrial <- genData(150, tdef)
dtTrial$Group <- factor(dtTrial$Group, levels=c(1,0), labels=c("Exp", "Con"))
library(pander)
pander(head(dtTrial))
```

Group | Pre | Post |
---|---|---|

Con | 10.24 | 10.45 |

Exp | 6.099 | 6.029 |

Con | 3.14 | 1.997 |

Exp | 7.573 | 10.26 |

Exp | 6.561 | 5.952 |

Con | 7.018 | 10.5 |

A dot plot of Post versus Pre visualizes the correlation between both time points:

```
library(ggplot2)
fav.col=c("#1A425C", "#8AB63F")
ggplot(dtTrial, aes (x=Pre,y=Post, col=Group)) +
geom_point() + geom_smooth(method="lm", se=FALSE) +
scale_color_manual(values=fav.col) + theme_bw()
```

### From wide to long format

When applying mixed models, the baseline measurement is seen as part of the outcome vector. All measurements (pre and post) should appear in one column (the variable *Outcome* ). Tidyverse’s gather() function is one of the many ways to do the job. In this example the columns Pre and Post are “gathered” in a column called *Outcome* (=value). The new variable called *Time* (=key) will show us whether it concerns a pre or post measurement.

```
library (tidyverse)
dtTrial_long <- gather(data=dtTrial, key=Time, value=Outcome,
Pre:Post, factor_key=TRUE)
pander(head(arrange(dtTrial_long, Id), 10))
```

Id | Group | Time | Outcome |
---|---|---|---|

1 | Con | Pre | 10.24 |

1 | Con | Post | 10.45 |

2 | Exp | Pre | 6.099 |

2 | Exp | Post | 6.029 |

3 | Con | Pre | 3.14 |

3 | Con | Post | 1.997 |

4 | Exp | Pre | 7.573 |

4 | Exp | Post | 10.26 |

5 | Exp | Pre | 6.561 |

5 | Exp | Post | 5.952 |

Once the data are structured in a “long” format the creation of a boxplot to visualize the data is simple:

```
library(ggplot2)
boxp <- ggplot (dtTrial_long , aes(x=Time,y=Outcome, col=Group))+
geom_boxplot(outlier.size = 0 ) +
geom_point(aes(fill=Group, col=NULL),shape=21, alpha=0.5,
position = position_jitterdodge(jitter.width = 0.2)) +
scale_fill_manual(values=fav.col) +
scale_color_manual(values=fav.col) +
theme_bw() + xlab("")
boxp
```

### Applying constrained longitudinal data analysis (cLDA)

The gls() function of the nlme package can be used to apply LDA modelling.

**Fixed part:**
The terms of the most basic cLDA model are *Time* and *Time * Group* . By omitting *Group* , there is no term allowing for a difference in means at baseline between the groups. Consequently, the baseline means are assumed to be equal. However, a general rule in regression modelling is that when adding an interaction to the model, the lower ranked effects should also be included in the model. For this reason, R will automatically include *Time* and *Group* as main effects in the model when including the interaction term _Time*Group_.
To avoid to have *Group* as a main effect in our model, we will create an alternative model matrix: Xalt.

```
X <- model.matrix(~ Time * Group, data = dtTrial_long)
colnames(X)
```

```
[1] "(Intercept)" "TimePost" "GroupCon"
[4] "TimePost:GroupCon"
```

```
Xalt <- X[, c("TimePost", "TimePost:GroupCon")]
colnames(Xalt)
```

`[1] "TimePost" "TimePost:GroupCon"`

**Random part:** The standard deviations and correlations that should be estimated are defined by the constant variance function varIdent(). By setting *weights = varIdent(form = ~ 1 | Time)* a separate standard deviation will be estimated for each time point and a seperate correlation will be estimated for each pair of time points (= unstructured variance covariance matrix). Applied to our example, we expect an estimated standard deviation for *Time*=Pre, an estimated standard deviation for *Time*=Post and one estimated correlation between the pre- and post measurements of the same subject.

*Remark*: By setting weights = varIdent(form = ~ 1 | Time:Group), a separate variance is estimated for each combination of Group and Time (Pre-Exp Post-Exp Pre-Con Post-Con ).

The argument *correlation=corSymm (form = ~ 1 | Id)* defines the subject levels. The correlation structure is assumed to apply only to observations within the same subject (in our example: *Id*); observations from different subjects (a different value for *Id*) are assumed to be uncorrelated.

```
library(nlme)
clda_gls <- gls(Outcome ~ Xalt,
weights = varIdent(form = ~ 1 | Time),
correlation=corSymm (form = ~ 1 | Id),
data = dtTrial_long)
summary(clda_gls)
```

```
Generalized least squares fit by REML
Model: Outcome ~ Xalt
Data: dtTrial_long
AIC BIC logLik
1358.307 1380.47 -673.1537
Correlation Structure: General
Formula: ~1 | Id
Parameter estimate(s):
Correlation:
1
2 0.842
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | Time
Parameter estimates:
Pre Post
1.000000 1.059605
Coefficients:
Value Std.Error t-value p-value
(Intercept) 6.978858 0.2461488 28.352190 0e+00
XaltTimePost 1.240246 0.2047301 6.057956 0e+00
XaltTimePost:GroupCon -0.958945 0.2815211 -3.406297 7e-04
Correlation:
(Intr) XltTmP
XaltTimePost -0.129
XaltTimePost:GroupCon 0.000 -0.715
Standardized residuals:
Min Q1 Med Q3 Max
-2.20538393 -0.65214666 -0.09411877 0.63654736 3.40671592
Residual standard error: 3.014695
Degrees of freedom: 300 total; 297 residual
```

`predictions<- cbind( dtTrial_long,clda_gls$fitted)`

The p-value for the interaction effect (the effect of interest here) is 0.0007: the change in mean over time differs significantly between groups. The estimated difference in change is -0.95 (SE: 0.28) (Con increases 0.95 less then Exp). The estimated correlation between Pre and Post is 0.842. The estimated standard deviation for the Pre measurement is 3.01 and the estimated standard devaition for the Post measurement is 1.059 * 3.01.

We can obtain predicted means using the gls() function. As expected, the predicted value is the same for both groups at baseline ( *Time* == Pre).

Id | Group | Time | Outcome | clda_gls$fitted |
---|---|---|---|---|

1 | Con | Pre | 10.24 | 6.979 |

1 | Con | Post | 10.45 | 7.26 |

2 | Exp | Pre | 6.099 | 6.979 |

2 | Exp | Post | 6.029 | 8.219 |

Preferably we also report the confidence intervals. That’s where the Gls() function of the rms package comes in. It uses exactly the same arguments as the gls() function, leading to exactly the same predictions. But the Gls() function allows us to create confidence intervals.

```
library(rms)
clda_Gls <- Gls(Outcome ~ Xalt,
weights = varIdent(form = ~ 1 | Time),
correlation=corSymm (form = ~ 1 | Id),
data = dtTrial_long)
predictions <- cbind(dtTrial_long, predict (clda_Gls, dtTrial_long, conf.int=0.95))
```

Now we do obtain 95% confidence intervals:

Id | Group | Time | Outcome | linear.predictors | lower | upper |
---|---|---|---|---|---|---|

1 | Con | Pre | 10.24 | 6.979 | 6.496 | 7.461 |

1 | Con | Post | 10.45 | 7.26 | 6.684 | 7.836 |

2 | Exp | Pre | 6.099 | 6.979 | 6.496 | 7.461 |

2 | Exp | Post | 6.029 | 8.219 | 7.633 | 8.805 |

#### Plotting the confidence intervals

First attempt for the visualization of the estimated means and their 95% confidence intervals:

```
pd <- position_dodge(.1)
limits <- aes(ymax = upper , ymin=lower, shape=Group)
pCI1 <- ggplot(predictions, aes( y=linear.predictors, x=Time)) +
geom_errorbar(limits, width= 0.1 , position=pd) +
geom_line(aes(group=Group, y=linear.predictors), position=pd) +
geom_point(position=pd, aes( fill=Group), shape=21, size=3) +
scale_fill_manual(values=c( "black", "white")) + theme_bw() +
theme(panel.grid.major = element_blank(), legend.title=element_blank(),
legend.position="bottom") +
xlab(NULL) + ylab("Estimated mean and 95% CI")
pCI1
```

To my opinion, this plot can be misleading. One estimated mean is presented by 2 different dots. I prefer to create the factor *Group2* and remake the plot.

```
predictions$Group2<- factor(predictions$Group, levels= c("All", "Exp", "Con"))
predictions$Group2 [predictions$Time=="Pre"] <- "All"
pd <- position_dodge(.08)
limits <- aes(ymax = upper , ymin=lower, shape=Group2)
pCI <- ggplot(predictions, aes( y=linear.predictors, x=Time)) +
geom_errorbar(limits, width=0.09,position=pd) +
geom_line(aes(group=Group, y=linear.predictors), position=pd)+
geom_point(position=pd, aes( fill=Group2), shape=21, size=3) +
scale_fill_manual(values=c("grey", "black", "white")) + theme_bw() +
theme(panel.grid.major = element_blank(), legend.title=element_blank(),
legend.position="bottom") +
xlab (NULL) + ylab("Estimated mean and 95% CI")
pCI
```

### In the case of non-randomized experiments

In the case of non-randomized experiments or studies an equal mean at baseline cannot be assumed. Consequently, applying constrained LDA is not appropriate. It is however possible to perform an LDA (without the constraint). By including *Group* in the model (next to *Time* and the interaction term *Group:Time* ), the baseline means are allowed to differ between groups. There’s no need to create the alternative model matrix Xalt in that situation. (Xalt can thus be replaced by *Group:Time* ) The remaining arguments (random part of the model) doesn’t differ from the model we presented for cLDA.

### References

References LDA and cLDA:

– *Fitzmaurice, Garrett M., et al. Applied Longitudinal Analysis. John Wiley & Sons, 2004.*

– *Liu GF, Lu K, Mogg R, et al. Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? Stat Med 2009; 28: 2509-2530.*

Reference Gls() function of the rms package:

– *Pinheiro J, Bates D (2000): Mixed effects models in S and S-Plus. New York: Springer-Verlag.*