Tag Archives: statistics

New paper: ‘Avoiding the misuse of BLUP in behavioural ecology’

March 6, 2017 tomhouslay Leave a comment

I have a new paper (with Alastair Wilson) out in the Behavioural Ecology journal, entitled ‘Avoiding the misuse of BLUP in behavioural ecology‘. Our paper is aimed at researchers working on individual variation in behaviour (e.g., personality, behavioural plasticity, behavioural syndromes), particularly those wishing to investigate associations between that behavioural variation and some other trait or variable (e.g., another behaviour, a physiological response, or even some external environmental variable). The thrust of the paper is really a call to ensure that we are using the proper statistical tools to test our hypotheses, rather than using other approaches that are known to give spurious results. The paper is quite brief, and can be found here (or drop me an email if you require a reprint).

Of course, pointing out problems is in itself not hugely useful without solutions being to hand, and so we have provided these in the form of tutorials for multivariate models in the programming language R. We are still working up more tutorials to cover more of the kinds of issues in which people are interested, so do keep checking back for updates – and let me know if there are any other relevant topics that you’d like to see covered! As has been pointed out on twitter, while we focused on animal behaviour because we work in that field, these models are applicable to many other fields in which researchers are interested in the causes and consequences of variation in labile traits.

NOTE

It has been pointed out to me (post-publication) that the Adriaenssens et al. (2016) paper in Physiology & Behavior, ‘Telomere length covaries with personality in a wild brown trout‘, did not use BLUPs extracted from mixed models in a secondary analysis, and is therefore incorrectly included in Table 1 of our Behavioural Ecology paper. I have apologised to Bart for my error, and contacted the publishers to see whether this reference can be removed from the paper.

Ponderings, Stats

Of carts and horses

December 18, 2015 tomhouslay Leave a comment

I have been working on a post for some time now, in which I was planning to use web-scraping in R to gather sports-related data from webpages and then run some fancy analysis on it. But when I say ‘working on’, I mean that I’ve playing around with the data, staring at a whole bunch of exploratory plots, and trying to come up with an angle for the analysis.

And, so far, I’ve come up with: nothing.

But perhaps some good can come out of this. The process of trying to come up with an idea to fit the data reminded me of this quote by Sir Ronald Fisher (I should admit that I know of the quote because a guy on the R mixed models mailing list has it in his signature):

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.

In the past couple of years, I’ve started teaching week-long statistics workshops (*cough*), which are incredibly satisfying (if also incredibly draining!). I have also gone through that enlightening period where learning a little more makes you realise how much you don’t know, which opened up a whole slew of new things to learn (also, I have started listening to statistics-related podcasts, which possibly means that I need professional help). I’m lucky in that I’ve managed to figure out some of this stuff, which leads to other people now coming to me for help with their analysis. Questions range from ‘does my model look ok?’ all the way to ‘I have this data, how should I analyse it?’. The latter always brings that quote into sharper focus.

It’s for this reason, the idea that statistical analysis should be an intrinsic part of the planning of any project, that I was interested to read articles recently on the idea of registering studies with journals prior to gathering the data. This idea stems mainly from the fact that too much of whether something gets published depends on it being a positive result, and the ‘spin’ of the results – with hypotheses often dreamed up post-hoc – affects which journal the study gets published in (obviously there’s more nuance than that, but maybe not that much more). By registering the design beforehand, you can go to a journal and say: this is the question, here are our hypotheses, here’s how we’re going to tackle it with an experiment, and here’s how we will analyse the data. The journal would then decide beforehand if they think that’s worth publishing – whatever the result.

This is a little simplistic, of course – there would have to be the usual review process, and there would obviously be leeway for further analysis of interesting trends on a post-hoc basis – but it would enforce greater thinking about an analysis strategy prior to embarking on a study. Even the simple task of drawing out the potential figures that would come out of the data collection is crucial to the process, as they help to clarify what is actually being tested.

So – that post I was originally setting out to write? I have the data, but I still haven’t had any good ideas for how to use it. And maybe it’s that kind of backwards approach that we all need to stay away from.

—

Understanding 3-way interactions between continuous and categorical variables, part ii: 2 cons, 1 cat

June 2, 2015 tomhouslay 5 Comments

I posted recently (well… not that recently, now that I remember that time is linear) about how to visualise 3-way interactions between continuous and categorical variables (using 1 continuous and 2 categorical variables), which was a follow-up to my extraordinarily successful post on 3-way interactions between 3 continuous variables (by ‘extraordinarily successful’, I mean some people read it on purpose and not because they were misdirected by poorly-thought google search terms, which is what happens with the majority of my graphic insect-sex posts). I used ‘small multiples‘, and also predicting model fits when holding particular variables at distinct values.

ANYWAY…

I just had a comment on the recent post, and it got me thinking about combining these approaches:

There are a number of approaches we can use here, so I’ll run through a couple of examples. First, however, we need to make up some fake data! I don’t know anything about bone / muscle stuff (let’s not delve too far into what a PhD in biology really means), so I’ve taken the liberty of just making up some crap that I thought might vaguely make sense. You can see here that I’ve also pretended we have a weirdly complete and non-overlapping set of data, with one observation of bone for every combination of muscle (continuous predictor), age (continuous covariate), and group (categorical covariate). Note that the libraries you’ll need for this script include {dplyr}, {broom}, and {ggplot2}.

#### Create fake data ####
 
bone_dat <- data.frame(expand.grid(muscle = seq(50,99),
                                   age = seq(18, 65),
                                   groupA = c(0, 1)))
 
## Set up our coefficients to make the fake bone data
coef_int <- 250
coef_muscle <- 4.5
coef_age <- -1.3
coef_groupA <- -150
coef_muscle_age <- -0.07
coef_groupA_age <- -0.05
coef_groupA_muscle <- 0.3
coef_groupA_age_muscle <- 0.093
 
bone_dat <- bone_dat %>% 
  mutate(bone = coef_int +
  (muscle * coef_muscle) +
  (age * coef_age) +
  (groupA * coef_groupA) +
  (muscle * age * coef_muscle_age) +
  (groupA * age * coef_groupA_age) +
  (groupA * muscle * coef_groupA_muscle) +
  (groupA * muscle * age * coef_groupA_age_muscle))
 
ggplot(bone_dat,
       aes(x = bone)) +
  geom_histogram(color = 'black',
                 fill = 'white') +
  theme_classic() +
  facet_grid(. ~ groupA)
 
## Add some random noise
noise <- rnorm(nrow(bone_dat), 0, 20)
bone_dat$bone <- bone_dat$bone + noise
 
#### Analyse ####
 
mod_bone <- lm(bone ~ muscle * age * groupA,
               data = bone_dat)
 
plot(mod_bone)

summary(mod_bone)

While I’ve added some noise to the fake data, it should be no surprise that our analysis shows some extremely strong effects of interactions… (!)

Call: lm(formula = bone ~ muscle * age * groupA, data = bone_dat)

Residuals:
Min 1Q Median 3Q Max
-71.824 -13.632 0.114 13.760 70.821

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.382e+02 6.730e+00 35.402 < 2e-16 ***
muscle 4.636e+00 8.868e-02 52.272 < 2e-16 ***
age -9.350e-01 1.538e-01 -6.079 1.31e-09 ***
groupA -1.417e+02 9.517e+00 -14.888 < 2e-16 ***
muscle:age -7.444e-02 2.027e-03 -36.722 < 2e-16 ***
muscle:groupA 2.213e-01 1.254e-01 1.765 0.0777 .
age:groupA -3.594e-01 2.175e-01 -1.652 0.0985 .
muscle:age:groupA 9.632e-02 2.867e-03 33.599 < 2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.85 on 4792 degrees of freedom
Multiple R-squared: 0.9728, Adjusted R-squared: 0.9728
F-statistic: 2.451e+04 on 7 and 4792 DF, p-value: < 2.2e-16

(EDIT: Note that this post is only on how to visualise the results of your analysis; it is based on the assumption that you have done the initial data exploration and analysis steps yourself already, and are satisfied that you have the correct final model… I may write a post on this at a later date, but for now I’d recommend Zuur et al’s 2010 paper, ‘A protocol for data exploration to avoid common statistical problems‘. Or you should come on the stats course that Luc Bussière and I run).

Checking the residuals etc indicates (to nobody’s surprise) that everything is looking pretty glorious from our analysis. But how to actually interpret these interactions?

We shall definitely have to use small multiples, because otherwise we shall quickly become overwhelmed. One method is to use a ‘heatmap’ style approach; this lets us plot in the style of a 3D surface, where our predictors / covariates are on the axes, and different colour regions within parameter space represent higher or lower values. If this sounds like gibberish, it’s really quite simple to get when you see the plot:

Here, higher values of bone are in lighter shades of blue, while lower values of bone are in darker shades. Moving horizontally, vertically or diagonally through combinations of muscle and age show you how bone changes; moreover, you can see how the relationships are different in different groups (i.e., the distinct facets).

To make this plot, I used one of my favourite new packages, ‘{broom}‘, in conjunction with the ever-glorious {ggplot2}. The code is amazingly simple, using broom’s ‘augment’ function to get predicted values from our linear regression model:

mod_bone %>% augment() %>% 
  ggplot(., aes(x = muscle,
                y = age,
                fill = .fitted)) +
  geom_tile() +
  facet_grid(. ~ groupA) +
  theme_classic()

But note that one aspect of broom is that augment just adds predicted values (and other cool stuff, like standard errors around the prediction) to your original data frame. That means that if you didn’t have such a complete data set, you would be missing predicted values because you didn’t have those original combinations of variables in your data frame. For example, if we sample 50% of the fake data, modelled it in the same way and plotted it, we would get this:

Not quite so pretty. There are ways around this (e.g. using ‘predict’ to fill all the gaps), but let’s move onto some different ways of visualising the data – not least because I still feel like it’s a little hard to get a handle on what’s really going on with these interactions.

A trick that we’ve seen before for looking at interactions between continuous variables is to look at only high/low values of one, across the whole range of another: in this case, we would show how bone changes with muscle in younger and older people separately. We could then use small multiples to view these relationships in distinct panels for each group (ethnic groups, in the example provided by the commenter above).

Here, I create a fake data set to use for predictions, where I have the full range of muscle (50:99), the full range of groups (0,1), and then age is at 1 standard deviation above or below the mean. The ‘expand.grid’ function simply creates every combination of these values for us! I use ‘predict’ to create predicted values from our linear model, and then add an additional variable to tell us whether the row is for a ‘young’ or ‘old’ person (this is really just for the sake of the legend):

#### Plot high/low values of age covariate ####
 
bone_pred <- data.frame(expand.grid(muscle = seq(50, 99),
                      age = c(mean(bone_dat$age) +
                                sd(bone_dat$age),
                              mean(bone_dat$age) -
                                sd(bone_dat$age)),
                      groupA = c(0, 1)))
 
bone_pred <- cbind(bone_pred,
                   predict(mod_bone,
                     newdata = bone_pred,
                     interval = "confidence"))
 
bone_pred <- bone_pred %>% 
  mutate(ageGroup = ifelse(age > mean(bone_dat$age), "Old", "Young"))
 
 
ggplot(bone_pred, 
       aes(x = muscle,
           y = fit)) +
  geom_line(aes(colour = ageGroup)) +
#   geom_point(data = bone_dat,
#              aes(x = muscle,
#                  y = bone)) +
  facet_grid(. ~ groupA) +
  theme_classic()

This gives us the following figure:

Here, we can quite clearly see how the relationship between muscle and bone depends on age, but that this dependency is different across groups. Cool! This is, of course, likely to be more extreme than you would find in your real data, but let’s not worry about subtlety here…

You’ll also note that I’ve commented out some lines in the specification of the plot. These show you how you would plot your raw data points onto this figure if you wanted to, but it doesn’t make a whole lot of sense here (as it would include all ages), and also our fake data set is so dense that it just obscures meaning. Good to have in your back pocket though!

Finally, what if we were more concerned with comparing the bone:muscle relationship of different groups against each other, and doing this at distinct ages? We could just switch things around, with each group a line on a single panel, with separate panels for ages. Just to make it interesting, let’s have three age groups this time: young (mean – 1SD), average (mean), old (mean + 1SD):

#### Groups on a single plot, with facets for different age values ####
 
avAge <- round(mean(bone_dat$age))
sdAge <- round(sd(bone_dat$age))
youngAge <- avAge - sdAge
oldAge <- avAge + sdAge
 
bone_pred2 <- data.frame(expand.grid(muscle = seq(50, 99),
                                      age = c(youngAge,
                                              avAge,
                                              oldAge),
                                      groupA = c(0, 1)))
 
bone_pred2 <- cbind(bone_pred2,
                   predict(mod_bone,
                           newdata = bone_pred2,
                           interval = "confidence"))
 
ggplot(bone_pred2, 
       aes(x = muscle,
           y = fit,
           colour = factor(groupA))) +
  geom_line() +
  facet_grid(. ~ age) +
  theme_classic()

Created by Pretty R at inside-R.org

The code above gives us:

Interestingly, I think this gives us the most insightful version yet. Bone increases with muscle, and does so at a higher rate for those in group A (i.e., group A == 1). The positive relationship between bone and muscle diminishes at higher ages, but this is only really evident in non-A individuals.

Taking a look at our table of coefficients again, this makes sense:

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.382e+02 6.730e+00 35.402 < 2e-16 *** muscle 4.636e+00 8.868e-02 52.272 < 2e-16 *** age -9.350e-01 1.538e-01 -6.079 1.31e-09 *** groupA -1.417e+02 9.517e+00 -14.888 < 2e-16 *** muscle:age -7.444e-02 2.027e-03 -36.722 < 2e-16 *** muscle:groupA 2.213e-01 1.254e-01 1.765 0.0777 . age:groupA -3.594e-01 2.175e-01 -1.652 0.0985 . muscle:age:groupA 9.632e-02 2.867e-03 33.599 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is a positive interaction between group A x muscle x bone, which – in group A individuals – overrides the negative muscle x age interaction. The main effect of muscle is to increase bone mass (positive slope), while the main effect of age is to decrease it (in this particular visualisation, you can see this because there is essentially an age-related intercept that decreases along the panels).

These are just a few of the potential solutions, but I hope they also serve to indicate how taking the time to explore options can really help you figure out what’s going on in your analysis. Of course, you shouldn’t really believe these patterns if you can’t see them in your data in the first place though!

Unfortunately, I can’t help our poor reader with her decision to use Stata, but these things happen…

Note: if you like this sort of thing, why not sign up for the ‘Advancing in statistical modelling using R‘ workshop that I teach with Luc Bussière? Not only will you learn lots of cool stuff about regression (from straightforward linear models up to GLMMs), you’ll also learn tricks for manipulating and tidying data, plotting, and visualising your model fits! Also, it’s held on the bonny banks of Loch Lomond. It is delightful.

—-

Want to know more about understanding and visualising interactions in multiple linear regression? Check out my previous posts:

Understanding three-way interactions between continuous variables

Using small multiples to visualise three-way interactions between 1 continuous and 2 categorical variables

Stats

Understanding 3-way interactions between continuous and categorical variables: small multiples

September 6, 2014 tomhouslay 8 Comments

It can be pretty tricky to interpret the results of statistical analysis sometimes, and particularly so when just gazing at a table of regression coefficients that include multiple interactions. I wrote a post recently on visualising these interactions when variables are continuous; in that instance, I computed the effect of X on Y when the moderator variables (Z, W) are held constant at high and low values (which I set as +/- 1 standard deviation from the mean). This gave 4 different slopes to visualise: low Z & low W, low Z & high W, high Z & low W, and high Z & high W. It was simple enough to plot these on a single figure and see what effect the interaction of Z and W had on the relationship between Y and X.

I had a comment underneath from someone who had a similar problem, but where the interacting variables consisted of a single continuous and 2 categorical variables (rather than 3 continuous variables). Given that we have distinct levels for the modifier variables Z and W then our job is made a little easier, as the analysis provides coefficients for every main effect and interaction. However, each of the categorical modifier variables here has 3 levels, giving 9 different combinations of Y ~ X. While we could take the same approach as last time (creating predicted slopes and plotting them on a single figure), that wouldn’t produce a very intuitive figure. Instead, let’s use ggplot2’s excellent ‘facets‘ function to produce multiple figures within a larger one.

This approach is termed ‘small multiples’, a term popularised by statistician and data visualisation guru Edward Tufte. I’ll hand it over to him to describe the thinking behind it:

“Illustrations of postage-stamp size are indexed by category or a label, sequenced over time like the frames of a movie, or ordered by a quantitative variable not used in the single image itself. Information slices are positioned within the eyespan, so that viewers make comparisons at a glance — uninterrupted visual reasoning. Constancy of design puts the emphasis on changes in data, not changes in data frames.”

–Edward Tufte, ‘Envisioning Information‘

Each small figure should have the same measures and scale; once you’ve understood the basis of the first figure, you can then move across the others and see how it responds to changes in a third variable (and then a fourth…). The technique is particularly useful for multivariate data because you can compare and contrast changes between the main relationship of interest (Y ~ X) as the values of other variables (Z, W) change. Sounds complex, but it’s really very simple and intuitive, as you’ll see below!

Ok, so the commenter on my previous post included the regression coefficients from her mixed model analysis that was specified as lmer(Y ~ X*Z*W+(1|PPX),data=Matrix). W and Z each have 3 levels: low, medium, and high, but these are to be treated as categorical rather than continuous – i.e., we get coefficients for each level. We’ll also disregard the random effects here, as we’re interested in plotting only the main effects.

I don’t have the raw data, so I’m simply going to plot the predicted slopes on values of X from 0 to 100; I’m first going to make this sequence, and enter all my coefficients as variables:

x <- seq(-100, 100)
 
int_global <- 	-0.0293
X_coef	<- 0.0007
WHigh <-	-0.0357
WMedium <-	0.0092
ZHigh <-	-0.0491
ZMedium <-	-0.0314
X_WHigh <-	0.0007
X_WMedium <-	0.0002
X_ZHigh <-	0.0009
X_ZMedium <-	0.0007
WHigh_ZHigh <-	0.0004
WMedium_ZHigh <-	-0.0021
WHigh_ZMedium <-	-0.0955
WMedium_ZMedium <-	0.0143
X_WHigh_ZHigh <- -0.0002
X_WMedium_ZHigh <-	-0.0004
X_WHigh_ZMedium <-	0.0013
X_WMedium_ZMedium <-	-0.0004

The reference values are Low Z and Low W; you can see that we only have coefficients for Medium/High values of these variables, so they are offset from the reference slope. The underscore in my variable names denotes an interaction; when X is involved then it’s an effect on the slope of Y on X, whereas otherwise it affects the intercept.

Let’s go ahead and predict Y values for each value of X when Z and W are both Low (i.e., using the global intercept and the coefficient for the effect of X on Y):

y.WL_ZL <- int_global + (x * X_coef)
 
df.WL_ZL <- data.frame(x = x,
                       W = "W:Low",
                       Z = "Z:Low",
                       y = y.WL_ZL)

Now, let’s change a single variable, and plot Y on X when Z is at ‘Medium’ level (still holding W at ‘Low’):

# Change Z
 
y.WL_ZM <- int_global + ZMedium + (x * X_coef) + (x * X_ZMedium)
 
df.WL_ZM <- data.frame(x = x,
                       W = "W:Low",
                       Z = "Z:Medium",
                       y = y.WL_ZM)

Remember, because the coefficients of Z/W are offsets, we add these effects on top of the reference levels. You’ll notice that I specified the level of Z and W as ‘Z:Low’ and ‘W:Low’; while putting the variable name in the value is redundant, you’ll see later why I’ve done this.

We can go ahead and make mini data frames for each of the different interactions:

y.WL_ZH <- int_global + ZHigh + (x * X_coef) + (x * X_ZHigh)
 
df.WL_ZH <- data.frame(x = x,
                       W = "W:Low",
                       Z = "Z:High",
                       y = y.WL_ZM)
 
# Change W
 
y.WM_ZL <- int_global + WMedium + (x * X_coef) + (x * X_WMedium)
 
df.WM_ZL <- data.frame(x = x,
                       W = "W:Medium",
                       Z = "Z:Low",
                       y = y.WM_ZL)
 
y.WH_ZL <- int_global + WHigh + (x * X_coef) + (x * X_WHigh)
 
df.WH_ZL <- data.frame(x = x,
                       W = "W:High",
                       Z = "Z:Low",
                       y = y.WM_ZL)
 
# Change both
 
y.WM_ZM <- int_global + WMedium + ZMedium + WMedium_ZMedium + 
  (x * X_coef) + 
  (x * X_ZMedium) +
  (x * X_WMedium) +
  (x * X_WMedium_ZMedium)
 
 
df.WM_ZM <- data.frame(x = x,
                       W = "W:Medium",
                       Z = "Z:Medium",
                       y = y.WM_ZM)
 
 
y.WM_ZH <- int_global + WMedium + ZHigh + WMedium_ZHigh + 
  (x * X_coef) + 
  (x * X_ZHigh) +
  (x * X_WMedium) +
  (x * X_WMedium_ZHigh)
 
 
df.WM_ZH <- data.frame(x = x,
                       W = "W:Medium",
                       Z = "Z:High",
                       y = y.WM_ZH)
 
 
y.WH_ZM <- int_global + WHigh + ZMedium + WHigh_ZMedium + 
  (x * X_coef) + 
  (x * X_ZMedium) +
  (x * X_WHigh) +
  (x * X_WHigh_ZMedium)
 
 
df.WH_ZM <- data.frame(x = x,
                       W = "W:High",
                       Z = "Z:Medium",
                       y = y.WM_ZH)
 
 
y.WH_ZH <- int_global + WHigh + ZHigh + WHigh_ZHigh + 
  (x * X_coef) + 
  (x * X_ZHigh) +
  (x * X_WHigh) +
  (x * X_WHigh_ZHigh)
 
 
df.WH_ZH <- data.frame(x = x,
                       W = "W:High",
                       Z = "Z:High",
                       y = y.WH_ZH)

Ok, so we now have a mini data frame giving predicted values of Y on X for each combination of Z and W. Not the most elegant solution, but fine for our purposes! Let’s go ahead and concatenate these data frames into one large one, and throw away the individual mini frames:

# Concatenate data frames
df.XWZ <- rbind(df.WL_ZL,
                df.WM_ZL,
                df.WH_ZL,
                df.WL_ZM,
                df.WL_ZH,
                df.WM_ZM,
                df.WM_ZH,
                df.WH_ZM,
                df.WH_ZH)
 
# Remove individual frames
rm(df.WL_ZL,
   df.WM_ZL,
   df.WH_ZL,
   df.WL_ZM,
   df.WL_ZH,
   df.WM_ZM,
   df.WM_ZH,
   df.WH_ZM,
   df.WH_ZH)

Now all that’s left to do is to plot the predicted regression slopes from the analysis!

# Plot
library(ggplot2)
 
ggplot(df.XWZ, aes(x,y)) + geom_line() + 
  facet_grid(W~Z, as.table = FALSE) +
  theme_bw()

First, I call the ggplot2 library. The main function call to ggplot specifies our data frame (df.XWZ), and the variables to be plotted (x,y). Then, ‘geom_line()‘ indicates that I want the observations to be connected by lines. Next, ‘facet_grid‘ asks for this figure to be laid out in a grid of panels; ‘W~Z’ specifies that these panels are to be laid out in rows for values of W and columns for values of Z. Setting ‘as.table’ to FALSE simply means that the highest level of W is at the top, rather than the bottom (as would be the case if it were laid out in table format, but as a figure I find it more intuitive to have the highest level at the top). Finally, ‘theme_bw()’ just gives a nice plain theme that I prefer to ggplot2’s default settings.

There we have it! A small multiples plot to show the results of a multiple regression analysis, with each panel showing the relationship between our response variable (Y) and continuous predictor (X) for a unique combination of our 3-level moderators (Z, W).

You’ll notice here that the facet labels have things like ‘Z:Low’ etc in them, which I coded into the data frames. This is purely because ggplot2 doesn’t automatically label the outer axes (i.e., as Z, W), and I find this easier than remembering which variable I’ve defined as rows/columns…

Hopefully this is clear – any questions, please leave a comment. The layout of R script was created by Pretty R at inside-R.org. Thanks again to Rebekah for her question on my previous post, and letting me use her analysis as an example here.

—-

Want to know more about understanding and visualising interactions in multiple linear regression? Check out my related posts:

Understanding three-way interactions between continuous variables

Three-way interactions between 2 continuous and 1 categorical variable

—-

Stats

Statistics workshop

August 26, 2014 tomhouslay Leave a comment

A quick note here to say that I shall be helping my PhD supervisor, Luc Bussière, teach a week-long workshop entitled ‘Advancing in R’ at the SCENE field station (on the bonny, bonny banks of Loch Lomond) in December. The course is aimed at biologists who have a basic to moderate knowledge of using R for statistics programming, and is designed to bridge the gap between basic R coding and advanced statistical modelling. Here is a list of the modules to be covered during the week:

Module 1: Introduction & data visualization using (graphics) and (ggplot2)
Module 2: Univariate regression, diagnostics & plotting fits
Module 3: Adding additional continuous predictors (multiple regression); scaling & collinearity
Module 4: Adding factorial (categorical) predictors & incorporating interactions (ANCOVA)
Module 5: Model selection & simplification (likelihood ratio tests, AIC)
Module 6: Mixed effects models in theory & practice
Module 7: Generalised Linear Models (binomial and count data)
Module 8: Nonlinear models (polynomial & mechanistic models)
Module 9: Combining methods (e.g., nonlinear mixed effect (NLME) models & generalised linear mixed effect (GLMM) models)
Module 10: One-on-one consultations/other advanced topics

Full details of the course can be found at the PR~Statistics site.

We are also open to running the course at different locations in the future; please contact myself or Luc to discuss options.

Stats

Push notifications in R!

April 3, 2014 tomhouslay 3 Comments

Despite its great litany of flaws, I love R. I love it super-hard. One of the best things about it is that people are always making weird packages that solve your problems; one of the annoying things is that you tend to find them just after they’d have been really useful, and that maybe they solve only a part of the problem. But it’s the beauty of open-source software: people make solutions to particular problems they have, and then they put in the work to make that available to the rest of us.

Anyway, one of my gripes over the course of my studies has been that you can’t really run anything in the background in R while you continue to work on another section of code: a lot of the statistical models I’ve been running use MCMCglmm, which is an amazing package but can be pretty time-intensive. Furthermore, my computer at work wasn’t really up to much, so quite often I couldn’t even work on any writing while my analysis was running. In all honesty, I should have looked harder at the options for running R in the cloud (e.g. on Amazon or Domino), but what I really wanted was just something that would give me a little notification when the analysis was done: that way, I could go off and work on something else until the results were ready for me to look at. As it was, there was a lot of clicking back and forth between windows and getting massively frustrated (or going away for half an hour, only to come back and find that I’d specified something incorrectly and it had failed within 10 seconds and needed to be restarted!).

Enter Brian Connelly‘s new package, PushoverR!

PushoverR takes advantage of the Pushover API, a service that receives messages from an application and sends them out to specified devices. Connelly has created this package so that Pushover receives a message from R, and communicates this to your iPhone or Android device. It’s really easy to set up, so here’s how to get started:

First, you need to sign up for a Pushover account. Head on over to Pushover, and register your email address. You’ll then get your user key (important information number 1!). After confirmation, you need to register your application – in this case, R! Here’s how I set up mine:

Once this is done, you’ll get an application token (important information number 2!). Now, you need to install ‘Pushover’ on your client device – this costs about £3, and is available from the App Store or Google Play:

Now, we’re ready to go! Fire up your favourite R development environment (which should be RStudio, unless you’re a MORON), and get the relevant library installed and called up:

############## Libraries ############
 
install.packages("pushoverr")
library(pushoverr)

Now, it’s a good idea to set up variables to hold your user ID and application token so that you don’t have to keep pasting them for any message you want to send. To protect myself from anyone here sending me messages to my phone, I’ve replaced the actual tokens with some nonsense:

############## My settings ###########
 
userID <- "u5schwarzenegger"
appToken <- "a1dontdisturbmyfriendhesdeadtired"

Anyway, we’re now ready to run some analysis – given that my time-consuming code tends to run via MCMCglmm, I’m just going to copy one of the examples given in Jarrod Hadfield’s excellent documentation. Then I’m going to send a simple message to myself that tells me when the analysis is complete; you could, of course, send some of the results in the message, but for now let’s stick to a very basic notification:

############ Run something ##########
 
library(MCMCglmm)
 
data(BTdata)
 
prior.m3a.4 = list(R = list(V = diag(1), nu = 0.002), 
                   G = list(G1 = list(V = diag(3) * 0.02, 
                                      nu = 4), 
                            G2 = list(V = 1, nu = 0.002)))
 
m3a.4 <- MCMCglmm(tarsus ~ sex, 
                  random = ~us(sex):dam + 
                    fosternest, 
                  data = BTdata, 
                  verbose = FALSE, 
                  prior = prior.m3a.4)
summary(m3a.4)
 
msg <- "Analysis complete"
 
## Send message
pushover(message = msg, 
         user = userID, 
         token = appToken)

We can set this to execute, and – while it runs – go and do something better, like watch youtube vidoes of Arnie violencing up some guys and then zinging their corpses. BLAM

Anyway. What do you know – my phone popped up with a little message!

SWEET.

This is the very basic functionality of PushoverR – check out Brian’s ‘Readme‘ file for some more details of little extra bits. All of the info here was basically taken from that document, but I’ve added some pictures to make it more fun.

Let’s have another Arnie video as well.

Stats

Understanding 3-way interactions between continuous variables

March 21, 2014 tomhouslay 17 Comments

A recurrent problem I’ve found when analysing my data is that of trying to interpret 3-way interactions in multiple regression models. As always, the mantra of PLOT YOUR DATA* holds true: ggplot2 is particularly helpful for this type of visualisation, especially using facets (I will cover this in a later post). Easier said than done, though, when all three predictor variables are continuous. How do interactions between continuous variables work, and how can we plot them?

First, let’s think of a simple example: consider a response such as the time a person takes to run 100m. We might expect that height would have a significant effect, with taller people covering the ground more quickly. A continuous by continuous interaction (two-way) would mean that the effect of height on 100m time depends on another continuous variable – for example, weight. A significant height x weight interaction would mean that the slope of height on 100m time changes as weight changes: the effect of height might be mitigated at heavier weights. Already, this is getting a little tricky to visualise in your head, so what about when we add in another predictor (e.g. age, resting heart rate, volume of beer** consumed the night before, that kind of thing)?

Some quick googling brings up a nice resource from the Institute for Digital Research and Education at UCLA, showing how to visualise three-way interactions between continuous variables using the statistical software Stata. However, I don’t want to use Stata. I want to use R. Just like you probably do. And if you don’t probably do, you should definitely do, and if you don’t now definitely do, you should probably leave. I think. Anyway, I decided that I would take their example and convert it to R so that we can follow along. Hooray for me!

I said, HOORAY FOR ME.

Ok, so we’re going to use a simple approach to explain a three-way interaction: computing the slopes of the dependent variable (e.g. 100m time) on the independent variable (e.g. height) when moderator variables (e.g. weight and last night’s beer volume) are held constant at different combinations of high and low values. Basically, this would show the effect of height on 100m time at these combinations:

Heavy weight, much beer
Heavy weight, little beer
Light weight, much beer
Light weight, little beer

Simple, huh? Let’s take a look at the model formula, where Y is the response variable (100m time), X the predictor (height), and Z and W being the moderators (weight, last night’s beer volume):

Y = b0 + b1X + b2Z + b3W + b4XZ + b5XW + b6ZW + b7XZW

This can be reordered into two groups: the first defines the intercept (terms that do not contain X), and the second defines the simple slope (terms that do contain X):

Y = (b0 + b2Z + b3W + b6ZW) + (b1 + b4Z + b5W + b7ZW)X

We can define high values of Z and W as being one standard deviation above their respective means and will denote them as zH and wH; the low values are one standard deviation below their means (zL and wL). As in the example I gave above, this gives 4 possible combinations: zHwH, zHwL, zLwH, zLwL. As an example, here is the formula when both Z and W are at high values:

simple slope 1 = b1 + b4zH + b5wH + b7zHwH

intercept 1 = b0 + b2zH + b3wH + b6zHwH

I’m going to use the same data set given in the example that I’m converting, and you can find the Stata file here. Don’t worry, we can convert this to R data format easily enough with the R package ‘foreign’:

library(foreign)
hsb2 <- read.dta("hsb2.dta")

As we’re not really concerned with the innards of this data set (yes, I’m afraid it’s not really the effect on running time of height, weight and beer consumption), let’s rename the variables that we’ll be using to Y, X, Z and W:

colnames(hsb2)[c(8,7,9,10)] <- c("y","x","z","w")

IT’S TIME TO REGRESS!

lm.hsb2 <- lm(y ~ x*z*w, data = hsb2)
summary(lm.hsb2)

…giving the following output:

Residuals:
     Min       1Q   Median       3Q      Max 
-20.0432  -4.8576   0.6129   4.0728  17.6322 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  1.665e+02  9.313e+01   1.788   0.0753 .
x           -3.014e+00  1.803e+00  -1.672   0.0962 .
z           -3.228e+00  1.926e+00  -1.676   0.0954 .
w           -2.435e+00  1.757e+00  -1.386   0.1674  
x:z          7.375e-02  3.590e-02   2.054   0.0413 *
x:w          5.531e-02  3.252e-02   1.701   0.0906 .
z:w          6.111e-02  3.503e-02   1.745   0.0827 .
x:z:w       -1.256e-03  6.277e-04  -2.001   0.0468 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 6.896 on 192 degrees of freedom
Multiple R-squared:  0.4893,	Adjusted R-squared:  0.4707 
F-statistic: 26.28 on 7 and 192 DF,  p-value: < 2.2e-16

You can see that the three-way interaction between the continuous variables (i.e., X:Z:W) is ‘significant’, just about. Let’s save any grousing about p-values for another day, shall we? Also, I would normally centre and standardise these types of continuous predictor variables before regression (see Gelman 2007, Gelman & Hill 2007, or Schielzeth 2010 for more details on why), but in this case I’m just converting code from a given example so we’ll take things at face value…

The next thing I want to do is to create the slope and intercept values for our 4 combinations of values (high Z + high W, high Z + low W, etc). I use the original data to create high and low values for Z and W (mean plus/minus one standard deviation), then use coefficients from the linear regression to compute the intercepts and slopes:

## High / low for w and z
zH <- mean(hsb2$z) + sd(hsb2$z)
zL <- mean(hsb2$z) - sd(hsb2$z)
wH <- mean(hsb2$w) + sd(hsb2$w)
wL <- mean(hsb2$w) - sd(hsb2$w)
 
## Get coefficients from regression
coefs.hsb2 <- coef(lm.hsb2)
coef.int <- coefs.hsb2["(Intercept)"]
coef.x <- coefs.hsb2["x"]
coef.z <- coefs.hsb2["z"]
coef.w <- coefs.hsb2["w"]
coef.xz <- coefs.hsb2["x:z"]
coef.xw <- coefs.hsb2["x:w"]
coef.zw <- coefs.hsb2["z:w"]
coef.xzw <- coefs.hsb2["x:z:w"]
 
## Create slopes
zHwH <- coef.x + zH*coef.xz + wH*coef.xw + zH*wH*coef.xzw
zHwL <- coef.x + zH*coef.xz + wL*coef.xw + zH*wL*coef.xzw
zLwH <- coef.x + zL*coef.xz + wH*coef.xw + zL*wH*coef.xzw
zLwL <- coef.x + zL*coef.xz + wL*coef.xw + zL*wL*coef.xzw
 
## Create intercepts
i.zHwH <- coef.int + zH*coef.z + wH*coef.w + zH*wH*coef.zw
i.zHwL <- coef.int + zH*coef.z + wL*coef.w + zH*wL*coef.zw
i.zLwH <- coef.int + zL*coef.z + wH*coef.w + zL*wH*coef.zw
i.zLwL <- coef.int + zL*coef.z + wL*coef.w + zL*wL*coef.zw

Now, I want to create a data frame that uses these intercepts and slopes to ‘predict’ lines: because I know that the lines are going to be straight, I simply compute the value of Y for the minimum and maximum values of X for each Z:W combination. I’m pretty sure it can be done with far less code than I’ve used here, but there you go:

## a set of values of x
x0 <- seq(min(hsb2$x), max(hsb2$x), length.out = 2)
 
df.HH <- data.frame(x0 = x0)
df.HH$y0 <- i.zHwH + df.HH$x0*zHwH
df.HH$type <- rep("zHwH", nrow(df.HH))
 
df.HL <- data.frame(x0 = x0)
df.HL$y0 <- i.zHwL + df.HL$x0*zHwL
df.HL$type <- rep("zHwL", nrow(df.HL))
 
df.LH <- data.frame(x0 = x0)
df.LH$y0 <- i.zLwH + df.LH$x0*zLwH
df.LH$type <- rep("zLwH", nrow(df.LH))
 
df.LL <- data.frame(x0 = x0)
df.LL$y0 <- i.zLwL + df.LL$x0*zLwL
df.LL$type <- rep("zLwL", nrow(df.LL))
 
## Create final data frame
df.pred <- rbind(df.HH, df.HL, df.LH, df.LL)
 
## Remove unnecessary data frames
rm(df.HH, df.HL, df.LH, df.LL)

Now, it’s time to plot the data! I’ve used ggplot2, and specified different colours and line types for each of our 4 combinations: red denotes high Z-value, and solid line denotes high W-value. I’ve also plotted the original data over the top:

## Call library
library(ggplot2)
 
## Convert 'type' to factor
df.pred$type <- factor(df.pred$type)
 
## Draw plot
ggplot(df.pred, aes(x = x0,
                    y = y0)) +
  geom_line(aes(colour = type,
                linetype = type)) +
  geom_jitter(data = hsb2, 
            aes(x = x,
                y = y),
             size = 3,
             alpha = 0.7) + 
  theme_bw() +
  theme(panel.background = element_rect(fill="ivory")) +
  theme(legend.key = element_blank()) +
  theme(text = element_text(size = 15)) +
  scale_colour_manual(name = "Partial effect",
                      labels = c("High z High w", 
                                 "High z Low w", 
                                 "Low z High w", 
                                 "Low z Low w"),
                      values = c("#E41A1C", 
                                 "#E41A1C", 
                                 "#377EB8", 
                                 "#377EB8")) +   
  scale_linetype_manual(name = "Partial effect",
                     labels = c("High z High w", 
                                "High z Low w", 
                                "Low z High w", 
                                "Low z Low w"),
                     values = c("solid", 
                                "longdash", 
                                "solid", 
                                "longdash"))

The finished article:

Clearly, three of the slopes are very similar (differing mostly in terms of intercept), while one is very different to the rest. The relationship between Y and X is changed dramatically when there are high values of Z in combination with low values of W. It can still be hard to grasp these in abstract terms, so let’s refer back to my initial example (just for ‘fun’): this would mean that when weight is high [Z] and beer consumption is low [W], there is a steep relationship between height [X] and running time [Y]: short, heavy teetotallers are slow, but tall, heavy teetotallers are really fast.

Remember, these lines are theoretical, and drawn from the regression coefficients rather than the data itself; it’s up to you to make sure that short, heavy teetotallers are represented within your data set. I, for one, drink quite a lot of beer.

—-

Want to know more about understanding and visualising interactions in multiple linear regression? Check out my follow-up posts:

Using ‘small multiples’ to visualise 3-way interactions between 1 continuous and 2 categorical variables

Three-way interactions between 2 continuous and 1 categorical variable

—-

* My supervisor calls this the ‘ADF method’, as in ‘ANY DAMN FOOL (can see what’s happening if you just actually plot your data)’

** I originally had beer count, but my esteemed colleague Lilly Herridge (who knows what she is talking about much more than I do) pointed out that counts are discrete, not continuous, hence the change to volume. SMARTS.

Code in this post was adapted from the UCLA Statistical Consulting Group’s page on three-way continuous interactions, and highlighted using inside-R’s ‘Pretty-R‘ tool.

Stats, Twitter

Luis Apiolaza’s tips for a good regression course

July 20, 2013 tomhouslay Leave a comment

On Twitter yesterday, Luis Apiolaza shared some tips that he’d given a colleague on what students should learn in a regression course. These are pretty great, so I thought I’d include them as a post here (mostly because that’s useful for me, but also because it’s as if I’ve written a blog post when all I’ve done is screen-grabbed some guy’s twitter feed). Luis is a quantitative geneticist and lecturer at the School of Forestry in Christchurch, New Zealand, wrote the ASReml (and ASReml-R) cookbook, runs the excellent Quantum Forest blog that has a general theme of data analysis, is good at the twitter, and – to quote Justin Bieber’s most recent analysis of Bill Clinton – is a ‘#greatguy’*.

* Coincidentally, I also had a turnaround of my views on Apiolaza after video emerged of me exiting a computing lab with a group of unruly S+ users, urinating into a paper recycling bin, spraying cleaning fluid onto a print-out of the ASReml cookbook, and shouting ‘F*ck Luis Apiolaza!’.

Tom Houslay

Tag Archives: statistics

New paper: ‘Avoiding the misuse of BLUP in behavioural ecology’

Of carts and horses

Understanding 3-way interactions between continuous and categorical variables, part ii: 2 cons, 1 cat

Understanding 3-way interactions between continuous and categorical variables: small multiples

Statistics workshop

Push notifications in R!

Understanding 3-way interactions between continuous variables

Luis Apiolaza’s tips for a good regression course

Evolution // Behaviour // Life History