Collider Bias As A Tool – leptokurtains

Reading Advice: If you have already heard of collider bias, then you can skip the examples and go to the “Goodhart’s Law section”

Disclaimer: I use “song”/“track” roughly interchangeably in this post, even though they are not exactly the same

Quality Rating: 7/10

Collider bias is when a statistical analysis can be corrupted by some selection on a “collider” that influences the data available to be used for the analysis (a wikipedia article about it under a different name¹ can be found here). That’s a bit long-winded and unclear, so in my experience it’s better to “feel” what it is by way of example(s).

Examples

Here are some common examples of collider bias in action:

In entertainment, you look at the ratings for books-that-have-been-made-into-films on Goodreads and RottenTomatoes (other review sites available), and find that being a well-rated book predicts being a worse-rated film²
In sport, you look at a dataset of Major League Baseball (MLB) players, and spot that fielding ability is inversely correlated with batting ability³
In politics, you predict what the vote share for big-blue-party-A is in a collection of constituencies/states, using the vote share of big-red-party-B, but find that the negative relationship is not as strong as you expected going in, or is even positive⁴
In medicine, you look at people in hospital and discover a negative correlation between diseases and some risk factors for diseases (this is the example originally used by Berkson⁵)
In relationships, handsome men are often reported to be “jerks”⁶
In media, you look at the content being recommended to you and find that the content contains a mix of very popular content, and less popular content that is specific to your interests/past-history⁷

In all of the above, I believe that the data to confirm/disconfirm is publicly available or obtainable with some effort. I will not hesitate to update the whole post if one/more are disconfirmed. I believe them all to be legitimate cases of collider bias.

Collider Bias - Twinned with Goodhart’s Law

One thing I haven’t often found discussed regarding collider bias is its sinister side - it often brings Goodhart’s Law along for the ride. Goodhart’s Law (and similar such ‘laws’ and critiques) observes that when selection pressure is put upon a metric of sorts, then it ceases to be a useful metric.

In the first of the above examples, I can picture some Hollywood executive with some data showing that films that have been based on a book get better reviews on average (or possibly earn more money), and then that leads their marginal decisions towards making films based on books, and those marginal extra films are likely to exacerbate the collider bias effect, as it’s likely still be an above average book but the marginal film is (in expectation) likely to be an average (or even below average) film.

In the second of the above examples, good metrics on batting ability became available before good metrics on fielding ability in MLB - so I would hypothesise that the magnitude of the correlation grew in that time (and it may still be larger today than during the time before data drove many teams’ recruitment/playing decisions - for example, this could occur if it is ‘easier’ to teach a good batter to be OK at fielding than vice versa).

I encourage you, dear reader, to think about whether Goodhart’s Law can apply to the other examples. The rest of this post will deal with what I suspect to be one of the Goodhart’s Law consequences somewhat related to the last example.

My Recommendation

Other than “keep selection biases in mind always”, my main recommendation in this post is that it can be used as a tool. For example, if a large or influential organisation/person has leant on an algorithm to influence what to share/show/promote, you can potentially use any resulting collider bias as a tool to determine what you should choose.

The example I think is most clear in my day-to-day life is the last one. I don’t think it’s too controversial to suggest that large technology companies have optimised their algorithms towards both engagement and being able to serve more adverts, and this is more prevalent when they have a near-monopoly. In the realm of music, I have a strong suspicion (not confirmed with comprehensive data) that Spotify have tweaked their algorithm over the years towards favouring shorter songs. Often I look at a particular music artist (particularly those who aren’t “current”) where their top 10 tracks are displayed, and find that their longest tracks by length are lower down the list than the total number of plays might suggest⁸. Of course, it is possible that listeners just generally prefer shorter songs nowadays, but at the moment the algorithm-knob-twiddling is my favoured theory (maybe because it allows me to feel like a low-stakes conspiracy theorist given I have no smoking-gun evidence).

If Spotify do tweak their algorithm like this⁹, and we also believe that play count is predictive of song quality (which is hopefully reasonable in expectation), then you can use collider bias to your advantage. Look at an artist’s top 10 hits, and mentally (or with statistically fitted parameters if you’re feeling ambitious) choose those tracks that have combinations of medium-high number of listens and longer length. Voilà! You should get an improved listening experience compared to listening to equivalent play count tracks of a shorter length. If the artist is still releasing new music, then you could look at play-count & track-length within the tracks of the album to skirt around any issues with recent tracks just having fewer listens. Feel free to play about with the below R code of a toy model to see how much this improves the music you listen to (and/or how many fewer adverts you might listen to!).

Warning! Information Hazard

Of course, if all users of Spotify respond in the way I’m suggesting, then they might get back to the assumed optimal variety of tracks being played. But the incentives for Spotify to continue to tweak their algorithm to maximise revenue remain, and so if they change their algorithm again we might end back at square one, except that Spotify have a less useful algorithm. And so on: this could be a negative-sum game, which could even hit the limit of Spotify becoming a glorified radio where there is no consumer choice. Thoughts of how to end up in a better situation are welcome, but for now, maybe don’t share this post extremely widely, and if you take my listening advice let’s hope this blog doesn’t become too popular!

Toy Model

Below is some code that implements a toy model outlining the above. It uses the free, open-source R programming language. At some point I aim to host interactive shiny apps on this blog, or even allow readers to run the R code themselves in their browser, and this model is an excellent candidate. If any readers are familiar with R (or want to learn!), then feel free to take this code, run it yourself, and change the seed, or any of the initial parameters while I work on making it genuinely interactive on the blog. If anyone wants to change the simulated data to instead be real listener habits, feel free, I’d be very interested to know how and where this toy model differs from reality (as it must: all models are wrong), and whether it’s still useful. Below I’ve listed some bullet points on what behaviour I think can be achieved with suitable choices of parameters, but maybe you can think of extensions or refinements to the model that bring new insights to bear? Do you disagree with any of the simulated-data or modelling assumptions?

By changing correlationBetweenTrackLengthAndSongQuality, you can make the effect of algorithmic choice more/less extreme
By changing some of the simulation parameters, you can change which artists are affected the most, and/or the overall size of the effect
By changing the algorithm preference parameters, you can change the magnitude of the effect

Show the code

library(ggplot2)
library(viridis)

Loading required package: viridisLite

Show the code

library(data.table)
library(mgcv)

Loading required package: nlme

This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.

Show the code

set.seed(91)

# Simulating data of albums/tracks/artists
totalNumberOfArtists = 10
nSuccessfulTrialsPerArtist = 2# this might have interesting statistical/musical meaning in a negative binomial model
meanSuccessfulAlbumsPerArtist = 4
goodAlbumMinimumQuality = 4
meanAlbumLength = 40*60
sdAlbumLength = 5*60
globalMeanTrackLength = 200
globalSDTrackLength = 60
minimumTrackLength = 30
# Now the actual modelling parameters:
correlationBetweenTrackLengthAndSongQuality = 0# Feel free to play about with this parameter
# quality is on the 1-10 scale
algorithmPreferenceForShorterSongs = 0.01# the units of this could be viewed as something like expected millions of listens extra per second less duration
algorithmPreferenceForBetterSongs = 0.8# the units of this could be viewed as something like expected millions of listens extra per one unit of quality
betaParForTypicalSongQuality = 3# exercise: does this parameter have any meaning musically?
qualityCutOff = 7

# generate the fake listening data - but could substitute for actual data if desired

simulateAlbumsAndTracks = function (totalNumberOfArtists, nSuccessfulTrialsPerArtist, meanSuccessfulAlbumsPerArtist, goodAlbumMinimumQuality, meanAlbumLength, sdAlbumLength, globalMeanTrackLength, globalSDTrackLength, minimumTrackLength) {
  artistIds = seq_len(totalNumberOfArtists)
  artistAlbumNumbers = rnbinom(totalNumberOfArtists, size=nSuccessfulTrialsPerArtist, mu=meanSuccessfulAlbumsPerArtist)
  artistGoodAlbumNumbers = round(runif(totalNumberOfArtists, min=0, max=1) * artistAlbumNumbers)
  artistMeanTrackLength = minimumTrackLength + rnbinom(totalNumberOfArtists, size=(globalMeanTrackLength-minimumTrackLength)^2/(globalSDTrackLength^2 - (globalMeanTrackLength-minimumTrackLength)), mu=globalMeanTrackLength-minimumTrackLength)

  albums = data.table(artistId=rep(artistIds, artistAlbumNumbers))
  albums[, albumId := .I]
  albums[, isGoodAlbum := sample(c(rep(1, artistGoodAlbumNumbers[as.integer(.BY[[1]])]), rep(0, .N-artistGoodAlbumNumbers[as.integer(.BY[[1]])])), size=.N), by=artistId]
  # Assign album ratings on a 1-10 scale according to some model that includes the total number of albums produced by that artist - could potentially add some autocorrelation here if it seems realistic
  albums[, albumScore := fifelse(isGoodAlbum==1, runif(.N, min=goodAlbumMinimumQuality + 6*(1-exp(-.N/20)), max=10), runif(.N, min=1, max=1 + goodAlbumMinimumQuality*(1-exp(-.N)))), by=artistId]
  # Assume that the best albums contain the best hits
  albums[, artistMeanTrackLength := artistMeanTrackLength[artistId]]
  albums[, approxAlbumLength := minimumTrackLength + rnbinom(.N, size=(meanAlbumLength-minimumTrackLength)^2/(sdAlbumLength^2 - (meanAlbumLength-minimumTrackLength)), mu=meanAlbumLength-minimumTrackLength)]

  # Instead of working out, just simulate a lot of songs and then pick the first N that is closest to the claimed approxAlbumLength
  allTracks = rbindlist(lapply(seq_len(nrow(albums)), function (albumId) {
    # Generate 1000 songs - far more than is typically needed, but by chance it's technically possible for lots of songs to have very low length
    artistMeanTrackLength = albums$artistMeanTrackLength[albumId]
    allPlausibleSongs = data.frame(trackNumber=1:1000, trackLength=minimumTrackLength + rnbinom(1000, size=(artistMeanTrackLength-minimumTrackLength)^2/(globalSDTrackLength^2 - (artistMeanTrackLength-minimumTrackLength)), mu=artistMeanTrackLength-minimumTrackLength))
    allPlausibleSongs$albumLengthSoFar = cumsum(allPlausibleSongs$trackLength)
    curAlbumApproxLength = albums$approxAlbumLength[albumId]
    albumCrossThresholdPoint = first(which(allPlausibleSongs$albumLengthSoFar > curAlbumApproxLength))
    if (length(albumCrossThresholdPoint) == 0) {warning("Unable to generate a reliable album, so returning all plausible simulated songs"); return(allPlausibleSongs[1:1000, ])}
    if (abs(allPlausibleSongs$albumLengthSoFar[albumCrossThresholdPoint]-curAlbumApproxLength) < abs(allPlausibleSongs$albumLengthSoFar[pmax(1, albumCrossThresholdPoint-1)]-curAlbumApproxLength)) {
      returnDF = allPlausibleSongs[seq_len(albumCrossThresholdPoint), ]
    } else {
      returnDF = allPlausibleSongs[seq_len(pmax(1, albumCrossThresholdPoint-1)), ]
    }

    return(returnDF)
  }), idcol="albumId")
  allTracks[albums, on="albumId", `:=`(artistId = i.artistId, isGoodAlbum = i.isGoodAlbum, albumScore = i.albumScore)]
  albums[allTracks[, .(observedAlbumLength=sum(trackLength), nSongs=.N), by=albumId], on="albumId", `:=`(albumLength = i.observedAlbumLength, nSongs = i.nSongs)]

  return(list(albums=albums, allTracks=allTracks))
}
simulatedMusicData = simulateAlbumsAndTracks(totalNumberOfArtists=totalNumberOfArtists, nSuccessfulTrialsPerArtist=nSuccessfulTrialsPerArtist, meanSuccessfulAlbumsPerArtist=meanSuccessfulAlbumsPerArtist, goodAlbumMinimumQuality=goodAlbumMinimumQuality, meanAlbumLength=meanAlbumLength, sdAlbumLength=sdAlbumLength, globalMeanTrackLength=globalMeanTrackLength, globalSDTrackLength=globalSDTrackLength, minimumTrackLength=minimumTrackLength)
albums = simulatedMusicData$albums
allTracks = simulatedMusicData$allTracks

# Could maybe extend this so that better albums have a 'better flow' somehow
allTracks[, trackScore := 1 + 9*rbeta(.N, shape1=0.8*albumScore + correlationBetweenTrackLengthAndSongQuality*trackLength, shape2=betaParForTypicalSongQuality)]

# Now the algorithm modelling -assume it isn't a perfect predictor of the 'true' track score, and also correlates with the album score (a "halo effect" of sorts)
allTracks[, neutralAlgorithmScore := pmax(1, pmin(10, rnorm(.N, mean=algorithmPreferenceForBetterSongs*trackScore + 0.15*albumScore, sd=1)))]
allTracks[, weightedAlgorithmScore := pmax(1, pmin(10, rnorm(.N, mean=algorithmPreferenceForBetterSongs*trackScore + 0.15*albumScore - algorithmPreferenceForShorterSongs*(trackLength-globalMeanTrackLength), sd=1)))]
# Calculate summary statistics of average quality of listened-to song
neutralSummaryStatsListenedTo = allTracks[neutralAlgorithmScore > qualityCutOff, c(lapply(.SD, mean), nTracks=.N), .SDcols=c("trackScore", "trackLength")]
weightedSummaryStatsListenedTo = allTracks[weightedAlgorithmScore > qualityCutOff, c(lapply(.SD, mean), nTracks=.N), .SDcols=c("trackScore", "trackLength")]
# Note: are these LMs statistically valid, given that we know the track and algorithm scores are between 1 and 10?
neutralListenedToLengthLM = allTracks[neutralAlgorithmScore > qualityCutOff, lm(trackScore ~ neutralAlgorithmScore + trackLength)]; print(summary(neutralListenedToLengthLM))


Call:
lm(formula = trackScore ~ neutralAlgorithmScore + trackLength)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5592 -0.4879  0.0754  0.6339  1.6749 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           2.7770439  0.9369151   2.964  0.00375 ** 
neutralAlgorithmScore 0.6105659  0.1174861   5.197  9.9e-07 ***
trackLength           0.0009472  0.0011031   0.859  0.39246    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8968 on 106 degrees of freedom
Multiple R-squared:  0.2196,    Adjusted R-squared:  0.2049 
F-statistic: 14.92 on 2 and 106 DF,  p-value: 1.956e-06

Show the code

weightedListenedToLengthLM = allTracks[weightedAlgorithmScore > qualityCutOff, lm(trackScore ~ weightedAlgorithmScore + trackLength)]; print(summary(weightedListenedToLengthLM))


Call:
lm(formula = trackScore ~ weightedAlgorithmScore + trackLength)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.03316 -0.48879  0.06138  0.61045  1.99495 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.391501   0.796920   1.746   0.0833 .  
weightedAlgorithmScore 0.709423   0.092407   7.677 4.34e-12 ***
trackLength            0.003734   0.001246   2.997   0.0033 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8425 on 123 degrees of freedom
Multiple R-squared:  0.3364,    Adjusted R-squared:  0.3256 
F-statistic: 31.17 on 2 and 123 DF,  p-value: 1.118e-11

Show the code

# Note the R-squared values on the above LMs, would you have forecasted that the weightedAlgorithmScore can explain more of the 'true' track score? Would it still increase if the algorithmPreferenceForBetterSongs was changed too?
# Generate approximate contours
contoursGAM = allTracks[, mgcv::bam(weightedAlgorithmScore ~ te(trackLength, trackScore), discrete=TRUE)]
# The weightedAlgorithmScore = qualityCutOff contour is interesting to plot
allTracks[, qualityCutOffContourVal := predict(contoursGAM)]

# Plot all the songs with quality on the y-axis and length on the x-axis, with a default algorithm that does not care about length
trackXAxisLims = c(30, 60*ceiling(max(allTracks$trackLength/60) + 30))
defaultPlot = ggplot(allTracks, aes(x=trackLength, y=trackScore)) + scale_x_continuous(breaks=seq(trackXAxisLims[1]+30, trackXAxisLims[2]-30, 60), minor_breaks=seq(30, trackXAxisLims[2], 60)) + scale_y_continuous(breaks=1:10) + scale_colour_viridis(limits=c(0.99, 10.01))
# Add on the effect of the algorithm on key metrics by colouring the points - report average time between songs and decrease in song quality
neutralPlot = defaultPlot + geom_point(alpha=0.7, aes(col=neutralAlgorithmScore)) + ggtitle(paste0("Mean track score listened to: ",round(neutralSummaryStatsListenedTo$trackScore, digits=1)," / 10"),
subtitle=paste0("Mean track length listened to: ",round(neutralSummaryStatsListenedTo$trackLength, digits=0),"s (",neutralSummaryStatsListenedTo$nTracks," tracks)"))
weightedPlot = defaultPlot + geom_point(alpha=0.7, aes(col=weightedAlgorithmScore)) + ggtitle(paste0("Mean track score listened to: ",round(neutralSummaryStatsListenedTo$trackScore, digits=1)," / 10"),
subtitle=paste0("Mean track length listened to: ",round(neutralSummaryStatsListenedTo$trackLength, digits=0),"s (",neutralSummaryStatsListenedTo$nTracks," tracks)"))
# The below is some form of density line of roughly where the cut-off lives in the trackLength-trackScore space
titlePlot = weightedPlot + geom_path(col="red", data=data.table::CJ(trackLength=seq(min(allTracks$trackLength), max(allTracks$trackLength), length.out=500), trackScore=seq(min(allTracks$trackScore), max(allTracks$trackScore), length.out=500))[, qualityCutOffContourVal := predict(contoursGAM, newdata=.SD)][qualityCutOffContourVal %between% (qualityCutOff + c(-0.03, 0.03))][order(trackLength)])
neutralPlot

Show the code

weightedPlot

Show the code

titlePlot

Show the code

ggsave(filename="colliderBiasExample.jpg", plot=titlePlot)

Saving 7 x 5 in image

Show the code

# Exercise: Outline a strategy within-artist that 'unhacks' the algorithm and leads to good average song quality once again

# You may like to view it as looking 'above' the red curve, and moving along the x-axis (increasing track length) and tracking what the track score typically is

Post-script: On Berkson

Joseph Berkson was (to my knowledge) the first person to outline the effect discussed in this post, he also was for several years a prominent opponent of the idea that cigarette smoking causes lung cancer. That takes the count of prominent statisticians¹⁰ I am aware of to attack the evidence for the harms attributed to smoking to more than one (one prominent other being Ronald Fisher). As a practicing statistician today, to see my predecessors take on this cause is something that is rather fascinating to me, who lives in the 21st Century where it’s “obvious” that smoking increases your risk of multiple types of cancers. It seems like an area ripe for some historical analysis, with questions including: * Are statisticians more protective of their tools being used ‘incorrectly’ than other fields? * Are statisticians more willing to ‘deviate from a consensus’ than other academics? * Are statisticians more likely to be led by ‘motivated reasoning’ towards a predetermined conclusion in their analysis than practicioners in other fields? * Have statisticians on average on the ‘wrong’/‘right’ side of history?

With no research being done by me so far, all I can provide are interesting ~~data points~~anecdotes that may shine some light on the above questions. Plenty of the early statisticians were also proponents of eugenics. Plenty of statisticians have taken up cases of convicted murderers such as Sally Clark, who was later found to be a victim of a miscarriage of justice. There is at least one other case I am aware of. The author of the mgcv R package, Simon Wood has been critical of UK Covid-19 pandemic response: https://rss.org.uk/RSS/media/File-library/Events/Discussion%20meetings/covid-final-preprint.pdf .

Footnotes

Collider bias is perhaps more commonly known as “Berkson’s Paradox”, but I will not use that here as I personally do not consider it to be a paradox at all.↩︎
There is an excellent Hannah Fry Numberphile YouTube video about this book/film example: https://www.youtube.com/watch?v=FUD8h9JpEVQ↩︎
I believed this to be in an R package called ShinyBaseball that includes a baseball example: https://github.com/bayesball/ShinyBaseball/blob/main/inst/shiny-examples/BerksonBA/app.R , with summaries here and here but those links do not contain my example above. I don’t know whether I have misrecalled the precise collider bias I remembered about baseball, and/or whether I’ve accidentally come up with a new theory myself!↩︎
I predict what I am hinting at may be easier to recognise for readers in different countries, in the UK there is a long history of other parties earning small amounts of votes, often losing their deposit, and indeed novelty candidates ↩︎
https://en.wikipedia.org/wiki/Berkson%27s_paradox#Original_illustration↩︎
https://en.wikipedia.org/wiki/Berkson%27s_paradox#Dating_pool_example↩︎
Since this is more of my subjective theory, I am not aware of an exact example, but https://arxiv.org/pdf/2203.00376 superficially appears to be pointing at something similar↩︎
Here are a couple of anecdotal examples from the time of writing. The Gorillaz have over 100 million more listens for the 5:40 “Clint Eastwood” than the 3:53 “On Melancholy Hill”, but they are at positions 3 and 2 respectively. John Coltrane has over 47 million listens for the 13:44 “My Favorite Things”, but it is only at position 10 below the much shorter “Its Easy To Remember” and “Nancy (with the Laughing Face)” with 39 million and 28 million listens respectively.↩︎
A brief google allows me to feel comfortable with some confirmation bias that song length has decreased over the streaming era: this article seems to be based on some reasonable data in my opinion.↩︎
“Prominent statistician” is loosely defined as being those who have statistical techniques/ideas named after them, or being the first to outline a statistical technique I have used, although there are definitely problems with this definition!↩︎