The goal of this analysis is to assess the degree to which female versus male authors are overrepresented in prestigious authorship positions. In medical literature, both first and last author positions cary prestige – the first author position, because it is understood that this person is responsible for the publications core ideas and most of the work, and last author position because this person is seen to be responsible for most of the intellectual and financial support of the project.

Bias, overrepresentation, parity

There are at least two possible manifestations of gender inequity in published medical literature. First, disparities in opportunities to conduct and publish research may result in differences in the proportion of a fields female practitioners, and female authors. However, regardless of the proportion of female authors out of the total number of authors, disparities might also result in differences between the probability of female authors being being in prestigious authorship positions relative to male authors.

On the first point, there is evidence in these data to suggest that across fields diversity as measured inversely by the difference between the proportion of male and female authors and practictioners, is increasing. This is what the analyses in the current draft of the paper show.

However, it is possible that certain fields and subfields may persistently overrepresent males or females for reasons that are not related to structural inequalities or inequalities in opportunity. In these cases, as well as balanced cases, it is important to evaluate whether, given the gender composition that is present, whether prestige is distributed without regard for gender. That is, regardless of the overall practitioner or author gender compostition, would a male author’s probability of being a first or last author rather than a middle author be the same as a female author’s? If there is true parity, the difference in probability of being a first or last author versus a middle author for both genders should just depend on the average number of middle authors.

Examples of (dis)parity

This may be easiest to understand with a toy example. Imagine in the year 2049, the subfield of cybernetic surgury publishes 10 papers. This field sometimes has long author lists, and sometimes short author lists. Also, strangely, all of the female author names begin with X and all of the male author names begin with Y (and all end in an integer). The whole field comprises 20 people, 14 of whom are men. Each author is on two papers this year, and so we know a priori that the overall representation of male authors on papers does not deviate from the proportion of male authors in the field (although this isn’t so important as ultimately the population we care about are just the authors of papers). The five author lists are:

X1,  Y1,  Y2 
X2,  X3,  Y3 
X4,  Y4,  Y5,  Y6,  X5,  Y7 
Y8,  X6,  Y9,  Y10 
Y11, Y12, Y13, Y14 
Y2,  Y1,  X1 
Y3,  X3,  X2 
Y7,  Y4,  Y5,  Y6,  X5,  X4 
Y8,  X6,  Y9,  Y10 
Y11, Y12, Y13, Y14 

Notice that the two sets of five author lists are identical except for swapping first and last author positions for the authors of the first three papers in each group. This is by design as will become apparent shortly.

Position does(n’t) depend on gender

To calculate the probability of any author being in any position (first, middle, last), we simply need to calculate the proportion of position k authors out of the total number of authors. The probability of an author being first author is simply the number of papers (N=10) divided by the number of authors (N=40), or \(P(\text{first}) = 10\div 40 = 0.25\). This proportion will be the same for the last author position, and so it is easy to see that half of all authors are middle authors so that \(P(\text{middle}) = 20\div 40 = 0.5\). Now, we can investigate these probabilities by gender. It’s important to realize that we’re no longer interersted in the proportion of, say, female first authors out of all authors, but just the proportion of female first authors out of all female authors – that is, the conditional probability of being first author, given that one is female. This is because the first case depends both on the total representation of women in the author list relative to males and as well as position, while the second proportion only relies on the authorship position. However, if we think of drawing author position like a fair lottery, it is easy to see that a female author’s chance of winning the lottery, all else being equal, shouldn’t deviate from the probability overall. We can see that that is the case in this example. Since we know that there are 6 total females in the field, and each is on 2 papers, we know there are 12 total female authorships. By counting the number of X’s in the first column, we see that we have 3 female first authors, so \(P(\text{first}|F) = 3\div 12 = 0.25\). This is the same for last author, and so for middle author we have \(P(\text{middle}|F) = 6\div 12 = 0.5\). For males we can follow the same procedure, yeilding \(P(\text{first}|M) = 7\div 28 = 0.25\), and \(P(\text{middle}|M) = 14\div 28 = 0.5\). So now we see that even in a field with unbalanced representation in authors’ gender, we get equal probability of authorship position within each gender. Technically, \(P(\text{first})=P(\text{first}|M)=P(\text{first}|F)\) or in other words, author position is not conditional on gender.

But we can make it so. It should be quite obvious that with a little rearranging of the author list above we can put the female authors entirely in the middle position such that \(P(\text{first}|F) = 0\div 12 = 0\), and \(P(\text{first}|M) = 10\div 28 = 0.36\). We can see in this case, the probability of being first, middle, or last author is now conditional on gender. This is consistent with some kind of process that prevents female authors from being awarded prestigious authorship positions, though of course there may be other explanations. However, statistically, with a data set only consisting of author gender and position, this is all the statistical evidence we can muster to try to answer this question.

Gender does(n’t) depend on position

There is another way one might assess the same disparity above and that is by flipping the question from “is the probability of being in a particular position conditional on gender” to “is gender conditional on author position”. Although this may seem strange at first, this will allow us to statistically evaluate the statement “the proportion of males and females in the position of first author is not different from the proportion of males and females in middle author position.” Again, consider the author list above. Calculating the proportion of male author versus female authors in first author position, we see that \(P(M|\text{first}) = 7\div 10 = 0.7\) and for middle authorship, \(P(M|\text{middle}) = 14\div 20 = 0.7\). These are the same, meaning the probability of being male is not conditional on authorship position. Stated from the perspective of a logistic regression model, we wouldn’t be able to gain any information about gender just by knowing author position. And intuitively, thinking back to the idea of a lottery, we see that this should be the case under perfect parity: if men and women are equally likely to win the lottery, and all we know about someone is that they won, our best guess as to the probability of that person being male would just be the population proportion of males.

Examining this in the case of non-parity, we will see that suddenly author position does give us information about the gender of the author. Moving all the female authors to the middle, \(P(M|\text{first}) = 10\div 10 = 1\), which is not the same as \(P(M|\text{middle}) = 11\div 20 = 0.55\). Suddenly, knowing an authors position tells us a great deal about their gender, which should not be the case under perfect parity.

(As a bit of fun math, you can use Baye’s theorem to flip back and forth between \(P(\text{Gender}|\text{Position})\) and \(P(\text{Position}|\text{Gender})\).)

Again, I should stress the important limitation that such an analysis tells us nothing about the causes for this deviation from perfect parity. If we see that the proportion of men in middle author position is increased beyond parity, perhaps we could conclude that they just want the free ride that comes from a low effort author possition. If we see more women in middle author positions than would be expected, we might conclude that women are being kept away from the prestigious ends of the author list. Both of these conclusions would clearly be reaching well beyond the data.

That is a good reminder that we actually have some data. Though the question of overall gender representation in a field, on author lists, and in certain positions are good, they are only a few pieces of the picture.

Data description and analysis

First, I will provide descriptive graphs that illustrate, for each subfield, the relations between the kinds of probabilities that I describe above. Then, I will estimate a logistic hierarchical linear model and plot the model estimates for each field.

Plotting the data

The easiest probabilities described above to intuitively understand graphically are those dealing with the conditional probability of authorship position given gender, comparing each gender. We can descibe the differences between those conditional probabilities for both genders visually, which I’ll do in the first set of plots, and we can also calculate a ratio between those probabilities which should be equal to 1 in the case of perfect parity, and either less than or greater than 1 as our data deviate from those probabilties. Because ratios are asymetrical, with a range of \([0,\infty)\), I’ll actually plot \(\log\Big(\frac{P(\text{Postion}|F)}{P(\text{Position|M})}\Big)\) so we get back to \((-\infty,\infty)\), with 0 as the parity point.

dataDir <- '/home/jflournoy/Documents/UO/UCLA_Gender_Pub/output/output/'
inputFiles <- dir(dataDir, 
                  pattern='author.*csv',
                  full.names = T)

authorColClasses <- c(
        'logical',
        'character',
        'character',
        'character',
        'character',
        'character',
        'integer',
        'character',
        'character')

# fileDT <- rbindlist(lapply(inputFiles, fread, colClasses = authorColClasses)) %>%

if(!file.exists('./RData/summaries.RDS')){
  summaries<-data.table(filename = inputFiles, stringsAsFactors = F) %>%
    group_by(filename) %>%
    do({
      aDT<-fread(as.character(.$filename[[1]]), colClasses=authorColClasses)
      result<-aDT %>%
        mutate(simple_sex=sub('.*?((fe)*male)','\\1',Author_Sex)) %>%
        filter(simple_sex %in% c('male','female'), pubDate >= 2002) %>%
        group_by(PubMed_ID) %>%
        mutate(last=Author_Rank_in_Paper == max(Author_Rank_in_Paper) & Author_Rank_in_Paper != 1) %>%
        group_by(simple_sex,pubDate) %>%
        summarise(
          n_tot=n(),
          first_author=sum(Author_Rank_in_Paper %in% 1),
          last_author=sum(last %in% TRUE),
          middle_author=sum(!(Author_Rank_in_Paper %in% 1) & !(last %in% TRUE))) %>%
        mutate(
          first_author_prop=first_author/n_tot,
          last_author_prop=last_author/n_tot,
          middle_author_prop=middle_author/n_tot,
          n_check=first_author+last_author+middle_author)
      result
    }) %>%
    mutate(term=sub('.*/author_centric_(.*)\\.csv','\\1',filename))
  saveRDS(summaries, './RData/summaries.RDS')
} else {
  summaries <- readRDS('./RData/summaries.RDS')
}

In this plot, the important comparison in each cell is the difference between the line for males and the line for females. As described above, in a situation with perfect parity, the lines should overlap. Across cells within a subfield, it is interesting to note that as author lists become longer over time, as in genetics, the probability of being a middle author increases.

propByGenderPlot <- summaries %>%
  filter(!(term %in% c('podiatry', 'plastic surgery', 'optometry'))) %>%
  select(filename:n_tot, matches('.*_prop'), term) %>%
  gather(stat, value, first_author_prop:middle_author_prop, convert=T) %>%
  mutate(stat_fact = factor(
    stat,
    levels = c('first_author_prop', 'middle_author_prop', 'last_author_prop'),
    labels = c('P(First|Gender)', 'P(Middle|Gender)', 'P(Last|Gender)')),
    term_fact = factor(term,
                       levels = unique(term),
                       labels = gsub(' ', '\n', unique(term)))) %>%
  ggplot(aes(x = as.numeric(pubDate), y = value)) +
  geom_point(aes(size = n_tot, color = simple_sex), alpha = .2) +
  geom_smooth(method = 'loess', aes(color = simple_sex), se=F) +
  facet_grid(term_fact ~ stat_fact) +
  guides(size = guide_legend('N authors'),
         color = guide_legend('Gender')) +
  theme_classic() +
  theme(strip.text = element_text(size = 8, margin = margin(0, 0, 0, 0, "in")),
        strip.background = element_rect(fill = '#efefef', linetype = 0),
        panel.spacing = unit(0, 'in'),
        panel.background = element_rect(fill = '#ffffff'),
        panel.border = element_rect(fill = NA, color = '#efefef', size = 1, linetype = 1)) + 
labs(x = 'Publication year', y = 'P(position | gender)')
print(propByGenderPlot)

In this plot we will see the log of the ratio of the two probabilities, and we can compare it to a line at y = 0. If the line is above 0, this means that female names are more likely to appear in this position, and if it’s below 0, it means male names are more likely to appear. Pay attention to the plots with the biggest dots, as those dots index sample size within a field. For example, notice in education, female names are disproporionately likely to appear in the first position, increasingly over time, and less likely to appear in the last position, decreasing over time.

Be careful not to read the dots as confidence intervals – the bigger the dot, the more sure you can be that the black line is a precise estimate.

propByGenderRatioPlot <- summaries %>%
  filter(!(term %in% c('podiatry', 'plastic surgery', 'optometry'))) %>%
  select(filename:n_tot, matches('.*_prop'), term) %>%
  gather(stat, value, first_author_prop:middle_author_prop, n_tot, convert=T) %>%
  unite(stat_sex, stat, simple_sex) %>%
  spread(stat_sex, value) %>%
  ungroup() %>%
  transmute(
    filename = filename,
    term = term,
    pubDate = pubDate,
    first_au_fm_ratio = first_author_prop_female / first_author_prop_male,
    middle_au_fm_ratio = middle_author_prop_female / middle_author_prop_male,
    last_au_fm_ratio = last_author_prop_female / last_author_prop_male,
    n_tot = n_tot_female + n_tot_male) %>%
  gather(stat, value, first_au_fm_ratio:last_au_fm_ratio) %>%
  mutate(stat_fact=factor(
    stat,
    levels = c('first_au_fm_ratio','middle_au_fm_ratio','last_au_fm_ratio'),
    labels = c('First', 
               'Middle',
               'Last')),
    term_fact = factor(term,
                       levels = unique(term),
                       labels = gsub(' ', '\n', unique(term)))) %>%
  ggplot(aes(x = as.numeric(pubDate), y = log(value))) +
  geom_point(aes(size = n_tot), alpha = .2) +
  geom_point(alpha = .3) +
  geom_line(stat = 'smooth', method = 'loess', color = 'black') +
  facet_grid(term_fact ~ stat_fact) +
  geom_hline(yintercept = 0, color = 'black', alpha = .75) +
  scale_y_continuous(breaks = c(-.25, 0, .25)) + 
  guides(size = guide_legend('N authors')) + 
  coord_cartesian(y = c(-.5, .5)) + 
  theme_classic() +
  theme(strip.text = element_text(size = 8, margin = margin(0, 0, 0, 0, "in")),
        strip.background = element_rect(fill = '#efefef', linetype = 0),
        panel.spacing = unit(0, 'in'),
        panel.background = element_rect(fill = '#ffffff'),
        panel.border = element_rect(fill = NA, color = '#efefef', size = 1, linetype = 1)) + 
labs(x = 'Publication year', y =expression(paste(log, bgroup("(", frac("P(position | F)", "P(position | M)"), ")"))))
print(propByGenderRatioPlot)

Analysis

Though the above descriptives are nice, and will help us interpret an analysis, we would like to control the error rate of our conclusions using p-values. We can model these data using the lme4 package, which I do below. The basic model is going to capture the second set of conditional probabilities I mention above – that is, the probability that an author is male, conditional on the author position.

In our model, we’ll use dummy variables for first and last author position, with middle author as the reference. This will tell us if each of the prestigious positions differs from the non-prestigious position in terms male over or under representation. Because most of the trends above are linear on average, I’ll also include the linear effect of year of publication.

The level 1 formula is

\[\text{male}_{ij} = \beta_{0j} + \beta_{1j}\text{First}_{ij} + \beta_{2j}\text{Last}_{ij} + \beta_{3j}\text{Year}_{ij} + \beta_{4j}\text{First}_{ij}\text{Year}_{ij} + \beta_{5j}\text{Last}_{ij}\text{Year}_{ij} + \epsilon_{ij}\] and the grouping factor will be the Mesh term, with random effects estimated for all coefficient. This will allow us to fit a partially pooled estimate for each subfield, while retaining the ability to perform significance tests on the field as a whole.

Load Data

This requires individual level data, so let’s load it.

require(tidyverse)
require(data.table)

if(!file.exists('./RData/indivLevelDat.RDS')){
  indivLevelDat <- data.table(filename = inputFiles, stringsAsFactors = F) %>%
    group_by(filename) %>%
    do({
      aDT<-fread(as.character(.$filename[[1]]), colClasses = authorColClasses)
      result<-aDT %>%
        mutate(simple_sex=sub('.*?((fe)*male)','\\1',Author_Sex)) %>%
        filter(simple_sex %in% c('male','female'),pubDate >= 2002) %>%
        group_by(PubMed_ID) %>%
        mutate(last=Author_Rank_in_Paper == max(Author_Rank_in_Paper) &
                 Author_Rank_in_Paper != 1,
               first_author=Author_Rank_in_Paper %in% 1,
               last_author=last %in% TRUE,
               middle_author=!(Author_Rank_in_Paper %in% 1) & !(last %in% TRUE),
               male=as.numeric(simple_sex=='male'))
      result
    }) %>%
    mutate(term=sub('author_centric_(.*)\\.csv','\\1',filename))
  saveRDS(indivLevelDat, './RData/indivLevelDat.RDS')
} else {
  indivLevelDat <- readRDS('./RData/indivLevelDat.RDS')
}
indivLevelDat$pubyear_c <- as.numeric(indivLevelDat$pubDate)-2014
indivLevelDat$pubyear_z <- indivLevelDat$pubyear_c / sd(indivLevelDat$pubyear_c)

set.seed(171114)
indivLevDatSample <- sample_frac(group_by(indivLevelDat, term),
                                                 size=.1)

Now fit the models. I’ll center the publication year at 2014 so that our intercept reflects the current (well, as of 2014) state of the medical literature. This means that significant coefficients on the author position terms indicate a difference as of the 2014 publication date. I will also standardize the publication date variable to make model estimation easier (logistic regression cares about that sort of thing). Finally, because the full data set is roughly 4 million rows, I will sample 10% of the data so that I have a chance of estimating the models within a day. I will estimate the full model on the full data set as well, but seeing that 10% of this data set is still over 400,000 rows, we should be pretty okay in terms of precision. Of note is that this 10% is sampled from each term to ensure that we don’t randomly end up with an oddly stratified set.

Null Model

inv.logit <- function(alpha){
  1/(1+exp(-alpha))
}

nullModel <- glmer(male ~ 1 + (1|term), 
                   data=indivLevDatSample, 
                   family='binomial')
summary(nullModel)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: male ~ 1 + (1 | term)
##    Data: indivLevDatSample
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  570074.2  570096.1 -285035.1  570070.2    431201 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8255 -1.0366  0.6437  0.8568  1.1136 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  term   (Intercept) 0.1499   0.3871  
## Number of obs: 431203, groups:  term, 47
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4860     0.0508   9.567   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using the inverse logit function to turn this intercept back into a probability of being male overall (or a proportion of males as estimated by the model). Below you can see the epected proportion of males across all subfields (black vertical line) at about \(P = 0.62\). The dotted gray line just highlightes \(P = 0.50\) which would be equal numbers of male and female authors. Obviously this is not the model we care about because we’re actually trying to assess whether, given these gender proportions, authorship is evenly or unevenly distributed across positions of prestige.

rr1 <- ranef(nullModel, condVar = TRUE)
op <- options(digits = 4)
options(op)

as.data.frame(rr1) %>%
  mutate(grp = sub('.*/output//(.*)', '\\1', grp),
         grp = factor(grp, levels = grp[order(condval)]),
         condval = condval + fixef(nullModel)[[1]]) %>%
  ggplot(aes(y = grp, x = inv.logit(condval))) +
  geom_point() + 
  facet_wrap(~ term, scales="free_x") +
  geom_errorbarh(aes(xmin = inv.logit(condval - 2*condsd), 
                     xmax = inv.logit(condval + 2*condsd)),
                   height=0) +
  geom_vline(xintercept = inv.logit(0.4860)) + 
  geom_vline(xintercept = inv.logit(0), color = 'gray', linetype = 2) +
  theme_classic()

Full model, random intercepts

This model is close to the one we care about but assumes that over time all fields change in the same way (that is, it doens’t estimate the publication year slope separately for each subfield). This model is being estimated just to check whether we need to worry about homogeneity in change. Looking at the descriptive plots above, I bet we do, but it’s nice to have a statistical argument.

require(lme4)

intModelFName <- './RData/lme4IntModel.RDS'

if(!file.exists(intModelFName)){
  intModelEquation <- formula(male ~ 1 + first_author*pubyear_z + last_author*pubyear_z + 
                                    (1 + first_author + last_author | term))
  
  intModelFormula <- lFormula(intModelEquation, data = indivLevDatSample)
  
  intModelFormulaNumFx <- length(dimnames(intModelFormula$X)[[2]])
  
  intModelFormulaNumRx <- sum(as.numeric(lapply(intModelFormula$reTrms$cnms, function(x) {
    l <- length(x)
    (l*(l-1))/2+l
  })))
  
  intModelFormulaMaxfun <- 10*(intModelFormulaNumFx+intModelFormulaNumRx+1)^2

  intModel <- glmer(intModelEquation,
                    data = indivLevDatSample, 
                    verbose = 2,
                    family='binomial',
                    control=glmerControl(optCtrl=list(maxfun=intModelFormulaMaxfun), 
                                         optimizer = "nloptwrap", 
                                         calc.derivs = FALSE))
  saveRDS(intModel, intModelFName)
} else {
  intModel <- readRDS(intModelFName)
}
summary(intModel)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: male ~ 1 + first_author * pubyear_z + last_author * pubyear_z +  
##     (1 + first_author + last_author | term)
##    Data: indivLevDatSample
## Control: glmerControl(optCtrl = list(maxfun = intModelFormulaMaxfun),  
##     optimizer = "nloptwrap", calc.derivs = FALSE)
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  567442.1  567573.8 -283709.0  567418.1    431191 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2089 -1.0144  0.6359  0.8418  1.3592 
## 
## Random effects:
##  Groups Name             Variance Std.Dev. Corr       
##  term   (Intercept)      0.133332 0.36515             
##         first_authorTRUE 0.023844 0.15441   0.50      
##         last_authorTRUE  0.002384 0.04882  -0.02  0.34
## Number of obs: 431203, groups:  term, 47
## 
## Fixed effects:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 0.273129   0.054878   4.977 6.46e-07 ***
## first_authorTRUE            0.066666   0.029981   2.224  0.02618 *  
## pubyear_z                  -0.075836   0.004511 -16.811  < 2e-16 ***
## last_authorTRUE             0.263065   0.017007  15.468  < 2e-16 ***
## first_authorTRUE:pubyear_z -0.038497   0.007890  -4.879 1.06e-06 ***
## pubyear_z:last_authorTRUE  -0.025416   0.007759  -3.276  0.00105 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) fr_TRUE pbyr_z l_TRUE f_TRUE:
## frst_thTRUE  0.292                              
## pubyear_z    0.104 -0.190                       
## lst_thrTRUE -0.096  0.293  -0.335               
## frst_TRUE:_ -0.059  0.368  -0.571  0.192        
## pbyr_:_TRUE -0.060  0.110  -0.580  0.602  0.332

Interpretting these coefficients, we can see that in 2014, it’s slightly more likely for an author to be male if they are in the first position, and more likely to be male if they are in the last position, than the middle position, and that there are more males than females in the middle position (that’s the intercept). We also see that across the three authorship positions, the author is less and less likely to be male over time. Note that this is on average, partially pooled across all subfields, and doens’t allow the effect of publication year to varry across subfields.

Full model, random slopes

Now we estimate the full model across every subfield, with estimates partially pooled. This means each subfield can get a slope and intercept that deviate from the fixed effects.

require(lme4)

slopeModelFName <- './RData/lme4SlopeModel.RDS'

if(!file.exists(slopeModelFName)){
  slopeModelEquation <- formula(male ~ 1 + first_author*pubyear_z + last_author*pubyear_z + 
                        (1 + first_author*pubyear_z + last_author*pubyear_z | term))
  
  slopeModelFormula <- lFormula(slopeModelEquation, data = indivLevDatSample)
  
  slopeModelFormulaNumFx <- length(dimnames(slopeModelFormula$X)[[2]])
  
  slopeModelFormulaNumRx <- sum(as.numeric(lapply(slopeModelFormula$reTrms$cnms, function(x) {
    l <- length(x)
    (l*(l-1))/2+l
  })))
  
  slopeModelFormulaMaxfun <- 10*(slopeModelFormulaNumFx+slopeModelFormulaNumRx+1)^2

  slopeModel <- glmer(slopeModelEquation,
                    data = indivLevDatSample, 
                    verbose = 2,
                    family='binomial',
                    control=glmerControl(optCtrl=list(maxfun=slopeModelFormulaMaxfun), 
                                         optimizer = "nloptwrap", 
                                         calc.derivs = FALSE))
  saveRDS(slopeModel, slopeModelFName)
} else {
  slopeModel <- readRDS(slopeModelFName)
}
summary(slopeModel)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: male ~ 1 + first_author * pubyear_z + last_author * pubyear_z +  
##     (1 + first_author * pubyear_z + last_author * pubyear_z |  
##         term)
##    Data: indivLevDatSample
## Control: glmerControl(optCtrl = list(maxfun = slopeModelFormulaMaxfun),  
##     optimizer = "nloptwrap", calc.derivs = FALSE)
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  567242.3  567538.6 -283594.2  567188.3    431176 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2257 -1.0039  0.6323  0.8409  1.3810 
## 
## Random effects:
##  Groups Name                       Variance Std.Dev. Corr             
##  term   (Intercept)                0.141786 0.37654                   
##         first_authorTRUE           0.019859 0.14092   0.66            
##         pubyear_z                  0.001029 0.03208   0.32  0.18      
##         last_authorTRUE            0.001483 0.03851  -0.06  0.41  0.21
##         first_authorTRUE:pubyear_z 0.001719 0.04147   0.28  0.02  0.68
##         pubyear_z:last_authorTRUE  0.000931 0.03051   0.09  0.33  0.69
##             
##             
##             
##             
##             
##   0.31      
##   0.38  0.72
## Number of obs: 431203, groups:  term, 47
## 
## Fixed effects:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 0.269160   0.056843   4.735 2.19e-06 ***
## first_authorTRUE            0.040992   0.027869   1.471  0.14133    
## pubyear_z                  -0.079370   0.007798 -10.178  < 2e-16 ***
## last_authorTRUE             0.254393   0.015726  16.177  < 2e-16 ***
## first_authorTRUE:pubyear_z -0.056198   0.011346  -4.953 7.30e-07 ***
## pubyear_z:last_authorTRUE  -0.029126   0.009984  -2.917  0.00353 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) fr_TRUE pbyr_z l_TRUE f_TRUE:
## frst_thTRUE  0.420                              
## pubyear_z    0.311 -0.028                       
## lst_thrTRUE -0.101  0.368  -0.147               
## frst_TRUE:_  0.133  0.330   0.046  0.308        
## pbyr_:_TRUE  0.011  0.321   0.010  0.635  0.508

Again, interpretting these coefficients, we can see that in 2014, it’s more likely for an author to be male if they are in the last position than the middle position, and there are more males in the middle position than females. Again, we see that across the three authorship positions, the author is less and less likely to be male over time. The is consistent with the overall finding that more women seem to be publishing in these fields over time.

Model comparison

Which of these models should we trust more?

identical(unlist(intModel@frame), unlist(slopeModel@frame))
## [1] TRUE
knitr::kable(broom::tidy(anova(intModel, slopeModel)))
term df AIC BIC logLik deviance statistic Chi.Df p.value
intModel 12 567442.1 567573.8 -283709.0 567418.1 NA NA NA
slopeModel 27 567242.3 567538.6 -283594.2 567188.3 229.7491 15 0

Subfield model estimate plots

What we really want to see is the model estimated trend for each of the subfields. That will take a little bit more work.

Conclusion

Our final model showed that, though distribution of authorship position seems to be moving toward parity, as of 2014, the last author position is held by more male authors than would be expected based on the underlying gender distribution of authors. In other words, on average, regardless of the overall proportion of males and females who author papers, there is not a significant difference between the probability that a middle versus a first author is male. However, it is significantly more likely that a last author is male versus female as of the publication date 2014. During the time period of this analysis, the probability of being male has become more similar between first and middle, and last and middle authorship indicating that while parity has improved, it has indeed been worse in the past.