Ten simple rules for structuring papers, bioRxiv, 2016-12-15

Thursday, Dec 15, 2016

report_template.utf8

Paper Info
LDA topic modeling analysis
Network homophily analysis
- Plot UMAP homophily embedding
Cosine similarity analysis
- UMAP embedding by cosine similarity
Retweet timeline analysis
- Timeline plot
Missing data analysis
- Plot missingness distributions by group
- Correlation between missing data and WN homophily

Paper Info

Report generated on 2019-07-23.

doi: 10.1101/088278
View paper on journal site
View paper on Altmetric

LDA topic modeling analysis

We obtained a list of tweets/RTs referencing the specified article by querying the Crossref Event Data API. For each unique user that has (re-)tweeted the article, we then collected the user names and bios of their followers using the Twitter API (limited to the 10,000 most recent followers. We compiled these bios into a single “document” per account.

We next generated a document term matrix, enumerating the frequencies of every term that occurs 10 or more times in each document, excluding common stop words (e.g., “a”, “of”, “is”). Note that because emoji and hashtags are commonly used to convey semantic meaning in a user’s bio, we included these as unique “words”.

Inference of academic audience sectors

The table below lists the audience topics inferred by the LDA model, their top 30 keywords, and the fraction of users associated with that topic.

Topics that are associated with academic audiences (having at least one keyword in each of the following 3 keyword sets: ["phd", "md", "dr"], ["university", "institute", "universidad", "lab"], and ["student", "estudiante", "postdoc", "professor", "profesor", "prof"]) are indicated with a “🎓” emoji in the topic column. For each topic, we calculate the cosine similarity between the top 30 keywords for that topic and the top 100 most common words found in the Wikipedia article for scientific-communication-and-education.

each of the Wikipedia articles for 1179 academic disciplines. The discipline found to have the highest cosine similarity with a given topic is indicated in the best_match column and the corresponding topic x discipline cosine similarity score is indicated in the td_score column.

For each of the matching disciplines, we then calculate a discipline x discipline cosine similarity score between that discipline and the paper’s main topical area, indicated in the cos() column.

Among the \(D\) academic topics (each assigned to discipline \(d\)), we calculate an aggregate interdisciplinary score as a weighted average of the similarity scores between each topic and the paper’s category, where the weights, \(w_d\) (indicated in the pct_acad column) are the fraction of the academic audience associated with that topic:

\(ID_{score} = 1- \sum_{d \in D} w_d \times cos(\vec{d}, \vec{d}_{home})\) = 0.9809842.

tf_table <- full_join( 
                      lda_gammas_count %>% 
                        mutate(topic=paste0("topic", gsub(":.*", "", topic))) %>%
                        dplyr::select(topic, top_terms=top_10, n_users=n, pct_total=pct),
                      top_fields,
                      by="topic") %>%
  ungroup() %>%
  # full_join(acad_topics2, by="topic") %>%
  mutate(topic_lab=topic) %>%
  # mutate(topic=factor(topic, levels=unique(lda_gammas_count$topic))) %>%
  # arrange(topic_lab) %>%
  # mutate(topic=factor(topic, levels=paste0("topic", 1:12))) %>%
  full_join(match_scores) %>%
  mutate(topic=cell_spec(topic, "html", 
                         color="black", align = "c",
                         background=c(cols[as.numeric(gsub("topic", "", topic))]))) %>%
  mutate(topic=ifelse(topic_lab %in% topic_ids, paste0(topic, "🎓"), topic)) %>%
  # dplyr::select(-c(topic_lab, pct, target_cat, score_wt, score_wt2)) %>%
  dplyr::select(topic, top_terms, n_users, pct_total, pct_acad, tc_score, best_match, td_score) %>%
  mutate(n_users=round(n_users),
         pct_total=round(pct_total, 3),
         pct_acad=round(pct_acad, 3),
         td_score=round(td_score, 3),
         tc_score=round(tc_score, 3)) %>%
  dplyr::rename("Number of users (estimated)" = "n_users",
                "Top 30 Terms" = "top_terms", 
                "Fraction of total audience" = "pct_total",
                "Fraction of academic audience" = "pct_acad",
                "Best matching discipline" = "best_match",
                "cos(t, d<sub>best</sub>)" = "td_score",
                "cos(t, d<sub>target</sub>)" = "tc_score")

## Joining, by = c("topic", "top_terms", "n_users", "best_match", "td_score")

knitr::kable(tf_table, format="html", escape=F) %>%
  column_spec(2, width_max = "200em; display: inline-block;") %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "600px")

topic	Top 30 Terms	Number of users (estimated)	Fraction of total audience	Fraction of academic audience	cos(t, d_target)	Best matching discipline	cos(t, d_best)
topic1🎓	neuroscience, phd, neuroscientist, science, research, student, cognitive, brain, psychology, university, lab, professor, studying, postdoc, researcher, scientist, learning, psychologist, health, dr, memory, assistant, candidate, fellow, clinical, cognition, human, prof, computational, mental	378	0.127	0.153	0.000	Psychology	0.225
topic2🎓	research, health, university, education, dr, lecturer, phd, researcher, learning, student, uk, director, teacher, public, people, librarian, school, academic, science, mental, centre, passionate, community, senior, teaching, psychology, manager, development, food, support	192	0.065	0.078	0.072	Community_psychology	0.200
topic3🎓	science, ecology, phd, ecologist, conservation, university, research, student, marine, climate, scientist, environmental, professor, change, biologist, dr, studying, biology, postdoc, evolutionary, wildlife, evolution, nature, candidate, researcher, plant, fellow, enthusiast, biodiversity, water	312	0.105	0.127	0.058	Ecology	0.225
topic4🎓	phd, science, biology, research, lab, university, student, professor, scientist, biologist, postdoc, plant, 🔬, molecular, cell, studying, evolution, genomics, dr, assistant, genetics, fellow, researcher, cancer, microbiology, candidate, prof, evolutionary, stem, enthusiast	663	0.223	0.269	0.000	Bioinformatics	0.182
topic5🎓	sports, sport, phd, exercise, coach, university, science, performance, research, health, nutrition, student, physical, msc, physiology, pathology, lecturer, scientist, researcher, strength, pathologist, professor, diabetes, conditioning, physiotherapist, fitness, training, medicine, biomechanics, bsc	94	0.032	0.038	0.000	Kinesiology	0.212
topic6🎓	phd, data, science, student, university, research, learning, language, scientist, professor, researcher, machine, computer, psychology, engineer, assistant, software, linguistics, #rstats, candidate, dr, computational, technology, prof, education, enthusiast, statistics, engineering, lecturer, cognitive	311	0.105	0.126	0.019	Artificial_intelligence	0.217
topic7🎓	health, md, medical, medicine, care, research, clinical, healthcare, physician, fellow, dr, cancer, director, hospital, university, resident, doctor, professor, patient, emergency, researcher, phd, student, education, cardiology, nurse, mom, husband, advice, advocate	184	0.062	0.075	0.000	Medicine	0.245
topic8🎓	enfermera, salud, vida, ms, universidad, ciencia, mdico, mundo, enfermero, hospital, medicina, investigacin, especialista, phd, amante, madrid, ser, oficial, siempre, profesor, estudiante, mster, ana, educacin, familia, ciencias, mejor, gestin, informacin, psicologa	150	0.050	NA	NA	NA	NA
topic9	#, writer, author, ❤, #fbpe, #_, ✨, books, book, music, world, artist, art, people, 🌈, writing, time, ♥, proud, mom, 🇬🇧, 🇺🇸, teacher, 🌊, free, live, 2, #resist, wife, brexit	174	0.059	NA	NA	NA	NA
topic10	science, business, marketing, data, digital, tech, founder, technology, world, design, news, media, engineer, author, writer, space, people, director, software, entrepreneur, ceo, music, enthusiast, physics, father, cofounder, free, husband, consultant, speaker	184	0.062	NA	NA	NA	NA
topic11🎓	science, research, phd, genomics, bioinformatics, data, genetics, scientist, cancer, biology, computational, health, student, university, medicine, researcher, human, institute, professor, biologist, sciences, genetic, molecular, bioinformatician, fellow, medical, chemistry, biotech, clinical, learning	182	0.061	0.074	0.000	Biostatistics	0.214
topic12🎓	phd, student, professor, university, research, science, politics, candidate, political, policy, health, economics, public, researcher, history, writer, dr, assistant, philosophy, studies, prof, education, fellow, development, human, director, editor, law, 🌈, school	147	0.049	0.060	0.061	Social_science	0.290

Paper topics in field space

To visualize the interdisciplinarity of the article, we calculate the cosine similarity between each pair of academic discipline keyword sets, producing an NxN matrix. We then apply PCA + UMAP to this matrix, producing a two-dimensional embedding of the relationship between academic disciplines.

The inferred academic audience disciplines for this paper are highlighted and labeled. If the paper has a more interdisciplinary audience, the highlighted points will tend to be further apart from each other.

tf2 <- top_fields %>%
  dplyr::select(topic, flag_title=best_match)

ms_df <- data.frame(ms_umap$layout, title=lword_counts$article_title) %>%
  mutate(flag_title=ifelse(title %in% top_fields$best_match, title, NA)) %>%
  left_join(tf2) %>%
  mutate(topic=factor(topic, levels=unique(tf2$topic)))

## Joining, by = "flag_title"

p_fields <- ggplot()+
  geom_point(data=ms_df[is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title), colour="grey80", alpha=0.4)+
  geom_point(data=ms_df[!is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title, colour=topic), size=3, alpha=0.8)+
  scale_colour_manual(values=c(cols[as.numeric(gsub("topic", "", top_fields$topic))]))+
  theme_classic()+
  theme(legend.position="none")

ggplotly(p_fields) %>%
  add_annotations(x = ms_df[!is.na(ms_df$flag_title),]$X1,
                  y = ms_df[!is.na(ms_df$flag_title),]$X2,
                  text = ms_df[!is.na(ms_df$flag_title),]$flag_title)

# in development—custom data layer to link to wikipedia articles when clicking points
# plot_embedding_wiki <- function(ms_df){
#   
#   p_fields <- ggplot()+
#     geom_point(data=ms_df[is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title), colour="grey80", alpha=0.4)+
#     geom_point(data=ms_df[!is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title, colour=topic), size=3, alpha=0.8)+
#     scale_colour_manual(values=c(cols[as.numeric(gsub("topic", "", top_fields$topic))]))+
#     theme_classic()+
#     theme(legend.position="none")
#   
#   ply <- ggplotly(p_fields) %>%
#     add_annotations(x = ms_df[!is.na(ms_df$flag_title),]$X1,
#                     y = ms_df[!is.na(ms_df$flag_title),]$X2,
#                     text = ms_df[!is.na(ms_df$flag_title),]$flag_title)
#   
#   # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
#   for(i in 1:12){
#     ply$x$data[[i]]$customdata <- plotdat[grepl(paste0("^", i, ": "), plotdat$topic),]$urls
#   }
#     #pp  <- add_markers(pp, customdata = ~url)
#   plyout <- onRender(ply, "
#                      function(el, x) {
#                      el.on('plotly_click', function(d) {
#                      var url = d.points[0].customdata;
#                      //url
#                      window.open(url);
#                      });
#                      }
#                      ")
# 
#   plyout
# }

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

# plot_embedding2(dmp_df2)

Plot topic breakdown by user

This plot shows the topic probabilities (gammas) for each user account according to the frequencies of each of the K topics inferred from the bios of their followers. Each stack of bars indicates a unique user that (re-)tweeted the article, and the height of the bar segment indicates the fraction of that user document that is associated with a given topic. Topics inferred to be associated with academic audiences are indicated with a “🎓” emoji in the legend. Click on a user to open their Twitter profile in a new window. Click on a topic in the legend to toggle it off/on in the plot.

plot_embedding_bars <- function(plotdat, docs_order){
  
  plotdat <- plotdat %>%
    mutate(topic_lab=paste0("topic", topic)) %>%
    ungroup() %>%
    mutate(document=factor(document, levels=docs_order$document)) %>%
    left_join(topics_terms, by="topic") %>%
    mutate(topic=paste0(topic, ": ", top_10)) %>%
    mutate(topic=ifelse(topic_lab %in% topic_ids, paste0("🎓", topic), topic)) #%>%
    
    # mutate(urls=paste0("https://twitter.com/", document))
  
  p <- plotdat %>%
    mutate(topic=factor(topic, levels=unique(plotdat$topic))) %>%
    ggplot(aes(x=document, y=gamma, fill=topic))+
    geom_bar(stat="identity", position="stack")+
    scale_fill_manual(values=cols)+
    scale_y_continuous(expand = c(0,0))+
    scale_x_discrete(position = "top")+ 
    xlab("Account")+
    theme(legend.position="bottom",
          axis.title.y=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())+
    guides(fill=guide_legend(ncol=1))

  ply <- ggplotly(p) %>%
    layout(legend = list(orientation = "v",   # show entries horizontally
                       xanchor = "center",  # use center of legend as anchor
                       yanchor = "bottom",
                       x = 0, y=-1))
  # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
  for(i in 1:12){
    ply$x$data[[i]]$customdata <- paste0("https://twitter.com/", docs_order$document)
  }
    #pp  <- add_markers(pp, customdata = ~url)
  plyout <- onRender(ply, "
                     function(el, x) {
                     el.on('plotly_click', function(d) {
                     var url = d.points[0].customdata;
                     //url
                     window.open(url);
                     });
                     }
                     ")

  plyout
}

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

plot_embedding_bars(bios_lda_gamma, docs_order)

# p2

# wrapper <- function(x, ...) paste(strwrap(x, ...), collapse = "\n")
# 
# p_thumb <- p +
#   ggplot2::annotate("text", x=0, y=0.5, hjust=0, label=wrapper(article_df$title, width = 30), size=6)+
#   theme(legend.position="none", axis.ticks.x=element_blank(), axis.title.x=element_blank())
# 
# ggsave(paste0(datadir, "/output/figures/", gsub(".html", ".png", nb_file)), 
#        plot=p_thumb, width = 4, height=2, dpi=125)

# htmlwidgets::saveWidget(p2, 
#                         file=paste0(datadir, "/figs/topic_breakdown_by_user_", article_id, ".html"),
#                         title=paste0("topic_breakdown_by_user_", article_id))

Network homophily analysis

Many of the papers we analyzed were inferred to have audience topics that are strongly suggestive of affiliation with white nationalism and other right-wing ideologies. According to the principle of network homophily, we would expect these users’ followers to substantially overlap the follower bases of prominent white nationalists.

These results show that most papers exhibit a continuous gradient between their affiliation with academic communities and their affiliation with white nationalist communities. Some users have up to 40% of their followers who also follow prominent white nationalist accounts and <1% who follow prominent scientist accounts, corresponding to a ~100-fold enrichment of white nationalists among their follower base.

Using a curated a set of 20 white nationalist accounts and 20 scientist accounts, we calculated the network homophily between each of these 40 accounts and each of the N users that have tweeted the paper, producing an Nx40 similarity matrix. We then applied PCA+UMAP to this matrix to reduce the dimensionality to Nx2.

Plot UMAP homophily embedding

This plot shows the 2D embedding of accounts according to their homophily with the two reference groups. We typically see a gradient from strong scientist homophily (blue) to strong white nationalist homophily (red), but the extent of these differences can vary. A paper that is exposed primarily to academic audiences will have mostly blue points, papers exposed to white nationalist audiences will have more red points.

umap_plotdat <- bind_cols(sim_matrix_pca[[3]][[1]], data.frame(sim_matrix_umap$layout)) %>%
  left_join(user_data %>% dplyr::rename(account=screen_name),
            by="account") %>%
  mutate(wn_mean=rowMeans(dplyr::select(.,vdare:NewRightAmerica), na.rm = TRUE),
         sc_mean=rowMeans(dplyr::select(.,pastramimachine:girlscientist), na.rm = TRUE)) %>%
  mutate(affiliation=log10(wn_mean/(sc_mean+0.001))) %>%
  dplyr::filter(sc_mean != 0 & wn_mean != 0) %>%
  mutate(urls=paste0("https://twitter.com/", account))

# plotdat2 <- plotdat
hdb_clust <- umap_plotdat %>%
  dplyr::select(X1:X2) %>%
  as.matrix() %>%
  hdbscan(x=., minPts=10)

umap_plotdat$cluster <- as.character(hdb_clust$cluster)

plot_embedding <- function(plotdat){
  
  p <- plotdat %>% # merge with user_data to get followers_count + other info
    ggplot(aes(x=X1, y=X2, label=account, colour=affiliation))+
    geom_point(aes(size=log(followers_count)), alpha=0.8)+
    scale_colour_gradientn("WN:Scientist Follower Ratio", 
                           colors=rev(brewer.pal(9, "RdBu")), 
                           breaks=seq(-3,3),
                           labels=c("1:1000", "1:100", "1:10","1:1","10:1","100:1","1000:1"),
                           limits=c(-3,3))+
    xlab("UMAP1")+
    ylab("UMAP2")+
    theme_classic()
  
  ply <- ggplotly(p)
  
  # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
  ply$x$data[[1]]$customdata <- plotdat$urls
  #pp  <- add_markers(pp, customdata = ~url)
  plyout <- onRender(ply, "
                     function(el, x) {
                     el.on('plotly_click', function(d) {
                     var url = d.points[0].customdata;
                     //url
                     window.open(url);
                     });
                     }
                     ")

  plyout
}

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

plot_embedding(umap_plotdat)

Cosine similarity analysis

As a sanity-check for the LDA model, we can also examine how users cluster in other ways. Here we calculate the cosine similarity between the follower bios of each pair of users and apply hierarchical clustering and PCA+UMAP to explore these relationships. Using the document term matrix, we calculate a distance matrix between each pair of users, where the M_i,j entry indicates the cosine similarity score (ranging from 0-1) between the follower bios of user i and user j.

UMAP embedding by cosine similarity

This is analogous to performing PCA on SNPs within a population—it tells us how closely “related” different groups of users are, according to pairwise similarity between their followers’ bios.

dmp <- prcomp(as.matrix(distMatrix), center=TRUE, scale=TRUE)

dmp_df <- dmp$x %>%
  as_tibble(rownames="account") %>%
  inner_join(lda_gammas, by="account")

dmp_umap <- dmp$x %>% as.data.frame() %>%
    umap(n_neighbors=20, random_state=36262643)

dmp_df2 <- dmp_umap$layout %>%
  as_tibble(rownames="account") %>%
  inner_join(lda_gammas, by="account") %>%
  left_join(umap_plotdat, by="account") %>%
  arrange(topic) %>%
  mutate(topic=factor(topic, levels=topics_terms_levels)) %>%
  mutate(urls=paste0("https://twitter.com/", account))



# htmlwidgets::saveWidget(ggplotly(p2), 
#                         file=paste0(datadir, "/figs/cosine_umap_", article_id, ".html"),
#                         title=paste0("cosine_umap_", article_id))

# ggplotly(p2)

plot_embedding2 <- function(plotdat){
  
  p <- plotdat %>% 
    ggplot(aes(x=V1, y=V2, label=account, colour=topic))+
    geom_point(aes(size=wn_mean), alpha=0.8)+
    scale_colour_manual(values=cols)+
    scale_size(limits=c(0,0.5))+
    xlab("UMAP1")+
    ylab("UMAP2")+
    theme_classic()+
    theme(legend.position="none")
  
  ply <- ggplotly(p)
  
  # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
  # if(length(ply$x$data==12)){
    # for(i in 1:12){
  for(i in 1:length(ply$x$data)){
    query_topic <- unique(gsub(".*topic: ", "", ply$x$data[[i]]$text))
    
    ply$x$data[[i]]$customdata <- plotdat[plotdat$topic==query_topic,]$urls
  }
    
  # }
    #pp  <- add_markers(pp, customdata = ~url)
  plyout <- onRender(ply, "
                     function(el, x) {
                     el.on('plotly_click', function(d) {
                     var url = d.points[0].customdata;
                     //url
                     window.open(url);
                     });
                     }
                     ")

  plyout
}

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

plot_embedding2(dmp_df2)

# plot PCA
# p5 <- dmp_df %>%
#   # mutate(topic_num=gsub(":.*", "", topic)) %>%
#   ggplot(aes(x=PC1, y=PC2, colour=topic, label=account))+
#   geom_point()+
#   scale_colour_manual(values=cols)+
#   theme(legend.position="none")+
#   guides(colour=guide_legend(ncol=1))
# 
# p5_ply <- ggplotly(p5) %>%
#   layout(legend = list(orientation = "v",   # show entries horizontally
#                      xanchor = "center",  # use center of legend as anchor
#                      yanchor = "bottom",
#                      x = 0, y=-1))
# 
# htmlwidgets::saveWidget(p5_ply, 
#                         file=paste0(datadir, "/figs/cosine_pca_", article_id, ".html"),
#                         title=paste0("cosine_pca_", article_id))
# 
# p5_ply

Retweet timeline analysis

The following plot shows the accumulation of (re-)tweets referencing the article over time.

Each point along the x-axis indicates a unique tweet referencing the article, with the timestamp indicated along the x-axis. Subsequent retweets of each tweet are connected by a line, with the cumulative number of retweets at time T indicated on the y-axis. Points are colored and sized as before, indicating the predominant topic inferred by the LDA model and the level of homophily with white nationalists, respectively.

Timeline plot

# p_times <- events %>% 
rt_dat <- events %>% 
  rename(account=names, rt=retweet_screen_name) %>% 
  left_join(dmp_df2 %>% dplyr::select(account, rt_topic=topic, wn_mean), by="account") %>% 
  mutate(rt=ifelse(is.na(rt), account, rt)) %>%
  left_join(dmp_df2 %>% dplyr::select(rt=account, source_topic=topic), by="rt") %>%
  mutate(tweets=paste0(rt, ": ", tweets)) %>%
  group_by(tweets) %>% 
  arrange(timestamps) %>% 
  mutate(order=row_number(), n=n()) %>% 
  # dplyr::filter(n>3) %>%
  ungroup() #%>%

# rt_dat %>% dplyr::select(account, rt, tweets, timestamps, source_topic, rt_topic, wn_mean) %>% group_by(tweets, source_topic) %>% count(rt_topic)

rt_dat_plot <- rt_dat %>%
  ggplot(aes(x=timestamps, y=order, group=tweets, label=account))+
    geom_line(colour="grey80")+
    geom_point(aes(colour=rt_topic, size=wn_mean), alpha=0.5)+
    scale_size(limits=c(0,0.5))+
    scale_colour_manual(values=cols)+
    scale_y_log10()+
    scale_x_discrete(breaks=events$timestamps[seq(1, nrow(events), 10)])+
    ylab("Retweet Number")+
    theme_classic()+
    theme(axis.title.x=element_blank(),
      axis.text.x=element_text(size=6, angle=45, hjust=1),
          legend.position="none")

# htmlwidgets::saveWidget(ggplotly(p_times), 
#                         file=paste0(datadir, "/figs/timeline_", article_id, ".html"),
#                         title=paste0("timeline_", article_id))

ggplotly(rt_dat_plot)

Missing data analysis

The LDA topic model described above is applied to the Twitter biographies of each user’s followers. However, having a biography is not mandatory and many users opt to leave their bio blank. Here we explore if these patterns of missingness systematically differ among the topic groups.

Plot missingness distributions by group

This plot shows the distribution of fraction of missing bios for each of the K topics.

p4 <- bios_m %>%
  ungroup() %>%
  mutate(topic_num=gsub(":.*", "", topic)) %>%
  mutate(topic=factor(topic, levels=topics_terms_levels)) %>%
  ggplot(aes(x=topic, y=pct, colour=topic, label=account))+
  geom_jitter(size=3, alpha=0.6)+
  geom_boxplot(outlier.shape=NA, fill=NA)+
  scale_colour_manual(values=cols)+
  theme(legend.position="bottom",
        axis.title.y=element_blank(),
        axis.text.x=element_blank())+
  guides(colour=guide_legend(ncol=1))

p4_ply <- ggplotly(p4) %>%
  layout(legend = list(orientation = "v",   # show entries horizontally
                     xanchor = "center",  # use center of legend as anchor
                     yanchor = "bottom",
                     x = 0, y=-1))

# htmlwidgets::saveWidget(p4_ply, 
#                         file=paste0(datadir, "/figs/missing_dist_", article_id, ".html"),
#                         title=paste0("missing_dist_", article_id))

p4_ply

Correlation between missing data and WN homophily

This plot investigates how patterns of missingness among follower bios are associated with patterns of white nationalist homophily described above. In many of the papers analyzed, we often see a positive correlation between proportion of followers with missing bios and homophily with prominent white nationalists, but only within a subset of topical groups inferred by the LDA model. This suggests that missingness within bios is itself a common feature of WN communities or WN-adjacent communities on Twitter. This also explains why some users have strong network homophily with known white nationalists, but do not show a strong topical association in the LDA model—essentially, the followers that drive WN network homophily are systematically contributing less information to the LDA model and skewing some users to look more like other topics.

p4a <- bios_m %>%
  # mutate(topic_num=gsub(":.*", "", topic)) %>%
  mutate(topic=factor(topic, levels=topics_terms_levels)) %>%
  dplyr::filter(pct<0.5) %>%
  ggplot(aes(x=pct, y=wn_mean, group=topic, colour=topic, label=account))+
  geom_point()+
  geom_smooth(method="lm", se=F)+
  scale_colour_manual(values=cols)+
  # facet_wrap(~topic_num, scales="free")+
  xlab("Fraction of followers with missing bios")+
  ylab("WN Homophily")+
  theme(legend.position="bottom")+
  guides(colour=guide_legend(ncol=1))

p4a_ply <- ggplotly(p4a) %>%
  layout(legend = list(orientation = "v",   # show entries horizontally
                     xanchor = "center",  # use center of legend as anchor
                     yanchor = "bottom",
                     x = 0, y=-1))

# htmlwidgets::saveWidget(p4a_ply, 
#                         file=paste0(datadir, "/figs/missing_homophily_cor_", article_id, ".html"),
#                         title=paste0("missing_homophily_cor_", article_id))

p4a_ply

Created with the audiences framework by Jedidiah Carlson