Deep learning detects virus presence in cancer histology, bioRxiv, 2019-07-06

Saturday, Jul 6, 2019
report_template.utf8

Paper Info

Report generated on 2020-02-10.

doi: 10.1101/690206
View paper on journal site
View paper on Altmetric

LDA topic modeling analysis

We obtained a list of tweets/RTs referencing the specified article by querying the Crossref Event Data API. For each unique user that has (re-)tweeted the article, we then collected the user names and bios of their followers using the Twitter API (limited to the 10,000 most recent followers. We compiled these bios into a single “document” per account.

We next generated a document term matrix, enumerating the frequencies of every term that occurs 10 or more times in each document, excluding common stop words (e.g., “a”, “of”, “is”). Note that because emoji and hashtags are commonly used to convey semantic meaning in a user’s bio, we included these as unique “words”.

Inference of academic audience sectors

The table below lists the audience topics inferred by the LDA model, their top 30 keywords, and the fraction of users associated with that topic.

Topics that are associated with academic audiences (having at least one keyword in each of the following 3 keyword sets: ["phd", "md", "dr"], ["university", "institute", "universidad", "lab"], and ["student", "estudiante", "postdoc", "professor", "profesor", "prof"]) are indicated with a “🎓” emoji in the topic column. For each topic, we calculate the cosine similarity between the top 30 keywords for that topic and the top 100 most common words found in the Wikipedia article for cancer-biology.

each of the Wikipedia articles for 1179 academic disciplines. The discipline found to have the highest cosine similarity with a given topic is indicated in the best_match column and the corresponding topic x discipline cosine similarity score is indicated in the td_score column.

For each of the matching disciplines, we then calculate a discipline x discipline cosine similarity score between that discipline and the paper’s main topical area, indicated in the cos() column.

Among the \(D\) academic topics (each assigned to discipline \(d\)), we calculate an aggregate interdisciplinary score as a weighted average of the similarity scores between each topic and the paper’s category, where the weights, \(w_d\) (indicated in the pct_acad column) are the fraction of the academic audience associated with that topic:

\(ID_{score} = 1- \sum_{d \in D} w_d \times cos(\vec{d}, \vec{d}_{home})\) = 0.9053724.

## Joining, by = c("topic", "top_terms", "n_users", "best_match", "td_score")
topic Top 30 Terms Number of users (estimated) Fraction of total audience Fraction of academic audience cos(t, dtarget) Best matching discipline cos(t, dbest)
topic1 cancer, research, health, medical, oncology, md, care, center, university, medicine, clinical, patients, oncologist, science, phd, breast, healthcare, chicago, foundation, director, patient, advocate, news, support, dr, community, education, survivor, treatment, dedicated 2 0.028 NA NA NA NA
topic2🎓 cancer, md, oncology, medical, oncologist, research, radiation, health, clinical, phd, surgeon, professor, surgery, university, fellow, director, medicine, researcher, head, resident, care, center, dr, neck, hospital, husband, urology, scientist, physician, father 9 0.147 0.196 0.146 Ophthalmology 0.196
topic3🎓 cancer, research, phd, science, biology, university, data, professor, scientist, computational, student, systems, medical, biologist, lab, mathematical, oncology, medicine, md, evolution, health, researcher, genomics, learning, fellow, cell, institute, prof, assistant, director 12 0.202 0.270 0.093 Bioinformatics 0.211
topic4🎓 health, md, medical, research, medicine, healthcare, phd, student, care, science, university, data, clinical, director, fellow, cancer, researcher, professor, digital, learning, physician, scientist, cardiologist, dr, enthusiast, innovation, technology, tech, world, husband 4 0.062 0.083 0.070 Medicine 0.211
topic5 #digital, digital, #innovation, marketing, #socialmedia, manager, communication, #communication, web, media, #marketing, sant, france, 🇫🇷, #ia, chez, consultant, #startup, paris, numrique, journaliste, #ai, agence, chef, innovation, compte, data, #sant, digitale, #seo 1 0.025 NA NA NA NA
topic6🎓 data, science, learning, engineer, software, digital, technology, tech, scientist, #ai, enthusiast, developer, phd, business, machine, marketing, student, analytics, computer, solutions, #iot, research, passionate, 💻, university, researcher, web, security, manager, #machinelearning 1 0.023 0.031 0.000 Data_mining 0.160
topic7🎓 pathology, md, pathologist, medical, research, phd, professor, health, university, science, cancer, medicine, fellow, dr, 🔬, resident, director, clinical, student, advice, #pathology, philosophy, scientist, ethics, assistant, hospital, surgical, pancreatic, researcher, prof 8 0.131 0.175 0.126 Medicine 0.193
topic8 business, author, news, marketing, data, digital, technology, health, science, founder, speaker, world, media, tech, #ai, research, people, software, writer, global, ceo, free, community, online, entrepreneur, director, development, learning, consultant, services 9 0.143 NA NA NA NA
topic9🎓 phd, lab, biology, scientist, genomics, biologist, genetics, student, candidate, postdoc, cell, studying, bioinformatics, assistant, researcher, computational, science, professor, mom, stem, immunology, fellow, graduate, 👩, molecular, human, university, center, postdoctoral, biotech 1 0.024 0.033 0.054 Bioinformatics 0.159
topic10 #ai, data, business, #iot, digital, technology, #bigdata, tech, #machinelearning, marketing, learning, intelligence, world, software, #fintech, #datascience, science, solutions, engineer, #innovation, global, machine, #analytics, analytics, #, #blockchain, founder, media, #tech, entrepreneur 2 0.041 NA NA NA NA
topic11 founder, ceo, entrepreneur, tech, technology, cofounder, investor, data, learning, business, product, marketing, digital, space, media, software, people, engineer, director, partner, world, science, startups, startup, machine, innovation, web, scientist, global, intelligence 1 0.015 NA NA NA NA
topic12🎓 phd, student, research, university, lab, science, biology, postdoc, scientist, 🔬, cancer, fellow, cell, researcher, biologist, studying, professor, molecular, candidate, neuroscience, bioinformatics, institute, genomics, data, enthusiast, health, evolution, computational, neuroscientist, postdoctoral 10 0.159 0.212 0.054 Bioinformatics 0.204

Paper topics in field space

To visualize the interdisciplinarity of the article, we calculate the cosine similarity between each pair of academic discipline keyword sets, producing an NxN matrix. We then apply PCA + UMAP to this matrix, producing a two-dimensional embedding of the relationship between academic disciplines.

The inferred academic audience disciplines for this paper are highlighted and labeled. If the paper has a more interdisciplinary audience, the highlighted points will tend to be further apart from each other.

## Joining, by = "flag_title"

Plot topic breakdown by user

This plot shows the topic probabilities (gammas) for each user account according to the frequencies of each of the K topics inferred from the bios of their followers. Each stack of bars indicates a unique user that (re-)tweeted the article, and the height of the bar segment indicates the fraction of that user document that is associated with a given topic. Topics inferred to be associated with academic audiences are indicated with a “🎓” emoji in the legend. Click on a user to open their Twitter profile in a new window. Click on a topic in the legend to toggle it off/on in the plot.

plot_embedding_bars <- function(plotdat, docs_order){
  
  plotdat <- plotdat %>%
    mutate(topic_lab=paste0("topic", topic)) %>%
    ungroup() %>%
    mutate(document=factor(document, levels=docs_order$document)) %>%
    left_join(topics_terms, by="topic") %>%
    mutate(topic=paste0(topic, ": ", top_10)) %>%
    mutate(topic=ifelse(topic_lab %in% topic_ids, paste0("🎓", topic), topic)) #%>%
    
    # mutate(urls=paste0("https://twitter.com/", document))
  
  p <- plotdat %>%
    mutate(topic=factor(topic, levels=unique(plotdat$topic))) %>%
    ggplot(aes(x=document, y=gamma, fill=topic))+
    geom_bar(stat="identity", position="stack")+
    scale_fill_manual(values=cols)+
    scale_y_continuous(expand = c(0,0))+
    scale_x_discrete(position = "top")+ 
    xlab("Account")+
    theme(legend.position="bottom",
          axis.title.y=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())+
    guides(fill=guide_legend(ncol=1))

  ply <- ggplotly(p) %>%
    layout(legend = list(orientation = "v",   # show entries horizontally
                       xanchor = "center",  # use center of legend as anchor
                       yanchor = "bottom",
                       x = 0, y=-1))
  # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
  for(i in 1:12){
    ply$x$data[[i]]$customdata <- paste0("https://twitter.com/", docs_order$document)
  }
    #pp  <- add_markers(pp, customdata = ~url)
  plyout <- onRender(ply, "
                     function(el, x) {
                     el.on('plotly_click', function(d) {
                     var url = d.points[0].customdata;
                     //url
                     window.open(url);
                     });
                     }
                     ")

  plyout
}

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

plot_embedding_bars(bios_lda_gamma, docs_order)

Network homophily analysis

Many of the papers we analyzed were inferred to have audience topics that are strongly suggestive of affiliation with white nationalism and other right-wing ideologies. According to the principle of network homophily, we would expect these users’ followers to substantially overlap the follower bases of prominent white nationalists.

These results show that most papers exhibit a continuous gradient between their affiliation with academic communities and their affiliation with white nationalist communities. Some users have up to 40% of their followers who also follow prominent white nationalist accounts and <1% who follow prominent scientist accounts, corresponding to a ~100-fold enrichment of white nationalists among their follower base.

Using a curated a set of 20 white nationalist accounts and 20 scientist accounts, we calculated the network homophily between each of these 40 accounts and each of the N users that have tweeted the paper, producing an Nx40 similarity matrix. We then applied PCA+UMAP to this matrix to reduce the dimensionality to Nx2.

Plot UMAP homophily embedding

This plot shows the 2D embedding of accounts according to their homophily with the two reference groups. We typically see a gradient from strong scientist homophily (blue) to strong white nationalist homophily (red), but the extent of these differences can vary. A paper that is exposed primarily to academic audiences will have mostly blue points, papers exposed to white nationalist audiences will have more red points.

umap_plotdat <- bind_cols(sim_matrix_pca[[3]][[1]], data.frame(sim_matrix_umap$layout)) %>%
  left_join(user_data %>% dplyr::rename(account=screen_name),
            by="account") %>%
  mutate(wn_mean=rowMeans(dplyr::select(.,vdare:NewRightAmerica), na.rm = TRUE),
         sc_mean=rowMeans(dplyr::select(.,pastramimachine:girlscientist), na.rm = TRUE)) %>%
  mutate(affiliation=log10(wn_mean/(sc_mean+0.001))) %>%
  dplyr::filter(sc_mean != 0 & wn_mean != 0) %>%
  mutate(urls=paste0("https://twitter.com/", account))

# plotdat2 <- plotdat
hdb_clust <- umap_plotdat %>%
  dplyr::select(X1:X2) %>%
  as.matrix() %>%
  hdbscan(x=., minPts=10)

umap_plotdat$cluster <- as.character(hdb_clust$cluster)

plot_embedding <- function(plotdat){
  
  p <- plotdat %>% # merge with user_data to get followers_count + other info
    ggplot(aes(x=X1, y=X2, label=account, colour=affiliation))+
    geom_point(aes(size=log(followers_count)), alpha=0.8)+
    scale_colour_gradientn("WN:Scientist Follower Ratio", 
                           colors=rev(brewer.pal(9, "RdBu")), 
                           breaks=seq(-3,3),
                           labels=c("1:1000", "1:100", "1:10","1:1","10:1","100:1","1000:1"),
                           limits=c(-3,3))+
    xlab("UMAP1")+
    ylab("UMAP2")+
    theme_classic()
  
  ply <- ggplotly(p)
  
  # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
  ply$x$data[[1]]$customdata <- plotdat$urls
  #pp  <- add_markers(pp, customdata = ~url)
  plyout <- onRender(ply, "
                     function(el, x) {
                     el.on('plotly_click', function(d) {
                     var url = d.points[0].customdata;
                     //url
                     window.open(url);
                     });
                     }
                     ")

  plyout
}

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

plot_embedding(umap_plotdat)

Cosine similarity analysis

As a sanity-check for the LDA model, we can also examine how users cluster in other ways. Here we calculate the cosine similarity between the follower bios of each pair of users and apply hierarchical clustering and PCA+UMAP to explore these relationships. Using the document term matrix, we calculate a distance matrix between each pair of users, where the M_i,j entry indicates the cosine similarity score (ranging from 0-1) between the follower bios of user i and user j.

UMAP embedding by cosine similarity

This is analogous to performing PCA on SNPs within a population—it tells us how closely “related” different groups of users are, according to pairwise similarity between their followers’ bios.

dmp <- prcomp(as.matrix(distMatrix), center=TRUE, scale=TRUE)

dmp_df <- dmp$x %>%
  as_tibble(rownames="account") %>%
  inner_join(lda_gammas, by="account")

dmp_umap <- dmp$x %>% as.data.frame() %>%
    umap(n_neighbors=20, random_state=36262643)

dmp_df2 <- dmp_umap$layout %>%
  as_tibble(rownames="account") %>%
  inner_join(lda_gammas, by="account") %>%
  left_join(umap_plotdat, by="account") %>%
  arrange(topic) %>%
  mutate(topic=factor(topic, levels=topics_terms_levels)) %>%
  mutate(urls=paste0("https://twitter.com/", account))



# htmlwidgets::saveWidget(ggplotly(p2), 
#                         file=paste0(datadir, "/figs/cosine_umap_", article_id, ".html"),
#                         title=paste0("cosine_umap_", article_id))

# ggplotly(p2)

plot_embedding2 <- function(plotdat){
  
  p <- plotdat %>% 
    ggplot(aes(x=V1, y=V2, label=account, colour=topic))+
    geom_point(aes(size=wn_mean), alpha=0.8)+
    scale_colour_manual(values=cols)+
    scale_size(limits=c(0,0.5))+
    xlab("UMAP1")+
    ylab("UMAP2")+
    theme_classic()+
    theme(legend.position="none")
  
  ply <- ggplotly(p)
  
  # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
  # if(length(ply$x$data==12)){
    # for(i in 1:12){
  for(i in 1:length(ply$x$data)){
    query_topic <- unique(gsub(".*topic: ", "", ply$x$data[[i]]$text))
    
    ply$x$data[[i]]$customdata <- plotdat[plotdat$topic==query_topic,]$urls
  }
    
  # }
    #pp  <- add_markers(pp, customdata = ~url)
  plyout <- onRender(ply, "
                     function(el, x) {
                     el.on('plotly_click', function(d) {
                     var url = d.points[0].customdata;
                     //url
                     window.open(url);
                     });
                     }
                     ")

  plyout
}

# htmlwidgets::saveWidget(plot_embedding(umap_plotdat), 
#                         file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
#                         title=paste0("homophily_ratio_", article_id))

plot_embedding2(dmp_df2)

Retweet timeline analysis

The following plot shows the accumulation of (re-)tweets referencing the article over time.

Each point along the x-axis indicates a unique tweet referencing the article, with the timestamp indicated along the x-axis. Subsequent retweets of each tweet are connected by a line, with the cumulative number of retweets at time T indicated on the y-axis. Points are colored and sized as before, indicating the predominant topic inferred by the LDA model and the level of homophily with white nationalists, respectively.

Timeline plot

Missing data analysis

The LDA topic model described above is applied to the Twitter biographies of each user’s followers. However, having a biography is not mandatory and many users opt to leave their bio blank. Here we explore if these patterns of missingness systematically differ among the topic groups.

Plot missingness distributions by group

This plot shows the distribution of fraction of missing bios for each of the K topics.

Correlation between missing data and WN homophily

This plot investigates how patterns of missingness among follower bios are associated with patterns of white nationalist homophily described above. In many of the papers analyzed, we often see a positive correlation between proportion of followers with missing bios and homophily with prominent white nationalists, but only within a subset of topical groups inferred by the LDA model. This suggests that missingness within bios is itself a common feature of WN communities or WN-adjacent communities on Twitter. This also explains why some users have strong network homophily with known white nationalists, but do not show a strong topical association in the LDA model—essentially, the followers that drive WN network homophily are systematically contributing less information to the LDA model and skewing some users to look more like other topics.

 

Created with the audiences framework by Jedidiah Carlson