Paper Info
Report generated on 2019-07-23.
doi: 10.1101/088278
View paper on journal site
View paper on Altmetric
LDA topic modeling analysis
We obtained a list of tweets/RTs referencing the specified article by querying the Crossref Event Data API. For each unique user that has (re-)tweeted the article, we then collected the user names and bios of their followers using the Twitter API (limited to the 10,000 most recent followers. We compiled these bios into a single “document” per account.
We next generated a document term matrix, enumerating the frequencies of every term that occurs 10 or more times in each document, excluding common stop words (e.g., “a”, “of”, “is”). Note that because emoji and hashtags are commonly used to convey semantic meaning in a user’s bio, we included these as unique “words”.
Inference of academic audience sectors
The table below lists the audience topics inferred by the LDA model, their top 30 keywords, and the fraction of users associated with that topic.
Topics that are associated with academic audiences (having at least one keyword in each of the following 3 keyword sets: ["phd", "md", "dr"]
, ["university", "institute", "universidad", "lab"]
, and ["student", "estudiante", "postdoc", "professor", "profesor", "prof"]
) are indicated with a “🎓” emoji in the topic
column. For each topic, we calculate the cosine similarity between the top 30 keywords for that topic and the top 100 most common words found in the Wikipedia article for scientific-communication-and-education.
each of the Wikipedia articles for 1179 academic disciplines. The discipline found to have the highest cosine similarity with a given topic is indicated in the best_match
column and the corresponding topic x discipline cosine similarity score is indicated in the td_score
column.
For each of the matching disciplines, we then calculate a discipline x discipline cosine similarity score between that discipline and the paper’s main topical area, indicated in the cos() column.
Among the \(D\) academic topics (each assigned to discipline \(d\)), we calculate an aggregate interdisciplinary score as a weighted average of the similarity scores between each topic and the paper’s category, where the weights, \(w_d\) (indicated in the pct_acad
column) are the fraction of the academic audience associated with that topic:
\(ID_{score} = 1- \sum_{d \in D} w_d \times cos(\vec{d}, \vec{d}_{home})\) = 0.9809842.
tf_table <- full_join(
lda_gammas_count %>%
mutate(topic=paste0("topic", gsub(":.*", "", topic))) %>%
dplyr::select(topic, top_terms=top_10, n_users=n, pct_total=pct),
top_fields,
by="topic") %>%
ungroup() %>%
# full_join(acad_topics2, by="topic") %>%
mutate(topic_lab=topic) %>%
# mutate(topic=factor(topic, levels=unique(lda_gammas_count$topic))) %>%
# arrange(topic_lab) %>%
# mutate(topic=factor(topic, levels=paste0("topic", 1:12))) %>%
full_join(match_scores) %>%
mutate(topic=cell_spec(topic, "html",
color="black", align = "c",
background=c(cols[as.numeric(gsub("topic", "", topic))]))) %>%
mutate(topic=ifelse(topic_lab %in% topic_ids, paste0(topic, "🎓"), topic)) %>%
# dplyr::select(-c(topic_lab, pct, target_cat, score_wt, score_wt2)) %>%
dplyr::select(topic, top_terms, n_users, pct_total, pct_acad, tc_score, best_match, td_score) %>%
mutate(n_users=round(n_users),
pct_total=round(pct_total, 3),
pct_acad=round(pct_acad, 3),
td_score=round(td_score, 3),
tc_score=round(tc_score, 3)) %>%
dplyr::rename("Number of users (estimated)" = "n_users",
"Top 30 Terms" = "top_terms",
"Fraction of total audience" = "pct_total",
"Fraction of academic audience" = "pct_acad",
"Best matching discipline" = "best_match",
"cos(t, d<sub>best</sub>)" = "td_score",
"cos(t, d<sub>target</sub>)" = "tc_score")
## Joining, by = c("topic", "top_terms", "n_users", "best_match", "td_score")
knitr::kable(tf_table, format="html", escape=F) %>%
column_spec(2, width_max = "200em; display: inline-block;") %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%", height = "600px")
topic | Top 30 Terms | Number of users (estimated) | Fraction of total audience | Fraction of academic audience | cos(t, dtarget) | Best matching discipline | cos(t, dbest) |
---|---|---|---|---|---|---|---|
topic1🎓 | neuroscience, phd, neuroscientist, science, research, student, cognitive, brain, psychology, university, lab, professor, studying, postdoc, researcher, scientist, learning, psychologist, health, dr, memory, assistant, candidate, fellow, clinical, cognition, human, prof, computational, mental | 378 | 0.127 | 0.153 | 0.000 | Psychology | 0.225 |
topic2🎓 | research, health, university, education, dr, lecturer, phd, researcher, learning, student, uk, director, teacher, public, people, librarian, school, academic, science, mental, centre, passionate, community, senior, teaching, psychology, manager, development, food, support | 192 | 0.065 | 0.078 | 0.072 | Community_psychology | 0.200 |
topic3🎓 | science, ecology, phd, ecologist, conservation, university, research, student, marine, climate, scientist, environmental, professor, change, biologist, dr, studying, biology, postdoc, evolutionary, wildlife, evolution, nature, candidate, researcher, plant, fellow, enthusiast, biodiversity, water | 312 | 0.105 | 0.127 | 0.058 | Ecology | 0.225 |
topic4🎓 | phd, science, biology, research, lab, university, student, professor, scientist, biologist, postdoc, plant, 🔬, molecular, cell, studying, evolution, genomics, dr, assistant, genetics, fellow, researcher, cancer, microbiology, candidate, prof, evolutionary, stem, enthusiast | 663 | 0.223 | 0.269 | 0.000 | Bioinformatics | 0.182 |
topic5🎓 | sports, sport, phd, exercise, coach, university, science, performance, research, health, nutrition, student, physical, msc, physiology, pathology, lecturer, scientist, researcher, strength, pathologist, professor, diabetes, conditioning, physiotherapist, fitness, training, medicine, biomechanics, bsc | 94 | 0.032 | 0.038 | 0.000 | Kinesiology | 0.212 |
topic6🎓 | phd, data, science, student, university, research, learning, language, scientist, professor, researcher, machine, computer, psychology, engineer, assistant, software, linguistics, #rstats, candidate, dr, computational, technology, prof, education, enthusiast, statistics, engineering, lecturer, cognitive | 311 | 0.105 | 0.126 | 0.019 | Artificial_intelligence | 0.217 |
topic7🎓 | health, md, medical, medicine, care, research, clinical, healthcare, physician, fellow, dr, cancer, director, hospital, university, resident, doctor, professor, patient, emergency, researcher, phd, student, education, cardiology, nurse, mom, husband, advice, advocate | 184 | 0.062 | 0.075 | 0.000 | Medicine | 0.245 |
topic8🎓 | enfermera, salud, vida, ms, universidad, ciencia, mdico, mundo, enfermero, hospital, medicina, investigacin, especialista, phd, amante, madrid, ser, oficial, siempre, profesor, estudiante, mster, ana, educacin, familia, ciencias, mejor, gestin, informacin, psicologa | 150 | 0.050 | NA | NA | NA | NA |
topic9 | #, writer, author, ❤, #fbpe, #_, ✨, books, book, music, world, artist, art, people, 🌈, writing, time, ♥, proud, mom, 🇬🇧, 🇺🇸, teacher, 🌊, free, live, 2, #resist, wife, brexit | 174 | 0.059 | NA | NA | NA | NA |
topic10 | science, business, marketing, data, digital, tech, founder, technology, world, design, news, media, engineer, author, writer, space, people, director, software, entrepreneur, ceo, music, enthusiast, physics, father, cofounder, free, husband, consultant, speaker | 184 | 0.062 | NA | NA | NA | NA |
topic11🎓 | science, research, phd, genomics, bioinformatics, data, genetics, scientist, cancer, biology, computational, health, student, university, medicine, researcher, human, institute, professor, biologist, sciences, genetic, molecular, bioinformatician, fellow, medical, chemistry, biotech, clinical, learning | 182 | 0.061 | 0.074 | 0.000 | Biostatistics | 0.214 |
topic12🎓 | phd, student, professor, university, research, science, politics, candidate, political, policy, health, economics, public, researcher, history, writer, dr, assistant, philosophy, studies, prof, education, fellow, development, human, director, editor, law, 🌈, school | 147 | 0.049 | 0.060 | 0.061 | Social_science | 0.290 |
Paper topics in field space
To visualize the interdisciplinarity of the article, we calculate the cosine similarity between each pair of academic discipline keyword sets, producing an NxN matrix. We then apply PCA + UMAP to this matrix, producing a two-dimensional embedding of the relationship between academic disciplines.
The inferred academic audience disciplines for this paper are highlighted and labeled. If the paper has a more interdisciplinary audience, the highlighted points will tend to be further apart from each other.
tf2 <- top_fields %>%
dplyr::select(topic, flag_title=best_match)
ms_df <- data.frame(ms_umap$layout, title=lword_counts$article_title) %>%
mutate(flag_title=ifelse(title %in% top_fields$best_match, title, NA)) %>%
left_join(tf2) %>%
mutate(topic=factor(topic, levels=unique(tf2$topic)))
## Joining, by = "flag_title"
p_fields <- ggplot()+
geom_point(data=ms_df[is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title), colour="grey80", alpha=0.4)+
geom_point(data=ms_df[!is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title, colour=topic), size=3, alpha=0.8)+
scale_colour_manual(values=c(cols[as.numeric(gsub("topic", "", top_fields$topic))]))+
theme_classic()+
theme(legend.position="none")
ggplotly(p_fields) %>%
add_annotations(x = ms_df[!is.na(ms_df$flag_title),]$X1,
y = ms_df[!is.na(ms_df$flag_title),]$X2,
text = ms_df[!is.na(ms_df$flag_title),]$flag_title)
# in development—custom data layer to link to wikipedia articles when clicking points
# plot_embedding_wiki <- function(ms_df){
#
# p_fields <- ggplot()+
# geom_point(data=ms_df[is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title), colour="grey80", alpha=0.4)+
# geom_point(data=ms_df[!is.na(ms_df$flag_title),], aes(x=X1, y=X2, label=title, colour=topic), size=3, alpha=0.8)+
# scale_colour_manual(values=c(cols[as.numeric(gsub("topic", "", top_fields$topic))]))+
# theme_classic()+
# theme(legend.position="none")
#
# ply <- ggplotly(p_fields) %>%
# add_annotations(x = ms_df[!is.na(ms_df$flag_title),]$X1,
# y = ms_df[!is.na(ms_df$flag_title),]$X2,
# text = ms_df[!is.na(ms_df$flag_title),]$flag_title)
#
# # Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
# for(i in 1:12){
# ply$x$data[[i]]$customdata <- plotdat[grepl(paste0("^", i, ": "), plotdat$topic),]$urls
# }
# #pp <- add_markers(pp, customdata = ~url)
# plyout <- onRender(ply, "
# function(el, x) {
# el.on('plotly_click', function(d) {
# var url = d.points[0].customdata;
# //url
# window.open(url);
# });
# }
# ")
#
# plyout
# }
# htmlwidgets::saveWidget(plot_embedding(umap_plotdat),
# file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
# title=paste0("homophily_ratio_", article_id))
# plot_embedding2(dmp_df2)
Plot topic breakdown by user
This plot shows the topic probabilities (gammas) for each user account according to the frequencies of each of the K topics inferred from the bios of their followers. Each stack of bars indicates a unique user that (re-)tweeted the article, and the height of the bar segment indicates the fraction of that user document that is associated with a given topic. Topics inferred to be associated with academic audiences are indicated with a “🎓” emoji in the legend. Click on a user to open their Twitter profile in a new window. Click on a topic in the legend to toggle it off/on in the plot.
plot_embedding_bars <- function(plotdat, docs_order){
plotdat <- plotdat %>%
mutate(topic_lab=paste0("topic", topic)) %>%
ungroup() %>%
mutate(document=factor(document, levels=docs_order$document)) %>%
left_join(topics_terms, by="topic") %>%
mutate(topic=paste0(topic, ": ", top_10)) %>%
mutate(topic=ifelse(topic_lab %in% topic_ids, paste0("🎓", topic), topic)) #%>%
# mutate(urls=paste0("https://twitter.com/", document))
p <- plotdat %>%
mutate(topic=factor(topic, levels=unique(plotdat$topic))) %>%
ggplot(aes(x=document, y=gamma, fill=topic))+
geom_bar(stat="identity", position="stack")+
scale_fill_manual(values=cols)+
scale_y_continuous(expand = c(0,0))+
scale_x_discrete(position = "top")+
xlab("Account")+
theme(legend.position="bottom",
axis.title.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())+
guides(fill=guide_legend(ncol=1))
ply <- ggplotly(p) %>%
layout(legend = list(orientation = "v", # show entries horizontally
xanchor = "center", # use center of legend as anchor
yanchor = "bottom",
x = 0, y=-1))
# Clickable points link to profile URL using onRender: https://stackoverflow.com/questions/51681079
for(i in 1:12){
ply$x$data[[i]]$customdata <- paste0("https://twitter.com/", docs_order$document)
}
#pp <- add_markers(pp, customdata = ~url)
plyout <- onRender(ply, "
function(el, x) {
el.on('plotly_click', function(d) {
var url = d.points[0].customdata;
//url
window.open(url);
});
}
")
plyout
}
# htmlwidgets::saveWidget(plot_embedding(umap_plotdat),
# file=paste0(datadir, "/figs/homophily_ratio_", article_id, ".html"),
# title=paste0("homophily_ratio_", article_id))
plot_embedding_bars(bios_lda_gamma, docs_order)