Supporting Information Identifying translational science through embeddings of controlled vocabularies

Size: px
Start display at page:

Download "Supporting Information Identifying translational science through embeddings of controlled vocabularies"

Transcription

1 Supporting Information Identifying translational science through embeddings of controlled vocabularies Qing Ke Center for Complex Network Research, Northeastern University, Boston, MA 5, USA (Dated: September 9, 8) Branch Name Terms Occurrences A Anatomy, 68 7, 78, 594 B Organisms, 656 4, 48, 64 C Diseases 4, , 58, 46 D Chemicals and Drugs 9, 6 8, 77, 87 E Analytical, Diagnostic and Therapeutic Techniques and Equipment, 7 58, 66, 9 F Psychiatry and Psychology 95 8, 85, 99 G Phenomena and Processes, , 6, 68 H Disciplines and Occupations 75 5, 55, 49 I Anthropology, Education, Sociology and Social Phenomena 54 4, 785, 9 J Technology, Industry, Agriculture 5,, 64 K Humanities 89,, 479 L Information Science 47 5, 595, 96 M Named Groups 4 7, 95, 4 N Health Care, 565 9, 66, 57 V Publication Characteristics 5 Z Geographicals 86 5, 4, 664 TABLE S. Total occurrences of all terms in each branch of the MeSH tree, counted based on all terms in all MEDLINE papers. Branches marked in red are included in our analysis. Terms in branch V ( Publication Characteristics ) are not used because they are only for annotating publication type.

2 Category Unique ID MeSH Term Tree Number D477 Cells A D5 Archaea B Cell and molecular (C) D49 Bacteria B D478 Viruses B4 D594 Molecular Structure G..57 D55599 Chemical Processes G.49 Animal (A) D5689 Eukaryota B Human (H) D68 Humans B D97 Persons M TABLE S. Root nodes used for defining basic and applied terms. We consider basic terms as those located in the subtrees rooted at the nodes in () the cell and molecular (C) and () animal (A) category. Applied terms are those in the subtrees rooted at the nodes in the human (H) category.

3 Unique ID MeSH term D564 Freeze Fracturing -.64 D57 Cytochalasins -.67 D57 Cytochalasin B -.68 D6 Porifera -.67 D4 Ruthenium Red -.6 D95 Receptors, Concanavalin A -.6 D554 Pseudopodia -.6 D46 Cell Membrane -.6 D475 Endocytosis -.6 D45 Cell Communication -.6 TABLE S. Top most basic MeSH terms in 98. Unique ID MeSH term D4 Outcome and Process Assessment (Health Care).877 D58 Patient Participation.876 D69 Professional-Patient Relations.875 D7 Referral and Consultation.87 D87 Physician-Patient Relations.87 D9 Attitude of Health Personnel.87 D949 Social Work, Psychiatric.87 D657 Self-Help Groups.868 D8 Physicians, Family.867 D4 Patient Admission.865 TABLE S4. Top most applied MeSH terms in 98.

4 . A CA-CA B CA-H D H-H Density C CA-All.5 E H-All.5 F All-All Cosine similarity. FIG. S. All pairwise cosine similarity between MeSH terms at year 98. 4

5 A Cell B Animal Density 4 C Human. D All of MeSH term FIG. S. Histogram of level score of MeSH terms at year 98. Red dashed lines mark median values.. of MeSH term Human All C+A Year FIG. S. of MeSH terms over years. The lines show the median values, and the shaded regions cover one standard deviation. 5

6 Density A All 4 B Phase I 5 4 C Phase II 5 4 D Phase III 4 E Phase IV FIG. S4. Histogram of level score of clinical trial papers. Dash lines mark median values. 6

7 Sample median Sample median Difference Sig. 8 6 Phase II.96 Phase I.4.49 Frequency 4 Observed Difference of medians 8 Phase III.49 Phase II.96.9 Frequency 6 4 Observed Difference of medians 8 6 Phase IV.455 Phase III.49. Frequency 4 Observed Difference of medians TABLE S5. Statistical significance test of the difference between median level score of papers belonging to consecutive stages of clinical trials. We use permutation test, where 5 permutes are performed to obtain the null distribution showed in the last column. The p-values for all the three tests are < 5. 7

8 Density 4 CA C CAH A CH AH H FIG. S5. Histogram of level score of papers in each category. The categorization is based on whether their MeSH terms contain the ones related to cell and molecular (C), animal (A), and human (H). Dash lines mark median values. 8

9 4 Density FIG. S6. Histogram of level score of papers that contain both Humans and Magnetic Resonance Imaging terms. 9

10 S. NULL MODEL We observed from Fig. 4 in the main text that there is a homophilous pairing of direct citations papers with similar level scores tend to cite each other. Can this observation be accounted for by the bimodal distribution of LS of papers? After all, a list of cited papers that are randomly selected would include some that have a score similar to the citing paper, especially if the LS of the citing paper is near the two peaks of the distribution. To address this question, we consider a null model where we randomize the citation network. In particular, we preserve the publication year and MeSH terms associated with each paper, therefore preserving their level scores, since the co-occurrence matrices M t are preserved. However, we reshuffle citation pairs by randomizing the underlying citation network, preserving the yearly number of citations received by each paper through a series of link switches. Specifically, for each switch, we first randomly choose a pair of links where the two citing papers and the two cited papers were respectively published in the same year, and then switch the end-points (cited papers) of the two links. We perform 5 E times of link switches to allow for good mixing, where E =, 59, 6 is the number of links in the citation network, resulting in more than billion switches. Based on this null model, the newly selected cited paper can be any other one that was published in the same year as the original cited paper, and thus can be of any level score. Through this way, we destroy the pairing of level scores of citing-cited paper pairs. Fig. S6A shows that four regions of high density emerge after randomization and LS of cited papers concentrates in the two regions corresponding to the two peaks. This is readily implied from the null model: For a given citation pair, the randomly picked cited paper can be any other one published in the same year as the original paper, therefore it is more likely to be from the two peaks, since there are more papers there.

11 A of cited paper of citing paper 5 4 Number of citation pairs B of citing paper 4 Number of papers FIG. S7. The same as Fig. 4 in the main text, but the results are obtained from randomized citation network by preserving citation dynamics of each paper (Section S). The results are averaged over realizations.

12 S. ALTERNATIVE CALCULATION OF LEVEL SCORE OF PAPERS In the main text, we calculated the level score of a paper as the average of the cosine similarities between the Translational Axis (TA) vector and the MeSH term vectors. Here we provide another way to do this. Specifically, we first obtain the TA vector as before, and get the vector of a paper by averaging the vectors of its MeSH terms. The LS of the paper is then the cosine similarity between the TA vector and the paper vector. Figs. S8 S demonstrate that our results remain similar.

13 A Density..8.4 B JBC C Cell Biology Biochemistry Cell Molecular Biology Allergy and Immunology PNAS Chemistry Biology JCI Chemistry Techniques, Analytical Pharmacology Cancer Cell Physiology Brain Nature Endocrinology Density Science Nature Medicine Chemistry, Clinical Psychopharmacology Neurology Drug Therapy Neuropsychopharmacology Neoplasms Vascular Diseases Nat. Rev. Drug Discov. Dentistry Medicine NEJM Cardiology General Surgery Lancet Pediatrics Epidemiology JAMA Health Services Research Nursing FIG. S8. The same as Fig. in the main text, but the level score of a paper is the cosine similarity between the Translational Axis (TA) vector and the vector of the paper, obtained by averaging the vectors of its MeSH terms. Sec. S describes this in details. MeSH term vectors are obtained using LINE.

14 A All Density B Phase I 4 C Phase II D Phase III 4 E Phase IV FIG. S9. The same as Fig. S4, but level score is obtained using the procedure described in Sec. S. 4

15 Density CA C CAH A CH AH H FIG. S. The same as Fig. S5, but level score is obtained using the procedure described in Sec. S. 5

16 S. ROBUSTNESS TEST Figs. S S8 show the same set of results we showed before, but using the GloVe embedding method []. [] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 4 Conference on Empirical Methods in Natural Language Processing, pages 5 54, 4. 6

17 A Density..8.4 B JBC C Cell Biology Biochemistry Cell Molecular Biology Chemistry PNAS Allergy and Immunology Biology Cancer Cell Pharmacology Chemistry Techniques, Analytical JCI Physiology Brain Nature Endocrinology Psychopharmacology Density Science Nature Medicine Neurology Chemistry, Clinical Drug Therapy Neuropsychopharmacology Neoplasms Dentistry Nat. Rev. Drug Discov. Vascular Diseases General Surgery NEJM Medicine Cardiology Lancet Pediatrics Epidemiology JAMA Nursing Health Services Research FIG. S. The same as Fig. in the main text, but level score is obtained based on the GloVe embedding method. 7

18 A of cited paper of citing paper 5 4 Number of citation pairs B of citing paper 5 4 Number of papers FIG. S. The same as Fig. 4 in the main text, but based on the GloVe embedding method. A B of cited paper of citing paper 5 4 Number of citation pairs of citing paper 4 Number of papers FIG. S. The same as Fig. S7, but based on the GloVe embedding method. 8

19 . A CA-CA Density B CA-H D H-H... C CA-All. E H-All. F All-All Cosine similarity FIG. S4. The same as Fig. S, but based on the GloVe embedding method. 9

20 A Cell B Animal Density C Human..5. D All of MeSH term FIG. S5. The same as Fig. S, but based on the GloVe embedding method.. of MeSH term Human All C+A Year FIG. S6. The same as Fig. S, but based on the GloVe embedding method.

21 Density A All B Phase I 4 C Phase II 4 D Phase III 4 E Phase IV FIG. S7. The same as Fig. S4, but based on the GloVe embedding method.

22 Density CA C CAH A CH AH H FIG. S8. The same as Fig. S5, but based on the GloVe embedding method.