Network Imputation in Predicting Researcher Collaboration Yun Huang1, Chuang Zhang2, Maryam FazelZarandi3, Hugh Devlin1, Alina Lungeanu1, Stanley Wasserman 4, Noshir Contractor1 1 Science of Networks in Communities (SONIC), Northwestern University 2 Beijing University of Post and Telecommunication 3 Nuance Communications, Montreal, Canada 4 Indiana University Supported by NSF grants CNS-1010904, OCI-0904356, IIS-0838564 and NIH CTSA award UL1RR025741. SONIC 1 advancing the science of networks in communities Social Science Guided Development of Tools to Recommend Collaborations • VIVO meets SciTS • Big data on scientific collaboration • Social science insights on effective team collaboration • Can theoretical models be used to guide development of tools to predict/recommend collaborations? • Implement algorithms based on advanced network analytic methodologies to make recommendations SONIC 2 advancing the science of networks in communities Outline • Motivation and application • Expert recommender for collaboration • Multi-theoretical Multilevel Model (MTML) for effective collaboration • Modeling NUCATS co-proposal networks • Exponential Random Graph Models (ERGM/p*) as a recommender • Social theories and hypotheses • Experiment and evaluation • Benchmarks of data mining algorithms • Result comparison SONIC 3 advancing the science of networks in communities Expert Recommender for Teams • Traditional use of link prediction models • To predict links that are present but not easily observed (e.g. in terrorist networks) • Novel use of link prediction models • To predict links that are not present but ought to be present to enable effective collaboration. • Technically, network imputation: Estimating missing nodes and missing links based on statistical modeling of networks (Wasserman, Robins, & Steinley, 2007) SONIC 4 advancing the science of networks in communities Multi-theoretical Multilevel Models (MTML) for Effective Collaboration • MTML models provide an integrated explanatory framework to understand collaboration at multiple levels: • Actor level (e.g. individual attributes such as age, gender, tenure, and H-index) • Dyad level: • Attributes (e.g., shared university or disciplinary affiliation) • Relation (e.g., prior collaborations or citation ties) • Higher order - Relational (e.g., friend of a friend, coauthor of a coauthor, star collaborators in the network, etc.) SONIC 5 advancing the science of networks in communities Using ERGM Network Analytic Methodologies to Estimate MTML Models for Effective Collaboration • Exponential Random Graph Models (ERGM) … • estimate the extent to which each hypothesized structure -- actor attribute, dyadic shared attribute, dyadic relational and higher order relational variable -- included in the MTML model explains the presence of effective collaboration ties observed in the network; estimate the degree to which the theoretically hypothesized structures are likely to occur in observed networks; consider an observed network x as a realization of an underlying random network X characterized by a set of network features and parameters: • • Where: θ is the vector of estimated MTML model parameters, g(x) is a vector of the extent to which the hypothesized network configurations occur in the observed collaboration network SONIC κ(θ) is a normalizing quantity. 6 advancing the science of networks in communities Proposal Collaboration • NUCATS co-proposal network • Northwestern University Clinical and Translational Sciences • 63 proposal teams with 147 researchers and 100 co-proposal relations • MTML factors influencing collaboration: • Impacts of gender, tenure, professional experience, coauthorship, citations, and network structures on their collaboration relations SONIC 7 advancing the science of networks in communities Co-proposal Relations SONIC 8 advancing the science of networks in communities ERGM Model for Team Assembly Levels Actor (attributes) Dyad (attributes) Dyad (relations) Higher order Variables Odds ratio Gender (1= “female”) 1.82* Tenure (Log years since PhD) 0.51* Experience (Ln Publication) 1.20* Gender difference 0.55* Tenure difference 0.99 Experience difference 0.72* Co-authorship 9.03* Citation relationship 0.65* Edge (co-proposal) 0.00* Weighed node degree 323.76* Weighed number of shared neighbors 70.11* Log Likelihood Significance codes: * p<0.001; -356.36 estimated using Statnet (Handcock et al 2008) SONIC 9 advancing the science of networks in communities ERGM Model for Team Assembly Levels Actor (attributes) Dyad (attributes) Dyad (relations) Higher order Variables Odds ratio Gender (1= “female”) 1.82* Tenure (Log years since PhD) 0.51* Experience (Ln Publication) 1.20* Gender difference 0.55* Tenure difference 0.99 Experience difference 0.72* Co-authorship 9.03* Citation relationship 0.65* Edge (co-proposal) 0.00* Weighed node degree 323.76* Weighed number of shared neighbors 70.11* Log Likelihood Significance codes: * p<0.001; Female and researchers with more publications are more likely to collaborate but tenure has a negative effect. -356.36 estimated using Statnet (Handcock et al 2008) SONIC 10 advancing the science of networks in communities ERGM Model for Team Assembly Levels Actor (attributes) Dyad (attributes) Dyad (relations) Higher order Variables Odds ratio Gender (1= “female”) 1.82* Tenure (Log years since PhD) 0.51* Experience (Ln Publication) 1.20* Gender difference 0.55* Tenure difference 0.99 Experience difference 0.72* Co-authorship 9.03* Citation relationship 0.65* Edge (co-proposal) 0.00* Weighed node degree 323.76* Weighed number of shared neighbors 70.11* Log Likelihood Significance codes: * p<0.001; Gender and experience homophily has a positive impact on collaboration. Tenure similarity has no effect. -356.36 estimated using Statnet (Handcock et al 2008) SONIC 11 advancing the science of networks in communities ERGM Model for Team Assembly Levels Actor (attributes) Dyad (attributes) Dyad (relations) Higher order Variables Odds ratio Gender (1= “female”) 1.82* Tenure (Log years since PhD) 0.51* Experience (Ln Publication) 1.20* Gender difference 0.55* Tenure difference 0.99 Experience difference 0.72* Co-authorship 9.03* Citation relationship 0.65* Edge (co-proposal) 0.00* Weighed node degree 323.76* Weighed number of shared neighbors 70.11* Log Likelihood Significance codes: * p<0.001; Researchers are more likely to collaborate with co-authors and others less cited with each other. -356.36 estimated using Statnet (Handcock et al 2008) SONIC 12 advancing the science of networks in communities ERGM Model for Team Assembly Levels Actor (attributes) Dyad (attributes) Dyad (relations) Higher order Variables Odds ratio Gender (1= “female”) 1.82* Tenure (Log years since PhD) 0.51* Experience (Ln Publication) 1.20* Gender difference 0.55* Tenure difference 0.99 Experience difference 0.72* Co-authorship 9.03* Citation relationship 0.65* Edge (co-proposal) 0.00* Weighed node degree 323.76* Weighed number of shared neighbors 70.11* Log Likelihood Significance codes: * p<0.001; Researchers are not likely to randomly collaborate and have a similar number of collaborators and a high level of transitivity. -356.36 estimated using Statnet (Handcock et al 2008) SONIC 13 advancing the science of networks in communities Re-purposing Link Prediction Models for Making Link Recommendation • Link prediction models are used to predict links that are present but were not observed (as in covert networks). • A key contribution of this study is to repurpose the use of link prediction models for predicting links that are not present but ought to be present – a recommendation. SONIC 14 advancing the science of networks in communities Comparing MTML Link Prediction Approaches to Traditional Link Prediction Approaches 1. Node-wise similarity approaches • Define or learn a measure of similarity between two nodes to determine link existence 2. Probabilistic model based approaches • Abstract the underlying structure from the observed data network to a compact probabilistic model. Regenerate the unobserved part of the network using the learned model. 3. Network topology based approaches • Exploit topological pattern, ranging from local patterns around the nodes to the global patterns covering the entire social network. SONIC 15 advancing the science of networks in communities Comparing MTML Link Prediction with … • Three benchmark data mining approaches 1. 2. 3. Node-wise similarity-based approach Relational Bayesian Networks (Jaeger 1997) The Katz Method (Katz 1953) • Remove exactly one link that is known to exist in the collaboration network and assess how well (with high rank) each approach recommends that link be created • Evaluate efficacy of four approaches using Average Rank of the Correct Recommendation (ARC) (Burke 2005) SONIC 16 advancing the science of networks in communities Example: Predicting Link b to d Observed network x Ranks of recommendations The rank of the correct prediction (link xbd) is b c 2 a d 1. Remove link from b to d Training network x* (Ideally the model should recommend adding a link from b to d with the highest probability, i.e. ranking Top 1) 1. P(Xbc=1|X=x*) = 0.4 2. P(Xbd=1|X=x*) = 0.3 3. P(Xcd=1|X=x*) = 0.2 4. P(Xac=1|X=x*) = 0.1 3. Rank all links recommended based on their probabilities Link probabilities b b 0.4 c a a d 2. Build a model and calculate the probability for all links which are not in the training network x* 0.1 c 0.3 0.2 d SONIC 17 advancing the science of networks in communities Links Ranks Node i Node j Node-wise RBN Katz ERGM 1 26 5443 7414 5323.5 4522 3 126 951 149 5325 695 5 54 3710 2 5325 83 6 60 2272 198.5 5325 1154 7 110 3710 7414 15 100 7 127 951 7414 15 96 8 137 10333.5 2 5325 67 10 114 5443 72 15 63 10 141 3710 54.5 15 59 11 77 2272 198.5 5325 733 12 15 6573 2630 3 74 12 27 8598 113.5 3 63 12 34 5443 7414 3 96 Test 1 2 3 4 5 6 7 8 9 10 11 12 13 … 100 133 Average 142 8598 5155 33 1.5 3381 1657 61 603 SONIC 18 advancing the science of networks in communities As a base line: the ARC for a random guess is 5316 ARC – The average rank with which the correct (missing) link was recommended by each of the four approaches Methods Node-wise similarity RBN Katz ERGM Average Rank of the Correct Recommendation (std. dev.) Actor level Dyad level High Order All variables 5155 (3243) 3381 (3217) 1657 (2471) 603 (1148) Not that impressive, but the ARC for a random guess is 5,316 because ranks range from one to10,632 (all possible links in a network of 147 researchers SONIC 19 advancing the science of networks in communities As a base line: the ARC for a random guess is 5316 ARC – The average rank with which the correct (missing) link was recommended by each of the four approaches Methods Node-wise similarity RBN Katz ERGM Average Rank of the Correct Recommendation (std. dev.) Actor level Dyad level High Order All variables 5155 (3243) 3381 (3217) 1657 (2471) 603 (1148) Similar to the findings in Liben-Nowell and Kleinburg, 2007, the Katz method has the best performance among the benchmark models. High order relational structures provide critical information for the predictions SONIC and dyad level only makes a small marginal 20 contribution. advancing the science of networks in communities As a base line: the ARC for a random guess is 5316 ARC – The average rank with which the correct (missing) link was recommended by each of the four approaches Methods Node-wise similarity RBN Katz ERGM Average Rank of the Correct Recommendation (std. dev.) Actor level Dyad level High Order All variables 5155 (3243) 3381 (3217) 1657 (2471) 603 (1148) The final ERGM model utilizes all variables and achieves the best performance. SONIC 21 advancing the science of networks in communities As a base line: the ARC for a random guess is 5316 ARC – The average rank with which the correct (missing) link was recommended by each of the four approaches Methods Node-wise similarity RBN Katz ERGM Average Rank of the Correct Recommendation (std. dev.) Actor level Dyad level High Order All variables 5155 (3243) 3381 (3217) 4587 (3231) 3751 (2855) 1657 (2471) 803 (1370) 603 (1148) ERGM models have better performance both in terms of the average rank and consistency of predictions compared to the benchmark models using similar variables. SONIC 22 advancing the science of networks in communities Links Ranks Node i Node j Node-wise RBN Katz 1 26 5443 7414 5323.5 3 126 951 149 5325 5 54 3710 2 5325 6 60 2272 198.5 5325 7 110 3710 7414 15 7 127 951 7414 15 8 137 10333.5 2 5325 10 114 5443 72 15 10 141 3710 54.5 15 11 77 2272 198.5 5325 12 15 6573 2630 3 12 27 8598 113.5 3 12 34 5443 7414 3 Test 1 2 3 4 5 6 7 8 9 10 11 12 13 … 100 133 Average 142 8598 5155 33 1.5 3381 1657 Best 5323.5 149 2 198.5 15 15 2 15 15 198.5 3 3 3 1.5 414 SONIC 23 advancing the science of networks in communities As a base line: the ARC for a random guess is 5316 ARC – The average rank with which the correct (missing) link was recommended by each of the four approaches Methods Node-wise similarity RBN Katz ERGM Best Average Rank of the Correct Recommendation (std. dev.) Actor level Dyad level High Order All variables 5155 (3243) 3381 (3217) 1657 (2471) 603 (1148) Node-wise + RBN + Katz < 414 (1082) SONIC 24 advancing the science of networks in communities Summary • p* models provides an analytic methodology to implement insights from social science theory driven models of effective collaboration into recommender systems • Recommendations made using social science driven ERGMs outperform traditional link prediction models in making recommendations … • Illustrating the potential of theory-driven over purely data-driven recommender systems for collaboration SONIC 25 advancing the science of networks in communities Thank you. Questions? SONIC 26 advancing the science of networks in communities

© Copyright 2017 ExploreDoc