Tools and Data
Data-sets used in CCNSD
- Cancer Project
- Cryptocurrency Project
- Twitter Project
It currently contains 257 datasets totaling 3,916 individual networks. The data are downloadable in a variety of convenient formats, including gt, graphml, gml, and csv. The site has a convenient JSON API that allows your programs to interact with it, and you can even do so directly from graph-tool:
The Colorado Index of Complex Networks (ICON)
ICON is a comprehensive index of research-quality network data sets from all domains of network science, including social, web, information, biological, ecological, connectome, transportation, and technological networks.
Each network record in the index is annotated with and searchable or browsable by its graph properties, description, size, etc., and many records include links to multiple networks. The contents of ICON are curated by volunteer experts from Prof. Aaron Clauset’s research group at the University of Colorado Boulder.
Click on the NETWORKS tab above to get started.
Below is a list with hyperlinks to external resources, as well as pointers to the resources people host at IUNI (e.g., WoS, OSoMe, Hoaxy).
- Web of Science data: 63,590,916 papers (SCIE, SSCI, AHCI, Books and Proceedings) covering 1900 through 2016
- The NaN data repository (http://carl.cs.indiana.edu/data/) contains several large datasets, including the massive click collection with over 50 billion Web requests.
- The Scholarometer API and Linked Open Data (http://scholarometer.indiana.edu/data.html) provide programmatic access to data about authors and disciplines, crowdsourced through the Scholarometer tool.
- The Kinsey Reporter API (http://kinseyreporter.org/data) provides programmatic access to crowdsourced data about co-occurrence of tags describing anonymous sexual behaviors.
- The Observatory on Social Media (osome.iuni.iu.edu), Indiana University’s warehouse for social media data, provides access to a large sample of the Twitter stream as well as derived analytics data about meme diffusion in social media.
- Tennis Prestige (http://tennisprestige.soic.indiana.edu/) uses publicly available data about tennis matches to generate a weighted and directed network of contacts among players, and then measure their performance with Prestige Score, a variant of the well known PageRank centrality.
- Network Workbench (http://nwb.cns.iu.edu/) is a large-scale network analysis, modeling and visualization toolkit for biomedical, social science and physics research.
- The Cyberinfrastructure Shell (CIShell) (http://cishell.org) supports the plug-and-play of datasets and algorithms and their bundling into custom tools that serve the specific needs of a user group or research community.
- Science of Science Tool (Sci2) (http://sci2.cns.iu.edu/user/index.php) supports the temporal, geospatial, topical, and network analysis and visualization of scholarly datasets at the micro (individual), meso (local), and macro (global) levels.
- Epidemics Tool (EpiC) (http://epic.cns.iu.edu//) supports the custom analysis, modeling, and visualization of data streams such as diffusion patterns of the H1N1 virus over geographic space.
- Brain Connectivity Toolbox (BCT) (https://sites.google.com/site/bctnet/) is an open-source matlab toolbox for brain network analysis and visualization.
This page contains links to some network data sets I’ve compiled over the years. All of these are free for scientific use to the best of my knowledge, meaning that the original authors have already made the data freely available, or that I have consulted the authors and received permission to the post the data here, or that the data are mine. If you make use of any of these data, please cite the original sources.
The data sets are in GML format. For a description of GML see here. GML can be read by many network analysis packages, including Gephi and Cytoscape. I’ve written a simple parser in C that will read the files into a data structure. It’s available here. There are many features of GML not supported by this parser, but it will read the files in this repository just fine. There is a Python parser for GML available as part of the NetworkX package here and another in the igraph package, which can be used from C, Python, or R. If you know of or develop other software (Java, C++, Perl, R, Matlab, etc.) that reads GML, let me know.
- Zachary’s karate club: social network of friendships between 34 members of a karate club at a US university in the 1970s. Please cite W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977).
- Les Miserables: coappearance network of characters in the novel Les Miserables. Please cite D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993).
- Word adjacencies: adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens. Please cite M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).
- American College football: network of American football games between Division IA colleges during regular season Fall 2000. Please cite M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).
- Dolphin social network: an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. Please cite D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396-405 (2003). Thanks to David Lusseau for permission to post these data on this web site.
- Political blogs: A directed network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance. Please cite L. A. Adamic and N. Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). Thanks to Lada Adamic for permission to post these data on this web site.
- Books about US politics: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent copurchasing of books by the same buyers. The network was compiled by V. Krebs and is unpublished, but can found on Krebs’ web site. Thanks to Valdis Krebs for permission to post these data on this web site.
- Neural network: A directed, weighted network representing the neural network of C. Elegans. Data compiled by D. Watts and S. Strogatz and made available on the web here. Please cite D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998). Original experimental data taken from J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, Phil. Trans. R. Soc. London 314, 1-340 (1986).
- Power grid: An undirected, unweighted network representing the topology of the Western States Power Grid of the United States. Data compiled by D. Watts and S. Strogatz and made available on the web here. Please cite D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998).
- Condensed matter collaborations 1999: weighted network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive between Jan 1, 1995 and December 31, 1999. Please cite M. E. J. Newman, The structure of scientific collaboration networks, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
- Condensed matter collaborations 2003: updated network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive. This version includes all preprints posted between Jan 1, 1995 and June 30, 2003. The largest component of this network, which contains 27519 scientists, has been used by several authors as a test-bed for community-finding algorithms for large networks; see for example J. Duch and A. Arenas, Phys. Rev. E 72, 027104 (2005). These data can be cited as M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
- Condensed matter collaborations 2005: updated network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive. This version includes all preprints posted between Jan 1, 1995 and March 31, 2005. Please cite M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
- Astrophysics collaborations: weighted network of coauthorships between scientists posting preprints on the Astrophysics E-Print Archive between Jan 1, 1995 and December 31, 1999. Please cite M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
- High-energy theory collaborations: weighted network of coauthorships between scientists posting preprints on the High-Energy Theory E-Print Archive between Jan 1, 1995 and December 31, 1999. Please cite M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001).
- Coauthorships in network science: coauthorship network of scientists working on network theory and experiment, as compiled by M. Newman in May 2006. A figure depicting the largest component of this network can be found here. These data can be cited as M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).
- Internet: a symmetrized snapshot of the structure of the Internet at the level of autonomous systems, reconstructed from BGP tables posted by the University of Oregon Route Views Project. This snapshot was created by Mark Newman from data for July 22, 2006 and is not previously published.
Other sources of network data
There are a number of other pages on the web from which you can download network data. Here are a few that I am aware of:
- UCINet data sets: Social network data sets released with the UCINet software by Steve Borgatti et al.
- Pajek data sets: Example data sets released with the Pajek software by Vladimir Batagelj and Andrej Mrvar.
- Indiana University data sets: A set of very large data sets, including some non-network data sets, compiled by the School of Library and Information Science at Indiana University. Network data sets include the NBER data set of US patent citations and a data set of links between articles in the on-line encyclopedia Wikipedia.
- Duncan Watts’ data sets: Data compiled by Prof. Duncan Watts and collaborators at Columbia University, including data on the structure of the Western States Power Grid and the neural network of the worm C. Elegans.
- Laszlo Barabasi’s data sets: Data compiled by Prof. Albert-Laszlo Barabasi and collaborators at the University of Notre Dame, including web data and biochemical networks.
- Alex Arenas’s data sets: Data compiled by Prof. Alexandre Arenas and collaborators at Universidad Rovira i Virgili, including metabolic network data and the network from their study of the collaboration patterns of jazz musicians.
- Stanford Large Network Dataset Collection: A substantial collection of data sets describing very large networks, including social networks, communications networks, and transportation networks.