MIB Datasets
This page describes the datasets of Twitter accounts we have analysed within MIB.
Note on the datasets: Due to several constraints (e.g., user's privacy/data protection), we are not openly posting the datasets, instead, they are available for researchers who will ask for. Please contact us to have access to the data, for research purposes.
Two terms of usage apply:
- The appropriate paper is cited in any research product whose findings are based on these datasets;
- The datasets cannot be redistributed
First dataset
Description: genuine and spambot Twitter accounts, annotated by CrowdFlower contributors, as described in our paper:
The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race, S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, M. Tesconi. WWW '17 Proceedings of the 26th International Conference on World Wide Web Companion, 963-972, 2017
@inproceedings{Cresci:2017:PSS:3041021.3055135, author = {Cresci, Stefano and Di Pietro, Roberto and Petrocchi, Marinella and Spognardi, Angelo and Tesconi, Maurizio}, title = {The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race}, booktitle = {Proceedings of the 26th International Conference on World Wide Web Companion}, series = {WWW '17 Companion}, year = {2017}, isbn = {978-1-4503-4914-7}, location = {Perth, Australia}, pages = {963--972}, numpages = {10}, url = {https://doi.org/10.1145/3041021.3055135}, doi = {10.1145/3041021.3055135}, acmid = {3055135}, publisher = {International World Wide Web Conferences Steering Committee} }
To give the flavour of the collected information, we make available a sample of 100 genuine accounts and their labeles received by CrowdFlower users. You can download the sample following these links: sample Sql version - sample CSV version
The full dataset, available on request and only for research purposes, has the following composition:
group name | description | accounts | tweets | year |
---|---|---|---|---|
genuine accounts | verified accounts that are human-operated | 3,474 | 8,377,522 | 2011 |
social spambots #1 | retweeters of an Italian political candidate | 991 | 1,610,176 | 2012 |
social spambots #2 | spammers of paid apps for mobile devices | 3,457 | 428,542 | 2014 |
social spambots #3 | spammers of products on sale at Amazon.com | 464 | 1,418,626 | 2011 |
traditional spambots #1 | training set of spammers used by C. Yang, R. Harkreader, and G. Gu. | 1,000 | 145,094 | 2009 |
traditional spambots #2 | spammers of scam URLs | 100 | 74,957 | 2014 |
traditional spambots #3 | automated accounts spamming job offers | 433 | 5,794,931 | 2013 |
traditional spambots #4 | another group of automated accounts spamming job offers | 1,128 | 133,311 | 2009 |
fake followers | simple accounts that inflate the number of followers of another account | 3,351 | 196,027 | 2012 |
Second dataset
Description: The total amount of accounts has been split in multiple datasets, as described in our paper: Fame for sale: efficient detection of fake Twitter followers.
- TFP (the fake project): 100% humans
- E13 (elections 2013): 100% humans
- INT (intertwitter): 100% fake followers
- FSF (fastfollowerz): 100% fake followers
- TWT (twittertechnology): 100% fake followers
Fame for sale: efficient detection of fake Twitter followers, S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, M. Tesconi. arXiv:1509.04098 09/2015. Elsevier Decision Support Systems, Volume 80, December 2015, Pages 56–71
@article{fameforsale2015, author = {Cresci, Stefano and {Di Pietro}, Roberto and Petrocchi, Marinella and Spognardi, Angelo and Tesconi, Maurizio}, title = {Fame for sale: efficient detection of fake Twitter followers}, journal = {Decision Support Systems}, publisher = {Elsevier}, volume = {80}, month = {December}, issn = {0167-9236}, doi = {http://dx.doi.org/10.1016/j.dss.2015.09.003}, year = {2015}, pages = {56-71} }