How to Get Historical Twitter Data for Academic Research?
Author : Kate finch | Published On : 26 Jul 2021
Twitter is a virtual hub for discussing or debating almost every topic, political or otherwise. This makes Twitter data much more valuable and is often sought-after by students, faculty members, and researchers. However, they immediately face problems trying to download Twitter data. In this piece, we are going to discuss the various options through which you can access Twitter data.
Now that we are done with the fine-print, let’s get started.
As the purpose of performing research varies from person to person, so does the Historical Twitter data required to perform the research.
Do you need historical Twitter data? What about current tweets?
How many tweets do you need?
Do you need a complete dataset? Or will a sample of the total tweets be enough?
There are also a few more factors that can impact decision making, and they are,
Does the researcher have the required funds to purchase the data?
Which technical skills does the researcher have?
Is there a need to share the data as part of the publication?
How will the researcher perform the analysis?
These factors play a major role in acquiring Historical Twitter datasets for academic research.
There are four ways of accessing Historical Twitter data for academic research (definitely not talking about copying and pasting tweets).
Twitter Public API
Find an existing dataset
Purchase directly from Twitter
Purchase from a Twitter service provider
Let’s discuss each method of accessing Twitter data in detail.
Use Twitter Public API
API (Application Programming Interface) is effectively a set of programming code that enables data transmission between two software products. The Twitter API supports a lot of functions, there are a lot of products or software solutions that use the Twitter API to access or exchange information. Although, you can also use the Twitter API to pull data from the platform.
The API functions most suited for downloading Historical Twitter data include.
Extracting tweets related to any targeted public Twitter account.
Searching tweets mentioning a specific keyword
Pulling real-time tweets posted by users
Although, using the Twitter API will require different levels of technical skills to pull Historical Twitter data. You can use the API to write and develop your own software solution, but there are already a lot of software services with a wide range of features that may require different levels of technical skills and infrastructure to develop. Some of them are,
Command-line tools (Twurl)
Software libraries (rtweet for R)
Web applications (DMI-TCAT)
Plugins for popular analytic packages (NVIVO, NodeXL for Excel, and TAGS for Google Sheets)
While some of these tools are developed to scrape Historical Twitter data (tweets), some tools are developed to perform an analysis on the scraped Twitter data. Find a more comprehensive list of social media research toolkits list created at the Social Media Lab, at Ted Rogers School of Management, Ryerson University.
Types of Twitter APIs
There are different Twitter APIs, each having its own set of functions and limitations. They are as follows,
Get Twitter data with Search API:
The Twitter Search API can help you pull Twitter data related to any username, keyword, location, etc. This API gives you access to scrape tweets that have already happened. Although, there are limitations to how much Twitter data or tweets you can pull.
The Search API allows you to scrape the most recent 3200 tweets related to any public Twitter profile. As far as keywords go, you can pull up to 5000 tweets for the query. This API also limits the number of requests you can make to pull tweets. The current limit is set to 180 requests in 15 minutes.
Get Twitter data with Streaming API:
Unlike the Search API, the Streaming API pulls Twitter data in near real-time. You can scrape tweets with the help of keywords, usernames, location, named places, hashtags, etc. It then scrapes all the tweets that match the search criteria in real-time.
Although, the major drawback of the API is that it only scrapes a sample of all the tweets related to the search query. The sample size can be between 1% and 40% of all the tweets that match the search criteria.
Get Twitter data with Firehose API:
The Firehose API has similar functions as the Streaming API. The only thing that differentiates it is that unlike the Streaming API it can pull all the tweets related to a search query rather than just a sample.
The Firehose API is managed by two data providers GNIP and DataSift, both have friendly relations with Twitter.
Find an Existing Dataset
One of the best ways to overcome the limitations of the Twitter API is to find existing Twitter datasets shared by users. We at TrackMyHashtag are building a collection of Twitter datasets that are available for free all over the internet. Click here to access our mega-compilation of Twitter datasets.
Sharing the Twitter data (especially tweets) scraped through the API is against Twitter’s developer policy. So, it places limitations on how much data you can share as soon as you get access to the API keys. If you want to share the Twitter data, you cannot share the tweets themselves. The tweets have to be replaced with their respective tweet ids. The users with whom the dataset is shared have to use the tweet ids to retrieve the exact tweets with the Twitter API. Although, if the dataset has been cleaned, there is no way to retrieve lost tweets and will have to make do with what’s left.
There are several tools that you can use to retrieve tweets with the Twitter API based on the tweet ids. Hydrator is one of the most popular and widely used tools to retrieve tweets. But, similar to the APIs, Twitter has placed limitations on how many tweets you can hydrate with the help of tweet ids. You might have to spend considerable time trying to retrieve tweets if you are working with large Twitter datasets.
Working with existing Twitter datasets can be a bit easier compared to raw Twitter data. As these datasets were gathered to perform research, they usually offer a cleaned version of the dataset. You can find a clean and noise-free version of the dataset with just the original tweets. Some datasets even include both the raw and cleaned version. Depending on the nature of your research, you can choose the type of dataset that best suits your requirements.
Purchase Directly from Twitter
You can also purchase historical Twitter datasets for academic research directly from Twitter using the Historical PowerTrack enterprise product.
This service was previously provided and managed by GNIP, a social media API aggregation company. But in 2014, GNIP was purchased and folded into Twitter. To buy specific or custom datasets, you simply have to state your data requirements with search terms and limiters. After specifying your data requirements, a GNIP sales executive would provide the cost estimates for the required data.
The Historical Powertrack enterprise product also offers a range of filtering options and enhancements available over the Public Twitter API. It also includes more filter operators and tweet enhancements such as profile locations and shortened URLs.
If you are considering purchasing Twitter data through the Historical PowerTrack Enterprise product, it will cost you a lot. The price of the data is mainly decided based on the time required to compile the dataset. However, the no of tweets is also a factor considered while evaluating the price of the dataset. The shorter the period required to compile the data, the lower the cost you have to pay. The service also promises a complete dataset for the search criteria rather than just a sample.
Purchase from Twitter Service Provider
Various commercial organizations and academic institutions as Twitter-service-providers in exchange for a fee. Just like purchasing data from Twitter, you just have to specify your requirements and provide the search criteria as well as the limiters to extract data. Their services include,
Access to Historical Twitter data related to any search term (keyword, username, location, hashtag, etc).
Value-added services for performing analysis using Twitter data. In case you don’t have an effective analysis, you can use these services to perform analysis.
The data extraction options that Twitter service providers offer are as follows,
Data extraction using public Twitter API: The public Twitter APIs, as discussed before, have certain limitations that restrict how much data you can access. However, while these methods of data-retrieval require more time to gather sufficient data, they are also less costly compared to the other options.
Data extraction using Enterprise Twitter API: Using this API, you purchase data directly from Twitter. You get unrestricted access to all tweets related to your search criteria. The cost of the service is determined, using the time required to scrape the data and the number of tweets. This service, however, is more costly and not suitable for personal or academic research, unless you can write it into a grant.
Build datasets using existing sets of historical tweets: In this case, the Twitter service provider usually has an arrangement with Twitter. The service provider then gets access to the ‘firehose’ of all tweets to build the collection.
Twitter service providers offer reliable and uninterrupted access to the Twitter APIs. These services also account for redundancy and backfill, thus ensuring that you don’t miss a single tweet related to your search criteria. Some service providers even offer access to the data as well as analytical insights and social media metrics. These tools are designed for gathering business intelligence and monitoring Twitter performance metrics.
TrackMyHashtag is an advanced AI-driven paid Twitter analytics platform capable of tracking any hashtag or event on Twitter. You can also track tweets related to any hashtag, keyword, or @mention. The tool provides real-time engagement metrics related to any hashtag or targeted Twitter account.
TrackMyHashtag can also extract Historical Twitter data related to any keyword, hashtag, or @mention. You just have to provide the search criteria and limiters as per your historical Twitter data requirements and a sales rep will get back to you with the pricing details.
Key features of TrackMyHashtag,
Download historical Twitter data
Perform real-time hashtag tracking
Track engagement metrics in real-time
Identify resonating content and trending hashtags
Find social media influencers
It is one of the most feature-packed Twitter hashtag tracking tools, but try and test the tool yourself. TrackMyHashtag offers a 5-day free trial. The best part, you don’t have to provide payment details to start the free trial.
Following are the metadata provided in TrackMyHashtag’s historical Twitter datasets,
Tweet ID, URL, and tweet posting time
Tweet type and Tweet source
Retweets and Likes received
Tweet language and location
User ID, name, username, bio, profile URL, followers, following, and account creation date users posting the tweets
Twitter accounts verification and protected status.
While there are various ways to access Historical Twitter data, there are also restrictions on how much data you can access and how much you can share. This makes it difficult to acquire sufficient Twitter data to perform academic research. I only hope this guide helps you better understand all the current ways to access Historical Twitter data so you can move on with your research.
If you have something to add to this guide, do let me know in the comments.