Skip to content

The Interfaces Twitter Elections Dataset: Data from the 2022 presidential elections in Brazil, as referenced in the related article, regarding it's construction and characteristics.

License

Notifications You must be signed in to change notification settings

Interfaces-UFSCAR/ITED-Br

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

ITED-Br

The Interfaces Twitter Elections Dataset (ITED-Br) encompasses data from the 2022 Brazilian presidential elections, as detailed in the related article discussing its construction and characteristics. The repository includes tweet IDs collected from Twitter (X) during the election period, the post-election phase, and the attack on the executive, legislative, and judiciary buildings in January 2023. Our dataset comprises 282 million dehydrated tweets, collected from June 20, 2023, to January 31, 2024. It includes over 280 million account IDs and more than 30 million media IDs.

Data Collection and Organization

Data Collection

The data was collected using the Twitter API, which provided academic/research access at the time. For the purposes of this repository, all the data except for the IDs representing the collected objects (such as tweets, users, media, replies, and quotes) was stripped, preserving the relationships between objects (e.g., tweets and their authors, retweets and their referenced tweets, etc.).

Measures of the dataset per search context

The following table displays information regarding the data collection and resulting number of Tweets pertaining to each search context used to create the dataset, giving an idea about the volume of data under each category that can be found in this repository:

Dataset measures

Data Organization

The Tweet IDs are organized as follows: 1. Tweet ID files are stored in a folder tree expressed by this path:

data/query/year/month/day/type/
  • query: represents the name of the search string used to obtain the data.*
  • year: year of collected data;
  • month: month of collected data;
  • day: day of collected data;
  • type: type of collected data* *Detailed in the Article

2. Each leaf folder has parquet files containing the IDs. 3. Each file has the following name structure:

type-query-year-month-day.parquet

or

type-subtype-query-year-month-day.parquet
  • type: type of collected data (tweets, users, media, replies, quotes)*
  • subtype: type of interaction collected*
  • query: represents the name of the search string used to obtain the data;*
  • year: year of collected data;
  • month: month of collected data;
  • day: day of collected data; *Detailed in the Article

Obs: This repository's folder structure is for logical organization of the data only, and may not represent the ideal structuring for your particular use case or application. Also, our team found that avoiding large file sizes resulted in increased ease when it came to implementing optmized methods for processing data at a large scale.

How to Hydrate

What is Hydration?

Hydration is the process of re-fetching the full data of tweets (and other objects such as users) from Twitter (X) using their IDs. This is necessary because, due to restrictions on data sharing included in Twitter's terms of service, we can only freely share IDs, not the full tweet data. To obtain the full data, users need to rehydrate the objects using the X API (previously Twitter API).

It is possible to hydrate the data through any means that allow interaction with the API, which include, but are not restricted to: libraries for python or other programming languages, command line interfaces or apps with graphical user interface, etc. We list some options below, but they might not be the best alternative for your use case.

Hydrating using Hydrator (GUI)

Navigate to the Hydrator GitHub repository and follow the instructions provided in the README. Due to the large number of separate Tweet ID files in this repository, it is advisable to merge files from timeframes of interest into a larger file before hydrating the tweets through the GUI.

Hydrating using Twarc (CLI)

Similar to Hydrator, but uses a command line interface. GitHub repository).

Structure

The following tables detail some of the most important properties (which are organized into columns, in the data) present in the structure of each type of object (hydrated), that were used by our team during our research. Since this dataset contains only the IDs of the objects, this is just to ilustrate the data that might be available through rehydration.

Tweets

Tweet properties

Users

User properties

Media

Tweet properties

Interactions (e.g. Quotes, Replies)

Tweet properties

Obs: Each object's available properties are defined and controlled by the X API (formerly Twitter API), and different properties could be available when you rehydrate the data. The properties being displayed here are those which were collected by our team during our construction of the hydrated dataset of which this ID-only dataset is derived.

More Information

For more information about the dataset, please read our article:

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, comply with Twitter’s Terms of Service, and cite the following manuscript:

  • Iasulaitis, S.; Valejo, A.D; Greco, B.C; Perillo, V.G; Messias, G.H; Vicari, I. The Interfaces Twitter Elections Dataset: Construction Process and Characteristics of Big Social Data During the 2022 Presidential Elections in Brazil. PLOS ONE (2024). http://dx.doi.org/10.1371/journal.pone.0316626.

About

The Interfaces Twitter Elections Dataset: Data from the 2022 presidential elections in Brazil, as referenced in the related article, regarding it's construction and characteristics.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •