GitXplorerGitXplorer
e

Twitter200M

public
8 stars
1 forks
0 issues

Commits

List of commits on branch main.
Verified
41437b3ae9c0edbd5403ac05f46eb7d972f87d50

Upgrade dependencies

eelliotwutingfeng committed 2 months ago
Verified
4e60b6cae8721356788f51ee064ec844417706d1

Upgrade dependencies

eelliotwutingfeng committed 7 months ago
Verified
85603cebbf3dce7e7a4c8fbe1cc338f82736524a

Add low memory warning.

eelliotwutingfeng committed 8 months ago
Verified
370e08e16505d8101d0c3c0d08ba9e2afd4cf1cc

Add venv to instructions.

eelliotwutingfeng committed 8 months ago
Verified
4fcc0a1c461e220fddbc9190d7ea534bca3724da

Fix regressions due to polars and seaborn API changes

eelliotwutingfeng committed 10 months ago
Verified
254ef9294973da40660d86c07ffbf9e4e523ecb4

Add Code of Conduct

eelliotwutingfeng committed a year ago

README

The README file for this repository.

Twitter200M

GitHub license

Simple analysis of the Twitter 200M Data Dump of January 2023.

Download links for the data dump are not included in this repository.

Background

Quote from haveibeenpwned.com,

In early 2023, over 200M records scraped from Twitter appeared on a popular hacking forum. The data was obtained sometime in 2021 by abusing an API that enabled email addresses to be resolved to Twitter profiles. The subsequent results were then composed into a corpus of data containing email addresses alongside public Twitter profile information including names, usernames and follower counts.

The data dump analysed in this repository is a "cleaned-up" version by a user on the aforementioned forum.

Findings

Caveats

  • Not all user accounts have been leaked; Twitter has much more than 200 million accounts.
  • It is impossible to verify that the leaked datasets have not been tampered with falsified data.

The following findings are made on the assumption that this dataset is representative of Twitter's actual userbase.

Most popular email providers

┌────────────────┬─────────────────┐
│ Email Provider ┆ Number of Users │
│ ---            ┆ ---             │
│ str            ┆ i64             │
╞════════════════╪═════════════════╡
│ gmail.com      ┆ 73314131        │
│ hotmail.com    ┆ 40509492        │
│ yahoo.com      ┆ 33051713        │
│ aol.com        ┆ 4025882         │
│ hotmail.co.uk  ┆ 3298152         │
│ mail.ru        ┆ 3289923         │
│ hotmail.fr     ┆ 3128568         │
│ live.com       ┆ 1945940         │
│ msn.com        ┆ 1321923         │
│ yahoo.co.uk    ┆ 1313553         │
│ yahoo.fr       ┆ 1245996         │
│ ymail.com      ┆ 1142144         │
│ yandex.ru      ┆ 1125810         │
│ icloud.com     ┆ 1093533         │
│ comcast.net    ┆ 1091726         │
└────────────────┴─────────────────┘

Over 75% of Twitter users use either Google, Microsoft, or Yahoo email addresses.

Account creation times

Twitter first experienced rapid user growth in 2009, with its highest new account signup rates from 2011 to 2013.

From 2016 onwards, new account signups dipped below 2009 levels, and have been on a constant decline ever since.

Requirements

Tested on Linux x64

  • Fast multicore CPU
  • At least 16 GB available RAM
  • Python 3.11.7
  • 7zip

Setup

python3 -m venv venv
venv/bin/python3 -m pip install --upgrade pip
venv/bin/python3 -m pip install -r requirements.txt

Run Jupyter Notebook

venv/bin/python3 -m jupyter notebook main.ipynb

Formatting

venv/bin/python3 -m black .