GitXplorerGitXplorer
j

badbunny

public
3 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
924a6e30dac74d273d9ef0ed1079a6b8e21860a0

Added PyCon note

jjuandes committed 2 years ago
Unverified
ae9239a7ef385f840b9271cdfda0a2f7546479a4

Added READMEs and removed parts of isolation forest script

jjuandes committed 2 years ago
Unverified
ebf2628a62911e65090c8bdd02857c65bef477a4

added ES plots

jjuandes committed 2 years ago
Unverified
8a3cee6bf4181783ed0cfbc4e12d9db33d5f62ed

updated scripts

jjuandes committed 2 years ago
Unverified
16f7266d2e5ae82c4277cb3c30a9eb5a8cead1ab

Presentation's finished

jjuandes committed 2 years ago
Unverified
bffa6adf19612b27a32e66344d22e2307b6b0d46

Updated scripts and more plots

jjuandes committed 2 years ago

README

The README file for this repository.

When do I listen to Bad Bunny?

Description

This repo contains all the resources I used for my experiment "When do I listen to Bad Bunny?"

Bad Bunny is an urban and Puerto Rican singer who has become a global star in the last few years. As the title indicates, in this project, I'm using Spotify data I've been collecting since late 2017 to find out when I listen to his music. To solve this mystery, I did a data analysis using Python and R, which includes data exploration, visualizations, time series, and anomaly detection.

Structure

The repo has the following structure:

  • bqWriter/: the script I used to read my Spotify account's recently played track and write the results to BigQuery.
  • data/: the datasets I used for the experiment.
    • df.csv: the dataset containing all the Bad Bunny data. Each row represents a song I listened to.
    • results.csv: the outputs from the anomaly detection algorithm (isolation forest). The first column, predictions is the prediction (-1 = anomaly, 1 = non-anomalous), and scores is the prediction score (the lowest the score, the more anomalous).
    • timeseries_data.csv: the data I used to train the time series model. The first column, ds, is the date, and the second column, y, is the number of Bad Bunny songs I listened to on that day.
    • weekdays_hours.csv: the data I used to train the anomaly detection model. It has two features, hour, the hour (in 24 hours) when I listened to a song, and weekday, the day of the week. I'm using one-hot encoding in the training script to transform the weekday column into seven columns (one per day).
  • notebooks/: the notebooks I wrote during the analysts.
    • analysis.Rmd: the data analysis (written in R). It has the code that summarizes the dataset, explores it, and visualizes it.
    • analysis_ES.Rmd: a copy of the previous script with the visualizations translated to Spanish.
    • isolation_forest.ipynb: the script that trains the isolation forest model.
    • timeseries.ipynb: the script that trains the time series model.
  • plots/: all the plots I created as part of the analysis. There's an English and Spanish (these files are suffixed with "_ES") version of each chart.
  • service/: an incomplete skeleton code of a service I want to build with FastAPI to serve the model's predictions. Here you will find an exported copy (model.joblib) of the isolation model I trained.

Library used

  • scikit-learn: to train the isolation forest model.
  • Prophet: to train the time series model.
  • ggplot2: to create the visualizations in R.
  • skimr: An R package that provide summary statistics.
  • tidyr and dplyr: to manipulate my data.

Notes

I presented this work (in Spanish) at PyCon Latam 2022. For details, see: https://pylatam.org/#schedule.