For this project, I have considered Twitter_volume_FB dataset from Kaggle.
Dataset link: https://www.kaggle.com/boltzmannbrain/nab
Steps to execute:
- Download the files from the github repository.
- Get the Twitter_volume_FB.csv file by unzipping the .zip file.
- Place the csv files in datasets folder and place the datasets folder in notebooks folder. The notebooks folder should also have ipynb file as well.
- Navigate to terminal and type "jupyter notebook"
- Navigate to the folder where the notebook is placed.
- From the menu icon cell, click on Run all which will run the whole notebook from the first cell. Verify the results.
The project is all about showing the anomalies in the dataset and how they can be easily identified.
Steps to follow:
- Set up a data science project structure in a new git repository in your GitHub account
- Download the benchmark data set from https://www.kaggle.com/boltzmannbrain/nab or https://github.com/numenta/NAB/tree/master/data
- Load the one of the data set into panda data frames
- Formulate one or two ideas on how feature engineering would help the data set to establish additional value using exploratory data analysis
- Build one or more anomaly detection models to determine the anomalies using the other columns as features
- Document your process and results
- Commit your notebook, source code, visualizations and other supporting files to the git repository in GitHub