Welcome to the R workshop for developers!
In this workshop, you will learn the basics of the R programming language, and then explore multiple interesting data sets that you can slice and dice, visualize, and even use to train machine learning models.
Note that the instructions below are by no means mandatory -- if you feel there is something completely different you should like to explore in the provided datasets, feel free to experiment!
To perform this workshop, you will need to install R and R Studio. Both work on Linux, macOS, and Windows and are completely free.
First, let's get acquainted with R. The easiest way is to do a step-by-step walkthrough. Clone this repository (or download it) and find the tutorial.R file. Run R Studio, open the file using the File > Open File menu item, and follow the instructions.
In this exercise, you will analyze flight delay information from a set of 50,000 US domestic flights. The dataset contains the airport and airline details, departure and arrival delays, cancellations, and many additional fields.
This dataset was generated from the United States Department of Transportation Bureau of Transportation Statistics data.
First, read the data into your R session by running:
flights = read.csv("flights.csv")
To explore this data, you can either use the dplyr
package (which has a variety of useful methods to filter, aggregate, and sort data frames), or you can try the sqldf
package, which lets you run SQL queries on top of data frames, for even greater flexibility -- if SQL is your thing, that is:
sqldf("select carrier, count(*) cnt from flights group by carrier order by cnt")
Explore the data using the tools you have. You can use the summary
function, plot some of the data or look at histograms, and so on. Here are some interesting questions to answer:
- Which airline has the most flights departing from Boston, MA?
- Overall, which airline has the worst average delay? How bad is it?
- If you live in Chicago, IL, what are the farthest 10 domestic destinations that you could fly to?
- Which airline has the best delay record on direct flights from New York, NY to San Francisco, CA?
- Is there an airline that performs mostly shorter flights (distance or time)? Is there an airline that prefers longer flights?
- Suppose you live in San Jose, CA and there don't seem to be many direct flights to Boston, MA. Of all the 1-stop flight combinations, which would be the best option in terms of average arrival delay? (You can assume that every pair of flights San Jose > N > Boston is a pair that you could use, regardless of carrier.)
Try to answer some questions using plots:
- What is the distribution of arrival delays? Departure delays?
- Of all the flights that weren't delayed more than 2 hours (arrival delay), is there a close correlation between the departure delay and the arrival delay?
- Is there a correlation between the flight time and the arrival delay?
- Are some months more delay-prone than others? Perhaps something seasonal?
In this exercise, you will experiment with air pollution data monitored in the city of Kfar Saba, Israel during the year 2016. The data contains concentrations and levels of various pollutants in addition to wind speed and direction.
This dataset was generated from the Israel Ministry of Environmental Protection air pollution data website.
Read the data into your R session by running:
library(readr)
pollution <- read_csv("pollution-kfar-saba.csv",
col_types = cols(H2S = col_double(),
PM10 = col_double(), SO2 = col_double(),
WD = col_double(), WS = col_double(),
date = col_datetime(format = "%m/%d/%Y %H:%M")))
Note that we use the
readr
package here instead ofread.csv
because it allows easier customization of the column types. In fact, the code above was generated by using the File > Import Dataset > From CSV dialog.
The various pollutants have different permitted levels. In the environment-values.pdf file (in Hebrew) you can find a table of levels. For example, SO2 is a pollutant for which 134 ppb (parts per billion) is an exceptional level when measured for less than an hour; when measured for a full day, 19 ppb is considered exceptional; and so on.
Explore the data using the tools you have. You can use the summary
function, plot some of the data or look at histograms, and so on. Here are some interesting questions to answer:
- What is the highest recorded value for each pollutant?
- How many exceptional values were recorded for each pollutant?
- What is the distribution of values for each pollutant? In other words, how exceptional are the exceptions?
- The NO, NO2, and NOx pollutants are often associated with internal combustion engines, i.e. traffic. Do exceptional levels of these pollutants occur more frequently on weekdays or weekends? Do they occur more frequently in morning hours, when there is more traffic?
Next, you will use the powerful openair
package for analyzing the pollution data and wind statistics. Install it and load it first:
install.packages("openair")
library(openair)
Now, let's build a wind rose to visualize wind speeds and directions measured by our monitoring station:
windRose(pollution, ws = "WS", wd = "WD")
The wind rose indicates the most frequent wind directions and speeds. The most frequent wind directions seem to be west, north-west, and south-west. But it looks like the strongest winds might be coming from the south-east. Let's confirm -- it's a bit hard to see when the weaker winds are obscuring most of the chart:
windRose(pollution[pollution$WS > 5,], ws = "WS", wd = "WD")
Now that we have a basic reading on the winds, let's analyze the pollutants with respect to the wind. For example, here's the NO2 (nitrogen dioxide) data:
pollutionRose(pollution, pollutant = "No2", ws = "WS", wd = "WD")
You can repeat the same rose for other pollutants: SO2, H2S, NO, and so on. Note that the size of the slices doesn't indicate pollution severity -- rather, it indicates the frequency of wind coming from that direction. So again, if we're interested only in exceptional values, we need to filter the pollution data first:
pollutionRose(pollution[pollution$No2 > 21,], pollutant = "No2", ws = "WS", wd = "WD")
Note that the pollution rose doesn't easily show where the most severe pollution is coming from. We can generate a polar chart that takes into account the wind speed and direction, and uses a heatmap-like visualization for pollutant concentrations:
polarPlot(pollution, pollutant = "No2", x = "WS", wd = "WD", type = "year")
Again, you should try this for different pollutants. You can also experiment with the type
parameter -- for example, if you specify type = "weekday"
, you will get seven charts, one for each day of the week. This can help recognize pollution patterns that are recurring on a daily, weekly, or seasonal basis. You can find even more examples and things to try on the openair project website.
Lastly, it may be interesting to overlay the polar plot data on a map. Although we can't easily estimate the distance from the source based on only the wind speed and direction, it may help identify potential pollution sources if we put a high-level map under the plot. R ships with a built-in maps
package as well as numerous other solutions that you can use.
In this exercise, you will use data on RMS Titanic passengers to predict which passengers survived or died in the crash. The information available includes the passenger's sex, fare class, port of embarcation, and additional fields.
This dataset was obtained from Kaggle's Machine Learning from Disaster tutorial competition.
Read the data into your R session by running:
titanic = read.csv("titanic.csv")
Explore the data using the tools you have. You can use the summary
function, plot some of the data or look at histograms, and so on. Here are some interesting questions to answer:
- How many passengers survived and how many died?
- What is the distribution of passenger ages?
- What is the distribution of passengers across fare classes?
- Which proportion of the males died? Which proportion of the females?
- What were the fares like in 1st, 2nd, and 3rd class?
Now, let's try to learn from the data by fitting a decision tree to it. You will need to install a few packages:
install.packages("rparty")
install.packages("caret")
install.packages("e1071")
First, we split the data to a training set and a test set. We'll go with 70% training, and the rest for testing:
library(caret)
intrain = createDataPartition(titanic$Survived, p = 0.7, list = FALSE)
train = titanic[intrain,]
test = titanic[-intrain,]
Now, grow a decision tree under the assumption that the passenger's sex, fare class, and age affected their survival:
library(rparty)
model = rpart(Survived ~ Sex + Pclass + Age, data = train, method = "class")
model
plot(model)
text(model, use.n=TRUE, all=TRUE, cex=.8)
This is a decision tree: for example, the root of the tree might evaluate the passenger's sex. Males go left, females go right. Next, males aged less than 3.5 are predicted to survive, whereas males aged over 3.5 are predicted to die -- and so on.
Now let's test the model on our test set:
pred = predict(model, newdata = test, type = "class")
summary(pred)
How good are these predictions?
library(e1071)
confusionMatrix(pred, test$Survived)
What is the accuracy you got? In some of our testing, we got values around 80%. While this isn't bad, you are definitely welcome to try and improve this result by using more training data, training on additional columns, or using a different model (e.g. a random forest).
In this exercise, you will construct a machine learning model to perform optical digit recognition from handwritten text. This is a classic machine learning experiment. The training set was generated by breaking down each potential digit image into an 8x8 matrix and calculating a grayscale value from 0 to 16 for each cell in the matrix by averaging its pixels. The data, then, has 64 columns for each of the grayscale values, and a final 65th column with the actual digit.
This dataset was obtained from the UCI Machine Learning Repository Optical Recognition of Handwritten Digits dataset.
First, as always, you will need to get the dataset into your R Studio session:
train = read.csv("optdigits.tra", header = FALSE)
names(train)[65] = "digit"
test = read.csv("optdigits.tes", header = FALSE)
names(test)[65] = "digit"
Take a look at the general shape of the data. For example, try hist(train$digit)
to see what kind of distribution the training set has. If some digit is grossly under- or over-represented, we would have to take that into account when training and testing our model.
Next, let's train a support vector machine with our data. We will need the kernlab
package, and our data as matrices because that's what the ksvm
function from that package expects:
install.packages("kernlab")
library(kernlab)
features_train = data.matrix(subset(train, select = -digit))
labels_train = data.matrix(subset(train, select = digit))
features_test = data.matrix(subset(test, select = -digit))
labels_test = data.matrix(subset(test, select = digit))
model = ksvm(features_train, labels_train, type = "C-svc") # classifier SVM
Now, armed with the model, we can start making predictions over the train set:
test_pred = predict(model, features_test)
table(labels_test, test_pred) # confusion matrix
sum(test_pred == labels_test) / length(labels_test) # accuracy
If you followed the same steps as we did, you should have got a 0.9771842 accuracy, which is a really good result. It also seems from the confusion matrix that we have no obvious failures -- there is no specific digit where our recognition really fails badly.