This is an introduction to the R package dplyr, written by Hadley Wickham. Personally, I found it incredibly easy and useful to use dplyr when I did data manipulation. It’s usually the first package I call when I start a data analysis project. The functions in the package work similar to SQL syntax. You almost can find any database function you want in dplyr. We will start by introducing the class tbl, the basic functions (verbs) to other special functions. We will also talk about how the amazing syntax %>% (like the pipe in Unix) makes your code more elegant.
library(dplyr)
library(hflights)
library(rbenchmark)
We use hflight dataset for demo.
data(hflights)
names(hflights)
[1] "Year" "Month" "DayofMonth"
[4] "DayOfWeek" "DepTime" "ArrTime"
[7] "UniqueCarrier" "FlightNum" "TailNum"
[10] "ActualElapsedTime" "AirTime" "ArrDelay"
[13] "DepDelay" "Origin" "Dest"
[16] "Distance" "TaxiIn" "TaxiOut"
[19] "Cancelled" "CancellationCode" "Diverted"
First of all, we introduce the class tbl. It will be better if we change the class to tbl. When you print the data.frame, R will dump everything to the console, even the data.frame is too big to fit in your console. However, if you convert the class to tbl and print it, it will give you a better output format. Note that the function in dplyr can be used to data.frame as well. We will compare the difference between tbl and data.frame later.
# hflights_df is data.frame
hflights_df <- hflights
class(hflights_df)
[1] "data.frame"
# Transform data.frame to table
hflights <- tbl_df(hflights)
class(hflights)
[1] "tbl_df" "tbl" "data.frame"
# Nicer output format
hflights
Source: local data frame [227,496 x 21]
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
(int) (int) (int) (int) (int) (int) (chr)
1 2011 1 1 6 1400 1500 AA
2 2011 1 2 7 1401 1501 AA
3 2011 1 3 1 1352 1502 AA
4 2011 1 4 2 1403 1513 AA
5 2011 1 5 3 1405 1507 AA
6 2011 1 6 4 1359 1503 AA
7 2011 1 7 5 1359 1509 AA
8 2011 1 8 6 1355 1454 AA
9 2011 1 9 7 1443 1554 AA
10 2011 1 10 1 1443 1553 AA
.. ... ... ... ... ... ... ...
Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
(int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
(chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
CancellationCode (chr), Diverted (int)
Subset is one of the most important things in data manipulation. In base R, we will use subset with conditions or just boolean vector with square brackets [ to retrieve the rows we want. In dplyr, we use filter(data, conditions) to subset the data.
Let’s say that we want to get the flights in March.
# In dplyr
filter(hflights, Month == 3)
Source: local data frame [19,470 x 21]
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
(int) (int) (int) (int) (int) (int) (chr)
1 2011 3 1 2 1633 1734 AA
2 2011 3 2 3 1632 1750 AA
3 2011 3 3 4 1635 1734 AA
4 2011 3 4 5 1630 1747 AA
5 2011 3 5 6 1633 1745 AA
6 2011 3 6 7 1630 1730 AA
7 2011 3 7 1 1627 1737 AA
8 2011 3 8 2 1650 1749 AA
9 2011 3 9 3 1635 1744 AA
10 2011 3 10 4 1636 1749 AA
.. ... ... ... ... ... ... ...
Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
(int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
(chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
CancellationCode (chr), Diverted (int)
##### The following codes will give your the same result #####
# subset(hflights, Month == 3)
# hflights[hflights$Month == 3, ]
If you have multiple condition, you can use commas to separate several conditions. For example, filter(hflights, Month == 3, DayofMonth == 16) will pull out the filght on March 16. Also, you can use & and | operation in the arguments like filter(hflights, Month == 3 & DayofMonth == 16).
We use the benchmark function in package rbenchmark to compare four methods. We replicate the procedure 100 times and calculate the time. You can also use the system.time to compare method while its result varies. benchmark gives us a more robust comparison.
From the results, you can find that filter with class tbl is the fastest. As we mentioned earlier, the class tbl will be better than data.frame for large dataset.
# compare four methods
benchmark(
method1=filter(hflights, Month == 3),
method1_df=filter(hflights_df, Month == 3),
method2=subset(hflights, Month == 3),
method3=hflights[hflights$Month == 3, ],
replications=100,
columns=c("test", "elapsed", "relative", "user.self", "sys.self")
)
test elapsed relative user.self sys.self
1 method1 0.45 1.000 0.43 0.03
2 method1_df 0.63 1.400 0.53 0.08
3 method2 2.86 6.356 2.24 0.58
4 method3 2.80 6.222 2.03 0.73
We can use select function to select the column. We just put the column name in the arguments.
# In dplyr
select(hflights, Month, DayofMonth, FlightNum)
Source: local data frame [227,496 x 3]
Month DayofMonth FlightNum
(int) (int) (int)
1 1 1 428
2 1 2 428
3 1 3 428
4 1 4 428
5 1 5 428
6 1 6 428
7 1 7 428
8 1 8 428
9 1 9 428
10 1 10 428
.. ... ... ...
If you have a lot of column names shared some characters, without typing the explicit name you can use contains, starts_with and ends_with to pull out those columns. (Note that we do not show the result here. You can try the code.)
# contains
select(hflights, contains("delay"))
# starts_with
select(hflights, starts_with("M"))
# use ends_with
select(hflights, ends_with("th"))
Sometimes we will combine the filter and select function together. You may want to find out the flight number and Origin of those American Airlines (AA) flights. You will write code like the following:
filter(select(hflights, FlightNum, Origin, UniqueCarrier), UniqueCarrier == "AA")
When you see the above codes, it may take a while to get what the codes is doing. However, in dplyr, it gives you a more elegant way to organize their code. It works like the pipe in Unix. We can rewrite the code as following:
# Clearer way to write code
hflights %>%
select(FlightNum, Origin, UniqueCarrier) %>%
filter(UniqueCarrier == "AA")
Source: local data frame [3,244 x 3]
FlightNum Origin UniqueCarrier
(int) (chr) (chr)
1 428 IAH AA
2 428 IAH AA
3 428 IAH AA
4 428 IAH AA
5 428 IAH AA
6 428 IAH AA
7 428 IAH AA
8 428 IAH AA
9 428 IAH AA
10 428 IAH AA
.. ... ... ...
You can interpret the %>% syntax as “then”. The above code will be like that we first get the hflights data, select FlightNum, Origin, UniqueCarrier, and then filter the outcome to find those UniqueCarrier is AA. It throws the output in front of it to the argument of next function. In the following demo, I will use %>%.
Suppose we want to find out which airlines had the longest flights delayed in the data. We have to sort our data by the delay time. To do it in dplyr, we can use arrange function.
# To sort decreasingly, use desc
hflights %>%
select(DepDelay, UniqueCarrier) %>%
arrange(desc(DepDelay))
Source: local data frame [227,496 x 2]
DepDelay UniqueCarrier
(int) (chr)
1 981 CO
2 970 AA
3 931 MQ
4 869 UA
5 814 MQ
6 803 MQ
7 780 CO
8 758 CO
9 730 DL
10 691 MQ
.. ... ...
Sometimes, you may want to create a new variable. For instance, if we want to create the date variable in flight data, use mutate function.
# Create a new variable, date
hflights <- hflights %>%
mutate(date=paste(Year, Month, DayofMonth, sep="-"))
hflights %>% select(date)
Source: local data frame [227,496 x 1]
date
(chr)
1 2011-1-1
2 2011-1-2
3 2011-1-3
4 2011-1-4
5 2011-1-5
6 2011-1-6
7 2011-1-7
8 2011-1-8
9 2011-1-9
10 2011-1-10
.. ...
Often, we will want to do something based on several groups such as what is the average delayed time for each airline. In base R, we use tapply(hflights$DepDelay, hflights$UniqueCarrier, mean, na.rm=TRUE) or aggregate(DepDelay ~ UniqueCarrier, hflights, mean, na.rm=TRUE).
# Use group_by + summarise
hflights %>%
group_by(UniqueCarrier) %>%
summarise(average_delay=mean(DepDelay, na.rm=T))
Source: local data frame [15 x 2]
UniqueCarrier average_delay
(chr) (dbl)
1 AA 6.390144
2 AS 3.712329
3 B6 13.320532
4 CO 9.261313
5 DL 9.370627
6 EV 12.482193
7 F9 5.093637
8 FL 4.716376
9 MQ 11.071745
10 OO 8.885482
11 UA 12.918707
12 US 1.622926
13 WN 13.488241
14 XE 7.713728
15 YV 1.538462
Sometimes, we may want to apply the function to several columns or apply different functions to column.
#Apply function to multiple columns
hflights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean(., na.rm=T)), DepDelay, ArrDelay)
Source: local data frame [15 x 3]
UniqueCarrier DepDelay ArrDelay
(chr) (dbl) (dbl)
1 AA 6.390144 0.8917558
2 AS 3.712329 3.1923077
3 B6 13.320532 9.8588410
4 CO 9.261313 6.0986983
5 DL 9.370627 6.0841374
6 EV 12.482193 7.2569543
7 F9 5.093637 7.6682692
8 FL 4.716376 1.8536239
9 MQ 11.071745 7.1529751
10 OO 8.885482 8.6934922
11 UA 12.918707 10.4628628
12 US 1.622926 -0.6307692
13 WN 13.488241 7.5871430
14 XE 7.713728 8.1865242
15 YV 1.538462 4.0128205
#Apply multiple functions to one column
hflights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean(., na.rm=T), min(., na.rm=T), max(., na.rm=T)), DepDelay)
Source: local data frame [15 x 4]
UniqueCarrier mean min max
(chr) (dbl) (int) (int)
1 AA 6.390144 -15 970
2 AS 3.712329 -15 172
3 B6 13.320532 -14 310
4 CO 9.261313 -18 981
5 DL 9.370627 -17 730
6 EV 12.482193 -18 479
7 F9 5.093637 -15 275
8 FL 4.716376 -14 507
9 MQ 11.071745 -23 931
10 OO 8.885482 -33 360
11 UA 12.918707 -11 869
12 US 1.622926 -17 425
13 WN 13.488241 -10 548
14 XE 7.713728 -19 628
15 YV 1.538462 -11 54
In this section, we are going to introduce some useful functions that can facilitate our analysis. The first one is n(). It can help us count the number in the group. For example, we want to calculate the number of flights for each airline.
#In dplyr
hflights %>%
group_by(UniqueCarrier) %>%
summarise(flight_count=n())
Source: local data frame [15 x 2]
UniqueCarrier flight_count
(chr) (int)
1 AA 3244
2 AS 365
3 B6 695
4 CO 70032
5 DL 2641
6 EV 2204
7 F9 838
8 FL 2139
9 MQ 4648
10 OO 16061
11 UA 2072
12 US 4082
13 WN 45343
14 XE 73053
15 YV 79
Suppose we also want to sort the airlines by the number of flights. We can add arrange(fligh_count) after summarise. Alternative is to use count function with argument sort=TRUE or, another useful function can count and sort at the same time:
#Another easier way to write it
hflights %>%
group_by(UniqueCarrier) %>%
tally(sort=TRUE)
Source: local data frame [15 x 2]
UniqueCarrier n
(chr) (int)
1 XE 73053
2 CO 70032
3 WN 45343
4 OO 16061
5 MQ 4648
6 US 4082
7 AA 3244
8 DL 2641
9 EV 2204
10 FL 2139
11 UA 2072
12 F9 838
13 B6 695
14 AS 365
15 YV 79
n_distinct funtion provides the number of unique subjects in a column.
hflights %>%
group_by(Dest) %>%
summarise(flight_count=n(), plane_count=n_distinct(TailNum))
Source: local data frame [116 x 3]
Dest flight_count plane_count
(chr) (int) (int)
1 ABQ 2812 716
2 AEX 724 215
3 AGS 1 1
4 AMA 1297 158
5 ANC 125 38
6 ASE 125 60
7 ATL 7886 983
8 AUS 5022 1015
9 AVL 350 142
10 BFL 504 70
.. ... ... ...
To check the data types of each column, use glimpse.
#Like the str function in base R
glimpse(hflights)
Observations: 227,496
Variables: 22
$ Year (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
$ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ DayofMonth (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ DayOfWeek (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
$ DepTime (int) 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
$ ArrTime (int) 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
$ UniqueCarrier (chr) "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
$ FlightNum (int) 428, 428, 428, 428, 428, 428, 428, 428, 428,...
$ TailNum (chr) "N576AA", "N557AA", "N541AA", "N403AA", "N49...
$ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
$ AirTime (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
$ ArrDelay (int) -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
$ DepDelay (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
$ Origin (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
$ Dest (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
$ Distance (int) 224, 224, 224, 224, 224, 224, 224, 224, 224,...
$ TaxiIn (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
$ TaxiOut (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
$ Cancelled (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ CancellationCode (chr) "", "", "", "", "", "", "", "", "", "", "", ...
$ Diverted (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ date (chr) "2011-1-1", "2011-1-2", "2011-1-3", "2011-1-...
We can also do sampling easily either by number or fraction.
# Sample by number
hflights %>% sample_n(10)
# Or by fraction
hflights %>% sample_frac(0.25, replace=T)
# connect to an SQLite database
my_db <- src_sqlite("my_db.sqlite3")
# connect to the "hflights" table in that database
flights_tbl <- tbl(my_db, "hflights")
# identical query using the database
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
# ask dplyr for the SQL commands
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay)) %>%
explain()