dplyr do: Some Tips for Using and Programming

This post aims to explore some basic concepts of do(), along with giving some advice in using and programming.

do() is a verb (function) of dplyr. dplyr is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames, like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.

First of all, you have to install dplyr package:

install.packages("dplyr")

1	install.packages("dplyr")

and to load it:

require(dplyr)

1	require(dplyr)

We will analyze the use of do() with the following dataset, created with random data:

set.seed(100)
ds

1 2	set.seed(100) ds

We firstly transform it into a tbl_df object to achieve a better print method. No changes occur on the input data frame.

ds

Source: local data frame [300 x 3]

    group        x           y
   (fctr)    (dbl)       (dbl)
1       a 1.995615 -1.71089045
2       a 3.263062 -0.03712943
3       a 2.842166 -0.09022217
4       a 4.773570  0.69742469
5       a 3.233943  2.76536531
6       a 3.637260  4.06379942
7       a 1.836419  2.26214995
8       a 4.429065  2.75438347
9       a 1.349481 -1.77539016
10      a 2.280276  3.04043881
..    ...      ...         ...

Source: local data frame [300 x 3]

group x y

(fctr) (dbl) (dbl)

1 a 1.995615 -1.71089045

2 a 3.263062 -0.03712943

3 a 2.842166 -0.09022217

4 a 4.773570 0.69742469

5 a 3.233943 2.76536531

6 a 3.637260 4.06379942

7 a 1.836419 2.26214995

8 a 4.429065 2.75438347

9 a 1.349481 -1.77539016

10 a 2.280276 3.04043881

.. ... ... ...

Base Concepts of do() (Non Standard Evaluation Version)

As we already said, do() computes arbitrary operations on a data frame returning more than one number back.

To use do(), you must know that:

it always returns a dataframe

unlike the others data manipulation verbs of dplyr, do() needs the specification of . placeholder inside the function to apply, referring to the data it has to work with.

# Head of ds
ds %>% do(head(.))

1 2	# Head of ds ds %>% do(head(.))

Source: local data frame [6 x 3]

   group        x           y
  (fctr)    (dbl)       (dbl)
1      a 1.995615 -1.71089045
2      a 3.263062 -0.03712943
3      a 2.842166 -0.09022217
4      a 4.773570  0.69742469
5      a 3.233943  2.76536531
6      a 3.637260  4.06379942

Source: local data frame [6 x 3]

group x y

(fctr) (dbl) (dbl)

1 a 1.995615 -1.71089045

2 a 3.263062 -0.03712943

3 a 2.842166 -0.09022217

4 a 4.773570 0.69742469

5 a 3.233943 2.76536531

6 a 3.637260 4.06379942

it is conceived to be used with dplyr group_by() to compute operations within groups:

# Head of ds by group
ds %>% group_by(group) %>% do(head(.))

1 2	# Head of ds by group ds %>% group_by(group) %>% do(head(.))

Source: local data frame [18 x 3]
Groups: group [3]

    group          x           y
   (fctr)      (dbl)       (dbl)
1       a 1.99561530 -1.71089045
2       a 3.26306233 -0.03712943
3       a 2.84216582 -0.09022217
4       a 4.77356962  0.69742469
5       a 3.23394254  2.76536531
6       a 3.63726018  4.06379942
7       b 2.33415330 -0.56965729
8       b 5.72622741  1.71643653
9       b 2.06170532  4.87756954
10      b 4.68575126 -0.08011508
11      b 0.08401255 -0.04767590
12      b 2.19938816  4.18954758
13      c 3.05634353 -0.89257491
14      c 2.28659319  2.63171152
15      c 4.70525275  1.31450497
16      c 4.02673050 -1.86270620
17      c 5.03640599  2.48564201
18      c 0.95704183  1.27446410

Source: local data frame [18 x 3]

Groups: group [3]

group x y

(fctr) (dbl) (dbl)

1 a 1.99561530 -1.71089045

2 a 3.26306233 -0.03712943

3 a 2.84216582 -0.09022217

4 a 4.77356962 0.69742469

5 a 3.23394254 2.76536531

6 a 3.63726018 4.06379942

7 b 2.33415330 -0.56965729

8 b 5.72622741 1.71643653

9 b 2.06170532 4.87756954

10 b 4.68575126 -0.08011508

11 b 0.08401255 -0.04767590

12 b 2.19938816 4.18954758

13 c 3.05634353 -0.89257491

14 c 2.28659319 2.63171152

15 c 4.70525275 1.31450497

16 c 4.02673050 -1.86270620

17 c 5.03640599 2.48564201

18 c 0.95704183 1.27446410

the argument of do() can be named or unnamed:

named arguments (more than one supplied) become list-columns, with one element for each group:

# Tail (last 3 obs) of x by group
ds %>% group_by(group) %>% do(out=tail(.$x, 3))

1 2	# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(out=tail(.$x, 3))

Source: local data frame [3 x 2]
Groups: 

   group      out
  (fctr)    (chr)
1      a <dbl[3]>
2      b <dbl[3]>
3      c <dbl[3]>

Source: local data frame [3 x 2]

Groups:

group out

(fctr) (chr)

1 a <dbl[3]>

2 b <dbl[3]>

3 c <dbl[3]>

unnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:

# Tail (last 3 obs) of x by group
ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))

1 2	# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))

Source: local data frame [9 x 2]
Groups: group [3]

   group       out
  (fctr)     (dbl)
1      a 3.8270397
2      a 0.6426337
3      a 0.6519305
4      b 3.3238824
5      b 0.8290942
6      b 4.1538746
7      c 6.5861213
8      c 4.6280643
9      c 0.3599512

Source: local data frame [9 x 2]

Groups: group [3]

group out

(fctr) (dbl)

1 a 3.8270397

2 a 0.6426337

3 a 0.6519305

4 b 3.3238824

5 b 0.8290942

6 b 4.1538746

7 c 6.5861213

8 c 4.6280643

9 c 0.3599512

Its use is the same working with customized functions.

Let us define the following function, which performs two simple operations returning a data frame:

my_fun

my_fun

If the argument is named the result is:

# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))

1 2	# Apply my_fun() function to ds by group ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))

Source: local data frame [3 x 2]
Groups: 

   group                out
  (fctr)              (chr)
1      a 
2      b 
3      c

Source: local data frame [3 x 2]

Groups:

group out

(fctr) (chr)

1 a

2 b

3 c

Otherwise, if argument is unnamed the result is:

# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))

1 2	# Apply my_fun() function to ds by group ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))

Source: local data frame [3 x 3]
Groups: group [3]

   group    res_x     res_y
  (fctr)    (dbl)     (dbl)
1      a 5.005825  9.167546
2      b 5.022282  8.683619
3      c 5.025586 11.240558

Source: local data frame [3 x 3]

Groups: group [3]

group res_x res_y

(fctr) (dbl) (dbl)

1 a 5.005825 9.167546

2 b 5.022282 8.683619

3 c 5.025586 11.240558

Programming with do_() (Standard Evaluation Version)

How can we enclose the previous operations inside a function? Simple! Using do_() (the SE version of do()) and interp() function of lazyeval package.

lazyeval is an R package, written and maintained by Hadley Wickham. It represents a new approach to Non Standard Evaluation (NSE) for R. The difference between SE and NSE approaches is the quoting of input variable names. NSE is suitable for interactive use (see the previous paragraph), but not for programming, for which SE approach is recommended.

Install and load lazyeval, if you haven’t already done it.

install.packages("lazyeval")

1	install.packages("lazyeval")

require(lazyeval)

1	require(lazyeval)

interp() helps to build the expression up from a mixture of constants and variables to be passed to .dots argument of dplyr verbs. For more details see Non Standard Evaluation vignette.

In the following example interp() is used to build up the expression to be passed to .dots argument of group_by_() (SE version of group_by()), which consists of the grouping variable name. It is used also to build up the expression to be passed to .dots argument of do_(). This expression consists of the function name specifying also its arguments in brackets.

fun % 
    group_by_(.dots = group_dots) %>%
    do_(.dots = do_dots)
  return(out)
}

fun %

group_by_(.dots = group_dots) %>%

do_(.dots = do_dots)

return(out)

}

Let us apply the previous function to ds dataset:

fun(data=ds, x_var_name="x", y_var_name="y", group_var_name="group")

1	fun(data=ds, x_var_name="x", y_var_name="y", group_var_name="group")

Source: local data frame [3 x 3]
Groups: group [3]

   group    res_x     res_y
  (fctr)    (dbl)     (dbl)
1      a 5.005825  9.167546
2      b 5.022282  8.683619
3      c 5.025586 11.240558

Source: local data frame [3 x 3]

Groups: group [3]

group res_x res_y

(fctr) (dbl) (dbl)

1 a 5.005825 9.167546

2 b 5.022282 8.683619

3 c 5.025586 11.240558

Other Examples

do() is often used to fit models and to display the results.
Look at the following functions!

Let us define a function that fits linear model and returns coefficients as a data frame and apply it to ds by group:

# Function that fits linear model and returns coefficients as a data frame
my_fun_2

1 2	# Function that fits linear model and returns coefficients as a data frame my_fun_2

# Apply my_fun_2() function (unnamed elements and nse version) to ds by group
ds %>% group_by(group) %>% do(my_fun_2(x=x, y=y, data=.))

1 2	# Apply my_fun_2() function (unnamed elements and nse version) to ds by group ds %>% group_by(group) %>% do(my_fun_2(x=x, y=y, data=.))

Source: local data frame [3 x 3]
Groups: group [3]

   group intercept       slope
  (fctr)     (dbl)       (dbl)
1      a  2.939123  0.03637955
2      b  3.149110 -0.07302733
3      c  3.249187 -0.09946141

Source: local data frame [3 x 3]

Groups: group [3]

group intercept slope

(fctr) (dbl) (dbl)

1 a 2.939123 0.03637955

2 b 3.149110 -0.07302733

3 c 3.249187 -0.09946141

Let us enclose the previous operations inside a function and apply it to ds by group:

# Enclose the previous operations inside a function
fun_2 % 
    group_by_(.dots = group_dots)  %>% 
    do_(.dots = do_dots)
  return(res)
}

# Enclose the previous operations inside a function

fun_2 %

group_by_(.dots = group_dots) %>%

do_(.dots = do_dots)

return(res)

}

# Apply fun_2() function (se version) to ds by group
fun_2(data=ds, x_var_name="x", y_var_name="y", group_var_name="group")

1 2	# Apply fun_2() function (se version) to ds by group fun_2(data=ds, x_var_name="x", y_var_name="y", group_var_name="group")

Source: local data frame [3 x 3]
Groups: group [3]

   group intercept       slope
  (fctr)     (dbl)       (dbl)
1      a  2.939123  0.03637955
2      b  3.149110 -0.07302733
3      c  3.249187 -0.09946141

Source: local data frame [3 x 3]

Groups: group [3]

group intercept slope

(fctr) (dbl) (dbl)

1 a 2.939123 0.03637955

2 b 3.149110 -0.07302733

3 c 3.249187 -0.09946141

Base Concepts of do() (Non Standard Evaluation Version)

Programming with do_() (Standard Evaluation Version)

Other Examples

Join us!

Courses calendar

We are part of

Categories

Archives

dplyr do: Some Tips for Using and Programming

Base Concepts of do() (Non Standard Evaluation Version)

Programming with do_() (Standard Evaluation Version)

Other Examples

Profile cancel

Join us!

Courses calendar

We are part of

Categories

Archives

Tags