## R as a functional programming language

`R`

can be considered as a functional programming language as it focusses on the creation and manipulation of functions and has what’s known as first class functions.

In computer science, functional programming is a programming paradigm, a style of building the structure and elements of computer programs, that treats computation as the evaluation of mathematical functions and avoids state and mutable data.

Functional programming emphasizes functions that produce results that depend only on their inputs and not on the program state

In functional code, the output value of a function depends only on the arguments that are input to the function, so calling a function `f()`

twice with the same value for an argument `x`

will produce the same result `f(x)`

both times. `R`

is clearly a functional programming language.

Understanding the functional nature of `R`

may help to improve clarity and avoid redundancy.

We will examine:

- First Class Functions
- Functions Closures
- Functions Factories
- Anonymous Functions
- Lists of Functions
- Functionals

## First class functions

First-class functions are a key component of functional programming style.

A programming language is said to have first-class functions when the language supports:

- passing functions as arguments to other functions
- creating anonymous functions
- returning functions as the values from other functions
- storing functions in data structures.

and `R`

has has first-class functions indeed.

In this example we pass function: `identity()`

as argument to function `lapply()`

1 |
lapply(0, identity) |

1 2 |
## [[1]] ## [1] 0 |

Here we make use of an anonymous function:

1 |
(function(x) sd(x)/mean(x))(x = 1:5) |

1 |
## [1] 0.527 |

We can easily define a function that return a function

1 2 3 |
f <- function(){ function(x) sd(x)/mean(x) } |

Finaly we store functions within a list:

1 |
function_list <- list(mean , sd) |

## Functions closures

A **function closure** or **closure** is a function together with a referencing environment.

Almost all functions in `R`

are closures as they remember the environment where they were created. Generally, but not always, the global environment:

1 2 |
f <- function(x) 0 environment(f) |

1 |
## <environment: R_GlobalEnv> |

or the package environment

1 |
environment(mean) |

1 |
## <environment: namespace:base> |

Functions that cannot be classified as *closures*, and therefore do not have a referencing environment, are know as *primitives*. These are internal `R`

function calling the underlying `C`

code. `sum()`

and `c()`

are good cases in point:

1 |
environment(sum) |

1 |
## NULL |

As functions remember the environments where they were created, the following example does not return any error:

1 2 3 |
y <- 1 f <- function(x){x+y} f(1) |

1 |
## [1] 2 |

This is possible as `f()`

is declared within the global environment and therefore `f()`

remembers all objects bounded to that environment (the referencing environment), `y`

included.

When we call a function, a new environment is created to hold the function’s execution and, normally, that environment is destroyed when the function exits. But, if we define a function `g()`

that returns a function `f()`

, the environment where `f()`

is created is the execution environment of `g()`

, that is, the execution environment of `g()`

is the referencing environment of `f()`

. As a consequence, the execution environment of `g()`

is not destroyed as `g()`

exits but it persists as long as `f()`

exists. Finally, as `f()`

remembers all objects bounded to its referencing environment, `f()`

remembers all objects bounded to the execution environment of `g()`

With this idea in mind, we can use the referencing environment of `f()`

, that is the execution environment of `g()`

, to hold any object and these objects will be available to `f()`

.

1 2 3 4 5 6 |
g <- function(){ y <- 1 function(x) {x+y} } f1 <- g() f1(3) |

1 |
## [1] 4 |

Moreover, as `f()`

is created within `g()`

any argument passed to `g()`

will be available to `f()`

in its later executions.

1 2 3 4 5 |
g <- function (y) { function(x) x+y } f1 <- g(1) f1(3) |

1 |
## [1] 4 |

As a proof of concept, we may temporaly modify function `g()`

in order to print the execution environment of `g()`

1 2 3 4 |
g_tmp <- function(y){ print(environment()) function(x) {x+y} } |

Than use `g()`

to produce `f()`

1 |
f1 <- g_tmp(1) |

1 |
## <environment: 0x1474198> |

and finaly ask `R`

for the environment associated with `f()`

1 |
environment(f1) |

1 |
## <environment: 0x1474198> |

As we can see, the execution environment of `g_tmp()`

corresponds to the environment associated to `f()`

.

Finally,

1 |
get("y" , env = environment(f1)) |

1 |
## [1] 1 |

shows where `y`

is stored.

Notice that each call to `g()`

returns a function with its own referencing environment:

1 2 |
f2 <- g(1) environment(f1) |

1 |
## <environment: 0x1474198> |

1 |
environment(f2) |

1 |
## <environment: 0xe42be8> |

The referencing environments for `f1()`

and `f2()`

are different, despite that `f1()`

and `f2()`

are both returned by `g()`

.

## Functions Factories

In practice, we can use closures to write specific functions that, in turn, can be used to generate new functions. This allows us to have a double layer of development: a first layer that is used to do all the complex work in common to all functions and a second layer that defines the details of each function.

### Example: A basic case in point

We can think of a simple function `add(x,i)`

that add the value `i`

to the value `x`

. We could define this function as:

1 2 3 |
add <- function(x, i){ x+i } |

Alternatively, we may consider a set of functions, say `f1(x)`

, `f2(x)`

, `...`

, `fi(x)`

, `...`

, `fn{x}`

that add `1,2,...,i,...,n`

to the `x`

argument. Clearly, we do not want to define all these functions but we want to define a unique function `f(i)`

:

1 2 3 |
f <- function(i){ function(x) {x+i} } |

capable of generating all `fi(x)`

functions:

1 2 |
f1 <- f(1) f1(3) |

1 |
## [1] 4 |

1 2 |
f2 <- f(2) f2(4) |

1 |
## [1] 6 |

In this simple example, this approach shows no benefit and possibly increases the complexity of our codes but, for more structured cases, it is definitely worth.

### Example: MLE functions

As a more structured example, we may consider the development of a set of functions: `lnorm(x)`

, `lweibull(x)`

, `...`

that compute max likelihood estimates for those distributions given a vector of data `x`

:

1 2 3 4 5 6 7 8 9 10 11 |
new_estimate <- function(dist){ estimate <- function(x, theta){ neglik <- function(theta = theta , x = x, log = T){ args <- c(list(x), as.list(theta), as.list(log)) neglik <- -sum(do.call(dist, args)) neglik } optim(par = theta, fn = neglik , x = x)$par } estimate } |

`new_estimate`

returns a second function: `estimate()`

whose body depends on the argument `dist`

passed to `new_estimate()`

.

Within `estimate()`

we first define a third function `neglik()`

and secondly, we minimize it within `optim()`

.

The returned function: `estimate()`

can be used as a generator of maximum likelihood estimation functions for any distribution as long as its corresponding `ddist()`

exists in `R`

.

Once we have `new_estimate()`

, we can use it to define any MLE estimation function as long as its density function is defined. That is, we can now write a `llnorm()`

that computes log-normal maximum likelihood estimates as simply as:

1 2 3 |
llnorm <- new_estimate("dlnorm") x <- rlnorm(100, 7 , 1) llnorm(x, theta = c(mean(log(x)), sd(log(x)))) |

1 |
## [1] 6.9848 0.9436 |

and similarly:

1 2 3 |
lweibull <- new_estimate("dweibull") w <- rweibull(100, 2 , 1) lweibull(w, theta = c(mean(w), sd(w))) |

1 |
## [1] 1.793 1.021 |

### Example: moving statistics

As a further example of functions factories we may consider a function `moving(f)`

that returns `moving_f()`

where `f()`

could be `mean()`

, `median()`

or any other statistical function as long it returns a single value.

As a first step we may consider a simple function `g()`

that returns the value of `f()`

for any backward window of length `n`

starting at `i`

:

1 2 |
g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...) g(i = 5 , x = 1:10,n = 3 , f= mean) |

1 |
## [1] 4 |

1 |
g(i = 5 , x = 1:10,n = 3 , f= sd) |

1 |
## [1] 1 |

Note that `g()`

takes, among its inputs, a second function `f()`

and apply it to the window `[(i-n+1):i]`

of `x`

.

As a second step we define a function `moving(f)`

that takes any function `f()`

as an input and define function `g()`

within its body.

1 2 3 4 5 6 7 8 |
moving <- function(f){ g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...) h <- function(x, n, ...) { N <- length(x) vapply(n:N, g, x , n , f, FUN.VALUE = numeric(1), ...) } return(h) } |

Function `moving()`

returns function `h()`

that, in turn can be used to generate any `moving_f()`

functions:

Function `vapply()`

within `h()`

is a *functional* used as a for loop replacement that will be fully explored when discussing *functionals*.

1 2 |
moving_average <- moving(mean) moving_average(x = rpois(10, 10), n = 3) |

1 |
## [1] 9.000 8.667 9.000 10.333 9.667 10.000 8.667 10.667 |

Eventually, argument ‘’`...`

’’ can be used to pass extra arguments to `f()`

.

1 |
moving_average(x = rpois(10, 10), n = 3, trim = .5) |

1 |
## [1] 13 10 11 11 11 11 11 11 |

If necessary, function `moving()`

ca be used in the form of anonymous function:

1 |
moving(sd)(rpois(10, 10), n = 5) |

1 |
## [1] 2.490 3.317 3.782 2.302 2.280 2.000 |

Finally:

1 2 3 4 5 |
x <- 1:100 y <- seq(along = x, from = 0 , to = 1)+rnorm(length(x), 0 , .1) plot(x, y) lines(x[10:length(x)], moving(mean)(y, 10), col = "red", lwd = 2) lines(x[10:length(x)], moving(median)(y, 10), col = "green", lwd = 2) |

### Example: Truncated density function

Density function in R are usually specified by the prefixes `d`

followed by a standard suffix for each ditribution. `dnorm()`

, `dlnorm()`

, `dweibull()`

, etc …

Therefore, we use to write:

1 2 |
x <- c(7,10,13) dlnorm(x , meanlog = 2, sdlog = 1) |

1 |
## [1] 0.05691 0.03811 0.02616 |

in order to get density values at `x`

from a lognormal distribution with parameters `2`

and `1`

.

In case we need value from a truncated distribution, as far as we know, we need to load an extra package such as `truncdist`

. The package itself works perfectly. In fact, assuming that a `dlnorm()`

function exists, we can get density values from a left truncated lognormal distribution with parameters `meanlog = 2`

and `sdlog = 1`

by simply writing:

1 |
require(truncdist) |

1 2 3 |
## Loading required package: truncdist ## Loading required package: stats4 ## Loading required package: evd |

1 |
dtrunc(x, spec = "lnorm", a = 5) |

1 |
## [1] 0.15963 0.05238 0.02128 |

where `a = 5`

represents the left threshold for truncation

Nevertheless, the above command require a change in our programming style.

In principle, we would like to be able to write:

1 |
tdlnorm(x, meanlog = 2, sdlog = 1, L = 5) |

where `L = 5`

represents the left threshold for truncation

so that we could have the same programming style, just with different parameters, for both truncated and not truncated distribution.

Within this frame, when `tdlnorm()`

is called with `L`

and `U`

set to their default values it behaves as `stats::dlnorm()`

1 |
tdlnorm(x, meanlog = 2, sdlog = 1) |

but when called with different settings for `L`

and `U`

; such as:

`tdlnorm(x, meanlog = 2, sdlog = 1, L = 5, U = 20)`

it behaves as a lognormal density left truncated at `L=5`

and right truncated at `U=20`

.

This goal could be achieved by writing a `tdlnorm()`

as:

1 2 3 4 5 6 7 8 9 10 11 12 |
tdlnorm <- function (x, meanlog = 0, sdlog = 1, L = 0, H = Inf) { density <- stats::dlnorm(x, meanlog=meanlog, sdlog=sdlog)/ ( stats::plnorm(H, meanlog=meanlog, sdlog=sdlog)- stats::plnorm(L, meanlog=meanlog, sdlog=sdlog) ) return(density) } |

That returns the same results as function `truncdist::dtrunc()`

1 |
tdlnorm(x, 1, 2, L= 5, H = 20) |

1 |
## [1] 0.11524 0.07297 0.05109 |

1 |
dtrunc(x, spec = "lnorm", a = 5, b = 20, meanlog = 1, sdlog = 2) |

1 |
## [1] 0.11524 0.07297 0.05109 |

As this function clearly works, next step could be to write something similar for other distributions such as `weibull`

, `gumbel`

or `gamma`

. We have to admit that all of this may become as quite time consuming.

A different approach could be to define a different function, `dtruncate()`

, taking the name of a density distribution as an argument and returning a second function that computes density values for the truncated distribution:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
dtruncate <- function (dist, pkg = stats){ dist <- deparse(substitute(dist)) envir <- as.environment(paste("package", as.character(substitute(pkg)), sep = ":")) ddist=paste("d", dist, sep = "") pdist=paste("p", dist, sep = "") #gets density function ddist <- get(ddist, mode = "function", envir = envir) #gets argument of density function dargs <- formals(ddist) #gets probability function pdist <- get(pdist, mode = "function", envir = envir) #gets argument of probability function pargs <- formals(pdist) #Output function starts here density <- function () { #check L U if (L > U) stop("U must be greater than or equal to L") #gets density arguments call <- as.list(match.call())[-1] #all unique arguments belonging to density and ddist dargs <- c(dargs[!is.element(names(dargs), names(call))], call[is.element(names(call), names(dargs))]) #all unique arguments belonging to probability and pdist pargs <- c(pargs[!is.element(names(pargs), names(call))], call[is.element(names(call), names(pargs))]) #select x only where defined by L and U dargs$x <- x[x > L & x <= U] #define arguments for pdist in L and U pUargs <- pLargs <- pargs pUargs$q <- U pLargs$q <- L #initialize output density <- numeric(length(x)) #standard method for computing density values for truncated distributions density[x > L & x <= U] <- do.call("ddist", as.list(dargs)) / (do.call("pdist", as.list(pUargs)) - do.call("pdist", as.list(pLargs))) #returns density values for truncated distributions return(density) } #add to density function formals L and U with values as passed with dtruncate formals(density) <- c(formals(ddist), eval(substitute(alist(L=-Inf, U=Inf)))) #return density function return(density) } |

with:

- envir: the environment
`plnorm()`

and`dlnorm()`

belong to

We can now define a new `tdlnorm()`

as:

1 |
tdlnorm <- dtruncate(dist = lnorm) |

and use it as:

1 2 3 4 5 6 7 8 9 |
p <- ppoints(1000) x <- qlnorm(p, meanlog = 1, sdlog = 1) d <- tdlnorm(x, meanlog = 1, sdlog = 1) dt <- tdlnorm(x, meanlog = 1, sdlog = 1, L= 5, U = 10) plot(x, dt, type = "n", xlab = "Quantile", ylab = "Density") points(x, dt, type = "s", col = "red", lwd = 2) points(x, d, type = "s", col = "darkblue", lwd = 2) title("Truncated and not-truncated log-normal") grid() |

clearly, `tdlnorm()`

returns the same results as `truncdist::dtrunc()`

:

1 |
dtrunc(x = 5:8, spec = "lnorm", a = 5, b = 10, meanlog = 1, sdlog = 1) |

1 |
## [1] 0.3792 0.2781 0.2085 0.1594 |

1 |
tdlnorm(x = 5:8, meanlog = 1, sdlog = 1, L = 5, U = 10) |

1 |
## [1] 0.0000 0.2781 0.2085 0.1594 |

Moreover, our newly created `tdlnorm()`

function takes as argument `meanlog`

and `sdlog`

, as well as `lower.tail = TRUE`

, `log.p = FALSE`

, `as stats::plnorm()`

does, despite these arguments were not mentioned when calling `dtruncate()`

.

Now that we have `dtruncate()`

, the same exercise can be replicate, at no extra programming effort, to any density function:

1 2 |
dweibull <- dtruncate(dist = weibull) dgpd <- dtruncate(gpd, pkg = evd) |

## Functions with memory

When talking about clousures, we used the referencing environment of `f()`

to hold any value passed by `g()`

. Similarly, we can use the same environment to keep a state across multiple executions of `f()`

.

### Example: Track how many times a function is called

We may consider a function that simply returns the current date but tracks how many times it has ben called:

1 2 3 4 5 6 7 8 9 10 11 |
g <- function(){ i <- 0 f <- function(){ i <<- i+1 cat("this function has been called ", i, " times", "\n") date() }} f <- g() #first call f() |

1 |
## this function has been called 1 times |

1 |
## [1] "Mon Jul 7 19:45:46 2014" |

1 2 |
#second call f() |

1 |
## this function has been called 2 times |

1 |
## [1] "Mon Jul 7 19:45:46 2014" |

1 2 |
#third call f() |

1 |
## this function has been called 3 times |

1 |
## [1] "Mon Jul 7 19:45:46 2014" |

Note that, we used the `<<-`

operator that assigns in the parent environment. This is equivalent to:

1 |
assign("i", i+1, envir = parent.env(environment())): |

### Example: Avoid re-calculate previous results

We can use the referencing environment of a function to keep previous returned values of the same function. By using this idea, we could try to avoid re-calculating previously computed values.

Supose we want a function that takes `n`

as argument and returns all primes less or equal to `n`

. This function already exists within library `pracma`

:

1 2 |
library(pracma) primes(n = 9) |

1 |
## [1] 2 3 5 7 |

In order to keep previous results we can define a function `makefprime()`

that, when called, returns a second function with an environment `.env`

attached:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
makefprime = function () { .env = new.env() f = function(n) { symbol = paste("p", n, sep = ".") if (exists(symbol, envir = .env)){ prime = get(symbol, envir = .env) } else {prime = primes(n = n) assign(symbol , prime, envir = .env) } prime } f } |

We can now create a function named for instance `fprimes()`

by calling function `makefprime()`

which returns identical results when compared with `primes()`

.

1 2 |
fprimes = makefprime() fprimes(10) |

1 |
## [1] 2 3 5 7 |

Now suppose we need to compute prime numbers several time within a working session or a for loop. When `n`

is large, this computation may require a substancial ammount of time.

1 |
system.time({p1 = fprimes(n = 10^7)}) |

1 2 |
## user system elapsed ## 1.023 0.138 1.165 |

Nevertheless, because of the way we defined `fprimes()`

, second time this function is called with `n = 10^7`

computing time is practicaly zero as the function reuse previously computed results as stored in environment `.env`

.

1 |
system.time({p2 = fprimes(n = 10^7)}) |

1 2 |
## user system elapsed ## 0 0 0 |

### Example: Add to an existing plot

As a last example, we may want to have a function that add to an existing plot any time a new observation becomes available, using the same mechanism, we can define a `new_plot()`

function that instances a new plot the first time it is called:

1 2 3 4 5 6 7 8 9 10 |
new_plot = function(){ xx = NULL yy = NULL function(x, y, ...) { xx <<- c(xx, x) yy <<- c(yy, y) plot(xx, yy, ...) }} this_plot <- new_plot() |

and add to the same plot at each next call:

1 |
this_plot (1:4, c(2, 3, 1, 5), type = "b") |

1 |
this_plot(5, 3, type = "b") |

1 |
this_plot(6, 3, type = "b", col = "red") |

## Functionals

*Functionals* are functions that take a function as input and return a data object as output.

`R`

incorporates many examples of functionals. Among many, `Reduce()`

and `Filter()`

are two good cases in point.

`Reduce(f, x)`

tries to fold the element of `x`

according to function `f()`

. As a result we may use this function for binding the elements of a list into a matrix:

1 2 |
l <- list(x = 1:4, y = 4:1) Reduce(rbind, l) |

1 2 3 |
## [,1] [,2] [,3] [,4] ## init 1 2 3 4 ## 4 3 2 1 |

`Filter(f, x)`

applies function `f()`

to each element of `x`

, and returns the subset of x for which this gives `TRUE`

. In order to subset even number from any vector `x`

we could write

1 |
Filter(f = function(x) x %% 2 == 0 , x = 1:5) |

1 |
## [1] 2 4 |

As a very interesting example of functional we may define:

1 |
fun <- function(f, ...) f(...) |

That is a function `fun()`

that takes any function `f()`

as input along with any other argument and compute `f(...)`

where `...`

represents the set of arguments.

As a result we may write:

1 |
fun(mean, x = 1:10, trim = .1) |

1 |
## [1] 5.5 |

that is equivalent to:

1 |
mean(x = 1:10, trim = .1) |

1 |
## [1] 5.5 |

Functionals are very often excellent substitutes to for loops as they allow to communicate the objective of our code in a more clear and concise manner as the code will be cleaner and it will more closely adhere to `R`

’s idioms.

Functionals may even perform a little better than the equivalent for loop nevertheless but, in a first instance, our focus must always be on clarity rather than performances.

`lapply()`

is, possibly, the most used functional:

1 |
lapply(list(one = 1, a = "a"), FUN = is.numeric) |

1 2 3 4 5 |
## $one ## [1] TRUE ## ## $a ## [1] FALSE |

`lapply()`

can be consider as the main functional. `sapply()`

and `vapply()`

perform as `lapply()`

but return a simplified output. `mapply()`

and `Map()`

are extension of `lapply()`

that allow for multiple inputs.

`lapply`

`lapply()`

takes a function and applies it to each element of a list, saving the results back into a result list. `lapply()`

is the building block for many other functionals. In principle, `lapply()`

is a wrapper around a standard for loop. The wrapper is written in `C`

to increase performance.

`lapply()`

takes three arguments:

- a list
`X`

, or anything that can be coerced to a list by`as.list()`

- a function
`FUN`

that takes, as first argument, each element of`X`

- the
`''...''`

argument that can be any argument to be passed to`FUN`

Suppose we want to gain the maximum of each column for the `airquality`

data frame. By using a for loop we could write:.

1 2 3 4 5 6 |
n <- ncol(airquality) out <- numeric(n) for (i in 1:n){ out[i] <- max(airquality[,i], na.rm = TRUE) } out |

1 |
## [1] 168.0 334.0 20.7 97.0 9.0 31.0 |

alternatively, as a data frame is a list:

1 |
lapply(X=airquality, FUN = max, na.rm = TRUE) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
## $Ozone ## [1] 168 ## ## $Solar.R ## [1] 334 ## ## $Wind ## [1] 20.7 ## ## $Temp ## [1] 97 ## ## $Month ## [1] 9 ## ## $Day ## [1] 31 |

The second chunk of code is by far more clear and concise than the first one even though a vector would be preferable than a list as output.

Moreover, `lapply()`

as opposite to `for`

loops does not produce any intermediate result when running. In the above `for`

loop, the value of the result of the loop, vector `out`

, changes at each iteration. The result of `lapply()`

, instead, can be assigned to a variable but does not produce any intermediate result.

By default `lapply()`

takes each element of list `X`

as the first argument of function `FUN`

. This works perfectly, as long as each element of `X`

is the first of `FUN`

. This is true in case we want to compute the mean of each column of a data frame as each column is passed as first argument to function `mean()`

.

But, suppose we want to compute various trimmed means of the same vector, `trim`

is the second parameter of `mean()`

, so we want to vary `trim`

, keeping the first argument `x`

fixed.

This can be easily achieved by observing that the following two calls are equivalent:

1 |
mean(1:100, trim = 0.1) |

1 |
## [1] 50.5 |

1 |
mean(0.1, x = 1:100) |

1 |
## [1] 50.5 |

This is possible because `R`

first matches formals by name and afterword by position.

As a result, in order to use `lapply()`

with the second argument of function `FUN`

, we just need to name the first argument of `FUN`

and pass it to `lapply()`

as part as the `''...''`

argument:

1 2 |
x <- rnorm(100) lapply(X = c(0.1, 0.2, 0.5), mean, x = x) |

1 2 3 4 5 6 7 8 |
## [[1]] ## [1] -0.08333 ## ## [[2]] ## [1] -0.09282 ## ## [[3]] ## [1] -0.1762 |

`sapply`

and `vapply`

,

`sapply()`

and `vapply()`

, variants of `lapply()`

that produce vectors, matrices and arrays as output, instead of lists.

`lapply()`

returns a list as output, in order to get a vector, the previous examples could be written by using `sapply()`

, a simple variant of `lapply()`

, that tries to return an atomic vector instead of a list.

1 |
sapply(X=airquality, FUN = max, na.rm = TRUE) |

1 2 |
## Ozone Solar.R Wind Temp Month Day ## 168.0 334.0 20.7 97.0 9.0 31.0 |

Note that `sapply()`

is a simple wrapper around `lapply()`

that uses `simplify2array()`

.

Unfortunately, `simplify2array()`

and therefore `sapply()`

offer very little control on the type of output that is returned:

1 |
sapply(list(), is.numeric) |

1 |
## list() |

In this case we would expect `sapply()`

to return a empty logical vector, not an empty list.

As a result, `sapply()`

may represent a excellent shortcut when working with R in interactive mode but not a good function to be used when developing serious R code.

A better alternative to `sapply()`

is provided by `vapply()`

a second variant of `lapply()`

that allows to specify, by mean of argument `FUN.VALUE`

, the kind of output we want the functional to return. In fact,

1 |
vapply(list(), is.numeric, FUN.VALUE = logical(1)) |

1 |
## logical(0) |

1 |
vapply(X=airquality, FUN = max, na.rm = TRUE, FUN.VALUE = numeric(1)) |

1 2 |
## Ozone Solar.R Wind Temp Month Day ## 168.0 334.0 20.7 97.0 9.0 31.0 |

In this case the `FUN.VALUE`

argument specify what kind of output each element of the result should be. `vapply()`

improves consistency by providing either the return type we were expecting or error. This is a clear advantage, as it helps catch errors before they happen and leads to more robust code.

As an example, suppose we have a list of data frames:

1 |
df_list <- list(cars, airquality, trees) |

and that we need to know the number of columns of each data frame returned in a vector. We can easily achieve this goal by:

1 |
sapply(df_list, ncol) |

1 |
## [1] 2 6 3 |

Nevertheless, suppose we want to apply the same function to a very large list of data frame and, by chance, one of them happens to be `NULL`

.

1 |
df_list <- list(df1 = cars, df2 = NULL, df3 = trees) |

When using `sapply()`

a `list`

, instead of a `vector`

is returned

1 |
sapply(df_list, ncol) |

1 2 3 4 5 6 7 8 |
## $df1 ## [1] 2 ## ## $df2 ## NULL ## ## $df3 ## [1] 3 |

If we use `vapply()`

instead:

1 |
vapply(df_list, ncol, FUN.VALUE = numeric(1)) |

1 2 |
## Error: values must be length 1, ## but FUN(X[[2]]) result is length 0 |

`R`

returns an error.

Clearly this second behavior is more coherent and, within the frame of a large project, possibly helps to avoid annoying hours of debugging

Finally, `vapply()`

is faster that `sapply()`

as `R`

does not have to *guess* the kind of output `sapply()`

needs to return.

Suppose we have a list made of `10^6`

vectors of variable length between one and five.

1 2 |
n <- sample(1:5, 10^6 , rep = T) vector_list <- lapply(n, sample , x = 0:9) |

We can appreciate the difference in speed between `sapply()`

and `vapply()`

by the following example:

1 2 3 |
system.time( sapply(vector_list, length ) ) |

1 2 |
## user system elapsed ## 3.493 0.033 3.529 |

1 2 3 |
system.time( vapply(vector_list, length, FUN.VALUE = numeric(1)) ) |

1 2 |
## user system elapsed ## 0.366 0.014 0.380 |

`lapply`

patterns

When using `lapply()`

we can loop at least in two different ways: on the `xs`

or on an index `i`

. As a example, we may consider the following code:

1 2 |
x = 1:3 vapply(x, function(x) x*x, numeric(1)) |

1 |
## [1] 1 4 9 |

and compare it with the next chunk:

1 |
vapply(1:length(x), function(i , x) x[i]*x[i], x=x , FUN.VALUE=numeric(1)) |

1 |
## [1] 1 4 9 |

The second chunk of code looks a little more complicated as it introduces the `i`

index that in this case is simply redundant.

Suppose, instead, we want to compute the mean the three variables of the `trees`

dataset but using different trims values for all columns and setting `na.rm = TRUE`

.

By using a for loop, beside clarity and the number of lines required, the code is quite simple:

1 2 3 4 5 6 7 |
n = ncol(trees) out = numeric(n) trim = c(0.1, 0.2, 0.3) for ( i in 1:n){ out[i] = mean(trees[,i], trim[i], na.rm = TRUE) } out |

1 |
## [1] 13.14 76.58 25.43 |

When translating this code with `lapply()`

, we may use the index strategy result as mandatory because, `lapply(X, FUN, ...)`

allows only argument `X`

of function `FUN`

to vary. All other arguments to function `FUN`

can be passed via ‘’`...`

’’ but without varying.

1 2 |
lapply(1:ncol(trees), function(i, x, trim, ...) mean(x[, i], trim[i], na.rm = TRUE), x = trees, trim = c(0.1, 0.2, 0.3)) |

1 2 3 4 5 6 7 8 |
## [[1]] ## [1] 13.14 ## ## [[2]] ## [1] 76.58 ## ## [[3]] ## [1] 25.43 |

The same approach we used to solve `for()`

loops into `lapply()`

, can be generalized to nested loop.

As an example consider this apparently messy loop:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
ni = 4 nj = 2 nk = ni*nj k = numeric(nk) for (j in 1:nj){ if ( j %% 2 == 0){ for ( i in 1:ni){ if ( i %% 2 == 0 ) next_k = i+j else next_k = i-j cat("first ij " , i , j , next_k, "\n") k[i+(j-1)*ni] = next_k } } else { for (i in 1:ni){ next_k <- 99 cat("second ij " , i , j , next_k, "\n") k[i+(j-1)*ni] = next_k } } } |

1 2 3 4 5 6 7 8 |
## second ij 1 1 99 ## second ij 2 1 99 ## second ij 3 1 99 ## second ij 4 1 99 ## first ij 1 2 -1 ## first ij 2 2 4 ## first ij 3 2 1 ## first ij 4 2 6 |

We have at least two alternatives to transform this loop.

We can use `lapply()`

with the index strategy. We have to rewrite the innermost part of the whole loop as:

1 2 3 4 5 6 7 8 9 |
f = function(k , i , j) { i <- i[k] j <- j[k] result <- 99 if( j %% 2 == 0){ result <- ifelse(i %% 2 == 0 , i+j , i-j) } result } |

secondly, with a little help from `expand.grid`

:

1 2 |
grid <- expand.grid(i = 1:ni, j = 1:nj ) with(grid , vapply(1:nrow(grid), f, i=i , j=j, FUN.VALUE = numeric(1))) |

1 |
## [1] 99 99 99 99 -1 4 1 6 |

The use of `expand.grid()`

allows to transform a nested loop into a matrix of all possible combinations over which we can easily loop by using `lapply()`

. This approach may help a lot in simplifying our code but, is quite memory hungry as it requires to explode all possible combinations in a single matrix.

Alternatively, we can make use of a nested `lapply()`

. We first define function `f()`

as:

1 2 3 4 5 |
f = function(i , j) { ifelse( j %% 2 == 0, ifelse(i %% 2 == 0 , i+j , i-j), 99) } |

Afterword, we nest two functionals as following:

1 |
unlist(lapply(1:2, function(j, i = ni) vapply(1:4, FUN = f, j, FUN.VALUE = numeric(1)))) |

1 |
## [1] 99 99 99 99 -1 4 1 6 |

This case clearly illustrate how much we gain in clarity and efficiency when using functionals instead of loops.

As an alternative to `lapply()`

with the index strategy we may consider `mapply()`

and `Map()`

that naturally iterate over multiple input data structures in parallel.

`mapply`

and `Map`

A first alternative to `lapply()`

, along with the index strategy, is represented by `mapply()`

:

1 2 3 |
mapply(mean ,trees, trim = c(0.1 , 0.2, 0.3), MoreArgs = list(na.rm = TRUE), SIMPLIFY = FALSE) |

1 2 3 4 5 6 7 8 |
## $Girth ## [1] 13.14 ## ## $Height ## [1] 76.58 ## ## $Volume ## [1] 25.43 |

The structure of `mapply()`

is:

1 |
mapply(FUN, ..., MoreArgs=NULL, SIMPLIFY = TRUE) |

where, as opposite to `lapply()`

, the `FUN`

argument takes the first position and the ‘’`...`

’’ argument specifies any list of arguments to be passed to `FUN`

during the iteration.

The `MoreArgs`

argument takes a list of parameters to be kept as fixed during the iteration. Note that this breaks R’s usual lazy evaluation semantics, and is inconsistent with other functions.

The `SIMPLIFY`

as set to `TRUE`

by default, allows output simplification in the `sapply()`

fashion. Clearly, this options gives as little control over the output as `sapply()`

does.

An alternative to `mapply()`

is represented by `Map()`

that returns identical results to `mapply()`

with `SIMPLIFY`

set to `FALSE`

and uses anonymous or external function to pass fixed parameters to `FUN`

.

1 2 |
Map(function(...) mean(..., na.rm = TRUE), x = trees , trim = c(0.1 , 0.2, 0.3)) |

1 2 3 4 5 6 7 8 |
## $Girth ## [1] 13.14 ## ## $Height ## [1] 76.58 ## ## $Volume ## [1] 25.43 |

The choice between using `mapply()`

or `Map()`

is surely a personal one.

Both `Map()`

and `mapply()`

can be used to substitute nested loops.

We can consider the previous nested loop and, with a little help from `expand.grid()`

and function `f()`

1 2 3 4 5 6 7 8 9 |
f = function(i , j) { result = 99 if( j %% 2 == 0){ result <- ifelse(i %% 2 == 0 , i+j , i-j) } result } grid <- expand.grid(i = 1:4, j = 1:2) |

And, similarly to the `lappy()`

case, we could write:

1 |
with(grid , mapply(f, i=i , j=j, SIMPLIFY = TRUE)) |

1 |
## [1] 99 99 99 99 -1 4 1 6 |

or:

1 |
unlist(with(grid , Map(f, i=i , j=j))) |

1 |
## [1] 99 99 99 99 -1 4 1 6 |

### eapply

Sometimes we may find our self using environments as data structure because of many reasons including: hash tables and copy on modify semantic that does not apply to environments.

When objects are stored within an environment, we can use `eapply()`

as a functional to the environment:

1 2 3 |
env <- new.env() env$x = 3 ; env$y = -2 eapply(env, function(x) ifelse(x>0 , 1 , -1)) |

1 2 3 4 5 |
## $x ## [1] 1 ## ## $y ## [1] -1 |

## Anonymous functions

In R, we usually assign functions to variable names. Nevertheless, functions can exists without been assigned to symbol. Functions that don’t have a name are called **anonymous functions**.

We can call anonymous functions directly, as we do with named functions, but the code is a little unusual as we have to use brackets both to include the whole function definition and to pass arguments to the function:

1 |
(function(x) x + 3)(10) |

1 |
## [1] 13 |

Note that this is exactly the same as:

1 2 |
f <- function(x) x + 3 f(10) |

1 |
## [1] 13 |

We use anonymous functions when it’s not worth the effort of assigning functions to a name. We could plot a function `s(x)`

by:

1 2 |
s <- function(x) sin(x)/sqrt(x) integrate(s, 0, 4) |

1 |
## 1.61 with absolute error < 4.5e-11 |

or alternatively by:

1 |
integrate(function(x) sin(x)/sqrt(x), 0, 4) |

1 |
## 1.61 with absolute error < 4.5e-11 |

in this case `function(x) sin(x)/sqrt(x)`

is an example of anonymous function.

Finally, anonymous functions are, by all rights, normal R functions as they have `formals()`

, a `body()`

, and a parent `environment()`

:

1 |
formals(function(x) x+1) |

1 |
## $x |

1 |
body(function(x) x+1) |

1 |
## x + 1 |

1 |
environment(function(x) x+1) |

1 |
## <environment: R_GlobalEnv> |

## Lists of functions

Functions, as any type of `R`

object, can be stored in a list.

1 |
fun_list <- list(m = mean , s = sd) |

This makes it easier to work with groups of related functions.

Functions defined within a list are still accessible at least in three different ways:

using function `with()`

1 |
with (fun_list, m(x = 1:10)) |

1 |
## [1] 5.5 |

by using the `$`

operator

1 |
fun_list$m( x = 1:10) |

1 |
## [1] 5.5 |

by attaching the list:

1 2 |
attach(fun_list) m( x = 1:10) |

1 |
## [1] 5.5 |

1 |
detach(fun_list) |

Lists of functions can be most useful when we want to apply all functions of the list to the same set of data.

We can achieve this goal in two logical steps.

We first define a function

1 |
fun <- function(f, ...){f(...)} |

that takes a function `f()`

as argument along with any other arguments ‘’`...`

’’ and returns `f(...)`

. In practice:

1 |
fun(mean, x = 1:10, na.rm = TRUE) |

1 |
## [1] 5.5 |

Secondly, we apply function `fun()`

to the list of functions. Arguments required by the functions stored in the list are passed by the ‘’`...`

’’ argument:

1 |
lapply(fun_list, fun, x = 1:10) |

1 2 3 4 5 |
## $m ## [1] 5.5 ## ## $s ## [1] 3.028 |

Under almost all circumstances, equivalent results can be achieved by using function `do.call()`

within a call to `lapply()`

:

1 |
lapply(fun_list, do.call, list(x = 1:10, na.rm = T)) |

1 2 3 4 5 |
## $m ## [1] 5.5 ## ## $s ## [1] 3.028 |

the only difference being that arguments to functions within the list must be enclosed in a list too.

### Example: Multiple *Anderson-Darling* tests

As a simple example we may want to compare the results of four **Anderson-Darling** type tests from the `truncgof`

package applied to the same data.

We can define a list that holds these four functions and store it in the global environment:

1 |
require(truncgof, quietly = TRUE) |

1 2 3 4 5 6 |
## ## Attaching package: 'truncgof' ## ## The following object is masked from 'package:stats': ## ## ks.test |

1 2 |
nor_test <- list(ad2.test = ad2.test, ad2up.test = ad2up.test, ad.test = ad.test, adup.test = adup.test) |

and, afterword, apply function `fun()`

to each element of this list:

1 2 3 4 |
x <- rnorm(100, 10, 1) m <- mean(x) s <- sd(x) lapply(nor_test, fun, x , distn = "pnorm", list(mean = m, sd = s), sim = 100) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
## $ad2.test ## ## Quadratic Class Anderson-Darling Test ## ## data: x ## AD2 = 0.3352, p-value = 0.55 ## ## treshold = -Inf, simulations: 100 ## ## ## $ad2up.test ## ## Quadratic Class Anderson-Darling Upper Tail Test ## ## data: x ## AD2up = 1.527, p-value = 0.93 ## ## treshold = -Inf, simulations: 100 ## ## ## $ad.test ## ## Supremum Class Anderson-Darling Test ## ## data: x ## AD = 2.807, p-value = 0.24 ## alternative hypothesis: two.sided ## ## treshold = -Inf, simulations: 100 ## ## ## $adup.test ## ## Anderson-Darling Upper Tail Test ## ## data: x ## ADup = 10.71, p-value = 0.54 ## alternative hypothesis: two.sided ## ## treshold = -Inf, simulations: 100 |

### Example: Summary statistics

We may want to define a function that returns some specific statistics for a given set of variables in the form of a `data.frame`

.

1 2 3 4 5 6 7 |
this_summary <- as.data.frame(rbind(vapply(trees, mean, FUN.VALUE = numeric(1)), vapply(trees, sd, FUN.VALUE = numeric(1)), vapply(trees, function(x, ...) { diff(range(x)) }, FUN.VALUE = numeric(1)))) row.names(this_summary) <- c("mean", "sd", "range") this_summary |

1 2 3 4 |
## Girth Height Volume ## mean 13.248 76.000 30.17 ## sd 3.138 6.372 16.44 ## range 12.300 24.000 66.80 |

We may achieve the same result by writing a more general function that will work with any kind of statistics as long as they return a single value:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
my_summary <- function(x, flist) { f <- function(f, ...) f(...) g <- function(x, flist) { vapply(flist, f, x, FUN.VALUE = numeric(1)) } df <- as.data.frame(lapply(x, g, flist)) row.names(df) <- names(flist) df } my_summary(cars, flist = list(mean = mean, stdev = sd, cv = function(x, ...) { sd(x, ...)/mean(x, ...) })) |

1 2 3 4 |
## speed dist ## mean 15.4000 42.9800 ## stdev 5.2876 25.7694 ## cv 0.3434 0.5996 |

### fapply

Working with this mind set we may even define a function `fapply()`

that applies all functions of a list to the same set of arguments

1 2 3 |
fapply <- function(X, FUN, ...){ lapply(FUN, function(f, ...){f(...)}, X, ...) } |

and use it as:

1 2 |
basic_stat <- list(mean = mean, median = median, sd = sd) fapply(1:10, basic_stat) |

1 2 3 4 5 6 7 8 |
## $mean ## [1] 5.5 ## ## $median ## [1] 5.5 ## ## $sd ## [1] 3.028 |