R as a functional programming language

R can be considered as a functional programming language as it focusses on the creation and manipulation of functions and has what’s known as first class functions.

In computer science, functional programming is a programming paradigm, a style of building the structure and elements of computer programs, that treats computation as the evaluation of mathematical functions and avoids state and mutable data.

Functional programming emphasizes functions that produce results that depend only on their inputs and not on the program state

In functional code, the output value of a function depends only on the arguments that are input to the function, so calling a function f() twice with the same value for an argument x will produce the same result f(x) both times. R is clearly a functional programming language.

Understanding the functional nature of R may help to improve clarity and avoid redundancy.

We will examine:

  • First Class Functions
  • Functions Closures
  • Functions Factories
  • Anonymous Functions
  • Lists of Functions
  • Functionals

First class functions

First-class functions are a key component of functional programming style.

A programming language is said to have first-class functions when the language supports:

  • passing functions as arguments to other functions
  • creating anonymous functions
  • returning functions as the values from other functions
  • storing functions in data structures.

and R has has first-class functions indeed.

In this example we pass function: identity() as argument to function lapply()

Here we make use of an anonymous function:

We can easily define a function that return a function

Finaly we store functions within a list:

Functions closures

A function closure or closure is a function together with a referencing environment.

Almost all functions in R are closures as they remember the environment where they were created. Generally, but not always, the global environment:

or the package environment

Functions that cannot be classified as closures, and therefore do not have a referencing environment, are know as primitives. These are internal R function calling the underlying C code. sum() and c() are good cases in point:

As functions remember the environments where they were created, the following example does not return any error:

This is possible as f() is declared within the global environment and therefore f() remembers all objects bounded to that environment (the referencing environment), y included.

When we call a function, a new environment is created to hold the function’s execution and, normally, that environment is destroyed when the function exits. But, if we define a function g() that returns a function f(), the environment where f() is created is the execution environment of g(), that is, the execution environment of g() is the referencing environment of f(). As a consequence, the execution environment of g() is not destroyed as g() exits but it persists as long as f() exists. Finally, as f() remembers all objects bounded to its referencing environment, f() remembers all objects bounded to the execution environment of g()

With this idea in mind, we can use the referencing environment of f(), that is the execution environment of g(), to hold any object and these objects will be available to f().

Moreover, as f() is created within g() any argument passed to g() will be available to f() in its later executions.

As a proof of concept, we may temporaly modify function g() in order to print the execution environment of g()

Than use g() to produce f()

and finaly ask R for the environment associated with f()

As we can see, the execution environment of g_tmp() corresponds to the environment associated to f().

Finally,

shows where y is stored.

Notice that each call to g() returns a function with its own referencing environment:

The referencing environments for f1() and f2() are different, despite that f1() and f2() are both returned by g().

Functions Factories

In practice, we can use closures to write specific functions that, in turn, can be used to generate new functions. This allows us to have a double layer of development: a first layer that is used to do all the complex work in common to all functions and a second layer that defines the details of each function.

Example: A basic case in point

We can think of a simple function add(x,i) that add the value i to the value x. We could define this function as:

Alternatively, we may consider a set of functions, say f1(x), f2(x), ..., fi(x), ..., fn{x} that add 1,2,...,i,...,n to the x argument. Clearly, we do not want to define all these functions but we want to define a unique function f(i):

capable of generating all fi(x) functions:

In this simple example, this approach shows no benefit and possibly increases the complexity of our codes but, for more structured cases, it is definitely worth.

Example: MLE functions

As a more structured example, we may consider the development of a set of functions: lnorm(x), lweibull(x), ... that compute max likelihood estimates for those distributions given a vector of data x:

new_estimate returns a second function: estimate() whose body depends on the argument dist passed to new_estimate().

Within estimate() we first define a third function neglik() and secondly, we minimize it within optim().

The returned function: estimate() can be used as a generator of maximum likelihood estimation functions for any distribution as long as its corresponding ddist() exists in R.

Once we have new_estimate(), we can use it to define any MLE estimation function as long as its density function is defined. That is, we can now write a llnorm() that computes log-normal maximum likelihood estimates as simply as:

and similarly:

Example: moving statistics

As a further example of functions factories we may consider a function moving(f) that returns moving_f() where f() could be mean(), median() or any other statistical function as long it returns a single value.

As a first step we may consider a simple function g() that returns the value of f() for any backward window of length n starting at i:

Note that g() takes, among its inputs, a second function f() and apply it to the window [(i-n+1):i] of x.

As a second step we define a function moving(f) that takes any function f() as an input and define function g() within its body.

Function moving() returns function h() that, in turn can be used to generate any moving_f() functions:

Function vapply() within h() is a functional used as a for loop replacement that will be fully explored when discussing functionals.

Eventually, argument ‘’...’’ can be used to pass extra arguments to f().

If necessary, function moving() ca be used in the form of anonymous function:

Finally:

Plot of moving average and median ### Example: Truncated density function

Density function in R are usually specified by the prefixes d followed by a standard suffix for each ditribution. dnorm(), dlnorm(), dweibull(), etc …

Therefore, we use to write:

in order to get density values at x from a lognormal distribution with parameters 2 and 1.

In case we need value from a truncated distribution, as far as we know, we need to load an extra package such as truncdist. The package itself works perfectly. In fact, assuming that a dlnorm() function exists, we can get density values from a left truncated lognormal distribution with parameters meanlog = 2 and sdlog = 1 by simply writing:

where a = 5 represents the left threshold for truncation

Nevertheless, the above command require a change in our programming style.

In principle, we would like to be able to write:

where L = 5 represents the left threshold for truncation

so that we could have the same programming style, just with different parameters, for both truncated and not truncated distribution.

Within this frame, when tdlnorm() is called with L and U set to their default values it behaves as stats::dlnorm()

but when called with different settings for L and U; such as:

tdlnorm(x, meanlog = 2, sdlog = 1, L = 5, U = 20)

it behaves as a lognormal density left truncated at L=5 and right truncated at U=20.

This goal could be achieved by writing a tdlnorm() as:

That returns the same results as function truncdist::dtrunc()

As this function clearly works, next step could be to write something similar for other distributions such as weibull, gumbel or gamma. We have to admit that all of this may become as quite time consuming.

A different approach could be to define a different function, dtruncate(), taking the name of a density distribution as an argument and returning a second function that computes density values for the truncated distribution:

with:

  • envir: the environment plnorm() and dlnorm() belong to

We can now define a new tdlnorm() as:

and use it as:

plot of chunk closures-030

clearly, tdlnorm() returns the same results as truncdist::dtrunc():

Moreover, our newly created tdlnorm() function takes as argument meanlog and sdlog, as well as lower.tail = TRUE, log.p = FALSE, as stats::plnorm() does, despite these arguments were not mentioned when calling dtruncate().

Now that we have dtruncate(), the same exercise can be replicate, at no extra programming effort, to any density function:

Functions with memory

When talking about clousures, we used the referencing environment of f() to hold any value passed by g(). Similarly, we can use the same environment to keep a state across multiple executions of f().

Example: Track how many times a function is called

We may consider a function that simply returns the current date but tracks how many times it has ben called:

Note that, we used the <<- operator that assigns in the parent environment. This is equivalent to:

Example: Avoid re-calculate previous results

We can use the referencing environment of a function to keep previous returned values of the same function. By using this idea, we could try to avoid re-calculating previously computed values.

Supose we want a function that takes n as argument and returns all primes less or equal to n. This function already exists within library pracma:

In order to keep previous results we can define a function makefprime() that, when called, returns a second function with an environment .env attached:

We can now create a function named for instance fprimes() by calling function makefprime() which returns identical results when compared with primes().

Now suppose we need to compute prime numbers several time within a working session or a for loop. When n is large, this computation may require a substancial ammount of time.

Nevertheless, because of the way we defined fprimes(), second time this function is called with n = 10^7 computing time is practicaly zero as the function reuse previously computed results as stored in environment .env.

Example: Add to an existing plot

As a last example, we may want to have a function that add to an existing plot any time a new observation becomes available, using the same mechanism, we can define a new_plot() function that instances a new plot the first time it is called:

and add to the same plot at each next call:

first call

second call

third call


Functionals

Functionals are functions that take a function as input and return a data object as output.

R incorporates many examples of functionals. Among many, Reduce() and Filter() are two good cases in point.

Reduce(f, x) tries to fold the element of x according to function f(). As a result we may use this function for binding the elements of a list into a matrix:

Filter(f, x) applies function f() to each element of x, and returns the subset of x for which this gives TRUE. In order to subset even number from any vector x we could write

As a very interesting example of functional we may define:

That is a function fun() that takes any function f() as input along with any other argument and compute f(...) where ... represents the set of arguments.

As a result we may write:

that is equivalent to:

Functionals are very often excellent substitutes to for loops as they allow to communicate the objective of our code in a more clear and concise manner as the code will be cleaner and it will more closely adhere to R’s idioms.

Functionals may even perform a little better than the equivalent for loop nevertheless but, in a first instance, our focus must always be on clarity rather than performances.

lapply() is, possibly, the most used functional:

lapply() can be consider as the main functional. sapply() and vapply() perform as lapply() but return a simplified output. mapply() and Map() are extension of lapply() that allow for multiple inputs.

lapply

lapply() takes a function and applies it to each element of a list, saving the results back into a result list. lapply() is the building block for many other functionals. In principle, lapply() is a wrapper around a standard for loop. The wrapper is written in C to increase performance.

lapply() takes three arguments:

  • a list X, or anything that can be coerced to a list by as.list()
  • a function FUN that takes, as first argument, each element of X
  • the ''...'' argument that can be any argument to be passed to FUN

Suppose we want to gain the maximum of each column for the airquality data frame. By using a for loop we could write:.

alternatively, as a data frame is a list:

The second chunk of code is by far more clear and concise than the first one even though a vector would be preferable than a list as output.

Moreover, lapply() as opposite to for loops does not produce any intermediate result when running. In the above for loop, the value of the result of the loop, vector out, changes at each iteration. The result of lapply(), instead, can be assigned to a variable but does not produce any intermediate result.

By default lapply() takes each element of list X as the first argument of function FUN. This works perfectly, as long as each element of X is the first of FUN. This is true in case we want to compute the mean of each column of a data frame as each column is passed as first argument to function mean().

But, suppose we want to compute various trimmed means of the same vector, trim is the second parameter of mean(), so we want to vary trim, keeping the first argument x fixed.

This can be easily achieved by observing that the following two calls are equivalent:

This is possible because R first matches formals by name and afterword by position.

As a result, in order to use lapply() with the second argument of function FUN, we just need to name the first argument of FUN and pass it to lapply() as part as the ''...'' argument:

sapply and vapply,

sapply() and vapply(), variants of lapply() that produce vectors, matrices and arrays as output, instead of lists.

lapply() returns a list as output, in order to get a vector, the previous examples could be written by using sapply(), a simple variant of lapply(), that tries to return an atomic vector instead of a list.

Note that sapply() is a simple wrapper around lapply() that uses simplify2array().

Unfortunately, simplify2array() and therefore sapply() offer very little control on the type of output that is returned:

In this case we would expect sapply() to return a empty logical vector, not an empty list.

As a result, sapply() may represent a excellent shortcut when working with R in interactive mode but not a good function to be used when developing serious R code.

A better alternative to sapply() is provided by vapply() a second variant of lapply() that allows to specify, by mean of argument FUN.VALUE, the kind of output we want the functional to return. In fact,

In this case the FUN.VALUE argument specify what kind of output each element of the result should be. vapply() improves consistency by providing either the return type we were expecting or error. This is a clear advantage, as it helps catch errors before they happen and leads to more robust code.

As an example, suppose we have a list of data frames:

and that we need to know the number of columns of each data frame returned in a vector. We can easily achieve this goal by:

Nevertheless, suppose we want to apply the same function to a very large list of data frame and, by chance, one of them happens to be NULL.

When using sapply() a list, instead of a vector is returned

If we use vapply() instead:

R returns an error.

Clearly this second behavior is more coherent and, within the frame of a large project, possibly helps to avoid annoying hours of debugging

Finally, vapply() is faster that sapply() as R does not have to guess the kind of output sapply() needs to return.

Suppose we have a list made of 10^6 vectors of variable length between one and five.

We can appreciate the difference in speed between sapply() and vapply() by the following example:

lapply patterns

When using lapply() we can loop at least in two different ways: on the xs or on an index i. As a example, we may consider the following code:

and compare it with the next chunk:

The second chunk of code looks a little more complicated as it introduces the i index that in this case is simply redundant.

Suppose, instead, we want to compute the mean the three variables of the trees dataset but using different trims values for all columns and setting na.rm = TRUE.

By using a for loop, beside clarity and the number of lines required, the code is quite simple:

When translating this code with lapply(), we may use the index strategy result as mandatory because, lapply(X, FUN, ...) allows only argument X of function FUN to vary. All other arguments to function FUN can be passed via ‘’...’’ but without varying.

The same approach we used to solve for() loops into lapply(), can be generalized to nested loop.

As an example consider this apparently messy loop:

We have at least two alternatives to transform this loop.

We can use lapply() with the index strategy. We have to rewrite the innermost part of the whole loop as:

secondly, with a little help from expand.grid:

The use of expand.grid() allows to transform a nested loop into a matrix of all possible combinations over which we can easily loop by using lapply(). This approach may help a lot in simplifying our code but, is quite memory hungry as it requires to explode all possible combinations in a single matrix.

Alternatively, we can make use of a nested lapply(). We first define function f() as:

Afterword, we nest two functionals as following:

This case clearly illustrate how much we gain in clarity and efficiency when using functionals instead of loops.

As an alternative to lapply() with the index strategy we may consider mapply() and Map() that naturally iterate over multiple input data structures in parallel.

mapply and Map

A first alternative to lapply(), along with the index strategy, is represented by mapply():

The structure of mapply() is:

where, as opposite to lapply(), the FUN argument takes the first position and the ‘’...’’ argument specifies any list of arguments to be passed to FUN during the iteration.

The MoreArgs argument takes a list of parameters to be kept as fixed during the iteration. Note that this breaks R’s usual lazy evaluation semantics, and is inconsistent with other functions.

The SIMPLIFY as set to TRUE by default, allows output simplification in the sapply() fashion. Clearly, this options gives as little control over the output as sapply() does.

An alternative to mapply() is represented by Map() that returns identical results to mapply() with SIMPLIFY set to FALSE and uses anonymous or external function to pass fixed parameters to FUN.

The choice between using mapply() or Map() is surely a personal one.

Both Map() and mapply() can be used to substitute nested loops.

We can consider the previous nested loop and, with a little help from expand.grid() and function f()

And, similarly to the lappy() case, we could write:

or:

eapply

Sometimes we may find our self using environments as data structure because of many reasons including: hash tables and copy on modify semantic that does not apply to environments.

When objects are stored within an environment, we can use eapply() as a functional to the environment:


Anonymous functions

In R, we usually assign functions to variable names. Nevertheless, functions can exists without been assigned to symbol. Functions that don’t have a name are called anonymous functions.

We can call anonymous functions directly, as we do with named functions, but the code is a little unusual as we have to use brackets both to include the whole function definition and to pass arguments to the function:

Note that this is exactly the same as:

We use anonymous functions when it’s not worth the effort of assigning functions to a name. We could plot a function s(x) by:

or alternatively by:

in this case function(x) sin(x)/sqrt(x) is an example of anonymous function.

Finally, anonymous functions are, by all rights, normal R functions as they have formals(), a body(), and a parent environment():


Lists of functions

Functions, as any type of R object, can be stored in a list.

This makes it easier to work with groups of related functions.

Functions defined within a list are still accessible at least in three different ways:

using function with()

by using the $ operator

by attaching the list:

Lists of functions can be most useful when we want to apply all functions of the list to the same set of data.

We can achieve this goal in two logical steps.

We first define a function

that takes a function f() as argument along with any other arguments ‘’...’’ and returns f(...). In practice:

Secondly, we apply function fun() to the list of functions. Arguments required by the functions stored in the list are passed by the ‘’...’’ argument:

Under almost all circumstances, equivalent results can be achieved by using function do.call() within a call to lapply():

the only difference being that arguments to functions within the list must be enclosed in a list too.

Example: Multiple Anderson-Darling tests

As a simple example we may want to compare the results of four Anderson-Darling type tests from the truncgof package applied to the same data.

We can define a list that holds these four functions and store it in the global environment:

and, afterword, apply function fun() to each element of this list:

Example: Summary statistics

We may want to define a function that returns some specific statistics for a given set of variables in the form of a data.frame.

We may achieve the same result by writing a more general function that will work with any kind of statistics as long as they return a single value:

fapply

Working with this mind set we may even define a function fapply() that applies all functions of a list to the same set of arguments

and use it as: