Functions structure

When working with R we all make constant use of functions and, when developing, we create new functions so that functions look like very familiar R objects. Nevertheless, understanding the theory and the rationals underlying R functions may help to create much more efficient and possibly elegant coding.

We can create and assign functions to a variable names as we do with any other object:

Eventually, we can delete any function with the usual call to rm() or remove()

Functions are objects with three basic components:

  • a formal arguments list
  • a body
  • an environment.

Formals

Formals are the formal arguments of a function returned as an object of class pairlist where a pairlist can be thought as something similar to a list with an important difference:

that is: a pairlist of length zero is NULL while a list is not.

When we call a function, formals arguments can be specified by position or by name and we can mix positional matching with matching by name so that the following are equivalent:

Along with position and name, we can also specify formals by partial matching so that:

would work anyway.

Functions formals may also have the construct symbol = default, that unless differently specified, forces any argument to be used with its default value.

Specifically, function mean() also have a third argument na.rm that defaults to FALSE and , as a result passing vectors with NA values to mean() returns NA

While, by specifying na.rm=TRUE we get the mean of all non missing elements of vector x.

The order R uses for matching formals against value is:

  1. Check for exact match for a named argument
  2. Check for a partial match
  3. Check for a positional match

Formals of a function are normally used within functions by the internal R evaluator but, we can use function formals() to expose formals explicitly.

args() is an other function that displays the formals in a more user friendly fashion. Actually, args(fun) returns a function with the same arguments as fun but with an empty body.

Surely, for programming purposes, formals() is a better choice as it returns a simple pairlist that can be handled as a list:

As a replacement method exists for function formals:

formals of a function can manipulated by using function alist(): a list() type function that handles unevaluated arguments

As an example of practical use of formals() we may decide to re-define function mean() that defaults na.rm to TRUE by simply:

Clearly, we now have copy of mean.default() in our globalenv:

Finally, let’s notice that:

remains the base environment: the environment where the function was created.

The “...” argument of a function is a special argument and can contain any number of symbol=value arguments . The “...” argument is transformed by R into a list that is simply added to the formals list:

The “...” argument can be used if the number of arguments is unknown. Suppose we want to define a function that counts the number of rows of any given number of data frames we could write:

Similarly, the “...” arguments becomes very handy when the “...” arguments will be passed on to another function as it often happened when calling plot() from within another function. The following example shows a basic plot function used for depths plotting where additional graphics parameters are passed via “...”:

plot of chunk functions-017

plot of chunk functions-017

Body of a function

The body of a function is a parsed R statement. In practice, this implies that the body of a function needs to be correct from a formal point of view but no evaluation of the body of a function occurred yet.

As a result, this function would return an error:

as its body is not a correct R statement.

While this function:

is accepted by R as is formally correct even thought, except under specific circumstances, will always return an error:

The body of a function, is usually a collection of statements in braces but it can be a single statement, a symbol or even a constant.

The body of function is an object of class call:

and as a call object, the body of a function can be manipulated as a list:

and, as function body() has a replacement method: body()<-, the body of a function can be easily manipulated:

This technique can be eventually used for testing on the fly small changes to a function without rewriting its full body.

Environment of a function

The environment of a function is the environment that was active at the time that the function was created. Generally, for user defined function, the Global environment:

or, when a function is defined within a package, the environment associated to that package:

The environment of a function is a structural component of the function and belongs to the function itself.

As an example, we can define a function f() that simply returns zero

the environment of f() is clearly the globalenv()

we can modify the environment of a function and assign to f() a newly created environment

in case we delete environment env

f() will keep working

All of this happen as env and the environment of f() are two pointers to the same piece of memory address but they exist as separate objects.

As an example we may consider a function defined in a dedicated environment along with some other objects in the same environment.

As we can see, clearly g() knows that x=1 as it was passed to the function as an argument but, g() also remembers that y=1 as y belongs to the environment env: the environment of g().

The same behavior occurs many times when we develop R function and may lead to errors when calling these functions. Suppose we simply write:

The above example works as the environment of g() is now the global environment. But, as soon as we do:

clearly, g() will stop working as object y no longer exists in the global environment

Notice that, if we define this odd function

this function works if it finds variable x in its chain of searchable environments. As a result, if we define

now f() returns zero as it finds x within its environment

if now delete env

f() will keep working

as a pointer to the same memory address exists as part of f() itself

Along with the environment where the function was created, functions usually interact with, at least, two more environments:

  • The evaluation environment
  • The calling environment

The evaluation environment is created any time the function is called. Within this environment, the formals arguments of the function are matched with the supplied arguments and the body of the function is evaluated.

The evaluation environment, as any other environment, has a parent. The parent of the evaluation environment of a function is the environment of the function. In other words, the function environment is the enclosure, the parent, of the evaluation environment.

As a proof of concept we can write simple function that returns the its evaluation environment along with the evaluated symbols that are created within this environment :

As we can see, object x is bounded to the evaluation environment of f().

The calling environment is the environment the function is called from. When using R interactively, the calling environment of a function is usually the global environment but, this is not always the case.

When we call a function, the function first looks for any variable in the evaluation environment and then in its enclosure; usually, for user defined functions, the global environment. In case no variable is found, R keeps searching along the environments stack until it reaches the empty environment. As we can see, this process does not take into account the calling environment.

When using R interactively, the environment of a function and the calling environment of that function often coincide: functions are defined in the global environment and called from the same environment.

In order to better understand the difference between the environment of a function and the calling environment of a function, we may consider a new environment, with a function f() defined in it, whose enclosure is forced to the base environment:

Function f() takes a single argument and returns TRUE in case it is a function, FALSE otherwise.

If we call this function with argument x = c:

f() returns TRUE as it is considering function c() from the base environment.

if we define an object c within environment env:

and we call it:

now f() returns FALSE as it is considering variable c within the env environment and does not find function c() in the base environment.

If we now remove c from env:

and we re-define c within our global environment:

when now calling f(x = c),

we can see that f() now returns TRUE despite the c <- 0 assignment in the global environment.

Basically, f() start searching from its environment: env and, if necessary, keeps searching along the environment tree structure that, in this case, does not include the globalenv.

R provides at least two useful functions to deal with the environments of a functions:

  • parent.env()
  • parent.frame()

parent.env() returns the environment in which the function was defined while parent.frame(n = 1) identify the environment from which the function was invoked.

In order to illustrate this concepts, we can define:

This function was defined in the global environment and called from the global environment.

Suppose we now define a new environment env and we move env_of_fun() in it:

when we now call env_of_fun()

we can see that the calling environment is now different from the definition environment.

Understanding this idea can help to improve clarity and avoid annoying conflicts.

As an example, we can define function f() within a newly created environment env and use function parent.frame() within the newly created function:

and observe that:

that is, function parent.frame() forced f() to look for c first inside the calling environment rather than the creation environment: env or its parent:

Similarly, in order to avoid conflicts between objects passed as arguments to a function and objects stored in any other environment, such as a package, we could define f() within env as:

in this case we can be sure that whenever we call f() it first looks for the value of x as stored either in env or its parent: :

Suppose, in fact, we call;

we can observe that f(x = pi) always returns teh correct value for pi

Example: Remove all objects from the workspace

As an example of use of the environment of a function, we can consider several strategies to write a function capable of removing all objects from the globalenv. We can iniatially write a simple function:

Function clear() removes all objects from a specified environment and seems to work correctly:

At this point, should be obvious what is the drawback of this solution. Function clear() deletes also itself and, as a result, it cannot be reused without redefined it.

This function can be improved, to keep function clear() when all other objects are deleted.

Now the function can be used more than once.

Unfortunately, this function has also a drawback: it stops working when reassigned.

As defined above, function clean() also removes itself: only the object named clear is preserved.

To dynamically keep function name, we may modify function clear as follow.

Nevertheless, beside the above solution, a smart way to obtain the same result is the follow:

Through function assign(), function clear() is created in a new environment called myenv. In this way, all objects in the global environment can be removed without deleting function clear()

search()

Return Value

The last object called within a function is returned by the function and therefore available for assignment. Functions can return only a single value but, in practice, this is not a limitation as a list containing any number of objects can be returned.

Objects can be returned visible or invisible. This option has no effect on the assignment side but affects the way results are displayed when the function is called.

Sometimes, we may want a function that does any job but returns nothing. In this case, the return value will be set to NULL and returned as invisible.

Suppose we need a function that cat() a message we can write:

and use it as:

with no assignment nor returned value.

Operators

Operators in R are simple function. Specifically, operators are infix functions as opposite to standard functions that are defined as prefix as the name of the function comes before its arguments. Operators can be defined as function with the only constrain that their name must be surrounded with ‘’%’’. As a result, a simple operator that concatenate strings can be defined as:

A more complex approach, based on R capabilities as an object oriented programming language, takes advantage of, + being a generic function:

As a result, different methods for generic function + can be defined for different classes of objects.

As an example, we may define a class of objects named string:

with a + method that concatenates strings:

and as a result:

Lazy evaluation

Functions arguments, except few exceptions, are, by default, lazy; that is, they are not evaluated when the function is called but only when the argument are explicitly used.

Let’s take as an example this simple function where the y argument is never evaluated within the function body:

We can call f() and pass a non existing object z to argument y. Clearly, this kind of statement would result in a error as z is not defined but, it works as a function argument:

As we can see, y is assigned to z and z does not exit but, R does not return any error. This is because y = z is never evaluated within the function body.

As a second example we can consider this basic function that simply prints its arguments:

If we call h() without passing any vale to b we see that:

that is: h() returns an error only when the evaluation of b is required. Prior to that, this function works perfectly.

Usually, whenever a function returns an error if any argument is not provided and not yet evaluated, this is because a control mechanism has been programmed within the function body:

More formally, an unevaluated argument is called a promise. A promise is an object made of three slots:

  • a value
  • an expression
  • an environment

Practically, when a function is called, any argument is associated to a promise object along with the expression associated to that argument and a pointer to the environment where the expression will be, eventually, evaluated and assigned to the argument symbol.

Evaluation of an argument is required when:

  • Interfacing with foreign language
  • Selecting a method for a generic function
  • An argument needs to be assigned within a function

There is generally no way within R to check whether an object is a promise or not, nor is there a way to determine the environment of a promise.

Lazy evaluation permits flexible handling of missing arguments and computations depending on the expression for the argument rather than its value.

The following example is a good case in point:

This function scales any vector, by default, in the [0,1] range.
Argument scale depends on the value of y that is not defined but, it will be defined: y = x-location prior to its evaluation y/scale.

Function delayedAssign() offers a direct mechanism for accessing promise mechanism outside a function

Functions call

Functions in R can be called directly or by mean of a second function such as do.call() by passing a string corresponding to the function name.

do.call()

Function do.call() takes as input two arguments:

  • either a function or a non-empty character string naming the function to be called.
  • a list of arguments to the function call

Basically:

corresponds to:

Example: Maximumum Likelihood Estimamates

As an example, we may consider a maximum likelihood estimator for normal distributions:

We can re-implement the estimator by using do.call():

The distribution name can be passed as an argument to the mle() and, as a consequence, to do.call() at the cost of a minor modification to the internal function mle().

Now it works with most of two parameters distributions assuming that the right initial theta is provided.

Clearly, this is a good value generalization given the programming effort required.

match.call()

Function match.call() is used within functions and it simply returns the call that has been passed to a function

Any call is an object of class call that can explored as a list object:

Call objects can also be manipulated as list.

Example: Function anyway()

As an example, we consider a function with two arguments a, b that returns, in case both arguments are numeric, the sum of the arguments; the character variable "a+b" otherwise.

Example: Function write.csv() revisited

As a as second application of do.call() we consider write.csv(). This function is a wrapper to write.table() forcing sep = "," and dec = ".".

Such a function could be easily written as:

Nevertheless, if we try to pass any of the sep or dec arguments via the ‘’...’’ argument, the function returns error:

Basically, the ‘’...’’ argument may take any argument to be passed to write.table() except sep and dec. In case any of these arguments is explicitly passed to the function, they have to be forced to the desired default values: "," and ".".

A simplified version of write.csv() can be re-written as:

Now, sep = ";" is simply ignored.

Recursive functions

A recursive function use recursion and can call itself until a certain condition is met.

As an example we may consider a function that takes x as an argument and keep dividing it by 2 until the result is greater than 2. This idea can be implemented by a simple while() loop:

or alternatively by the use of function Recall(): a placeholder for the name of the function in which it is called. It allows the definition of recursive functions which still work after being renamed

in this case Recall(x) is equivalent to one_r(x).

The use of recursion my look redundant in this simple example but, it given an idea of how much a function can change by the simple introduction of this concept.

When dealing with more complex problem, the use of recursion may help indeed to simplify our coding.

Example: Quicksort

A good example of the advantages, and possibly disadvantages, of using recursive function is represented by the implementation of the quicksort: a divide and conquer algorithm that first divides a large list into two smaller sub-lists: the low elements and the high elements. Quicksort can then recursively sort the sub-lists.

As a simple implementation we may consider:

that results in:

Note that its non recursive implementation could be:

that keeps working

but does not express the same level of clarity.

Moreover, when looking for performances, the use of recursion in R is a clear advantage:

Example: Left join

Suppose we want to implent a left join between three data frames:

we will have to acheive this goal in two steps:

In case we have to repeat this task several times, expecialy with a variable number of data frames, we could define function left_join() as:

and use it as:

Replacement functions

Given any f() function sometimes we are allowed to write expressions like: f(x) <- y. For example, given any data.frame:

we can query for the names of the variables within the data.frane by:

in order to replace variables names, we often use:

This is possible as a function names<-() exists and it is known as the replacement method for names().

In principle any replacement function takes the general form of: "f<-"(x, value) with value being the replacement argument.

Example: Trim and replace

As an example, we may consider function trim() that trims any vector at a the quantile corresponding to the p (probability) argument:

A simple replacement method for this function can be written as:

and can be used as:

Replacing non assigned objects

Note that using replacement functions requires that the object passed as argument x exists in the calling environment of the function. As a proof of concept we can see that:

works normally, while

does not work as the data.frame to be modified does not exist.