Introduction
Assuming that we are all familiar with classic R
objects such as vectors, matrices, lists, data.frames, etc …, this chapter takes into consideration a critical type of objects: environments.
Within the R
computation mechanism, environments play a crucial role as they are constantly used by R
just behind the scene of interactive computation.
An environment is an object that takes care of mapping variable names to values. Each mapping is called a binding.
Being able to understand and manage environments represents a key step in the R
programming learning curve.
Environments in R
The environment definition is clearly stated by in R Language Definition manual:
Environments can be thought of as consisting of two things.
- A frame, consisting of a set of symbol-value pairs,
- an enclosure, a pointer to an enclosing environment.
Given that a frame is a set of objects each of them associated to a name, where a name is a simple character string, in practice, we can consider an environment as a self contained portion of memory containing a frame. Each environment can access one and only one other environment known as the parent environment.
Environments in R
are created, and eventually destroyed, under many circumstances.
Any R
session has an environment associated known as the global environment. as returned by functions globalenv()
and environment()
:
1 |
globalenv() |
1 |
## <environment: R_GlobalEnv> |
1 |
environment() |
1 |
## <environment: R_GlobalEnv> |
When we are working with R
in interactive mode, we are using the frame within the globalenv
as a container for our objects:
1 2 |
x <- 0 ls.str(globalenv()) |
1 |
## x : num 0 |
Any package has at least one environment:
1 |
as.environment("package:stats") |
1 2 3 4 5 |
## <environment: package:stats> ## attr(,"name") ## [1] "package:stats" ## attr(,"path") ## [1] "/usr/lib/R/library/stats" |
Almost all functions have an environment as part of their definition:
1 |
environment(mean) |
1 |
## <environment: namespace:base> |
User defined functions have an environment too:
1 2 |
f <- function() NULL environment(f) |
1 |
## <environment: R_GlobalEnv> |
Function environmentName()
returns the name of an environment. As a result we may query R
for the environment of function f()
:
1 |
environmentName(environment(f)) |
1 |
## [1] "R_GlobalEnv" |
or for the name of the environment associated to a package:
1 |
require("scuba", quietly = T) |
1 2 3 4 |
## scuba 1.7-0 ## Type 'help(scuba)' for an introduction ## Read the warnings in 'help(scuba.disclaimer)' ## For news about changes to the package, type 'news(package='scuba')' |
1 |
environmentName(as.environment("package:scuba")) |
1 |
## [1] "package:scuba" |
Unfortunately, function environmentName()
does not always return the expected results:
1 2 |
env <- new.env() environmentName(env) |
1 |
## [1] "" |
Environment names for packages and namespaces are assigned at the C
level. Therefore, user created environments do not reveal names. Users cannot set the name of an environment in R
even through a, possibly misleadingly named, function called environmentName()
exists. This function is really only meant for packages and namespaces, not other environments.
The ‘’environment tree structure’’
The definition of environment also states that an environment is made of an enclosure: a pointer to an enclosing environment. As a consequence, any environment has a parent environment that, as an environment has a parent environment. This chain of parent environments, known as the environment tree structure, roots to a special environment called the empty environment that, as stated by its name, contains no objects.
R
has a very useful function, known as parent.env()
, that returns the parent of any given environment:
1 |
parent.env(globalenv()) |
1 2 3 4 5 |
## <environment: package:scuba> ## attr(,"name") ## [1] "package:scuba" ## attr(,"path") ## [1] "/home/andrea/R/x86_64-pc-linux-gnu-library/3.1/scuba" |
In order to visualize the environment tree structure we can easily define a function that returns this structure starting from any given environment:
1 2 3 4 5 6 7 8 |
tree <- function(env){ cat("+ ", environmentName(env), "\n") if(environmentName(env) != environmentName(emptyenv())){ env <- parent.env(env) Recall(env) } invisible(NULL) } |
The above function make use of function Recall()
that will be examined in the chapter dedicated to functions.
We can test tree()
starting with globalenv()
as argument:
1 |
tree(env = globalenv()) |
1 2 3 4 5 6 7 8 9 10 11 |
## + R_GlobalEnv ## + package:scuba ## + package:knitr ## + package:stats ## + package:graphics ## + package:grDevices ## + package:utils ## + package:datasets ## + Autoloads ## + base ## + R_EmptyEnv |
Or we may want to use the built in functions search()
that returns similar results
1 |
search() |
1 2 3 4 |
## [1] ".GlobalEnv" "package:scuba" "package:knitr" ## [4] "package:stats" "package:graphics" "package:grDevices" ## [7] "package:utils" "package:datasets" "Autoloads" ## [10] "package:base" |
When we attach a list
, usually a data.frame
, we actually insert an entry in the environment tree structure in the position given by the pos
argument of function attach()
. As this parameter defaults to pos=2L
, most of the times we attach just underneath the global environment:
1 2 |
attach(data.frame(NULL)) search() |
1 2 3 4 |
## [1] ".GlobalEnv" "data.frame(NULL)" "package:scuba" ## [4] "package:knitr" "package:stats" "package:graphics" ## [7] "package:grDevices" "package:utils" "package:datasets" ## [10] "Autoloads" "package:base" |
When loading libraries, functions library()
or require()
work on a similar basis and use the same parameter pos = 2L
1 2 |
library(MASS) search() |
1 2 3 4 |
## [1] ".GlobalEnv" "package:MASS" "data.frame(NULL)" ## [4] "package:scuba" "package:knitr" "package:stats" ## [7] "package:graphics" "package:grDevices" "package:utils" ## [10] "package:datasets" "Autoloads" "package:base" |
How R
looks for objects
When R
looks for any object, a symbol value pair, by default R
looks for a matching symbol in the current environment and, if a matching symbol is found, the corresponding value is returned.
In case we want to search starting from a different environment we are usually able to specify it directly. As an example, we may consider the well known function get()
that has an argument envir
specifying which environment to search, at least as a starting point.
As a result, we can create an object named Formaldehyde
in the current environment:
1 |
Formaldehyde <- data.frame() |
and use get()
to find it along with the environment where to look for:
1 |
get("Formaldehyde", envir = globalenv()) |
1 |
## data frame with 0 columns and 0 rows |
Note that an object with the same name exists in the environment of package:datasets
and we can find it by specifying the right environment:
1 |
get("Formaldehyde", envir = as.environment("package:datasets")) |
1 2 3 4 5 6 7 |
## carb optden ## 1 0.1 0.086 ## 2 0.3 0.269 ## 3 0.5 0.446 ## 4 0.6 0.538 ## 5 0.7 0.626 ## 6 0.9 0.782 |
When R
does not find the required symbol in the current environment, R
looks in the parent environment and then in the parent of the parent until R
either finds the symbol in any environment or reaches the empty environment. In the latest case, as by definition the empty environment contains no objects, R
returns an error.
Given this search mechanism, R
stops searching as soon as it finds an object with the corresponding name ignoring any object with the same name in any other environment in the environment tree structure.
This effect, known as masking
, may result in quite embarrassing situations.
As a very simple example, suppose we define a simple function for computing circumference length given radius as argument:
1 |
circumference <- function(radius) 2*pi*radius |
and that, at any point of our working session we defined:
1 |
pi <- 0 |
The result we would gain looks quite embarrassing:
1 |
circumference(1) |
1 |
## [1] 0 |
In this case the object pi
in the globalenv()
:
1 |
get("pi", envir = as.environment(globalenv())) |
1 |
## [1] 0 |
masks the same symbol in the base
environment
1 |
get("pi", envir = as.environment(baseenv())) |
1 |
## [1] 3.142 |
A robust method that reduce the risk of masking consists in specifying the package we are calling objects from: We could achieve this goal by using the ‘’::
’’ operator:
1 2 |
circumference <- function(radius) 2*base::pi*radius circumference(1) |
1 |
## [1] 6.283 |
Finally, any conflict is returned by:
1 |
conflicts() |
1 |
## [1] "Formaldehyde" "npk" "pi" |
Computing with Environments
As we have seen, environments are an essential components of the R
working mechanism. As a consequence, it should not come as a surprise if environments are defined as R
objects themselves.
As a consequence of being R
objects, environments can be created:
1 |
env <- new.env() |
and eventually deleted:
1 |
rm(env) |
The frame
component of an environment can be used as an objects place holder almost as we do with lists. We can place objects within an environment at least in three different ways:
by using the $
operator:
1 |
env$zero <- 0 |
by using function with()
:
1 |
with(env , one <- 1) |
by using function assign:
1 |
assign("three", 3, envir = env) |
Finally, we can browse environment env
with standard functions ls()
or ls.str()
to check our result:
1 |
ls(env) |
1 |
## [1] "one" "three" "zero" |
1 |
ls.str(env) |
1 2 3 |
## one : num 1 ## three : num 3 ## zero : num 0 |
Suppose we want to store several objects at once into an environment, we may want to define a function fill_envir()
that saves any series of objects within an environment:
1 2 3 4 5 |
fill_envir <- function(..., envir = globalenv()){ this_list <- list(...) Map(function(...) assign(..., envir = envir) , names(this_list), this_list) invisible(NULL) } |
The above function takes ...
as argument and internally makes use of function Map()
with an anonymous function as first argument. All this interesting concepts will be exhaustively explained in the next chapters.
By using function fill_env()
, we can create a new environment and, subsequently, fill it with objects:
1 2 3 |
env1 <- new.env() fill_envir(one = 1, seven = 7, envir = env1) ls.str(env1) |
1 2 |
## one : num 1 ## seven : num 7 |
As we do with list()
, we may also want a function envir()
that directly creates an environment with named objects inside:
1 2 3 4 5 |
envir <- function(..., hash = TRUE, parent = parent.frame(), size = 29L){ envir <- new.env(hash = hash, parent = parent, size = size) fill_envir(..., envir = envir) return(envir) } |
Note that, we have used the newly created function fill_envir()
within the body of envir()
; writing modular functions, reusable within new functions, is a key point for producing efficient R
coding.
Function envir()
is now ready to be used for creating new environments:
1 2 |
env2 = envir(six = 6, seven = 7) ls.str(env2) |
1 2 |
## seven : num 7 ## six : num 6 |
Up to now we have noticed that environments behave very similarly to lists but, at this point of this explanation, we must point out at least three differences that exists between environments and lists:
First of all, within environment all objects must have a name while lists do not impose this restriction. In fact, we can create a list with unnamed components:
1 |
list (0, 1) |
1 2 3 4 5 |
## [[1]] ## [1] 0 ## ## [[2]] ## [1] 1 |
but we cannot do the same with environments nor using function envir()
:
1 |
envir(0,1) |
1 |
## Error: zero-length inputs cannot be mixed with those of non-zero length |
or, any other approach that attempts to create nameless objects within an environment.
This sound quite logical as the definition of environment states that: Environments consist of a frame, or collection of named objects.
Similarly, we may have lists with duplicated components:
1 2 |
l <- list (x = 0 , x = 1) l$x |
1 |
## [1] 0 |
This idea may look strange but it is a basic example masking within R
.
When we try to repeat the same experiment with environments, we may observe a different behavior:
1 2 |
env <- envir(x = 0, x = 1) ls.str(env) |
1 |
## x : num 1 |
In this case, the second argument: x = 1
simply reassigns a different value to x
.
Finally, as opposite to lists, within environments, the order objects were placed in does not matter. The frame is a collection of named objects and only names matter. As a consequence, objects of an environment are always displayed in alphabetical order:
1 2 |
env <- envir(b = 2, a = 1) ls.str(env) |
1 2 |
## a : num 1 ## b : num 2 |
The second part of the definition of environment states that an environment is made of an enclosure: a pointer to another environment.
As a consequence of this definition, when we create a new environment, it has, by definition, a parent environment.
Unless differently specified, the parent of the newly created environment is the environment where the environment was created.
As a result, if we create an environment, say env0
, within the global environment, the latest results as the parent of env0
.
1 2 |
env0 <- new.env() parent.env(env0) |
1 |
## <environment: R_GlobalEnv> |
We can pass env0
as an argument to the tree()
function:
1 |
tree(env0) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## + ## + R_GlobalEnv ## + package:MASS ## + data.frame(NULL) ## + package:scuba ## + package:knitr ## + package:stats ## + package:graphics ## + package:grDevices ## + package:utils ## + package:datasets ## + Autoloads ## + base ## + R_EmptyEnv |
Note that, the name of the environment env0
is not returned by function environmentName()
. This happens as the name of an environment is stored into the underlying C function and no assignment or replacement method exist, at the moment, for environments.
We may even create an environment, say env1
and specifically declare its parent environment:
1 |
env1 <- new.env(parent=baseenv()) |
Again, we can use function tree()
to see the effects of the previous statement:
1 |
tree(env1) |
1 2 3 |
## + ## + base ## + R_EmptyEnv |
It should be clear at this point that the structure of the environment tree is a key element in the R
programming mechanism.
These concepts will be very often recalled in the chapter dedicated to functions.
Copy on modify
When we do an assignment, R
reserves a portion of memory for that object. We can display objects memory addresses by using a short function:
1 |
mem_add <- function(x) substring(capture.output(.Internal(inspect(x))), 2, 17) |
beside its cryptic output, mem_add()
allows us to verify that, given two different objects:
1 2 |
x <- 0 y <- 0 |
they have different memory address:
1 |
mem_add(x) |
1 |
## [1] "2108658 14 REALS" |
1 |
mem_add(y) |
1 |
## [1] "e15a78 14 REALSX" |
and that, given:
1 |
x <- 0 |
if we assign
1 |
y <- x |
they share the same memory address:
1 |
identical(mem_add(x), mem_add(y)) |
1 |
## [1] TRUE |
that is: when vector x
is copied into vector y
, both objects share the same memory address.
When existing objects are modified, usually R
objects follow a copy on modify semantic; that is the object is copied into a different memory address.
In practice, given an object:
1 |
x <- 1:5 |
with its address
1 |
mem_add(x) |
1 |
## [1] "11d9830 13 INTSX" |
if we modify it
1 |
x[3] <- 0L |
R
modify its address too
1 |
mem_add(x) |
1 |
## [1] "14a8960 13 INTSX" |
If we apply the same concept to lists, given a list
1 |
list0 <- list(x = 0) |
and its copy
1 |
list1 <- list0 |
both list share the same address
1 |
identical(mem_add(list0), mem_add(list1)) |
1 |
## [1] TRUE |
but, if we modify list1
1 |
list1$x <- 1 |
list0
and list1
now have different addresses:
1 |
identical(mem_add(list0), mem_add(list1)) |
1 |
## [1] FALSE |
This mechanism: copy on modify, allows to preserve the value of list0
even if list1
is modified.
Prior to R 3.1
when modifying a list the entire list was copied. With version 3.1
we had a nice change that clearly helps in keeping memory usage under control.
Suppose we have two copied list made of more than one element:
1 2 |
list0 <- list(x = 1:100, y = rpois(100, 100)) list1 <- list0 |
and we modify only the second vector of the second list: y
:
1 |
list1$y[1] <- 0L |
We can now observe that the memory address is modified only for the second element of the list while list0
and list1
keep sharing the same address for the first vector: x
1 |
lapply(list0, mem_add) |
1 2 3 4 5 |
## $x ## [1] "209b9f0 13 INTSX" ## ## $y ## [1] "1e0be70 13 INTSX" |
1 |
lapply(list1, mem_add) |
1 2 3 4 5 |
## $x ## [1] "209b9f0 13 INTSX" ## ## $y ## [1] "d6b9d0 13 INTSXP" |
In conclusion, we could say that R
, at leat in its newest versions, uses a partial copy on modify semantic.
The same semantic does not apply to environments, that is environments do not copy on modify:
As an proof of concept, we can create an environment with some objects in it:
1 2 |
env0 <- new.env() env0$x <- 0 |
Afterward, we copy our newly created environment env0
into a second environment, say env1
.
1 |
env1 <- env0 |
As env1
is a copy of env0
, both environments contain the same symbols with the same values associated to them.
1 |
env0$x |
1 |
## [1] 0 |
1 |
env1$x |
1 |
## [1] 0 |
As environments do not copy on modify, if we now modify x
within env1
:
1 |
env1$x <- 1 |
We can easily observe that the value of x
within env0
is modified too:
1 |
env0$x |
1 |
## [1] 1 |
The previous example clearly shows that any modification on env1
also affects env0
. This is possible as env0
and env1
share the same memory address even after the modification env1$x <- 1
:
1 |
identical(mem_add(env0), mem_add(env1)) |
1 |
## [1] TRUE |
Hashed environments
When we create a new environment, by setting hash=TRUE
: the default value, we create a hashed environment.
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys, to associated values. Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.
Hashed environment, allow value look up by symbol faster than traditional methods at the price of the hash table implementation.
As a proof of concept we may consider the following example.
First, we create a simple data frame whose rows represent name-value
pairs:
1 2 3 4 |
options(stringsAsFactors = FALSE) n = 10^6 df = data.frame(name = paste("p", 1:n, sep = "."), value = 1:n) head(df, 3) |
1 2 3 4 |
## name value ## 1 p.1 1 ## 2 p.2 2 ## 3 p.3 3 |
Secondly, we create a new environment and we fill it with the name-value pairs so that we define, within the newly created environment, n
objects of value i
and name p.i
:
1 2 3 4 5 |
env = new.env(hash = T) system.time( Map(function(...) assign(..., envir = env), x = df$name, value = df$value) ) |
1 2 |
## user system elapsed ## 27.435 0.116 27.554 |
As we can see, implementing the hash table require a certain amount of computing time.
We now define a random sample of names:
1 2 |
k = 100 what = paste("p", sample(1:n, k), sep ="." ) |
and finally, we want to create a vector out
containing the values corresponding to each name. In practice, if we selected what <- c(p.1,p.2,p.3)
we would like R
to return c(1,2,3)
.
In order to achieve this result we may use either a crazy for
loop approach
1 2 3 4 5 |
out <- numeric(k) system.time({ for (i in 1:k){ out[i] <- df$value[df$name == what[i]] }}) |
1 2 |
## user system elapsed ## 4.703 0.073 4.782 |
or the common R
vectorized approach
1 |
system.time({df$value[is.element(df$name , what)]}) |
1 2 |
## user system elapsed ## 0.071 0.001 0.073 |
or, finally, the new hash approach
1 2 3 |
system.time({ unlist(mget(what, envir = env)) }) |
1 2 |
## user system elapsed ## 22.507 0.064 22.574 |
Definitely, the hash approach, if we are willing to pay the computational price required for building the hash table, offers a clear advantage.