Introduction

A wide documentation is available on the internet providing the technical instructions to build R packages, starting from the manual Writing R extensions at the official R site.

This section provides a general overview about R packages. The following sections try to provide some insights on the mechanism used by R when dealing with packages.

Why create a R package?

According to Rossi1, there are at least three good reasons to create an R package.

  1. Creating an R package forces the user to document the code and provide test examples to insure that it actually works. It will also be much easier to use the code as documentation will only be a ? command away and all of functions and shared libraries will be available for use. This is a good reason to create a package, also for its own use.

  2. If the goal is disseminate a research, this is an ideal way of making sure others have access to the work. It will also increase the probability that eventually the work will be correct. This is a good reason to create a package for a team (private) use.

  3. Giving back something to this amazing community of volunteers! This is a good reason to make the package available for the whole world (e.g. through CRAN).

Structure of a package

The sources of a R package consists of a subdirectory containing some files and directories in a well organized structure.

Files ‘DESCRIPTION’ and ‘NAMESPACE’ and subdirectories ‘R’, ‘man’, and ‘data’ are required for every package. The base package.skeleton() function creates all the required directories.

The ‘DESCRIPTION’ file contains basic information about the package. The ‘Package’, ‘Version’, ‘License’, ‘Description’, ‘Title’, ‘Author’, and ‘Maintainer’ fields are mandatory, all other fields are optional.

R has a namespace management system for code in packages. This system allows the package writer to specify which variables in the package should be exported to make them available to package users, and which variables should be imported from other packages. The mechanism for specifying a namespace for a package is through the ‘NAMESPACE’ file in the top level package directory.

The ‘R’ subdirectory contains R code files, only. The package.skeleton() function returns a .R file for each function.

The ‘man’ subdirectory should contains only documentation files for the objects in the package in R documentation (Rd) format.

The ‘data’ subdirectory contains data files in R format.

A package may also contain files ‘INDEX’, ‘configure’, ‘cleanup’, ‘LICENSE’, ‘LICENCE’, ‘COPYING’ and ‘NEWS’ and directories ‘data’, ‘exec’, ‘inst’, ‘po’, ‘src’, and ‘tests’. These subdirectories can be missing, but which should not be empty.

The sources and headers for the compiled code are in ‘src’. The ‘demo’ subdirectory is for R scripts (for running via demo()) that demonstrate some of the functionality of the package. The contents of the ‘inst’ subdirectory will be copied recursively to the installation directory. Subdirectories of ‘inst’ should not interfere with those used by R. Subdirectory ‘exec’ could contain additional executables the package needs, typically scripts for interpreters such as the shell, Perl, or Tcl. This mechanism is currently used only by a very few packages, and still experimental. Subdirectory ‘tests’ is for additional package-specific test code, similar to the specific tests that come with the R distribution. Subdirectory ‘po’ is used for files related to localization.

Creating a package

When all is well organized, the following steps are required in order to create a package:

  1. the ‘DESCRIPTION’ file ought be filled in with required information,
  2. the ‘NAMESPACE’ file ought be filled in with required information,
  3. the R documentation files ought be written,
  4. the sources and headers for the compiled code, if any, ought be contained in the ‘src’ directory.

Then, a package can be created. This requires three steps:

  1. build: the shell command R CMD BUILD builds an R source tarball. This means that temporary files are removed from the source tree of the package and everything is packed into a single file.
  2. check: the shell command R CMD CHECK runs a wide variety of diagnostic checks on the package. Checks may be run before or after the build step.
  3. install: the shell command R CMD INSTALL installs the package into a library and makes it available for usage in R. The R function install.packages() can be used instead.

R package devtools provide a lot of useful function in order to help developers to develop their own package. Moreover, RStudio IDE integrates devtools providing an user interface to build packages.

Writing R documentation files

R objects are documented in files written in R documentation’’ (Rd) format, a simple markup language much of which closely resembles Latex, which can be processed into a variety of formats, including Latex, HTML and plain text.

An ‘Rd’ file consists of three parts. The header gives basic information about the name of the file, the topics documented, a title, a short textual description and R usage information for the objects documented. The body gives further information; for example, on the function’s arguments and return value, as in the example. Finally, there is an optional footer with keyword information. The header is mandatory. Information is given within a series of sections with standard names (and user-defined sections are also allowed). Unless otherwise specified these should occur only once in an ‘Rd’ file (in any order).

The roxygen2 R package allows to get a in-source documentation. Accordint to package vignettes, Roxygen2 provides a number of advantages over writing .Rd files by hand:

  • Code and documentation are adjacent so when you modify your code, it’s easy to remember that you need to update the documentation.
  • Roxygen2 dynamically inspects the objects that it’s documenting, so it can automatically add data that you’d otherwise have to write by hand.
  • It abstracts over the differences in documenting S3 and S4 methods, generics and classes so you need to learn fewer details.

The package structure behind R

Packages provide a mechanism for loading optional code, data and documentation as needed.

An R package can be thought of as the software equivalent of a scientific article: articles are the de facto standard to communicate scientific results, and readers expect them to be in a certain format. R packages are a comfortable way to maintain collections of R functions and data sets (Leisch, 2009)2.

The R distribution itself includes about 30 packages. With regard to the importance’’ of a package, packages can be split into three categories.

  • Base packages: part of the R source tree, maintained by R Core.
  • Recommended packages: part of every R installation, but not necessarily maintained by R Core.
  • Contributed packages: all the rest. This does not mean that these packages are necessarily of lesser quality than the above, e.g., many contributed packages on CRAN are written and maintained by R Core members. The goal is simply to try to keep the base distribution as lean as possible.

The installed.packages() function returns a matrix with several information about installed packages. The Priority’’ column reports the category (base, reccomended or contributed) which each package belong to.

Terms about R packages are often confused. This may help to clarify:

  • Package: a collection of R functions, data, and compiled code in a well-defined format.
  • Library: the directory where packages are installed.
  • Repository: A website providing packages for installation.
  • Source: The original version of a package with human-readable text and code.
  • Binary: A compiled version of a package with computer-readable text and code, may work only on a specific platform.

Packages Environments

Every R package has three associated environments:

  1. package environment
  2. namespace environment
  3. imports environment

The package environment contains all functions of the package exposed to the end user.

The namespace environment contains all functions the package including those functions included in the package environment. This is not a duplication of the functions contained in both environments as two equal functions in the two environment share the same memory address.

As a simple proof of concept, consider a first environment env1 with a simple function f() in it:

and a second environment env2 with an other function f() that is a copy of function f() from env1:

with the help of function mem_add(), we can see that env1$f and env2$f share the same memory address:

Similarly, with the help of function getAnywhere() we can see that function mean() is located both in the package environment and in the namespace environment of package base.

Having functions within namespaces rather than packages allows the developer to expose to the end user only those functions that are supposed to be called directly and hide all those functions that are to be internally called from exposed function.

As a common practice, namespace environments may hold a quite large number of functions. As an example we can consider package stats; this package contains 452 objects exposed to the end user while the corresponding namespace has 1114 objects:

The imports environment of a package contains objects from other packages that are explicitly stated requirements for a package to work properly. Most packages published on CRAN are not islands; they build on functionality provided in other packages.

We can get the names of the packages any package requires by using a little variant of function packageDescription():

and test it on package ggplot2

We can also count the functions within the imports of ggplot2

Notice that, getNamespaceImports() also shows an object from the base package. This object is not a function but a simple logical set to TRUE.

Calling a function from a package

As the package environment is included in the search path, when we call a function from a package, R looks for that function along the search path until it finds it in the package environment. Nevertheless, if we ask to a function from a package environment for the environment it belongs to, we have a little surprise:

That is, the environment of function sd() from stats package environment refers to the stats namespace environment as its environment.

As a consequence, when function sd() runs, a new environment is created whose enclosure is the stats namespace environment and all hidden functions within the namespace become available to function sd().

The namespace environment of any package, as any environment has a parent. We can query R for the parent of any namespace:

As we can see the parent environment of a namespace is a new environment: imports:packagename whose parent, is the namespace of package base.

Finally, the parent of namespace:base happend to be our R_GlobalEnv.

The following picture, borrowed from obeautifulcode.com illustrates the whole chain of environments.

In practice, when we look for a function f() in a package pkg, we find f() in the package environment og pkg. When we call f(), it runs within the namespace of pkg and as this is its environment. The enclosure of execution environment of f() is therefore the namespace of pkg. Whenever f() calls a second function g(), g() is searched first in the execution environment and, as reasonably g() is not defined in there, R looks for g() in the name space and, in case g() does not belong to package pkg, R looks in the imports environment of pkg.

This search structure makes perfect sense as it increases the probaility of finding any function g() in the shortes possible time.

In case g() is not found in the imports of pkg then R looks for g() in the namespace of base. Again, this is very reasonable as, almost all packages have to refer to the base package.

If we assume that the dependency structure of the package has been built properly, the search should end up at the imports namespace or, at worse at the base namespace. In fact, if g() is not found within the namespace of base the next step in the search mechanism, points to our R_GlobalEnv and after that it moves down the search path until it either finds g() or reaches the empty environment.

Older version of R did not implement this idea of namespace and, as a consequence, the imports environment did not exists. Dependencies between packages was implemented with the use of depends. If a package pkg1 was dependending from package pkg2, pkg2 was attached in the search path just after package pkg1. Nowadays, few packages still use this idea of depends. As an exampel consider package abc: before attaching the package the search path should look like:

If we load package abc version 1.8 and we run search() again, we observe how our search path is changed:

Packages nnet, quantreg, MASS are attached as package abc depends on them while package SparseM is attached because package quantreg depends on it:

Clearly, imports is to be preferred to depends as it offer a neter structure for R searching mechanism.

Loading packages

We load a package, with a call either to library() or require(). These functions perform very similarly but they have few very important differences that is worth to mention.

As a first difference we can notice the different messages the two functions requrn when called with a non existing library:

beside the message, a more important side effect comes up when we assign the results of these calls to an object:

We can notice that, in case of error, library() does not assign while require() returns FALSE. This behavior of require() allow us to make libraries loading more robust expecially within the body of a function:

Despite this difference, It’s bad practice to use library() or require() inside a function, because it makes it hard to understand code dependencies. They should either be outside functions or, even better, in package DESCRIPTION.

When loading a package these four actions occur:

  1. The namespace environment is loaded
  2. A new environment is created: the package environment
  3. Only exported functions are copied from the namespace to the package environment
  4. Package environment is that included in the search list

Packages are usually loaded by mean of a lazy loading mechanism. Lazy loading is always used for code in packages but is optional, as it selected by the package maintainer, for datasets in packages. When a package namespace is loaded, the namespace environment is populated with promises for all the named objects and those objects specified in the NAMESPACE field of the package are copied into the package environment: when these promises are evaluated they load the actual code from a database.

There are separate databases for code and data, stored respectively in the R and data subdirectories. Each database consists of two files, name.rdb and name.rdx. The .rdb file is a concatenation of serialized objects, and the .rdx file contains an index. The objects are stored in a gzip-compressed format.

The loader for a lazy-load database of code or data is function lazyLoad() in the base package.

As an example we can write a function that load an existing object as a promise from a .rdb file being part of a R package without loading the package:

and use this function to get the promises in our local environment as:

Once we have the promise we can evaluate it in order to get the value associated to the promise:

Finally, we can easily load all data and functions from package .rdb files by: