Chapter 10 | Rcpp: Seamless R and C++ Integration - Quantide

Introduction

Sometimes R code just isn’t fast enough and there’s no simply way to make the code any faster.

The Rcpp package provides a consistent API for seamlessly accessing, extending or modifying R objects at the C++ level. The API is a rewritten and extented version of an earlier API which we refer to as the ‘classic Rcpp API’.

Rcpp, R and C++

All examples in this chapter need at least version 0.10.1 of the Rcpp package. This version includes cppFunction and sourceCpp, which makes it very easy to connect C++ to R. You’ll also need a working C++ compiler.

Before using any of Rccp facilities, we need at least:

require(Rcpp)

1	require(Rcpp)

## Loading required package: Rcpp

1	## Loading required package: Rcpp

Moreover, in order to perfom some comparisons, we also load:

require(rbenchmark)

1	require(rbenchmark)

## Loading required package: rbenchmark

1	## Loading required package: rbenchmark

require(microbenchmark)

1	require(microbenchmark)

## Loading required package: microbenchmark

1	## Loading required package: microbenchmark

A simple example with scalars

cppFunction allows developer to write C++ functions in R like this:

cppFunction('
  int add(int x, int y, int z) {
    int sum = x + y + z;
    return sum;
  }'
)

cppFunction('

int add(int x, int y, int z) {

int sum = x + y + z;

return sum;

)

Once the function is defined, its print method displays info about the function

add

add

## function (x, y, z) 
## .Primitive(".Call")(<pointer: 0x2b7921338c10>, x, y, z)

1 2	## function (x, y, z) ## .Primitive(".Call")(<pointer: 0x2b7921338c10>, x, y, z)

and we can use it as any other R function:

add(1, 2, 3)

1	add(1, 2, 3)

## [1] 6

## [1] 6

Some notes about R and C++ that may be worth to recall before moving forward might be:

In C++ we must declare the type of output the function returns.
The classes for the most common types of R vectors are: NumericVector, IntegerVector, CharacterVector and LogicalVector.
Scalars and vectors in C++ are different.
The scalar equivalents of numeric, integer, character and logical R vectors are, respectively, double, int, String and bool in C++.

Brief introduction to vectors

Now let us try a slightly more complex example.

cppFunction('
  double sumC(NumericVector x) {
    int n = x.size();
    double total = 0;
    for(int i = 0; i < n; ++i) {
      total += x[i];
    }
    return total;
  }
')

cppFunction('

double sumC(NumericVector x) {

int n = x.size();

double total = 0;

for(int i = 0; i < n; ++i) {

total += x[i];

}

return total;

}

The above function simply returns the sum of elements of a vector passed as a parameter to function sumC().

Note the use of NumericVector to “say” to sumC() that the argument x is a vector containing floating point (double precision) numeric values.

Note also that the C++ code performs an explicit loop on all the vector elements to calculate the sum.

sumC() function is available in .Global environment, as all R functions created by user:

ls()

ls()

## [1] "add"  "sumC"

1	## [1] "add" "sumC"

And the speed of sumC() is comparable to speed of native R sum() function:

y=rnorm(1000000)
microbenchmark(
  sumC(y),
  sum(y)
)

y=rnorm(1000000)

microbenchmark(

sumC(y),

sum(y)

)

## Unit: milliseconds
##     expr   min    lq median    uq   max neval
##  sumC(y) 1.002 1.012  1.036 1.057 1.198   100
##   sum(y) 1.021 1.047  1.055 1.074 1.201   100

## Unit: milliseconds

## expr min lq median uq max neval

## sumC(y) 1.002 1.012 1.036 1.057 1.198 100

## sum(y) 1.021 1.047 1.055 1.074 1.201 100

Now let us try to produce a yet more complex example: a function that calculates the inner product between two equal-length vectors:

cppFunction('
  double sumP(NumericVector x, NumericVector y) {
    int nx = x.size();
    int ny = y.size();
    double sumup=0;

    if(nx==ny)
    {
      for(int i=0; i < nx; i++)
          sumup += x[i] * y[i];
    }
    else
        sumup = NA_REAL; // NA_REAL: constant of NA value for numeric (double) values
  
    return sumup;
    }'
)

cppFunction('

double sumP(NumericVector x, NumericVector y) {

int nx = x.size();

int ny = y.size();

double sumup=0;

if(nx==ny)

{

for(int i=0; i < nx; i++)

sumup += x[i] * y[i];

}

else

sumup = NA_REAL; // NA_REAL: constant of NA value for numeric (double) values

return sumup;

)

We than compare our newly created function with two equivalent R native approaches:

microbenchmark(
  sumP(rnorm(1E5),rnorm(1E5)),
  sum(rnorm(1E5)*rnorm(1E5)),
  t(rnorm(1E5)) %*% rnorm(1E5)
)

microbenchmark(

sumP(rnorm(1E5),rnorm(1E5)),

sum(rnorm(1E5)*rnorm(1E5)),

t(rnorm(1E5)) %*% rnorm(1E5)

)

## Unit: milliseconds
##                              expr   min    lq median    uq   max neval
##  sumP(rnorm(1e+05), rnorm(1e+05)) 16.19 16.66  16.78 17.18 17.97   100
##  sum(rnorm(1e+05) * rnorm(1e+05)) 16.35 16.64  16.86 17.17 44.39   100
##  t(rnorm(1e+05)) %*% rnorm(1e+05) 17.22 17.51  17.72 18.24 46.14   100

## Unit: milliseconds

## expr min lq median uq max neval

## sumP(rnorm(1e+05), rnorm(1e+05)) 16.19 16.66 16.78 17.18 17.97 100

## sum(rnorm(1e+05) * rnorm(1e+05)) 16.35 16.64 16.86 17.17 44.39 100

## t(rnorm(1e+05)) %*% rnorm(1e+05) 17.22 17.51 17.72 18.24 46.14 100

In above example, the function sumP() seems a just about faster than both sum() and the linear algebra-based calculation. Moreover, some small changes to the function may reveal quite sensible advantages either in terms of memory usage, or in terms of performance:

x <- rnorm(1E8)
y <- rnorm(1E8)

#Test C function
mmC1=gc(reset=TRUE)
tmC=system.time(sumP(x,y))
mmC2=gc()

#Test standard R 
mmR1=gc(reset=TRUE)
tmR=system.time(sum(x*y))
mmR2=gc()

#Test Matrix calculation
mmR3=gc(reset=TRUE)
tmR1=system.time(t(x) %*% y)
mmR4=gc()

x <- rnorm(1E8)

y <- rnorm(1E8)

#Test C function

mmC1=gc(reset=TRUE)

tmC=system.time(sumP(x,y))

mmC2=gc()

#Test standard R

mmR1=gc(reset=TRUE)

tmR=system.time(sum(x*y))

mmR2=gc()

#Test Matrix calculation

mmR3=gc(reset=TRUE)

tmR1=system.time(t(x) %*% y)

mmR4=gc()

The amount of memory requested by sumP() to perform the computations is close to 0: mmC2[2,6]–mmC1[2,6] = 0.1 MB, whereas sum() needed mmR2[2,6]–mmR1[2,6] = 763 MB.

This because the calculation performed by sum() is made of two steps:

Calculate the vector containing the element by element product
Sum the values of products

This produced another side effct: the time requested to perform the inner product alone with sumP() is about 40% of the time requested by sum(), since some time was needed to create the vector and the two calculations “steps” (instead of only one of sumP()) in sum().

The same calculations performed by linear algebra operators return the worst performances both in terms of execution speed and in terms of memory usage.

That is, by a simple function written in C++, we gain sensibily faster and less memory hungry code.

Finally, we could object that the crossprod() R function does the same computations performed by sumP(). This is true, but as an example, this could be equally useful.

Now, let’s try a new problem: we need a function that “generates” (“simulates”) an n-length realization from an AR(1) process with parameter phi, starting from start and having a white noise error with standard deviation equal to sigma.

In this case the return value is not a scalar but a vector. However, the solution is simple:

cppFunction('
  NumericVector arC(double start, int n, double phi, double sigma) {
    RNGScope scope;
    NumericVector out(n);
  
    out[0]=start;
    for(int i = 1; i < n; i++) {
      out[i] = phi*out[i-1]+rnorm(1,0,sigma)[0];
    }
    return out;
  }
')

cppFunction('

NumericVector arC(double start, int n, double phi, double sigma) {

RNGScope scope;

NumericVector out(n);

out[0]=start;

for(int i = 1; i < n; i++) {

out[i] = phi*out[i-1]+rnorm(1,0,sigma)[0];

}

return out;

}

The arC() function will return a n-length vector directly manageable by R.
Notice in above code the use of rnorm() function, that behave exactly as the R rnorm() function.
In Rcpp sugar (an abstraction layer of Rcpp that provide some facilities to the developer) all the main r/p/q/d R “vector” distribution functions are directly available to programmer.
Notice also the RNGScope scope; line of code, which sets the random number generatator on entry to a block and resets it on exit.

Let’s try the function:

set.seed(1000)
phi=.7
ts=arC(10,500,phi,2)
plot(ts,type="l",
     main=paste("Plot of AR(1) Rcpp realization with parameter",phi))
abline(h=0,col="blue",lty=2)
grid()

set.seed(1000)

phi=.7

ts=arC(10,500,phi,2)

plot(ts,type="l",

main=paste("Plot of AR(1) Rcpp realization with parameter",phi))

abline(h=0,col="blue",lty=2)

grid()

Realization of an AR(1) process with Rcpp

op=par(mfrow=c(2,1))
acf(x=ts)
pacf(x=ts)

op=par(mfrow=c(2,1))

acf(x=ts)

pacf(x=ts)

ACF and PACF of AR(1) process instance made with Rcpp

par(op)

par(op)

Obviously, to avoid many calls to rnorm() in above function, a unique call like rnorm(n,0,sigma) can be made, but this requires more RAM:

cppFunction('
  NumericVector arC2(double start, int n, double phi, double sigma) {
    RNGScope scope;
    NumericVector out(n);
    NumericVector rn=rnorm(n,0,sigma);
  
    out[0]=start;
    for(int i = 1; i < n; i++) {
      out[i] = phi*out[i-1]+rn[i];
    }
    return out;
  }
')

cppFunction('

NumericVector arC2(double start, int n, double phi, double sigma) {

RNGScope scope;

NumericVector out(n);

NumericVector rn=rnorm(n,0,sigma);

out[0]=start;

for(int i = 1; i < n; i++) {

out[i] = phi*out[i-1]+rn[i];

}

return out;

}

Assuming, we want to write a similar function in R, we must write more complicated or much slower code. The simplest, and perhaps slowest, solution is:

arR=function(start, n , phi, sigma){
  out=numeric(n)
  out[1]=start
  
  for( i in 2:n){
    out[i]=phi*out[i-1]+rnorm(1,sd=sigma)
  }
  return(out)
}

arR=function(start, n , phi, sigma){

out=numeric(n)

out[1]=start

for( i in 2:n){

out[i]=phi*out[i-1]+rnorm(1,sd=sigma)

}

return(out)

}

And this is the plot:

set.seed(1000)
phi=.7
tsR=arR(10,500,phi,2)
plot(ts,type="l",
     main=paste("Plot of AR(1) R realization with parameter",phi))
abline(h=0,col="blue",lty=2)
grid()

set.seed(1000)

phi=.7

tsR=arR(10,500,phi,2)

plot(ts,type="l",

main=paste("Plot of AR(1) R realization with parameter",phi))

abline(h=0,col="blue",lty=2)

grid()

Realization of an AR(1) process with R

Note that, given the assigned seed, the Rcpp and R functions produce exactly the same numeric output. That means that the compiled C++ code produced by Rcpp is totally integrated with R libraries:

sum(abs(ts-tsR))

1	sum(abs(ts-tsR))

## [1] 0

## [1] 0

A slightly better R implementation of the same function could be:

arR2=function(start, n , phi, sigma){
  out=numeric(n)
  out[1]=start
  err=rnorm(n,sd=sigma)
  
  for( i in 2:n){
    out[i]=phi*out[i-1]+err[i]
  }
  return(out)
}

arR2=function(start, n , phi, sigma){

out=numeric(n)

out[1]=start

err=rnorm(n,sd=sigma)

for( i in 2:n){

out[i]=phi*out[i-1]+err[i]

}

return(out)

}

The four above functions give the following performance results:

(mb=microbenchmark(
  arC(10,500,phi,2),
  arC2(10,500,phi,2),
  arR(10,500,phi,2),
  arR2(10,500,phi,2)
))

(mb=microbenchmark(

arC(10,500,phi,2),

arC2(10,500,phi,2),

arR(10,500,phi,2),

arR2(10,500,phi,2)

))

## Unit: microseconds
##                   expr     min      lq  median      uq     max neval
##   arC(10, 500, phi, 2)   50.39   54.11   56.86   59.74   71.03   100
##  arC2(10, 500, phi, 2)   33.12   35.13   36.51   38.93   59.90   100
##   arR(10, 500, phi, 2) 1940.79 2002.79 2029.68 2060.74 3306.61   100
##  arR2(10, 500, phi, 2)  625.46  651.32  665.47  694.64 2106.98   100

## Unit: microseconds

## expr min lq median uq max neval

## arC(10, 500, phi, 2) 50.39 54.11 56.86 59.74 71.03 100

## arC2(10, 500, phi, 2) 33.12 35.13 36.51 38.93 59.90 100

## arR(10, 500, phi, 2) 1940.79 2002.79 2029.68 2060.74 3306.61 100

## arR2(10, 500, phi, 2) 625.46 651.32 665.47 694.64 2106.98 100

with average performances from 57 to 13 times better for C++ code.

Using sourceCpp

To simplify the initial presentation, the examples in this chapter have used inline C++ via cppFunction(). For real problems, it’s usually easier to use standalone C++ files and then source them into R using the sourceCpp() function.
This will enable us to take advantage of text editor support for C++ files (e.g. syntax highlighting) as well as make it easier to identify the line numbers of compilation errors.
Standalone C++ files can also contain embedded R code in special C++ comment blocks. This is really convenient if you want to run some R test code.

The standalone C++ file should have extension .cpp, and needs to start with:

#include <Rcpp.h> 
using namespace Rcpp;

1 2	#include <Rcpp.h> using namespace Rcpp;

And for each function that we want available within R, we need to prefix it with:

// [[Rcpp::export]]

1	// [[Rcpp::export]]

where the space after // is mandatory

Then using sourceCpp("path/to/file.cpp") will compile the C++ code, create the matching R functions and add them to current session.
Note that these functions, and the ones created using cppFunction(), will not persist across save() and load(), such as when you restore your workspace.

Let’s try to copy/paste following code into a cppExample1.cpp file in current directory, and compile/run the script as below:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double meanC(NumericVector x) {
  int n = x.size();
  double total = 0;

  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total / n;
}

/*** R
  library(microbenchmark)
  x <- runif(1e5)
  microbenchmark(
    mean(x),
    meanC(x))
*/

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]

double meanC(NumericVector x) {

int n = x.size();

double total = 0;

for(int i = 0; i < n; ++i) {

total += x[i];

}

return total / n;

}

/*** R

library(microbenchmark)

x <- runif(1e5)

microbenchmark(

mean(x),

meanC(x))

Note the R code block embedded between /*** R and */.
This code will be executed after the compilation to test the functions.

sourceCpp("cppExample1.cpp")

1	sourceCpp("cppExample1.cpp")

## 
## > library(microbenchmark)
## 
## > x <- runif(1e+05)
## 
## > microbenchmark(mean(x), meanC(x))
## Unit: microseconds
##      expr    min     lq median     uq   max neval
##   mean(x) 189.34 190.27 195.52 197.23 254.4   100
##  meanC(x)  93.32  93.98  96.38  97.18 119.4   100

## > library(microbenchmark)

## > x <- runif(1e+05)

## > microbenchmark(mean(x), meanC(x))

## Unit: microseconds

## expr min lq median uq max neval

## mean(x) 189.34 190.27 195.52 197.23 254.4 100

## meanC(x) 93.32 93.98 96.38 97.18 119.4 100

And now meanC() is available in current workspace:

"meanC" %in% ls()

1	"meanC" %in% ls()

## [1] TRUE

1	## [1] TRUE

meanC

meanC

## function (x) 
## .Primitive(".Call")(<pointer: 0x2b79220f2c50>, x)

1 2	## function (x) ## .Primitive(".Call")(<pointer: 0x2b79220f2c50>, x)

Matrix computations example

Now we want to write a very simple function that calculates a correlation matrix from a data matrix.
For sake of simplicity, we suppose that the input matrix is numeric and that it does not contain NAs.

In Rcpp, the data types for integer, double, logical, and character matrices are IntegerMatrix, NumericMatrix, LogicalMatrix, CharacterMatrix, respectively.

Of course, the return value of our example function shall be an object of NumericMatrix type.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix corrC(NumericMatrix X) {

  Environment stats("package:stats");
  Function corr=stats["cor"];
  NumericMatrix out=corr(X);
  
  return out;
}

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]

NumericMatrix corrC(NumericMatrix X) {

Environment stats("package:stats");

Function corr=stats["cor"];

NumericMatrix out=corr(X);

return out;

}

The above code is saved in cppExample2.cpp source C++ file.
Notice that the R cor() function of stats package has been used within the code. To use the function, an object of class Environment “containing” the stats package has been created, and then from this object a class Function object named corr has been created, where the cor() function of stats package has been assigned.
With these two simple lines of code, the R cor() function has became available to C++ code.
Now we can launch the compilation of code:

sourceCpp("cppExample2.cpp")

1	sourceCpp("cppExample2.cpp")

And now we can compare the performances of R and Rcpp implementations:

X=matrix(rnorm(100000),nrow=1000,ncol=100)
microbenchmark(
  corrC(X),
  as.matrix(cor(X))
)

X=matrix(rnorm(100000),nrow=1000,ncol=100)

microbenchmark(

corrC(X),

as.matrix(cor(X))

)

## Unit: milliseconds
##               expr   min    lq median    uq   max neval
##           corrC(X) 5.214 5.281  5.364 5.425 5.809   100
##  as.matrix(cor(X)) 5.193 5.271  5.344 5.406 5.853   100

## Unit: milliseconds

## expr min lq median uq max neval

## corrC(X) 5.214 5.281 5.364 5.425 5.809 100

## as.matrix(cor(X)) 5.193 5.271 5.344 5.406 5.853 100

In this specific case, the differences are very small, since the two codes actually implement the same function.

A last more complex example

The truncgof library contains functions that allow one to calculate GoF statistics, and related p-values, for truncated distributions. The main drawback of these functions is that they use Montecarlo simulations to calculate p-values.
Since the estimates are often obtained by maximization of likelihood, the p-value calculation become a very time-consuming exercize, since for left-truncated distributions, as in this case, the maximum of likelihood is seldom obtainable in closed form, and then it requires a numeric optimization as provided by optim() or similar functions.

We would then like to speed-up the optimization process, and then the p-value calculation too.

Let us try an example: suppose that we have to estimate via MLE the parameters of a left-truncated lognormal distribution. The left truncated distribution function is:

dtrlnorm=function(x,pars,L=0,log=FALSE){
  if(L<=0){
    L=0
  }
  if(log){
    ret=log(x>=L) + 
      dlnorm(x,meanlog=pars[1],sdlog=pars[2],log=log)-
      log(1-plnorm(L,meanlog=pars[1],sdlog=pars[2]))
  }
  else{
    ret=(x>=L)*
      dlnorm(x,meanlog=pars[1],sdlog=pars[2],log=log)/
      (1-plnorm(L,meanlog=pars[1],sdlog=pars[2]))
  }
  return(ret)
}

dtrlnorm=function(x,pars,L=0,log=FALSE){

if(L<=0){

L=0

}

if(log){

ret=log(x>=L) +

dlnorm(x,meanlog=pars[1],sdlog=pars[2],log=log)-

log(1-plnorm(L,meanlog=pars[1],sdlog=pars[2]))

}

else{

ret=(x>=L)*

dlnorm(x,meanlog=pars[1],sdlog=pars[2],log=log)/

(1-plnorm(L,meanlog=pars[1],sdlog=pars[2]))

}

return(ret)

}

and then the neg-log-likelihood:

nlglik=function(pars,x,L){
  return(
    -sum(dtrlnorm(x=x,pars=pars,L=L,log=TRUE))
    )
}

nlglik=function(pars,x,L){

return(

-sum(dtrlnorm(x=x,pars=pars,L=L,log=TRUE))

)

}

The Rcpp code that calculates the truncated distribution and log-likelihood is below, and is saved as cppExample3.cpp file:

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector dtrlnormC(NumericVector x, NumericVector pars, double L, int lg) {
  
  NumericVector out;
  if(lg==1){
    double tmp=log(1-R::plnorm(L,pars[0],pars[1],1,0)); 
    out= ifelse(x>=L,dlnorm(x,pars[0], pars[1],lg)-tmp,-INFINITY);
  }
  else{
    double tmp=1-R::plnorm(L,pars[0],pars[1],1,0); 
    out= ifelse(x>=L,dlnorm(x,pars[0], pars[1],lg)/tmp,0);
  }
  return out;
}

// [[Rcpp::export]]
NumericVector dtrlnormC1(NumericVector x, NumericVector pars, double L, int lg) {
  int n=x.size();
  NumericVector tmp(n);
  NumericVector out(n);
  
  tmp=dlnorm(x,pars[0], pars[1],lg);
  
  if(lg==1){
    double cnst=log(1-R::plnorm(L,pars[0],pars[1],1,0));
    for(int i=0;i<n;i++){
      if(x[i]>=L){
        out[i]=tmp[i]-cnst;
      }
      else{
        out[i]=-INFINITY;
      }
    }
  }
  else{
    double cnst=1-R::plnorm(L,pars[0],pars[1],1,0);
    for(int i=0;i<n;i++){
      if(x[i]>=L){
        out[i]=tmp[i]/cnst;
      }
      else{
        out[i]=0;
      }
    }
  }
  return out;
}


// [[Rcpp::export]]
double nlglikC(NumericVector pars, NumericVector x, double L){

  return -sum(dtrlnormC(x,pars, L,1));
}

// [[Rcpp::export]]
double nlglikC1(NumericVector pars, NumericVector x, double L){

  return -sum(dtrlnormC1(x,pars, L,1));
}

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]

NumericVector dtrlnormC(NumericVector x, NumericVector pars, double L, int lg) {

NumericVector out;

if(lg==1){

double tmp=log(1-R::plnorm(L,pars[0],pars[1],1,0));

out= ifelse(x>=L,dlnorm(x,pars[0], pars[1],lg)-tmp,-INFINITY);

}

else{

double tmp=1-R::plnorm(L,pars[0],pars[1],1,0);

out= ifelse(x>=L,dlnorm(x,pars[0], pars[1],lg)/tmp,0);

}

return out;

}

// [[Rcpp::export]]

NumericVector dtrlnormC1(NumericVector x, NumericVector pars, double L, int lg) {

int n=x.size();

NumericVector tmp(n);

NumericVector out(n);

tmp=dlnorm(x,pars[0], pars[1],lg);

if(lg==1){

double cnst=log(1-R::plnorm(L,pars[0],pars[1],1,0));

for(int i=0;i<n;i++){

if(x[i]>=L){

out[i]=tmp[i]-cnst;

}

else{

out[i]=-INFINITY;

}

else{

double cnst=1-R::plnorm(L,pars[0],pars[1],1,0);

for(int i=0;i<n;i++){

if(x[i]>=L){

out[i]=tmp[i]/cnst;

}

else{

out[i]=0;

}

return out;

}

// [[Rcpp::export]]

double nlglikC(NumericVector pars, NumericVector x, double L){

return -sum(dtrlnormC(x,pars, L,1));

}

// [[Rcpp::export]]

double nlglikC1(NumericVector pars, NumericVector x, double L){

return -sum(dtrlnormC1(x,pars, L,1));

}

Note that two implementations of the same dtrlnorm* procedure have been reported: the first is slightly less performant, but more “concise”, since it uses some facilities of Rcpp Sugar that allows the developer to use vector functions (see the ifelse() function). The only difference between the two nlglik* functions, instead, is on the call to dtrlnormC or dtrlnormC1.

Now let’s compile the above code:

sourceCpp('cppExample3.cpp')

1	sourceCpp('cppExample3.cpp')

Now, suppose that a sample from a left truncated lognormal distribution has been produced:

set.seed(1000)
dt=rlnorm(n=1000,meanlog=9,sdlog=2)
L=4000
dt=dt[dt>L]

set.seed(1000)

dt=rlnorm(n=1000,meanlog=9,sdlog=2)

L=4000

dt=dt[dt>L]

We can produce some comparison bechmarks between R and C++ implementations

# Start values for optimization process:
init=c(mean(log(dt)),sd(log(dt)))

microbenchmark(
  dtrlnorm(init,x=dt,L=L,log=FALSE),
  dtrlnormC(init,x=dt,L=L,lg=FALSE),
  dtrlnormC1(init,x=dt,L=L,lg=FALSE),
  dtrlnorm(init,x=dt,L=L,log=TRUE),
  dtrlnormC(init,x=dt,L=L,lg=TRUE),
  dtrlnormC1(init,x=dt,L=L,lg=TRUE)
)

# Start values for optimization process:

init=c(mean(log(dt)),sd(log(dt)))

microbenchmark(

dtrlnorm(init,x=dt,L=L,log=FALSE),

dtrlnormC(init,x=dt,L=L,lg=FALSE),

dtrlnormC1(init,x=dt,L=L,lg=FALSE),

dtrlnorm(init,x=dt,L=L,log=TRUE),

dtrlnormC(init,x=dt,L=L,lg=TRUE),

dtrlnormC1(init,x=dt,L=L,lg=TRUE)

)

## Unit: microseconds
##                                         expr   min     lq median     uq
##   dtrlnorm(init, x = dt, L = L, log = FALSE) 84.52  86.06  88.40  90.73
##   dtrlnormC(init, x = dt, L = L, lg = FALSE) 65.21  66.28  67.72  68.72
##  dtrlnormC1(init, x = dt, L = L, lg = FALSE) 57.78  59.21  60.58  61.82
##    dtrlnorm(init, x = dt, L = L, log = TRUE) 98.00 100.92 102.91 105.77
##    dtrlnormC(init, x = dt, L = L, lg = TRUE) 61.53  63.08  64.52  65.60
##   dtrlnormC1(init, x = dt, L = L, lg = TRUE) 54.25  56.06  57.28  58.87
##     max neval
##  228.91   100
##   97.59   100
##   77.67   100
##  121.96   100
##   93.06   100
##   71.54   100

## Unit: microseconds

## expr min lq median uq

## dtrlnorm(init, x = dt, L = L, log = FALSE) 84.52 86.06 88.40 90.73

## dtrlnormC(init, x = dt, L = L, lg = FALSE) 65.21 66.28 67.72 68.72

## dtrlnormC1(init, x = dt, L = L, lg = FALSE) 57.78 59.21 60.58 61.82

## dtrlnorm(init, x = dt, L = L, log = TRUE) 98.00 100.92 102.91 105.77

## dtrlnormC(init, x = dt, L = L, lg = TRUE) 61.53 63.08 64.52 65.60

## dtrlnormC1(init, x = dt, L = L, lg = TRUE) 54.25 56.06 57.28 58.87

## max neval

## 228.91 100

## 97.59 100

## 77.67 100

## 121.96 100

## 93.06 100

## 71.54 100

microbenchmark(
  nlglik(init,x=dt,L=L),
  nlglikC(init,x=dt,L=L),
  nlglikC1(init,x=dt,L=L)
)

microbenchmark(

nlglik(init,x=dt,L=L),

nlglikC(init,x=dt,L=L),

nlglikC1(init,x=dt,L=L)

)

## Unit: microseconds
##                           expr    min     lq median     uq    max neval
##    nlglik(init, x = dt, L = L) 100.62 103.30 106.29 111.64 196.46   100
##   nlglikC(init, x = dt, L = L)  61.57  62.81  64.42  66.17  91.51   100
##  nlglikC1(init, x = dt, L = L)  54.58  56.30  57.27  58.77  76.16   100

## Unit: microseconds

## expr min lq median uq max neval

## nlglik(init, x = dt, L = L) 100.62 103.30 106.29 111.64 196.46 100

## nlglikC(init, x = dt, L = L) 61.57 62.81 64.42 66.17 91.51 100

## nlglikC1(init, x = dt, L = L) 54.58 56.30 57.27 58.77 76.16 100

Note that, at the best, the C++ implementation requires about one-half of time required by the “pure R” code.

An estimate of lognormal parameters may be obtained via maximum likelihood. Below an implementation of a function that performs such as calculation using optim():

lnMLE=function(x,nllik,L=0,method="Nelder-Mead",
               init=c(mean(log(x)),sd(log(x)))){
  invisible(
    optim(par=init,fn=nllik,method=method,x=x,L=L))
}

lnMLE=function(x,nllik,L=0,method="Nelder-Mead",

init=c(mean(log(x)),sd(log(x)))){

invisible(

optim(par=init,fn=nllik,method=method,x=x,L=L))

}

Note that the function takes as a parameter the specific instance of neg-log-likelihood function to be used, and that the C++ compiled Rcpp functions can be passed to optim() like all standard R functions.

And now the estimates:

ests=lnMLE(x=dt,nllik=nlglik,L=L)
estsC=lnMLE(x=dt,nllik=nlglikC,L=L)
estsC1=lnMLE(x=dt,nllik=nlglikC1,L=L)

ests

ests=lnMLE(x=dt,nllik=nlglik,L=L)

estsC=lnMLE(x=dt,nllik=nlglikC,L=L)

estsC1=lnMLE(x=dt,nllik=nlglikC1,L=L)

ests

## $par
## [1] 9.418 1.718
## 
## $value
## [1] 7373
## 
## $counts
## function gradient 
##       51       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

## $par

## [1] 9.418 1.718

## $value

## [1] 7373

## $counts

## function gradient

## 51 NA

## $convergence

## [1] 0

## $message

## NULL

ests$par-estsC$par

1	ests$par-estsC$par

## [1] 0 0

1	## [1] 0 0

ests$par-estsC1$par

1	ests$par-estsC1$par

## [1] 0 0

1	## [1] 0 0

estsC$par-estsC1$par

1	estsC$par-estsC1$par

## [1] 0 0

1	## [1] 0 0

ests$value-estsC$value

1	ests$value-estsC$value

## [1] -5.457e-12

1	## [1] -5.457e-12

ests$value-estsC1$value

1	ests$value-estsC1$value

## [1] -5.457e-12

1	## [1] -5.457e-12

estsC$value-estsC1$value

1	estsC$value-estsC1$value

## [1] 0

## [1] 0

Only very small differences appear between the different implementations.

Now a last benchmark on MLE procedures:

microbenchmark(
  lnMLE(x=dt,nllik=nlglik,L=L,init=init),
  lnMLE(x=dt,nllik=nlglikC,L=L,init=init),
  lnMLE(x=dt,nllik=nlglikC1,L=L,init=init),times=1000
)

microbenchmark(

lnMLE(x=dt,nllik=nlglik,L=L,init=init),

lnMLE(x=dt,nllik=nlglikC,L=L,init=init),

lnMLE(x=dt,nllik=nlglikC1,L=L,init=init),times=1000

)

## Unit: milliseconds
##                                                 expr   min    lq median
##    lnMLE(x = dt, nllik = nlglik, L = L, init = init) 5.156 5.284  5.347
##   lnMLE(x = dt, nllik = nlglikC, L = L, init = init) 3.230 3.324  3.354
##  lnMLE(x = dt, nllik = nlglikC1, L = L, init = init) 2.905 2.985  3.016
##     uq   max neval
##  5.402 8.243  1000
##  3.389 6.859  1000
##  3.054 5.200  1000

## Unit: milliseconds

## expr min lq median

## lnMLE(x = dt, nllik = nlglik, L = L, init = init) 5.156 5.284 5.347

## lnMLE(x = dt, nllik = nlglikC, L = L, init = init) 3.230 3.324 3.354

## lnMLE(x = dt, nllik = nlglikC1, L = L, init = init) 2.905 2.985 3.016

## uq max neval

## 5.402 8.243 1000

## 3.389 6.859 1000

## 3.054 5.200 1000

The C++ implementation requires about one half of time required by the R code.

And now, we will use the truncgof package to apply a right-tailed AD2 test on simulated data (see the ad2up.test() help for info about the function):

require(truncgof)

1	require(truncgof)

## Loading required package: truncgof
## Loading required package: MASS
## 
## Attaching package: 'truncgof'
## 
## The following object is masked from 'package:stats':
## 
##     ks.test

## Loading required package: truncgof

## Loading required package: MASS

## Attaching package: 'truncgof'

## The following object is masked from 'package:stats':

## ks.test

set.seed(100000)
system.time(print(ad2up.test(x=dt,distn="plnorm",fit=ests$par,H=L,
                       sim=1000,tol=0,
                       estfun="as.list(lnMLE(x=dt, nllik=nlglik, L = H)$par)")))

set.seed(100000)

system.time(print(ad2up.test(x=dt,distn="plnorm",fit=ests$par,H=L,

sim=1000,tol=0,

estfun="as.list(lnMLE(x=dt, nllik=nlglik, L = H)$par)")))

## 
##  Quadratic Class Anderson-Darling Upper Tail Test
## 
## data:  dt
## AD2up = 5.373, p-value = 0.58
## 
## treshold = 4000, simulations: 1000

## Quadratic Class Anderson-Darling Upper Tail Test

## data: dt

## AD2up = 5.373, p-value = 0.58

## treshold = 4000, simulations: 1000

##    user  system elapsed 
##   5.804   0.008   5.817

1 2	## user system elapsed ## 5.804 0.008 5.817

set.seed(100000)
system.time(print(ad2up.test(x=dt,distn="plnorm",fit=estsC$par,H=L,
                       sim=1000,tol=0,
                       estfun="as.list(lnMLE(x=dt, nllik=nlglikC, L = H)$par)")))

set.seed(100000)

system.time(print(ad2up.test(x=dt,distn="plnorm",fit=estsC$par,H=L,

sim=1000,tol=0,

estfun="as.list(lnMLE(x=dt, nllik=nlglikC, L = H)$par)")))

## 
##  Quadratic Class Anderson-Darling Upper Tail Test
## 
## data:  dt
## AD2up = 5.373, p-value = 0.58
## 
## treshold = 4000, simulations: 1000

## Quadratic Class Anderson-Darling Upper Tail Test

## data: dt

## AD2up = 5.373, p-value = 0.58

## treshold = 4000, simulations: 1000

##    user  system elapsed 
##   3.910   0.006   3.919

1 2	## user system elapsed ## 3.910 0.006 3.919

set.seed(100000)
system.time(print(ad2up.test(x=dt,distn="plnorm",fit=estsC1$par,H=L,
                       sim=1000,tol=0,
                       estfun="as.list(lnMLE(x=dt, nllik=nlglikC1, L = H)$par)")))

set.seed(100000)

system.time(print(ad2up.test(x=dt,distn="plnorm",fit=estsC1$par,H=L,

sim=1000,tol=0,

estfun="as.list(lnMLE(x=dt, nllik=nlglikC1, L = H)$par)")))

## 
##  Quadratic Class Anderson-Darling Upper Tail Test
## 
## data:  dt
## AD2up = 5.373, p-value = 0.58
## 
## treshold = 4000, simulations: 1000

## Quadratic Class Anderson-Darling Upper Tail Test

## data: dt

## AD2up = 5.373, p-value = 0.58

## treshold = 4000, simulations: 1000

##    user  system elapsed 
##   3.530   0.007   3.541

1 2	## user system elapsed ## 3.530 0.007 3.541

All the approaches produce the same results, but the best Rcpp implemetation requires a few more than one half of the “pure R” computational time.

Chapter 10 | Rcpp: Seamless R and C++ Integration

Introduction

Rcpp, R and C++

A simple example with scalars

Brief introduction to vectors

Using sourceCpp

Matrix computations example

A last more complex example

Join us!

Courses calendar

We are part of

Categories

Archives

Chapter 10 | Rcpp: Seamless R and C++ Integration

Introduction

Rcpp, R and C++

A simple example with scalars

Brief introduction to vectors

Using sourceCpp

Matrix computations example

A last more complex example

Join us!

Courses calendar

We are part of

Categories

Archives

Tags