R and the R Project
The R-Project: a bit of history
R is a programming environment for data analysis, graphics and statistical computing. The R language is widely used among statisticians for developing statistical software and data analysis.
R was initially developed in early 90s by Robert Gentleman and Ross Ihaka at the Department of Statistics of the University of Auckland as a dialect of the S language.
The R name is partly based on the (first) names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of S.
What is S and a bit of history
S is a statistical programming language developed by John Chambers and others in Bell Laboratories.
A bit of history:
- 1976: the first version of S was developed as an internal statistical analysis environment. It was originally implemented as Fortran libraries.
- 1980: the first version of S distributed outside of Bell Laboratories. In 1981, source version were made available.
- 1984: Richard A. Becker and John M. Chambers, “S. An Interactive Environment for Data Analysis and Graphics”. (Brown Book). Historical interest only.
- 1988: Richard A. Becker, John M. Chambers and Allan R. Wilks, “The New S Language”. London: Chapman & Hall. (Blue Book). It introduced what is now known as S version 2. The system was rewritten in C and began to resemble the system that we have today.
- 1992: John M. Chambers and Trevor J. Hastie, “Statistical Models in S”. (White Book). It introduced S version 3, often abbreviated S3, which added structures to facilitate statistical modeling in S.
- 1998: John M. Chambers, “Programming with Data”. (Green Book). It introduced S version 4, often abbreviated S4, which provided advanced object-oriented features. S4 classes differ markedly from S3 classes.
The S language itself has not changed dramatically since 1998.
What is S-PLUS and a bit of history
S-PLUS is a commercial implementation of the S programming language.
S-PLUS provides a number of fancy features (GUIs, mostly) on top of it, hence the “PLUS”.
A bit of history:
- 1993: Statistical Sciences, Inc. acquires the exclusive license to distribute S and merges with MathSoft.
- 2001: MathSoft sells its Cambridge-based Engineering and Education Products Division (EEPD). It changes name to Insightful Corporation.
- 2004: Insightful purchases the S language from Lucent Technologies for $2 million.
- 2008: TIBCO acquires Insightful Corporation.
R: a bit of history
- 1993: First announcement of R to the public.
- 1995: Martin Maechler convinces Ross Ihaka and Robert Gentleman to use the GNU General Public License to make R free software.
- 1997: The R Development Core Team is formed. The team controls the source code for R.
- 2000: R version 1.0.0 released. Developers considered R stable enough for production use.
- 2004: R version 2.0.0 released. Introduced lazy loading, which enables fast loading of data with minimal expense of system memory.
- 2013: R version 3.0.0 released. Introduced long vectors.
The R-project and R licence
R is supported by a wide community of academic users, professors, companies and developers. This community composes the so-called “R-project”. The “R-project” is supported by the “R Foundation”. The R Foundation is a not for profit organisation.
R is an official part of the Free Software Foundation’s GNU project. The R Foundation has similar goals to other open source software foundations like the Apache Foundation or the GNOME Foundation. R is free and open source software. It is released under the GPL (version 2) licence.
R is free:
- you can have R without paying for it (freeware);
- you can copy and re-use the software (free software);
- you can access source code and modify it (open source).
R Commercial Support
Revolution Analytics (www.revolutionanalytics.com) was founded in 2007 to provide commercial support for Revolution R. Revolution R is the distribution of R developed by Revolution Analytics which also includes components developed by the company.
Revolution R Enterprise includes all of R’s advanced data analysis and graphics capabilities, plus additional components. Major additional components include: ParallelR (for parallel computing), the R Productivity Environment IDE, RevoScaleR (for big data analysis), RevoDeployR (web services framework and the ability for reading and writing data in the SAS file format).
What R does?
R provides a suite of software facilities for:
- matrix algebra;
- hash tables and regular expressions;
- reading and manipulating data;
- programming language: loops, subroutines, functions, etc.;
- conducting statistical analyses;
- graphics and tables;
- displaying the results.
On the contrary, R:
- it is not a database, but it connects to databases;
- it does not provide a graphical interface, but it uses Java, TclTk and, under Windows, COM to provide graphical interfaces;
- it is not a spreadsheet, but it connects to spreadsheets;
- it does not provide commercial support. Revolution R is a commercially supported distribution of R.
In conclusion, R is an interpreted computer language. R provides a platform for the development and implementation of new algorithms and technology transfer. Most user-visible functions are written in R itself, calling upon a smaller set of internal primitives. It is possible to interface procedures written in C, C+, or FORTRAN languages for efficiency, and to write additional primitives. System commands can be called from within R.
R advantages and disadvantages
Main R advantages are:
- Fast and free.
- State of the art: Statistical researchers provide their methods as R packages. SPSS and SAS are years behind R!
- Excellent for graphics.
- Mx, WinBugs, and other programs use or will use R.
- Active user community.
- Excellent for simulation, programming, computer intensive analyses, etc.
- Forces you to think about your analysis.
- Interfaces with database storage software (SQL).
Main R disadvantages are:
- Not user friendly at start: steep learning curve, minimal GUI.
- Sometimes, figuring out correct methods or how to use a function on your own can be frustrating.
- Easy to make mistakes and not know.
- Working with large datasets is limited by RAM.
- Data preparation and cleaning can be messier and more mistake prone in R vs SPSS or SAS.
The R-project website (www.r-project.org) is the starting point for R materials.
The website contains:
- the software and packages;
- the search engine interface (the same queries can be submitted with the RSiteSearch(‘query’) function within R);
- the on-line documentation both in HTML and in PDF format. The HTML version can be accessed with the help.start() function within R;
- the R Journal. The R Journal is the open access, refereed journal of the R project. It features short to medium length articles covering topics that might be of interest to users or developers of R;
- the interface to the mailing list;
- the wiki, suggested books and many others.
The on-line documentation includes the following manuals. These manuals have been written by the R Development Core Team itself and contain precious information.
- An Introduction to R gives an introduction to the language and how to use R for doing statistical analysis and graphics.
- Writing R Extensions covers how to create your own packages, write R help files, and the foreign language (C, C++, Fortran, …) interfaces.
- R Data Import/Export describes the import and export facilities available either in R itself or via packages which are available from CRAN.
- R Installation and Administration.
Other manuals and tutorials provided by R users can be downloaded from the R-project website (cran.r-project.org/other-docs.html).
Mailing lists is the most important tool to contact the R community. Mailing lists can be accessed from the R-project website (www.r-project.org/mail.html).
There are four general mailing lists devoted to R:
- R-announce: This list is for major announcements about the development of R and the availability of new code.
- R-packages: This list is for announcements as well, usually on the availability of new or enhanced contributed packages (on CRAN, typically).
- R-help: The “main” R mailing list, for discussion about problems and solutions using R, announcements about the availability of new functionality for R and documentation of R, comparison and compatibility with S-plus, and for the posting of nice examples and benchmarks.
- R-devel: This list is intended for questions and discussion about code development in R.
Other on-line resources
It is very difficult estimate how many sites about R are on-line. However, Google returns 224.000.000 sites searching “R stat blog”. Also if only the 0.1% of these sites talk about R, it means almost 220.000 sites about R.
R-bloggers (www.r-bloggers.com) is a blog aggregator of content collected from bloggers who write about R. R-bloggers contains R news and tutorials contributed by hundreds of R bloggers.
Other useful websites about R are:
A partially annotated list of books that are related to S or R may be found in the R-project website (www.r-project.org/doc/bib/R-books.html).
The following book may be considered the milestone book about R: – William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York, 2002. ISBN 0-387-95457-0.
Other suggested books are:
- Everitt and Hothorn (2009). A handbook of statistical analyses using R. Chapman & Hall/CRC.
- Chambers (2008). Software for Data Analysis, Springer.
- Chambers (1998). Programming with Data, Springer.
- Murrell (2005). R Graphics, Chapman & Hall/CRC Press.
- Dalgard (2002). Introductory Statistics with R. Springer.
- Kabakoff (2011). R in Action. Manning.
- Braun and Murdoch (2007). A First Course in Statistical Programming with R. Cambridge University Press.
Springer is developing a series of books called Use R!.