R Data Objects

There are different types of objects in R. The most used are: vectors, matrices, lists, factors, data frames.

Vectors

In R the simplest data structure is a vector. A vector is defined as an ordered sequence of elements of the same kind.

A vector can be defined according to the data type it contains. Therefore, there are:

  • numeric vectors;
  • logical (or Boolean) vectors;
  • character (or string) vectors.

The most common method to define a vector is the c() function.

Logical vectors are often defined as the result of control actions on numerical or character vectors.

Of course, a logical vector can be created using the c() function.

If a vector mixes different data types, R will store it as a character vector.

The c() function can be used to create a vector combining several vectors.

A vector can be created using sequences. The easiest method is using the operator :. The inputs of the operator : are the first number on the left and the last number on the right. The vector will be composed of numbers comprised between the first and the last number (by one unit).

Other sequences can be created using the seq() function.

Finally, vectors can be created with the rep() function which repeats the elements of a vector. The first parameter of the rep() function is the value or vector to be repeated and the second parameter, times, represents the number of repetitions to be made. Alternatively, the each parameter enables the repetition of each element of the vector as many times as indicated by the number.

A subset of the vector x can be extracted with x[subscripts]. The selection can be done in three ways.

  1. A vector of positive integers indicating the elements of the vector to be extracted.
  1. A vector of negative integers indicating the elements which must not be extracted.
  1. A Boolean vector indicating the elements to be extracted (TRUE) or to be left (FALSE).

The ! symbol identifies the “not” logical operator. The logical operator “not” reverses the logical value of a condition on which it operates.

In this way elements satisfying a logical condition can be extracted. Usually, the logical vectors are obtained as the result of logical expressions.

The logical expression can be defined inside the square brackets, directly.

The & symbol identifies the “and” logical operator. The “and” logical operator compare two (or more) logical expression and return TRUE if both are TRUE. The following example returns the x values greater than 2 but less than 8.

The | symbol identifies the “or” logical operator. The “or” logical operator compare two (or more) logical expression and return TRUE if at least one is TRUE. The following example returns the x values less than 2 or greater than 8.

Extracting unique values contained in a vector can sometimes be useful. This can be done with the unique() function.

Matrices

Matrices are generalizations of vectors. Like vectors, matrices need to contain elements of the same kind. This Paragraph introduces numeric matrices.

A matrix can be created using the matrix() function.

By default, a matrix is filled by columns. The byrow = T argument of the matrix() function fills the matrix by rows.

Alternatively, a matrix can be created by applying the dim() function to a vector.

Finally, a matrix can be created by joining two or more vectors, both as column vectors (cbind() function) and row vectors (rbind() function).

The cbind() and rbind() functions can be used to join two (ore more) matrices or vectors and matrices.

Like vectors, a subset of the matrix x can be extracted with x[subscripts]. Subscripts can be:

  1. a set [rows, cols], where rows is a vector of row numbers and cols is a vector of column numbers. Numbers are negative when they indicate a row or column to be excluded.
  2. A number, a vector of numbers or a logical condition. In this case, the matrix is treated as if it were a single vector created by stacked matrix columns.

Lists

A list is an ordered collection of objects. Each object is a component of the list. Each element of the list can have a different structure. It can be a list itself, a vector, a matrix, an array, a factor or a data frame. A list allows you to gather a variety of (possibly unrelated) objects under one name.

Lists are not usually created by users but are the result of statistical procedures in R.

For example, in its simplest form, the lsfit() function estimates a least-squares regression.

The output of the function is a list made of four objects called “coefficients”, “residuals”, “intercept” e “qr”. The first element of the list is a vector with intercept and slope. The second element is a vector with the residuals of the model. The third element is a Boolean vector of length one indicating if the model contains the intercept. The fourth element is a list containing the QR matrix decomposition of the independent variables.

Even if rarely used, list() is the basic function to create a list. Its arguments are the elements of the list.

The elements of a list can be extracted in three different ways:

  1. with square brackets;
  2. with double square brackets;
  3. with the name of the object inside the list.

In the first case, square brackets can be used to extract a list made of one or more objects. As for vectors, the position of the elements to be included or excluded ought to be specified.

In the second case, double square brackets can be used to extract one object (only) from the list.

Please note the difference between myList[1] and myList[[1]]. The first argument extracts a list with only the first object contained in myList; in our case a vector. On the other hand, the second argument extracts the vector, which is the content of the first object of the list.

The third way enables the extraction of the content of an object in the list. The use of the object position in the list, as for double square brackets, is replaced by the use of the object name preceded by the symbol $.

Indices for the selection can be combined to extract elements in an object of the list using the above-mentioned methods.

Factors

A factor is a vector-like object used to specify a discrete classification (grouping) of the components of other vectors of the same length. R provides both ordered and unordered factors.

Factor variables are categorical variables that can be either numeric or string variables. There are a number of advantages to converting categorical variables to factor variables. Perhaps the most important advantage is that they can be used in statistical modeling where they will be implemented correctly, e.g., they will then be assigned the correct number of degrees of freedom. Factor variables are also very useful in many different types of graphics. Furthermore, storing string variables as factor variables is a more efficient use of memory.

Vectors, matrices and lists contain numerical data, characters or logics and are basic objects in R. Factors, on the other hand, are a more complex structure, as they contain both the numerical data vector and the labels associated with each level.

To create a factor variable the factor() function is used. The only required argument is a vector of values which can be either string or numeric. Optional arguments include the levels argument, which determines the categories of the factor variable, and the default is the sorted list of all the distinct values of the data vector. The labels argument is another optional argument which is a vector of values that will be the labels of the categories in the levels argument. The exclude argument is also optional; it defines which levels will be classified as NA in any output using the factor variable.

Once a vector has been defined, it is always possible to modify the labels of the vector’s levels. This can be done with the levels() function.

The order parameter of the factor() function creates a factor with ordered levels.

The as.numeric() and as.character() functions transform a factor into a numeric vector or into a vector whose elements are the levels’ labels.

The elements of a factor can be extracted in the same way as the elements of a vector. The logic conditions on the elements of a factor are referred to the factor’s levels which can be obtained with thelevels() function.

Data Frames

A data frame in R can be thought of as:

  • a generalization of a matrix;
  • a list of particular kind.

In the first case a data frame can be thought of as a matrix whose columns can be both factors and vectors of the same length but (possibly) of different types (numeric, character, Boolean).

In the second case, the data frame is a list completely made of either vectors (of any kind) or factors, all with the same length.

From a formal point of view, data frames are not a new type of objects. They are objects of list type and data frame class. From a practical point of view, a data frame is a very well-know structure in statistics. Its different kinds of information are organized in columns, whereas rows represent different types of observational units.

Furthermore, data imported in R from external sources, such as text files, Excel files or databases, is saved in R as data frame-like objects.

To sum up, unless there are have specific needs, data frames in R are the ideal tool for data filing and management.

Data frames are usually imported from external sources but the creation of a data frame object in R might sometimes be needed. The most widespread method to define a data frame is the data.frame() function. Its inputs are a series of vectors or factors of the same length. The generated object is made of as many columns as input elements.

Alternatively, the vectors composing the data frame can be defined inside the data.frame() function itself.

The management of character vectors in R requires a detailed explanation. By default, numeric vectors become part of data frames as such, whereas character vectors are transformed into factors whose levels correspond to the vector’s unique values.

This behaviour is surely effective when character vectors represent categorical variables with a definite number of modes, such as education qualification or job.

A character vector can also represent a set of strings which are not necessarily referable to a definite number of modes (e.g. person’s proper names) or to numerous and/or unique modes (e.g. Italian municipalities are about 8,000 and have different names). In this case R’s behaviour becomes a disturbing factor because it tends to transform variables of a different nature into factors.

Unfortunately, there is not an optimal solution to this problem. Much depends on the most used types of variables dealt with by a single user.

This behaviour is managed by the stringsAsFactors logical parameter.

As already mentioned, the default setting is stringsAsFactors = TRUE which tells R to transform character vectors into factors inside a data frame; stringsAsFactors = FALSE does not change character vectors. This parameter can be set both in “local” option with the stringsAsFactors parameter inside the data.frame() function and in “global” option with stringsAsFactors inside the options() function. Clearly, the “global” option of the stringsAsFactors lasts for the whole session of work but can be locally modified in any moment in a single call to a function.

There are several ways to extract a subset from a data frame. Data management is not among topics of this introductory manual. Like matrices, data can be extracted using x[subscripts] where subscripts is a set [rows, cols], where rows is a vector of row numbers and cols is a vector of column numbers. Numbers are negative when they indicate a row or column to be excluded.

The following code chunk shows how to extract the first, the third and the seventh row from the df dataframe.

The code below shows how to extract the firts and the second column from the df dataframe.

Like lists, data can be extracted using x$column where column is the column name.

The above code returns a vector. To get a data frame, the data.frame() function can be used.

The dim() function can be used to know the number of rows and of columns of a data frame. The same information can be obtained using nrow() and ncol() functions, respectively.

The str() function returns the structure of a dataframe. For each variable (column), it shows: the name, the type (numeric, character, factor etc.) and first elements.

The head() function show the first rows of a data frame. Its use is particularly convenient when data sets are long. iris is a built in R data set. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.

Missing Values, Null Objects and Infinite

In R missing values are represented by the symbol NA (Not Available).

The above chunk create a vector x. Vectors will be shown in the next Chapter. The vector contains three values: 4, 1 and the character string “a”. Then, elements of the vector are transformed into integer values.

The character string “a” cannot be transformed, so a NA value is returned.

To check if data is missing, the function is.na() can be used.

The NaN symbol (Not a Number) represents a missing value obtained as a result of an impossible numerical operation. NaN can be detected with the function is.nan.

The NULL symbol represents the null object in R. NULL is often returned by expressions and functions whose value is undefined. The is.null() function returns TRUE if its argument is NULL and FALSE otherwise.

Infinite values are represented by the +Inf and -Inf symbols in R.

In the above example, R calculates the limit of the function \(log(x)\) as \(x\) approaches zero.

Summary

In this chapter, we showed the most common object types within R. We introduced the main features of each object type and how get a subset of data. Vector is the “basic” object. You can have a numeric, character or logical vector. Matrices are generalisation of vectors. Lists are more complex data type: they collect in a single object several other objects. A factor is a vector-like object used to specify a discrete classification (grouping) of the components of other vectors of the same length. Data frame is a very well-know structure in statistics. It looks like a matrix, but each column can contain different types of data. Its different kinds of information are organised in columns, whereas rows represent different types of observational units. Now that you know the several object types of R, it’s time to get your data into the mix. In the next chapter, we’ll look how to import data into and export data from R from text files, other programs, and database management systems.