Chapter 2 Basics

I got introduced to R on the backseat of a van during a four week long field excursion through the southwestern United States. With a laptop on my knees (and sufficient power from a 230 V plug) I reproduced the sparse but elegant exampes from the R introduction pdf file that was installed along with R. Seven years later, I use R almost daily for an extremely wide prupose and consider R one of the most versatile, flexible, elegant tools I ever encountered. Well, this might have to do with where my research interests graded to. From the geographers early playground, fieldwork, to abstract analysis of natural phenomena mainly driven by numeric data at different levels of quantity, quality and of course intended application.

This ongoing transition is certainly reflected by the overall quality of the book and specifically by the degree of elaboration of individual chapters. Whenever I read code I wrote some years or even only months ago, I feel between depressed, amused and disgusted – for different reasons (correctness, clarity, style, and so on), but mostly simply for “why did I not do it like I would do it now?” Well, the answer is clear. But again, R does not demand a steep learning curve. There is really little overhead and the formal requirements are low, compared to other languages. So this chapter just gives some very brief and mainly unjustified definitions of essential foundations and phrases of the typical R jargon that you might want to be familiar with to have fun with this book.

Useful references that discuss the following items in a bit more detail and with official binding to R are:

2.1 Data types

R supports the usual data types known from many other programs but also further, more specialised ones. Data types define how values are handled and which operations are possible with them. Different data types can be converted among each other to some degree, if certain constraints are respected that mainly arise from the hierarchy position of a data type. Conversion is always possible in the following direction:

logical to integer to double to complex to character

Conversion in the other direction usually results in loss of information or is only possible in exceptional situations. This is because conversion is no problem with increasing hierarchy level of data. The hierarchy level defines allowed values and operations. While, for example, logical data can only be TRUE or FALSE, character data can be of any alpha-numeric value. Arithmetic operations and quantitative comparisons are only possible up to the level of complex data, but not for character. It is essential to know what is possible with what kind of data type and how to convert between data types (and of course being aware of the consequences of conversion).

The following list contains the commonly experienced data types in R (the ones shown above) as well as some further data types that are also encoutered and deserve a brief discussion:

  • Integer and Double, i.e., numeric. Integer values are numeric values without decimal extension. Double values are numeric values with a decimal extension. Practically R does not make a difference between these two data types. It handles both as numeric data types. To convert a double object to an integer object, use as.integer(). This will truncate a fractional number rather than rounding it. Hence, it may be necessary to round first (as.integer(round(x, digits = 0))). A double object can be generated by numeric() and converted to from another data type by as.numeric(). Whether an object’s data type is double or integer can be querried with is.numeric() or is.integer(). More generally, the data type can be querried with typeof().

  • Complex. Complex objects contain a real and an imaginary part, both of them are numeric. Complex data can be generated by complex(). To extract the real part use Re(), to extract the imaginary part use Im(). Whether an object’s data type is complex can be querried with is.complex(). Objects can be converted to complex with as.complex().

  • Logical. Logical data can be of two cases: TRUE (1) or FALSE (0) and are usually the result of a logical operation using Boolean operators. It is possible to do calculations with them, i.e., their representations by 1 and 0. Actually, data of type logical can also be NA, though this will rarely be of practical use. For the lazy ones, R supports specifying T instead of TRUE and F instead of FALSE. To create logical objects use logical(). To convert any data type to logical use as.logical(). Note however, that all values will be converted to TRUE except for 0 (FALSE) or NA (NA). To query if an object is logical use is.logical() - the return value will be a logical value actually.

  • Character. Literally a character is only a single letter, number, symbol, digit and so on, i.e., the smallest addressable unit possible. In R (and many other software/languages) the data type character means a string (also if it contains only one character) of alpha-numeric characters. The data type character is the most open data type in R, nearly everything can be converted to character. However, this comes at the cost of reduced possibilities of data treatment. You cannot do any calculations with characters (e.g., "1" + "1" will give an error). Objects of type character are denoted by double quotation marks ("") or single quotation marks (''), though the latter is less common but may be needed when using paste() to generate long text strings with quotation marks in them theirselfs. Chararcter objects can be created using character(), queried by is.character() and objects can be converted to characters using as.character().

  • Time/Date. Handling dates and times is a wide field (and accoringly deserved a chapter on its own). In R, these data types are represented by the POSIX (Portable Operating System Interface) format. There are two different classes: POSIXct (ct stands for calendar time) and POSIXlt (lt stands for local time). POSIXct values are nothing else than a numeric description of the number of seconds that have passed since a given reference date (in the case of R 1970-01-01 00:00:00 UTC). Hence, you can do arithmetics and any other calculas as with other data of type numeric. However, specifying and displaying dates requires a long string, denoting year, month, day, hour, minute and second (maybe with fractions) in a predefined format. Querying the data type is perfomed with is.POSXct() and conversion to it is done by as.POSIXct(). Defining POSIXct formats is more complicated. To convert between different character representations of POSIX data use the function strptime(). To extract parts of a POSIX date and return them as character string use strftime() (a wrapper for format.POSIXlt()). For more information see chapter Time series

  • NA, NaN, NULL. Missing values in the statistical sense, that is, variables whose value is not known, have the value NA. The default type of NA is logical, unless coerced to some other type. Testing for NA is done using is.na. to specify an explicit string NA should use ‘as.character(NA)’ rather than “NA”. Numeric calculations whose result is undefined, such as ‘0/0’, produce the value NaN. This exists only in the double type and for real or imaginary components of the complex type. The function is.nan is provided to check specifically for NaN, is.na also returns TRUE for NaN. Coercing NaN to logical or integer type gives an NA of the appropriate type, but coercion to character gives the string “NaN”. NaN values are incomparable so tests of equality or collation involving NaN will result in NA. They are regarded as matching any NaN value (and no other value, not even NA) by match. There is a special object called NULL. It is used whenever there is a need to indicate or specify that an object is absent. It should not be confused with a vector or list of zero length. The NULL object has no type and no modifiable properties. There is only one NULL object in R, to which all instances refer. To test for NULL use is.null. You cannot set attributes on NULL. NULL can be used, for example to remove list elements by assigning them NULL

  • Further types. There are for example raw and user-defined types which are actually derivatives of the six atomic data types of R. They are thus not described in this scope.

2.2 Data structures

Data structures are perhaps unknown to users of spreadsheet software, simply because spreadsheet software does not care about data structures, it uses spaghetti data. The structure of data defines how the values are organised and, thus, can be accessed (or indexed, see next chapter). Obviously, data structures are only relevant if the data contain (significantly) more than one value.

There is one paramount function that gives access to how data is structured: str(). It returns not just the name of the specific data structure but also information about the contained object names, included data types, dimensions and the first few values. Thus, knowing and using str() is essential to avoid pitfalls and wrong assumptions when writing scripts. Somehow similar to str() is the function dim(), which returns (or sets) the dimensions of objects, i.e., converts some data structures to other data structures.

Data structures can be converted into each other to a certain degree, i.e., whenever the conversion does not violate any of the constraints of the target data structure. Conversion is performed similar to the conversion of data types with the function as.DATASTRUCTRUE(), where DATASTRUCTRUE denotes the target data structure. To query the current data structure use is.DATASTRUCTURE(), which returns a logical value. Alternatively, and more universal, class() can be used.

Data structures follow a more or less logical order from simple but constrained to complex but lax:

vector to matrix (to array) to list to data frame.

There are further data structures, commonly encountered in R. The most common and important ones are S4-objects. Actually, a user can define its own data structure simply by renaming the data structure using class(). However, all these “derivative” data structures, including S4-objects consist of the five structures listed above.

  • Vector. Vectors are the most primitive data structure in R. They are one-dimensional objects (m:1) of length m, i.e., they contain m elements organised as m rows in one column. There are no zero-dimensional (scalars) data structures in R, an object of length 1 is still a vector. Vectors can either be atomic vectors or lists. This is a bit confusing, because lists are a data structure by themselfes. Throughout this book the term vector is used to refer to atomic vectors, not lists. To test if an object is a vector, use the function is.vector(). However, to be sure, it is more appropriate to use either is.atomic() or is.list(). Vectors can be of any data type but this type must be used consistenty. This means, a vector cannot be a vector if it contains a mix of numeric and character values. Accordingly, if a data structure with mixed data types is converted into a vector, the data type is coerced. Vectors are created with functions like c() (combine or concatanate) or seq() (sequence). The number of elements of a vector (its length) can be querried and changed with length(). Each element of a vector can be labelled with names. These names can be querried with names().

  • Factor. Factors denote vectors with nominal values. Each element of a factor is stored as an integer value. Additionally, a factor comprises a further internal vector that denotes the names associated with the integer values. Thus, factors are actually integer vectors with the values referring to the set of provided names. Factors can be useful when dealing with large amounts of categoric data but they can also be a pain in the back when importing, for example ASCII data and not thinking of specifically ticking the option to convert (or not convert) the imported data to factors. The usual consequences are unexpected behaviours of functions like c(), nchar() or sum().

  • Matrix. Adding one more dimension leads to a matrix, i.e., two-dimensional data structures (m:n). They consist of values organised in m rows and n columns. Thus, their length is the product of m and m. Matrices are a frequently used data structure for many algebraic operations with accordingly optimised algorithms (like t(), transpose, and diag() , matrix diagonale). To query the number of rows use nrow(), the number of columns is returned by ncol() and length() provides the total number of elements. To test if an object is a matrix use is.matrix(). Like vectors, matrices can only be of one consistent data type. Matrices are created from scratch with matrix(). They can also be created by binding vectors, either row-wise (rbind()) or column-wise (cbind()). Likewise, it is possible to use dim() to convert matrices from other data structures. Names of matrix rows are queried and set by rownames() and likewise for columns with colnames().

  • Data frame.

  • List.

  • S4-object.

2.3 Indexing R-objects

2.4 Importing/reading data

2.5 Building blocks

2.5.1 Logical (Boolean) operators

Logical operations (“Is x greater than y?”) are possible with all data types. They are performed with Boolean operators and will usually return logical values (TRUE or FALSE). In R, Boolean operators are different from their literal version as it is usually implemented in spreadsheet software:

Logical operator Implementation in R Example
greater than > 1 > 2
smaller than < 2 < 1
greater or equal >= FALSE >= 0
smaller or equal <= "a" <= "b"
equals == 0 == round(0.1, 0)
AND & or && 1:5 > 1 & 1:5 <= 3
OR | or || x == TRUE | x > 0
negation ! "a" != "b"

Note that AND and OR can be applied in two different forms, using & and using &&. The operators with a single symbol are the vectorised versions. Thus, they evaluate the operation for each element of the input vectors and return a corresponding output vector. The operators with two symbols evaluate only the first element of a vector:

1:5 < 4 & 1:5 > 2
## [1] FALSE FALSE  TRUE FALSE FALSE
1:5 < 4 && 1:5 > 2
## [1] FALSE

loops, BOOLEAN algebra, conditionals

2.6 Good practice

2.6.1 A clean start

2.6.2 Structuring code

2.6.3 Commenting

2.6.4 Clearing the workbench

2.7 Useful functions

quote Hadley that it is good to read R code (your own, from books and from forums) like you would read a journal article or newspaper articles to stay awake, learn new functions and discover other peoples approaches.

There is a fundamendally basic body of functions that occur in almost every R script of function. These are functions you should have in your mind, along with their arguments and the way they work, without needing to look at their documentation. This must be your basic vocabulary, your everyday communication toolbox.