Chapter 3 Tweaking and optimising

3.1 Converting data structures

Sometimes it becomes necessary to convert data structures between each other. While there is always the possibility to do this in a loop, filling the predefined output object, there are also more elegant ways to do this.

3.1.1 Vector to matrix to vector

To convert a vector to a matrix, it can simply be bind (rbind or cbind). The other way around, depends on what is desired. If the entire matrix shall be converted to a vector, use as.numeric (or convert to any other data type, see data types). This will convert the matrix column-wise (i.e., combine all values from column 1 to column m). If the matrix needs to be converted row-wise, transpose it before. If only selected rows of the matrix shall be converted, index them (see data structures).

Since a data frame can be very similar to a matrix and can always be created from the latter, conversions of the style vector to data frame to vector are an analogue to the above.

## create example data set
X <- rbind(1:5, 6:10)

## convert matrix to vector (column-wise)
x <- as.numeric(X)

## convert matrix to vector (row-wise)
x <- as.numeric(t(X))

## convert only the first row
x <- X[1,]

3.1.2 Vector to list to vector

Vectors can be converted to lists either with list() or as.list(), which has fundamentally different results. While list() returns a list of length one with the input vector being the one and only list element, as.list() returns a list of the same length as the input vector but each element of the list is the ith element of the vector.

To collapse all elements of a list to one vector, use unlist(). To transfer the list content element-wise use do.call() (see below).

## create example data set
x <- 1:3

## convert vector to list with one element
list(x)
## [[1]]
## [1] 1 2 3
## convert vector to list by elements
as.list(x)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## collapse list content to one vector
unlist(list(x))
## [1] 1 2 3

3.1.3 Matrix to list to matrix

To convert the rows or columns of a matrix to a list, it is easiest (and fastest) to convert the matrix to a data frame first and then the data frame to a list. If the matrix is supposed to be converted row-wise, it can simply be transposed before converting it to a data frame.

The corresponding back-conversion uses the function do.call(), provided with the operation to be performed (in this case either rbind or cbind) and the object to which the action shall be applied.

## create example data set
X <- matrix(data = 1:10, 
            nrow = 5)

## convert matrix col-wise to list
X_col <- as.list(as.data.frame(X))

## convert matrix row-wise to list
X_row <- as.list(as.data.frame(t(X)))

## convert list to matrix, colwise
X <- do.call(cbind, X_col)

## convert list to matrix, colwise
X <- do.call(rbind, X_row)

3.2 Fighting the loops

3.2.1 Vectorisation

One of the most widely used sentences among the R community is to vectorise your code instead of using loops. Although for()-loops appear reasonable from a user’s view they are pretty slow and, as Hadley Wickham puts it, not very expressive. R has specified alternatives to for()-loops, each designed for its specific task and the data structure to handle.

Some operations are already vectorised without explicity thinking of them. To add, for example the elements of two vectors a <- 1:10 and b <- 1:10 one could write c <- numeric(10); for(i in 1:10) {c[i] <- a[i] + b[i]} but intuitively one would simple write a + b, the vectorised form.

For most structures more complex than vectors, there is the apply()-family. Sadly, it is apply() itself that is not really a vectorised solution but a wrapper for for()-loops (Don’t believe it? Type apply in the R console to inspect the code). The really vectorised alternatives to apply() are introduced in the chapter The apply family ordered by the task one wishes to carry otu with them rather than by their name.

For some of the most commonly encountered tasks like calculating row-wise or column-wise sums or averages there prepared vectorised functions that should or could be used instead of loops. They are usually comparaly fast as correctly used apply()-variants but most certainly faster than using apply() or using loops. Some of those specified functions are discussed in the chapter Some vectorised functions for specific puropse.

Here is a brief example comparing the evaluation times of calculating the row-wise sums using different approaches. It uses the function microbenchmark from the microbenchmark package to infer the computation time:

## create example data set
X <- matrix(data = runif(10^4), 
            nrow = 100)

## define the for()-loop approach
f1 <- function(X) {
  x <- numeric(length = nrow(X))
  for(i in 1:length(x)) {
    x[i] <- sum(X[i,])
  }
  return(x)
}

## define the apply function
f2 <- function(X) {
  x <- apply(X = X, 
             MARGIN = 1, 
             FUN = sum)
  return(x)
}

## define the rowSums function
f3 <- function(X) {
  x <- rowSums(X)
  return(x)
}

## define the lapply function
X_list <- as.list(as.data.frame(X))
f4 <- function(X_list) {
  x <- lapply(X = X_list, FUN = sum)
  return(x)
}

## perform benchmark of all four approaches
t <- microbenchmark::microbenchmark(f1(X), 
                               f2(X), 
                               f3(X), 
                               f4(X_list))

print(t)
## Unit: microseconds
##        expr     min       lq      mean  median       uq      max neval
##       f1(X) 163.980 169.2255 210.75180 176.687 194.2000 1428.500   100
##       f2(X) 192.709 196.8010 260.98566 203.464 214.5385 1571.255   100
##       f3(X)  33.403  34.2920  36.49589  34.912  35.9850  149.852   100
##  f4(X_list)  50.061  52.0490  56.76670  54.046  59.3450   87.303   100

Note that for the last approach the matrix needs to be converted first to a data frame and then to a list. If this operation needs to be done each time the row-wise sum is to be evaluated then this approach will be the slowest of all! However, when the data is in a proper structure, lapply() is almost as fast as rowSums() and both are factor 3 to 7 faster than the for()-loop and apply(). Anyhow, this example is close to nonsense because calculating row-wise sums is easy. But if you need to do some more sophisticated evaluations then rowSums() will be not enough and you need to consider some of the three other approaches.

3.2.2 The apply family

As noted above, the apply-family includes a series of functions that are handy for real vectorisation of tasks but not apply()! However, apply() might still be handy when you wish to work with matrices and don’t want to convert them into other, more suitable data structures. Beyond apply() there are the following members:

rapply(), I have not found a useful application for this function, yet tapply(), apply a function to subsets of a vector, defined by a second vector

Some of the examples I found on stackoverflow

3.2.2.1 matrix – manipulation – vector

Usually this task is intended to be performed row-wise or column-wise. For simple tasks there are predefined functions such as rowSums() and colSums(), rowMean() and colMean(), and so on. A more generic way is to use the apply()-function. It requires the input matrix, the MARGIN (i.e., whether the matrix shall be manipulated row-wise 1 or column-wise 2) and the function FUN to be applied, as well as optional further function arguments.

## create example data set
X <- matrix(data = 1:10, 
            nrow = 5)

## calculate row-wise means while ignoring NA-values
apply(X = X,
      MARGIN = 1,
      FUN = mean,
      na.rm = TRUE)
## [1] 3.5 4.5 5.5 6.5 7.5

3.2.2.2 matrix – manipulation – data frame

There is no function to do this job. If a matrix shall be manipulated and returned as a data frame, this has to be done in two steps: i) manipulation and output as matrix and ii) conversion of the matrix to a data frame.

IS IT NOT FASTER TO FIRST CONVERT TO DATA FRAME AND THEN USE AN APPROPRIATE APPLY-FUNCTION?

## create example data set
X <- matrix(letters[1:10], 
            nrow = 5, 
            byrow = TRUE)

## sort data set column-wise
X_sort <- apply(X = X, 
                MARGIN = 2, 
                FUN = sort)

## convert matrix to data frame
X_sort <- as.data.frame(x = X)

3.2.2.3 matrix – manipulation – list

Similar to

3.2.2.4 data frame – manipulation – vector

3.2.2.5 data frame – manipulation – matrix

3.2.2.6 data frame – manipulation – data frame

3.2.2.7 data frame – manipulation – list

3.2.2.8 list – manipulation – vector

sapply() or, more clumsy, unlist(lapply())

vapply() is similar to sapply() but can be tweaked to be faster when providing information about the expected output structure.

3.2.2.9 list – manipulation – matrix

3.2.2.10 list – manipulation – data frame

3.2.2.11 list – manipulation – list

lapply()

A special case is using two lists as input and returning a list object.

## input list A, just a list of vectors
A <- list(c(1, 2, 3),
          c(4, 5, 6))

## input list B, scalars used for multiplication
B <- list(1,
          2)

## list B is element-wise applied to list A 
C <- mapply(FUN = function(X, Y) {
  
      as.list(as.data.frame(X * Y))}, 
      X = A, 
      Y = B)

3.2.3 Some vectorised functions for specific puropse

rowSums, rowMeans, rowMedians, all the matrixStats functions

running or rolling functions

3.3 Multi-core environment

3.4 Writing a function

3.5 Creating a package