Beginning with R — The uncharted territory Part 1

Coming from a non-programming background and Python being the first exposure to programming and data analysis, trying to get my hands dirty in R seemed pretty daunting at first. R at times can feel a bit peculiar and unique since it is based on the premise of doing data analysis and statistics rather than software programming which is the case with python. But as I push myself and try to learn the many quirks and leverages of R over python, it sort of gives a different perspective of doing data analysis. Plus, there is a strong edge of using R over python — the vast and contemporary libraries of various statistical methodologies being implemented by statisticians world over.

Besides its quirks, the most interesting IDE developed so far for R — Rstudio, makes doing data analysis seem like fun activity. The various other things in Rstudio like making reports with support of \(\LaTeX\) and HTML and making static websites using HUGO is something which makes life soooo easy.

So lets start the journey of R.


Introduction to R

R is a dynamic language developed largely for statistical computing.

R data types

Data type for a variable created in workspace is automatically assigned just like in Python.

a <- 4.2; b <- 'Hello!'
print(a); print(b)
## [1] 4.2
## [1] "Hello!"

To check the type of variable, use typeof()

print(typeof(a)); print(typeof(b))
## [1] "double"
## [1] "character"

Basic data types are:

  • String/Character
  • Number
    • Integer
    • Double
    • Complex
  • Boolean/Logical

A number whether integer or float is always represented as double.

a <- 20; print(typeof(a))
## [1] "double"

For explicit requirement of integer, add suffix L

b <- 20L; print(typeof(b))
## [1] "integer"

Handling undefined values

Handling undefined/missing values is somewhat different than python. Python has only NaN values as undefined/missing values. In R, undefined values are basically represented using

  • NULL
  • NA
  • NaN

All of three work differently.

NULL which is a null object is used when there is no value present. If there is some value present in the vector or matrix and the value is not usable (fill_value), we use NA or NaN.

NA or NaN are missing value indicator.

print(class(NULL)); print(class(NA)); print(class(NaN))
## [1] "NULL"
## [1] "logical"
## [1] "numeric"

NA comes when there is no TRUE or FALSE i.e. logical indeterminacy. It can also come for missing value.

NaN means 0/0

Operators

Mathematical operations are just like in Python.

  • Multiplication *
  • Division /
  • Addition +
  • Subtraction -
  • Exponent ^
  • Modulus %%
  • Integer Division %/%

Relational operators are same as in Python.

Logical operators are as follows:

  • Not !
  • Element wise AND &
  • AND &&
  • Element wise OR |
  • OR ||
  • In the set %in%

Data structures

In R there are 6 types of data structures

  • Lists
  • Vectors (or Atomic vectors)
  • Matrices
  • Arrays
  • Factors
  • Data Frames

Lists

List in R can hold elements of different types. There is no coercion. A list can contain numeric, characters, boolean, matrices, vectors, arrays, lists stc.

To create list, use list() argument.

list_data <- list('green', 'yellow', 1, 2, c(4,5,6))
print(list_data)
## [[1]]
## [1] "green"
## 
## [[2]]
## [1] "yellow"
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 4 5 6

To give names to each entry of list, use names() argument.

names(list_data) <- c("A", "B", "C", "D", "E")

To access a particular entry in list use $

print(list_data$A); print(list_data$B) ### Access values of entries with name A and B
## [1] "green"
## [1] "yellow"
print(list_data[1]) ### Access first label and its value
## $A
## [1] "green"
print(list_data[[1]]) ### Access first value
## [1] "green"

To merge two or more lists, use c()

a <- list(1,2,3,4); b <- list(5,6,7,8); c <- c(a,b); print(c)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8

Some predefined lists in R

print(letters); print(LETTERS); print(month.abb); print(month.name)
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

Vectors

To create a vector, we use c() function. It basically concatenates things like a list in python.

x <- c(1,2,3,4,5.4,'hello',TRUE,FALSE); print(x)
## [1] "1"     "2"     "3"     "4"     "5.4"   "hello" "TRUE"  "FALSE"

As we can see, a vector can have any data type, be it number, character or boolean. But we notice something. All the elements in the vector are coerced to character type because the vector contains a string "hello". This is the effect of implicit coercion.

For strictly making a numeric vector, use vector() function.

x <- vector("numeric", length = 20); print(x)
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We can use such a vector to preallocate a vector which can be used for appending values from a for loop which is faster than appending values to an empty vector since every time a value is appended in an empty vector, R makes a copy of it thus slowing the whole process.

Coercion — Objects like vectors, data frames etc. can be coerced to different classess using as.class function.

x <- c(1,2,3,4); print(class(x))
## [1] "numeric"
y <- as.character(x); print(class(y))
## [1] "character"
y <- as.logical(x); print(class(y))
## [1] "logical"

Matrices

Matrix is same as a vector except it has an additional attribute of dimension. It has two dimensional data structure.

a <- matrix(c(6,2,6,8,3,2,6,8,0), nrow = 3, ncol = 3); print(a); print(attributes(a))
##      [,1] [,2] [,3]
## [1,]    6    8    6
## [2,]    2    3    8
## [3,]    6    2    0
## $dim
## [1] 3 3

Matrices start filling row wise. Whereas in python, a matrix starts filling columnwise.

In R, we can pass the names of rows and columns.

a <- matrix(c(6,2,6,8,3,2,6,8,0), nrow = 3, ncol = 3,
            dimnames = list(c('a','b','c'), c('x','y','z'))); print(a)
##   x y z
## a 6 8 6
## b 2 3 8
## c 6 2 0
print(colnames(a)); print(rownames(a))
## [1] "x" "y" "z"
## [1] "a" "b" "c"

To access the elements of a matrix, use square brackets.

a1 <- matrix(c(6,2,6,8,3,2,6,8,0), nrow = 3, ncol = 3,
            dimnames = list(c('a','b','c'), c('x','y','z'))); print(a)
##   x y z
## a 6 8 6
## b 2 3 8
## c 6 2 0
print(a1[2,2])              ### select 2nd row and 2nd column element 
## [1] 3
print(a1[c(2,3),c(1,2)])    ### select rows 2 and 3 and columns 1 and 2
##   x y
## b 2 3
## c 6 2

But a[2,] (2nd row) or a[,2] (2nd column) gives a vector. To avoid this i.e. to get a matrix, use drop = FALSE

print(a[2,]); print(dim(a[2,]))
## x y z 
## 2 3 8
## NULL
print(a[2,,drop = FALSE]); print(dim(a[2,,drop = FALSE]))
##   x y z
## b 2 3 8
## [1] 1 3

Specific indexing can also be done.

print(a[c(1,2,4,6)])
## [1] 6 2 8 2

You can also do indexing using logical vectors.

print(a[c(TRUE,FALSE,TRUE), c(TRUE,TRUE,FALSE)])
##   x y
## a 6 8
## c 6 2

To transpose a matrix use t(a)

To combine vectors or matrices, use rbind or cbind

Dimension of matrix can also be changed (reshape)

dim(a) <- c(1,9); print(dim(a))
## [1] 1 9

Arrays, factors and data frames will be covered in next post.

Avatar
Puneet Sharma
Research Scholar

My research interests include cloud & aerosol modeling and statistics.

Next
Previous
comments powered by Disqus