R SOFTWARE IN HYDROLOGY
Online tutorial
Author: Nejc Bezak
Reviewers of the Slovene version: Mojca Šraj, Lovrenc Pavlin
Design: Nejc Bezak
Publisher: University of Ljubljana, Faculty of Civil Engineering and Geodesy, UNESCO Chair on Water-related Disaster Risk Reduction
Ljubljana, 2024
The content of this publication may be used under the terms of the Creative Commons CC-BY-NC 4.0 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 licence.
This online material is aimed at those who want to gain an in-depth knowledge of data analysis using the R software tool for the field of hydrology and learn how to use the R software tool effectively as a tool for data processing, analysis and interpretation. In particular, the textbook is aimed at students taking the course R software in water management and other courses where R is frequently used, such as hydrology and hydrological modelling.
R is an open-source and freely available statistical tool that provides a wide range of functions for analysing data, visualising results, and developing statistical and other models. Through practical examples and graphical demonstrations, this online material will explore how to use R to solve a variety of problems in hydrology and beyond. We will enrich your knowledge with concrete examples featuring analyses of water data, hydrological models, spatial analyses, and other key aspects of this important field.
The purpose of this online material is not only to provide technical knowledge of the R software and programming language, but also to encourage thinking about complex issues related to water management and developing the ability to take a critical approach to the analysis and interpretation of data and models. The material also includes short practical exercises to further enhance your knowledge.
I believe that this material will be a useful tool for students, researchers, professionals, and all those who want to delve deeper into the field of hydrology with the help of the R software tool. I wish you successful work and a lot of satisfaction in exploring the challenges of hydrology!
Author
R is a variant of the programming language S. S is a programming language developed by John Chambers and colleagues at Bell Telephone Laboratories. The S programming language began to be used in 1976 as an internal environment for statistical analysis – it was originally implemented as Fortran libraries.
In 1988, the system was rewritten in the C language and it began to resemble today’s S programming language system. Version 4 of S was produced in 1998 and remains the version in use today. The book Programming with Data by John Chambers1 describes this version of the language.
The S programming language has evolved since the book was published, but its basics have not changed significantly. In 1998, the S programming language was awarded the Association for Computing Machinery’s Software System Award, a very prestigious award in computer science.
The philosophy of S is somewhat different from that of conventional programming languages. The developers’ aim was to create a language that would be suitable for both interactive command-line data analysis and for writing more extensive software applications, making it more similar to traditional programming languages.
One of the key limitations of S was that it was only available in the commercial S-PLUS package. Therefore, in 1991, Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland created R, which was first made public in 1993. The experience of developing R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics2.
In 1995, Martin Mächler was instrumental in persuading the original authors to use the GNU General Public Licence and make R freely available. This was crucial, as it made the source code of the whole R system available to anyone who wanted to work with it.
However, in 1997, the R Core Group was set up with a large number of experts in S and S-PLUS. Currently, this group has control over the source code of the R programming language. In 2000, R version 1.0.0 was released.
Today, R runs on almost all standard computer platforms and operating systems. Its open-source character means that anyone can adapt the program on any platform of their choice. R also works on modern tablets and phones, among other devices.
A positive feature of R is its frequent updates. Every year, usually in October, a major update is released to include important new features. During the year, minor bug fixes are issued if necessary. Frequent updates and a regular release cycle reflect the software’s active development and ensure that potential problems are fixed in a timely manner. Core developers control the primary source code for R, and many people around the world contribute to development in the form of new features, bug fixes, or both.
The main advantage of R over many other statistical software applications is that it is free in the sense of free software. The copyright for the primary source code of R is held by the R Foundation3 and is published under the GNU General Public License version 2.04. Under the license, the user is free to: (i) use the program for any purpose, (ii) adapt the program and access the source code, and (iii) improve the program and redistribute the improvements.
Another advantage that R has over many other statistical software applications is its graphical capabilities. R’s ability to produce high-quality graphics has been one of its features since the very beginning and it is generally superior to competing tools. Today, despite many more visualisation software programs being available than in the past, this trend continues. R’s basic graphics system allows for very precise control over all the elements of a graph. Other more recent graphical systems and packages, such as lattice and ggplot2, allow complex and graphically more beautiful data visualisations.
Another positive aspect of using R is not related to the language itself, but to its active community of users. In many ways, a language can be considered successful if it generates a platform for a sufficient number of people to create new features. R is such a platform, and thousands of people around the world contribute to its development. R has extensive support on the website Stack Overflow5. It is also worth mentioning the excellent documentation (e.g. a standardised description of help functions) and the central repository of packages.
It should also be pointed out that R has certain shortcomings. For example, objects in R must be stored in PC physical memory, which can be a problem when analysing large amounts of data (e.g. global analyses, climate scenarios). Furthermore, R’s functionality is based on consumer demand and voluntary contributions from users. If a particular method is not yet implemented, the user must either implement it themselves or pay someone else to do so. The learning curve is also higher than for other software (e.g. Excel).
There is a lot of literature available where users can gain useful insight into how to use R:
To start working with the R software tool, you must first install it on your computer. R works on almost all available operating systems, including the widely available Windows, Mac OS, and Linux. Installation files are available on the R Project’s website13.
There are also several graphical interfaces available for R, one of which is RStudio14, which has a nice editor with highlighting of individual code blocks (e.g. comments, commands, objects), an object viewer, and many other features that make working with R easier. RStudio is also free to use and runs on various operating systems such as Windows, Mac OS, and Linux.
Figure 1: Example of a basic R programme.
Figure 2: Example of the RStudio GUI.
Now that you’ve installed R and RStudio, you’re probably wondering, “How do I use R?” First of all, it is important to point out that, unlike other statistical programs such as Excel or SPSS, which offer point-and-click graphical interfaces, R is a language in which you have to type commands written in R code into the Console. In other words, you have to code in R. Although you don’t need to be an experienced programmer to use R, you still need to understand a set of basic concepts. You can use R as a simple calculator by typing commands into the Console and pressing the Enter button:
3 + 4 # addition
## [1] 7
3*4 # multiplication
## [1] 12
8/2 # division
## [1] 4
R also allows for comments (the part of the code that is not executed) to be written in the code. The character # is used for this purpose:
3 + 4 # this is a comment
## [1] 7
Of course, more complex calculations are also possible:
2*(3+8)/4+5*1.5/(2/3)^3 # a slightly more complex calculation
## [1] 30.8125
R displays messages when there are issues in the code’s execution. There are three different types of messages:
Another of R’s advantages is the large number of features included in its packages. Using these functions is relatively simple, but it is necessary to be aware of what input data each function requires, or what the function specifies as the result. The documentation or help on how to use a function can be accessed using one of the following commands:
help(mean) # help for function mean
?min # help for function min
The help for almost all functions (including those in packages) is designed in a similar way and includes the following points:
An example of how to use the mean function:
mean(c(3,5,3,5,3)) # calculate the average of five numbers
## [1] 3.8
The mean function mentioned above calculates the average of the numbers 3, 5, 3, 5, 3. Additionally, the c function is used to combine individual numbers into a vector (the help for the ?c function explains an example of its use). Another additional example is the use of the round function to round numbers:
round(x=3.54453445,digits=2) # rounded according to input data
## [1] 3.54
Compared to the example above, you can see that here we have also precisely defined the two parameters or arguments used in the round function, namely the first argument x, which defines the input data to the function, and the argument digits, which defines the number of decimal places to be shown in the result. If you know the order of the arguments, you can omit their names:
round(3.54453445,2) # rounded according to input data
## [1] 3.54
If you are not sure about the order of the arguments or parameters, it is better to define each one separately, otherwise you may get an incorrect result:
round(2, 3.54453445) # rounded according to input data
## [1] 2
So, in most cases, the definition of arguments or parameters makes sense:
round(digits=2, x=3.54453445) # rounded according to input data
## [1] 3.54
Of course, you need to know the name of the function to use it. The best way to do this is to search online for the function that makes the most sense for the case or problem you want to solve. There are also various websites that give an overview of the most commonly used functions15.
There is almost no statistical mathematical problem for which a function does not exist within the R programming language’s environment, either in the R base package (base) or in one of the more than 10,000 extension packages16.
Below are the first practical tasks you can solve to build upon your basic knowledge of R.
Task 1: Find a function that can be used to calculate the standard deviation in R and use the function.
Task 2: Find a function to plot a graph (any graph) in R and use the function.
Task 3: Find a function with which you can generate a sequence of numbers, say 2,4,6,8,10, etc., and use this function to calculate a sequence that has 50 elements, whose initial value is 2, and whose step is also 2.
Task 4: Find a function to round numbers and round the number 2.464646 to 2 decimal places.
In R, objects are defined using character the <-. In RStudio, this character can also be accessed by pressing the ALT and - keys at the same time.
x <- 1 # define an object x with one element
When an R object is defined, it is not displayed. To print it, use the print function or just the object name:
print(x) # print the contents of the object named x
## [1] 1
x # same as above only without using the function
## [1] 1
In the R programming language, there are often several different ways to solve a problem or perform an operation. For example, we can use the assign function to define objects:
assign("x", c(10,4,5)) # define an object using the assign function
c(4,5,3) -> x # objects can also be defined this way
In most cases when defining objects, we can also use the =.
When you print out the vector, you will notice that the vector’s index is printed in square brackets [] on the side. For example, look at this numeric sequence of length 20, where the numbers in square brackets are not part of the vector itself, but only part of the printout:
y <- 41:60 # generate integers between 41 and 60
y # check the contents of the object y
## [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
R knows five basic types of objects:
At first sight, there is no significant difference between integer and numeric. Numbers in R are generally treated as numeric objects (i.e. real numbers with double precision). This means that even if you see a number like “5” or “10” in R, which could be considered an integer, it is probably represented as a numeric object in the background (e.g. something like “5.00” or “10.00”). If you want to define an integer, you need to specify the L extension. Thus, typing 1 into R will give you a numeric object, and typing 1L will give you an object of type integer.
z1 <- 10 # define object z1
z2 <- 10L # define object z2
str(z1) # check the structure of object z1
## num 10
str(z2) # check the structure of object z2
## int 10
is(z2) # a slightly different way of checking the structure of object z2
## [1] "integer" "double" "numeric"
## [4] "vector" "data.frameRowLabels"
For the average user, there is no significant difference between the two types of objects, but the difference is important for very large objects, where objects of type integer take up much less space than objects of type numeric:
# check the size of the vector, type integer
object.size(as.integer(seq(from=1,by=2,length.out=100000)))
## 400048 bytes
# check vector size, type numeric
object.size(seq(from=1,by=2,length.out=100000))
## 800048 bytes
There is also a special number Inf, representing infinity. This allows us to define expressions such as 1/0. Thus, Inf can be used in ordinary calculations; e.g. 1/Inf is 0. The value NaN represents “not a number”; e.g. 0/0. In the R programming language, the expression NA is also often used to indicate a missing value (Not Available). An example of its use is shown below:
z3 <- c(4,5,Inf,1/0,NA,10) # define vector
# divide all elements of the vector by 10
z3/10
## [1] 0.4 0.5 Inf Inf NA 1.0
R knows complex numbers as shown in the example above. More sophisticated definitions of complex numbers are obtained with the complex command. It is worth mentioning some other potentially useful functions related to complex numbers: Re, Im, Mod, Arg, Conj. However, since complex numbers are not widely used in hydrology, we will not go into details here.
The most basic type of object in the R programming environment is a vector. You can create an empty vector using the vector function. The rule for vectors in R is that they can only contain elements of the same type, the exception being the list object type, which can combine different types.
You can also define different types of vectors in R:
x <- c(0.4, 0.7) # numeric
# Boolean, could use T instead of TRUE or F instead of FALSE
x <- c(TRUE, FALSE)
x <- c("Marko", "Jana", "Vid") # character
x <- 2:10 # integer
x <- c(2+0i, 4+2i) # complex number
Occasionally, different types of objects in the R programming environment become mixed up. Sometimes this happens by accident, but it can also be intentional. Example:
c(3.4, "Zvone") # character
## [1] "3.4" "Zvone"
c(FALSE,3) # numeric value
## [1] 0 3
c("Stanko", TRUE) # character
## [1] "Stanko" "TRUE"
In each of the above examples, we are mixing objects of two different types in the vector. But remember that the only rule about vectors is that this is not allowed. When we mix data types in a vector, a change occurs (lower types are converted to higher types) such that every element in the vector is of the same class. In the example above, we see the effect of a single change: R is trying to find a way to represent all objects in a common vector in a reasonable way. Sometimes it does exactly what you want, and sometimes it doesn’t. For example, if you combine a numeric value with a character type, you will end up with a character vector, since numbers can usually be represented in this form. In any case, it is wise to avoid these combinations. You can also change the type of an object using functions such as as.numeric, as.integer, as.logical, or as.character:
z2 <- 1:10 # generate a vector of integers between 1 and 10
z3 <- as.character(z2) # change the object type to character
str(z3) # structure of object z3
## chr [1:10] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
z3 # content of object z3
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
The individual elements of an object can be accessed using [], where the count always starts with 1:
z6 <- 1:10 # generate a vector of integers between 1 and 10
z6[3:4] # we are only interested in the 3rd and 4th element
## [1] 3 4
z6[c(2,8,10)] # we are interested in the 2nd, 8th, and 10th element
## [1] 2 8 10
z6[c(1, 3:5, 9)] # we are interested in certain elements
## [1] 1 3 4 5 9
z6[-2] # we are interested in all elements except the second one
## [1] 1 3 4 5 6 7 8 9 10
z6[5] <- 1000 # the selected elements can also be replaced
z6[c(3,8)] <- c(100,200) # we can also replace several elements at the same time
The following functions are also useful for indexing or selecting specific items:
1 %in% z6 # check whether 1 is contained in object z6
## [1] TRUE
2 == 2 # check whether the two elements are equal
## [1] TRUE
2 == 4 # check whether two elements are equal
## [1] FALSE
3 != 4 # check whether two elements are not equal
## [1] TRUE
# more complex combinations are possible; & indicates the condition "and"
10 <= 20 & 22 >= 22
## [1] TRUE
10 <= 20 | 22 >= 22 # | indicates an "or" condition
## [1] TRUE
v7 <- 15:21 # generate integers between 15 and 21
v7 > 18 # check which elements are greater than 18
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
v7 == 17 # check which element is equal to 17
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
v7 != 20 # check which elements are different from 20
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
v7[v7 > 18] # elements that satisfy the selected condition
## [1] 19 20 21
v7[v7 > 18 & v7 < 22] # applying 2 conditions at the same time
## [1] 19 20 21
v7[v7 > 18 | v7 < 22] # slightly different
## [1] 15 16 17 18 19 20 21
which(v7 == 17) # element order
## [1] 3
For large objects, the head and tail functions can also be useful, showing the n first and last elements of the object, respectively:
v8 <- 1:10000 # generate a large integer object
head(x=v8, n=3) # check the first three elements of the v8 object
## [1] 1 2 3
tail(x=v8, n=4) # check the last four elements of the v8 object
## [1] 9997 9998 9999 10000
Functions such as subset, is.na, is.nan, summary, and na.rm often come in handy:
v9 <- c(4,NA,2,3,NA,10) # define vector
is.na(v9) # check which elements are not defined
## [1] FALSE TRUE FALSE FALSE TRUE FALSE
summary(v9) # check the basic statistics and which elements are equal to NA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 2.75 3.50 4.75 5.50 10.00 2
max(v9) # the function does not work because of the NA value
## [1] NA
max(v9, na.rm=TRUE) # NA elements are ignored in the calculation
## [1] 10
v10 <- subset(v9, is.na(v9)!=TRUE) # store elements that are different from NA
v11 <- subset(v10, v10 > 2 & v10 < 10) # select only certain elements
v11
## [1] 4 3
You can see the list of defined objects in the top right window of RStudio (Environment tab). You can also use the ls() function to view all defined objects. In case you want to remove an object, you can do so using the rm or remove functions:
z6 <- 1:10 # generate a vector of integers between 1 and 10
rm(z6) # remove object z6
If you want to remove all objects, you can use the following combination of functions:
rm(list=ls(all=TRUE))
Objects in R can have attributes, which are like metadata for the object. This metadata can be very useful, as it helps to describe the object. For example, the names of the columns in a data frame help us to explain what data are contained in each of the columns. Some examples of an R object’s attributes include names, column names, dimensions, object type, length, etc. Certain object types contain attributes that can be accessed using the attributes function (in the example above, this is the use of a dataset that is associated with an R program and can be accessed using the data function):
data("airquality") # import data named airquality
attributes(airquality) # check the attributes of this object
## $names
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153
Task 5: Suppose you measure rainfall every day of the week (Monday to Sunday) at 2 rainfall stations; the first station measured 10,0,0,20,15,10,5 mm of rainfall and the second station measured 5,0,0,25,10,5,10 mm of rainfall. Store the data in 2 separate objects. Then, using the objects, calculate the rainfall values for all 7 days: Calculate the average precipitation taking into account both stations (station1 and station2), calculate the average precipitation over the whole week (for all days combined), calculate the maximum daily and minimum daily precipitation given the measurements from both stations, calculate the range of the measured values (range), and calculate the median for each precipitation station and the standard deviation for each precipitation station.
Task 6: Find a function to generate random numbers (based on a uniform distribution) and generate 10 random numbers between 1 and 35. Store the results in a new object and round to 1 decimal place, calculate the sum of all the generated numbers, and calculate the product of all the generated values, then sort the generated values from smallest to largest.
Task 7: Calculate the sum of all integers between 1 and 10 000.
Task 8: Using the rainfall example from Task 5, check on which days at station 1 the average rainfall was more than 10 mm and less than 20 mm.
Task 9: It was subsequently discovered that the measured rainfall for station 1 on Thursday was incorrect and should have been 30 mm. Check whether this change affects the result of task 8.
Matrices are vectors with a dimension attribute. The dimension attribute is an integer vector of length 2 (number of rows, number of columns).
m1 <- matrix(nrow = 2, ncol = 3) # define a matrix, with NA values
dim(m1) # matrix dimensions
## [1] 2 3
m1 # content
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
attributes(m1) # attributes
## $dim
## [1] 2 3
Matrices are defined by columns, so entries can be visualised as starting in the “top left” corner (element 1,1) and filling down the columns.
# define a matrix containing integers between 1 and 6
m2 <- matrix(1:6, nrow = 3, ncol = 2)
m2 # content
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
dim(m2) <- c(2,3) # change dimensions
m2 # content of the transformed matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
You can also define matrices by combining columns or rows with the cbind and rbind functions. In some cases, you can also use the array function:
m4 <- cbind (4:6,1:3) # combine two columns of vectors with three elements
m4 # content
## [,1] [,2]
## [1,] 4 1
## [2,] 5 2
## [3,] 6 3
m5 <- rbind(10:11, 20:21) # combine two rows into a 2 x 2 matrix
m5 # content
## [,1] [,2]
## [1,] 10 11
## [2,] 20 21
m6 <- array(1:12,dim=c(3,4)) # define an array using the array function
We can use matrix referencing to use specific elements of a matrix:
m4 <- cbind (4:6,1:3) # combine two columns of vectors with three elements
m4[,2] # second column
## [1] 1 2 3
m4[1,] # first row
## [1] 4 1
m4[2,2] # element in the second column and second row
## [1] 2
m4[c(1,3),1] # elements in the first and third rows of the first column
## [1] 4 6
m4[5] # the fifth element of the matrix, where the elements are arranged by columns
## [1] 2
c(m4) # the matrix can also be transformed into a vector, ordered by columns
## [1] 4 5 6 1 2 3
as.vector(m4) # same as before only using the as.vector function
## [1] 4 5 6 1 2 3
But matrices can also be used to make calculations:
m4 <- cbind (4:6,1:3) # combine two columns of vectors with three elements
m5 <- m4*10 # multiply all elements of the matrix m4 by 10
m5 <- log(m5) # calculate the logarithm of all elements
m6 <- m4*m5 # the matrices can be multiplied
print(m6) # result of multiplication
## [,1] [,2]
## [1,] 14.75552 2.302585
## [2,] 19.56012 5.991465
## [3,] 24.56607 10.203592
crossprod(m4,m5) # matrix multiplication
## [,1] [,2]
## [1,] 58.88170 44.59619
## [2,] 23.79596 18.49764
m4 %o% m5 # tensor product, alternative is the function outer
## , , 1, 1
##
## [,1] [,2]
## [1,] 14.75552 3.688879
## [2,] 18.44440 7.377759
## [3,] 22.13328 11.066638
##
## , , 2, 1
##
## [,1] [,2]
## [1,] 15.64809 3.912023
## [2,] 19.56012 7.824046
## [3,] 23.47214 11.736069
##
## , , 3, 1
##
## [,1] [,2]
## [1,] 16.37738 4.094345
## [2,] 20.47172 8.188689
## [3,] 24.56607 12.283034
##
## , , 1, 2
##
## [,1] [,2]
## [1,] 9.21034 2.302585
## [2,] 11.51293 4.605170
## [3,] 13.81551 6.907755
##
## , , 2, 2
##
## [,1] [,2]
## [1,] 11.98293 2.995732
## [2,] 14.97866 5.991465
## [3,] 17.97439 8.987197
##
## , , 3, 2
##
## [,1] [,2]
## [1,] 13.60479 3.401197
## [2,] 17.00599 6.802395
## [3,] 20.40718 10.203592
t(m6) # matrix transpose
## [,1] [,2] [,3]
## [1,] 14.755518 19.560115 24.56607
## [2,] 2.302585 5.991465 10.20359
m7 <- cbind(1:2,4:5) # define a square matrix
det(m7) # calculate the determinant
## [1] -3
# eigenvalues and vectors in case the matrix can be diagonalised
eigen(m7)
## eigen() decomposition
## $values
## [1] 6.4641016 -0.4641016
##
## $vectors
## [,1] [,2]
## [1,] -0.5906905 -0.9390708
## [2,] -0.8068982 0.3437238
dim(m6) # dimensions of the matrix
## [1] 3 2
The apply, sapply, mapply, etc. functions are also very useful for calculations targeting multi-dimensional objects. Let’s look at one example:
m8 <- cbind (10:20,20:30) # combine two vectors
# calculate the average over the columns, the MARGIN argument defines
# whether to use columns or rows
apply(X = m8, MARGIN = 2,FUN = mean)
## [1] 15 25
apply(X = m8, MARGIN = 1,FUN = mean) # calculate average by rows
## [1] 15 16 17 18 19 20 21 22 23 24 25
apply(m8, 2, summary) # calculate the main descriptive statistics by column
## [,1] [,2]
## Min. 10.0 20.0
## 1st Qu. 12.5 22.5
## Median 15.0 25.0
## Mean 15.0 25.0
## 3rd Qu. 17.5 27.5
## Max. 20.0 30.0
Task 10: Combine the precipitation data from the two stations given in Task 5 into a matrix. Add a third column to show the average rainfall values at the two stations (by day). Add row labels (names of days of the week) and column labels (station 1, station 2, average) using the colnames and rownames functions.
Task 11: Try to use the apply function on the previously defined matrix (task 10) to calculate the average and mean values at station 1 and station 2.
Factors are used to categorize data and can be unordered or ordered. A factor can be thought of as an integer vector, where each integer has a label. Factors are important in statistical analysis and are also specifically considered in certain functions such as lm and glm. The use of labelled factors is preferable to the use of integers because the factors are self-describing. In some cases, a variable taking the values “male” and “female” is better than a variable taking the values 1 and 2. Often, factors are automatically defined when you read a set of data with a function such as read.table. These functions often create factors by default when they detect data that look like characters or strings. The order of the levels of a factor can be specified by the levels argument in the factor function:
opa <- c("M", "F", "F", "M", "M") # define a vector of observations
opafak <- factor(opa) # transform the object into a factor
levels(opafak) # check which types are included in our object
## [1] "F" "M"
table(opafak) # check how many elements of a given type we have
## opafak
## F M
## 2 3
summary(opafak) # similar to above
## F M
## 2 3
unclass(opafak) # transform into numeric values
## [1] 2 1 1 2 2
## attr(,"levels")
## [1] "F" "M"
# define the order of the levels
f1 <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
table(f1) # structure
## f1
## yes no
## 3 2
Data frames are used to store tabular data in R. They are an important type of object in R and are used for a variety of purposes. For example, the dplyr package has an optimised set of functions designed to work efficiently with data frames. Data frames are a special type of list or matrix where each element of the list must be the same length. Each list element can be thought of as a column, and the length of each list element is the number of rows. Unlike matrices, data frames can store different classes of objects in each column. In the case of matrices, each element must be of the same class (e.g. numeric values). In addition to column names, which denote the names of variables, data frames have a special attribute called row.names, which indicates information about each row of the data frame. Data frames are usually defined by reading the data using the read.table or read.csv function. However, data frames can also be created directly using the data.frame function or transformed from other existing objects such as matrices. Data frames can be converted to a matrix using the data.matrix function.* Although it may sometimes seem necessary to use the as.matrix function to convert a data frame to a matrix, in most cases the data.matrix function is the better choice.
# structure of the data frame included in R
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
nrow(mtcars) # number of rows
## [1] 32
ncol(mtcars) # number of columns
## [1] 11
summary(mtcars) # basic statistics
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
colnames(mtcars) # column names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
mtcars$hp # check the contents of the column named hp
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
## [20] 65 97 150 150 245 175 66 91 113 264 175 335 109
mtcars$hp[2:5] # some elements of this column
## [1] 110 93 110 175
mtcars[3:5,5] # we can also use a matrix reference
## [1] 3.85 3.08 3.15
mtcars[1:4, "hp"] # a slightly different way
## [1] 110 110 93 110
You can also define your own data frames:
# define a data frame
example <- data.frame(temp=c("visoka","nizka","visoka"),
pretok=c("velik","majhen","velik"),
motnost=c("velika","srednja","velika"),
vodostaj=c(30,10,28))
print(example) # look at the structure
## temp pretok motnost vodostaj
## 1 visoka velik velika 30
## 2 nizka majhen srednja 10
## 3 visoka velik velika 28
example$pretok # let's just look at the column named pretok
## [1] "velik" "majhen" "velik"
Task 12: Using the airquality object (you need the data(airquality) function to use the data), check on which days the wind speed was greater than 10 mph (the units in which the wind speed is given), check on which days the air temperature was between 60 and 70 F (the units in which the air temperature is given), and check on which day the maximum and minimum ozone concentrations were measured.
Task 13: Sort the airquality data according to the measured air temperature. Check the functioning of the order and sort functions.
Task 14: Using the apply function, calculate the average of all columns in the airquality object.
Lists (list) are a special kind of vector that can contain elements of different classes. Lists are a very important data type in R, because you can store many different kinds of data in them. Lists, in combination with various functions such as apply, sapply, or lapply, allow fast and easy computations on large amounts of data. Lists can be defined with the list function, which accepts any number of arguments:
# define any list
clovek<- list(name="John", age=35, spouse="Mary",
age_child=c(15, 13, 2))
clovek$age_child # let's look at the contents of the element named age_child
## [1] 15 13 2
clovek[["name"]] # we can also use this way
## [1] "John"
clovek[c("name", "spouse")] # or multiple elements
## $name
## [1] "John"
##
## $spouse
## [1] "Mary"
names(clovek) # names can be checked using the names function
## [1] "name" "age" "spouse" "age_child"
Task 15: Define a new object in the form of a list where you combine two columns of the airquality dataset and average precipitation objects that you considered in Task 5.
There are a few main functions for importing data into R:
There are many R packages that have been developed for importing different types of data, e.g. readxl for importing Excel spreadsheets or read_sav (haven package) for reading SPSS databases.
For saving data to files outside the R programming environment, there are analogous functions such as:
The read.table function is one of the most commonly used functions for importing data into R. The help for read.table is worth reading in full because it is a commonly used function and because it allows you to make a number of settings to ensure that your data is read in the correct format (e.g. you can also import your data directly via RStudio (via the GUI using the functions in the Environment tab):
Figure 3: Example of importing data via the RStudio GUI.
As an example, we will show the process of importing data from the water gauging station Veliko Širje on the Savinja River, measured in 2005. The data were obtained from the website of the Slovenian Environment Agency (ARSO)17. The data are stored on OneDrive18. We will define the arguments of the read.table function to import the data, we will define the location of the file (in your case change this depending on the folder where you will store the data), the column delimiter, the decimal symbol used and that the first line is read as the header of the file (header).
podatki <- read.table(file="C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Savinja-Veliko SirjeI-2005.txt",header=TRUE,sep=";",dec=".")
head(podatki) # check the first few lines
## Datum vodostaj.cm. pretok.m3.s. temp.vode.C.
## 1 01.01.2005 234 37.982 3.6
## 2 02.01.2005 231 35.515 3.5
## 3 03.01.2005 227 32.395 4.4
## 4 04.01.2005 221 28.073 3.7
## 5 05.01.2005 218 26.068 3.7
## 6 06.01.2005 215 24.165 3.4
## transport_suspendiranega_materiala.kg.s.
## 1 0.068
## 2 0.053
## 3 0.055
## 4 0.345
## 5 0.201
## 6 0.039
## vsebnost_suspendiranega_materiala.g.m3.
## 1 2
## 2 2
## 3 2
## 4 12
## 5 8
## 6 2
str(podatki) # check the structure of the read data
## 'data.frame': 365 obs. of 6 variables:
## $ Datum : chr "01.01.2005" "02.01.2005" "03.01.2005" "04.01.2005" ...
## $ vodostaj.cm. : int 234 231 227 221 218 215 213 211 210 208 ...
## $ pretok.m3.s. : num 38 35.5 32.4 28.1 26.1 ...
## $ temp.vode.C. : num 3.6 3.5 4.4 3.7 3.7 3.4 3.6 3.5 4.1 4.1 ...
## $ transport_suspendiranega_materiala.kg.s.: num 0.068 0.053 0.055 0.345 0.201 0.039 0.112 0.068 0.095 0.213 ...
## $ vsebnost_suspendiranega_materiala.g.m3. : int 2 2 2 12 8 2 5 3 5 11 ...
# we can see that the dates were read as character-character
names(podatki) # check column names
## [1] "Datum"
## [2] "vodostaj.cm."
## [3] "pretok.m3.s."
## [4] "temp.vode.C."
## [5] "transport_suspendiranega_materiala.kg.s."
## [6] "vsebnost_suspendiranega_materiala.g.m3."
names(podatki) <- c("Datum", "Vodostaj", "Pretok", "Temperatura", "Transport", "Vsebnost") # change column names
names(podatki) # check the changed names again
## [1] "Datum" "Vodostaj" "Pretok" "Temperatura" "Transport"
## [6] "Vsebnost"
Task 16: Calculate the average values of all 5 variables found in the data from the Veliko Širje water gauging station on the Savinja River (for 2005).
Task 17: Import any data you have ever used into the R software environment.
Task 18: Save an any object in .Rdata format and then load it back into R using the load function.
Additional R package expand the programme’s functionality by providing additional features, data, and documentation. The packages are produced by a worldwide community of R users and can be downloaded free of charge online. R packages are a kind of application on your mobile phone. In order to use the features available in a package, the following two steps must be followed:
Installing a package: This is similar to installing an app on your phone. Most packages are not installed by default when you install R and RStudio. So if you want to use a package for the first time, you’ll need to install it first. Once you have installed a package, you are unlikely to install it again unless you want to update it to a newer version (or if you accidentally uninstall it in the meantime).
Activating a package: Activating a package is similar to opening an app on your phone. Packages are not activated by default when you start RStudio. You must activate each package you want to use each time you start RStudio.
The R environment provides two ways of installing packages. The first is by using the graphical interface in RStudio, and the second is directly by using the install.packages function. To install via the GUI, the following window can be used (whether a particular package is activated can be seen by a tick in the white square in front of the package name, and clicking on a package will take you to the help page for that particular package, which is available after the package has been installed):
Figure 4: Example of installing a package via the RStudio GUI. The red circles show the individual installation steps and the blue circle shows whether a particular package is activated or not.
An alternative procedure for installing (and activating) the package:
install.packages("airGR") # install a package named "airGR"
library(airGR, quietly=TRUE) # package activation
Packages often also contain certain information that serves as a test case to better understand how the package works. For example, the airGR package also contains hydrological data that can be used as an example to calibrate, validate, and run the rainfall-runoff hydrological model included in this package, which we will see below. A description of these data can be found in the help for the BasinObs function (use ?BasinObs).
library(airGR, quietly=TRUE) # package activation
data(L0123001) # load data named L0123001
str(BasinObs) # overview of basic characteristics
## 'data.frame': 10593 obs. of 6 variables:
## $ DatesR: POSIXct, format: "1984-01-01" "1984-01-02" ...
## $ P : num 4.1 15.9 0.8 0 0 0 0 0 2.9 0 ...
## $ T : num 0.5 0.2 0.9 0.5 -1.6 0.9 3.5 4.4 7 6.4 ...
## $ E : num 0.2 0.2 0.3 0.3 0.1 0.3 0.4 0.4 0.5 0.5 ...
## $ Qls : int 2640 3440 12200 7600 6250 5650 5300 4700 3940 5300 ...
## $ Qmm : num 0.634 0.826 2.928 1.824 1.5 ...
View(BasinObs) # view data in a separate window
Package authors will release new versions of packages with bug fixes and new features over time, so it is usually a good idea to keep them up to date. However, keep in mind that new versions of packages will occasionally contain bugs or have slightly changed behaviour (e.g. features that work differently), which may mean that your code will no longer work as it did before the update. You can use the update.packages() function to update packages, or you can update via the RStudio GUI.
R has a centralised packet repository called CRAN19 (The Comprehensive R Archive Network). All packages stored there have high quality requirements. These packages must be regularly updated and documented. You can install any package from CRAN directly from the R console or via the RStudio GUI as shown above. Additional packages not included in CRAN can be found at:
Lists of hydrology-related packages can be found here:
As a point of interest, let’s also show an example of how R can also be used to download files directly from https connections and how they can be imported into R using the readr package, downloading a file from a web connection, and saving the file to a working directory, which you can see using the getwd() function:
download.file( "https://monashdatafluency.github.io/r-intro-2/r-intro-2-files.zip", destfile="r-intro-2-files.zip")
# the file is in .zip format, the extension of this file
unzip("r-intro-2-files.zip")
# install.packages("readr") # install the readr package
library(readr, quietly=TRUE) # package activation
## Warning: package 'readr' was built under R version 4.1.3
# read a file named geo.csv
geo <- read_csv("r-intro-2-files/geo.csv")
## Rows: 196 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): name, region, income2017
## dbl (2): lat, long
## lgl (2): oecd, g77
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(geo) # let's look at the first few lines of these data
## # A tibble: 6 x 7
## name region oecd g77 lat long income2017
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <chr>
## 1 Afghanistan asia FALSE TRUE 33 66 low
## 2 Albania europe FALSE FALSE 41 20 upper_mid
## 3 Algeria africa FALSE TRUE 28 3 upper_mid
## 4 Andorra europe FALSE FALSE 42.5 1.52 high
## 5 Angola africa FALSE TRUE -12.5 18.5 lower_mid
## 6 Antigua and Barbuda americas FALSE TRUE 17.0 -61.8 high
R uses a working directory to store files unless an exact (other) location is defined, which can be viewed using getwd(), or changed using setwd(). In conjunction with these two functions, it is also worth mentioning the list.files() function, which returns a list of all files in the working directory. This command is very useful for, say, the automatic reading of certain files.
Task 19: Find a package to use and calculate L-moments, install and use the package, and calculate the L-moments of the flow data you worked on in Task 16 (L-moments are similar to ordinary statistical moments, mean, variance, skewness, kurtosis).
Task 20: For the data included in the airGR package, calculate the basic statistics using the apply function and the basic statistics for columns 2, 3, and 4 using the summary function, and identify which hydrological variables are involved and what their units are.
In hydrological analyses, we often deal with time series data representing hydrological measurements or other observations. The R program uses the following representation of dates and times: dates are represented by the Date class and times by the POSIXct and POSIXlt classes. Dates are stored internally as the number of days since 1970-01-01 and times as the number of seconds since 1970-01-01. From a character string, a date can be defined using the function as.Date():
x <- as.Date("1970-01-01") # define date
x # let's see the content
## [1] "1970-01-01"
# define the format argument, which defines the format of the data
x1 <- as.Date("1.1.1970", format="%d.%m.%Y")
unclass(x1) # convert back to character format
## [1] 0
unclass(as.Date("1971-01-01")) # another example with a different date
## [1] 365
You can choose between several format types for dates.