R SOFTWARE IN HYDROLOGY

Online tutorial

Author: Nejc Bezak

Reviewers of the Slovene version: Mojca Šraj, Lovrenc Pavlin

Design: Nejc Bezak

Publisher: University of Ljubljana, Faculty of Civil Engineering and Geodesy, UNESCO Chair on Water-related Disaster Risk Reduction

Ljubljana, 2024

The content of this publication may be used under the terms of the Creative Commons CC-BY-NC 4.0 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 licence.

1 Preface

This online material is aimed at those who want to gain an in-depth knowledge of data analysis using the R software tool for the field of hydrology and learn how to use the R software tool effectively as a tool for data processing, analysis and interpretation. In particular, the textbook is aimed at students taking the course R software in water management and other courses where R is frequently used, such as hydrology and hydrological modelling.

R is an open-source and freely available statistical tool that provides a wide range of functions for analysing data, visualising results, and developing statistical and other models. Through practical examples and graphical demonstrations, this online material will explore how to use R to solve a variety of problems in hydrology and beyond. We will enrich your knowledge with concrete examples featuring analyses of water data, hydrological models, spatial analyses, and other key aspects of this important field.

The purpose of this online material is not only to provide technical knowledge of the R software and programming language, but also to encourage thinking about complex issues related to water management and developing the ability to take a critical approach to the analysis and interpretation of data and models. The material also includes short practical exercises to further enhance your knowledge.

I believe that this material will be a useful tool for students, researchers, professionals, and all those who want to delve deeper into the field of hydrology with the help of the R software tool. I wish you successful work and a lot of satisfaction in exploring the challenges of hydrology!

Author

2 The R programming language and its history

2.1 The S programming language

R is a variant of the programming language S. S is a programming language developed by John Chambers and colleagues at Bell Telephone Laboratories. The S programming language began to be used in 1976 as an internal environment for statistical analysis – it was originally implemented as Fortran libraries.

In 1988, the system was rewritten in the C language and it began to resemble today’s S programming language system. Version 4 of S was produced in 1998 and remains the version in use today. The book Programming with Data by John Chambers¹ describes this version of the language.

The S programming language has evolved since the book was published, but its basics have not changed significantly. In 1998, the S programming language was awarded the Association for Computing Machinery’s Software System Award, a very prestigious award in computer science.

The philosophy of S is somewhat different from that of conventional programming languages. The developers’ aim was to create a language that would be suitable for both interactive command-line data analysis and for writing more extensive software applications, making it more similar to traditional programming languages.

2.2 The R programming language

One of the key limitations of S was that it was only available in the commercial S-PLUS package. Therefore, in 1991, Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland created R, which was first made public in 1993. The experience of developing R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics².

In 1995, Martin Mächler was instrumental in persuading the original authors to use the GNU General Public Licence and make R freely available. This was crucial, as it made the source code of the whole R system available to anyone who wanted to work with it.

However, in 1997, the R Core Group was set up with a large number of experts in S and S-PLUS. Currently, this group has control over the source code of the R programming language. In 2000, R version 1.0.0 was released.

Today, R runs on almost all standard computer platforms and operating systems. Its open-source character means that anyone can adapt the program on any platform of their choice. R also works on modern tablets and phones, among other devices.

A positive feature of R is its frequent updates. Every year, usually in October, a major update is released to include important new features. During the year, minor bug fixes are issued if necessary. Frequent updates and a regular release cycle reflect the software’s active development and ensure that potential problems are fixed in a timely manner. Core developers control the primary source code for R, and many people around the world contribute to development in the form of new features, bug fixes, or both.

The main advantage of R over many other statistical software applications is that it is free in the sense of free software. The copyright for the primary source code of R is held by the R Foundation³ and is published under the GNU General Public License version 2.0⁴. Under the license, the user is free to: (i) use the program for any purpose, (ii) adapt the program and access the source code, and (iii) improve the program and redistribute the improvements.

Another advantage that R has over many other statistical software applications is its graphical capabilities. R’s ability to produce high-quality graphics has been one of its features since the very beginning and it is generally superior to competing tools. Today, despite many more visualisation software programs being available than in the past, this trend continues. R’s basic graphics system allows for very precise control over all the elements of a graph. Other more recent graphical systems and packages, such as lattice and ggplot2, allow complex and graphically more beautiful data visualisations.

Another positive aspect of using R is not related to the language itself, but to its active community of users. In many ways, a language can be considered successful if it generates a platform for a sufficient number of people to create new features. R is such a platform, and thousands of people around the world contribute to its development. R has extensive support on the website Stack Overflow⁵. It is also worth mentioning the excellent documentation (e.g. a standardised description of help functions) and the central repository of packages.

It should also be pointed out that R has certain shortcomings. For example, objects in R must be stored in PC physical memory, which can be a problem when analysing large amounts of data (e.g. global analyses, climate scenarios). Furthermore, R’s functionality is based on consumer demand and voluntary contributions from users. If a particular method is not yet implemented, the user must either implement it themselves or pay someone else to do so. The learning curve is also higher than for other software (e.g. Excel).

2.3 Further reading

There is a lot of literature available where users can gain useful insight into how to use R:

Introduction to R on the CRAN website⁶.
Resources on R’s website ⁷.
Advanced R⁸.
Material on bookdown.org⁹.
Handbook by Prof. Blejc¹⁰.
Basic Statistical Analysis book by Prof. Žiberna¹¹.
Rpubs website and practical examples¹².

3 How to start using R

3.1 Installation

To start working with the R software tool, you must first install it on your computer. R works on almost all available operating systems, including the widely available Windows, Mac OS, and Linux. Installation files are available on the R Project’s website¹³.

There are also several graphical interfaces available for R, one of which is RStudio¹⁴, which has a nice editor with highlighting of individual code blocks (e.g. comments, commands, objects), an object viewer, and many other features that make working with R easier. RStudio is also free to use and runs on various operating systems such as Windows, Mac OS, and Linux.

Figure 1: Example of a basic R programme.

Figure 2: Example of the RStudio GUI.

3.2 First steps

Now that you’ve installed R and RStudio, you’re probably wondering, “How do I use R?” First of all, it is important to point out that, unlike other statistical programs such as Excel or SPSS, which offer point-and-click graphical interfaces, R is a language in which you have to type commands written in R code into the Console. In other words, you have to code in R. Although you don’t need to be an experienced programmer to use R, you still need to understand a set of basic concepts. You can use R as a simple calculator by typing commands into the Console and pressing the Enter button:

3 + 4 # addition

## [1] 7

3*4 # multiplication

## [1] 12

8/2 # division

## [1] 4

R also allows for comments (the part of the code that is not executed) to be written in the code. The character # is used for this purpose:

3 + 4 # this is a comment

## [1] 7

Of course, more complex calculations are also possible:

2*(3+8)/4+5*1.5/(2/3)^3 # a slightly more complex calculation

## [1] 30.8125

R displays messages when there are issues in the code’s execution. There are three different types of messages:

Errors: If the red text is an error, it is prefaced by “Error in…”, which tries to explain what went wrong. Normally, if an error occurs, the code execution stops immediately.
Warnings: When a warning text is displayed, it is prefaced by “Warning:”, and R attempts to explain why the warning appears. In most cases, the code will still work, but with some limitations, or the calculations may not be correct or in line with what you have asked R to do. Caution is therefore advised with warnings and a close eye when checking the calculations.
Messages: If the displayed text does not start with “Error” or “Warning”, it is just a message. You will see these messages when you load R packages and when you import data with certain functions. These are useful diagnostic messages and do not prevent the code from running.

3.3 Using the functions

Another of R’s advantages is the large number of features included in its packages. Using these functions is relatively simple, but it is necessary to be aware of what input data each function requires, or what the function specifies as the result. The documentation or help on how to use a function can be accessed using one of the following commands:

help(mean) # help for function mean
?min # help for function min

The help for almost all functions (including those in packages) is designed in a similar way and includes the following points:

Description: description of the function’s purpose.
Usage: an example of the function’s typical use with the general description of all input arguments.
Arguments: list of arguments accepted by the function and the meaning of each argument.
Details: a detailed description of the function and the arguments.
Value: information about the type of a particular function’s result.
Author(s): the authors of each function.
References: additional useful literature related to the function.
Examples: typical examples of a particular function’s use with specific use of arguments.

An example of how to use the mean function:

mean(c(3,5,3,5,3)) # calculate the average of five numbers

## [1] 3.8

The mean function mentioned above calculates the average of the numbers 3, 5, 3, 5, 3. Additionally, the c function is used to combine individual numbers into a vector (the help for the ?c function explains an example of its use). Another additional example is the use of the round function to round numbers:

round(x=3.54453445,digits=2) # rounded according to input data

## [1] 3.54

Compared to the example above, you can see that here we have also precisely defined the two parameters or arguments used in the round function, namely the first argument x, which defines the input data to the function, and the argument digits, which defines the number of decimal places to be shown in the result. If you know the order of the arguments, you can omit their names:

round(3.54453445,2) # rounded according to input data

## [1] 3.54

If you are not sure about the order of the arguments or parameters, it is better to define each one separately, otherwise you may get an incorrect result:

round(2, 3.54453445) # rounded according to input data

## [1] 2

So, in most cases, the definition of arguments or parameters makes sense:

round(digits=2, x=3.54453445) # rounded according to input data

## [1] 3.54

Of course, you need to know the name of the function to use it. The best way to do this is to search online for the function that makes the most sense for the case or problem you want to solve. There are also various websites that give an overview of the most commonly used functions¹⁵.

There is almost no statistical mathematical problem for which a function does not exist within the R programming language’s environment, either in the R base package (base) or in one of the more than 10,000 extension packages¹⁶.

Below are the first practical tasks you can solve to build upon your basic knowledge of R.

Task 1: Find a function that can be used to calculate the standard deviation in R and use the function.

Task 2: Find a function to plot a graph (any graph) in R and use the function.

Task 3: Find a function with which you can generate a sequence of numbers, say 2,4,6,8,10, etc., and use this function to calculate a sequence that has 50 elements, whose initial value is 2, and whose step is also 2.

Task 4: Find a function to round numbers and round the number 2.464646 to 2 decimal places.

3.4 Objects

In R, objects are defined using character the <-. In RStudio, this character can also be accessed by pressing the ALT and - keys at the same time.

x <- 1 # define an object x with one element

When an R object is defined, it is not displayed. To print it, use the print function or just the object name:

print(x) # print the contents of the object named x

## [1] 1

x # same as above only without using the function

## [1] 1

In the R programming language, there are often several different ways to solve a problem or perform an operation. For example, we can use the assign function to define objects:

assign("x", c(10,4,5)) # define an object using the assign function
c(4,5,3) -> x # objects can also be defined this way

In most cases when defining objects, we can also use the =.

When you print out the vector, you will notice that the vector’s index is printed in square brackets [] on the side. For example, look at this numeric sequence of length 20, where the numbers in square brackets are not part of the vector itself, but only part of the printout:

y <- 41:60 # generate integers between 41 and 60
y # check the contents of the object y

##  [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

R knows five basic types of objects:

character: elements consisting of strings of characters or words, e.g. “neighbour”, “Marko”.
numeric: real numbers.
integer: integers.
complex: imaginary or complex numbers.
logical: data with logical values such as FALSE and TRUE.

At first sight, there is no significant difference between integer and numeric. Numbers in R are generally treated as numeric objects (i.e. real numbers with double precision). This means that even if you see a number like “5” or “10” in R, which could be considered an integer, it is probably represented as a numeric object in the background (e.g. something like “5.00” or “10.00”). If you want to define an integer, you need to specify the L extension. Thus, typing 1 into R will give you a numeric object, and typing 1L will give you an object of type integer.

z1 <- 10 # define object z1
z2 <- 10L # define object z2
str(z1) # check the structure of object z1

##  num 10

str(z2) # check the structure of object z2

##  int 10

is(z2) # a slightly different way of checking the structure of object z2

## [1] "integer"             "double"              "numeric"            
## [4] "vector"              "data.frameRowLabels"

For the average user, there is no significant difference between the two types of objects, but the difference is important for very large objects, where objects of type integer take up much less space than objects of type numeric:

# check the size of the vector, type integer
object.size(as.integer(seq(from=1,by=2,length.out=100000)))

## 400048 bytes

# check vector size, type numeric
object.size(seq(from=1,by=2,length.out=100000))

## 800048 bytes

There is also a special number Inf, representing infinity. This allows us to define expressions such as 1/0. Thus, Inf can be used in ordinary calculations; e.g. 1/Inf is 0. The value NaN represents “not a number”; e.g. 0/0. In the R programming language, the expression NA is also often used to indicate a missing value (Not Available). An example of its use is shown below:

z3 <- c(4,5,Inf,1/0,NA,10) # define vector
# divide all elements of the vector by 10
z3/10

## [1] 0.4 0.5 Inf Inf  NA 1.0

R knows complex numbers as shown in the example above. More sophisticated definitions of complex numbers are obtained with the complex command. It is worth mentioning some other potentially useful functions related to complex numbers: Re, Im, Mod, Arg, Conj. However, since complex numbers are not widely used in hydrology, we will not go into details here.

The most basic type of object in the R programming environment is a vector. You can create an empty vector using the vector function. The rule for vectors in R is that they can only contain elements of the same type, the exception being the list object type, which can combine different types.

You can also define different types of vectors in R:

x <- c(0.4, 0.7) # numeric
# Boolean, could use T instead of TRUE or F instead of FALSE
x <- c(TRUE, FALSE) 
x <- c("Marko", "Jana", "Vid") # character
x <- 2:10 # integer
x <- c(2+0i, 4+2i) # complex number

Occasionally, different types of objects in the R programming environment become mixed up. Sometimes this happens by accident, but it can also be intentional. Example:

c(3.4, "Zvone") # character

## [1] "3.4"   "Zvone"

c(FALSE,3) # numeric value

## [1] 0 3

c("Stanko", TRUE) # character

## [1] "Stanko" "TRUE"

In each of the above examples, we are mixing objects of two different types in the vector. But remember that the only rule about vectors is that this is not allowed. When we mix data types in a vector, a change occurs (lower types are converted to higher types) such that every element in the vector is of the same class. In the example above, we see the effect of a single change: R is trying to find a way to represent all objects in a common vector in a reasonable way. Sometimes it does exactly what you want, and sometimes it doesn’t. For example, if you combine a numeric value with a character type, you will end up with a character vector, since numbers can usually be represented in this form. In any case, it is wise to avoid these combinations. You can also change the type of an object using functions such as as.numeric, as.integer, as.logical, or as.character:

z2 <- 1:10 # generate a vector of integers between 1 and 10
z3 <- as.character(z2) # change the object type to character
str(z3) # structure of object z3

##  chr [1:10] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

z3 # content of object z3

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

The individual elements of an object can be accessed using [], where the count always starts with 1:

z6 <- 1:10 # generate a vector of integers between 1 and 10
z6[3:4] # we are only interested in the 3rd and 4th element

## [1] 3 4

z6[c(2,8,10)] # we are interested in the 2nd, 8th, and 10th element

## [1]  2  8 10

z6[c(1, 3:5, 9)] # we are interested in certain elements

## [1] 1 3 4 5 9

z6[-2] # we are interested in all elements except the second one

## [1]  1  3  4  5  6  7  8  9 10

z6[5] <- 1000 # the selected elements can also be replaced
z6[c(3,8)] <- c(100,200) # we can also replace several elements at the same time

The following functions are also useful for indexing or selecting specific items:

1 %in% z6 # check whether 1 is contained in object z6

## [1] TRUE

2 == 2 # check whether the two elements are equal

## [1] TRUE

2 == 4 # check whether two elements are equal

## [1] FALSE

3 != 4 # check whether two elements are not equal

## [1] TRUE

# more complex combinations are possible; & indicates the condition "and"
10 <= 20 & 22 >= 22

## [1] TRUE

10 <= 20 | 22 >= 22 # | indicates an "or" condition

## [1] TRUE

v7 <- 15:21 # generate integers between 15 and 21
v7 > 18 # check which elements are greater than 18

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

v7 == 17 # check which element is equal to 17

## [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

v7 != 20 # check which elements are different from 20

## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

v7[v7 > 18] # elements that satisfy the selected condition

## [1] 19 20 21

v7[v7 > 18 & v7 < 22] # applying 2 conditions at the same time

## [1] 19 20 21

v7[v7 > 18 | v7 < 22] # slightly different

## [1] 15 16 17 18 19 20 21

which(v7 == 17) # element order

## [1] 3

For large objects, the head and tail functions can also be useful, showing the n first and last elements of the object, respectively:

v8 <- 1:10000 # generate a large integer object
head(x=v8, n=3) # check the first three elements of the v8 object

## [1] 1 2 3

tail(x=v8, n=4) # check the last four elements of the v8 object

## [1]  9997  9998  9999 10000

Functions such as subset, is.na, is.nan, summary, and na.rm often come in handy:

v9 <- c(4,NA,2,3,NA,10) # define vector
is.na(v9) # check which elements are not defined

## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

summary(v9) # check the basic statistics and which elements are equal to NA

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00    2.75    3.50    4.75    5.50   10.00       2

max(v9) # the function does not work because of the NA value

## [1] NA

max(v9, na.rm=TRUE) # NA elements are ignored in the calculation

## [1] 10

v10 <- subset(v9, is.na(v9)!=TRUE) # store elements that are different from NA
v11 <- subset(v10, v10 > 2 & v10 < 10) # select only certain elements
v11

## [1] 4 3

You can see the list of defined objects in the top right window of RStudio (Environment tab). You can also use the ls() function to view all defined objects. In case you want to remove an object, you can do so using the rm or remove functions:

z6 <- 1:10 # generate a vector of integers between 1 and 10
rm(z6) # remove object z6

If you want to remove all objects, you can use the following combination of functions:

rm(list=ls(all=TRUE))

Objects in R can have attributes, which are like metadata for the object. This metadata can be very useful, as it helps to describe the object. For example, the names of the columns in a data frame help us to explain what data are contained in each of the columns. Some examples of an R object’s attributes include names, column names, dimensions, object type, length, etc. Certain object types contain attributes that can be accessed using the attributes function (in the example above, this is the use of a dataset that is associated with an R program and can be accessed using the data function):

data("airquality") # import data named airquality
attributes(airquality) # check the attributes of this object

## $names
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    
## 
## $class
## [1] "data.frame"
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153

Task 5: Suppose you measure rainfall every day of the week (Monday to Sunday) at 2 rainfall stations; the first station measured 10,0,0,20,15,10,5 mm of rainfall and the second station measured 5,0,0,25,10,5,10 mm of rainfall. Store the data in 2 separate objects. Then, using the objects, calculate the rainfall values for all 7 days: Calculate the average precipitation taking into account both stations (station1 and station2), calculate the average precipitation over the whole week (for all days combined), calculate the maximum daily and minimum daily precipitation given the measurements from both stations, calculate the range of the measured values (range), and calculate the median for each precipitation station and the standard deviation for each precipitation station.

Task 6: Find a function to generate random numbers (based on a uniform distribution) and generate 10 random numbers between 1 and 35. Store the results in a new object and round to 1 decimal place, calculate the sum of all the generated numbers, and calculate the product of all the generated values, then sort the generated values from smallest to largest.

Task 7: Calculate the sum of all integers between 1 and 10 000.

Task 8: Using the rainfall example from Task 5, check on which days at station 1 the average rainfall was more than 10 mm and less than 20 mm.

Task 9: It was subsequently discovered that the measured rainfall for station 1 on Thursday was incorrect and should have been 30 mm. Check whether this change affects the result of task 8.

3.5 Matrix

Matrices are vectors with a dimension attribute. The dimension attribute is an integer vector of length 2 (number of rows, number of columns).

m1 <- matrix(nrow = 2, ncol = 3) # define a matrix, with NA values
dim(m1) # matrix dimensions

## [1] 2 3

m1 # content

##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA

attributes(m1) # attributes

## $dim
## [1] 2 3

Matrices are defined by columns, so entries can be visualised as starting in the “top left” corner (element 1,1) and filling down the columns.

# define a matrix containing integers between 1 and 6
m2 <- matrix(1:6, nrow = 3, ncol = 2) 
m2 # content

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

dim(m2) <- c(2,3) # change dimensions
m2 # content of the transformed matrix

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

You can also define matrices by combining columns or rows with the cbind and rbind functions. In some cases, you can also use the array function:

m4 <- cbind (4:6,1:3) # combine two columns of vectors with three elements
m4 # content

##      [,1] [,2]
## [1,]    4    1
## [2,]    5    2
## [3,]    6    3

m5 <- rbind(10:11, 20:21) # combine two rows into a 2 x 2 matrix
m5 # content

##      [,1] [,2]
## [1,]   10   11
## [2,]   20   21

m6 <- array(1:12,dim=c(3,4)) # define an array using the array function

We can use matrix referencing to use specific elements of a matrix:

m4 <- cbind (4:6,1:3) # combine two columns of vectors with three elements
m4[,2] # second column

## [1] 1 2 3

m4[1,] # first row

## [1] 4 1

m4[2,2] # element in the second column and second row

## [1] 2

m4[c(1,3),1] # elements in the first and third rows of the first column

## [1] 4 6

m4[5] # the fifth element of the matrix, where the elements are arranged by columns

## [1] 2

c(m4) # the matrix can also be transformed into a vector, ordered by columns

## [1] 4 5 6 1 2 3

as.vector(m4) # same as before only using the as.vector function

## [1] 4 5 6 1 2 3

But matrices can also be used to make calculations:

m4 <- cbind (4:6,1:3) # combine two columns of vectors with three elements
m5 <- m4*10 # multiply all elements of the matrix m4 by 10
m5 <- log(m5) # calculate the logarithm of all elements
m6 <- m4*m5 # the matrices can be multiplied
print(m6) # result of multiplication

##          [,1]      [,2]
## [1,] 14.75552  2.302585
## [2,] 19.56012  5.991465
## [3,] 24.56607 10.203592

crossprod(m4,m5) # matrix multiplication

##          [,1]     [,2]
## [1,] 58.88170 44.59619
## [2,] 23.79596 18.49764

m4 %o% m5 # tensor product, alternative is the function outer

## , , 1, 1
## 
##          [,1]      [,2]
## [1,] 14.75552  3.688879
## [2,] 18.44440  7.377759
## [3,] 22.13328 11.066638
## 
## , , 2, 1
## 
##          [,1]      [,2]
## [1,] 15.64809  3.912023
## [2,] 19.56012  7.824046
## [3,] 23.47214 11.736069
## 
## , , 3, 1
## 
##          [,1]      [,2]
## [1,] 16.37738  4.094345
## [2,] 20.47172  8.188689
## [3,] 24.56607 12.283034
## 
## , , 1, 2
## 
##          [,1]     [,2]
## [1,]  9.21034 2.302585
## [2,] 11.51293 4.605170
## [3,] 13.81551 6.907755
## 
## , , 2, 2
## 
##          [,1]     [,2]
## [1,] 11.98293 2.995732
## [2,] 14.97866 5.991465
## [3,] 17.97439 8.987197
## 
## , , 3, 2
## 
##          [,1]      [,2]
## [1,] 13.60479  3.401197
## [2,] 17.00599  6.802395
## [3,] 20.40718 10.203592

t(m6) # matrix transpose

##           [,1]      [,2]     [,3]
## [1,] 14.755518 19.560115 24.56607
## [2,]  2.302585  5.991465 10.20359

m7 <- cbind(1:2,4:5) # define a square matrix
det(m7) # calculate the determinant

## [1] -3

# eigenvalues and vectors in case the matrix can be diagonalised
eigen(m7)

## eigen() decomposition
## $values
## [1]  6.4641016 -0.4641016
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.5906905 -0.9390708
## [2,] -0.8068982  0.3437238

dim(m6) # dimensions of the matrix

## [1] 3 2

The apply, sapply, mapply, etc. functions are also very useful for calculations targeting multi-dimensional objects. Let’s look at one example:

m8 <- cbind (10:20,20:30) # combine two vectors
# calculate the average over the columns, the MARGIN argument defines 
# whether to use columns or rows
apply(X = m8, MARGIN = 2,FUN = mean)

## [1] 15 25

apply(X = m8, MARGIN = 1,FUN = mean) # calculate average by rows

##  [1] 15 16 17 18 19 20 21 22 23 24 25

apply(m8, 2, summary) # calculate the main descriptive statistics by column

##         [,1] [,2]
## Min.    10.0 20.0
## 1st Qu. 12.5 22.5
## Median  15.0 25.0
## Mean    15.0 25.0
## 3rd Qu. 17.5 27.5
## Max.    20.0 30.0

Task 10: Combine the precipitation data from the two stations given in Task 5 into a matrix. Add a third column to show the average rainfall values at the two stations (by day). Add row labels (names of days of the week) and column labels (station 1, station 2, average) using the colnames and rownames functions.

Task 11: Try to use the apply function on the previously defined matrix (task 10) to calculate the average and mean values at station 1 and station 2.

3.6 Factors

Factors are used to categorize data and can be unordered or ordered. A factor can be thought of as an integer vector, where each integer has a label. Factors are important in statistical analysis and are also specifically considered in certain functions such as lm and glm. The use of labelled factors is preferable to the use of integers because the factors are self-describing. In some cases, a variable taking the values “male” and “female” is better than a variable taking the values 1 and 2. Often, factors are automatically defined when you read a set of data with a function such as read.table. These functions often create factors by default when they detect data that look like characters or strings. The order of the levels of a factor can be specified by the levels argument in the factor function:

opa <- c("M", "F", "F", "M", "M") # define a vector of observations
opafak <- factor(opa) # transform the object into a factor
levels(opafak) # check which types are included in our object

## [1] "F" "M"

table(opafak) # check how many elements of a given type we have

## opafak
## F M 
## 2 3

summary(opafak) # similar to above

## F M 
## 2 3

unclass(opafak) # transform into numeric values

## [1] 2 1 1 2 2
## attr(,"levels")
## [1] "F" "M"

# define the order of the levels
f1 <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no")) 
table(f1) # structure

## f1
## yes  no 
##   3   2

3.7 Data frames

Data frames are used to store tabular data in R. They are an important type of object in R and are used for a variety of purposes. For example, the dplyr package has an optimised set of functions designed to work efficiently with data frames. Data frames are a special type of list or matrix where each element of the list must be the same length. Each list element can be thought of as a column, and the length of each list element is the number of rows. Unlike matrices, data frames can store different classes of objects in each column. In the case of matrices, each element must be of the same class (e.g. numeric values). In addition to column names, which denote the names of variables, data frames have a special attribute called row.names, which indicates information about each row of the data frame. Data frames are usually defined by reading the data using the read.table or read.csv function. However, data frames can also be created directly using the data.frame function or transformed from other existing objects such as matrices. Data frames can be converted to a matrix using the data.matrix function.* Although it may sometimes seem necessary to use the as.matrix function to convert a data frame to a matrix, in most cases the data.matrix function is the better choice.

# structure of the data frame included in R
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

nrow(mtcars) # number of rows

## [1] 32

ncol(mtcars) # number of columns

## [1] 11

summary(mtcars) # basic statistics

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

colnames(mtcars) # column names

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

mtcars$hp # check the contents of the column named hp

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
## [20]  65  97 150 150 245 175  66  91 113 264 175 335 109

mtcars$hp[2:5] # some elements of this column

## [1] 110  93 110 175

mtcars[3:5,5] # we can also use a matrix reference

## [1] 3.85 3.08 3.15

mtcars[1:4, "hp"] # a slightly different way

## [1] 110 110  93 110

You can also define your own data frames:

# define a data frame
example <- data.frame(temp=c("visoka","nizka","visoka"),
pretok=c("velik","majhen","velik"),
motnost=c("velika","srednja","velika"),
vodostaj=c(30,10,28)) 
print(example) # look at the structure

##     temp pretok motnost vodostaj
## 1 visoka  velik  velika       30
## 2  nizka majhen srednja       10
## 3 visoka  velik  velika       28

example$pretok # let's just look at the column named pretok

## [1] "velik"  "majhen" "velik"

Task 12: Using the airquality object (you need the data(airquality) function to use the data), check on which days the wind speed was greater than 10 mph (the units in which the wind speed is given), check on which days the air temperature was between 60 and 70 F (the units in which the air temperature is given), and check on which day the maximum and minimum ozone concentrations were measured.

Task 13: Sort the airquality data according to the measured air temperature. Check the functioning of the order and sort functions.

Task 14: Using the apply function, calculate the average of all columns in the airquality object.

3.8 Lists

Lists (list) are a special kind of vector that can contain elements of different classes. Lists are a very important data type in R, because you can store many different kinds of data in them. Lists, in combination with various functions such as apply, sapply, or lapply, allow fast and easy computations on large amounts of data. Lists can be defined with the list function, which accepts any number of arguments:

# define any list
clovek<- list(name="John", age=35, spouse="Mary",
age_child=c(15, 13, 2)) 
clovek$age_child # let's look at the contents of the element named age_child

## [1] 15 13  2

clovek[["name"]] # we can also use this way

## [1] "John"

clovek[c("name", "spouse")] # or multiple elements

## $name
## [1] "John"
## 
## $spouse
## [1] "Mary"

names(clovek) # names can be checked using the names function

## [1] "name"      "age"       "spouse"    "age_child"

Task 15: Define a new object in the form of a list where you combine two columns of the airquality dataset and average precipitation objects that you considered in Task 5.

3.9 Importing and storing data

There are a few main functions for importing data into R:

read.table, read.csv for importing tabular data
readLines to read lines in a text file
readChar to read a specified number of characters of the form character
read_excel to read Excel files
source to read R code files (the reverse of dump)
dget to read R code files (inverse of dput)
load to read saved workspaces
scan to read files, the result of the object is a vector

There are many R packages that have been developed for importing different types of data, e.g. readxl for importing Excel spreadsheets or read_sav (haven package) for reading SPSS databases.

For saving data to files outside the R programming environment, there are analogous functions such as:

write.table, write.csv to save tabular data to text files (e.g. csv)
writeLines to store data line by line in a file
dump to store textual data in conjunction with various R objects
dput to store a textual representation of an R object
save to store any number of R objects (in binary compressed form) in a file

The read.table function is one of the most commonly used functions for importing data into R. The help for read.table is worth reading in full because it is a commonly used function and because it allows you to make a number of settings to ensure that your data is read in the correct format (e.g. you can also import your data directly via RStudio (via the GUI using the functions in the Environment tab):

Figure 3: Example of importing data via the RStudio GUI.

As an example, we will show the process of importing data from the water gauging station Veliko Širje on the Savinja River, measured in 2005. The data were obtained from the website of the Slovenian Environment Agency (ARSO)¹⁷. The data are stored on OneDrive¹⁸. We will define the arguments of the read.table function to import the data, we will define the location of the file (in your case change this depending on the folder where you will store the data), the column delimiter, the decimal symbol used and that the first line is read as the header of the file (header).

podatki <- read.table(file="C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Savinja-Veliko SirjeI-2005.txt",header=TRUE,sep=";",dec=".") 
head(podatki) # check the first few lines

##        Datum vodostaj.cm. pretok.m3.s. temp.vode.C.
## 1 01.01.2005          234       37.982          3.6
## 2 02.01.2005          231       35.515          3.5
## 3 03.01.2005          227       32.395          4.4
## 4 04.01.2005          221       28.073          3.7
## 5 05.01.2005          218       26.068          3.7
## 6 06.01.2005          215       24.165          3.4
##   transport_suspendiranega_materiala.kg.s.
## 1                                    0.068
## 2                                    0.053
## 3                                    0.055
## 4                                    0.345
## 5                                    0.201
## 6                                    0.039
##   vsebnost_suspendiranega_materiala.g.m3.
## 1                                       2
## 2                                       2
## 3                                       2
## 4                                      12
## 5                                       8
## 6                                       2

str(podatki) # check the structure of the read data

## 'data.frame':    365 obs. of  6 variables:
##  $ Datum                                   : chr  "01.01.2005" "02.01.2005" "03.01.2005" "04.01.2005" ...
##  $ vodostaj.cm.                            : int  234 231 227 221 218 215 213 211 210 208 ...
##  $ pretok.m3.s.                            : num  38 35.5 32.4 28.1 26.1 ...
##  $ temp.vode.C.                            : num  3.6 3.5 4.4 3.7 3.7 3.4 3.6 3.5 4.1 4.1 ...
##  $ transport_suspendiranega_materiala.kg.s.: num  0.068 0.053 0.055 0.345 0.201 0.039 0.112 0.068 0.095 0.213 ...
##  $ vsebnost_suspendiranega_materiala.g.m3. : int  2 2 2 12 8 2 5 3 5 11 ...

# we can see that the dates were read as character-character
names(podatki) # check column names

## [1] "Datum"                                   
## [2] "vodostaj.cm."                            
## [3] "pretok.m3.s."                            
## [4] "temp.vode.C."                            
## [5] "transport_suspendiranega_materiala.kg.s."
## [6] "vsebnost_suspendiranega_materiala.g.m3."

names(podatki) <- c("Datum", "Vodostaj", "Pretok", "Temperatura", "Transport", "Vsebnost")  # change column names
names(podatki) # check the changed names again

## [1] "Datum"       "Vodostaj"    "Pretok"      "Temperatura" "Transport"  
## [6] "Vsebnost"

Task 16: Calculate the average values of all 5 variables found in the data from the Veliko Širje water gauging station on the Savinja River (for 2005).

Task 17: Import any data you have ever used into the R software environment.

Task 18: Save an any object in .Rdata format and then load it back into R using the load function.

3.10 Packages

Additional R package expand the programme’s functionality by providing additional features, data, and documentation. The packages are produced by a worldwide community of R users and can be downloaded free of charge online. R packages are a kind of application on your mobile phone. In order to use the features available in a package, the following two steps must be followed:

Installing a package: This is similar to installing an app on your phone. Most packages are not installed by default when you install R and RStudio. So if you want to use a package for the first time, you’ll need to install it first. Once you have installed a package, you are unlikely to install it again unless you want to update it to a newer version (or if you accidentally uninstall it in the meantime).
Activating a package: Activating a package is similar to opening an app on your phone. Packages are not activated by default when you start RStudio. You must activate each package you want to use each time you start RStudio.

The R environment provides two ways of installing packages. The first is by using the graphical interface in RStudio, and the second is directly by using the install.packages function. To install via the GUI, the following window can be used (whether a particular package is activated can be seen by a tick in the white square in front of the package name, and clicking on a package will take you to the help page for that particular package, which is available after the package has been installed):

Figure 4: Example of installing a package via the RStudio GUI. The red circles show the individual installation steps and the blue circle shows whether a particular package is activated or not.

An alternative procedure for installing (and activating) the package:

install.packages("airGR") # install a package named "airGR"
library(airGR, quietly=TRUE) # package activation

Packages often also contain certain information that serves as a test case to better understand how the package works. For example, the airGR package also contains hydrological data that can be used as an example to calibrate, validate, and run the rainfall-runoff hydrological model included in this package, which we will see below. A description of these data can be found in the help for the BasinObs function (use ?BasinObs).

library(airGR, quietly=TRUE) # package activation  
data(L0123001) # load data named L0123001
str(BasinObs) # overview of basic characteristics

## 'data.frame':    10593 obs. of  6 variables:
##  $ DatesR: POSIXct, format: "1984-01-01" "1984-01-02" ...
##  $ P     : num  4.1 15.9 0.8 0 0 0 0 0 2.9 0 ...
##  $ T     : num  0.5 0.2 0.9 0.5 -1.6 0.9 3.5 4.4 7 6.4 ...
##  $ E     : num  0.2 0.2 0.3 0.3 0.1 0.3 0.4 0.4 0.5 0.5 ...
##  $ Qls   : int  2640 3440 12200 7600 6250 5650 5300 4700 3940 5300 ...
##  $ Qmm   : num  0.634 0.826 2.928 1.824 1.5 ...

View(BasinObs) # view data in a separate window

Package authors will release new versions of packages with bug fixes and new features over time, so it is usually a good idea to keep them up to date. However, keep in mind that new versions of packages will occasionally contain bugs or have slightly changed behaviour (e.g. features that work differently), which may mean that your code will no longer work as it did before the update. You can use the update.packages() function to update packages, or you can update via the RStudio GUI.

R has a centralised packet repository called CRAN¹⁹ (The Comprehensive R Archive Network). All packages stored there have high quality requirements. These packages must be regularly updated and documented. You can install any package from CRAN directly from the R console or via the RStudio GUI as shown above. Additional packages not included in CRAN can be found at:

GitHub²⁰;
Bioconductor²¹;
R-Forge²².

Lists of hydrology-related packages can be found here:

AboutHydrology²³.
Overview of packages on CRAN: R-resources for hydrologists²⁴.

As a point of interest, let’s also show an example of how R can also be used to download files directly from https connections and how they can be imported into R using the readr package, downloading a file from a web connection, and saving the file to a working directory, which you can see using the getwd() function:

download.file( "https://monashdatafluency.github.io/r-intro-2/r-intro-2-files.zip",  destfile="r-intro-2-files.zip")
# the file is in .zip format, the extension of this file
unzip("r-intro-2-files.zip") 
# install.packages("readr") # install the readr package
library(readr, quietly=TRUE) # package activation

## Warning: package 'readr' was built under R version 4.1.3

# read a file named geo.csv
geo <- read_csv("r-intro-2-files/geo.csv")

## Rows: 196 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): name, region, income2017
## dbl (2): lat, long
## lgl (2): oecd, g77
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(geo) # let's look at the first few lines of these data

## # A tibble: 6 x 7
##   name                region   oecd  g77     lat   long income2017
##   <chr>               <chr>    <lgl> <lgl> <dbl>  <dbl> <chr>     
## 1 Afghanistan         asia     FALSE TRUE   33    66    low       
## 2 Albania             europe   FALSE FALSE  41    20    upper_mid 
## 3 Algeria             africa   FALSE TRUE   28     3    upper_mid 
## 4 Andorra             europe   FALSE FALSE  42.5   1.52 high      
## 5 Angola              africa   FALSE TRUE  -12.5  18.5  lower_mid 
## 6 Antigua and Barbuda americas FALSE TRUE   17.0 -61.8  high

R uses a working directory to store files unless an exact (other) location is defined, which can be viewed using getwd(), or changed using setwd(). In conjunction with these two functions, it is also worth mentioning the list.files() function, which returns a list of all files in the working directory. This command is very useful for, say, the automatic reading of certain files.

Task 19: Find a package to use and calculate L-moments, install and use the package, and calculate the L-moments of the flow data you worked on in Task 16 (L-moments are similar to ordinary statistical moments, mean, variance, skewness, kurtosis).

Task 20: For the data included in the airGR package, calculate the basic statistics using the apply function and the basic statistics for columns 2, 3, and 4 using the summary function, and identify which hydrological variables are involved and what their units are.

3.11 Date and time in R

In hydrological analyses, we often deal with time series data representing hydrological measurements or other observations. The R program uses the following representation of dates and times: dates are represented by the Date class and times by the POSIXct and POSIXlt classes. Dates are stored internally as the number of days since 1970-01-01 and times as the number of seconds since 1970-01-01. From a character string, a date can be defined using the function as.Date():

x <- as.Date("1970-01-01") # define date
x # let's see the content

## [1] "1970-01-01"

# define the format argument, which defines the format of the data
x1 <- as.Date("1.1.1970", format="%d.%m.%Y") 
unclass(x1) # convert back to character format

## [1] 0

unclass(as.Date("1971-01-01")) # another example with a different date

## [1] 365

You can choose between several format types for dates.

Figure 5: An example of some useful date and time abbreviations that can be used when defining arguments.

In R, the time data are represented by the POSIXct or POSIXlt class. POSIXct is useful when you want to store times in a format similar to data frames. POSIXlt contains additional useful information such as day of the week, day of the year, month, and day of the month. This may be useful in some cases. It is worth mentioning some other useful functions such as weekdays, months, quarters.

x <- Sys.time() # define object x according to the current time
x # content

## [1] "2024-07-23 12:38:09 CEST"

class(x) # check time format

## [1] "POSIXct" "POSIXt"

p <- as.POSIXlt(x) # transform time into POSIXlt format
names(unclass(p)) # look at the additional data contained in this format

##  [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
##  [9] "isdst"  "zone"   "gmtoff"

p$yday # check the Julian day

## [1] 204

If your dates are written in a different format, there is the strptime function (the inverse of this function is the strftime function). This function uses a character vector of dates and times and converts them into a POSIXlt object. You can choose two dates of your choice and the month format depends on the local language settings:

datum <- c("Januar 23, 2022 11:42", "Junij 10, 2018 12:10") 
x2 <- strptime(datum, "%B %d, %Y %H:%M") # transformed to POSIXlt format
x2 # see the object

## [1] "2022-01-23 11:42:00 CET"  "2018-06-10 12:10:00 CEST"

class(x2) # format of this object

## [1] "POSIXlt" "POSIXt"

You can also perform some mathematical operations with dates and times (e.g. addition and subtraction). You can also perform comparisons (e.g. ==, <=):

x3 <- as.Date("2017-01-01-01") # define a date
# define a second object
y3 <- strptime("21 Marec 2015 13:44:51", "%d %B %Y %H:%M:%S") 
# x3-y3 # does not work because the format of the two objects is not the same
x3 <- as.POSIXlt(x3) # transform the first object
x3-y3 # try again

## Time difference of 651.4689 days

The good thing about these object types in R is that they take into account all the potentially confusing things related to dates and times, such as leap years, daylight saving time, and time zones. And there’s also the example with two different time zones, where the base time zone is set according to the computer’s settings:

x4 <- as.POSIXct("2012-10-25 01:00:00")      
y4 <- as.POSIXct("2012-10-25 01:00:00", tz = "GMT") # random tz
z4 <- as.POSIXct("2012-10-25 01:00:00", tz = "America/Chicago")
y4-x4 # difference

## Time difference of 2 hours

z4-x4 # difference

## Time difference of 7 hours

as.double(z4-x4, units = "days") # conversion to number of days

## [1] 0.2916667

To work with time series, which hydrological data often include, R provides specific packages such as zoo, which include functions to combine date and time data and different variables into a single object. This approach makes it relatively easy to compile certain parts of the data, such as determining maximum annual discharges or calculating monthly rainfall. Let’s take a look at some examples based on data from the Veliko Širje water gauging station on the Savinja River:

#install.packages("zoo") # install the package
library(zoo, quietly=TRUE)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

podatki <- read.table(file="C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Savinja-Veliko SirjeI-2005.txt",header=TRUE,sep=";",dec=".") 
# change column names
names(podatki) <- c("Datum", "Vodostaj", "Pretok",
"Temperatura", "Transport", "Vsebnost") 
# dates are transformed into the as.POSIXct format
podatki[,1] <- as.POSIXct(strptime(podatki[,1],format="%d.%m.%Y")) 
Qzoo <- zoo(podatki[,3],podatki[,1]) # define the zoo object
head(Qzoo) # let's take a look at the structure of this object

## 2005-01-01 2005-01-02 2005-01-03 2005-01-04 2005-01-05 2005-01-06 
##     37.982     35.515     32.395     28.073     26.068     24.165

mes1 <- as.yearmon(time(Qzoo)+3600) # set monthly values
head(mes1) # let's look at the structure of the data

## [1] "jan. 2005" "jan. 2005" "jan. 2005" "jan. 2005" "jan. 2005" "jan. 2005"

leto1 <- as.numeric(floor(as.yearmon(time(Qzoo)+3600))) # annual values
head(leto1) # data structure

## [1] 2005 2005 2005 2005 2005 2005

# the aggregate function can be used to calculate the annual maximum flows
letneVsote <- aggregate(Qzoo, leto1, max) 
letneVsote # results

##  2005 
## 382.1

mesVsote <- aggregate(Qzoo,mes1,sum) # total monthly sums
head(mesVsote) # content

## jan. 2005 feb. 2005 mar. 2005 apr. 2005  maj 2005 jun. 2005 
##   578.223   360.518   973.600  1588.382  1124.756   455.644

# and another slightly more complex example
mesVsotePovp <- aggregate(mesVsote,months(time(mesVsote)),sum)

As an interesting feature, the merge function can be used to combine continuous data and data with missing values, which is particularly useful for data with hourly (or more precise) time steps:

# the first column is transformed into the Date format
podatki[,1] <- as.Date(podatki[,1],format="%d.%m.%Y") 
# assume missing data for February
podatki <- podatki[-(32:59),] 
# generate a continuous vector of dates
time.seq <- seq(as.Date(podatki[,1],format="%d.%m.%Y")[1], as.Date(podatki[,1],format="%d.%m.%Y")[length(podatki[,1])], by = "1 day") 
# data merging
zdr <- merge(x=podatki, y=as.data.frame(time.seq), by.x = "Datum", by.y = "time.seq", all.y = T) 
zdr[25:35,1:3] # we see that the missing data are marked NA

##         Datum Vodostaj Pretok
## 25 2005-01-24      193 13.126
## 26 2005-01-25      190 11.994
## 27 2005-01-26      190 11.994
## 28 2005-01-27      189 11.654
## 29 2005-01-28      189 11.654
## 30 2005-01-29      183  9.836
## 31 2005-01-30      187 11.006
## 32 2005-01-31       NA     NA
## 33 2005-02-01       NA     NA
## 34 2005-02-02       NA     NA
## 35 2005-02-03       NA     NA

Task 21: Calculate the maximum, minimum, and average values of the flows in each period (Jan–Mar, Apr– Jun, etc., i.e. ¼ of the year). Hint you are looking for a function similar to as.yearmon.

3.12 Plotting basic graphs

In this section, we will demonstrate the use of the most common graphs. R offers many ways of presenting data. The basic function in R for plotting graphs is plot, which has a number of arguments that can be modified to specify what will be plotted on a given graph. Let’s look at one very simplified example where we will use the zoo object we defined in the previous steps to plot. In this case the structure of the object is such that R recognises the time data and the actual data, and plots them accordingly on the x-axis and y-axis; this is one of the advantages of using packages like zoo, as well as the objects in these packages:

plot(Qzoo)

Figure 6: Line graph plot.

 # settings changed, the xlim and ylim arguments are also worth mentioning
plot(Qzoo,xlab="Date",ylab="Discharge [m3/s]",main="Savinja-Veliko Širje", 
col="blue",lty=2,lwd=2)

Figure 7: Line graph plot.

# drawing a simple scatter plot
plot(x=podatki[,3],y=podatki[,2],xlab="Discharge [m3/s]",
ylab="Water level [cm]", main="Rating curve")

Figure 8: Rating curve.

The arguments associated with the par settings allow you to change a number of settings²⁵. However, there are many packages that can be used to draw slightly more visually attractive graphs, for example ggplot2 is often used:

#install.packages("ggplot2") # install the package first
library(ggplot2, quietly=TRUE) # activate

## Warning: package 'ggplot2' was built under R version 4.1.3

ggplot(podatki, aes(x = Pretok, y = Vodostaj)) + # plot graph
  geom_point() # and points

Figure 9: Q-H curve with the ggplot2 package.

The usual way to use the ggplot function is to use data frame (data.frame) as input and then tell it which columns to use for the x and y values as shown in the example above. The ggplot2 package is designed to work with data frames as a data source rather than individual vectors, so you will only be able to use a limited part of its capabilities in this way (with vectors). The ggplot commands are often spread over several lines, so you may often see code in the form as:

# define what data you want to plot
# temperature is used for the colour scale, discharge for the magnitude
ggplot(podatki, aes(x=Pretok, y=Vodostaj)) + 
  geom_point(aes(col=Temperatura, size=Pretok)) + 
  geom_smooth(method="loess", se=F) + # add a function describing the data
  xlim(c(0, 400)) + # define the display area on the x-axis
  ylim(c(170, 500)) + # define the display area on the y-axis
  labs(subtitle="Savinja-Veliko Širje", # define headings and other
    y="Water level", x="Discharge", title="Rating curve", 
    caption = "Year 2005")

## `geom_smooth()` using formula = 'y ~ x'

Figure 10: Q-H curve in a third way.

R includes a large number of functions for plotting a wide variety of different graphs, a nice overview of which can be found on the R Graph Gallery website²⁶. Also of potential interest is the datacamp website²⁷, which also offers courses on how to use R.

Let’s look at some other examples. To create a line graph with the plot function, we use the x and y vectors as input and use the argument type as “l”.

# plotting a line graph based on water T data
plot(x=podatki[,1],y=podatki[,4],type="l",xlab="Date",ylab="Temperature [°C]", main="Temperature")

Figure 11: Line graph for water temperature.

You can also add points (points), lines (lines), text labels (text), and legends (legend) to an existing graph. For points and lines, you need to define the x and y coordinate and make sure it is the same as the base graph you are plotting the data on, otherwise there may be a shift in the plotting of points and lines. The text function requires the x and y coordinate where the text is plotted, similarly for the legend function. Alternatives to coordinates are top, bottomleft, right, etc. The legend function has a number of arguments that can be used to specify the legend description²⁸.

plot(x=podatki[,1],y=podatki[,4],type="l",xlab="Date",ylab="Temperature [°C]", main="Temperature") 
points(x=podatki[,1],y=podatki[,4],col="red") # adding points
# additional line (1/2 of the measured temperature value)
lines(x=podatki[,1],y=podatki[,4]/2,col="blue") 
# add a horizontal line at value 15
abline(h=15,col="red",lty=2) 
# random text
text(x=podatki[50,1],y=15, labels="A note") 
# legend
legend("bottom",legend="Temperature",col="black",lty=1,bty="n",cex=0.9)

Figure 12: Line graph with additional data.

In ggplot2 you can get a similar result using geom_line to add line graphs and geom_points to add points. The ggplot2 package allows a number of settings, some of which are described here²⁹.

ggplot(podatki[,], aes(x = Datum, y = Temperatura)) +
  geom_line() + # adding line
  geom_point() + # and points
  labs(subtitle="Savinja-Veliko Širje", # title and other
       y="Temperature", 
       x="Date", 
       title="Variation in T", 
       caption = "Change of T")

Figure 13: Line graph with ggplot2.

To draw a bar plot, you can use the barplot function, where you can use as input a vector of values for the height of each row and (optionally) a vector of labels for each row. If there are element names in the vector, these names are automatically used as labels. Sometimes a bar plot refers to a graph where the columns represent the number of elements in each category. It is similar to a histogram but has a discrete rather than a continuous x-axis. To generate a count of each unique value in a given vector, you can use the table function. The ggplot2 package also allows you to plot similar graphs using either the geom_col or geom_bar function depending on the type of data you want to display. Often we also want to plot a histogram using the hist function, which plots the number of data in defined classes. In the ggplot2 package, there is an analogue function geom_histogram. To plot the bar graph, we will use the monthly totals of the flows we calculated above, applying the round function to round the data.

round(mesVsote,0)

## jan. 2005 feb. 2005 mar. 2005 apr. 2005  maj 2005 jun. 2005 jul. 2005 avg. 2005 
##       578       361       974      1588      1125       456      1591      2800 
## sep. 2005 okt. 2005 nov. 2005 dec. 2005 
##      1625      1881       742      1795

barplot(mesVsote,xlab="Date", ylab="Discharge [m3/s]") # barplot

Figure 14: Bar plot.

head(table(podatki$Vodostaj)) # let's see the result of such a graph

## 
## 182 183 184 185 186 187 
##   1   3   4   3   1   9

table(mtcars$cyl) # for a better understanding, let's look at the result of the table function on the mtcars data and the column showing the number of cylinders, so the table function determines how many elements have 4, 6, and 8 cylinders

## 
##  4  6  8 
## 11  7 14

# plotting the bar graph
barplot(table(podatki$Vodostaj),xlab="Water level",ylab="Number of elements")

Figure 15: Bar plot for water levels.

 #  histogram
hist(podatki$Vodostaj,xlab="Water level",ylab="Number of elements",main="Histogram")

Figure 16: Histogram.

# change the number of classes on the x-axis
hist(podatki$Vodostaj,breaks=12,xlab="Water level",ylab="Number of elements in the class",main="Histogram")

Figure 17: Histogram with a different number of classes.

One of the last graphs we will mention is the boxplot, which can be drawn using the plot function if we are dealing with factors. In the case of numerical data, we can use the boxplot function, or geom_boxplot in the case of the ggplot2 package. In connection with the boxplot, it is also worth mentioning the so-called violin graphs, which allow for slightly different data display from that of the boxplot³⁰. geom_violin is a function that can be used to plot this type of graph using the ggplot2 package.

# show discharges for 3 different periods
boxplot(podatki$Pretok[20:40],podatki$Pretok[60:80],podatki$Pretok[120:140],
        names=c("Q1","Q2","Q3"),xlab="Discharge") 
abline(h=75,col="red",lty=2) # add a horizontal line at 75

Figure 18: Boxplot.

Additionally, the curve function (for drawing curves) can be useful in some cases, as well as the mtext function (for adding text to the edges of graphs) and the segments function (for drawing line segments on the graph). Let’s look at another example of how graphs can be drawn over each other. An alternative for drawing such a graph is the ggplot2 package. First, we will plot the first histogram, where the data for the 1/2 year will be displayed. The breaks argument defines the number of histogram classes and this argument also allows other settings (see the help from the ?hist function for more information). It is important that the units of display on the x-axis are identical for the first and the second graph, so that there is no data offset. We will then add a second histogram to the existing histogram (this is defined using the arguments add=T and add=TRUE respectively), where we show the data for 1/2 year, with the colour and transparency defined using the function rgb. In addition, we will draw the legend.

hist(podatki[1:180,3], breaks=30, xlim=c(0,500), col=rgb(1,0,0,0.5), 
xlab="Discharge", ylab="Number of discharge values", main="Discharge Savinje" ) 
hist(podatki[181:365,3], breaks=30, xlim=c(0,500),
col=rgb(0,0,1,0.5), add=T) 
legend("topright", legend=c("1/2","2/2"), col=c(rgb(1,0,0,0.5), 
  rgb(0,0,1,0.5)), pt.cex=2, pch=15 )

Figure 19: Plotting two histograms on the same graph.

In addition, the drawing area can also be divided into smaller parts, where several smaller graphs can be drawn.

par(mfrow=c(1,2)) # dividing the drawing area into 2 parts
# first histogram, graph settings similar to above
hist(podatki[1:180,3], breaks=30, xlim=c(0,500), col=rgb(1,0,0,0.5), 
xlab="Discharge",ylab="Number", main="Discharge Savinja 1/2" ) 
# second histogram, graph settings are similar to above
hist(podatki[181:365,3], breaks=30, xlim=c(0,500), col=rgb(0,0,1,0.5), 
xlab="Discharge", ylab="Number", main="Discharge Savinja 2/2" )

Figure 20: Two histograms side by side.

Graphs can be saved directly via the Rstudio graphical interface, where you have the option in the Plots tab to save the plotted graph either as an image in several formats (e.g. png, jpeg, tiff, eps, etc.) or as a pdf document, or to copy and paste the graph directly using the Copy to Clipboard function. An alternative method is to use functions such as bmp, jpeg, png, tiff, or pdf. These functions can be used to save graphs even during calculations using loops or similar complex calculations where saving the graphs manually would be quite time-consuming. The procedure for using these functions is as follows, where an example of saving a graph using the png function is shown.

png(file = "myplot.png", bg = "transparent") 
plot(1:10)
rect(1, 5, 3, 7, col = "white")
dev.off()

The ggplot2 package also allows you to draw coloured graphs or areas (geam_area function).

ggplot(podatki, aes(x = Datum, y = Vodostaj)) +
  geom_area(colour = "black", fill = "blue", alpha = .2)

Figure 21: Example of a line graph with a coloured area below the line.

The range of packages and options for graphing in R is quite extensive, and it is often possible to find a suitable solution to a particular problem online. As an example, we can mention the openair package, which is designed for displaying air quality data³¹. The R Graphics Cookbook is an excellent example of a graphical package and also gives a great overview of graphical settings ³².

Task 22: For the data from the Savinja watercourse (water level, discharge, temperature, and suspended solids transport), draw a boxplot and label the graph with title, axis labels, etc. The graph’s y-axis should be drawn in log scale.

Task 23: From a set of more advanced graphs (the R graph gallery³³ website gives a good overview), choose one type of graph to draw and fit the selected graph accordingly.

3.13 Writing your own functions and using loops

R gives users the option of writing their own functions, too. Writing functions is an important step in transitioning from an ordinary user to a developer. Functions are often used for operations that must be performed repeatedly, perhaps under slightly different conditions.

Writing functions allows developers to create an interface to code that is defined by a set of parameters. Writing code in the form of functions simplifies its use by other users, as they do not need to know all the details of how the code works and can use it in the same way as any other function. Additionally, functions can also be used as arguments in other functions, which is particularly useful for operations such as apply, lapply, or sapply. Functions are defined by the function directive and are stored as R objects. Let’s look at a simple example of a function to calculate a square in R:

# kvad will be the name of the function, x is the only argument used by the function
kvad<- function(x){ 
rez<- x*x # calculate the square and store it in the rez object
return(rez) # print the result
}
kvad(10) # check the function

## [1] 100

kvad(3:6) # the function can also be used on the whole vector

## [1]  9 16 25 36

# use as one of the arguments to the apply function
apply(X = podatki[1:5,2:3],MARGIN = 2,FUN = kvad)

##   Vodostaj    Pretok
## 1    54756 1442.6323
## 2    53361 1261.3152
## 3    51529 1049.4360
## 4    48841  788.0933
## 5    47524  679.5406

Functions can also have multiple arguments, which can have pre-selected values. However, in some cases it makes sense to leave open a set of arguments that can be used for a particular function. For example, suppose we want to create a function that basically plots quadratic values on the x- and y-axes using the plot function, and we want to have a line graph where the points are plotted at the same time. The graph is the name of a function that has 3 arguments, where x and y are not defined, type already defines the type of the graph, and the label … indicates a variable number of arguments that are usually passed to other functions. The … argument is often used when extending to other functions, when you don’t want to copy the entire argument list from the original function.

graf <- function(x, y, type="b", ...){ 
  plot(x*x,y*y, type=type, ...) # our function
}
graf(x=5:10,y=25:30) # how to use it

Figure 22: First plot using our own function.

# more arguments
graf(x=5:10,y=25:30,main="Graph with function",col="red")

Figure 23: Using additional arguments.

Control structures or loops in R allow us to control the execution of a set of computations in the R programming environment. Control structures allow us to introduce some “logic” into the R code, instead of always executing the same R code, like changing the calculations depending on the specific condition (e.g. if function). Control structures can be used to respond to input data or data characteristics, and to execute different operations in the R programming environment accordingly. Some commonly used control structures are:

if and else functions check a condition and act on the condition
for function executes the loop multiple times
while function executes the loop until the condition is satisfied
repeat function executes infinite loops (to stop, the loop must be terminated)
break function terminates the execution of the loop
next function skips a (specified) part of the loop

Most control structures are used when writing custom functions or longer combinations of expressions. Let’s look at one typical example of an if loop³⁴:

value <- 25 # define any object
if (value > 20) { # define a condition
    print('This is a big number') # if the condition is met
} else {
    print('The number is not that big') # if the condition is not satisfied
}

## [1] "This is a big number"

# let's look at the case when the value is smaller
value <- 5 # define an arbitrary object
if (value > 20) { # define the condition
    print('This is a big number') # if the condition is satisfied
} else {
    print('The number is not that big') # if the condition is not satisfied
}

## [1] "The number is not that big"

Let’s look at another example of a for loop:

for(k in 1:5) { # define a loop
print(k) # what happens in the loop
} # end of loop

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Let’s look at another example of a double for loop:

x <- cbind(c("water", "air"),10:11) # define an arbitrary object
for(i in 1:dim(x)[1]) { # define the first loop
        for(j in 1:dim(x)[2]) { # define the second loop
                print(x[i,j]) # define what is executed in the loops
        } # end the second loop
} # end the first loop

## [1] "water"
## [1] "10"
## [1] "air"
## [1] "11"

It should be noted that many computational operations in R can also be performed using the apply, lapply, and sapply functions (instead of using loops). Let’s look at an example of using the sapply function.

x <- list(a=1:10,b=4:9,log=c(TRUE,FALSE,FALSE,TRUE))
sapply(X=x, FUN=quantile)

##          a    b log
## 0%    1.00 4.00 0.0
## 25%   3.25 5.25 0.0
## 50%   5.50 6.50 0.5
## 75%   7.75 7.75 1.0
## 100% 10.00 9.00 1.0

In cases where we want to repeat the code until a condition is met, the while function comes in handy. Loops using the while function start by testing the condition. If the condition is met, the main part of the loop is executed. Once the commands are executed, the condition is re-tested and so on until the condition is no longer satisfied, and then the loop is terminated³⁵.

stejemo <- 0 # define object
while(stejemo< 5) { # define the while condition of the loop
    print(stejemo) # loop statements
     stejemo<- stejemo+ 1 
 } # end

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Control structures such as if, while, and for allow you to control the flow of calculations in R. The use of loops should be as simple as possible.

Task 24: Write a function that allows you to normalise data.

Task 25: Define a function that takes two arguments (an x vector and a y vector) and plots a scatter plot of these two vectors, where the x and y axes are in log scale.

Task 26: Define and use a function that calculates the mean and variance of your sample.

Task 27: Use an if loop to check whether all the daily flow data from the Savinja River for 2005 are available. The result of the if loop should be in descriptive form (character) and should tell you whether there are enough data or whether there are some missing data.

Task 28: Using a for loop, randomly generate the vector x and the vector y 5 times (both vectors should contain 10 elements, given a normal distribution, mean 0, and standard deviation 1), and plot both vectors on a scatter plot using the plot function.

3.14 Statistical tests and distributions

Each distribution included in R has four functions. There is a root name for the function, for example the root name for the normal distribution is norm. This root name is preceded by one of the letters:

p for the cumulative distribution function P(X less than or equal to x)
q for the quantile function or the inverse of the distribution function
d for the point probability P(X = x) of discrete or density pX(x) of continuous distributions
r for the random value following a given distribution

For a normal distribution, these functions are pnorm, qnorm, dnorm, and rnorm. For a binomial distribution, these functions are pbinom, qbinom, dbinom, and rbinom. It is similar for the other distribution functions available in R:

beta distribution, beta
exponential distribution, exp
gamma distribution, gamma
log-normal distribution, lnorm
logistic distribution; logis
Poisson distribution, pois
uniform distribution, uniform
Student’s t distribution, t
Weibull distribution, weibull
f distribution, f
etc.

Many of the distributions used in hydrology are also available in a variety of additional packages, for example the lmom and lmomco packages are designed to estimate parameters using the L-moment method and contain a number of additional distribution functions. Some distribution parameters have prescribed values (e.g. mean and standard deviation in normal distributions) but can be changed, while others are dependent (e.g. scale parameter in gamma distributions; 1/rate).

For a continuous distribution such as the normal distribution, the most useful functions for solving problems involving the calculation of probabilities are the prnorm and qnorm functions, since the probability density calculated with dnorm can only be used to calculate probabilities using integrals. For a discrete distribution such as a binomial distribution, dbinom calculates the probability density, which in this case is the probability f(x) = P(X = x) and is therefore often useful in calculations. Consider the following example, where we are interested in P(X = 3) and where the random variable X is the distributed binomial Bi(10, 0.4):

dbinom(3, size=10, prob=0.4)

## [1] 0.2149908

Let us assume that we are dealing with an air temperature that can be adequately described by a normal distribution, with a mean of 5 degrees and a standard deviation of 8 degrees. What is the 90th percentile of the air temperature data? We can also generate a vector with 5 elements that follow this distribution.

qnorm(0.9, mean=5, sd=8) # calculation performed

## [1] 15.25241

rnorm(5, mean=5, sd=8) # generate 5 random numbers

## [1] 1.548072 3.529935 3.685294 2.038504 4.846742

rnorm(5, mean=5, sd=8) # repeat again

## [1]  3.427018 19.110134 11.167608 10.830124 -4.996389

set.seed(30) # in case we want to generate the same numbers each time
rnorm(2, mean=5, sd=8) # repeat again

## [1] -5.308146  2.218485

Sampling is also done using the sample function, which returns a random permutation of the vector x:

sample(x=1:5, size=2) # sample without repeating the corresponding size from x

## [1] 5 4

sample(x=1:5, size=4, replace=TRUE) # sample from x with repetition

## [1] 2 3 2 5

However, we often use various statistical tests, and the list of tests included in the basic R software is relatively extensive, to mention just a few of the commonly used tests:

cor.test;
chisq.test;
kruskal.test;
ks.test;
poisson.test;
t.test;
wilcox.test;
var.test.

For example, let’s look at the commonly used t.test. In this case, we test the hypothesis that two variables defined on different populations have the same mean (the null hypothesis states that mu1 = mu2, where mu1 and mu2 are means), assuming that they are at least approximately normally distributed. Here, alternative = two.sided means that the alternative hypothesis states that mu1 is different from mu2; the choice null means that the alternative hypothesis states that mu1 is less than mu2 and the choice greater means that the alternative hypothesis states that mu1 is greater than mu2.

t.test(rnorm(15,10,2), rnorm(15,100,20), alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  rnorm(15, 10, 2) and rnorm(15, 100, 20)
## t = -17.659, df = 14.206, p-value = 4.576e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -98.79112 -77.41894
## sample estimates:
## mean of x mean of y 
##  8.831213 96.936245

We can see that the calculated p-value is less than 0.05, which means that the null hypothesis can be rejected (H0: the mean is equal to the sample used). If the p-value is not less than 0.05, we could conclude that the calculations show that we are not able to reject the null hypothesis, as in this case:

t.test(rnorm(15,10,2), rnorm(15,16,4), alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  rnorm(15, 10, 2) and rnorm(15, 16, 4)
## t = -4.9492, df = 23.55, p-value = 4.97e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.426010 -3.874035
## sample estimates:
## mean of x mean of y 
##  9.711689 16.361711

We can also mention the non-parametric Mann-Whitney test, where the null hypothesis is that the variables belong to the same distribution and the alternative hypothesis is that one group’s variable has greater values than the other group’s variable³⁶.

# Mann-Whitney test
wilcox.test(x=rnorm(100,10,2), y=runif(200,20,30), paired = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rnorm(100, 10, 2) and runif(200, 20, 30)
## W = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Alternatively, we can use test for dependence applying the cor.test function, where the alternative argument tells you what the alternative hypothesis is, the two.sided choice means that the alternative hypothesis states that the variables are dependent, the less choice means that the alternative hypothesis states that the variables are positively correlated, and the greater choice means that the alternative hypothesis states that the variables are negatively correlated:

cor.test(x=runif(15,2,10), y=runif(15,20,30), alternative =  "two.sided")

## 
##  Pearson's product-moment correlation
## 
## data:  runif(15, 2, 10) and runif(15, 20, 30)
## t = 0.6245, df = 13, p-value = 0.5431
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3743235  0.6280225
## sample estimates:
##       cor 
## 0.1706647

Basically, the cor.test function calculates the Pearson correlation coefficient. However, using the method argument, we can also calculate the Spearman or Kendall correlation coefficient. Additionally, using the conf.level argument, we can calculate the confidence interval for the correlation coefficient at the selected confidence level. Example of such data dependency is where the discharge values depend on the water level values, as they are calculated from the flow curve of the gauging station. An example for the Kendall correlation coefficient is also shown. The calculation of the p-value for the Kendall correlation coefficient is relatively complex in some cases and the cor.test function thus determines in which cases it calculates the exact value and in which cases an approximation. It is advised to heed the Warning message. The calculation process can be modified by using the exact argument as shown in the following example.

cor.test(x=podatki[1:20,2], y=podatki[1:20,3], method="pearson", conf.level = 0.05)

## 
##  Pearson's product-moment correlation
## 
## data:  podatki[1:20, 2] and podatki[1:20, 3]
## t = 41.216, df = 18, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 5 percent confidence interval:
##  0.9945819 0.9949009
## sample estimates:
##       cor 
## 0.9947438

cor.test(x=podatki[1:20,2], y=podatki[1:20,3], method="kendall")

## Warning in cor.test.default(x = podatki[1:20, 2], y = podatki[1:20, 3], : Cannot
## compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  podatki[1:20, 2] and podatki[1:20, 3]
## z = 6.0862, p-value = 1.156e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau 
##   1

cor.test(x=podatki[1:20,2], y=podatki[1:20,3], method="kendall", exact=FALSE)

## 
##  Kendall's rank correlation tau
## 
## data:  podatki[1:20, 2] and podatki[1:20, 3]
## z = 6.0862, p-value = 1.156e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau 
##   1

Many additional statistical tests are also available in a number of packages.

Task 29: Plot a graph of two randomly generated samples. Generate both samples using a normal distribution: once with mean 0 and standard deviation 2, then with mean 3 and standard deviation 5. Label the graph accordingly with axis labels, legend, etc.

Task 30: Find an appropriate test that can be used to calculate whether or not a trend in the sample is statistically significant. Download the appropriate package and use the test with one randomly generated sample based on a normal distribution with mean 10 and standard deviation 5 (generate 20 numbers). Repeat the procedure for the sequence of numbers from 1:20.

3.15 Simple models

In water management, we often encounter linear and non-linear models. For example, discharge data is mostly calculated from water level data using a flow curve.

# linear model between discharge and water level
mod1 <- lm(Pretok ~ Vodostaj, podatki) 
# add the water temperature
mod2 <- lm(Pretok ~ Vodostaj + Temperatura, podatki) 
plot(podatki$Vodostaj,podatki$Pretok, xlab="Water level [cm]", ylab="Discharge [m3/s]", main="Models 1 and 2") 
# add a link based on the first linear model
abline(mod1, col="red") 
# check whether the water temperature information helps
abline(mod2, col="blue")

## Warning in abline(mod2, col = "blue"): only using the first two of 3 regression
## coefficients

Figure 24: Illustration of a linear model with a rating curve.

The linear model we have defined is a separate type of object in R. We can directly use various practical functions such as anova, coef, deviance, formula, predict, plot, print, residuals, spring, summary, etc., in conjunction with these objects. Let’s take a look at some examples:

print(mod1) # model coefficients

## 
## Call:
## lm(formula = Pretok ~ Vodostaj, data = podatki)
## 
## Coefficients:
## (Intercept)     Vodostaj  
##    -201.515        1.058

summary(mod1) # main features of the model

## 
## Call:
## lm(formula = Pretok ~ Vodostaj, data = podatki)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.292  -8.891  -2.693   5.978  76.735 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -201.51498    3.06039  -65.85   <2e-16 ***
## Vodostaj       1.05821    0.01288   82.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 335 degrees of freedom
## Multiple R-squared:  0.9527, Adjusted R-squared:  0.9526 
## F-statistic:  6747 on 1 and 335 DF,  p-value: < 2.2e-16

formula(mod1) # equation used

## Pretok ~ Vodostaj

head(residuals(mod1)) # residuals of the model used

##         1         2         3         4         5         6 
## -8.122995 -7.415380 -6.302560 -4.275330 -3.105715 -1.834100

head(predict(mod1)) # predictions of the model used 1

##        1        2        3        4        5        6 
## 46.10499 42.93038 38.69756 32.34833 29.17371 25.99910

plot(mod1) # some automatically determined plots

Figure 25: Graphs showing the fit of the linear model.

In the same way, more complex (non-linear) models can also be defined. In the parameter estimation the initial values of the parameters are very important:

mod3 <- nls(Pretok ~ a*(Vodostaj)^b, data = podatki, 
    start = list(a=0.0000005,b=3)) # model 3
mod4 <- nls(Pretok ~ a*(Vodostaj)^2+b*(Vodostaj)+c, data = podatki, 
    start = list(a=0.005,b=3,c=2)) # model 4
# simple graph
plot(podatki$Vodostaj,podatki$Pretok, xlab="Water level [cm]", 
  ylab="Discharge [m3/s]", main="Models 3 in 4") 
lines(150:500,coef(mod3)[1]*(150:500)^(coef(mod3)[2]),col="red") 
# graphical model matching 3
lines(150:500,coef(mod4)[1]*(150:500)^2+coef(mod4)[2]*(150:500)+coef(mod4)[3],col="blue")

Figure 26: Testing additional linear models.

# graphical model matching 4
# let's calculate some more statistics, R2
RSS.p <- sum(residuals(mod3)^2)
TSS <- sum((podatki$Pretok - mean(podatki$Pretok))^2)
r2 <- 1 - (RSS.p/TSS)
r2 # r2 for model 3

## [1] 0.9695111

RSS.p <- sum(residuals(mod4)^2)
TSS <- sum((podatki$Pretok - mean(podatki$Pretok))^2)
r2 <- 1 - (RSS.p/TSS)
r2 # r2 for model 4

## [1] 0.998363

It is clear that this last model describes the actual data very well, and it is likely that the chosen function is very similar to the rating curve that was used to determine the flows from the measured water levels.

There are also more complex functions, such as gls (nlme package), which can be used to define various linear and non-linear models. Some interesting examples are also available online³⁷.

Task 31: Find a suitable function or model that describes the relationship between flow and suspended solids such that r2 is greater than 0.3. Use data from the Veliko Širje water gauging station on the Savinja River.

4 Hydrological analysis and modelling

The following sections outline some of the methods and analyses commonly used in hydrology. Methods for both low and high flow analyses, mutlivariate analyses, and trend and seasonality analyses will be presented. In addition, emphasis will also be placed on rainfall data analysis, stochastic rainfall modelling and rainfall erosivity, as well as surface runoff modelling using a range of approaches. In addition, the use of spatial data will be demonstrated, as well as climate change analysis.

4.1 Low flow analysis

Low-flow analyses in hydrology are one of the key aspects of water resources management³⁸, where we focus on understanding and quantifying rivers’ water characteristics. This is particularly important in regions where water scarcity and hydrological drought are a problem. Low flows usually occur during dry periods or in regions with limited water availability. The analysis of low flows is important for several reasons such as: (i) water resources management (assessment of available water), (ii) planning of hydraulic infrastructure (dams, reservoirs), (iii) environmental protection (impact of low flows on aquatic ecosystems), (iv) risk assessment (planning of measures in case of low flows). There is a lot of literature available for understanding and performing hydrological drought analyses, with a special focus on the book Hydrological Drought³⁹. In this chapter, we will focus specifically on some of the methods included in the lfstat package⁴⁰. The lfstat package is based on the World Meteorological Organisation (WMO) manual⁴¹.

For the analyses, we will use data from the Veliko Širje water gauging station on the Savinja River, which we have already used in previous chapters. Before using the lfstat package, remember to install the package using the install.packages function or via the Rstudio GUI and activate it (e.g. using the library function).

library(lfstat, quietly=TRUE); library(zoo, quietly=TRUE)

## Warning: package 'lfstat' was built under R version 4.1.3

podatki <- read.table(file="C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Savinja-Veliko SirjeI-2005.txt",header=TRUE,sep=";",dec=".") 
# import the data and change the column names
names(podatki) <- c("Datum", "Vodostaj", "Pretok", "Temperatura", "Transport", "Vsebnost") 
# dates are transformed into the as.POSIXct format
podatki[,1] <- as.POSIXct(strptime(podatki[,1],format="%d.%m.%Y")) 
# define zoo object 
Qzoo <- zoo(podatki[,3],podatki[,1]) 
# object lfobj, argument hyeastart determines the start of the hydrological year
nizkiQ <- createlfobj(ts(Qzoo), startdate = "01/01/2005", hyearstart = 1) 
setlfunit("m^3/s") # define our data’s units 
head(nizkiQ) # let's look at the first few lines of the object

##   day month year   flow hyear baseflow
## 1   1     1 2005 37.982  2005       NA
## 2   2     1 2005 35.515  2005       NA
## 3   3     1 2005 32.395  2005       NA
## 4   4     1 2005 28.073  2005       NA
## 5   5     1 2005 26.068  2005       NA
## 6   6     1 2005 24.165  2005       NA

summary(nizkiQ) # let's look at the basic statistics

## Startdate:  2005-01-01     (calendar year)
## Enddate:    2005-12-31     (calendar year)
## 
## Time series covers 1 years.
## The hydrological year is set to start on January 1st.
## 
## Meanflow     MAM7      Q95      BFI 
##  42.5100   9.8630  10.4600   0.4687

# we see some characteristics such as the BFI or MAM7 index and the Q95 flow
# the package contains a hydrograph function that can be used to plot a hydrograph
hydrograph(nizkiQ,startdate = "01/02/2005", enddate = "31/10/2005")

Figure 27: Hydrograph Savinja-Veliko Širje.

# we can also show the results of baseflow separation (red line)
plot(nizkiQ$flow,type="l",xlab="Day in year",ylab="Discharge [m3/s]") 
lines(nizkiQ$baseflow,col="red")

Figure 28: Hydrograph and baseflow discharge.

# or directly using a function in the lfstat package
bfplot(nizkiQ)

Figure 29: Baseflow discharge slightly different.

## NULL

# plot the flow duration curve of our discharge values
# the fdc function outputs all the percentiles of the discahrges
fdc(nizkiQ)

Figure 30: Flow duration curve.

##                100%      99%      98%      97%     96%     95%      94%     93%
## FDC-quantiles 382.1 258.0889 198.5345 176.4337 143.076 133.593 117.7452 101.205
##                    92%     91%     90%    89%     88%      87%     86%     85%
## FDC-quantiles 95.33348 91.5466 89.0422 82.374 81.6566 78.89384 74.1346 72.4836
##                  84%      83%    82%  81%     80%      79%     78%      77%
## FDC-quantiles 67.155 64.38472 60.765 58.5 56.5106 55.81448 55.1224 52.33104
##                    76%    75%    74%    73%    72%     71%   70%    69%   68%
## FDC-quantiles 50.64424 48.992 47.027 47.027 45.191 42.7454 41.27 39.689 38.83
##                    67%    66%     65%    64%      63%      62%    61%    60%
## FDC-quantiles 38.72824 37.982 36.8182 35.515 34.97236 34.46548 33.157 32.395
##                  59%    58%    57%    56%    55%    54%      53%      52%
## FDC-quantiles 31.646 30.908 30.182 29.467 28.764 28.764 28.01868 26.91232
##                    51%    50%     49%    48%      47%    46%    45%    44%
## FDC-quantiles 26.48848 25.423 25.0166 24.165 23.60196 23.553 22.952 22.362
##                  43%    42%    41%    40%    39%    38%    37%    36%   35%
## FDC-quantiles 21.782 21.214 21.214 20.656 20.108 20.108 19.572 19.045 18.53
##                 34%   33%    32%    31%    30%    29%    28%      27%    26%
## FDC-quantiles 18.53 18.53 17.529 17.044 17.044 16.569 16.104 15.77712 15.205
##                  25%   24%    23%    22%    21%    20%      19%    18%    17%
## FDC-quantiles 15.205 14.77 14.344 14.344 13.523 13.523 13.18952 13.126 13.126
##                  16%     15%    14%    13%    12%    11%    10%     9%     8%
## FDC-quantiles 12.739 12.5882 11.994 11.994 11.654 11.654 11.654 11.006 11.006
##                   7%     6%      5%   4%     3%      2%      1%    0%
## FDC-quantiles 11.006 11.006 10.4596 10.4 10.113 9.91356 9.47476 9.309

# calculate the quantile values
# for annual values (yearly) and for monthly values, argument monthly
Qxx(nizkiQ, c(50,90,95,99),monthly=TRUE)

##    month flow.50% flow.10%  flow.5%  flow.1%
## 1      1 16.56900 11.65400 11.33000 10.18700
## 2      2 11.65400  9.30900  9.30900  9.30900
## 3      3 27.39400 11.65400 11.65400 10.38140
## 4      4 43.26200 19.46780 18.53000 18.53000
## 5      5 28.76400 19.04500 18.02950 17.52900
## 6      6 11.99400 10.11300  9.96065  9.64572
## 7      7 38.83000 20.10800 17.78700 17.04400
## 8      8 67.15500 20.65600 18.85000 15.75670
## 9      9 35.93200 21.21400 20.31090 18.83218
## 10    10 37.14700 17.52900 16.80650 16.56900
## 11    11 13.93350 11.00600 10.67270 10.40000
## 12    12 30.18200 20.10800 19.05100 17.48980

The lfstat package also contains a number of other functions for low flow analysis, which can be very useful:

BFI to calculate the BFI index (ratio of base flow to total flow);
MAM to calculate minimum annual flows;
meanflow to calculate the average annual flow;
Q95, Q90, Q70, Qxx to calculate specific quantile flows;
recession to analyse recession constants (decreasing parts of hydrographs);
seasindex to calculate the seasonality index;
seasratio to calculate the seasonal ratio (the ratio of Q95 over two periods, possibly summer and winter).

However, slightly more advanced calculations or analyses⁴² can also be carried out, for example frequency analysis of low flows. Using the data in the lfstat package, since we need more than 1 year of data, we first determine the minimum flows for all years of data, then, given the data, we define a Weibull distribution based on the L moments. From here, it is possible to choose other distribution functions included in the lmom package, for instance, the event parameter determines the chosen return period, and then we can estimate the low flow with a 100-year return period.

data(ngaruroro)  
nletni <- MAM(ngaruroro,n=1,yearly=TRUE)   
head(nletni)

##   hyear   MAn
## 1  1964 3.344
## 2  1965 4.805
## 3  1966 5.019
## 4  1967 4.819
## 5  1968 3.210
## 6  1969 3.832

weibullpor <- tyears(ngaruroro, dist = "wei", event = 100, plot = TRUE)

Figure 31: Results of frequency analyses of low flows.

weibullpor

##              distribution
## return period      wei
##           100 2.494827

# estimated values of Weiboll distribution parameters and sample L-moments
summary(weibullpor)

## Values: num [1:38(1d)] 3.34 4.8 5.02 4.82 3.21 ...
## 
## 
## L-Moments:
##        l_1        l_2        t_3        t_4 
## 4.13686842 0.50741892 0.08370583 0.16018348 
## 
## Fitted Parameters of the Distribution:
## wei    zeta:   2.20731,  beta:   2.17832,  delta:  2.27166

# we can check several distributions
vecpor <- tyears(ngaruroro, dist = c("wei", "pe3"), event = 100, plot = TRUE)

## Warning: For fitting minima, a Weibull distribution with parameter 'zeta = 0'
## may be best.

Figure 32: Using a larger number of distributions.

# let's do some calculations for certain values of return periods 
povratna <- c(2, 5, 10, 25) 
# calculate the values of the quantiles with respect to the return periods
evquantile(weibullpor, return.period = povratna)

##              distribution
## return period      wei
##            2  4.061063
##            5  3.332857
##            10 3.016214
##            25 2.740186

plot(weibullpor) # plot the graph again
# add labels for the selected return period values to the graph
rpline(weibullpor, return.period = povratna)

Figure 33: Additionally label selected values of return periods.

The package can also be used to analyse different aspects of hydrological drought. For example, we search for drought periods according to a selected criterion, in the example below this is the flow Q90 (argument probs) or Q95.

susa <- find_droughts(nizkiQ, threshold = quantile, probs = 0.1, na.rm = TRUE) 
summary(susa) # define 4 major drought periods according to the criterion

## Summary of droughts
## River:  at 
## Units: volumes in m3, duration in days
## 
## Filtered 5 minor events of 9 total. 
## Only droughts with volume >= 7320.672 m3 and duration >= 5 days are reported.
## 
##   event.no      start       time        end    volume duration dbt       vbt
## 1        1 2005-01-28 2005-02-12 2005-02-12 1464134.4       16  16 1464134.4
## 3        3 2005-03-07 2005-03-11 2005-03-11  157075.2        5   5  157075.2
## 5        5 2005-06-18 2005-06-22 2005-06-22  459129.6        5   5  459129.6
## 9        9 2005-11-18 2005-11-25 2005-11-25  496627.2        8   8  496627.2
##     qmin      tqmin
## 1  9.309 2005-02-06
## 3  9.836 2005-03-11
## 5 10.113 2005-06-21
## 9 10.400 2005-11-23

# use the Q95 flow (argument probs)
susa1 <- find_droughts(nizkiQ, threshold = quantile, probs = 0.05, na.rm = TRUE) 
summary(susa1, drop_minor=0) # show all events

## Summary of droughts
## River:  at 
## Units: volumes in m3, duration in days
## 
##   event.no      start       time        end    volume duration dbt       vbt
## 1        1 2005-01-30 2005-01-30 2005-01-30  53879.04        1   1  53879.04
## 2        2 2005-02-05 2005-02-09 2005-02-09 402796.80        5   5 402796.80
## 3        3 2005-02-11 2005-02-11 2005-02-11   5149.44        1   1   5149.44
## 4        4 2005-02-19 2005-02-20 2005-02-20  10298.88        2   2  10298.88
## 5        5 2005-03-11 2005-03-11 2005-03-11  53879.04        1   1  53879.04
## 6        6 2005-06-20 2005-06-22 2005-06-22  65041.92        3   3  65041.92
## 7        7 2005-06-24 2005-06-26 2005-06-26 160859.52        3   3 160859.52
## 8        8 2005-06-29 2005-06-29 2005-06-29  29946.24        1   1  29946.24
## 9        9 2005-11-23 2005-11-24 2005-11-24  10298.88        2   2  10298.88
##     qmin      tqmin
## 1  9.836 2005-01-30
## 2  9.309 2005-02-06
## 3 10.400 2005-02-11
## 4 10.400 2005-02-19
## 5  9.836 2005-03-11
## 6 10.113 2005-06-21
## 7  9.568 2005-06-26
## 8 10.113 2005-06-29
## 9 10.400 2005-11-23

# we can plot an interactive graph of these results using plot(susa1)
# we can use a threshold value that varies according to the period
# say monthly, weekly, daily values
susa2 <- find_droughts(nizkiQ, threshold = vary_threshold(nizkiQ, 
varying = "monthly", fun = function(x) quantile(x, probs = 0.05, na.rm = T))) 
# we check the main characteristics of the droughts according to the above procedure
summary(susa2)

## Summary of droughts
## River:  at 
## Units: volumes in m3, duration in days
## 
## Filtered 10 minor events of 11 total. 
## Only droughts with volume >= 2996.784 m3 and duration >= 5 days are reported.
## 
##   event.no      start       time        end   volume duration dbt      vbt
## 3        3 2005-03-07 2005-03-11 2005-03-11 157075.2        5   5 157075.2
##    qmin      tqmin
## 3 9.836 2005-03-11

# we analyse the dependence between the drought events we have identified 
summary(pool_sp(susa))

## Summary of pooled droughts
## River:  at 
## Pooling: Sequent Peak, 2 were pooled
## Units: volumes in m3, duration in days
## 
## Filtered 4 minor events of 7 total. 
## Only droughts with volume >= 7320.672 m3 and duration >= 5 days are reported.
## 
##   event.no      start       time        end    volume duration dbt       vbt
## 1        1 2005-01-30 2005-02-12 2005-02-12 1464134.4       14  14 1464134.4
## 5        5 2005-06-18 2005-06-29 2005-06-30 1086566.4       12  10 1145318.4
## 7        7 2005-11-19 2005-11-25 2005-11-26  496627.2        7   7  496627.2
##     qmin      tqmin
## 1  9.309 2005-02-06
## 5  9.568 2005-06-26
## 7 10.400 2005-11-23

# graph showing periods when the flow is below the threshold value
streamdefplot(nizkiQ,year=2005,threslevel = 95)

Figure 34: Showing the periods when the flow is below the threshold value.

## NULL

Similar to the previous example, probabilistic analyses could also be done for other aspects of drought, such as the number of days with drought conditions, the maximum deficit, or the maximum volume deficit. Let us consider the case for the maximum duration of a hydrological drought. We use the data in the lfstat package, then we choose a Weibull distribution. We are interested in the duration (variable=“d”, the alternative is volume “v”) and we look for the maximum values (aggr=“max”), then we do the calculations for the selected return periods, and we also take into account the dependent events according to the parameter pooling=“IC”.

maksTrajanje <- tyearsS(ngaruroro, dist = "wei", variable = "d", 
aggr = "max", event = povratna, hyearstart = 1, pooling = "IC")

Figure 35: Probabilistic analysis for the durations of the most severe droughts.

Determining the base flow is important for understanding the dynamics of the basin response, as well as for determining the volumes of high-water waves. There are other packages and functions that can be used to estimate baseflow (besides lfstat), one of which is BaseflowSeparation, included in the EcoHydRology package, which is no longer available in the CRAN repository, but older versions of the packages are still available online⁴³. These can be installed using tar. gz files via the Install option in the Rstudio GUI (instead of the CRAN repository, the option to install using the archive files of the package is selected). The presented function uses one of the automatic baseflow estimation methods⁴⁴.

The baseflow separation is also often necessary in analysing hydrograph volumes and durations⁴⁵,⁴⁶. Let us consider a simple example, where we first find the maximum discharge value of the year, then determine all the days of the year when the baseflow is equal to the surface discharge, i.e. the beginning and the end of the hydrographs, and then calculate the volume of the high-water hydrographs.

nizkiQ <- createlfobj(ts(Qzoo), startdate = "01/01/2005", hyearstart = 1) 
letnM <- which.max(nizkiQ$flow) 
mejneT <- which(nizkiQ$flow==nizkiQ$baseflow)  
# peak flow occurred on day 278
# from the mejneT object we see that the wave started on day 270 and ended on day 309
plot(nizkiQ$flow[270:309],type="l",xlab="Day from start",ylab="Disharge")
lines(nizkiQ$baseflow[270:309],col="red")

Figure 36: Maximum annual peak flow including baseflow.

# sum of all discharge values
vsota <- sum(nizkiQ$flow[270:309]-nizkiQ$baseflow[270:309]) 
vsota*24*3600 # we can also calculate the volume of the high-water wave in m3

## [1] 134212464

# or the duration of the high-water wave in days (because we have daily data)
length(nizkiQ$flow[270:309])

## [1] 40

Task 32: Download daily flow data from the Sava Šentjakob water gauging station for the period 1990–2021 from the website of the Slovenian Environment Agency. Using the lfstat package, analyse the maximum drought events (minimum annual flows) in each year. Perform both frequency analyses of the low flows (Pearson distribution of type 3) and analyses of the duration of these drought events.

4.2 Flood frequency analyses

Peak flow analyses or flood frequency analyses are an important area of hydrological science. Their aim is to estimate the probability of flood events (high water peaks) occurring at a given location. These analyses help in the planning of hydraulic infrastructure, the management of water resources, and the development of flood risk reduction strategies. Data on design flows are one of the most important outputs of hydrological studies and, for example, are also one of the inputs for hydraulic and flood risk analyses. A wide variety of methods are available for performing flood frequency analyses, ranging from the peaks over threshold method(POT) to the annual maximum (AM) method⁴⁷. In the context of this tutorial, we will focus on the use of the AM method, which is very widely used in hydrological practice⁴⁸.

The basic steps for performing (univariate) flood frequency analyses of high-water peaks are:

Selection of the sample for the flood frequency analyses (e.g. the annual maximum (AM) method if a sufficiently long set of measured flows is available, i.e. if from a long set of measured flow data the largest high-water peak in a particular year is included in the sample; if shorter sets of measured flow data are available, it is preferable to use the peaks over threshold method (POT)).
Parameter estimation of several selected distribution functions (e.g. log-normal, Pearson III, log-Pearson III, Gumbel, generalised extreme value (GEV) distribution) based on the method of moments, the method of L-moments, or any other method).
Evaluate the goodness of fit of each distribution to the measured flow data, using either graphical representations or statistical tests. It is certainly advisable to use any additional qualitative and/or quantitative information available to assess the adequacy of the distribution and to interpret the results (especially in the case of longer return periods), and to decide on the adequacy of the distribution and the interpretation of the results on the basis of professional experience.

In addition, the probabilistic analysis process can be complemented by the determination of a design hydrograph according to the typical high-water hydrograph method. In this case, the following two steps are required:

Selection of appropriate representative hydrographs depending on the intended use of the results (e.g. shorter events, longer events, several different combinations).
Taking into account the results of the flood frequency analyses (together with the associated confidence intervals that capture the impact of uncertainty) and the selected representative hydrographs to determine the design hydrographs for different values of return periods.

It is estimated that a reasonable amount of data for flood frequency analyses is at least 15–20 years of continuous flow measurements from the most recent period (due to the more modern measurement methodology and climatic variability), which allows us to select the appropriate sample (AM or POT). The flow data must of course be of adequate quality, in terms of a stable gauge profile, an adequate rating curve, etc. In any case, it should be kept in mind that, in the case of such short data sets, the uncertainty of the estimated design flows increases sharply with increasing return period. The definition and consideration of confidence intervals and caution in determining return periods longer than twice the length of the measured data set are therefore necessary in such cases⁴⁹.

In Slovenia, the water level and discharge are measured by the Slovenian Environment Agency (ARSO), which also publishes peak flow data on its website⁵⁰. Peak flow data are available for most of the gauging stations and are indicated by Qvk, while the maximum flow in a given year is indicated by vQvk.

In our case, we will use data from the Litija water gauging station on the Sava River to perform the flood frequency analyses. The procedure proposed by Meylan et al.⁵¹ will be used to calculate the confidence intervals. The data on the annual maximum peaks for the period 1895–2021 are available at the following link⁵². It should be noted that the method of water level measurements at this gauging station has changed several times during the period 1895–2021. Thus, from 1895 to 1952, observations were made once a day, from 1953 to 2014, measurements were made using a limnigraph, and from 2015 onwards, measurements are made using a pressure probe and a radar gauge. This also results in slightly different data characteristics, as in most cases data based on daily observations underestimate peak flows. To estimate the parameters of the distribution functions, we will first use the lmom and lmomco packages, which contain functions to estimate the parameters based on the method of L-moments⁵³.

# import the annual peak data into R
vQvk <- read.table("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Litija-vQvk.txt",header=TRUE) 
head(vQvk) # check the appearance of the data

##   Leto   vQvk
## 1 1895 1586.6
## 2 1896 1993.2
## 3 1897  673.0
## 4 1898 1289.0
## 5 1899  973.8
## 6 1900 1213.0

str(vQvk) # check data structure

## 'data.frame':    127 obs. of  2 variables:
##  $ Leto: int  1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 ...
##  $ vQvk: num  1587 1993 673 1289 974 ...

summary(vQvk) # calculate the basic statistics of our data

##       Leto           vQvk       
##  Min.   :1895   Min.   : 434.0  
##  1st Qu.:1926   1st Qu.: 945.5  
##  Median :1958   Median :1199.0  
##  Mean   :1958   Mean   :1228.7  
##  3rd Qu.:1990   3rd Qu.:1507.0  
##  Max.   :2021   Max.   :2326.0

# draw a line graph
plot(vQvk,xlab="Year",ylab="vQvk [m3/s]",main="Sava-Litija",type="b")

Figure 37: Maximum annual peaks.

library(lmom, quietly=TRUE)
library(lmomco, quietly=TRUE) # activate both packages

## 
## Attaching package: 'lmomco'

## The following objects are masked from 'package:lmom':
## 
##     cdfexp, cdfgam, cdfgev, cdfglo, cdfgno, cdfgpa, cdfgum, cdfkap,
##     cdfln3, cdfnor, cdfpe3, cdfwak, cdfwei, quaexp, quagam, quagev,
##     quaglo, quagno, quagpa, quagum, quakap, qualn3, quanor, quape3,
##     quawak, quawei

# calculate the L-moments based on the 1953–2021 sample
lmomenti <- lmoms(vQvk[60:127,2]) 
lmomenti$lambdas # L-moments

## [1] 1233.514412  214.272722   27.838481   19.071841    5.891533

# estimate the parameters of the chosen distribution function
pe3par <- lmom2par(lmomenti,type="pe3")   
pe3par # check calculations

## $type
## [1] "pe3"
## 
## $para
##           mu        sigma        gamma 
## 1233.5144118  387.2970013    0.7919035 
## 
## $source
## [1] "parpe3"

# the GEV distribution (generalised extreme value distribution)
gevpar <- lmom2par(lmomenti,type="gev") 
# calculate design flows for Fx using Pearson's 3 distributions
quape3(c(0.9,0.99),pe3par)

## [1] 1751.005 2351.094

# using GEV distribution
quagev(c(0.9,0.99),gevpar)

## [1] 1749.472 2368.118

# make calculations for different values of return periods
Pdobe <- c(2,5,10,20,30,50,100,300,500,1000) 
verj <- 1-1/Pdobe # values of the Fx distribution function
rezpe3 <- quape3(verj,pe3par) # Pearson type 3 distribution
rezgev <- quagev(verj,gevpar) # GEV distrubiton
# check results for return periods of 100, 300, 500, and 1000 years
rezpe3[7:10]

## [1] 2351.094 2606.243 2720.772 2872.840

rezgev[7:10] # using the GEV distribution

## [1] 2368.118 2628.055 2742.690 2892.341

# we will also show the measured data using the Weibull distribution 
# the function pp also has a parameter a, which allows for the use of different equations
weibull <- pp(vQvk[60:127,2],a=0) 
Pdobe1 <- 1/(1-weibull) # calculate the corresponding return period values
# plot the results using a Pearson distribution of type 3
plot(Pdobe,rezpe3,type="l",log="xy",xlab="Return period [years]",
ylab = "Discharge [m3/s]", main = "Sava-Litija",ylim=c(1000,3000)) 
# also add measured flow data to the graph
points(Pdobe1,sort(vQvk[60:127,2])) 
# add the results to the graph according to the GEV distribution
lines(Pdobe,rezgev,col="red")

Figure 38: Results of flood frequency analyses of high-water peaks.

When performing flood frequency analyses, it is also useful to evaluate the associated confidence intervals. Confidence intervals will be generated according to the procedure proposed by Meylan et al.⁵⁴, where in each iteration a new sample of peak annual flows will be generated based on the estimated parameters of a selected distribution function (in this case Pearson type 3). The parameters will be estimated again based on the sample, and then the design discharge values will be calculated based on the parameters for the selected return periods. We will select 10% empirical confidence intervals and set lower and upper confidence limits for the individual return period values. When generating data, it is reasonable to use a large number of replicates, e.g. 10 000.

Nrep <- 100 # number of repetitions of random parameter generation
# empty matrix to store results
rezultati <- matrix(numeric(0), Nrep, length(verj)) 
for(i in 1:Nrep){ 
sample_dummy <- rlmomco(length(vQvk[60:127,2]),
lmom2par(lmoms(vQvk[60:127,2]),type="pe3"))  
parPE3_dummy <- lmom2par(lmoms(sample_dummy),type="pe3") 
rezultati[i,] <- quape3(verj, parPE3_dummy)
} 
alpha <- 0.1  
# values of confidence intervals for the lower limit for the 2-year return period
rez2l <- sort(rezultati[,1])[alpha*100/2] 
# similar for the upper limit
rez2u <- sort(rezultati[,1])[Nrep-alpha*100/2] 
# and for other values of the return periods
rez5l <- sort(rezultati[,2])[alpha*100/2] 
rez5u <- sort(rezultati[,2])[Nrep-alpha*100/2]
rez10l <- sort(rezultati[,3])[alpha*100/2]
rez10u <- sort(rezultati[,3])[Nrep-alpha*100/2]
rez20l <- sort(rezultati[,4])[alpha*100/2]
rez20u <- sort(rezultati[,4])[Nrep-alpha*100/2]
rez30l <- sort(rezultati[,5])[alpha*100/2]
rez30u <- sort(rezultati[,5])[Nrep-alpha*100/2]
rez50l <- sort(rezultati[,6])[alpha*100/2]
rez50u <- sort(rezultati[,6])[Nrep-alpha*100/2]
rez100l <- sort(rezultati[,7])[alpha*100/2]
rez100u <- sort(rezultati[,7])[Nrep-alpha*100/2]
rez300l <- sort(rezultati[,8])[alpha*100/2]
rez300u <- sort(rezultati[,8])[Nrep-alpha*100/2]
rez500l <- sort(rezultati[,9])[alpha*100/2]
rez500u <- sort(rezultati[,9])[Nrep-alpha*100/2]
rez1000l <- sort(rezultati[,10])[alpha*100/2]
rez1000u <- sort(rezultati[,10])[Nrep-alpha*100/2]
# combine the data into a data frame
genrez <- data.frame(Povratne=Pdobe,spodnja=c(rez2l,rez5l,rez10l,rez20l,
  rez30l,rez50l,rez100l,rez300l,rez500l,rez1000l),
zgornja=c(rez2u,rez5u,rez10u,rez20u,rez30u,rez50u,rez100u,rez300u,
  rez500u,rez1000u))
# check the design discharge with a return period of 100 years
rezpe3[7]

## [1] 2351.094

rez100l # we look at the value of the confidence interval

## [1] 2032.378

rez100u

## [1] 2660.641

# plot
plot(Pdobe,rezpe3,type="l",log="xy",xlab="Return period [years]",
ylab="Discharge [m3/s]",main="Sava-Litija",ylim=c(1000,3000)) 
# also add measured flow data to the graph
points(Pdobe1,sort(vQvk[60:127,2])) 
# plot the empirical confidence intervals
lines(genrez[,1],genrez[,2],col="brown",lty=2) 
lines(genrez[,1],genrez[,3],col="brown",lty=2)

Figure 39: Results of flood frequency analyses with confidence intervals.

It is clear that, despite the relatively long dataset, there is still a relatively large amount of uncertainty in the estimated design discharges due to the flood frequency analysis procedure alone (e.g. no uncertainty in the flow measurements is taken into account). In addition to the method of L-moments, other methods can also be used to estimate the parameters, such as the method of moments or the maximum likelihood method[^54]. The procedure using the method of moments (and the Gumbel distribution function) is shown below, followed by the maximum likelihood method (using the extRemes package). This package also allows for the calculation of some goodness of fit criteria, such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), which can be used to select the most appropriate distribution function. Other distributions can also be selected within the package, such as GEV, Generalized Pareto distribution (type=“GP”), or exponential. The package also allows for the plotting of a number of graphs from which the suitability of a particular distribution function can be determined, in case you want to test the use of a different distribution function. [^54]: https://doi.org/10.1080/02626667.2013.831174.

# first, an example of the application of the method of moments is shown
povp <- mean(vQvk[60:127,2]) # calculate mean
standev <- sd(vQvk[60:127,2]) # standard deviation
sigma <- (sqrt(6)/pi)*standev # scale parameter of Gumbel distrubtion
mu <- povp - 0.5772*sigma # location parameter
# function which can be used to calculate the distribution function
pGumbel <- function (pret, mu, sigma) { 
  y <- (pret - mu)/sigma   
  p <- exp(-exp(-y)) 
  return(p)
}
# distrubution function calculation for selected discharge values, from 500 to 2500 m3/s (step 100)
Fxgum <- pGumbel(seq(500,2500,100),mu,sigma) 
plot(Pdobe1, sort(vQvk[60:127,2]), log="xy", xlab="Return period [years]", 
     ylab=" Q (m3/s)") 
lines(1/(1-Fxgum),seq(500,2500,100),col="red",lwd=3) 
legend("bottomright",legend=c("Empirical","Gumbel"), lty=c(NA,1),lwd=c(1,3),
col=c("black","red"),pch=c(21,NA))

Figure 40: Flood frequency analysis with Gumbel distribution (method of moments).

# calculate the design discharge with a 100-year return period
mu-sigma*log(-log(1-1/100))

## [1] 2420.529

# compare the estimate with the calculation based on the method of moments L
quagum(1-1/100,lmom2par(lmomenti,type="gum"))

## [1] 2477.125

library(extRemes, quietly=TRUE)

## Warning: package 'extRemes' was built under R version 4.1.3

## 
## Attaching package: 'extRemes'

## The following objects are masked from 'package:stats':
## 
##     qqnorm, qqplot

# estimate the parameters of the Gumbel distribution using the maximum likelihood method
Gumbelmle <- fevd(vQvk[60:127,2],type="Gumbel", method="MLE")  
summary(Gumbelmle) # AIC in BIC

## 
## fevd(x = vQvk[60:127, 2], type = "Gumbel", method = "MLE")
## 
## [1] "Estimation Method used: MLE"
## 
## 
##  Negative Log-Likelihood Value:  496.4416 
## 
## 
##  Estimated parameters:
##  location     scale 
## 1057.7568  308.0549 
## 
##  Standard Error Estimates:
## location    scale 
## 39.38974 29.75105 
## 
##  Estimated parameter covariance matrix.
##           location    scale
## location 1551.5516 371.2417
## scale     371.2417 885.1248
## 
##  AIC = 996.8832 
## 
##  BIC = 1001.322

Gumbelmle$results$par # parameters of the Gumbel distribution function

##  location     scale 
## 1057.7568  308.0549

# diagnostic graphs
plot(Gumbelmle, main="Gumbel-MLE")

Figure 41: Diagnostic graphs in extRemes package.

# individual graphs using the additional argument type, 
# choose from probprob, qq, density, etc.  
plot(Gumbelmle, type="rl")

Figure 42: One of the diagnostic graphs in extRemes package.

# design discahrge calculation
Gumbelmle$results$par[1]-Gumbelmle$results$par[2]*log(-log(1-1/100))

## location 
## 2474.855

However, to choose the most appropriate distribution function, it is necessary to perform various statistical tests or calculate goodness of fit criteria. A large number of tests are available in various packages to check the appropriateness of a particular distribution function. We have already seen the AIC and BIC criteria above. One of these is the Kolmogorov-Smirnov test, which checks whether the estimated flows from a chosen distribution fit the measured data well enough. If the p-value is greater than 0.05, this means that the null hypothesis (distribution is the same) cannot be rejected at the chosen level of significance. It is worth mentioning the hydroGOF package, which includes many functions for calculating goodness of fit criteria such as RMSE, MAE, etc.

# likelihood test for two distributions
lr.test(fevd(vQvk[60:127,2],type="Gumbel", method="MLE"), 
    fevd(vQvk[60:127,2],type="GEV", method="MLE"))

## 
##  Likelihood-ratio Test
## 
## data:  vQvk[60:127, 2]vQvk[60:127, 2]
## Likelihood-ratio = 0.14202, chi-square critical value = 3.8415, alpha =
## 0.0500, Degrees of Freedom = 1.0000, p-value = 0.7063
## alternative hypothesis: greater

# Kolmogorov-Smirnov test
ks.test(sort(vQvk[60:127,2]),quape3(weibull,pe3par))

## Warning in ks.test(sort(vQvk[60:127, 2]), quape3(weibull, pe3par)): cannot
## compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  sort(vQvk[60:127, 2]) and quape3(weibull, pe3par)
## D = 0.058824, p-value = 0.9998
## alternative hypothesis: two-sided

ks.test(sort(vQvk[60:127,2]),quagev(weibull,gevpar))

## Warning in ks.test(sort(vQvk[60:127, 2]), quagev(weibull, gevpar)): cannot
## compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  sort(vQvk[60:127, 2]) and quagev(weibull, gevpar)
## D = 0.058824, p-value = 0.9998
## alternative hypothesis: two-sided

library(hydroGOF, quietly=TRUE)

## Warning: package 'hydroGOF' was built under R version 4.1.3

# RMSE for different distributions
rmse(quagev(weibull,gevpar),sort(vQvk[60:127,2]))

## [1] 34.38188

rmse(quape3(weibull,pe3par),sort(vQvk[60:127,2]))

## [1] 32.59287

# ME for different distributions
me(quagev(weibull,gevpar),sort(vQvk[60:127,2]))

## [1] -5.015486

me(quape3(weibull,pe3par),sort(vQvk[60:127,2]))

## [1] -5.177761

# MAE for different distributions
mae(quagev(weibull,gevpar),sort(vQvk[60:127,2]))

## [1] 27.97724

mae(quape3(weibull,pe3par),sort(vQvk[60:127,2]))

## [1] 26.5385

Task 33: Download from the website of the Slovenian Environment Agency the annual peak flow data from the Sava Šentjakob water gauging station for the period 1990–2021 (as an alternative, you can use the data from Task 32). Using the packages lmom* and lmomco, analyse the peak flows. Perform flood frequency analyses of peak flows (GEV and Gumbel distribution).*

4.3 Trend analyses

Climate change affects hydrological processes by causing changes in the water cycle⁵⁵. Climate change (or variability) can be manifested in altered precipitation patterns, temperature changes, changes in the frequency and intensity of extreme hydrological events, and other changes in hydrological processes. As a consequence, trend analysis and detecting changes in data form a frequent part of water management and hydrology⁵⁶. Trend analysis in hydrology involves analysing data to identify changes in variables such as precipitation, discharge, air temperature, or groundwater level. Such analyses are key to understanding the hydrological cycle and identifying changes, which is the information we need to manage water resources, ensure flood safety, etc. A wide variety of methods are available to determine whether there is a positive or negative trend in our data and whether this trend is statistically significant. Many of the commonly used methods are included in the trend package⁵⁷. Often used for analyses is the Mann-Kendall statistical test, which is a non-parametric test where the null hypothesis is that there is no trend in the data, and the alternative hypothesis is that there is a trend in the data⁵⁸. An alternative to the Mann-Kendall test is, for example, the Sen’s slope test⁵⁹. In the following example, we will show some examples of the use of these statistical tests.

# plot line graph 
plot(vQvk[,1],vQvk[,2],type="l",xlab="Year", ylab="Discharge [m3/s]",
    main="Litija-Sava") 
# add a linear trend line
abline(lm(vQvk[,2] ~ vQvk[,1]),lty=2,col="red")

Figure 43: Hydrograph with the trend indication.

# checking the slope coefficient of the linear function, we see that it is negative 
# vQvk should decrease by about 40 m3/s per 100 years according to the linear function
coef(lm(vQvk[,2] ~ vQvk[,1]))

##  (Intercept)    vQvk[, 1] 
## 2031.2582216   -0.4099017

# for trend analyses we will use the trend package
library(trend, quietly=TRUE) 
# perform trend analysis using the Mann-Kendall statistical test
# we see that there is a negative trend in the data (z value)
# but it is not statistically significant (p-value is higher than the significance level, e.g. 0.05)
mk.test(vQvk[,2])

## 
##  Mann-Kendall trend test
## 
## data:  vQvk[, 2]
## z = -0.53352, n = 127, p-value = 0.5937
## alternative hypothesis: true S is not equal to 0
## sample estimates:
##             S          varS           tau 
## -2.570000e+02  2.302423e+05 -3.213706e-02

# in case we are interested in the trend for only one part of the data
mk.test(vQvk[1:65,2])

## 
##  Mann-Kendall trend test
## 
## data:  vQvk[1:65, 2]
## z = -2.2873, n = 65, p-value = 0.02218
## alternative hypothesis: true S is not equal to 0
## sample estimates:
##             S          varS           tau 
##  -405.0000000 31197.0000000    -0.1948521

# we also use Sen's slope test 
sens.slope(vQvk[,2])

## 
##  Sen's slope
## 
## data:  vQvk[, 2]
## z = -0.53352, n = 127, p-value = 0.5937
## alternative hypothesis: true z is not equal to 0
## 95 percent confidence interval:
##  -2.5250  1.5625
## sample estimates:
## Sen's slope 
##  -0.6064634

# for only part of the data
sens.slope(vQvk[1:65,2])

## 
##  Sen's slope
## 
## data:  vQvk[1:65, 2]
## z = -2.2873, n = 65, p-value = 0.02218
## alternative hypothesis: true z is not equal to 0
## 95 percent confidence interval:
##  -12.414286  -1.208333
## sample estimates:
## Sen's slope 
##   -6.717045

The trend package also includes a number of variants of the Mann-Kendall test, such as the seasonal test or the multivariate test (also for Sen’s slope test). The package additionally includes a number of tests for analysing abrupt changes in data, such as the Pettitt test, which has been used, among other things, to detect changes in hydrological data of karst springs⁶⁰. First, an analysis of the abrupt change point (breakpoint) in the data will be presented by applying the Pettitt test on the actual data and on the generated data. Then some other functions from the trend package will be shown.

pettitt.test(vQvk[,2])

## 
##  Pettitt's test for single change-point detection
## 
## data:  vQvk[, 2]
## U* = 718, p-value = 0.447
## alternative hypothesis: two.sided
## sample estimates:
## probable change point at time K 
##                              46

# data # 46, the result is statically insignificant (significance level 0.05)
# in case we generate random data based on a normal distribution 
# (10 values with mean 0 and 10 values with mean 20) 
pettitt.test(c(rnorm(n=10,mean=0),rnorm(n=10,mean=20)))

## 
##  Pettitt's test for single change-point detection
## 
## data:  c(rnorm(n = 10, mean = 0), rnorm(n = 10, mean = 20))
## U* = 100, p-value = 0.001581
## alternative hypothesis: two.sided
## sample estimates:
## probable change point at time K 
##                              10

# test detects a statistically significant breakpoint between two sets of data
# similar result as above
# similar analyses can also be done using the br.test and bu.test functions
lanzante.test(vQvk[,2])

## 
##  Lanzante's procedure for single change-point detection
##  with Wilcoxon-Mann-Whitney Test
## 
## data:  vQvk[, 2]
## W = 2222, p-value = 0.07213
## alternative hypothesis: two.sided
## sample estimates:
## probable change point at time K 
##                              46

plot(vQvk[,1],vQvk[,2],type="l",xlab="Year", ylab="Discharge [m3/s]",
    main="Litija-Sava") 
# in case we want to draw a trend line on only one part of the data
clip(1800,1940,200,3000) 
abline(lm(vQvk[1:46,2] ~ vQvk[1:46,1]),lty=2,col="red") # first part
clip(1940,2022,200,3000) # second part
abline(lm(vQvk[47:127,2] ~ vQvk[47:127,1]),lty=2,col="blue")  
do.call("clip", as.list(par("usr"))) # simplify the drawing area
# add a breakpoint to the graph
abline(v=vQvk[pettitt.test(vQvk[,2])$estimate,1],col="green",lty=1)

Figure 44: Line graph and breakpoint in the data.

There are also slightly more sophisticated trend detection methods, such as those included in the trendchange package⁶¹. Additionally, it is possible to perform a regional Mann-Kendall test as demonstrated in the case of rainfall station data in Slovenia⁶². More information on the regional Mann-Kendall test can be found in the scientific paper⁶³ and in the manual of the rkt package⁶⁴.

library(trendchange, quietly=TRUE) # activate the trendchange package

## Warning: package 'trendchange' was built under R version 4.1.3

## 
## Attaching package: 'trendchange'

## The following object is masked _by_ '.GlobalEnv':
## 
##     x

# graphical representation of the detection of breakpoints in the data
dfcusum(vQvk[,2],startyear = 1895)

Figure 45: Trend detection using CUMSUM.

## $`CUMSUM Values`
##   [1]  1  2  1  2  1  2  3  4  5  6  7  6  7  6  7  6  5  4  3  2  3  4  3  2  3
##  [26]  2  2  1  2  1  0  1  2  1  0  1  2  1  2  3  4  5  6  7  6  7  6  5  6  5
##  [51]  4  3  4  5  6  5  4  5  4  5  4  3  2  1  0 -1  0  1  0  1  2  3  4  5  6
##  [76]  5  4  5  6  7  8  7  6  5  6  7  6  7  6  5  6  5  4  3  2  3  2  3  4  3
## [101]  2  3  2  3  2  3  2  1  0  1  0 -1  0 -1  0  1  0  1  0  1  0 -1  0  1  2
## [126]  1  0
## 
## $`Maximum CUMSUM value`
## [1] 8
## 
## $`Critical value at 90% CI`
## [1] 13.7487
## 
## $`Critical value at 95% CI`
## [1] 15.32642
## 
## $`Critical value at 99% CI`
## [1] 18.36917

# breakpoints are shown as minimums on the graph
# this is clearly visible in the case where we use two data subsets 
dfcusum(c(rnorm(n=10,mean=0),rnorm(n=10,mean=20)),startyear = 1895)

Figure 46: Trend detection using CUMSUM (Example 2).

## $`CUMSUM Values`
##  [1]  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1
## [20]   0
## 
## $`Maximum CUMSUM value`
## [1] 10
## 
## $`Critical value at 90% CI`
## [1] 5.456006
## 
## $`Critical value at 95% CI`
## [1] 6.082105
## 
## $`Critical value at 99% CI`
## [1] 7.289582

# plus a graphical representation of trend detection
innovtrend(vQvk[,2],ci=95)

Figure 47: Innovative way to detect trends.

##                        Trend Slope                    Trend Indicator 
##                         0.89026622                         0.47028305 
##           Slope Standard deviation                        Correlation 
##                         0.09467647                         0.98515248 
## Lower Cofidence Limit at 90percent Upper Cofidence Limit at 90percent 
##                        -0.15574280                         0.15574280 
## Lower Cofidence Limit at 95percent Upper Cofidence Limit at 95percent 
##                        -0.18556589                         0.18556589 
## Lower Cofidence Limit at 99percent Upper Cofidence Limit at 99percent 
##                        -0.24388660                         0.24388660

# if there were a clear trend in the data, the points would lie 
# either above the confidence intervals (positive) or below the confidence intervals (negative)
# we generate a sequence of numbers from 1:127, which represents a positive trend
# in the same way, we would check the result in the case of a decreasing series (127:1)
innovtrend(1:127,ci=95)

Figure 48: Innovative way to detect trends (Example 2).

##                        Trend Slope                    Trend Indicator 
##                           0.992126                          19.687500 
##           Slope Standard deviation                        Correlation 
##                           0.000000                           1.000000 
## Lower Cofidence Limit at 90percent Upper Cofidence Limit at 90percent 
##                           0.000000                           0.000000 
## Lower Cofidence Limit at 95percent Upper Cofidence Limit at 95percent 
##                           0.000000                           0.000000 
## Lower Cofidence Limit at 99percent Upper Cofidence Limit at 99percent 
##                           0.000000                           0.000000

Let’s demonstrate an example of the use of the regional Mann-Kendall test. The regional Mann-Kendall test is performed in the case where, in addition to the Litija station, there is another downstream station where the vQvk data are 200 m3/s higher than the Litija station (defined by the y argument), in which case we have 2 sets of annual data (the date argument) and the block argument defines which data belong to which station. We can see that, in this case, the regional Mann-Kendall test is not statistically significant (with a chosen confidence level of 0.05), so there is no trend in the data. An example of generated data with a trend (1:127 and 1001:1127) will be shown in addition, but the results are different. A larger number of stations can be included in the rkt function.

library(rkt, quietly=TRUE)  
rkt(date=rep(1895:2021,2),y=c(vQvk[,2],vQvk[,2]+200),
block=c(rep(1,length(vQvk[,2])),rep(2,length(vQvk[,2]))))

## 
## Standard model
## Tau = -0.03212098
## Score =  -514
## var(Score) =  460484.7
## 2-sided p-value =  0.4496617
## Theil-Sen's (MK) or seasonal/regional Kendall (SKT/RKT) slope=  -0.6064634

rkt(date=rep(1895:2021,2),y=c(1:127,1001:1127),
block=c(rep(1,length(vQvk[,2])),rep(2,length(vQvk[,2]))))

## 
## Standard model
## Tau = 1
## Score =  16002
## var(Score) =  460502
## 2-sided p-value =  0
## Theil-Sen's (MK) or seasonal/regional Kendall (SKT/RKT) slope=  1

Task 34: Download from the website of the Slovenian Environment Agency the data about annual peaks from the water gauging stations on the Sava River: Radovljica I, Šentjakob, Litija (and Litija I), Hrastnik, and Čatež I for the period 2000–2021. Perform trend analyses for individual water gauging stations (choose any test), as well as a regional Mann-Kendall test.

4.4 Non-stationary flood frequency analyses

Non-stationary flood frequency analyses can be performed using the extRemes package, which has already been used in the chapter on (conventional) flood frequency analyses of peak flows. In this case, we incorporate the time dependence of the selected variables (e.g. precipitation) into the parameters of the selected theoretical distribution functions⁶⁵. Various types of non-stationary models can be defined, either as a function of time or as a function of external variables such as precipitation⁶⁶. In the previous examples, we found that a part of our data have a statistically significant negative trend, so we will only use this part of the data. We will use the data to estimate the parameters of the non-stationary model by considering the maximum likelihood method and using a generalised extreme value (GEV) distribution, where the location parameter of the GEV distribution will be a function of time. Additionally, we will check the statistics of the tested model, which could also be compared with, for example, a stationary model (e.g. based on the BIC and AIC criteria), and the function plot(model1) allows us to plot a graphical fit of the selected model to the annual peak data.

library(extRemes, quietly=TRUE)  
vQvk1 <- vQvk[1:65,] # use only part of the data
cas <- 1:length(vQvk1[,2]) # define a time vector
# define the model
model1 <- fevd(x=vQvk,data=vQvk1,method="MLE",type="GEV",location.fun=~cas) 
summary(model1) # basic features

## 
## fevd(x = vQvk, data = vQvk1, location.fun = ~cas, type = "GEV", 
##     method = "MLE")
## 
## [1] "Estimation Method used: MLE"
## 
## 
##  Negative Log-Likelihood Value:  477.1598 
## 
## 
##  Estimated parameters:
##          mu0          mu1        scale        shape 
## 1292.9295226   -6.9322010  370.3522898   -0.2875139 
## 
##  Standard Error Estimates:
##         mu0         mu1       scale       shape 
## 102.1463941   2.5010305  39.8920436   0.1172437 
## 
##  Estimated parameter covariance matrix.
##                mu0           mu1       scale       shape
## mu0   10433.885836 -218.18402981  623.374966 -5.05417391
## mu1    -218.184030    6.25515372  -10.331939  0.06341077
## scale   623.374966  -10.33193937 1591.375143 -3.06325528
## shape    -5.054174    0.06341077   -3.063255  0.01374609
## 
##  AIC = 962.3196 
## 
##  BIC = 971.0172

# plot results for different values of return periods
plot(model1, type="rl",rperiods=c(10,100,500))

Figure 49: Results of non-stationary flood frequency analyses.

# values of design discharges decrease as a function of time
return.level(model1,return.period=100) # selected return period value

## fevd(x = vQvk, data = vQvk1, location.fun = ~cas, type = "GEV", 
##     method = "MLE")
## get(paste("return.level.fevd.", newcl, sep = ""))(x = x, return.period = return.period)
## 
##  GEV model fitted to  vQvk vQvk1 
## Data are assumed to be  non-stationary 
## [1] "Covariate data = vQvk1"
## [1] "Return Levels for period units in years"
##       100-year level
##  [1,]       2230.911
##  [2,]       2223.978
##  [3,]       2217.046
##  [4,]       2210.114
##  [5,]       2203.182
##  [6,]       2196.250
##  [7,]       2189.317
##  [8,]       2182.385
##  [9,]       2175.453
## [10,]       2168.521
## [11,]       2161.589
## [12,]       2154.656
## [13,]       2147.724
## [14,]       2140.792
## [15,]       2133.860
## [16,]       2126.928
## [17,]       2119.995
## [18,]       2113.063
## [19,]       2106.131
## [20,]       2099.199
## [21,]       2092.267
## [22,]       2085.334
## [23,]       2078.402
## [24,]       2071.470
## [25,]       2064.538
## [26,]       2057.606
## [27,]       2050.673
## [28,]       2043.741
## [29,]       2036.809
## [30,]       2029.877
## [31,]       2022.945
## [32,]       2016.012
## [33,]       2009.080
## [34,]       2002.148
## [35,]       1995.216
## [36,]       1988.284
## [37,]       1981.351
## [38,]       1974.419
## [39,]       1967.487
## [40,]       1960.555
## [41,]       1953.622
## [42,]       1946.690
## [43,]       1939.758
## [44,]       1932.826
## [45,]       1925.894
## [46,]       1918.961
## [47,]       1912.029
## [48,]       1905.097
## [49,]       1898.165
## [50,]       1891.233
## [51,]       1884.300
## [52,]       1877.368
## [53,]       1870.436
## [54,]       1863.504
## [55,]       1856.572
## [56,]       1849.639
## [57,]       1842.707
## [58,]       1835.775
## [59,]       1828.843
## [60,]       1821.911
## [61,]       1814.978
## [62,]       1808.046
## [63,]       1801.114
## [64,]       1794.182
## [65,]       1787.250

return.level(model1,return.period=100,qcov=make.qcov(model1, 
  vals = list(mu1 = (65+50))),do.ci=TRUE)

## fevd(x = vQvk, data = vQvk1, location.fun = ~cas, type = "GEV", 
##     method = "MLE")
## 
## [1] "Normal Approx."
## 
##      95% lower CI Estimate 95% upper CI Standard Error
## [1,]     916.5903  1440.64     1964.689        267.377

In addition, we can also calculate an estimate of the design discharge with a 100-year return period (argument return.period) by taking into account confidence intervals (do.ci argument) for our model (model1) 50 years after the end of our data set (argument qcov, where for the time-dependent parameter (location) we specify that we are interested in the state 50 years after the end of our data). We can see that the results are quite different from the actual state that can be seen in the second part of our data (i.e. the part that we did not use to specify the non-stationary model). When working with non-stationary models, care is therefore needed in interpreting the results. In a similar way to the time example, the location parameter could be defined as a function of other variables such as precipitation; in this case, we would also need precipitation data (for each year), which we would use to define the location.fun. We can then assess how an increase or decrease in annual rainfall (or monthly rainfall, or any other amount) affects the design discharges with a given return period, and the individual models can then also be compared with each other using the lr.test function included in the extRemes package.

4.5 Multivariate flood frequency analyses

As an alternative to the commonly used univariate flood frequency analyses of high-discharge peaks, multivariate flood frequency analyses can also be used, where additional variables (e.g. volume and duration of the high-water hydrograph) are included in the analyses. Copula functions⁶⁷ are one of the methods that can be used to carry out multivariate flood frequency analyses⁶⁸. As hydrological processes are often multi-dimensional or defined by several variables, more than one variable should be considered in the analyses. Copulas allow the construction of a multivariate model, where two or more dependent variables are considered simultaneously. A multivariate d-dimensional distribution can be written as a combination of a copula function C and marginal distribution functions. The main advantage of copula functions over conventional multivariate distribution functions (e.g. multivariate normal distribution) is the separation of the marginal distributions and the copula functions. Different distribution functions can be chosen as the marginal distributions of the different variables. Thus, in flood frequency analyses, volumes and durations of hydrographs or other variables can be considered simultaneously along with peak flows. Additional variables can be described by different parametric and non-parametric functions. The individual variables are then combined using the copula function. This concept can be applied to any multivariate problem where there are a number (more than 1) of more or less dependent variables⁶⁹. Thus, copulas have been used to perform multivariate flood frequency analyses of hydrographs, frequency analysis of rainfall, drought analysis, and geostatistical interpolation as an alternative to conventional kriging, as well as to test the adequacy of an overflow structure at a dam, and for many other hydrological problems⁷⁰.

In the example below, we will demonstrate the procedure for performing mutlivariate flood frequency analyses of high-water waves. As a starting point, we will use the data included in the lfstat package. The data are from the Ngaruroro River Basin (New Zealand), with a catchment area of 164 square kilometres. The first step in such analyses is to determine the sample; in our case we will use data on peak flows (annual maximum), hydrograph volumes, and hydrograph duration. To calculate the hydrograph volumes durations, we will also use the baseflow estimation data (based on the functions in the lfstat package), assuming that the volume and duration are defined by the point at which the surface part of the flow is equal to the baseflow. The code shows how the peaks, volumes, and durations of hydrographs can be calculated. In the following step, we will first extract our sample, i.e. the peak flow, the hydrograph volume, and the duration of the hydrograph. To calculate the volume of the surface part of the hydrograph, we will first extract the baseflow and then, given the points (one before the peak and one after the peak) where the surfaceflow and baseflow are equal, determine the start and duration of each hydrograph.

library(lfstat, quietly=TRUE)
data(ngaruroro) # use the flow data included in the lfstat package
# transform the flow data into zoo format
pret <- zoo(ngaruroro$flow,as.Date(paste0(ngaruroro$year,"-",ngaruroro$month,"-",ngaruroro$day))) 
# do not use the first year of data as the data are incomplete
pret <- pret[104:length(pret)] 
# similarly, in this case we delete the first part of the data
ngaruroro <- ngaruroro[-(1:103),] 
# determine the annual data from the daily data
leto1 <- as.numeric(floor(as.yearmon(time(pret)))) 
# check on which day of the year our peak flow occurs (for all years)
letQkat <- aggregate(pret,leto1,which.max) 
# peak flow, some years have missing values
# therefore the argument na.rm=TRUE must be used
letQkon <- aggregate(pret,leto1,max,na.rm=TRUE)
# is used to calculate the actual order of peak flows
letoQkat1 <- (cumsum(aggregate(pret,leto1,length))-365)+letQkat 
# prepare an empty vector to store the data
volumni <- rep(NA,length(letoQkat1)) 
# same for hydrograph durations
trajanje <- rep(NA,length(letoQkat1)) 
# for each of the largest annual events, calculate the hydrograph volume and duration
for(i in 1:length(letoQkat1)){ 
# find the start of the hydrograph, this is the point, 
# when the volume of surface runoff equals the volume of baseflow
# the search in this case takes place 40 days before the occurrence of the peak flow
# for large watercourses, it is reasonable to extend this time
zac <- last(which((ngaruroro$flow-ngaruroro$baseflow)[(letoQkat1[i]-40):(letoQkat1[i])]==0)) 
# find the location of the starting point given all the flow data
zac <- letoQkat1[i]-41+zac 
# find the end of a hydrograph wave in a similar way
# search over a period of 60 days
kon <- first(which((ngaruroro$flow-ngaruroro$baseflow)[(letoQkat1[i]):(letoQkat1[i]+60)]==0)) 
kon <- letoQkat1[i]+kon-1 
# we can also plot all hydrographs
# plot(ngaruroro$flow[(zac):(kon)],type="l",main=paste("Event",i))
# lines(ngaruroro$baseflow[(zac):(kon)],col="red") 
# calculate the volume (surface, excluding base flow) of the hydrograph
# the flow data are in m3/s, the result is the volume in m3
volumni[i] <- sum((ngaruroro$flow-ngaruroro$baseflow)[zac:kon])*24*3600 
# the duration of hydrograph
trajanje[i] <- length((ngaruroro$flow-ngaruroro$baseflow)[zac:kon])   
}
# combine data
vzorec <- cbind(letQkon,volumni/10^6,trajanje) 
vzorec <- vzorec[-c(15,20,25),] 
# the graphical overview shows that we have missing data
# delete this data (events marked NA)

Once we have a sample, we can start analysing. The CRAN repository has a number of packages that include functions for working with copula functions, such as svines, igcop, copBasic, CondCopulas, riskSimul, acopula, VineCopula, lcopola, copula, etc. Particularly useful is the copula package, which includes a large number of functions to perform such analyses. The following example will demonstrate the use of Chi and K plots. The Chi-plot can be used to estimate the dependence between pairs of variables (e.g. for peaks and volumes), and similarly for the other two pairs of variables (peaks-duration and volumes-duration). On the graph, the lambda shows the distance between the pairs of data (Xi,Yi) and the centre of the scatter plot. If the points are above Chi=0, there is a positive dependence; if the points are below Chi=0, there is a negative dependence. Confidence intervals indicate the boundary between significant and non-significant dependence. The Chi=0 line indicates independence. The K-plot can also be used to estimate the dependence. In the example below, we will plot the volumes and durations of the hydrographs. In the k-plot, the line x=y indicates independence and the curve indicates a perfect positive dependence between the data. Negative dependence would be shown below the x=y line. The greater a point’s distance from the x=y line, the greater the dependence. If there is a perfect positive dependence, the points are located on the curve, and if there is a perfect negative dependence, the points are located on the x-axis.

library(copula, quietly=TRUE)

## Warning: package 'copula' was built under R version 4.1.3

library(VineCopula, quietly=TRUE)

## 
## Attaching package: 'VineCopula'

## The following object is masked from 'package:copula':
## 
##     pobs

vzorec <- as.data.frame(vzorec) # transform data into a data frame
colnames(vzorec) <- c("Q","V","T") # change colnames
# calculate pseudo values, transform data to [0,1]
u <- pobs(vzorec) 
BiCopChiPlot(u[,1],u[,2],xlim=c(-1,1),ylim=c(-1,1),main="Chi-plot",
  col="black",cex=1.5)

Figure 50: Chi-plot for the volume-peaks of the hydrographs.

BiCopKPlot(u[,2],u[,3], PLOT=TRUE,main="K plot",col="black")

Figure 51: K-plot for the volume-peaks of the hydrographs.

Once we have estimated the dependence between pairs of variables, we can start estimating the copula parameters. In the example below, we will estimate the parameters based on the Kendall correlation coefficient. In the copula package, we can choose between different functions such as Joe, Frank, Clayton, Gumbel, etc. Similarly, we can choose various different parameter estimation methods, such as the pseudo maximum likelihood method (mpl) or the method based on Kendall’s correlation coefficient (itau), which will be used in the example below. The procedure is similar in both the bivariate and the trivariate case.

# values of Kendall correlation coefficients between pairs of variables 
cor(u,method="kendall")

##           Q         V         T
## Q 1.0000000 0.4973262 0.1303322
## V 0.4973262 1.0000000 0.4634033
## T 0.1303322 0.4634033 1.0000000

# estimate the parameter(s) of the selected copula function
param <- fitCopula(copula=joeCopula(),data=u[,c(1:2)],method="mpl") 
# define copula
kopul2 <- joeCopula(param@estimate, dim=2) 
# estimate the trivariate copula parameter in a similar way
param1 <- fitCopula(copula=frankCopula(dim=3),data=u,method="mpl")
# define trivariate copula
kopul3 <- frankCopula(param1@estimate, dim=3)

As there are several functions to choose from, it is often necessary to find which copulas adequately describe the input data used. The copula package also contains statistical tests that can be used to evaluate the suitability of the selected copula functions. In the example shown below, we can use the Cramer-von Mises test; more details on the chosen test can be found in the description of the gofCopula function. It can be seen that the calculated (below) p-value is greater than 0.05, which means that the chosen Joe copula is appropriate for the analyses. Similarly, we can also check the Gumbel copula, where it makes sense to increase the number of N, which allows for a more accurate calculation of the test statistic, although the calculation is slightly slower. Similarly, for the trivariate function, the test statistic shows that the chosen function is appropriate (given the calculated p-value).

# statistical tests of agreement for different copulas
gofCopula(copula=joeCopula(),u[,c(1:2)],optim.method = "L-BFGS-B",
    N=500,estim.method="mpl",simulation="mult",method="Sn")

## 
##  Multiplier bootstrap-based goodness-of-fit test of Joe copula, dim. d
##  = 2, with 'method'="Sn", 'estim.method'="mpl":
## 
## data:  x
## statistic = 0.045758, parameter = 2.3213, p-value = 0.07685

gofCopula(copula=gumbelCopula(),u[,c(1:2)],optim.method = "L-BFGS-B",
    N=500,estim.method="mpl",simulation="mult",method="Sn")

## 
##  Multiplier bootstrap-based goodness-of-fit test of Gumbel copula, dim.
##  d = 2, with 'method'="Sn", 'estim.method'="mpl":
## 
## data:  x
## statistic = 0.025186, parameter = 1.969, p-value = 0.2884

# trivariate case
gofCopula(copula=joeCopula(dim=3),u,optim.method = "L-BFGS-B",
    N=500,estim.method="mpl",simulation="mult",method="Sn")

## 
##  Multiplier bootstrap-based goodness-of-fit test of Joe copula, dim. d
##  = 3, with 'method'="Sn", 'estim.method'="mpl":
## 
## data:  x
## statistic = 0.060341, parameter = 1.8385, p-value = 0.1347

A graphical comparison with the measured data can also be checked. We can also generate a random sample of pseudo values based on the bivariate Joe copula, which can be compared with the actual values to graphically assess the goodness of fit of the chosen copula function. Since we will add the actual data to the graph, we will see that the chosen copula puts more emphasis on the upper tail, which is not a characteristic of our data, hence the Joe copula may not be the most suitable for bivariate analyses and we should check the suitability of some other copula. This can be confirmed by calculating the lower and upper tail dependence index (example below). However, there are additional graphical visualizations that can be used to select the most appropriate copula function, such as the Lambda diagram shown below, where the blue line shows the data and the grey line shows the estimate based on the Joe copula.

gen2 <- rCopula(10000,kopul2) # generate data
plot(gen2,xlab="Discharge",ylab="Volumes") # plot
points(u[,c(1,2)],col="red",lty=2,lwd=2) # add points

Figure 52: Graphical comparison of measured and generated data.

lambda(joeCopula(param@estimate)) # upper/lower tail index

##    lower    upper 
## 0.000000 0.652018

# lambda plot
BiCopLambda(u1=u[,1],u2=u[,2],family=6,par=param1@estimate,col="blue")

Figure 53: Lambda diagram for estimating the appropriate copula.

Symmetric copulas can only be used if the variables are “interchangeable” and consequently there is a test in the copula package to check whether the variables are interchangeable. In the example below, we see that the calculated p-value is greater than 0, which means that we can use symmetric copula functions such as Joe, Frank, Clayton, Gumbel-Hougaard, etc. We can do the same for other pairs of variables. The copula package also includes the function xvCopula, which performs a cross-validation and which can be used to compare the goodness of fit of different copula functions (the higher the value of the test statistic, the better the goodness of fit of a particular function). There are, however, a large variety of different copula functions, and for the multivariate case it is worth mentioning the asymmetric Khoudraji-Liebscher copulas, which are often used in hydrological analyses.

exchTest(u[,c(1,2)], N =100) # interchangeability test

## 
##  Test of exchangeability for bivariate copulas with argument 'm' set to
##  0
## 
## data:  u[, c(1, 2)]
## statistic = 0.019031, p-value = 0.4604

xvCopula(copula=joeCopula(),x=u[,c(1,2)]) # xvCopula test

## [1] 6.159292

xvCopula(copula=frankCopula(),x=u[,c(1,2)])

## [1] 9.039083

# we see that Frank's copula is more appropriate than Joe's copula
xvCopula(copula=gumbelCopula(),x=u[,c(1,2)])

## [1] 8.454546

# Gumbel-Hougaard is slightly less appropriate than Frank
fitCopula(khoudrajiCopula(copula1 = gumbelCopula(dim=3), 
copula2=joeCopula(dim=3)),start = c(1.7,3.1, 0.5,0.6,0.05), 
u,optim.method = "Nelder-Mead")

## Call: fitCopula(khoudrajiCopula(copula1 = gumbelCopula(dim = 3), copula2 = joeCopula(dim = 3)), 
##     data = u, ... = pairlist(start = c(1.7, 3.1, 0.5, 0.6, 0.05), optim.method = "Nelder-Mead"))
## Fit based on "maximum pseudo-likelihood" and 34 3-dimensional observations.
## Copula: khoudrajiExplicitCopula 
## c1.alpha c2.alpha   shape1   shape2   shape3 
##   2.1028   3.6646   0.6765   0.2631   0.0183 
## The maximized loglikelihood is 21.07 
## Optimization converged

# asymetric Khoudraji-Liebscher copula

Once the appropriate copula function has been selected based on appropriate graphical and statistical tests, and the marginal distributions of the individual variables have been defined, we can also determine the various relationships between the return periods (e.g. OR or AND return period)⁷¹ and the design values of the variables. In the case of multivariate analyses, there are also many definitions of return periods such as conditional or secondary⁷². In the example below, we will use the Frank copula and the GEV distribution as the marginal distribution for volumes and peak flows. We will calculate the OR and AND return period values for the maximum peak flow in the sample and for the maximum volume of the hydrograph, noting that the maximum peak and volume did not actually occur as part of the same event. The OR return period (i.e., the case that either the peak flow or the hydrograph volume occurs) is approximately 21 years (example below). The AND return period (i.e., the case that both the peak flow and the hydrograph volume occur simultaneously) the return period is much larger. The probability of both the peak flow and the hydrograph volume occurring simultaneously is of course smaller.

# copula parameters
paramf <- fitCopula(copula=frankCopula(),u[,c(1,2)],method="mpl") 
# define copula
kopul2 <- frankCopula(paramf@estimate, dim=2) 
# marginal distributions
parGEVQ <- lmom2par(lmoms(vzorec[,1]),type="gev") 
parGEVV <- lmom2par(lmoms(vzorec[,2]),type="gev") 
# generate data
gen <- rCopula(10000,kopul2) 
# graphical comparison
plot(quagev(gen[,1],parGEVQ),quagev(gen[,2],parGEVV),xlab="Discharge [m3/s]",
     ylab="Volumne [10^6 m3]") 
points(vzorec[,c(1,2)],col="red",lty=2,lwd=2)

Figure 54: Example of a graphical comparison of measured and generated data for the Frank copula.

# return periods OR and AND 
dl1 <- cdfgev(max(vzorec[,1]),parGEVQ) 
dl2 <- cdfgev(max(vzorec[,2]),parGEVV) 
vmes <- pCopula(c(dl1,dl2),kopul2) 
# return period OR
povratnaOR <- 1/(1-vmes)
# return period AND
povratnaAND <- 1/(1-dl1-dl2+vmes) 
povratnaOR

## [1] 21.34922

povratnaAND

## [1] 423.8661

The calculation of the OR return periods can be done for different combinations of flows and volumes. In the example below, we will define an empty matrix where we will store the results. The calculations will be done in 2 loops, first for the flows and then for the volumes. In the first step, we will calculate marginal distributions and then the OR and AND return periods. We will also plot a contour plot for different values of the return periods.

# OR return period 
matrikaOR <- matrix(NA,nrow=length(seq(from=150,to=500,by=10)),
  ncol=length(seq(from=20,to=200,by=5)))  
for(i in 1:length(seq(from=150,to=500,by=10))){ 
  for(j in 1:length(seq(from=20,to=200,by=5))){ 
    dl1 <- cdfgev(seq(from=150,to=500,by=10)[i],parGEVQ) 
    dl2 <- cdfgev(seq(from=20,to=200,by=5)[j],parGEVV) 
    vmes <- pCopula(c(dl1,dl2),kopul2) 
    matrikaOR[i,j] <- 1/(1-vmes) # OR return period     
  }
}
# contour plot for different values of return periods
contour(seq(from=150,to=500,by=10),seq(from=20,to=200,by=5),
matrikaOR,levels=c(2,5,10,20,50,100),xlab="Discharge [m3/s]",
ylab="Volumne [10^6 m3]")

Figure 55: Contour plot for OR return period.

# AND  return period
matrikaAND <- matrix(NA,nrow=length(seq(from=150,to=500,by=10)),
  ncol=length(seq(from=20,to=200,by=5))) 
for(i in 1:length(seq(from=150,to=500,by=10))){
  for(j in 1:length(seq(from=20,to=200,by=5))){
    dl1 <- cdfgev(seq(from=150,to=500,by=10)[i],parGEVQ)
    dl2 <- cdfgev(seq(from=20,to=200,by=5)[j],parGEVV)
    vmes <- pCopula(c(dl1,dl2),kopul2)
    matrikaAND[i,j] <- 1/(1-dl1-dl2+vmes)    
  }
}
contour(seq(from=150,to=500,by=10),seq(from=20,to=200,by=5),
matrikaAND,levels=c(2,5,10,20,50,100,500,1000),xlab="Discharge [m3/s]",
ylab="Volumne [10^6 m3]")

Figure 56: Contour plot for AND return period.

Similarly, bivariate analyses can be performed for other variables⁷³, and multivariate probability analyses⁷⁴ or copula functions can be used to estimate unknowns⁷⁵ or to determine confidence intervals⁷⁶.

Task 35: Based on the Joe copula with parameter value 2, generate a bivariate sample with 40 elements. Based on the generated sample, estimate the parameters of the Frank copula and check the suitability of the Frank copula to describe this sample. Additionally, graphically compare the generated sample based on the Frank copula (N=500) and the Joe copula.

4.6 Analyses of seasonality

Data about the seasonal characteristics of peak (or low) flows can inform diagnoses of climate change⁷⁷. The hydrological conditions of watercourses are the result of the influence of precipitation, soil moisture, and snow conditions in the catchment. Consequently, it is often useful to analyse the seasonal characteristics of these variables and to examine whether or not this seasonality is changing. The seasonality of floods can be graphically represented in the form of circular statistics using the equations given by Burn (1997)⁷⁸ and summarised in some other works⁷⁹. Also for circular statistics, a large number of various tests are available. In this context, it is worth mentioning the CircStats package⁸⁰, which contains a large number of such functions. In the following example, we will show an analysis of the seasonality of the annual peak for the data included in the lfstat package, which we also used in the previous section. We will first calculate the angles of the time of occurrence of these events in radians and then the average time of the occurrence of all the peaks. A value close to 1 means that the seasonality is very pronounced and practically all events occur in a similar part of the year, while a value close to 0 means that there is no pronounced seasonality. We will also show the seasonal occurrence of annual maximum peaks (vQvk), where one circle represents one annual peak and the size of the circle is the magnitude of the event.

dan <- letQkat 
# angles in radians 
kot <- (dan-0.5)*(2*pi/365) 
# mean time of occurance
xpov <- mean(cos(kot));x <- cos(kot)  
ypov <- mean(sin(kot));y <- sin(kot)  
r <- sqrt(xpov*xpov+ypov*ypov)
# average day of occurance based on r
if(xpov>=0 && ypov>=0) dan.rez <- (atan(ypov/xpov))*(365/(2*pi)) else
  if(xpov<0) dan.rez <- (atan(ypov/xpov) + pi)*(365/(2*pi)) else
    if (ypov<0 && xpov >= 0) dan.rez <- (atan(ypov/xpov) + 2*pi)*(365/(2*pi)) else 
      end 
# calculated day
dan.rez

## [1] 186.2764

symbols(x,y,letQkon,inches=0.25,bg="lightblue",xlab="",ylab="",
xlim=c(-1.1,1.1),ylim=c(-1.1,1.1),fg="white",
main="Seasonal occurance of vQvk");abline(0,0);abline(v=0) 
par(new=TRUE)
symbols(xpov,ypov,mean(letQkon),inches=0.25,bg="red",
xlab="",ylab="",xlim=c(-1.1,1.1),ylim=c(-1.1,1.1),fg="white") 
# adding text to the plot
text(xpov,ypov,"Mean value") 
text(1,0.1,labels="January")
text(0.2,1,labels="April")
text(-1,0.1,labels="July")
text(0.2,-1,labels="October")

Figure 57: Seasonal display of annual peaks.

A similar display to the one we made manually above can also be made with the built-in functions in the CircStats package. The rose diagram is used to plot the wind rose; we will plot all the events on the circle using the circle plot method and calculate the basic statistics, where the result rho is the same as the seasonality index r. A test is also shown to check whether there are changes in the direction and magnitude of the trend in the sample. More information about the function and the results can be found in the description of the function change.pt.

library(CircStats, quietly=TRUE)

## Warning: package 'CircStats' was built under R version 4.1.3

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

rose.diag(kot,bins=20)

Figure 58: Rose diagram.

circ.plot(kot,stack=TRUE,bins=360)

Figure 59: Circular diagram of all maximum annual peak flows.

circ.summary(kot) #  rho

##    n  mean.dir       rho
## 1 37 -3.076585 0.1566884

change.pt(kot) # change test

##    n       rho     rmax k.r     rave      tmax k.t       tave
## 1 37 0.1566884 4.547054  26 1.152155 0.1167458  35 0.04501709

Task 36: For data from one of the ARSO stations, perform seasonality analyses: i) calculate the seasonality coefficient and ii) draw a circular graph showing the timing of the selected events. Repeat the analyses for the annual peak (AM) sample and for the POT sample, selecting on average 3 events above the selected threshold per year (POT3).

4.7 Analyses of recession parts of hydrographs

The gradual draining of water stored in the basin during periods of little or no rainfall is reflected in the form of the recession part of the hydrograph (i.e. the declining part of the hydrograph from the flow’s peak to the end of the surface part of the runoff). The recession curve describes in a comprehensive way how the various water-rich parts of the basin and the processes within it regulate the inflow of water into the watercourse. Rivers with a slow rate of decline in the recession part of the hydrograph tend to be closely associated with groundwater or lakes, while rapid decreases in flow (and also increases) are characteristic of torrents. Analyses of the recession part of the hydrograph have proved useful in many aspects of water resources management, including low flow forecasting and the assessment of low flow variables at gauging sites. Such data are also often useful for regional flood frequency analyses and rainfall-runoff modelling⁸¹. An overview of the different methods and the basic equations can be found in an article by Tallaksen (1995)⁸². Some functions for analysing the recession part of the hydrograph are also included in the lfstat package. Additionally, an interesting paper⁸³ shows the influence that choosing different criteria has on the value of the low-flow recession constants, namely on the choice of the calculation method, the choice of the length of the recession curve segment, and the choice of the period used to determine the initial flow, which sets the limit for the inclusion of the decreasing parts of the hydrograph in the recession analysis.

library(lfstat, quietly=TRUE)
# duration of the decreasing part of the hydrograph with respect to the threshold value
seglenplot(ngaruroro,threslevel = 70)

Figure 60: Analysis of the recesion part of the hydrographs.

# example of a main recession curve (MRC) with a segment length of 5 days 
# and Q70 as a threshold value
recession(ngaruroro,method = "MRC",seglen = 7,threshold = 70)

Figure 61: Example of the main recession curve.

## [1] 19.77262

# function for seasonality analysis of low flows
seasindex(ngaruroro)

## $theta
## [1] 1.088009
## 
## $D
## [1] 63.20411
## 
## $r
## [1] 0.8554912

4.8 Determination of the sample according to the peak over threshold method (POT)

Two different approaches are commonly used to perform flood frequency analyses of high-water peaks⁸⁴. In the annual maximum (AM) method, the sample consists of the peak flows in each year (maximum values). The sample therefore contains as many elements as the number of years of data used for the analysis. An alternative way of forming the sample is to select all events above a certain threshold (the POT method). In the POT method, the sample is formed by selecting all events are above a certain threshold⁸⁵. There are several different packages available in the R programming environment that allow you to define a POT sample, one of which is the hydrostats package (also available on GitHub). The example below shows the identification of all events over a selected threshold. We will use the function partial.series, where the argument ari defines the number of events above the selected threshold (or the average return period) and the argument ind.days defines the independence criterion between two consecutive events. If we want to define an average of 5 events above the selected threshold, the average return period is equal to (1/5=0.2), and if we have calculated that there must be at least 7 days between two events (ind.days), we can also define a pattern like the one shown below. The hydrostats package also contains other useful functions that can be used to analyse peak flows (including low flows), such as the high.spells function that can be used to analyse the occurrence time, frequency, and duration of different events. In addition, some seasonal characteristics can be determined, such as average monthly flows or the percentage of annual runoff in the 6 driest months, etc. For more information, see the description of the seasonality function.

library(hydrostats, quietly=TRUE)

## Warning: package 'hydrostats' was built under R version 4.1.3

# data frame, this is the format required by the hydrostats package
preob <- data.frame(as.character(time(pret)),as.numeric(coredata(pret))) 
colnames(preob) <- c("Date","Q") 
# annual maximum
partial.series(ts.format(preob,format="%Y-%m-%d"), ari=1, plot=TRUE, ind.days=1)

Figure 62: Hydrograph and annual maxima (red dots).

##   ari n.years n.events flow.threshold avg.duration max.duration
## 1   1      33       33        156.517     1.176471            3
##   med.spell.volume
## 1          28.2795

# average of 5 events above the threshold and 7 days between events
partial.series(ts.format(preob,format="%Y-%m-%d"), ari=0.2, plot=TRUE, ind.days=7)

Figure 63: Average of 5 events above threshold.

##   ari n.years n.events flow.threshold avg.duration max.duration
## 1 0.2      33      165         61.647     1.875648           10
##   med.spell.volume
## 1           30.362

# printout of all events, or only the first 6 of all events
head(partial.series(ts.format(preob,format="%Y-%m-%d"), ari=0.2, plot=FALSE,
  ind.days=7,series = TRUE)$p.series)

##             Date       Q hydro.year event.rank
## 4636  1976-09-09 301.535       1976          1
## 12602 1998-07-02 290.657       1998          2
## 11270 1994-11-08 280.164       1994          3
## 7745  1985-03-15 252.869       1985          4
## 11573 1995-09-07 251.044       1995          5
## 71    1964-03-11 248.107       1964          6

# an example of using high.spells function
high.spells(ts.format(preob,format="%Y-%m-%d"),quant=0.99)

Figure 64: Using high.spells function.

##   high.spell.threshold n.events spell.freq       ari min.high.spell.duration
## 1               92.009       85   2.575758 0.3882353                       1
##   avg.high.spell.duration med.high.spell.duration max.high.spell.duration
## 1                1.729412                       1                       9
##   avg.spell.volume avg.spell.peak sd.spell.peak avg.rise avg.fall avg.max.ann
## 1         66.13016       30.36957        34.344 84.83268 48.86572     172.669
##   cv.max.ann flood.skewness ann.max.timing ann.max.timing.sd ann.max.min.dur
## 1     37.263       10.02056            200               103               1
##   ann.max.avg.dur ann.max.max.dur ann.max.cv.dur
## 1         3.30303               9       59.82942

high.spell.lengths(ts.format(preob,format="%Y-%m-%d"), threshold=200)

##    start.date spell.length
## 2  1964-03-11            1
## 4  1965-08-15            1
## 6  1967-02-03            1
## 8  1967-08-12            1
## 10 1976-09-09            1
## 12 1985-03-15            1
## 14 1991-08-09            1
## 16 1994-11-08            1
## 18 1995-09-07            1
## 20 1998-07-02            1

# seasonality function
seasonality(ts.format(preob,format="%Y-%m-%d"), monthly.range=TRUE)

## $seasonality
## [1] 40.65079
## 
## [[2]]
##       01       02       03       04       05       06       07       08 
## 386.1272 289.4735 356.9346 384.5619 494.7906 674.8001 811.1845 840.7488 
##       09       10       11       12 
## 735.8962 582.4253 447.5639 413.2396 
## 
## $avg.ann.month.range
## [1] 1007.311
## 
## $max.min.time.dif
## [1] 5

4.9 Analysis of precipitation data

Precipitation analysis in water management is crucial for understanding and managing water resources. As the main input in the hydrological cycle, precipitation plays a fundamental role in shaping the availability and movement of water. Thus, we are often interested in the spatial and temporal variability of precipitation and its characteristics⁸⁶. Analysing the spatial distribution of precipitation involves understanding how precipitation varies in different locations. Analysing the temporal distribution involves studying how precipitation changes over time, such as seasonal patterns, monthly trends, and diurnal cycles. Engineering practice, however, makes frequent use of intensity-duration-frequency (IDF) curves ⁸⁷. These curves are helpful in designing built infrastructure, such as stormwater drainage systems (IDF curves are the input to the rational equation), and are essential for flood risk assessment. However, rainfall analyses can also be an indirect indicator of drought conditions through the analysis of various rainfall indices, such as the Standardised Precipitation Index (SPI)⁸⁸. We will show some examples of analyses that can be useful for water management. The first example will be the calculation of the SPI⁸⁹. First, data import and data preparation with a time step of 30 minutes will be shown. The data are structured in 3 columns; the first one contains the date, the second one the time, and the last one the amount of precipitation in 30 minutes (in mm).

# 30-minute precipitation data from station Lj.-Bežigrad
data <- read.table("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Ljubljana-padavine.txt",header=FALSE, sep=" ") 
head(data)

##           V1       V2 V3
## 1 2010-01-01 00:00:00  0
## 2 2010-01-01 00:30:00  0
## 3 2010-01-01 01:00:00  0
## 4 2010-01-01 01:30:00  0
## 5 2010-01-01 02:00:00  0
## 6 2010-01-01 02:30:00  0

# combine data into a data frame
dataP <- data.frame(Datum=as.POSIXct(paste0(data[,1]," ",data[,2]),
  format="%Y-%m-%d %H:%M:%S"),P=((data[,3])))
# check whether we might have any duplicate records in the data
head(dataP[duplicated(dataP[,1]),])

##                     Datum P
## 14549 2010-10-31 02:00:00 0
## 14550 2010-10-31 02:30:00 0
## 32021 2011-10-30 02:00:00 0
## 32022 2011-10-30 02:30:00 0
## 49493 2012-10-28 02:00:00 0
## 49494 2012-10-28 02:30:00 0

# dates related to the clock change in autumn
# in case we have part of the missing data,
# we could generate a continuous time vector
time.seq <- seq(as.POSIXct("1.1.2010 00:00:00",format="%d.%m.%Y %H:%M:%S"), 
as.POSIXct("31.12.2020 23:00:00",format="%d.%m.%Y %H:%M:%S"), by = "30 mins") 
length(time.seq)==dim(dataP)[1]

## [1] TRUE

# data dataP contains all the data we need

If you have duplicate data and want to delete the redundancies, you could use the procedure below. This way, the generated time vector could be compared with the actual data and the data could be merged according to the time information.

dataP <- dataP[-which(duplicated(dataP[,1])==TRUE),] 
dataPsk <- merge(dataP,as.data.frame(time.seq),by.x="Datum",by.y="time.seq")

We will transform the precipitation data into a zoo object.

# basic features of 30-minute rainfall data
summary(dataP[,2])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07252  0.00000 26.00000

# plot the basic graph of the data used
plot(dataP,type="h",ylab="P [mm]",main="Ljubljana")

Figure 65: 30-min precipitation data.

# transform data into a zoo object
Pzoo <- zoo(dataP[,2],dataP[,1])

## Warning in zoo(dataP[, 2], dataP[, 1]): some methods for "zoo" objects do not
## work if the index entries in 'order.by' are not unique

# evaluate the monthly values
mes1 <- as.yearmon(time(Pzoo)+3600) 
head(mes1) # data structure

## [1] "jan. 2010" "jan. 2010" "jan. 2010" "jan. 2010" "jan. 2010" "jan. 2010"

tail(mes1) # last part of data

## [1] "dec. 2020" "dec. 2020" "dec. 2020" "dec. 2020" "dec. 2020" "dec. 2020"

mesVsote <- aggregate(Pzoo, mes1, sum)
plot(mesVsote, xlab="Year",ylab="Monthly precipitation [mm]")

Figure 66: Monthly precipitation totals.

# descriptive statistics
summary(mesVsote)[,2]

##                                                                         
## "Min.   :  0.0  " "1st Qu.: 53.9  " "Median :101.8  " "Mean   :106.0  " 
##                                     
## "3rd Qu.:145.2  " "Max.   :419.9  "

To calculate the SPI, the SPEI package can be used. The calculation of the SPI with a 3-month step is shown below; the scale argument specifies the accumulation period and there are various distribution functions to choose from (distribution argument). SPI values less than -2 indicate extreme drought, values between -1.5 and -2 indicate severe drought, values between -1 and -1.5 indicate moderate drought, and values between 1 and -1 can be considered normal; wet periods (very wet, quite wet and moderately wet) are defined in a similar way. The SPI index calculated for shorter periods can serve as an indicator of changes over short time intervals (e.g. lower soil moisture, reduced water in smaller watercourses), while the SPI calculated for longer durations can be used as an indicator of a reduction in groundwater recharge. It should be emphasised that it is useful to include long periods of data in such analyses (e.g. at least 30 years of data); in our case, the SPI will be calculated considering only 11 years. The SPEI package contains a function to calculate the SPEI with the water balance (precipitation-evapotranspiration) taken into account, and the package also contains functions to calculate the potential and reference evapotranspiration based on different methods.

library(SPEI, quietly=TRUE)

## Warning: package 'SPEI' was built under R version 4.1.3

## # Package SPEI (1.8.1) loaded [try SPEINews()].

spiLJ3 <- spi(data=as.numeric(mesVsote), scale=3, distribution = "Gamma")

## [1] "Calculating the Standardized Precipitation Evapotranspiration Index (SPEI) at a time scale of 3. Using kernel type 'rectangular', with 0 shift. Fitting the data to a Gamma distribution. Using the ub-pwm parameter fitting method. Checking for missing values (`NA`): all the data must be complete. Using the whole time series as reference period. Input type is vector. No time information provided, assuming a monthly time series."

# check the SPI-3 values
spiLJ3

##   [1]           NA           NA -0.320274607 -0.194664298 -0.937433956
##   [6] -0.997291452 -0.532102209  0.694251435  1.956356663  1.848429557
##  [11]  1.534704427  0.330837079  0.400682870 -0.214182042 -0.527328861
##  [16] -1.244062574 -0.846891962 -1.118370879  0.887495674  0.004630253
##  [21] -1.043273171 -1.018107148 -1.484780898 -0.503925501 -1.198912813
##  [26] -0.649866162 -1.965812005 -0.964039477  0.013230781  1.153018972
##  [31]  0.221749097  0.061847109  0.093577654  0.530403106  0.740821193
##  [36]  0.796258538  0.009794913 -0.430663652  0.848850167  0.952027128
##  [41]  2.099860410  0.859576635 -0.783250374 -1.775407088 -0.484887868
##  [46] -0.534458983  0.141015585 -0.369850775  1.099274994  2.195683553
##  [51]  1.802138774  1.629224496 -0.882588862 -0.241162386 -0.171914258
##  [56]  1.458401392  0.967692769  0.993198025  1.087638631  1.347840830
##  [61]  0.708802357 -0.873485850 -0.277725207 -0.544605200 -0.241767869
##  [66] -0.533794532 -0.259785861 -0.134785235 -0.409735107 -0.128388531
##  [71] -0.675362384 -1.929543286 -1.753847072  0.645290230  1.155997095
##  [76]  1.054679110  0.137503564  1.046102783  1.210122917 -0.013273574
##  [81] -1.498376646 -1.432291985 -0.717582302 -0.498851772 -0.134448961
##  [86] -0.486447345 -0.111003443  0.582909529 -0.382092446  1.065977865
##  [91] -1.803891703 -0.938563641  0.368208342  0.236068571  0.936565757
##  [96]  0.976126447  1.345421186  0.922216408  0.077795125 -0.048752377
## [101]  0.535018388 -0.913392064 -0.783250374  1.030650745  0.548587794
## [106]  0.384780677 -0.668509843 -1.117839390 -0.668058412 -0.403802362
## [111] -0.013587585 -0.071196777  1.216234170  0.833255835  0.980794035
## [116] -1.060219411 -0.166013931 -0.750160220 -0.166819485  0.334108691
## [121]  0.392053300 -0.493317893 -0.498100846 -0.902303156 -0.569267491
## [126] -1.089178576  1.101313441  0.792667623 -0.115795328  0.094460418
## [131] -0.504844354  0.750705868

plot(spiLJ3)

## Warning: Removed 130 rows containing missing values (`geom_point()`).

Figure 67: SPI-3 index.

spiLJ12 <- spi(data=as.numeric(mesVsote), scale=12, distribution = "Gamma")

## [1] "Calculating the Standardized Precipitation Evapotranspiration Index (SPEI) at a time scale of 12. Using kernel type 'rectangular', with 0 shift. Fitting the data to a Gamma distribution. Using the ub-pwm parameter fitting method. Checking for missing values (`NA`): all the data must be complete. Using the whole time series as reference period. Input type is vector. No time information provided, assuming a monthly time series."

plot(spiLJ12)

## Warning: Removed 121 rows containing missing values (`geom_point()`).

Figure 68: SPI-12 index.

In addition to the SPI, there are a number of other indices⁹⁰. Most indices use monthly rainfall (or water balance) as an input, but only a few indices can be calculated from daily data, one of which is the effective drought index (EDI). Consequently, such an index is very useful for operational monitoring of meteorological and agricultural drought⁹¹. For example, the EDI has been used to analyse the coincidence of drought and extreme heat in Europe for the period 1961–2018⁹².

# from 30-min data to daily values
dan1 <- as.Date(time(Pzoo)+3600,format="%Y-%m-%d") 
# calculate daily precipitation totals
dnevneVsote <- aggregate(Pzoo, dan1, sum) 
# check the basic statistics of the daily data
summary(dnevneVsote)[,2]

##                                                                                 
## "Min.   :  0.000  " "1st Qu.:  0.000  " "Median :  0.000  " "Mean   :  3.481  " 
##                                         
## "3rd Qu.:  1.600  " "Max.   :130.500  "

plot(dnevneVsote,xlab="Year",ylab="Daily precipitation [mm]")

Figure 69: Daily precipitation values.

# define the calculation period before the calculation starts, 
# so we will calculate the EDI based on the previous 365 days
# we will store the results in the EP object
obd <- 365;EP <- NA 
vekt <- as.numeric(dnevneVsote)
# calculate the EP values 
for(i in (obd+1):length(as.numeric(dnevneVsote))){ 
dum <- 0
  for(j in 1:obd){
    dum <- dum + sum(vekt[((i-j+1):i)])/j
  }
  EP[i] <- dum 
}
MEP <- mean(EP,na.rm=TRUE)
DEP <- EP-MEP
# EDI index calculation
EDI <- DEP/sd(DEP,na.rm=TRUE) 
susa <- which(EDI<(-1.5))
plot(time(dnevneVsote),EDI,type="l",xlab="Year")
abline(h=-2,col="red",lty=2) # extreme drought
abline(h=-1.5,col="orange",lty=2) # severe drought
abline(h=-1,col="yellow",lty=2) # moderate drought

Figure 70: EDI index.

plot(time(mesVsote),spiLJ3$fitted,type="l",col="blue",lty=3,xlab="Year",
     ylab="SPI3") 
abline(h=-2,col="red",lty=2); abline(h=-1.5,col="orange",lty=2) 
abline(h=-1,col="yellow",lty=2)

Figure 71: SPI index.

Comparing the results (above) with the SPI3 index shows a match for certain drought events (e.g. 2012) and a worse match for some other events.

Task 37: From the ARSO website (meteo.si, measurement archive), obtain monthly precipitation data for the period 1961-2020 for the Murska Sobota station and calculate the standardised precipitation indices (SPI1, SPI3, SPI6, and SPI12). Graphically show the calculated indices.

4.10 Intensity-duration-frequency (IDF) curves

IDF curves can also be determined from rainfall data and are used as one of the inputs in the determination of design rainfall events⁹³. The following example illustrates the process of determining IDF curves from the rainfall data used. It should be emphasised that, even in the determination of IDF curves, the longest possible data sets are needed to calculate reliable results. The procedure shown follows the methodology shown in the IDF package⁹⁴, which uses the methodology developed by Koutsoyiannis et al.⁹⁵. However, other distribution functions, such as the Gumbel distribution⁹⁶, are often used to determine ITP curves. In this case, the procedure for producing ITP curves is very similar to that shown for the development of flood frequency analyses of high-water peaks. In the following example, we will calculate the pattern of annual maxima (in mm/h) for selected durations (in hours; argument ds); it is also possible to use the argument which.mon to select only certain months in which to look for annual maxima. Annual maxima could also be determined using the zoo package and the aggregate function.

library(IDF, quietly=TRUE)

## Warning: package 'IDF' was built under R version 4.1.3

## 
## Attaching package: 'IDF'

## The following object is masked _by_ '.GlobalEnv':
## 
##     example

# rename the data according to the instructions in the IDF package
colnames(dataP) <- c("date","RR") 
# define sample
vzorec <- IDF.agg(data=list(dataP),ds=c(1,6,12,24),na.accept = 1) 
# check data structure
head(vzorec)

##      xdat ds year  mon station
## 2010 23.9  1 2010 0:11       1
## 2011 24.2  1 2011 0:11       1
## 2012 27.4  1 2012 0:11       1
## 2013 23.2  1 2013 0:11       1
## 2014 40.3  1 2014 0:11       1
## 2015 27.5  1 2015 0:11       1

# xdat are the annual maxima (mm/h) for different durations of ds (h)
# station indicates the station
aggregate(zoo(rollapply(dataP[,2],12,sum),dataP[,1]), 
floor(as.yearmon(time(zoo(rollapply(dataP[,2],2,sum),
dataP[,1]+3600)))), max)

## Warning in zoo(rollapply(dataP[, 2], 12, sum), dataP[, 1]): some methods for
## "zoo" objects do not work if the index entries in 'order.by' are not unique

## Warning in zoo(rollapply(dataP[, 2], 2, sum), dataP[, 1] + 3600): some methods
## for "zoo" objects do not work if the index entries in 'order.by' are not unique

## jan. 2010 jan. 2011 jan. 2012 jan. 2013 jan. 2014 jan. 2015 jan. 2016 jan. 2017 
##      61.5      54.1      52.7      37.0     112.9      46.8      36.5      48.2 
## jan. 2018 jan. 2019 jan. 2020 
##      63.9      37.0      51.8

# in this case, 6h values are calculated
# the second argument in the rollapply function defines how much data is being added
# the maximum values in mm (not in mm/h) are obtained as a result
# fit the GEV distribution using the maximum likelihood method (MLE)
test <- gev.d.fit(xdat=vzorec$xdat,ds=vzorec$ds)

## Warning in sqrt(diag(z$cov)): NaNs produced

## $conv
## [1] 0
## 
## $nllh
## [1] 90.75036
## 
## $mle
## [1] 6.43018911064108 3.89991482372726 0.23503591461929 0.00000002183837
## [5] 0.67972665428536
## 
## $se
## [1] 0.17277214        NaN 0.10158694        NaN 0.02330413

# check the graphical correspondence between the sample and the estimated values
gev.d.diag(test)

Figure 72: Comparison of GEV distributions to the sample.

# calculate the parameters of the GEV distribution
parametri <- gev.d.params(test, ydat = NULL) 
# determine for which values of return periods the IDF curves will be calculated
povratne <- c(2,10,50,100) 
# plot IDF curves
IDF.plot(durations=1:24,fitparams = parametri,probs = 1-1/povratne,cols=4:1) 
# add a legend showing the return period in years
legend("bottomleft",legend=povratne,col=4:1,lty=1,bty="n",cex=0.9)

Figure 73: IDF curves.

qgev.d(p=1-1/povratne,mut=parametri[1],sigma0=parametri[2],xi=parametri[3],
theta=parametri[4],eta=parametri[5],eta2=parametri[6],tau=parametri[7],d=24)

## [1] 3.063544 4.225090 5.765042 6.618706

In the last step, we also calculated the rainfall intensities (in mm/h) for the selected durations, which are given by the d argument; the p argument gives the selected return period values for which we are interested in the calculations, and the other arguments define the parameters of the qgev.d function.

4.11 Rainfall erosivity

Rainfall erosivity is one of the main drivers of soil erosion⁹⁷. Rainfall erosivity is characterised by high spatial and temporal variability⁹⁸. To obtain reliable estimates of rainfall erosivity, high temporal resolution rainfall data are needed, for example data with a time step of 5 minutes. The availability of stations with such high-frequency data that can be used to calculate erosivity is relatively low in many parts of the world. Rainfall erosivity R is one of the inputs to the RUSLE equation, which is used to estimate the potential erosion of soil by water impact⁹⁹. The RUSLE methodology thus defines the procedure for calculating rainfall erosivity R and the criteria for determining erosive rainfall events¹⁰⁰. According to the methodology, two independent rainfall events are separated in the case of less than 1.27 mm of rain in 6 hours (between two events). Only erosive rainfall events with more than 12.7 mm of rain in the whole event, or those with more than 6.35 mm of rainfall within 15 minutes, are considered in the rainfall erosivity calculations. The erosivity R is calculated as the product of the maximum 30-minute rainfall intensity and the kinetic energy of the rainfall event. The annual erosivity is the sum of the erosivities for all erosive events in a given year. A number of equations have been developed to calculate the specific kinetic energy of rainfall based on measurements of raindrop properties (velocity and size)¹⁰¹. The methodology (and R code) for calculating the erosivity of rainfall shown below is based on a paper by Pidoto et al. (2022)¹⁰². The erosivity function also allows the criteria for determining independent precipitation events to be modified, as in some cases only a 6 h (or 4 h) period without precipitation is considered as a criterion for distinguishing between precipitation events. The function to calculate the erosivity of rainfall for independent events, which are also determined by the function, is described below. The Pcp argument is the continuous input rainfall data (in vector form), the Factor argument specifies whether the units of the data are mm (or other units; a value of 1 means that the units are in mm), the StartDate argument specifies the start date of the data, the Timestep argument specifies the time step of the data (in minutes), the SeperationMin argument specifies the criterion when the two events are combined (in mm; 0 can be used), the PcpMin argument specifies the criterion as to which event is still considered as erosive (in mm), and the Pcp15Min argument also specifies the erosive event (in mm), except that in this case it is a 15-min rainfall event (in mm).

GetErosiveEvents <- function(Pcp, Factor=1, StartDate=as.POSIXct("1900-01-01",
tz="UTC"),Timestep=5, SeparationMin=1.27, PcpMin=12.7, Pcp15Min=6.35 ) {
  print(paste("Factor:",Factor)) # printout of the basic features of the function
  print(paste("Start date:",StartDate))
  print(paste("Total rainfall:",round(sum(Pcp/Factor, na.rm = TRUE)),"[mm]"))
  # any negative values are replaced by 0
  Pcp[Pcp<0] <- as.integer(0) 
  Pcp[is.nan(Pcp)] <- as.integer(0) # the same for NaN
  Pcp[is.na(Pcp)] <- as.integer(0) # and for missing values
  IsWet <- (Pcp>0) # whether or not there was precipitation at a particular step
  # the rolling sum function is used to separate two independent events 
  # k after 6 h = 1 + 6*60/Timestep
  PcpRollSum <- rollsum(Pcp, k = 1 + 6*60/Timestep, align = "left", fill = 0) - Pcp 
  if(Timestep<30) { # function was basically prepared for data 
  # with a time step of 5 min and later adjusted for a duration of 30 min 
  # in this case it is impossible to calculate the 15 min intensity
  PcpRollI15 <- rollsum(Pcp, k = 15/Timestep, align = "center", fill = 0 )  
  } else if (Timestep==30) {
  PcpRollI15 <- rollsum(Pcp, k = 30/Timestep, align = "center", fill = 0 )  
  } else {
  return(print("Duration must be less than or equal to 30 min"))
    }
  # prolongation of wet periods between events if the criteria are not met
  IsWet <- ( IsWet | (PcpRollSum>=SeparationMin*Factor)  )
  # breaking down events and checking which events meet the criteria
  calcEventIMax <- function(a,b,Pcp,Dur=15) return( max(rollsum(x = Pcp[a:b], k = min(b-a+1,Dur/Timestep) )) ) 
  # calculation of the maximum intensity [mm/hr] for a given duration [min]
  calcEventI15Max <- function(a,b,PcpMax) return( max(PcpMax[a:b]) ) 
  # calculation of the maximum intensity for a duration of 15 min
  calcEventVol <- function(a,b,Pcp) return(sum(Pcp[a:b])) 
  # rainfall calculation for the event
  # calculate the length and value of series of equal values in a vector 
  Events <- rle(IsWet) 
  # transformation into data frame
  Events <- data.frame(Length=Events[[1]],IsWet=Events[[2]]) 
  Events$endIndex <- cumsum(Events$Length) # end of event
  # start of event
  Events$startIndex <- as.integer(Events$endIndex - Events$Length + 1)
  Events <- Events[Events$IsWet,] # selection of only erosive events
  Events$Vol <- unlist(mapply(calcEventVol, Events$startIndex, Events$endIndex, MoreArgs=list(Pcp=Pcp)))
  Events$I15 <- unlist(mapply(calcEventI15Max, Events$startIndex, Events$endIndex, MoreArgs=list(PcpMax=PcpRollI15))) 
  # removing events that do not meet the criteria
  Events <- Events[ (Events$Vol>PcpMin*Factor | Events$I15>Pcp15Min*Factor ) , ]
  # calculation of max 30-minute intensity
  Events$I30 <- unlist(mapply(calcEventIMax, Events$startIndex, Events$endIndex, MoreArgs=list(Pcp=Pcp,Dur=30)))
  Events$I30 <- (Events$I30/Factor) / (30/60) # intensitiy calculation [mm/h]
  # calculation of the actual duration of precipitation events 
  CalcEventDur <- function(a,b,Pcp) return( max(which(Pcp[a:b]!=0)) - min(which(Pcp[a:b]!=0)) + 1 )
  CalcEventDate <- function(a,b,Pcp) return( min(which(Pcp[a:b]!=0))+a-2  )
  # calculating the specific kinetic energy according to the equation 
  # Brown and Foster (1987). i: rainfall intensity [mm/hr]
  CalcBrownFoster <- function(i) return( 0.29*(1-0.72*exp(-0.05*i)) ) 
  # kinetic energy calculation for selected events
  CalcEventE <- function(a,b,Pcp) return( sum(CalcBrownFoster((Pcp[a:b]/Factor)/(Timestep/60)) * (Pcp[a:b]/Factor) ) ) 
  Events$E <- unlist(mapply(CalcEventE, Events$startIndex, Events$endIndex, MoreArgs=list(Pcp=Pcp)))
  Events$R <- Events$E * Events$I30
  Events$Date <- StartDate + (unlist(mapply(CalcEventDate, Events$startIndex, Events$endIndex, MoreArgs=list(Pcp=Pcp)))*Timestep*60)
  Events$D_h <- unlist(mapply(CalcEventDur, Events$startIndex, Events$endIndex, MoreArgs=list(Pcp=Pcp))) * Timestep/60 
  # event intensity [mm/hr]
  Events$I_mmh <- (Events$Vol/Factor) / Events$D_h 
  Events$P_mm <- (Events$Vol/Factor) # rename column
  # remove data that is not needed
  Events[,c("IsWet","Length","Vol","I15","startIndex")] <- NULL
  # spremenimo vrstni red
  Events <- Events[,c("Date","E","I30","R","P_mm","D_h","I_mmh","endIndex")]
  return(Events) # output
}

Using the function defined above, we determine the erosive events according to the selected criteria and also calculate the rainfall erosivity using the function for energy calculation proposed by Brown and Foster (1987). In the following example, we will also plot a graph of all erosive events. Since the erosivity data are rather asymmetric (a few large events, most events have smaller erosivity), we can also calculate different coefficients of non-uniformity as shown in the paper by Bezak et al.¹⁰³. We will plot the Lorenz curve along the calculated Gini coefficients (G; G* is the normalized Gini coefficient). The range of the Gini coefficient is between 0 and 1, where a value close to 0 means that all erosive events have roughly the same erosivity and a value close to 1 means that there is a large variation between erosive events. In addition, we will also calculate the proportion of events that contribute 50% of the total erosivity.

dogodki <- GetErosiveEvents(dataP[,2],Factor=1,StartDate=as.POSIXct("2010-01-01", tz="CET"),Timestep = 30, SeparationMin=1.27, PcpMin=12.7, Pcp15Min=6.35)

## [1] "Factor: 1"
## [1] "Start date: 2010-01-01"
## [1] "Total rainfall: 13987 [mm]"

head(dogodki) # results

##                    Date        E  I30         R P_mm  D_h     I_mmh endIndex
## 18  2010-02-19 11:30:00 5.900608  6.0 35.403649 55.0 28.5 1.9298246     2432
## 28  2010-02-26 03:00:00 3.304734  5.8 19.167460 29.5 19.0 1.5526316     2732
## 58  2010-03-31 07:30:00 2.743929  6.0 16.463575 24.1 15.5 1.5548387     4316
## 62  2010-04-01 15:30:00 1.547226  4.0  6.188902 14.8 14.5 1.0206897     4378
## 116 2010-04-26 21:00:00 1.529229 13.4 20.491667  9.2  2.0 4.6000000     5564
## 118 2010-05-03 01:30:00 1.405858  3.2  4.498746 14.8 18.0 0.8222222     5893

plot(dogodki[,1],dogodki[,4],xlab="Year",
  ylab="Rainfall erosivity [(MJ*mm)/(ha*h)")

Figure 74: All erosive rainfall events.

boxplot(dogodki[,4], ylab="Rainfall erosivity [(MJ*mm)/(ha*h)")

Figure 75: Boxplot of all events.

library(ineq, quietly=TRUE) 
library(REAT, quietly=TRUE)

## 
## Attaching package: 'REAT'

## The following object is masked from 'package:ineq':
## 
##     conc

## The following object is masked from 'package:readr':
## 
##     spec

# Lorenz curve
lorenz(dogodki$R, lcg = TRUE, lcgn = TRUE, 
lcx = "% of all erosive events (smaller->larger)" ,
lcy = "% of total erosivity", lctitle = "Lorenz curve")

Figure 76: Lorenz curve.

# Gini coefficient
gini(dogodki$R)

## [1] 0.634825

# total erosivity
vsota <- sum(sort(dogodki$R)) 
d50 <- which.min(abs(cumsum(sort(dogodki$R)) - vsota*0.5))/length(dogodki$R) 
d50 # we see that about 10% of erosive events (the largest ones)

## [1] 0.9030471

# contribute 50% of the total erosivity
vsota/11 # and we can also calculate the average annual rainfall erosivity R

## [1] 1931.834

# which is one of the inputs to the RUSLE equation 
# (11 is the number of years of data considered, the period 2010–2020)

4.12 Huff curves

Data on (independent) precipitation events can also be used to determine Huff curves. Huff curves are dimensionless normalised precipitation curves that describe the characteristics of a precipitation event (in which part of the precipitation event we can expect more precipitation)¹⁰⁴. Huff curves can be used as one of the inputs in determining synthetic precipitation events¹⁰⁵. In the following example, Huff curves will be constructed based on events with durations shorter than 12 h and longer than 6 h. In the following example, we will use the Huff curves for events with durations shorter than 12 h and longer than 6 h.

# select events
izbira <- which(dogodki$D_h<12 & dogodki$D_h>6) 
# generate a sequence of normalised values
zaporedje <- seq(from=0,to=1,by=0.05) 
# prepare an empty matrix where we will store the results
# for each event, we will determine the values of the Huff curves
huffvmes <- matrix(NA,length(izbira),length(zaporedje)) 
# calculations will be made as a loop for all events 
for(i in 1:length(izbira)){
# precipitation according to the start and end of the event
padizb <- dataP[(dogodki$endIndex[izbira[i]]-2*dogodki$D_h[izbira[i]]):dogodki$endIndex[izbira[i]],2] 
# sum of precipitation during the event
vsotapad <- sum(padizb) 
for(k in 0:(length(zaporedje))){
# linear interpolation for selected time steps 
# according to normalised rainfall (between two points, 
# for a selected point defined by the xout argument)
huffvmes[i,k] <- approx(seq(0,1,length=length(cumsum(padizb))),
  cumsum(padizb)/vsotapad,xout=zaporedje[k])$y 
}
}
huffvmes[,1] <- 0 # all initial values must be equal to 0
huffvmes[,21] <- 1 # all final values equal to 1 
# graph all normalised curves for all events
plot(zaporedje,huffvmes[1,],type="l",xlab="Normalized time",
  ylab="Normalized rainfall",main="Duration 6–12 h") 
# add all Huff curves for events to the graph
for(g in 2:dim(huffvmes)[1]){ 
lines(zaporedje,huffvmes[g,]) 
}

Figure 77: Huff curves for all events.

The Huff curves can also be used to determine confidence intervals, including 50% empirical confidence intervals. The median, which can be considered as the most probable rainfall distribution for selected rainfall events (of duration between 6 and 12 h), will in turn represent the result of temporal rainfall distribution.

spmeja <- apply(huffvmes,2,quantile,probs=0.25) # lower curve
med <- apply(huffvmes,2,quantile,probs=0.5) # median
zgmeja <- apply(huffvmes,2,quantile,probs=0.75) # upper curve
plot(zaporedje,med,col="red",lty=1,lwd=1,xlab="Normalized time",
  ylab="Normalized rainfall",main="Duration 6–12 h") 
lines(zaporedje,spmeja,col="red",lty=2,lwd=2) # lower curve
lines(zaporedje,zgmeja,col="red",lty=2,lwd=2) # upper curve

Figure 78: Huff curve (median) and confidence intervals.

Task 38: Produce Huff curves for rainfall events with a duration between 12 and 24 h and determine both the median Huff curves and the confidence intervals (10% and 90% percentile curves).

4.13 Stochastic rainfall (and air temperature) simulator

Stochastic generators of weather variables such as precipitation and air temperature, called weather generators, have been developed in recent decades for various purposes, such as hydrological and agronomic applications¹⁰⁶. Such weather generators have been used in many practical applications related to crop phenology and agriculture¹⁰⁷, as well as hydro-climatic studies¹⁰⁸. Weather variable generators can be used to reproduce or generate the time series of selected variables, such as precipitation or air temperature. In some cases it is also necessary to preserve the spatial characteristics of precipitation (and air temperature), in which case it is necessary to ensure the consistency of the weather generator for different locations in space¹⁰⁹. A number of packages for precipitation and air temperature simulations are available as part of the R software environment. Let us show the procedure for using the GWEX package¹¹⁰. First, we will import data from 3 precipitation stations for the period 2015 to 2022 in the area of Slovenia. The data will be transformed into a matrix as required by the GWEX package and the model parameters will be estimated. It is useful to increase the nChainFit parameter argument (according to the example below); this increases the calculation time, while the value of the th argument determines the amount of precipitation used to determine so-called wet days. The simGwexModel function can be used to simulate rainfall for a certain number of days in the future; in our example we simulated 2 realisations of the data for each site (argument nb.rep). More information on the additional parameters can be found in the description of the fitGwexModel function.

postaje3 <- read.table("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Padavine.txt",header=TRUE)  
library(GWEX, quietly=TRUE) 
skupaj <- as.matrix(postaje3[,2:4]) # transform data 
# define the input data according to the requirements of the GWEX package
myObsPrec <- GwexObs(variable='Prec',date=as.Date(postaje3[,1],
  format="%d.%m.%Y"),obs=skupaj) 
# parameter estimation
myparPrec <- fitGwexModel(myObsPrec, listOption=list(th=0.5,nChainFit=1000))

## [1] "Fit generator"

# simulate rainfall
rez <- simGwexModel(myparPrec,nb.rep=2,d.start = as.Date("01012025","%d%m%Y"),
    d.end = as.Date("31122030","%d%m%Y"))

## [1] "Generate scenarios"

str(rez@sim) # look at the structure of the simulated data

##  num [1:2191, 1:3, 1:2] 0 3.42 13.51 0 0 ...

# check the basic statistics of the simulated data 
# in this case realisation number 1 for station 1 (Topol)
summary(rez@sim[,1,1])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.722   2.817 199.176

# realisation number 2 for station 1 (Topol)
summary(rez@sim[,1,2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.913   2.565 251.864

# realisation number 2 for station 3 (Grosuplje)
summary(rez@sim[,3,2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.635   1.786 140.426

# plot realisation number 2 for Ljubljana station
plot(rez@date,rez@sim[,2,][,2],type="l",xlab="Days",ylab="Rainfall [mm]",
    main="Station Ljubljana") 
# add realisation number 1 for the same station to the same graph
lines(rez@date,rez@sim[,2,][,1],col="red")

Figure 79: Simulated precipitation for Ljubljana station.

In the same way, we can define a model for air temperature data, where we first read the data (same stations as in the case of precipitation), transform it, estimate the parameters, and then use a stochastic model to generate the data.

postaje3T <- read.table("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/Temperatura.txt",header=TRUE) 
skupajT <- as.matrix(postaje3T[,2:4]) 
myObsTemp <- GwexObs(variable='Temp',date=as.Date(postaje3T[,1],
  format="%d.%m.%Y"),obs=skupajT) 
myparTemp <- fitGwexModel(myObsTemp, listOption=list(hasTrend=TRUE,
  typeMargin='Gaussian',depStation='Gaussian',nChainFit=1000))

## [1] "Fit generator"
## ================================================================================

# generate air temperature
rezT <- simGwexModel(myparTemp,nb.rep=2,d.start = as.Date("01012025",
  "%d%m%Y"),d.end = as.Date("31122030","%d%m%Y"))

## [1] "Generate scenarios"

# plot realisation number 2 for Grosuplje station
plot(rezT@date,rezT@sim[,3,][,2],type="l",xlab="Days",
  ylab="Air temperature[°C]",main="Station Grosuplje") 
# add realisation number 1 for the same station to the same graph
lines(rezT@date,rezT@sim[,3,][,1],col="red")

Figure 80: Simulated air temperature for Grosuplje station.

# proportion of wet days
wet.day.frequency(skupaj,0.5)

##     Topol Ljubljana Grosuplje 
## 0.3186174 0.2997947 0.3042437

It is clear that, compared to stochastically generated precipitation, in the case of air temperature there is a clear seasonal pattern of variation for this variable and that even between the two random data realisations that were generated there is less difference between the generated air temperature data than in the case of precipitation. The GWEX package also includes a number of other useful functions, such as a function to calculate the proportion of wet days according to a selected threshold (wet.day.frequency).

Other packages are also available within the R software framework that allow stochastic simulations of rainfall (and other variables), in particular the RMAWGEN package¹¹¹. Both the GWEX and RMAWGEN packages focus on daily rainfall data. However, if simulations with hourly or 30-minute time steps are required, the HyetosMinute package¹¹² can be used, which is not included in the CRAN repository, though the files for installing the package are freely available¹¹³.

Task 39: Download daily precipitation data for 5 stations in eastern Slovenia (for the period 2020–2023) from meteo.si, define a stochastic precipitation model, and generate 20 random realisations of 3 years of precipitation data for each station.

4.14 Regression trees and clustering

The R programming environment also contains a large number of packages for data mining¹¹⁴. For example, packages such as data.table, dplyr, caret, e1071, xgboost, and randomForest¹¹⁵ are worth mentioning. Let’s first look at an example of the use of hierarchical clustering and the K-means algorithm. The hierarchical clustering method allows for hierarchies in the data to be defined and visualised via a dendrogram¹¹⁶. The same applies for the K-means algorithm¹¹⁷. To apply the hierarchical clustering methods, we will use the dplyr package. We will first normalise the data using the normalize function in the BBmisc package. This will be followed by the calculation of the distances between the data; in this case we will do the analyses on the mtcars data itself. There are several methods to choose from when calculating the distances. For more information, see the description of the dist function that we will use. The calculation of hierarchical clustering will be done with the hclust function. The R programming environment also contains functions to determine the clustering based on the K-means algorithm (kmeans). To visualise the results of the K-means algorithm, we will use the factoextra package.

library(dplyr, quietly=TRUE)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:hydrostats':
## 
##     recode

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:xts':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(BBmisc, quietly=TRUE)

## Warning: package 'BBmisc' was built under R version 4.1.3

## 
## Attaching package: 'BBmisc'

## The following objects are masked from 'package:dplyr':
## 
##     coalesce, collapse, symdiff

## The following object is masked from 'package:base':
## 
##     isFALSE

mtcars1 <- normalize(mtcars)
razdalje <- dist(mtcars1, method = 'euclidean')    
gručenje <- hclust(razdalje, method = "average") 
plot(gručenje) # display of results

Figure 81: Hierarchical clustering of mtcars data.

# if you want to group the data into 3 (k=3) groups,
# we can see which elements are grouped in the same groups
cutree(gručenje, k = 3)

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   1 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   2                   1                   2                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   2                   2 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   2                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   2                   2                   2 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   2                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   3                   1                   3                   1

kmeansrez <- kmeans(mtcars1,centers=3) # k-means
# clustering results are different from the case of hierarchical clustering
kmeansrez$cluster

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   3 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   2                   3                   2                   3 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   3                   3                   3                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   2                   2 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   2                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   3                   2                   2                   2 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   2                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   2                   1                   2                   1

library(factoextra, quietly=TRUE)

## Warning: package 'factoextra' was built under R version 4.1.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# plot the 3 groups we have determined based on the K-means algorithm
fviz_cluster(kmeansrez, data = mtcars, 
  palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
  geom = "point",ellipse.type = "convex", 
  ggtheme = theme_bw())

Figure 82: 3 groups determined by the K-means algorithm.

It is also worth mentioning regression trees and generalized boosted regression trees, which have been used in analysing the influence of meteorological and vegetation parameters on rainfall interception¹¹⁸. Regression trees are one of the machine learning methods that allow you to specify a model to predict the selected variable given the input data. Regression tree models are determined by repeated divisions of the input data with some adjustment of the prediction model to the target variable. The result of the regression tree process can be represented graphically by a decision tree or regression tree¹¹⁹. The rpart package provides the means to build a decision tree within the R software environment. On the other hand, boosted regression trees improve the results of conventional regression trees by testing a large number of models and analysing the results in terms of their performance ¹²⁰. Boosted regression trees can be defined in the R software environment using the gbm package. To demonstrate the use of decision trees, we will use airquality data. We will use those data that include ozone data and only columns 1 to 4. We will transform the data according to ozone thresholds, which indicate different values according to the measured values. The ozone data make up the variable that we are predicting with respect to the other variables. More information on the additional options of the rpart function can be found in the function’s description.

data("airquality") 
vzorec <- airquality 
# select only data where ozone measurements are available
vzorec <- vzorec[-which(is.na(vzorec$Ozone)),1:4] 
# categorisation
vzorec$Ozone <- cut(vzorec$Ozone,breaks=c(0,50,100,150,200),
  labels=c("Green","Yellow","Orange","Red")) # 
library(rpart, quietly=TRUE) 
# package rpart.plot provides several options for plotting regression trees
library(rpart.plot, quietly=TRUE)

## Warning: package 'rpart.plot' was built under R version 4.1.3

# define model
drevo <- rpart(Ozone ~ Solar.R + Wind + Temp,data=vzorec) 
rpart.plot(drevo,type=5)

Figure 83: Decision tree.

We can see that the first element of the decision is the air temperature; if the temperature is below 83 F (about 28 degrees Celsius), the ozone values were low (green), and if the temperature is above 83 F, we can have ozone values in the yellow range. Depending on the wind speed (high wind speed->low ozone values), we can see that in our sample there were too few orange and red values, and the chosen regression tree model was unable to predict these values. This is due to data imbalance, where we have too few data in the orange and red categories. As a consequence, it would have been better to use longer data sets or to use one of the alternatives to draw an appropriate sample. We will now proceed to check the adequacy of the model defined above (despite some shortcomings). We will split the data into two parts, about 70% for calibration and the remainder for validation. We will define the model, only on the calibration part of the data. We will estimate the model’s goodness of fit according to the equation (TP + TN)/ (TP + TN + FP + FN), where TP is true positive, TN is true negative, FP is false positive, and FN is false negative, i.e. TP and TN are the correctly predicted values, FP and FN are the incorrectly predicted values, and the sum of TP + TN + FP + FN is the sum of all predictions.

# to ensure that the randomly generated numbers are always the same
set.seed(1) 
# out of 116 data, use 80 data to calibrate the selected model
izbira <- sample.int(116,80) 
# select a fraction of the data to calibrate, about 70%
kalib <- vzorec[izbira,] 
# remainder for validation
valid <- vzorec[-izbira,] 
# define model
drevokalib <- rpart(Ozone ~ Solar.R + Wind + Temp,data=kalib) 
# for the validation part, use the model defined earlier and predict the values
napoved <- predict(drevokalib, valid, type = 'class')
# combine data
tabela <- table(valid$Ozone,napoved) 
# check results, diagonal values are correctly predicted values
tabela

##         napoved
##          Green Yellow Orange Red
##   Green     26      1      0   0
##   Yellow     2      6      0   0
##   Orange     0      1      0   0
##   Red        0      0      0   0

# calculate statistics
accuracy_Test <- sum(diag(tabela)) / sum(tabela)  
accuracy_Test

## [1] 0.8888889

We can see that our regression tree model was able to correctly predict about 89% of all the days used to validate our model, and it might have been possible to improve the results by using different regression tree settings (rpart). Let us also show an example of the use of boosted regression trees using the gbm package.

library(gbm, quietly=TRUE)

## Warning: package 'gbm' was built under R version 4.1.3

## Loaded gbm 2.1.8.1

# define the generalized boosted regression tree model
gbm1 <- gbm(formula = Ozone ~ Solar.R + Wind + Temp, 
  distribution = "gaussian", data = vzorec, n.trees = 1500, 
  shrinkage = .005, cv.folds=5) 
# check best iteration
best.iter <- gbm.perf(gbm1, method="OOB")

## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

Figure 84: Best iteration of the boosted regression tree model.

# see results
summary(gbm1)

Figure 85: Relative influence of individual variables on ozone values.

##             var  rel.inf
## Temp       Temp 41.39562
## Wind       Wind 40.57956
## Solar.R Solar.R 18.02482

It can be seen that the influence (on ozone values) of air temperature and wind speed is much larger than that of solar radiation, and even in the case of generalized boosted regression trees it is possible to split the sample into two parts and then predict the data for the independent part of the data using the predict function included in the gbm package.

As a point of interest, let us also show an example of the application of the k-nearest neighbour (KNN) algorithm¹²¹ to numerical flow data included in the airGR package. The KNN algorithm is included in the class package. We will calibrate the KNN model on a part of the data and then use the KNN algorithm to predict the data values for the other part of the sample and compare the estimated values with the measured values. Various machine learning methods with classical hydrological models were made by Sezen et al. (2019)¹²².

library(airGR, quietly=TRUE) 
data(L0123001) # use of airGR data 
# select part of the flow data for model calibration
kalib2 <- BasinObs[5000:6000,c(2,3,4,6)] 
# for validation
valid2 <- BasinObs[6001:6500,c(2,3,4,6)] 
library(class, quietly=TRUE) 
# using the KNN algorithm to predict discharge, 
# use the 3 closest values (k=3)
knnnapoved <- knn(train=kalib2,test=valid2,cl=kalib2$Qmm,k=3) 
# comparison of measured and predicted values
plot(valid2$Qmm,as.numeric(as.character(knnnapoved)),
    xlab="Q measured [mm]",ylab="Q predicted [mm]")

Figure 86: Comparison of KNN algorithm results with measurements.

# and calculating the value of the coefficient of determination R^2
cor(valid2$Qmm,as.numeric(as.character(knnnapoved)))^2

## [1] 0.8210452

In a similar way, some other machine learning algorithms could be used, such as the random forest algorithm (randomForest package).

Task 40: Using data from the Veliko Širje water gauging station (Savinja, year 2005), evaluate the performance of the regression tree algorithm for predicting suspended solids concentrations (based on flow and water temperature data), using part of the data (approx. 80%) for model calibration and the remainder for validation.

4.15 Spatial data

The R software environment includes a large number of packages that can be used to perform various analyses using spatial data. There is also a lot of useful material covering this area, such as the book Applied Spatial Data Analysis with R¹²³, or the website rspatial.org¹²⁴, where many practical examples are also shown. For a deeper theoretical understanding of the field, it is also worth mentioning the book Introduction to Geostatistics¹²⁵ or the textbook Spatial Statistics¹²⁶ by Prof. Goran Turk. It is worth mentioning, however, that the data imported into R is stored in memory, which consequently limits very large calculations (either spatially or temporally), and that R was not originally developed for analysis and visualisation of maps (such as GIS programs), but some packages nonetheless include functions that (partially) allow this. To name a few packages that include useful functions for data analysis and geostatistical operations such as sp, sf, rgdal, raster, terra, netcdf4, rgeos, maptools, gstat, geoR, Fields, spatialExtremes, spacetime, and spBayes. The raster package is no longer in development and has been replaced by the terra package, and also the rgdal, rgeos and maptools packages are no longer available on the CRAN repository. Some demonstrations of useful functions will be made using data for the Sora River basin up to the Suha water gauging station¹²⁷. First, some basic functions for importing and storing data and displaying spatial data will be shown. Most of the functions shown below are derived from the terra package.

library(terra, quietly=TRUE) 
library(raster, quietly=TRUE) 
# data import
porecje <- vect("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/4200_Sora_Suha.shp")
# check basic properties of catchment area
porecje

##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 1, 2  (geometries, attributes)
##  extent      : 421172, 449307.4, 92142.4, 127056.8  (xmin, xmax, ymin, ymax)
##  source      : 4200_Sora_Suha.shp
##  coord. ref. : gauss_krueger_SLO 
##  names       : opomba      AREA
##  type        :  <chr>     <num>
##  values      :     NA 5.689e+08

# to store vector data in the terra package 
# we can use the writeVector function
# read also the rainfall station data
postaje <- vect("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/Stations2.shp") 
# check coordinated of stations
crspost <- crs(postaje) 
# change the coordinate system
porecje2 <- project(porecje,crspost) 
# we can also use epsg codes: https://epsg.io/3794, 
# 3787 is the old coordinate system in Slovenia
porecje3 <- project(porecje, "+init=epsg:3794") 
# check the difference between two systems
plot(porecje); lines(porecje3,col="red")

Figure 87: The watershed of the Sora (Suha) river basin.

plot(porecje,lwd=3, lty=2,) # draw a watershed
# add the locations of the selected stations to the graph
points(postaje, col="blue", pch=16, cex=2) 
# an alternative option is plot(stations, add=TRUE)
# add a north arrow to the plot 
# the function is available in newer versions of terra
north(type=2,cex=1.5,"topleft") 
# adding legend as well
sbar(10000, c(435000, 95000), type="bar", divs=2, below="km",label=c(0,5,10))

Figure 88: The Sora river basin and selected rainfall stations.

Let’s also show how to check the attributes of the rainfall stations. The individual elements of the attribute table can be accessed similarly to the data frames, or matrix referencing also works. We will add a new column to the data we have imported, representing the average annual rainfall for the selected rainfall stations for the period 2016–2021 according to the data available in the ARSO archive.

as.data.frame(postaje) # attribute table

##          Station    GKX    GKY
## 1          Topol 451758 105735
## 2    _kofja Loka 445736 114445
## 3      _elezniki 436375 119269
## 4   Martinji Vrh 433179 115743
## 5        Poljane 437385 108787
## 6           _iri 432255 100816
## 7 Zgornja Sorica 425430 120227
## 8          Dav_a 428474 117475

postaje$Station # individual attribute

## [1] "Topol"          "_kofja Loka"    "_elezniki"      "Martinji Vrh"  
## [5] "Poljane"        "_iri"           "Zgornja Sorica" "Dav_a"

head(geom(porecje)) # geometric properties

##      geom part        x        y hole
## [1,]    1    1 439632.6 105203.2    0
## [2,]    1    1 439616.5 105162.4    0
## [3,]    1    1 439610.6 105115.3    0
## [4,]    1    1 439594.4 105077.0    0
## [5,]    1    1 439571.0 105002.4    0
## [6,]    1    1 439538.2 104969.0    0

# add a new column to the attribute table
postaje$padavine = c(1684, 1575, 1803, 2347, 1620, 1917, 2224, 1950)  
# stations$GKX = NULL # in case you want to delete a specific column
expanse(porecje, unit="km") # area of the catchment in selected units

## [1] 569.0504

perim(porecje) # the perimeter of the river basin

## [1] 146261.2

# aggregate(porecje, dissolve=T) # if you have several river basins 
# they could be merged, the expanse function is also useful
postbuf <- buffer(postaje, 1500) # of areas around stations (1500 m perimeter)
# combine different data into one layer 
# mention the intersect and symdif functions
plot(union(porecje,postbuf))

Figure 89: The Sora river basin and rainfall stations with a 1500 m perimeter around the station.

Raster data can be imported into the R environment in a similar way. We will import digital elevation model data for the same river basin. First, some functions for plotting raster data will be shown.

# data import
teren <- rast("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/SoraSuha.sdat")  
# to store raster data in terra 
# we can use the writeRaster function
print(teren) # basic properties of the raster layer used

## class       : SpatRaster 
## dimensions  : 349, 280, 1  (nrow, ncol, nlyr)
## resolution  : 100, 100  (x, y)
## extent      : 421212, 449212, 92172, 127072  (xmin, xmax, ymin, ymax)
## coord. ref. : gauss_krueger_SLO 
## source      : SoraSuha.sdat 
## name        :  SoraSuha 
## min value   :  333.4742 
## max value   : 1650.1703

# plot a digital elevation model, the plot function also 
# allows many other settings 
plot(teren, axes=T, grid=T)

Figure 90: Digital elevation model.

# a slightly more advanced graph
plot(teren, axes=T, breaks=c(300,600,900,1200,1500,1800),
   col = colorRampPalette(c("blue", "green", "red"))(12))
north("topleft") # add an arrow for North

Figure 91: Digital elevation model in classes.

# altitude information for each cell 
head(as.data.frame(teren))

##     SoraSuha
## 113 1328.688
## 114 1322.766
## 115 1328.819
## 392 1327.994
## 393 1325.541
## 394 1312.359

# we can also use the values(teren) function
hist(teren) # you can also print a histogram of the data

Figure 92: Histogram of altitude data.

You can also plot a hypsometric watershed curve and perform many useful operations such as aggregate or disaggregate. Other functions worth mentioning are mosaic (combining a large number of raster data into a single layer), focal, crop and merge. The following example will also show the classification of data into classes. We will change the raster values of the cells according to the boundaries (lower boundary and upper boundary) and according to the new values (3 numbers for the lower and upper boundary); in our example, we will have 5 classes.

# hypsometric curve
plot(ecdf(as.numeric(values(teren))),xlab="Elevation",
    ylab="Proportion of area",main="Hypsometric curve")

Figure 93: Hypsometric curve of the river basin.

# calculate the individual quartile values
quantile(as.numeric(values(teren)),probs=c(0.25,0.5,0.75),na.rm=TRUE)

##      25%      50%      75% 
## 563.6654 692.8478 830.6627

# check which cell is located at certain coordinates
cellFromXY(teren, data.frame(x=436000,y=110000))

## [1] 47748

# reverse as above
xyFromCell(teren,47748)

##           x      y
## [1,] 435962 110022

# change the resolution of the data (resampling)
teren2 <- aggregate(teren, fact=4, fun=mean) 
# reverse of above, data disaggregation
teren3 <- disagg(teren2, fact=4, method="near")
# data classification
terenklas <- classify(teren, rbind(c(300,600,1),c(600,900,2),c(900,1200,3),
  c(1200,1500,4),c(1500,1800,5)))  
plot(terenklas) # display the modified data

Figure 94: Raster data divided into 5 classes.

Interactions and operations between raster and vector data are also often of interest. A polygon can be transformed into raster form with the rasterize function; the inverse function is as.polygons, which allows for transformation from raster to vector data. The mask, crop, extract, etc. functions are also useful.

# from vector to raster data
porecje_rast <- rasterize(porecje, project(teren, crs(porecje))) 
# mask function, coordinate system must be the same
plot(mask(teren,postbuf))

Figure 95: Elevation around rainfall stations.

# evaluate the values for each location 
# the first station is not located in the Sora river basin
extract(teren,postaje)

##   ID  SoraSuha
## 1  1        NA
## 2  2  379.1530
## 3  3  784.2018
## 4  4 1051.6125
## 5  5  420.5624
## 6  6  529.8202
## 7  7  806.4953
## 8  8  842.2570

# the same for larger areas around rainfall stations, 
# where an average is then calculated from all the cells
extract(teren,postbuf,mean)

##   ID  SoraSuha
## 1  1       NaN
## 2  2  419.3978
## 3  3  676.0768
## 4  4 1011.1084
## 5  5  478.5840
## 6  6  554.3724
## 7  7  841.8691
## 8  8  858.5206

# import the lines of the river network
vodotoki <- vect("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/Vodotoki.shp") 
# applying the crop function to specific sections of river reaches (2:5)
plot(crop(teren,vodotoki[2:5,]))

Figure 96: Using the crop function.

In some cases, however, we also need to make analyses based on satellite data, as in the example presented by Parajka et al.¹²⁸. MODIS data related to snow cover can be obtained from NASA’s website¹²⁹. The following example will show the use of MODIS snow cover data. The MODIS snow cover data is in the following format: cells with a value of 0-40 are snow-free areas, cells with a value of 41-100 are snow-covered areas, and cells with a value of 101-300 are cloud-covered areas. We will make a cross-section with the Sora river basin and calculate the number of cells with clouds, as well as with snow and without snow.

MODISsneg <- rast("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/MOD10A1.A2000057.Snow_Cover_Daily_Tile.tif") 
# MODIS data structure
plot(MODISsneg,breaks=c(0,40,100,300),col=c("green","white","blue"))

Figure 97: MODIS snow cover data.

# change the coordinate system of the MODIS data
MODISsneg2 <- project(MODISsneg,crspost) 
# make an intersection with the basin boundary
presek <- crop(MODISsneg2,porecje) 
# intersection
plot(presek,breaks=c(0,40,100,300),col=c("green","white","blue"),legend=F) 
# adding legend
legend("bottom",legend = c("tla", "sneg", "oblaki"), 
  fill = c("green","white","blue"),cex=0.5)
# catchment boundary
lines(porecje,col="red")

Figure 98: MODIS data for the Sora river basin.

# we see that, on this day, almost the entire basin was covered with snow 
# calculate the number of cloud cells
noblak <- freq(classify(presek,c(101,500,1),others=NA),value=1)[3] 
# calculate the number of snow cells
nsneg <- freq(classify(presek,c(40,100,1),others=NA),value=1)[3] 
# calculate the number of cells without snow
ntla <- freq(classify(presek,c(0,40,1),others=NA),value=1)[3] 
# calculate snow cover
nsneg*100/(nsneg+ntla)

##      count
## 1 99.33425

There are also many other MODIS products, some of which can be imported directly into the R programming environment using packages such as MODISTools. The following example will show how to use the MODISTools package to import one of the products (surface temperature) for a selected location given the coordinates. The data will then be used to calculate the average surface temperature for the Sora basin up to the Suha water gauging station.

library(MODISTools, quietly=TRUE) 
# list of all products that can be downloaded
head(mt_products())

##        product
## 1       Daymet
## 2 ECO4ESIPTJPL
## 3      ECO4WUE
## 4       GEDI03
## 5     GEDI04_B
## 6      MCD12Q1
##                                                                          description
## 1 Daily Surface Weather Data (Daymet) on a 1-km Grid for North America, Version 4 R1
## 2               ECOSTRESS Evaporative Stress Index PT-JPL (ESI) Daily L4 Global 70 m
## 3                          ECOSTRESS Water Use Efficiency (WUE) Daily L4 Global 70 m
## 4                GEDI Gridded Land Surface Metrics (LSM) L3 1km EASE-Grid, Version 2
## 5     GEDI Gridded Aboveground Biomass Density (AGBD) L4B 1km EASE-Grid, Version 2.1
## 6              MODIS/Terra+Aqua Land Cover Type (LC) Yearly L3 Global 500 m SIN Grid
##   frequency resolution_meters
## 1     1 day              1000
## 2    Varies                70
## 3    Varies                70
## 4  One time              1000
## 5  One time              1000
## 6    1 year               500

# check what data are available in a product
head(mt_bands("MYD11A2"))

##               band                          description valid_range fill_value
## 1   Clear_sky_days               Day clear-sky coverage    1 to 255          0
## 2 Clear_sky_nights             Night clear-sky coverage    1 to 255          0
## 3    Day_view_angl View zenith angle of day observation    0 to 130        255
## 4    Day_view_time        Local time of day observation    0 to 240        255
## 5          Emis_31                   Band 31 emissivity    1 to 255          0
## 6          Emis_32                   Band 32 emissivity    1 to 255          0
##    units scale_factor add_offset
## 1   <NA>         <NA>       <NA>
## 2   <NA>         <NA>       <NA>
## 3 degree            1        -65
## 4    hrs          0.1          0
## 5   <NA>        0.002       0.49
## 6   <NA>        0.002       0.49

# the selected product and the specified data can now be imported
# according to the selected coordinate (lat, long) and according to the +- distance
# from this coordinate (km_lr and km_ab arguments) and for the selected dates
LC <- mt_subset(product = "MYD11A2", 
 lat = 46.12,
 lon = 14.21,
 band = "LST_Day_1km",
 start = "2023-07-01",
 end = "2023-07-31",
 km_lr = 20,
 km_ab = 20,
 site_name = "Sora",
 internal = TRUE,
 progress = FALSE) 
# transform data into terra packet format
LC_r <- mt_to_terra(df = LC) 
# is for surface temperature data in Kelvin
plot(LC_r)

Figure 99: MODIS surface temperature data.

# select only the first date, 4th of July 2023, change the coordinate system
izbran <- project(LC_r[[1]],crspost) 
# plot a graph, surface temperature in Kelvin
plot(izbran) 
lines(porecje,col="red") # add a river basin boundary

Figure 100: MODIS surface temperature data on a given day.

# calculate the average surface temperature (in K)
extract(izbran,porecje,mean)

##   ID 2023-07-04
## 1  1   298.0038

# we see that the average temperature is about 25 degrees Celsius

Similarly, some other spatial data can be retrieved within the R software environment, and the pRecipe package can also be used to retrieve precipitation data (historical, measured precipitation). A large number of different products are also available on the Climate Copernicus website¹³⁰. In the following example, we will download the data of the product era5, namely the annual values and only the land values (argument domain). The files will be saved in a temporary working directory, whose contents can be checked with the function list.files(tempdir()). We will use data in .nc format, which combines a large number of raster data; it is annual rainfall for the period 1959 to 2021.

library(pRecipe, quietly=TRUE) 
# data download
# download_data("era5", tempdir(), timestep = "yearly",domain="land")  
# data import
padera5 <- rast("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/era5_tp_mm_land_195901_202112_025_yearly.nc")  
# let's look at the basic properties of the read data
padera5

## class       : SpatRaster 
## dimensions  : 720, 1440, 63  (nrow, ncol, nlyr)
## resolution  : 0.25, 0.25  (x, y)
## extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 
## source      : era5_tp_mm_land_195901_202112_025_yearly.nc 
## varname     : tp (Total monthly precipitation) 
## names       : tp_1, tp_2, tp_3, tp_4, tp_5, tp_6, ... 
## unit        :   mm,   mm,   mm,   mm,   mm,   mm, ... 
## time (days) : 1959-01-01 to 2021-01-01

# plot a graph for 1959 (the first year of the data)
plot(padera5[[1]])

Figure 101: Annual precipitation 1959, ERA5.

# change the coordinate system for our watershed, 
# so that it is the same as the ERA5 data, 
# use the geographical coordinate system
porecje4 <- project(porecje,crs(padera5)) 
# only extract a small area from our data
plot(padera5[["tp_1"]],ext=c(13,15,45,47)) 
# add a river basin boundary to the graph 
lines(porecje4,col="red")

Figure 102: Annual precipitation, Sora basin, ERA5.

# we see that the data resolution is very coarse and 
# that we have practically only one cell in the area
# let's check which cells we have in our basin
rez <- extract(padera5[[1]],porecje4,xy=TRUE,touches=TRUE,weights=TRUE) 
rez

##   ID     tp_1      x      y      weight
## 1  1 1415.586 14.125 46.375 0.043502964
## 2  1 1839.068 13.875 46.125 0.007647287
## 3  1 1649.994 14.125 46.125 0.799024283
## 4  1 1319.189 14.375 46.125 0.170440718
## 5  1 1696.560 14.125 45.875 0.039453870

# calculate the precipitation by taking the weights into account
sum(rez$tp_1*rez$weight)

## [1] 1685.811

# empty object where to store the data
dummy <- rep(NA,dim(padera5)[3]) 
# check the average rainfall in the area
# consider the cell with the largest share in the Sora basin  
for(i in 1:dim(padera5)[3]){
dummy[i] <- as.numeric(extract(padera5[[i]],porecje4)[2]) 
}
plot(as.Date(time(padera5)),dummy,type="l",xlab="Year",
  ylab="Annual precipitation",main="Sora-ERA5")

Figure 103: Annual precipitation in the Sora basin over a longer period of time.

When plotting the graph, we used the time function to get the time data, and the names, ext, crs, nlyr, values, etc. functions are also useful.

As a point of interest, let us also show an example of determining a catchment area from digital elevation model data, where we will use some of the functions included in the whitebox package. The catchment determination will include the following steps: filling in the local depressions, flow accumulation in the D8 computational grid, calculating the flow direction according to the D8 algorithm, generating the river grid, and generating the catchment area according to the selected coordinate, where the coordinates of the Železniki water gauging station on the Selška Sora River will be used. The whitebox package works not by loading the data into the R environment, but by reading and storing the files directly on disk, i.e. in the selected working directory.

library(whitebox, quietly=TRUE) 
# also install the whitebox package without using the # sign
# whitebox::install_whitebox() 
# set the working file 
setwd("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS") 
# fill local dpressions
wbt_breach_depressions_least_cost(dem = "SoraSuha2.tif",
  output = "dem_breached.tif",dist = 500,fill = TRUE) 
# fill depressions based on selected algorithm
wbt_fill_depressions_wang_and_liu(dem = "dem_breached.tif",
  output = "terenzap.tif") 
# flow accumulaction
# based on D8 algorithm for all cells
wbt_d8_flow_accumulation(input = "terenzap.tif",output = "terenD8FA.tif") 
# calculate the direction of the flow according to algorithm D8
wbt_d8_pointer(dem = "terenzap.tif", output = "terenD8smer.tif") 
# based on terrain (and D8 algorithm and flow accumulation) 
# generate a river network
wbt_extract_streams(flow_accum = "terenD8FA.tif", 
  output = "teren_vodotoki.tif",threshold = 20) 
# read data into the R environment using the terra package
teren_vodotoki <- rast("teren_vodotoki.tif") 
# transform into a polygon
teren_vodotoki_pol <- as.polygons(teren_vodotoki)
# compare the generated and the actual network of rivers
plot(teren) 
lines(teren_vodotoki_pol) # generated
lines(vodotoki,col="red")  # actual

Figure 104: Comparison of generated and actual river network.

# coordinates of Železniki station
koordinata <- data.frame(x=435710,y=120100) 
# transform to vector
tocka <- SpatialPoints(coords = koordinata, proj4string = CRS(crspost))

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum hermannskogel in Proj4 definition

# store the coordinate in a folder in the workspace
shapefile(tocka,"tocka.shp",overwrite=TRUE)
# find the nearest point given the station coordinate 
# on the generated river grid, snap_dist is the distance 
# in units of the map (in our case, metres)
wbt_jenson_snap_pour_points(pour_pts = "tocka.shp",
  streams = "teren_vodotoki.tif",output = "snaprez.shp", snap_dist = 100) 
# determine the catchment area based on the point defined earlier
wbt_watershed(d8_pntr = "terenD8smer.tif",pour_pts = "snaprez.shp",
  output = "rezultat-por.tif") 
# data import in R
zelezniki <- rast("rezultat-por.tif") 
# transform to polygon
zelezniki_pol <- as.polygons(zelezniki) 
plot(teren) # plot terrain
# add the catchment area to Železniki station
plot(zelezniki_pol,add=TRUE)

Figure 105: Catchment area to Železniki water gauging station.

# map the Sora river basin to Železniki station
plot(mask(teren,zelezniki_pol))

Figure 106: Digital model upstream of Železniki station.

The whitebox package also contains a number of other useful GIS functions, such as functions to calculate various indices (e.g. shape index or sediment transport index) and other characteristics. A list of these functions can be found in the description of the ?whitebox package. Also worth mentioning is the RSAGA package, which allows the use of some of the functions included in the SAGA GIS software¹³¹. Various methods of spatial interpolation of data can also be demonstrated. More theoretical information is available, for example, in the book An Introduction to Applied Geostatistics¹³². Some interpolation methods will be demonstrated using the functions included in the sp and raster packages, and the terra package also includes some interpolation functions, but these differ slightly from those shown below. Interpolation using different methods (e.g. Krigin, Thiesson polygons, etc.) will be shown.

library(sp, quietly=TRUE) 
library(gstat, quietly=TRUE) 
library(raster, quietly=TRUE) 
# import the terrain data again using the raster package 
teren1 <- raster("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/SoraSuha.sdat")

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Militar-Geographische Institut in Proj4 definition

# add altitude to the attribute data
postaje$teren1 <- teren[cellFromXY(teren, data.frame(postaje[,2:3]))][,1] 
# transform the terrain into the shape to be used for interpolation
grid <- as(teren1, 'SpatialPixelsDataFrame') 
# create data frame
postajexy <- data.frame(pad=postaje$padavine,SoraSuha=postaje$teren1,x=geom(postaje)[,3],y=geom(postaje)[,4]) 
# transform to format SpatialPointsDataFrame
coordinates(postajexy) <- ~x+y 
# define coordinate system
crs(postajexy) <- crs(teren1) 
# using the Thiessen polygons method (idp=1 and nmax=1), 
# the use of the other arguments idp and nmax also allows 
# interpolation using the nearest neighbour method 
prec_nn <- idw(pad ~ 1, postajexy, grid, idp=1, nmax=1)

## [inverse distance weighted interpolation]

# transform into raster data format
prec_nn <- as(prec_nn, 'RasterLayer') 
# plot
spplot(prec_nn, main = "Thiesson polygons")

Figure 107: Thiesson polygons.

# IDW method, idp argument is the weight
prec_idw <- idw(pad ~ 1, postajexy, grid, idp = 2)

## [inverse distance weighted interpolation]

# transform into raster
prec_idw <- as(prec_idw, 'RasterLayer') 
# plot
spplot(prec_idw, main = "Metoda IDW")

Figure 108: IDW method.

# demonstrate the use of kriging
# define the empirical variogram
emperical.var <- variogram(pad~1,postajexy) 
# estimate the theoretical variogram where 
# check the exponential and spherical functions
theoretical.var <- fit.variogram(emperical.var, model=vgm("Exp", "Sph"), 
  fit.sills = T, fit.ranges = T)

## Warning in fit.variogram(emperical.var, model = vgm("Exp", "Sph"), fit.sills =
## T, : No convergence after 200 iterations: try different initial values?

# plot both variograms
plot(emperical.var,theoretical.var)

Figure 109: Variogram.

# we see that the results are not the best, which is the result of 
# the relatively high spatial variability of precipitation 
# we nevertheless show an example of the use of ordinary kriging
prec_ok <- krige(pad ~ 1, postajexy, grid, model = theoretical.var, 
  nmax = 40, nmin = 2, maxdist = 100e3)

## [using ordinary kriging]

# transform into raster
prec_ok <- as(prec_ok, 'RasterLayer') 
# plot
spplot(prec_ok, main = "Običajni kriging")

Figure 110: Normal kriging.

# External drift krigin (elevation)
prec_edk <- krige(pad ~ SoraSuha, postajexy, grid, model = theoretical.var, 
  nmax = 40, nmin = 2, maxdist = 100e3)

## [using universal kriging]

# transform into raster
prec_edk <- as(prec_edk, 'RasterLayer') 
 # plot
spplot(prec_edk, main = "External drift kriging")

Figure 111: External drift kriging.

It can be seen that the interpolation results are unlikely to be of high quality, due to the relatively high spatial variability over a small area or the (too) small number of stations. We also show an example of cross-validation (omitting one station each time, repeating the spatial interpolation procedure and calculating the difference between actual and estimated rainfall) for the spatial interpolation methods shown earlier.

# argument nfold is the number of stations, first for normal kriging
locv_ok <- krige.cv(pad~1, postajexy, theoretical.var, nmax = 40, 
  nmin = 2, maxdist = 100e3, nfold = 8) 
# results
head(locv_ok)

##   var1.pred var1.var observed   residual     zscore fold
## 1  1906.335 174230.7     1684 -222.33513 -0.5326547    1
## 2  1921.339 173688.0     1575 -346.33923 -0.8310302    2
## 3  1925.517 169022.5     1803 -122.51719 -0.2980058    3
## 4  1823.621 165595.9     2347  523.37863  1.2861489    4
## 5  1920.963 172823.3     1620 -300.96295 -0.7239556    5
## 6  1870.298 174062.8     1917   46.70233  0.1119401    6

# we can see the differences between actual and predicted precipitation estimates
# External drift kriging
locv_edk <- krige.cv(pad~SoraSuha, postajexy, theoretical.var, nmax = 40, 
  nmin = 2, maxdist = 100e3, nfold = 8) 
# Thiesson polygons
locv_nn <- krige.cv(pad~1, postajexy,  model = theoretical.var, nmin = 1, 
  nmax = 1, maxdist = 100e3, nfold = 8)
# IDW method
locv_idw <- krige.cv(pad~1, postajexy, nfold = 8) 
# combine data
razlike <- cbind(EDK = locv_edk$residual, OK = locv_ok$residual, 
  TP = locv_nn$residual, IDW = locv_idw$residual) 
# boxplot
boxplot(razlike , main = "Differences-cross validation", 
  ylim = c(-600, 600),ylab="Precipitation [mm]") 
# add a horizontal line
abline(h = 0, col = "red", lty = 2)

Figure 112: Cross-validation results.

As mentioned above, the interpolation results were not great.

Task 41: Based on the digital elevation model data, determine the slope map using the terrain function in the terra package. Calculate the basic slope statistics.

4.16 Rainfall-runoff modelling

Rainfall-runoff modelling is one of the most widely used practical challenges in water management or hydrology. There are various models (e.g. conceptual, empirical, physically based) that can be used to model the transformation of rainfall into surface runoff. One of these models is the GR4J model (modèle du Génie Rural à 4 paramètres Journalier)¹³³ and its improvements. A description of the model’s history and development (with relevant references) is available on the INRAE website¹³⁴. The GR4J model is a lumped conceptual hydrological model that allows for simulations of flows based on rainfall and potential evapotranspiration data. An Excel version is also available (with limited capabilities). The GR4J model uses 4 parameters (X1, X2, X3, and X4). The model is included in the airGR and airgrteaching packages. In these packages, functions for upgraded versions of hydrological models (e.g. GR5J and GR6J), the CemaNeige snow module, and functions for model calibration, etc. are also available ¹³⁵. There is also the airGRdatasim package for data assimilation and the airGRiwrm package for integrated water resources management¹³⁶. In the following example, the procedure of using GR4J and some improved versions will be presented, based on the data used in applying the model for predicting shallow landslide triggering in the Sora basin¹³⁷. The data to be used will include data on discharge (m3/s and mm of the Železniki station), precipitation (Davča station), and air temperature (Bohinjska Češnjica). Based on the air temperature data, we will also calculate potential evapotranspiration values. The calibration procedure of the GR4J model for the period 2005–2010 will be shown below first, with the first year of data used to warm up the model.

library(airGR, quietly=TRUE) 
data <- read.table("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/airGR/Data-Example.txt",header=T) 
head(data) # look at the first few lines of data

##       Datum Q_m3s Q_mm P_mm  T_C
## 1 1.01.2005 5.711 4.74    0 -2.6
## 2 2.01.2005 5.452 4.53    0 -1.0
## 3 3.01.2005 4.696 3.90    0 -2.7
## 4 4.01.2005 4.451 3.69    0 -2.8
## 5 5.01.2005 4.209 3.49    0 -2.1
## 6 6.01.2005 4.209 3.49    0 -2.0

summary(data) # basic statistics of the data

##     Datum               Q_m3s             Q_mm             P_mm        
##  Length:4383        Min.   : 0.402   Min.   : 0.330   Min.   :  0.000  
##  Class :character   1st Qu.: 1.405   1st Qu.: 1.180   1st Qu.:  0.000  
##  Mode  :character   Median : 2.345   Median : 2.010   Median :  0.000  
##                     Mean   : 4.028   Mean   : 3.417   Mean   :  5.168  
##                     3rd Qu.: 4.447   3rd Qu.: 3.840   3rd Qu.:  3.900  
##                     Max.   :76.227   Max.   :63.270   Max.   :227.900  
##       T_C         
##  Min.   :-13.400  
##  1st Qu.:  2.200  
##  Median :  9.500  
##  Mean   :  8.884  
##  3rd Qu.: 15.300  
##  Max.   : 27.300

plot(data[,3],type="l") # plot the flow data

Figure 113: Measured flow data.

# calculate potential evapotranspiration using the Oudin equation 
# based on air temperature data and station location 
potET <- PE_Oudin(JD = as.POSIXlt(strptime(data[,1], "%d.%m.%Y"))$yday, 
  Temp = data[,5], Lat = 46.2, LatUnit = "deg")
summary(potET) # look at the calculations

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.433   1.500   1.805   3.069   5.175

 # define input data (precipitation, potential evapotranspiration) 
# and the modelling period and model to be used (GR4J)
InputsModel <- CreateInputsModel(FUN_MOD = RunModel_GR4J, DatesR = as.POSIXct(strptime(data[,1], "%d.%m.%Y")), Precip = data[,4], PotEvap = potET)
# find the point in the data when 2010 ends
konec <- which(format(data[,1], format = "%d.%m.%Y")=="31.12.2010") 
# define the sequence, the first year (2005) will be used
# to warm up the model
Ind_Run <- seq(366, konec) 
# define model settings, warm-up and calibration period
RunOptions <- CreateRunOptions(FUN_MOD = RunModel_GR4J, InputsModel = InputsModel, IndPeriod_Run = Ind_Run, IndPeriod_WarmUp = 1:365) 
# select a criterion to check model fit
# (Nash-Sutchliffe criterion) and select the flow data
InputsCrit <- CreateInputsCrit(FUN_CRIT = ErrorCrit_NSE, InputsModel = InputsModel, RunOptions = RunOptions, Obs = data[366:konec,3]) 
# choose a model calibration algorithm (Michel algorithm)
CalibOptions <- CreateCalibOptions(FUN_MOD = RunModel_GR4J, FUN_CALIB = Calibration_Michel) 
# calibrating the model according to the settings defined above 
OutputsCalib <- Calibration_Michel(InputsModel = InputsModel, RunOptions = RunOptions, InputsCrit = InputsCrit, CalibOptions = CalibOptions, FUN_MOD = RunModel_GR4J)

## Grid-Screening in progress (0% 20% 40% 60% 80% 100%)
##   Screening completed (81 runs)
##       Param =  432.681,   -2.376,   83.096,    1.417
##       Crit. NSE[Q]       = 0.4598
## Steepest-descent local search in progress
##   Calibration completed (27 iterations, 261 runs)
##       Param =  677.881,   -0.621,  118.217,    0.500
##       Crit. NSE[Q]       = 0.7363

# save the parameters we have determined in the calibration process
ParamGR4J <- OutputsCalib$ParamFinalR 
ParamGR4J # let's look at the calibrated parameter values

## [1] 677.8810135  -0.6208393 118.2173963   0.5000000

# run the model with calibrated parameters (for the period 2005–2010)
OutputsModel <- RunModel_GR4J(InputsModel = InputsModel, RunOptions = RunOptions, Param = ParamGR4J) 
# draw a comparison between measured and modelled data
plot(OutputsModel, data[366:konec,3])

Figure 114: GR4J model calibration results.

We now proceed to validating the hydrological model for the period 2011–2016, defining the sequence of data to be used for the validation, running the model with the calibrated parameters for the validation period, and checking the goodness of fit during the validation period. The comparison will be made both graphically and based on the numerical goodness-of-fit criterion of the tested rainfall-runoff hydrological model (NSE). The optimal value of the NSE criterion would be 1, and a value of 0.61 represents a relatively good fit between the measured and modelled data.

Ind_Run1 <- seq((konec+1),dim(data)[1]) 
RunOptions1 <- CreateRunOptions(FUN_MOD = RunModel_GR4J, InputsModel = InputsModel, IndPeriod_Run = Ind_Run1)

## Warning in CreateRunOptions(FUN_MOD = RunModel_GR4J, InputsModel = InputsModel, : model warm up period not defined: default configuration used
##   the year preceding the run period is used

OutputsModel1 <- RunModel_GR4J(InputsModel = InputsModel, RunOptions = RunOptions1, Param = ParamGR4J) 
plot(OutputsModel1, Qobs = data[Ind_Run1,3])

Figure 115: GR4J model validation results.

InputsCrit1  <- CreateInputsCrit(FUN_CRIT = ErrorCrit_NSE, InputsModel = InputsModel,RunOptions = RunOptions1, Obs = data[Ind_Run1,3]) 
OutputsCrit1 <- ErrorCrit_NSE(InputsCrit = InputsCrit1, OutputsModel = OutputsModel1)

## Crit. NSE[Q] = 0.6102

OutputsCrit1

## $CritValue
## [1] 0.610208
## 
## $CritName
## [1] "NSE[Q]"
## 
## $CritBestValue
## [1] 1
## 
## $Multiplier
## [1] -1
## 
## $Ind_notcomputed
## NULL
## 
## attr(,"class")
## [1] "NSE"       "ErrorCrit"

The airGRteaching package also allows for the interactive use of the GR4J hydrological model (and other models included in the airGR package). The airGRteaching package requires a slightly different data structure.

library(airGRteaching, quietly=TRUE) 
preob <- data.frame(DatesR=as.POSIXct(strptime(data[,1], "%d.%m.%Y"),tz="UTC"),P=data[,4], E=potET,Qmm=data[,3]) 
# start the Shiny app, which opens in your browser
ShinyGR(ObsDF = preob, SimPer = c("2005-01-01", "2010-12-31"))

However, as we have seen, the GR4J model fails to simulate winter flows adequately and has some difficulties in estimating low flows. The GR6J model was developed with the aim of better simulating low flows. For the example of the Selška Sora river basin up to the Železniki gauging station, let us show an example of the use of the GR6J model with the CemaNeige snow module included. The snow module also needs air temperature and the hypsometric curve of the catchment as input data. The procedure for using the CemaNeige GR6J model is similar to the GR4J model. In addition, the MeanAnSolidPrecip argument must be defined, which includes the fraction of snowfall (for the 5 elevation zones) and has been estimated according to the equation provided by Alexopoulos et al. (2023)¹³⁸, caution is advised when estimating this parameter.

library(terra, quietly=TRUE) 
# activate the L0123001 river basin data included in the airGR package 
data(L0123001) 
# hypsometric curve data format
# additionally the format is also written in ?BasinInfo
BasinInfo

## $BasinCode
## [1] "L0123001"
## 
## $BasinName
## [1] "Blue River at Nourlangie Rock"
## 
## $BasinArea
## [1] 360
## 
## $HypsoData
##   [1]  286  309  320  327  333  338  342  347  351  356  360  365  369  373  378
##  [16]  382  387  393  399  405  411  417  423  428  434  439  443  448  453  458
##  [31]  463  469  474  480  485  491  496  501  507  513  519  524  530  536  542
##  [46]  548  554  560  566  571  577  583  590  596  603  609  615  622  629  636
##  [61]  642  649  656  663  669  677  684  691  698  706  714  722  730  738  746
##  [76]  754  762  770  777  786  797  808  819  829  841  852  863  875  887  901
##  [91]  916  934  952  972  994 1012 1029 1054 1080 1125 1278

teren <- rast("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/airGR/terrain-Sora.sdat")
# calculate the hypsometric curve 
hypso <- c(min(as.numeric(values(teren)),na.rm=T),as.numeric(quantile(as.numeric(values(teren)),probs=seq(from=0.01,by=0.01,to=0.99),na.rm=T)),max(as.numeric(values(teren)),na.rm=T)) 
# define model settings, number of height 
# zones for the snow module, the hypsometric curve
InputsModel2 <- CreateInputsModel(FUN_MOD = RunModel_CemaNeigeGR6J, as.POSIXct(strptime(data[,1], "%d.%m.%Y")), Precip = data[,4], PotEvap = potET, TempMean = data[,5],  ZInputs = median(hypso), HypsoData = hypso, NLayers = 5)

## input series were successfully created on 5 elevation layers for use by CemaNeige

RunOptions2 <- CreateRunOptions(FUN_MOD = RunModel_CemaNeigeGR6J, InputsModel = InputsModel2, IndPeriod_Run = Ind_Run, IndPeriod_WarmUp = 1:365, MeanAnSolidPrecip=c(70,130,180,250,350)) 
# define the goodness of fit criterion and discharge data
InputsCrit2 <- CreateInputsCrit(FUN_CRIT = ErrorCrit_NSE, InputsModel = InputsModel2, RunOptions = RunOptions2, Obs = data[366:konec,3]) 
# model calibration method
CalibOptions2 <- CreateCalibOptions(FUN_MOD = RunModel_CemaNeigeGR6J, FUN_CALIB = Calibration_Michel) 
# run the model calibration, the process takes longer than in the case of GR4J
OutputsCalib2 <- Calibration_Michel(InputsModel = InputsModel2, RunOptions = RunOptions2, InputsCrit = InputsCrit2, CalibOptions = CalibOptions2, FUN_MOD = RunModel_CemaNeigeGR6J)

## Grid-Screening in progress (

## 0% 20% 40% 60% 80% 100%)
##   Screening completed (6561 runs)
##       Param =   90.017,   -0.521,   27.113,    1.369,    0.220,   54.598,    0.002,    6.764
##       Crit. NSE[Q]       = 0.5421
## Steepest-descent local search in progress
##   Calibration completed (45 iterations, 7241 runs)
##       Param =  395.412,   -0.174,   61.234,    0.755,    0.113,   15.810,    0.026,   17.247
##       Crit. NSE[Q]       = 0.7401

# save calibrated parameter values
ParamGR6JCemaNeige <- OutputsCalib2$ParamFinalR 
# run the model with calibrated parameter values
OutputsModel2 <- RunModel_CemaNeigeGR6J(InputsModel = InputsModel2, RunOptions = RunOptions2, Param = ParamGR6JCemaNeige) 
# plot
plot(OutputsModel2, data[366:konec,3])

Figure 116: Calibration results of GR6J with the snow module.

# calculate NSE
ErrorCrit_NSE(InputsCrit = InputsCrit2, OutputsModel = OutputsModel2)

## Crit. NSE[Q] = 0.7401

## $CritValue
## [1] 0.7400699
## 
## $CritName
## [1] "NSE[Q]"
## 
## $CritBestValue
## [1] 1
## 
## $Multiplier
## [1] -1
## 
## $Ind_notcomputed
## NULL
## 
## attr(,"class")
## [1] "NSE"       "ErrorCrit"

Let’s calculate the simulated flow values for the validation period. It can be seen that using the GR6J model with the snow module led to a better match between the modelled and measured flows than using only the GR4J model without the snow module.

RunOptions3 <- CreateRunOptions(FUN_MOD = RunModel_CemaNeigeGR6J, InputsModel = InputsModel2, IndPeriod_Run = Ind_Run1)

## Warning in CreateRunOptions(FUN_MOD = RunModel_CemaNeigeGR6J, InputsModel = InputsModel2, : model warm up period not defined: default configuration used
##   the year preceding the run period is used

## Warning in CreateRunOptions(FUN_MOD = RunModel_CemaNeigeGR6J, InputsModel = InputsModel2, : 'MeanAnSolidPrecip' not defined: it was automatically set to c(328,328,328,328,328)  from the 'InputsModel' given to the function. Be careful in case your application is (short-term) forecasting.

OutputsModel3 <- RunModel_CemaNeigeGR6J(InputsModel = InputsModel2, RunOptions = RunOptions3, Param = ParamGR6JCemaNeige)
plot(OutputsModel3, Qobs = data[Ind_Run1,3])

Figure 117: GR6J model validation results with snow module.

InputsCrit3  <- CreateInputsCrit(FUN_CRIT = ErrorCrit_NSE, InputsModel = InputsModel2,RunOptions = RunOptions3, Obs = data[Ind_Run1,3]) 
OutputsCrit3 <- ErrorCrit_NSE(InputsCrit = InputsCrit3, OutputsModel = OutputsModel3)

## Crit. NSE[Q] = 0.6618

OutputsCrit3

## $CritValue
## [1] 0.6618097
## 
## $CritName
## [1] "NSE[Q]"
## 
## $CritBestValue
## [1] 1
## 
## $Multiplier
## [1] -1
## 
## $Ind_notcomputed
## NULL
## 
## attr(,"class")
## [1] "NSE"       "ErrorCrit"

In the same way as for the GR4J, you can also run the Shiny app.

preob1 <- data.frame(DatesR=as.POSIXct(strptime(data[,1], "%d.%m.%Y"),
  tz="UTC"),P=data[,4], E=potET,Qmm=data[,3],T=data[,5]) 
ShinyGR(ObsDF = preob1, SimPer = c("2005-01-01", "2010-12-31"), 
  ZInputs = median(hypso),HypsoData = hypso, NLayers = 5)

The airGR package also includes the GR4H model, which allows for simulations of flows with hourly time steps. As regards this model, it is also worth mentioning that, for example, the PE_Oudin function allows for disaggregation of potential evapotranspiration data (into hourly time steps). In the example below, we will calculate the potential evapotranspiration with an hourly time step and check the basic statistics of the hourly data.

potETh <- PE_Oudin(JD = as.POSIXlt(strptime(data[,1], "%d.%m.%Y"))$yday, 
  Temp = data[,5], Lat = 46.2, LatUnit = "deg",TimeStepIn = "daily",TimeStepOut = "hourly")  
summary(potETh)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07519 0.10802 0.60550

Other packages are available that allow for similar simulations to be performed, an overview of which is given in a paper by Astagneau et al.¹³⁹. One of these packages is TUWmodel, which allows for the use of the TUWmodel model¹⁴⁰. This model is relatively similar in structure to the HBV hydrological model and allows for simulations of discharges¹⁴¹. In the following example, the use of the TUWmodel will be demonstrated, following a presentation by Prof. Parajka¹⁴². First, the model will be run with random parameter values, and then the model will be calibrated using the DEoptim package. More information about the parameters is available in the help for the TUWmodel. We will also activate the hydroGOF package, which includes some functions to assess the model fit.

library(TUWmodel, quietly=TRUE) 
preob1 <- data.frame(DatesR=as.POSIXct(strptime(data[,1], "%d.%m.%Y"),
  tz="UTC"),P=data[,4], E=potET,Qmm=data[,3],T=data[,5]) 
# run the model with randomly selected parameter values
sim1 <- TUWmodel(prec=preob1[,2], airt=preob1[,5], ep=preob1[,3], area=1., 
   param=c(1.3, 2.0, -1.0, 1.0, 0.0, 0.8, 360.0, 0.2, 0.3, 7.0, 150.0, 50.0, 2.0, 10.0, 25.0),
   incon=c(50,0,2.5,2.5))
# check the structure
str(sim1)

## List of 22
##  $ itsteps: int 4383
##  $ nzones : int 1
##  $ area   : num 1
##  $ param  : num [1, 1:15] 1.3 2 -1 1 0 0.8 360 0.2 0.3 7 ...
##   ..- attr(*, "names")= chr [1:15] "SCF" "DDF" "Tr" "Ts" ...
##  $ incon  : num [1, 1:4] 50 0 2.5 2.5
##   ..- attr(*, "names")= chr [1:4] "SSM0" "SWE0" "SUZ0" "SLZ0"
##  $ prec   : num [1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ airt   : num [1:4383] -2.6 -1 -2.7 -2.8 -2.1 -2 -1.6 -2.1 -3.2 -1.2 ...
##  $ ep     : num [1:4383] 0.0991 0.166 0.0959 0.0923 0.1223 ...
##  $ output : num [1, 1:20, 1:4383] 0.00229 0 50 0 0 ...
##  $ qzones : num [1, 1:4383] 0.00229 0.00767 0.01466 0.02324 0.02883 ...
##  $ q      : num [1, 1:4383] 0.00229 0.00767 0.01466 0.02324 0.02883 ...
##  $ swe    : num [1, 1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ melt   : num [1, 1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ q0     : num [1, 1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ q1     : num [1, 1:4383] 0.0434 0 0 0 0 ...
##  $ q2     : num [1, 1:4383] 0.0298 0.0326 0.0324 0.0322 0.032 ...
##  $ moist  : num [1, 1:4383] 50 50 50 50 50 50 50 50 50 50 ...
##  $ rain   : num [1, 1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ snow   : num [1, 1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ eta    : num [1, 1:4383] 0 0 0 0 0 0 0 0 0 0 ...
##  $ suz    : num [1, 1:4383] 0.457 0 0 0 0 ...
##  $ slz    : num [1, 1:4383] 4.47 4.89 4.86 4.83 4.8 ...

# plot the simulated values (for the first year)
plot(as.numeric(sim1$q)[1:365],type="l", col="red",ylab="Discharge [mm]") 
# add measured flow data
lines(preob1[1:365,4], col="green")

Figure 118: TUWmodel calculations with random parameter values.

# we see that the differences are relatively large, 
# which is to be expected as we have not calibrated the model parameters
library(hydroGOF, quietly=TRUE)  
# let's check some basic goodness-of-fit statistics
gof(as.numeric(sim1$q)[1:365], preob1[1:365,4])

##          [,1]
## ME       0.09
## MAE      1.47
## MSE      5.51
## RMSE     2.35
## NRMSE % 84.50
## PBIAS %  3.30
## RSR      0.84
## rSD      1.03
## NSE      0.28
## mNSE     0.17
## rNSE     0.23
## d        0.80
## md       0.62
## rd       0.78
## cp      -0.57
## r        0.65
## R2       0.42
## bR2      0.36
## KGE      0.65
## VE       0.45

# define the function to be used in the optimisation process 
msespr <- function (param, precip, temp, potevap, runoff, area) { 
# run the model
simu <- TUWmodel(param,prec=as.numeric(precip), airt=as.numeric(temp), 
  ep=as.numeric(potevap), area=area)$q 
# first year of data used for model warm-up
simu <- simu[-c(1:365)]  
obse <- runoff[-c(1:365)] 
# calculate the MSE criterion, the alternative would be 
# manual calculation without mean((obse - simu)^2) function
mse(simu,obse) 
}

Now let’s calibrate the model with 1 elevation zone. We will define the settings of the optimisation algorithm (we can also change/increase the itermax argument, which defines the maximum number of optimisation steps, thus increasing the computation time). We will then run the model with the best combination of parameters through the end of 2010 (excluding the first year, which we used for model warm-up) and calculate the NSE criterion. We will see that the model performs better in principle than the GR4J model for the same period and approximately as well as the GR6J CemaNeige model, which we also used.

library(DEoptim, quietly=TRUE)

## Warning: package 'DEoptim' was built under R version 4.1.3

## 
## DEoptim package
## Differential Evolution algorithm in R
## Authors: D. Ardia, K. Mullen, B. Peterson and J. Ulrich

area=1 # use one elevation zone
# model calibration
calibrate_period1 <- DEoptim(fn=msespr, lower=c(0.9, 0.0, 1.0, -3.0, -2.0, 0.0, 0.0, 0.0, 0.0, 2.0, 30.0, 1.0, 0.0, 0.0, 0.0),
upper=c(1.5, 5.0, 3.0, 1.0, 2.0, 1.0, 600.0, 20.0, 2.0, 30.0, 250.0, 100.0, 8.0, 30.0, 50.0),control=DEoptim.control(NP=NA, itermax=60, reltol=1e-4, steptol=50, trace=10, parallelType=0), precip=preob1[1:konec,2], temp=preob1[1:konec,5], potevap=preob1[1:konec,3], runoff=preob1[1:konec,4], area=area)

## Iteration: 10 bestvalit: 5.611297 bestmemit:    0.911294    2.485503    1.399058   -1.700807   -1.092843    0.443469  522.158192    4.386748    1.508317   27.313302  102.445324    8.068820    5.506733    9.553699   36.350001
## Iteration: 20 bestvalit: 5.227388 bestmemit:    1.156470    1.321408    1.117402   -2.484593    1.726505    0.175899  554.152875    4.267478    1.319508    2.543315   63.860740   20.742001    6.636973   13.645317   39.906745
## Iteration: 30 bestvalit: 4.572308 bestmemit:    1.091526    1.321408    1.117402   -1.969616   -1.374502    0.595091  554.152875    4.267478    1.319508    2.543315   63.860740   24.942543    6.636973   13.645317   39.906745
## Iteration: 40 bestvalit: 4.543579 bestmemit:    1.091526    1.321408    1.117402   -1.969616   -1.374502    0.595091  554.152875    4.267478    1.319508    2.543315   63.860740   24.942543    5.907190   13.557868   39.906745
## Iteration: 50 bestvalit: 4.390717 bestmemit:    1.065064    1.321408    1.117402   -2.337340   -1.516016    0.377852  554.152875    4.267478    1.319508    2.543315   63.860740   50.156800    3.925980   13.557868   39.906745
## Iteration: 60 bestvalit: 4.356931 bestmemit:    1.160399    1.579897    1.289074   -2.702604   -1.016168    0.517566  468.275596    3.256204    1.103864    3.247163   63.093022   36.892444    5.911351   13.731370   37.320947

sim2 <- TUWmodel(prec=preob1[1:konec,2], airt=preob1[1:konec,5], 
  ep=preob1[1:konec,3], area=1, param=calibrate_period1$optim$bestmem) 
# plot the simulated values (until the end of 2010, 
# excluding the first year used for model warm-up) 
plot(as.numeric(sim2$q)[366:konec],type="l", col="red",ylab="Discharge [mm]")
# add measured flow data
lines(preob1[366:konec,4], col="green")

Figure 119: Results for the TUWmodel calibration period.

# calculate NSE 
NSE(as.numeric(sim2$q)[366:konec],preob1[366:konec,4])

## [1] 0.8030478

Task 42: Perform calibration of the GR5J model for the period 2005–2011 (using data for the Selška Sora river basin) and validate the model using the calibrated parameters for 2012 and 2013. Show the results of the model validation.

Task 43: Run the GR4J model for 2006 –2007 with data from the Selška Sora river basin and randomly generated model parameters (uncalibrated), using the function set.seed(15) before randomly generating 4 numbers. The ranges of the parameters are: x1: 0-2500 mm, x2: -5->5 mm; x3: 0-1000 mm; x4: 0-10 days. Use a uniform distribution. Run the model and compare the results with the measurements.

Task 44: Run the TUWmodel for the validation period (2011–2016) and graph the results and calculate the NSE criterion.

4.17 Sensitivity and uncertainty analysis

Sensitivity analysis is particularly important in hydrological modelling in terms of understanding and interpreting model performance. Such analyses aid our understanding of how changes in input parameters affect model results. This understanding is crucial for explaining model behaviour, for identifying the key factors associated with the analyses, and for gaining insight into the simulated hydrological processes. The results of these analyses can thus guide the selection of parameters that should be given special emphasis in the model calibration process, by identifying those that have the greatest impact on the model results. Sensitivity analysis further helps in quantifying the uncertainty associated with different parameters, allowing for a better understanding of model prediction reliability. This is also related to the acquisition of input data, as for parameters that have a large impact on model performance it makes sense to obtain the highest quality data possible, both in terms of uncertainty and temporal resolution. As an example of sensitivity analyses, we will consider a study of a simple soil erosion model¹⁴³. The procedure will be illustrated by an example of using the sensitivity package, which includes a number of functions to perform sensitivity analyses. However, we will be using the global sensitivity analysis method developed by Sobol¹⁴⁴. There are also various indices, such as the standard regression coefficient or the normalised index, which allow for a quick assessment of the model’s sensitivity ¹⁴⁵. These methods have also been used for sensitivity analyses of rainfall erosivity¹⁴⁶ or WATEM/SEDEM model¹⁴⁷, among others. In the following example, we will define a function that can be used to calculate the potential soil release based on the Gavrilović equation or the Erosion Potential Model (EPM) model. A description of the input parameters can be found in the paper by Bezak et al. (2023)¹⁴⁸.

EPM <- function(Temp,Pa,A,rho,X,Y,S){ 
Tkoef <- sqrt(Temp/10+0.1) # temperature coefficient
Zkoef <- X*Y*(rho+sqrt(S)) # Z coefficient
EPM <- Tkoef*Pa*3.14*A*Zkoef^(3/2) # calculation of potential soil erosion
return(EPM) # function returns the result
}
EPM(10,1462,18.8,0.3,0.2,1.1,30/100) # example of calculation

## [1] 7290.307

library(sensitivity, quietly=TRUE)

## Warning: package 'sensitivity' was built under R version 4.1.3

## Registered S3 method overwritten by 'sensitivity':
##   method    from 
##   print.src dplyr

## 
## Attaching package: 'sensitivity'

## The following object is masked from 'package:raster':
## 
##     extract

## The following object is masked from 'package:terra':
## 
##     extract

## The following object is masked from 'package:dplyr':
## 
##     src

rangemin <- c(0,400,5000,0.1,0.05,0.25,1/100) # lower limit of parameters
rangemax <- c(35,4000,5000,1,1,2,50/100) # upper limit
# determine the number of randomly generated parameters 
# in the range between the minimum and maximum value 
n <- 1000 
# generate parameters
nab1 <- cbind(runif(n,rangemin[1],rangemax[1]),runif(n,rangemin[2],rangemax[2]),runif(n,rangemin[3],rangemax[3]),
  runif(n,rangemin[4],rangemax[4]),runif(n,rangemin[5],rangemax[5]),runif(n,rangemin[6],rangemax[6]),
  runif(n,rangemin[7],rangemax[7])) 
# change column names
colnames(nab1) <- c("Temp","Pa","A","rho","X","Y","S") 
# make another set of parameters
nab2 <- cbind(runif(n,rangemin[1],rangemax[1]),runif(n,rangemin[2],rangemax[2]),runif(n,rangemin[3],rangemax[3]),
   runif(n,rangemin[4],rangemax[4]),runif(n,rangemin[5],rangemax[5]),runif(n,rangemin[6],rangemax[6]),
   runif(n,rangemin[7],rangemax[7])) 
# transform columns
colnames(nab2) <- c("Temp","Pa","A","rho","X","Y","S") 
# define another reformulated function for the EPM model
EPM1 <- function(X){ 
  Tkoef <- sqrt(X[,1]/10+0.1)
  Zkoef <- X[,5]*X[,6]*(X[,4]+sqrt(X[,7]))
  EPM <- Tkoef*X[,2]*3.14*X[,3]*Zkoef^(3/2)
  return(EPM)
}
# perform sensitivity analyses for the EPM model
rez <- sobol(model=EPM1,X1=nab1,X2=nab2,order=1,nboot=1000) 
# plot results
plot(rez)

Figure 120: Sensitivity analysis of the EPM model.

It can be seen that it is the X and Y parameters that have the greatest influence on the model results, so the EPM model results are most sensitive to the X and Y parameters. However, it is often important to perform uncertainty analyses as a regular part of hydrological analyses. Several uncertainty-related packages are available within the R software framework, such as the uncertainty package¹⁴⁹ or the metRology package¹⁵⁰. The functions included in these packages allow, for example, Monte Carlo simulations to be carried out or uncertainty analysis to be performed on measured data. Otherwise, additional methods and procedures for performing uncertainty analysis are also described in Uncertainty Quantification using R¹⁵¹ or in Uncertainty Analysis of Experimental Data with R¹⁵². With a focus on hydrological modelling, the following works are worth mentioning: Rainfall-Runoff Modelling: The Prime¹⁵³ or the paper Parameter estimation and uncertainty analysis in hydrological modelling¹⁵⁴. The following code is an example of a simulation of flows where the range of parameter values is known and the assumption is made that the uniform distribution adequately describes the possible range of parameter values.

n <- 50 # number of simulations
# generate parameters
GR4Jnab1 <- cbind(runif(n,ParamGR4J[1]-50,ParamGR4J[1]+70),runif(n,ParamGR4J[2]-0.9,ParamGR4J[2]+0.7),runif(n,ParamGR4J[3]-70,ParamGR4J[3]+110),round(runif(n,ParamGR4J[4],ParamGR4J[4]+0.7),1)) 
# change column names
colnames(GR4Jnab1) <- c("X1","X2","X3","X4") 
# run the model with the estimated parameter values
plot(RunModel_GR4J(InputsModel = InputsModel, RunOptions = RunOptions1, 
  Param =ParamGR4J)$Qsim[100:200],type="l",col="red",
  ylab="Simulated discharge values [mm]",xlab="Days",lwd=3) 
# and for a possible set of parameters
for(k in 1:n){ 
# calculations
simulacije <- RunModel_GR4J(InputsModel = InputsModel, RunOptions = RunOptions1, 
  Param = c(GR4Jnab1[k,1],GR4Jnab1[k,2],GR4Jnab1[k,3],GR4Jnab1[k,4])) 
# show results
lines(simulacije$Qsim[100:200],col="grey") 
}

Figure 121: Simulated flow values for different parameter combinations.

Task 45: Using the TUWien model, calculate the discharge values (for the same period as in Task 44) for the different parameter combinations used during the model’s optimisation or calibration of the model. Plot the results.

4.18 Climate change

Analyses of the impact of climate change on processes in the water cycle have often been carried out in recent decades with a range of objectives, often based on calculations and projections by the IPCC¹⁵⁵. There are many packages available within the R software environment for working with data related to climate models. A relatively large number of packages are available in the Github repository, such as climate4R¹⁵⁶, downscaleR¹⁵⁷, or climdex¹⁵⁸. There are also packages available in the CRAN repository such as musica¹⁵⁹, which allows for analysis of the impact of downscaling global model results to local levels, with a focus on hydrological studies¹⁶⁰. There is also the spdownscale¹⁶¹ package. Also worth mentioning is the geodata package¹⁶², which allows for (among other things) the analysis of the CMIP6 data available on the Worldclim website¹⁶³. An example of the use of the geodata package will be shown below. First, we will download precipitation data (argument var) for a selected climate model (argument model) and a greenhouse gas emission scenario (ssp) for a selected period (time) and spatial resolution (res) using the cmip6_world function. We will then determine the average monthly precipitation for the Sora basin for the period 2061–2080 for the selected climate model. The geodata package also allows for the download of a number of other variables (including wind speed, air temperature, solar radiation, etc.).

library(geodata, quietly=TRUE) 
# cmip6 <- cmip6_world(model="CNRM-CM6-1", ssp="245", time="2061-2080", 
#      var="prec", res=10, path=tempdir()) # data download
library(terra, quietly=TRUE)
cmip6 <- rast("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/wc2.1_10m_prec_CNRM-CM6-1_ssp245_2061-2080.tif")
cmip6 # we see that these are monthly totals of precipitation (12 maps)

## class       : SpatRaster 
## dimensions  : 1080, 2160, 12  (nrow, ncol, nlyr)
## resolution  : 0.1666667, 0.1666667  (x, y)
## extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## source      : wc2.1_10m_prec_CNRM-CM6-1_ssp245_2061-2080.tif 
## names       : wc2.1~ec_01, wc2.1~ec_02, wc2.1~ec_03, wc2.1~ec_04, wc2.1~ec_05, wc2.1~ec_06, ... 
## min values  :         0.0,         0.0,           0,         0.0,         0.0,         0.0, ... 
## max values  :      1044.4,       934.8,         688,       654.8,       882.1,      1703.8, ...

plot(cmip6[[7]]) # we plot, say, the monthly rainfall for July

Figure 122: Monthly precipitation for July under one of the scenarios.

# deactivate the sensitivity package containing the extract function
detach("package:sensitivity", unload = TRUE) 
# determine the average monthly precipitation
extract(cmip6,porecje4)

##   ID wc2.1_2.5m_prec_01 wc2.1_2.5m_prec_02 wc2.1_2.5m_prec_03
## 1  1              141.1              120.9              155.9
##   wc2.1_2.5m_prec_04 wc2.1_2.5m_prec_05 wc2.1_2.5m_prec_06 wc2.1_2.5m_prec_07
## 1              192.9              163.3              188.5              135.7
##   wc2.1_2.5m_prec_08 wc2.1_2.5m_prec_09 wc2.1_2.5m_prec_10 wc2.1_2.5m_prec_11
## 1              138.1              188.1              224.4              230.1
##   wc2.1_2.5m_prec_12
## 1                181

plot(1:12, as.numeric(extract(cmip6,porecje4))[2:13],type="l",
  xlab="Month",ylab="Monthly precipitation [mm]")

Figure 123: Monthly precipitation for the Sora basin (2061–2080).

# maximum air temperature data downloaded
#slovenija <- worldclim_country("Slovenia", var="tmax", path=tempdir()) 
slovenija <- rast("C:/Users/nbezak/OneDrive - Univerza v Ljubljani/Ucbenik/GIS/SVN_wc2.1_30s_tmax.tif")
# maximum air temperature for July
plot(slovenija[[7]])

Figure 124: Maximum air temperature for Slovenia.

# for the Sora basin only
plot(mask(slovenija[[7]],porecje4))

Figure 125: Maximum temperature for the Sora basin.

The geodata package contains functions for transferring other data, such as ground data (soil_world function) or digital elevation model data (elevation_global function). The data used in the previous example have already been downscaled and bias correction has been applied. However, the R programming environment also has a number of packages that allow these operations to be performed in the event that the underlying CMIP6 simulation data are available. For example, Sezen et al.¹⁶⁴ used the quantile mapping method for precipitation corrections using the qmap package¹⁶⁵, and the delta mapping method for air temperature corrections using the MBC package¹⁶⁶. These methods have also been used in analyses of climate change in Slovenia¹⁶⁷. Data from the project Assessment of climate change in Slovenia by the end of the 21st century are also available on the Open Data Slovenia website (e.g. for the RCP4.5 scenario)¹⁶⁸, where the data can be downloaded in .nc format, which is relatively easy to open and analyse in the R software environment.

The R programming environment also has packages available for accessing various indices such as the North Atlantic Oscillation index (NAO), which tracks a range of processes in the water cycle¹⁶⁹. The rsoi package¹⁷⁰ is a package that allows many indices to be downloaded. We will also calculate the autocorrelation on selected NAO index data. More information on the function can be found in the acf function description.

library(rsoi, quietly=TRUE) 
nao <- download_nao() # download NAO index data
# check the structure of the data
head(nao)

## # A tibble: 6 x 3
##    Year Month   NAO
##   <int> <ord> <dbl>
## 1  1950 jan.   0.92
## 2  1950 feb.   0.4 
## 3  1950 mar.  -0.36
## 4  1950 apr.   0.73
## 5  1950 maj   -0.59
## 6  1950 jun.  -0.06

# we see that this is monthly data from 1950 onwards
tail(nao) # until the end of 2023

## # A tibble: 6 x 3
##    Year Month   NAO
##   <int> <ord> <dbl>
## 1  2024 jul.     NA
## 2  2024 avg.     NA
## 3  2024 sep.     NA
## 4  2024 okt.     NA
## 5  2024 nov.     NA
## 6  2024 dec.     NA

summary(nao$NAO) # basic statistics

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -3.180000 -0.750000  0.040000 -0.005817  0.720000  3.040000         6

plot(nao$NAO,ylab="NAO",type="l") # plot

Figure 126: NAO index fluctuations.

acf(nao$NAO[1:200])

Figure 127: Autocorrelation in NAO data.

Task 46: Based on CMIP6 simulations, graphically display the monthly dynamics in precipitation in Slovenia and its surroundings (lon=c(13,17),lat=c(45,47)) for the period 2061–2080 using the CMCC-ESM2 model.

Task 47: Download the monthly precipitation data for the Murska Sobota station that you used in Task 37 from the ARSO website and analyse whether there is a correlation between the NAO index and the precipitation for the selected station (for the selected period, depending on the availability of ARSO data).

5 Final thoughts

R is a powerful tool for carrying out a wide range analyses for water management or hydrology. The flexibility of the R programming language and it broad array of statistical and graphical capabilities empower an advanced understanding of hydrological processes. R can be used improve data visualisation and interpretation. R improves the accuracy and reproducibility of hydrological studies through its many statistical capabilities. Using R in water management can simplify workflows, which can increase efficiency and save time. Through automation and the plethora of packages tailored for water studies, certain tasks such as data processing, model calibration, and analysis of large volumes of data can be performed more quickly, allowing R users to concentrate on the interpretation of results. The collaborative nature of the R environment facilitates knowledge sharing and collaboration between users of the R environment. Although there are many benefits of using R in the water sector, it is important to be aware of challenges such as its relatively steep learning curve. Accordingly, I hope that this manual will be helpful in making the use of the R software environment more effective in the water sector. Many practical examples of R’s use and a number of packages have been shown, and there are, of course, a much larger number of other packages available, whose has not been specifically shown.

R software in hydrology

Nejc Bezak

23 July 2024