Learning Objectives:
R is a powerful, comprehensive, open-source software framework for doing professional statistics.
It is possible to download and install the software on computers, or to use it through a website-interface without downloading anything.
We will use R through a website-interface in the form of a Jupyter notebook or R Markdown document. In fact, what you are reading here is an R-Markdown document that will guide you through the first steps of getting familiar with R.
Watch this shot video introduction:
Navigate to the Labs folder and find the Lab_1 folder. Inside you will find a Juptyper Notebook which can run and execute R code titled “Math136-Lab_1-YourLastName_YourFirstName-S20”. Re-name the file with the obvious modifications.
At the beginning of your Jupyter Notebook, double-click on the “Lab1” text. Replace the text “FirstName LastName” with your actual first and last name. Click “run cell” (looks like a play button) or hit shift+enter
.
Have some fun and make a few calculations
You can perform all sorts of basic arithmetic using R:
## [1] 4
Of course we know 2+2 is 4. In the above, you’ll see the gray box shows the code, 2+2
and in the box blow that you’ll see the output 4
.
We can input much more complicated expressions as well. Just use parentheses liberally!
If we want to evaluate the expression \(\sqrt{\frac{2+5\cdot 7}{10}}\) we need to know what the square root, multiplication, and division syntax used.
## [1] 1.923538
You can add spaces so that it’s easier to understand/read the code.
Let’s look at one more useful feature of R: assignment.
If you type the symbols less-than and hyphen, you type something that looks like an arrow pointing to the left: <-
. You can use this arrow to assign values to variables.
For example, you can create a variable x
and assign it the value 3
by typing this into a cell:
If you hit run, notice nothing happens. There’s no output. This is because we’ve only defined the variable and stored the value of 3 to it. So that’s all that R does. We have to be specific and tell R to show us the output by “calling” the variable.
All we do to call a variable is just type the name of it in a new line:
## [1] 3
Notice by typing x
in the next line we DO get an output of 3
You can name variable almost anything.
If you want to assign a variable a description or text instead of a number we use quotes:
fav_color <- "yellow"
fav_color # reminder: you need to type the name of the variable again to see it as an output
## [1] "yellow"
Comments are very useful in programming and coding. They are parts of code that are NOT read or implemented by the program but are there to be helpful for humans–that is, for you or me to read :-)
In R, any text after a hashtag is not read by the program for that line.
For example,
## [1] 11
If you go back up and read the examples, you’ll see that I’ve already used comments!
I’m going to extensively use comments in the code cells to teach you important info that is not part of the code.
If you need to write regular text, then you need to create a different structure that understands plain text called “markdown.”
To do this, click on the “Cell” menu, then under “Cell Type”, select “Change to Markdown M”. Double-click on this cell and you can now start writing.
Let’s assume we have a simple data set of: 1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,10.
When entering data “by hand” we use the following code:
Again, it seems like nothing happened. But something did happen. The program has now stored all the numbers to a vector (i.e. list) variable that we named my_cool_data
.
If we want to see our data, we have to type the name again in a new line:
my_cool_data <- c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,10) # define and store values to my_cool_data
my_cool_data
## [1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 10
Now my_cool_data
is stored and ready to be studied.
We calculate the mean of the data set my_cool_data
as follows:
## [1] 4.0625
Notice the parentheses after the mean that sandwich the data name.
This is because mean()
is a function. It needs an input (in the example above the input is the vector my_cool_data
) and it gives us an output (the mean of the data).
We can compute the mean more explicitly:
# explicit mean
# recall: mean = sum of values divided by # of values
num <- sum(my_cool_data) # numerator is sum of values
denom <- length(my_cool_data) # length() function tells us number of values in data set (vectors)
mean_hand <- num/denom
mean_hand
## [1] 4.0625
Luckily, we don’t need to compute things the long way. We’ll use the functions already programmed into R
.
Similarly, we can calculate the median, standard deviation, and variance as follows:
## [1] 4
## [1] 2.015564
## [1] 4.0625
We can also have the 5 number summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 4.062 5.000 10.000
# A dataset consisting of the 5 values: Yes, No, Yes, Yes, Maybe
acat <- c("Yes", "No", "Yes", "Yes", "Maybe")
# notice that you must use quotes to enclose categorical values
table(acat) # make a frequency table
## acat
## Maybe No Yes
## 1 1 3
If we want to see the box plot, we use:
Notice that by default, the box-plot is vertical. We like it to look horizontal so we add the following code horizontal=TRUE
and use a comma after the data name.
Notice the box plot looks a little different than what we discussed in class. This is because the value of 10 is considered an outlier.
See the textbook for how outliers are determined using the IQR (inter-quartile range).
A histogram is easy to produce using:
We can easily customize this to add some color:
A dot plot is easy to produce using:
A stem-leaf plot is easy as well. To make it more interesting, we’ll introduce a new data set called “ruth”.
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 25
## 3 | 45
## 4 | 1166679
## 5 | 449
## 6 | 0
Finally, pie charts require a bit more to set-up.
# Pie Charts
# Notes:
# "labels= " will create labels that we defined above
# "main=" will create a title for the pie chart
grades <- c(4, 10 , 37, 8, 1) # the data for each "slice" of the pie
lbls <- c("A", "B", "C", "D", "F") # the labels to apply to each slice (in order)
pie(grades, labels = lbls, main="Final Exam Grades for a Statistics Course")
Down | Up | Up | Down | Down | Up |
Down | Up | Down | Up | Down | Up |
Down | Down | Up | Up | Up | Up |
Down | Down | Down | Up | Down | Up |
No Change | Up | Down | Down | No Change | Down |
Create a pie chart with labels and a title.
72, 97, 74, 93, 68, 59, 64, 56, 70, 58, 50, 71, 67, 56, 70, 61, 53, 92, 57, 67,
58, 49, 68, 69, 87, 81, 60, 52, 70, 63, 56, 68, 68, 54, 80, 64, 57, 63, 54, 56,
54, 73, 77, 63, 51, 59, 65, 53, 62, 55, 74, 74, 64, 64, 57, 64, 60, 64, 66, 52,
71, 55, 65, 75, 42, 74, 94
command+c
(mac) or control+c
(pc))
A spreadsheet or table of raw data can be created manually via keyboard input, or by reading in data files written in various standard formats. The structure used in R to represent such tables is called a dataframe.
Consider, for example, the following dataset:
Age | Sex | Class year | SAT score | Financial aid? |
18 | F | 1 | 1014 | N |
20 | F | 3 | 1222 | Y |
17 | M | 1 | 1141 | Y |
17 | F | 1 | 1082 | N |
19 | M | 2 | 1261 | Y |
18 | F | 2 | 1288 | N |
20 | F | 1 | 1002 | N |
21 | M | 3 | 1078 | N |
We will input each column of data as a separate variable first, after which we will organize them into a dataframe.
The dataframe can be given any convenient name, e.g., “mydata”
# First create each column as a separate variable: we'll use the names
# "age", "sex", etc., for the names of my variables
age <- c(18, 20, 17, 17, 19, 18, 20, 21)
sex <- c("f", "f", "m", "f", "m", "f", "f", "m")
year <- c(1, 3, 1, 1, 2, 2, 1, 3)
sat_score <- c(1014, 1222, 1141, 1082, 1261, 1288, 1002, 1078)
f_aid <- c("n", "y", "y", "n", "y", "n", "n", "n")
# Next, we'll combine the variables into a dataframe that we will call "mydata"
mydata <- data.frame(age, sex, year, sat_score, f_aid)
Let’s see how it looks:
## age sex year sat_score f_aid
## 1 18 f 1 1014 n
## 2 20 f 3 1222 y
## 3 17 m 1 1141 y
## 4 17 f 1 1082 n
## 5 19 m 2 1261 y
## 6 18 f 2 1288 n
## 7 20 f 1 1002 n
## 8 21 m 3 1078 n
Now we can compute summary stats, make histograms, boxplots, piecharts, etc.
Once a dataframe is created, it is easy to make various displays, and to compute summary statistics for variables in the dataframe.
The following examples show how to do this for variables in the dataframe created above.
The basic structure is as follows: dataframe_name$variable_name
where the dollar sign $
is important and how we tell R to look at one specific variable.
##
## n y
## 5 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1002 1062 1112 1136 1232 1288
If we want to change the class widths using R in the histogram we can do it using the breals=
command:
# A simple way to set the histogram scale is to specify
# the number of bars to use, like this
hist(mydata$"sat_score", breaks=6) # histogram with 6 equal-width bars
# It is also easy to customize the plot title, axes labels, etc., like this
hist(mydata$"sat_score", breaks=6, xlab="SAT scores", main="A title test")
Age | Major | Employment | Work hours |
---|---|---|---|
19 | Business | Part time | 35 |
19 | English | Part time | 30 |
34 | Business | Unemployed | 0 |
20 | Psychology | Part time | 19 |
20 | Psychology | Part time | 32 |
21 | History | Unemployed | 0 |
21 | Business | Part time | 20 |
21 | History | Part time | 15 |
23 | Psychology | Full time | 36 |
41 | Business | Full time | 50 |
30 | Physics | Unemployed | 0 |
Create a dataframe via keyboard input to represent these data. Print your dataframe and verify that it is correct.
This completes a basic introduction to R and also the basics of descriptive statistics.
In the next lab, we will continue descriptive statistics by studying the full data set of “freshman_15_full”. Since it is tedius to type in data by “keyboard”, in the next lab we will learn how to import a spreadsheet into R saving us lots of time!
You’ve done it! You’ve now finished Lab_1! ;-)