Introduction to R for Statistics Computations Statistics

Using the R Statistical Programming Language

R References

Additional R References

Your R Calculations



Quick Start

What is R?


R is a powerful, comprehensive, open-source software framework for doing statistics. It is possible to download and install the software on computers, or to use it through a website-interface without downloading anything. In our class, we will learn how to access R through the CoCalc website-interface, and how to do some basic statistical computations and graphics. To get started, log on to the CoCalc server (create an account using your school email) and open a blank "Jupyter notebook" (or you can also open a "sage worksheet" but you will need to type `%r` at the beginning of each cell). Select R for your kernel (language you want to use). Try some basic arithmetic computations in your worksheet. For example, compute/ enter the following commands
  1. `32+43`
  2. `e^{-\sqrt(5.5*\pi)}`
  3. `x=2.59;\qquad y=-6.44;\qquad x^2*\pi^y`
The Sage cell server looks like the box below. It offers a very convenient way to do short computations and graphics without logging on to the CoCalc server (or having to mess with downloading anything). Try it!

Of course, you can also use this website for quick R Calculations without logging into CoCalc. However, nothing is saved! So if you want to save your work, you should log into your CoCalc account.



Arithmetic

Obviously, you don't need R to handle basic arithmetic. But, you do need to know how it works. This is especially true when you do arthimetic while programming more complicated functions. We start with: $$ 3 \cdot (8-6)^3 - 5$$
Try the next few calculations in the cell above:
  1. `\frac{(11-5)^2+(3-5)^2}{2}`
  2. `\sqrt{4*5+(8-3)}`
  3. `\frac{(11-5)^2+(3-5)^2+(1-5)^2+(6-5)^2}{4}`


Data Basics

Entering Data

First, we must learn how to enter data into R.

my_data <- c(number1, number2, etc)

my_data

The first line of code does a few things. First, we give the data a name, my_data. The symbols <- is used for assignment. Next, we specify the list of data \( number1, number2, \ldots \) using the vector function c(). After the first line is complete, R stores this information in the memory.

The second line then tells R to show or display the data!

Try removing the second line and see what happens.

Here's a basic example showing how to save a simple data set $$ \{1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,10\} $$ and store it to the variable name my_cool_data.

Note that hashtags (#) are "comments" and are meant for humans to read but they are ignored by the program.

Also, you can edit these examples to make your own computations below by clicking “Evaluate”.

Sorting Data

Sorting a list into increasing (or decreasing) numbers is very useful and common. It is very tedious to do by hand.

To sort a list into increasing order, use the sort() function.

my_data <- c(number1, number2, etc)

sort(my_data) # sort increasing by default

Here's a simple example:



Random Samples

It is very easy to get all sorts of random samples in R.

# Random Sample
# Pick n numbers at random that lie between a and b.
sample( a:b, size = n, replace = FALSE) # without replacement

sample( a:b, size = n, replace = TRUE) # with replacement

Here's a basic example showing how to generate a random sample.

Later, once we learn distributions we will learn how to generate random samples chosen from various distributions.

Here's an example with 50 numbers drawn from a random sample, which is then sorted:



Descriptive Stastistics

Mean, Median, Variance, Standard Deviation

Once data is entered into R, the measures of center and variation are easy!

my_data <- c(number1, number 2, etc)

mean(my_data)   # compute the mean of my_data
median(my_data) # compute the median of my_data

var(my_data)    # compute the variance of my_data
sd(my_data)     # compute the standard deviation of my_data

Notice it only shows one mumber in the output. This is a bit annoying but we will need to deal with this qwirk. One way to handle this is to comment out three of the line so it only displays one at a time.



Histograms

To create a histogram, first enter your data and then use the hist() function.

# Histogram (let R handle everything)
hist(data)

R let you customize almost everything. Here's a simple example:

hist( data,                             # data
      main = "A lovely Title",          # add a title
      xlab = "x values are..."          # specify labels on x-axis
      xlim = c(min,max),                # specify desired range
      breaks = c(seq(list of values)),  # classes / bins
      col = c("blue", "red", etc),      # colors of bars
      freq = FALSE,                     # TRUE by default, when false gives relative freq
    )

Note that R automatically decides on the width of the bins/classes.

An example with a few direct specifications:

Dot Plots

To create a dot plot>, first you need to load a special package called plotrix. A package is a bit of code someone else wrote so we can use it!

Next, enter your data and then use the function dotplot.mtb().

# Dot Plot
library(plotrix) # load a special package

dotplot.mtb( data ) # dot plot

Here's a simple example:

Stem-and-Leaf Plots

To create a stem-and-leaf pot>, use the stem() function.

# Stem-Leaf Plot

stem( data ) # stem-leaf plot

Suppose that we want to a make a stem-and-leaf plot of Babe Ruth's homeruns:

Pie Charts

To create a pie chart>, use the pie() function.

# Pie Charts

slices <- c(list of values)  # enter data of the 'slices'
lbls <- c("category1", "category2", etc) # enter label names for categories matching slice data
pie( slices, 
     label = lbls, 
     main = "Add Title"
     )

We can create pie charts:

One more example: Time Series

We can plot the time series for the 1918-1919 influenze epidemic. The plot for the number of cases is in red, and the plot for 10 times the number of deaths is in blue.



Five Number Summary

To create a five number summary, first enter your data and then use the summary() function.

# Five Number Summary
summary(data)

Box-Whisker Plots

To create a box-whisker plot, use the boxplot() function.

# Box Plot

boxplot(data)

Simple box plot example:

Note that it displays the boxplot vertically by default. Also, it identifies outliers by default.

Here is another example with more options specified:



Distributions

Just like the Ti84 calculator, it is easy to compute probabilities using common distributions.

Binomial Distributions compute the probability of exactly \(x\) successess in \(n\) trials of a binomial probability distribution with probability of success \(p\).

Syntax in R: base distribution: binom().

There's 4 prefixes: d (exactly), p (cumulative), q (quartile aka "inverse"), r (random sample).

Exactly:
dbinom(x = #, size = #, prob = #) 
    # computes the probability of exactly x successess 
        # x = number of successess, 
        # size = number of trials, 
        # prob = probability of success of single trial
Cumulative:
pbinom(#, size = #, prob = #) 
    # computes the probability of at most x = # successess (cumulative)
    # note: need to remove 'x=' from first argument for cumulative
Inverse:
qbinom(p = #, size = #, prob = #)
    # inverse binomial: finds z such that P(x <= z) = p (right-tail)
Random samples:
rbinom(n = #, size = #, prob = #) 
    # generates required number of random values of given probability from a given sample
    # n = number of observations     

Examples:

Here are a few examples. Hit the 'Evaluate R Code' button to see the outputs.


Continuous Distributions compute the probability of a range of values from a given distribution.

Normal Distrbutions

Let's start with the Normal Distributions.

Syntax in R: base distribution: norm().

There's 4 prefixes: d (exactly), p (cumulative), q (quartile aka "inverse"), r (random sample).

The Normal Distribution, $$ N(\mu,\sigma), $$ depends on the mean \(\mu\) and standard deviation \(\sigma\). It's probability density function (pdf) is given by: $$ f(x) = \frac{1}{\sqrt{2\pi \sigma}} \exp\left[ -\frac{(x-\mu)^2}{2\sigma^2}\right] $$

In R, to get the exact value of $f(x)$, use dnorm.

dnorm(x = #, mean = #, sd = #) 
    # computes the probability of exactly x successess 
        # x = exact value of x (any real number), 
        # mean = mean of normal distribution
        # sd = standard deviation of normal distribution

We don't really need to know or use dnorm unless we want to plot the curve.

To compute probabilities of intervals of a normal distribution, we use the cumulative distribution. In math notation, if \(X \sim N(\mu,\sigma) \) then to compute $$ P(X \le b) \qquad \text{(i.e. less than b--left tail)} $$ we use:

   
pnorm(b, mean = #, sd = #) 
    # computes the cumulative probability P(X <= b)
        # mean = mean of normal distribution
        # sd = standard deviation of normal distribution
        # lower value = -oo 
        # higher value = b

Often, we need other calculations using the normal curve. For example: $$ P(a \le X \le b) \qquad \text{(i.e. between a and b)} $$ or $$ P(X \ge a) \qquad \text{(i.e. greater than a--right tail)} $$

Thanfully, with a bit of basic geometry (drawing pictures helps!), we can use the cumulative from above to accomplish the other two probabilities.

# P(a <= X <= b): 
pnorm(b, mean = #, sd = #) - dnorm(a, mean = #, sd = #) 
    # computes the probability under a normal distribution between the values of a and b
    # lower value = a 
    # higher value = b
# P(X >= a):   
pnorm(a, mean = #, sd = #, lower.tail = FALSE) 
    # computes the probability under a normal distribution for the values greater than a
    # lower value = a 
    # higher value = +oo 

Examples:

Here are a few examples. Hit the 'Evaluate R Code' button to see the outputs.

Inverse Normal

Inverse:
qnorm(p = #, mean = #, sd = #, lower.tail = FALSE)
    # inverse normal: finds z such that P(x <= z) = p (right-tail)
Here are a few examples. Hit the 'Evaluate R Code' button to see the outputs.

Examples:

Here are a few examples. Hit the 'Evaluate R Code' button to see the outputs.

Additional Continuous Distributions

In addition to the normal distribution, R has other distributions programmed. In the code below, I'll reference a few of the distributions we use in our Intro Statistics course.



Inference

Confidence Intervals. Here we show how R will give you a confidence interval.

One Proportion Confidence Interval

Let's say we want to estimate the true population proportion, \(p\). We must first be given a level of confidence, \(CL\), which then lets us \(\alpha = 1-CL\).

The critical value is then computed using the "inverse normal", which recall that in R is found using qnorm.

CL <- 0.95 # give CL 
alpha <- 1-CL

z_alpha <- qnorm(alpha/2, 0,1) # critical value = inverse normal ( alpha/2, 0, 1)

Next, we state the point estimate, i.e. the sample proportion, \(\hat{p}\), and the standard error, \(\sigma_{\hat{p}}\). $$ \hat{p} = \frac{x}{n} $$ and $$ \sigma_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}. $$

And use the formula for the margin of error, \(E\). $$ E = z_{\alpha/2} \cdot \sigma_{\hat{p}}. $$

   
p_hat <- 0.65 # if p_hat is given directly
              # use x/n if not given directly
n <- 40       # sample size (note: need n > 30)

std_error <- sqrt( ( (p_hat)*(1-p_hat) ) / n )

# Margin of Error
Error <- z_alpha * std_error

Examples:

Here are a few examples. Hit the 'Evaluate R Code' button to see the outputs.
old conf int below


Hypothesis Tests One Proportion p-test. You can ignore


Scatter Plots and Linear Regression

Coming soon...