Stat 50 - Elementary Statistics

Dr. Jorge Basilio

Data Analysis using an Excel file

Lab 2

Name: SOLUTIONS

Due: Saturday, Feb 1 at 11:59 PM

PART 1

In [1]:
library(readxl) # load library to read excel files in R
fresh_15_full <- read_excel("06-Freshman15.xlsx") # load excel file info into R and save it to "fresh_15_full"
fresh_15_full # see data
Out[1]:
A tibble: 67 × 5
SEXWT SEPTWT APRILBMI SEPTBMI APRIL
<chr><dbl><dbl><dbl><dbl>
M725922.0218.14
M978619.7017.44
M746924.0922.43
M938826.9725.57
F686421.5120.10
M595518.6917.40
F646024.2422.88
F565321.2320.23
F706830.2629.24
F585621.8821.02
F504717.6316.89
M716924.5723.85
M676620.6820.15
F565520.9720.36
F706827.3026.73
F616023.3022.88
F535219.4819.24
M929224.7424.69
F575820.6920.79
M676720.4920.60
F585821.0921.24
F495018.3718.53
M686822.4022.61
F696928.1728.43
M878823.6023.81
M818226.5226.78
M606118.8919.27
F525319.3119.75
M707120.9621.32
F636421.7822.22
⋮⋮⋮⋮⋮
F63 6523.8724.67
F54 5618.6119.34
F56 5821.7322.58
M54 5618.9319.72
M73 7525.8826.72
M77 7928.5929.53
F63 6621.8922.79
F51 5418.3119.28
F59 6219.6420.63
F65 6823.0224.10
F53 5620.6321.91
F62 6522.6123.81
F55 5822.0323.42
M74 7720.3121.34
M74 7820.3121.36
M64 6819.5920.77
M64 6821.0522.31
F57 6123.4725.11
F64 6822.8424.29
F60 6419.5020.90
M64 6818.5119.83
M66 7121.4022.97
F52 5717.7219.42
M71 7722.2623.87
F55 6021.6423.81
M65 7122.5124.45
M75 8223.6925.80
F42 4915.0817.74
M74 8222.6425.33
M9410536.5740.86
In [2]:
names(fresh_15_full)
Out[2]:
  1. 'SEX'
  2. 'WT SEPT'
  3. 'WT APRIL'
  4. 'BMI SEPT'
  5. 'BMI APRIL'

Problem 1

The five variabls are as follows:

  • "Sex" represents the gender of (M or F)
  • "WT SEPT" represents the weight of the person in September (in Fall semester)
  • "WT April" represents the weight of the person in April (next year in Spring semester)
  • "BMI SEPT" represents the BMI (bodymass index) of the person in September (in Fall semester)
  • "BMI April" represents the BMI (bodymass index) of the person in April (next year in Spring semester)

Problem 2

In [3]:
# 2a
fresh_15_full$"WT SEPT" # show all values of "WT SEPT"
Out[3]:
  1. 72
  2. 97
  3. 74
  4. 93
  5. 68
  6. 59
  7. 64
  8. 56
  9. 70
  10. 58
  11. 50
  12. 71
  13. 67
  14. 56
  15. 70
  16. 61
  17. 53
  18. 92
  19. 57
  20. 67
  21. 58
  22. 49
  23. 68
  24. 69
  25. 87
  26. 81
  27. 60
  28. 52
  29. 70
  30. 63
  31. 56
  32. 68
  33. 68
  34. 54
  35. 80
  36. 64
  37. 57
  38. 63
  39. 54
  40. 56
  41. 54
  42. 73
  43. 77
  44. 63
  45. 51
  46. 59
  47. 65
  48. 53
  49. 62
  50. 55
  51. 74
  52. 74
  53. 64
  54. 64
  55. 57
  56. 64
  57. 60
  58. 64
  59. 66
  60. 52
  61. 71
  62. 55
  63. 65
  64. 75
  65. 42
  66. 74
  67. 94
In [4]:
# 2b
fresh_15_full$"WT SEPT"[55] # one way
fresh_15_full[55,2] # another way
Out[4]:
57
Out[4]:
A tibble: 1 × 1
WT SEPT
<dbl>
57
In [5]:
# 2c
fresh_15_full[45:55,] # show rows 45 to 55, all the columns
Out[5]:
A tibble: 11 × 5
SEXWT SEPTWT APRILBMI SEPTBMI APRIL
<chr><dbl><dbl><dbl><dbl>
F515418.3119.28
F596219.6420.63
F656823.0224.10
F535620.6321.91
F626522.6123.81
F555822.0323.42
M747720.3121.34
M747820.3121.36
M646819.5920.77
M646821.0522.31
F576123.4725.11
In [6]:
# 2 d
fresh_15_full[45:55,c(4,5)] # show rows 45 to 55, columns 4 and 5
Out[6]:
A tibble: 11 × 2
BMI SEPTBMI APRIL
<dbl><dbl>
18.3119.28
19.6420.63
23.0224.10
20.6321.91
22.6123.81
22.0323.42
20.3121.34
20.3121.36
19.5920.77
21.0522.31
23.4725.11

Problem 3

In [7]:
# 3 a
summary(fresh_15_full$"WT SEPT") # five number summary for "WT SEPT"
Out[7]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.00   56.50   64.00   65.06   70.50   97.00 
In [8]:
# 3 a
summary(fresh_15_full$"WT APRIL") # five number summary for "WT APRIL"
Out[8]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47.00   58.00   66.00   66.24   70.00  105.00 

3b.

The median for "WT APRIL" is 66.00 pounds and the median of "WT SEPT" which is 64.00 pounds. Therefore, the median weight of freshman in April (spring semester) is larger than the median weight of freshman in September (fall semester).

In [9]:
# 3 c
# range = max - min
range_sept <- 97.00 - 42.00 # define range for weight of sept
range_april <- 105.00 - 47.00 # define range for weight of april
range_sept
range_april
Out[9]:
55
Out[9]:
58

The range in April is bigger.

In [10]:
# 3 d
# range rule of thumb: estimate for standard deviation = range/4
approx_sd_wt_sept <- range_sept/4 
approx_sd_wt_april <- range_april/4 
approx_sd_wt_sept
approx_sd_wt_april
Out[10]:
13.75
Out[10]:
14.5

The "spread" of the weights in April are larger (approx 14.5) than the weights in September (approx 13.75) using the range rule of thumb.

Problem 4

There are 35 Females and 32 Males, so there are more Females in the data set.

Problem 5

In [11]:
sd_wt_sept <- sd(fresh_15_full$"WT SEPT") # standard deviation of "WT SEPT"
sd_wt_april <- sd(fresh_15_full$"WT APRIL") # standard deviation of "WT APRIL"
sd_wt_sept 
sd_wt_april
Out[11]:
11.2853895852718
Out[11]:
11.2843274960247

The exact standard deviation of the weights in September is 11.285... and the approximation using the range rule of four is 13.75. The approximation is bigger than the actual.

The exact standard deviation of the weights in April is 11.284... and the approximation using the range rule of four is 14.5. The approximation is bigger than the actual.

PART 2

Problem 6

In [12]:
hist(fresh_15_full$"WT SEPT", col="pink") # histogram for "WT SEPT" 
    # notice the optional " col="pink" " separated by a comman to give the histogram so color ;-)
Out[12]:
In [13]:
hist(fresh_15_full$"WT APRIL", col="pink") # histogram for "WT APRIL"
Out[13]:
In [14]:
hist(fresh_15_full$"BMI SEPT", col="lightblue") # histogram for "BMI SEPT"
Out[14]:
In [15]:
hist(fresh_15_full$"BMI APRIL", col="lightblue") # histogram for "BMI APRIL"
Out[15]:

Problem 7

In [16]:
# box plot 
# compare before and after freshman year weights
boxplot(fresh_15_full[,c(2,3)], horizontal=TRUE, col="aquamarine")
Out[16]:

By comparing the two box plots, we see that there is statistical evidence that supports the "freshman 15 myth" but it is not very dramatic. The minimum weight in April increased to 47.00 from 42.00 and the maximums also increases from 97.00 to 105.00. The median weights in April are larger.

Overall, the box plot in April is shifted to the right, indicating that the weights inceased.

PART 3

Problem 8

In [17]:
# box plot 
# compare two box plots between M vs F
# first piece before the tilde (~) is the box plots from the "WT SEPT" column 
# the second after the tilde (~) is the grouping variable (SEX, M/F)
boxplot(fresh_15_full$"WT SEPT" ~ fresh_15_full$SEX, horizontal=TRUE, col="orange")
Out[17]:
In [18]:
boxplot(fresh_15_full$"WT APRIL" ~ fresh_15_full$SEX, horizontal=TRUE, col="deepskyblue")
Out[18]:
In [19]:
boxplot(fresh_15_full$"BMI SEPT" ~ fresh_15_full$SEX, horizontal=TRUE, col="salmon")
Out[19]:
In [20]:
boxplot(fresh_15_full$"BMI APRIL" ~ fresh_15_full$SEX, horizontal=TRUE, col="yellow")
Out[20]:

BONUS

Wouldn't it be nice to do all of them in one graph?

In [33]:
boxplot(fresh_15_full$"WT SEPT", fresh_15_full$"WT APRIL", fresh_15_full$"BMI SEPT", fresh_15_full$"BMI APRIL", col="red", horizontal=TRUE)
Out[33]:
In [0]: