R commands for Intermediary Statistics

Basic Stuff

Data Manipulation

Descriptive Statistics

Prerequisite

Plots

SimpleRegression

Transformed Regression

Residual Plots

Advanced Methods



Basic Stuff

Control-R
   Take a command from the script and run it (for windows)

Command-Enter
   Take a command from the script and run it (for MAC)

?plot
   Read the help file on plot

??plot
   Search for all commands that have plot in the description

#Remember to fix this
   A comment that won't be run if you do control-r

x <- 2
   Assign the value of 2 to the box named x

x <- c(2,4,3,5,7)
   Assign the vector of numbers 2 4 3 5 and 7 to the box named x

x <- "hello"
   Assign the characters hello to the box named x ls()
   See all the variables (boxes) that you've created

rm(x)
   Remove the box named x from your list of variables

Control-L
   If you press control and l you'll clear the console (commands you've run)

Data manipulation

x<-read.table("http://www.uwyo.edu/crawford/Datasets/printers.txt",header=TRUE)
   Read in the data set from the url, and save in a box called x with the header names

x<-read.table("http://www.uwyo.edu/crawford/Datasets/algea.txt",header=TRUE,sep="\t")
   Read in the data when it's tab delimited (verses comma delimited)

x<-read.csv("http://www.uwyo.edu/crawford/Datasets/brain.txt",header=TRUE,skip=3)
   Read in the data but skip the first 3 lines (of text)

x[2,]
   In the data set x, only use row 2

x[,3]
   In the data set x, only use column 3

x[2,3]
   In the data set x grab the value in row 2 column 3

x[,c(2,3,4,5)]
   In the data set x only use columns 2, 3, 4, and 5

head(x)
   See just first six rows of the dataset x

nrow(x)
   The number of rows in dataset x

round(x,5)
   round the number x to 5 decimal places

as.numeric(as.character(dataset$nums))
   turn factors (through characters) into numbers

x[x$color=="red",]
   In data set x, find only the rows where the color is red

x[x$height>0,]
   In data set x, find only the rows where the height is above 0

na.omit(x)
   The dataset x except remove any rows that have an "NA" value

x<-rnorm(100,3,2)
   Create 100 random numbers that are normal with a mean of 3 and sd 2. Store it in x

Descriptive Statistics

min(x)
   Find the minimum value in x

max(x)
   Find the maximum value in x

sum(x)
   the sum of x

mean(x)
   the mean of x

sd(x)
   the standard deviation of x

Prerequisite Skills

pnorm(1.96)
   Find the area to the left of a z-score of 1.96

qnorm(0.025)
   Get the left tailed z-score for an area of 0.025

rnorm(10)
   Get 10 random draws from the normal distribution

dnorm(1.96)
   The y value of the bellcurve at 1.96 (for plotting) dpois(5,2)
   The probability for the Poisson distribution where theta=2

pt(1.96,2)
pchisq(1.96,2)
pf(1.96,2,5)
   Areas for other distributions with their degrees of freedom

t.test(x)
   A one sample test of mu=0, also confidence interval for the mean

t.test(x,y)
   Two sample test of mu1=mu2, also confidence interval for the difference

t.test(x,y,paired=TRUE)
   Two sample matched pairs t-test (with confidence interval)

Plots

boxplot(x)
   Make a boxplot of x

hist(x)
   Draw a histogram of x

plot(x)
   Plot the values of x in order (not actually that useful in this class)

plot(y~x)
   Draw a scatterplot of y based on x

plot(y~x,xlim=c(0,100))
   Plot y on x, but make the x axis go from 0 to 100

plot(y~x,ylim=c(0,100))
   Plot y on x, but make the y axis go from 0 to 100

plot(y~x,col="red")
   Plot y on x with red dots

plot(y~x,xlab="Time")
   Plot y on x and label the x axis Time

plot(y~x,ylab="Height")
   Plot y on x and label the y axis Height

plot(y~x,main="Height based on Time")
   Plot y on x and write Height based on Time at the top

lines(y~x)
   Add the line for y on x on top of whatever plot is already there

points(y~x)
   Add the dots for y on x on top of whatever plot is already there

legend("topright",col=c("red","yellow","blue"),legend=c("high","medium","low"),lty=1)
   Put a legend in the top right corner. Have the red line say high, etc.

par(mfrow=c(2,2))
   Start putting 4 plots (2 rows, 2 columns) on one picture

x<-seq(0,10,length=1000)
y<-5+2*x
plot(y~x,type="l")
   Plot the line y=5+2*x

Simple Regression

fit<-lm(y~x,data=flowers)
fit<-lm(flowers$y~flowers$x)
   Predict y based on x, and save the results in a variable (box) called fit

plot(fit)
   plot the residuals (4 different plots)

summary(fit)
   Get slopes, p-values, R^2, and the standard error

confint(fit)
   Computes condidence intervals for one or more parameters in an lm model called fit

predict.lm(fit,newdata=data.frame(x=10),interval="prediction",level=.99)
   Give a 99% prediction interval for possible y values where x=10

predict.lm(fit,newdata=data.frame(x=10),interval="confidence",level=.99)
   Give a 99% confidence interval for the average/prediction where x=10

AIC(fit)
BIC(fit)
   Get the AIC or BIC values for a model

fit<-aov(y~x,data=flowers)
fit<-aov(flowers$y~flowers$x)
   Predict y based on x, and save the results in a variable (box) called fit anova(fit)
   Get the sums of squares and F tests for the ANOVA

TukeyHSD(fit)
   Perform pairwise confidence intervals with Tukey's EWER correction

Transformed Regression

lm(y~I(x^2),data=flowers)
   Predict y based on x squared

lm(y~x+I(x^2)+q+w+q*w,data=flowers)
   predict y based on x, x^2, q, w, and their interaction

log(2)
   log uses base e
log10(2)
   log10 uses base 10
exp(2)
   exponent on e

lm(y~log(x),data=flowers)
   log does not use the I() notation

Residual Plots

plot(fit,which=1)
   Plot the first residual plot (residuals vs fitted values)

plot(fit$residuals~x)
   Plot the residuals against x

plot(fit$residuals~fit$fitted.values)
   Plot the residuals against the predicted values

qqnorm(fit$residuals)
   Plot a QQ plot of the residuals

qqline(fit$residuals)
   Put the line on a QQplot

plot(density(fit$residuals))
   Draw the density plot of the residuals

plot(cooks.distance(fit))
   See Cook's Distance for each data point

Advanced Methods

install.packages("leaps")
   Downloading a package of new commands from another server

library(leaps)
   Installing the package of new commands into the R you're using

bestfit<-regsubsets(y~x,nbest=1,nvmax=16)
   Perform best subset regression on the data

summary(bestfit)$bic
summary(bestfit)$which[2,]
   Pick out specific pieces from the best subset regression

plot(TukeyHSD(fit))
   See pairwise plots of the Tukey's HSD comparisons

chisq.test(data)
   Take a matrix and does an independence chi-squared test

chisq.test(data,probs)
   Perform Goodness of Fit on the data

kruskal.wallis(data)
   Perfom a Kruskal Wallis test

kruskal.wallis(data,paired=TRUE)
   Perfom a Signed Rank test