Linear Regression using R

Iteration-0

Consolidated 2 good posts..This is just for personal reading..references use only.

------------------------------------------------------------------------
https://www.r-bloggers.com/simple-linear-regression-2/

One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there are flexible facilities in R for fitting a range of linear models from the simple case of a single variable to more complex relationships.

For this example we will use some data from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition – Duxbury 1990). This data is for a study in central Florida where 15 alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches – this is the distance between the back of the head to the end of the nose).

The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. The authors analysed the data on the log scale (natural logarithms) and we will follow their approach for consistency. We first create a data frame for this study:

alligator = data.frame(
  lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
    3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
  lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
    3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)

As with most analysis the first step is to perform some exploratory data analysisto get a visual impression of whether there is a relationship between weight and snout vent length and what form it is likely to take. We create a scatter plot of the data as follows:

xyplot(lnWeight ~ lnLength, data = alligator,
  xlab = "Snout vent length (inches) on log scale",
  ylab = "Weight (pounds) on log scale",
  main = "Alligators in Central Florida"
)

The scatter plot is shown here:

Scatter plot of the weight and snout vent length for alligators caught in central Florida

The graph suggests that weight (on the log scale) increases linearly with snout vent length (again on the log scale) so we will fit a simple linear regression model to the data and save the fitted model to an object for further analysis:

alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)

The function lm fits a linear model to data are we specify the model using a formula where the response variable is on the left hand side separated by a ~ from the explanatory variables. The formula provides a flexible way to specify various different functional forms for the relationship. The data argument is used to tell R where to look for the variables used in the formula.

Now that the model is saved as an object we can use some of the general purpose functions for extracting information from this object about the linear model, e.g. the parameters or residuals. The big plus with R is that there are functions defined for different types of model, using the same name such as summary, and the system works out what function we intended to use based on the type of object saved. To create a summary of the fitted model:

> summary(alli.mod1)

--------------------------------------------------

https://www.r-bloggers.com/using-linear-regression-to-predict-energy-output-of-a-power-plant/


fit a linear regression to predict the energy output at a Combined Cycle Power Plant(CCPP). The dataset is obtained from the UCI Machine Learning Repository. The dataset contains five columns, namely, Ambient Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH), Exhaust Vacuum (EV), and net hourly electrical energy output (PE) of the plant. The first four are the attributes, and are used to predict the output, PE.

Reading in the data and splitting

Since the data is in xlsx format, we will need the xlsx package. We will use the data from the first sheet of the file.
library(xlsx)
powerData <- read.xlsx('Folds5x2_pp.xlsx', 1)

head(powerData)

     AT     V      AP    RH     PE
1 14.96 41.76 1024.07 73.17 463.26
2 25.18 62.96 1020.04 59.08 444.37
3  5.11 39.40 1012.16 92.14 488.56
4 20.86 57.32 1010.24 76.64 446.48
5 10.82 37.50 1009.23 96.62 473.90
6 26.27 59.44 1012.23 58.77 443.67


Next, we need to split the data into a training set and a testing set. As their names imply, the training set is used to train and build the model, and then this model is tested on the testing set. Let’s say we want to have about 75% of the data in the training set and 25% of the data in the testing set. It can be done as follows:
set.seed(123)
split <- sample(seq_len(nrow(powerData)), size = floor(0.75 * nrow(powerData)))
trainData <- powerData[split, ]
testData <- powerData[-split, ]
head(trainData)

        AT     V      AP    RH     PE
2752 29.14 67.45 1015.51 46.47 433.34
7542 24.67 70.94 1007.99 75.64 443.51
3913 20.84 51.19 1008.63 84.11 448.98
8447 31.73 74.67 1016.38 44.51 425.34
8995  4.44 38.44 1016.14 75.35 486.53
436   9.43 37.14 1013.03 74.99 473.57

head(testData)

      AT     V      AP    RH     PE
2  25.18 62.96 1020.04 59.08 444.37
4  20.86 57.32 1010.24 76.64 446.48
12 20.14 46.93 1014.66 64.22 453.99
13 24.34 73.50 1011.31 84.15 440.29
14 25.71 58.59 1012.77 61.83 451.28
17 18.21 45.00 1022.86 48.84 467.54


Let me explain what the above commands do.

First, we set the seed so that the results are reproducible.
Then, we create a sequence whose length is equal to the number of rows of the dataset. These numbers act as the indices of the dataset. We randomly select 75% of the numbers in the sequence and store it in the variable split.
Lastly, we copy all the rows which correspond to the indices in split intotrainData and all the remaining rows into testData.


Building the prediction model

Now, let’s build the prediction model. We will use the lm function for this.
predictionModel <- lm(PE ~ AT + V + AP + RH, data = trainData)


The above function will try to predict PE based on the variables AT, V, AP, and RH. Since we are using all the variables present in the dataset, a shorter way to write the above command is: (this is very helpful when are are a large number of variables.)
predictionModel <- lm(PE ~ ., data = trainData)


We can now look at the summary of the dataset.
summary(predictionModel)

Call:
lm(formula = PE ~ ., data = trainData)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.363  -3.148  -0.140   3.162  17.763 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 450.764756  11.331102   39.781  < 2e-16 ***
AT           -1.978787   0.017599 -112.435  < 2e-16 ***
V            -0.232049   0.008415  -27.575  < 2e-16 ***
AP            0.065590   0.010993    5.967 2.54e-09 ***
RH           -0.155561   0.004829  -32.214  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.574 on 7171 degrees of freedom
Multiple R-squared:  0.9284, Adjusted R-squared:  0.9284 
F-statistic: 2.326e+04 on 4 and 7171 DF,  p-value: < 2.2e-16


This will help us determine which variables to include in the model. A linear regression can be represented by the equation: y_i = β_1 x_i1 + β_2 x_i2 + β_3 x_i3 + ⋯ + ε_i where y_i represents the outcome we’re prediction (PE), x_irepresent the various attributes (AT, V, AP, and RH), β represent their coefficients, and ε represents the constant term.
The first column in the summary, namely Estimate gives us these values. The first value corresponds to ε, and the rest of the values correspond to the variousβ values. If the coefficient for a particular attribute is 0 or close to 0, that means it has very little to no effect on the prediction, and hence, can be removed. Thestandard error column gives an estimation of how much the coefficients may vary from the estimate values. The t value is calculated by dividing the estimate by the standard error column. The last column gives a measure of how likely it is that the coefficient is 0 and is inversely proportional to the t value column. Hence, an attribute with a high absolute value of t, or a very low absolute value of Pr(>|t|) is desirable.

The easiest way to determine which variables are significant is by looking at the stars next to them. The scheme is explained at the bottom of the table. Variables with three stars are most significant, followed by two stars, and finally one. Variables with a period next to them may or may not be significant and are generally not included in prediction models, and variables with nothing next to them are not significant.

In our model, we can see that all our variables are highly significant, so we will leave our prediction model as it is. In case you are dealing with a dataset in which there are one or more variables which are non-significant, it is advisable to test the model by removing one variable at a time. This is because when two variables are highly correlated with each other, they may become non-significant to the model, but when one of them is removed, they could become significant. This is due to multicollinearity. You can find out more about multicollinearity here.

The easiest way to check the accuracy of a model is by looking at the R-squared value. The summary provides two R-squared values, namely Multiple R-squared, and Adjusted R-squared. The Multiple R-squared is calculated as follows:

Multiple R-squared = 1 – SSE/SST where:

SSE is the sum of square of residuals. Residual is the difference between the predicted value and the actual value, and can be accessed bypredictionModel$residuals.
SST is the total sum of squares. It is calculated by summing the squares of difference between the actual value and the mean value.


For example, lets say that we have 5, 6, 7, and 8, and a model predicts the outcomes as 4.5, 6.3, 7.2, and 7.9. Then, SSE can be calculated as: SSE = (5 – 4.5) ^ 2 + (6 – 6.3) ^ 2 + (7 – 7.2) ^ 2 + (8 – 7.9) ^ 2; and SST can be calculated as: mean = (5 + 6 + 7 + 8) / 4 = 6.5; SST = (5 – 6.5) ^ 2 + (6 – 6.5) ^ 2 + (7 – 6.5) ^ 2 + (8 – 6.5) ^ 2

The Adjusted R-squared value is similar to the Multiple R-squared value, but it accounts for the number of variables. This means that the Multiple R-squared will always increase when a new variable is added to the prediction model, but if the variable is a non-significant one, the Adjusted R-squared value will decrease. For more info, refer here.

An R-squared value of 1 means that it is a perfect prediction model, and an R-squared value of 0 means that it is of no improvement over the baseline model (the baseline model just predicts the output to always be equal to the mean). From the summary, we can see that our R-squared value is 0.9284, which is very high.

Testing the prediction model

We will now apply the prediction model to the test data.
prediction <- predict(predictionModel, newdata = testData)


Now, let’s look at the first few values of prediction, and compare it to the values of PE in the test data set.
head(prediction)

       2        4       12       13       14       17 
444.0433 450.5260 456.5837 438.7872 443.1039 463.7809 

head(testData$PE)

[1] 444.37 446.48 453.99 440.29 451.28 467.54


This means that for the value of 444.37, our prediction was 444.0433, for 446.48, it was 450.5260, and so on.

We can calculate the value of R-squared for the prediction model on the test data set as follows:
SSE <- sum((testData$PE - prediction) ^ 2)
SST <- sum((testData$PE - mean(testData$PE)) ^ 2)
1 - SSE/SST

0.9294734

http://data.princeton.edu/R/linearModels.html


4.1 Fitting a Model

To fit an ordinary linear model with fertility change as the response and setting and effort as predictors, try
> lmfit = lm( change ~ setting + effort )


Note first that lm is a function, and we assign the result to an object that we choose to call lmfit (for linear model fit). This stores the results of the fit for later examination.

The argument to lm is a model formula, which has the response on the left of the tilde ~ (read "is modeled as") and a Wilkinson-Rogers model specification formula on the right. R uses



+ to combine elementary terms, as in A+B
: for interactions, as in A:B;
* for both main effects and interactions, so A*B = A+B+A:B




A nice feature of R is that it lets you create interactions between categorical variables, between categorical and continuous variables, and even between numeric variables (it just creates the cross-product).

4.2 Examining a Fit

Let us look at the results of the fit. One thing you can do with lmfit, as you can with any R object, is print it.
> lmfit
 
Call:
lm(formula = change ~ setting + effort)
 
Coefficients:
(Intercept)      setting       effort  
   -14.4511       0.2706       0.9677


The output includes the model formula and the coefficients. You can get a bit more detail by requesting a summary:
> summary(lmfit)
 
Call:
lm(formula = change ~ setting + effort)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-10.3475  -3.6426   0.6384   3.2250  15.8530 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -14.4511     7.0938  -2.037 0.057516 .  
setting       0.2706     0.1079   2.507 0.022629 *  
effort        0.9677     0.2250   4.301 0.000484 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 
Residual standard error: 6.389 on 17 degrees of freedom
Multiple R-Squared: 0.7381,     Adjusted R-squared: 0.7073 
F-statistic: 23.96 on 2 and 17 DF,  p-value: 1.132e-05 


The output includes a more conventional table with parameter estimates and standard errors, as well the residual standard error and multiple R-squared. (By default S-Plus includes the matrix of correlations among parameter estimates, which is often bulky, while R sensibly omits it. If you really need it, add the option correlation=TRUE to the call to summary.)

To get a hierarchical analysis of variance table corresponding to introducing each of the terms in the model one at a time, in the same order as in the model formula, try the anova function:
> anova(lmfit)
Analysis of Variance Table
 
Response: change
          Df  Sum Sq Mean Sq F value    Pr(>F)    
setting    1 1201.08 1201.08  29.421 4.557e-05 ***
effort     1  755.12  755.12  18.497 0.0004841 ***
Residuals 17  694.01   40.82                      
---
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1


Alternatively, you can plot the results using
> plot(lmfit)


This will produce a set of four plots: residuals versus fitted values, a Q-Q plot of standardized residuals, a scale-location plot (square roots of standardized residuals versus fitted values, and a plot of residuals versus leverage that adds bands corresponding to Cook's distances of 0.5 and 1.

R will prompt you to click on the graph window or press Enter before showing each plot, but we can do better. Typepar(mfrow=c(2,2)) to set your graphics window to show four plots at once, in a layout with 2 rows and 2 columns. Then redo the graph using plot(lmfit). To go back to a single graph per window use par(mfrow=c(1,1)). There are many other ways to customize your graphs by setting high-level parameters, type ?par to learn more.



Technical Note: You may have noticed that we have used the function plot with all kinds of arguments: one or two variables, a data frame, and now a linear model fit. In R jargon plot is a generic function. It checks for the kind of object that you are plotting and then calls the appropriate (more specialized) function to do the work. There are actually many plot functions in R, including plot.data.frame and plot.lm. For most purposes the generic function will do the right thing and you don't need to be concerned about its inner workings.

4.3 Extracting Results

There are some specialized functions that allow you to extract elements from a linear model fit. For example
> fitted(lmfit)
        1         2         3         4         5         6         7         8 
-2.004026  5.572452 25.114699 21.867637 28.600325 24.146986 17.496913 10.296380 
        ... output edited ... 


extracts the fitted values. In this case it will also print them, because we did not asign them to anything. (The longer form fitted.values is an alias.)

To extract the coefficients use the coef function (or the longer form coefficients)
> coef(lmfit)
(Intercept)     setting      effort 
-14.4510978   0.2705885   0.9677137 


To get the residuals, use the residuals function (or the abbreviation resid):
> residuals(lmfit)
          1           2           3           4           5           6 
  3.0040262   4.4275478   3.8853007   3.1323628   0.3996747  15.8530144
        ... output edited ... 


If you are curious to see exactly what a linear model fit produces, try the function
> names(lmfit)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"


which lists the named components of a linear fit. All of these objects may be extracted using the $ operator. However, whenever there is a special extractor function you are encouraged to use it.

4.4 Factors and Covariates

So far our predictors have been continuous variables or covariates. We can also use categorical variables or factors.Let us group family planning effort into three categories:
> effortg = cut(effort, breaks = c(-1, 4, 14, 100), 
+    label=c("weak","moderate","strong"))


The function cut creates a factor or categorical variable. The first argument is an input vector, the second is a vector of breakpoints, and the third is a vector of category labels. Note that there is one more breakpoint than there are categories. All values greater than the i-th breakpoint and less than or equal to the (i+1)-st breakpoint go into thei-th category. Any values below the first breakpoint or above the last one are coded NA (a special R code for missing values). If the labels are omitted, R generates a suitable default of the form "(a,b]". By default the intervals are closed on the right, so our intervals are < 4; 5-14; 15+. To change this use the option right=FALSE.

Try fitting the analysis of covariance model:
> covfit = lm( change ~ setting + effortg )
> covfit
 
Call:
lm(formula = change ~ setting + effortg)
 
Coefficients:
    (Intercept)          setting  effortgmoderate    effortgstrong  
        -5.9540           0.1693           4.1439          19.4476


As you can see, family planning effort has been treated automatically as a factor, and R has generated the necessary dummy variables for moderate and strong programs treating weak as the reference cell.

Choice of Contrasts: R codes unordered factors using the reference cell or "treatment contrast" method. The reference cell is always the first category which, depending on how the factor was created, is usually the first in alphabetical order. If you don't like this choice, R provides a special function to re-order levels, check outhelp(relevel).

S codes unordered factors using the Helmert contrasts by default, a choice that is useful in designed experiments because it produces orthogonal comparisons, but has baffled many a new user. Both R and S-Plus code ordered factors using polynomials. To change to the reference cell method for unordered factors use the following call
> options(contrasts=c("contr.treatment","contr.poly"))


Back on to our analysis of covariance fit. You can obtain a hierarchical anova table for the analysis of covariance model using the anova function:
> anova(covfit)
Analysis of Variance Table
 
Response: change
          Df  Sum Sq Mean Sq F value    Pr(>F)    
setting    1 1201.08 1201.08  36.556 1.698e-05 ***
effortg    2  923.43  461.71  14.053 0.0002999 ***
Residuals 16  525.69   32.86                      
---
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1


Type ?anova to learn more about this function.

4.5 Regression Splines

The real power of R begins to shine when you consider some of the other functions you can include in a model formula. First, you can include mathematical functions, for example log(setting) is a perfectly legal term in a model formula. You don't have to create a variable representing the log of setting and then use it, R will create it 'on the fly', so you can type
> lm( change ~ log(setting) + effort)


If you wanted to use orthogonal polynomials of degree 3 on setting, you could include a term of the formpoly(setting,3)

You can also get R to calculate a well-conditioned basis for regression splines. First you must load the splines library (this step is not needed in S-Plus):
> library(splines)


This makes available the function bs to generate B-splines. For example the call
> setting.bs <- bs(setting, knots = c(66,74,84)) + effort )


will generate cubic B-splines with interior knots placed at 66, 74 and 84. This basis will use seven degrees of freedom, four corresponding to the constant, linear, quadratic and cubic terms, plus one for each interior knot. Alternatively, you may specify the number of degrees of freedom you are willing to spend on the fit using the parameter df. For cubic splines R will choose df-4 interior knots placed at suitable quantiles. You can also control the degree of the spline using the parameter degree, the default being cubic.

If you like natural cubic splines, you can obtain a well-conditioned basis using the function ns, which has exactly the same arguments as bs except for degree, which is always three. To fit a natural spline with five degrees of freedom, use the call
> setting.ns <- ns(setting, df=5)


Natural cubic splines are better behaved than ordinary splines at the extremes of the range. The restrictions mean that you save four degrees of freedom. You will probably want to use two of them to place additional knots at the extremes, but you can still save the other two.

To fit an additive model to fertility change using natural cubic splines on setting and effort with only one interior knot each, placed exactly at the median of each variable, try the following call:
> splinefit = lm( change ~ ns(setting, knot=median(setting)) + 
+    ns(effort, knot=median(effort)) )


Here we used the parameter knot to specify where we wanted the knot placed, and the function median to calculate the median of setting and effort.

Do you think the linear model was a good fit? Natural cubic splines with exactly one interior knot require the same number of parameters as an ordinary cubic polynomial, but are much better behaved at the extremes.

4.6 Other Options

The lm function has several additional parameters that we have not discussed. These include




data to specify a dataset, in case it is not attached
subset to restrict the analysis to a subset of the data
weights to do weighted least squares



and many others; see help(lm) for further details. The args function lists the arguments used by any function, in case you forget them. Try args(lm).

The fact that R has powerful matrix manipulation routines means that one can do many of these calculations from first principles. The next couple of lines create a model matrix to represent the constant, setting and effort, and then calculate the OLS estimate of the coefficients as (X'X)-1X'y:
> X <- cbind(1,effort,setting)
> solve( t(X) %*% X ) %*% t(X) %*% change
      
              [,1]
  [1,] -14.4510978
  [2,]   0.9677137
  [3,]   0.2705885


Compare these results with coef(lmfit).

Search This Blog

TechMusings (BigData,Hadoop,Pig,Hive,DataScience,IoT,EAI,SOA,J2EE)

Linear Regression using R

Reading in the data and splitting

Building the prediction model

Testing the prediction model

4.1 Fitting a Model

4.2 Examining a Fit

4.3 Extracting Results

4.4 Factors and Covariates

4.5 Regression Splines

4.6 Other Options

Comments

Post a Comment

Popular posts from this blog

The auxService:mapreduce_shuffle does not exist

Zeppelin and Anaconda

Logistic Regression using German Credit Data

+	to combine elementary terms, as in A+B
:	for interactions, as in A:B;
*	for both main effects and interactions, so A*B = A+B+A:B

`data`	to specify a dataset, in case it is not attached
`subset`	to restrict the analysis to a subset of the data
`weights`	to do weighted least squares