Ergebnis für URL: http://alexei.nfshost.com/PopEcol/lec4/corfact.html4.2. Correlation between factors
One of the problems with multiple regression is that factors may be correlated.
For example, temperature is highly correlated with precipitation. If factors are
correlated, then it is impossible to separate the effect of different factors. In
particular, regression coefficients that indicate the effect of one factor may
change when some other factor is added or removed from the model. Step-wise
regression helps to evaluate the significance of individual terms in the
equation.
First, I will remind you the basics of the analysis of variances (ANOVA)
Total sum of squares (SST) is the sum of squared deviations of individual
measurements from the mean. The total sum of squares is a sum of 2 portions:
(1) Regression sum of squares (SSR) which is the contribution of factors into the
variance of the dependent variable, and
(2) Error sum of squares (=residual sum of squares) (SSE) which is the stochastic
component of the variation of the dependent variable.
SSR is the sum of squared deviations of predicted values (predicted using
regression) from the mean value, and SSE is the sum of squared deviations of
actual values from predicted values.
The significance of regression is evaluated using [1]F-statistics:
[eqfisher.gif]
where
df(SSR)= g - 1 is the number of degrees of freedom for the regression sum of
squares which is equal to the number of coefficients in the equation, g, minus 1;
df(SSE)= N - g is the number of degrees of freedom for the error sum of squares
which is equal to the number of observations, N, minus the number of
coefficients, g;
df(SST) = df(SSR) + df(SSE) = N - 1 is the number of degrees of freedom for the
total sum of squares.
The null-hypothesis is that factors has no effect on the dependent variable. If
this is true, then the total sum of squares is approximately equally distributed
among all degrees of freedom. As a result, the fraction of the sum of squares per
one degree of freedom is approximately the same for regression and error terms.
Then, the F-statistic is approximately equal to 1.
Now, the question is, how much should the F-statistic deviate from 1 to reject
the null hypothesis. To answer this question we need to look at the distribution
of F assuming the null hypothesis:
[fisher.gif] If estimated (empirical) value exceeds the threshold value (which
corresponds to the 95% cumulative probability distribution) then the effect of
all factors combined is significant. (See tables of threshold values for P =
[2]0.05, [3]0.01, and [4]0.001)
Note: In some statistical textbooks you can find a two-tail F-test (5% area is
partitioned into two 2.5% areas at both, right and left tails of the
distribution). This is a wrong method because small F indicates that the
regression performs too well (some times suspiciously well). Null hypothesis is
not rejected in this case! If F is very small, then we may suspect some cheating
in data analysis. For example, this may happen if too many data points were
removed as "outliers". However, our objective here is not to test for cheating
(we assume no cheating). Thus we use a 1-tail F-test.
The F-distribution depends on the number of degrees of freedom for the numerator
[df(SSR)] and denominator [df(SSE)].
Standard regression analysis generally cannot detect the significance of
individual factors. The only exception are orthogonal plans in which factors are
independent (=not correlated). In most cases, factors are correlated, and thus, a
special method called step-wise regression should be used to test the
significance of individual factors. The step-wise regression is a comparison of
two regression analyses:
(1) the full model and
(2) the reduced model in which one factor is excluded.
The full model has more degrees of freedom, and therefore, it fits data better
than the reduced model. Thus, the regression sum of squares, SSR, is greater for
the full model than for the reduced model. The question is: is this difference
significant or not? If it is not significant, then the factor that was excluded
is not important and can be ignored. The significance is tested with the same
F-statistic, but SSR and df(SSR) are replaced by the difference in SSR and
df(SSR) between the full and reduced models:
[eqseqsam.gif]
where SSR and SSR1 are regression sum of squares for the full and reduced models,
respectively; df(SSR) and df(SSR1) are degrees of freedom for the regression sum
of squares in the full and reduced models, respectively; SSE is the error sum of
squares for the full model; and df(SSE) is the number of degrees of freedom for
the error sum of squares.
Because only one factor was removed in the reduced model,
df(SSR) - df(SSR1
) = 1.
The F-statistic is related to the t-statistic if the denominator has only one
degree of freedom:
[eqt2f.gif]
Thus, the t-statistic can be used instead of the F in the step-wise regression.
Example of the step-wise regression:
Full model [eqseqexm.gif] ; SSR =53.2, SSE =76.3, df(SSR) =2, df(SSE) =53.
Reduced model y = a + b x; SSR =45.7, SSE =83.8, df(SSR1)=1, df(SSE1)=54.
F=(53.2-45.7)53 / 76.3 = 5.21; t = 2.28; P
Usage: http://www.kk-software.de/kklynxview/get/URL
e.g. http://www.kk-software.de/kklynxview/get/http://www.kk-software.de
Errormessages are in German, sorry ;-)