R-squared and Adjusted R-squared

Hi David,

trust you are doing great!

i chanced upon an old schweser material which stated that, overestimating the regression is a problem that emanates from instances where higher R^2 may be as a result of increased number of independent variables in a regression model and not how well these independent variables explain the dependent variable.

can this problem be overlooked if via trial and error means you have detected that a newly introduced independent variable gives a better explanation to the dependent variable?

is it right to eliminate from the model, independent variables that less explain the dependent variables before using the regression model?

can adjusted R^2 also means, rectifying the overestimation by eliminating some independent variables and introducing those independent variables that can well explain the dependent variable or you should strictly stick to using the adjusted R^2 estimator?
 

chiyui

Member
I don't know if my answer helps but I'll try.

As you said, R^2 will just increase anyway when you just simply add more independent variables on the right hand side of the regression function. This means if you use Microsoft's stock return as the dependent variable and NASDAQ 100 index return as the independent variable, even if you add a new variable say, monthly revenue of apples sold in Washington (this is ridiculous to the equation, of course), you will still see the R^2 becomes larger than before.

This creates a problem: how can we identify if the newly added variable is ridiculous or not?
Of course we know monthly revenue of apples sold in Washington is ridiculous by our common sense. But there're some cases we just can't rely on our common sense to make the judgment.

So what I think is, trial and error can't help avoiding ridiculous variables and finding the truly meaningful variables.

Let me put a funny example which is told by my previous professor.
Suppose you regress GDP of your country on a variable - your driving skill. By looking at the regression result, surely you'll see your driving skill can highly explain GDP in statistical sense.
WHY? Because GDP increases in these years, as well as your driving skill (unless you don't drive at all)!
But I think you could see the point in this simple example.

So what I think is, whether to add or delete a variable in the model requires a critical sense on the issue you're dealing with.
Of course you can add a variable that more explains, if you make sure that this variable is not similar to your driving skill in the sense of the previous example.
Similarly, you can delete a variable that less explains, if you have a good reason to say that this variable is ridiculous (or at least, not important) enough to the rationality of your model.

About your question of adjusted R^2, I will say in this way.
We know that, adjusted R^2 increases when the added variable is statistically significant to the dependent variable (in the F-test sense), and vice versa.
So you may say, if we see the adjusted R^2 increases, we should keep the added variable.
But hold a sec. Remember the GDP versus driving skill example. In this example you surely can see the adjusted R^2 increases when you add the driving skill variable in it. WHY? Because driving skill is statistically significant to GDP, as we have seen before!
So should we add it just because the adjusted R^2 increases?

The point is, whether or not to stick on the adjusted R^2 also depends on your critical sense on the issue you're dealing with.
Of course you can keep a variable that increases the adjusted R^2, if you make sure that this variable is not similar to your driving skill in the sense of the previous example.
Similarly, you can delete a variable that decreases the adjusted R^2, if you have a good reason to say that this variable is ridiculous (or at least, not important) enough to the rationality of your model.

Hope my comment will help.
 

chiyui

Member
A little reminder:

R^2 will just increase anyway, even if you add a statistically insignificant variable in the model.
Adjusted R^2 will increase only when you add a statisically significant (in the F-test sense) variable in the model.
 

chiyui

Member
At this stage, you may ask how can we know if the model is rational or not?

I can say that, we can't really use the knowledge of statistics to judge the answer to this problem.
That's why we need to study finance theory if we're dealing with stock return regression,
we need to study psychology theory if we're dealing with stimula-impulse regression,
and we need to study 師奶 theory (theory of mothers' wisdom) if we're dealing with regression about how to buy the most premier quality oranges at the lowest cost!

In summary, statistical analysis is not the whole picture. It can only tell you whether two things are related or not, but it can't tell you why they are related.
Try to combine with both qualitative and quantitative analysis will give a better understand of the problem.
 

ShaktiRathore

Well-Known Member
Subscriber
Hi there,
When more independent variables are added to the regression than the value of R^2 increases that means now the regression is more better able to explain the regression. Its possible that the R^2 has increases by addition of independent variable but the variable in itself is not statistically significant in explaining the dependent variable so its the problem of overestimating regression which is eliminated by using adj R^2. So when you add independent variables measure the adj R^2 if its decreasing than it means that addition does not make the regression more accurate even if R^2 increases but when it increases that means addition of independent variable now explain the dependent variable well and thus we accept the new regression model and certainly R^2 also increases. SO adj R^2 itself helps in judging how well the added independent variable explains the regression more well.
Regarding second Question:
it right to eliminate from the model, independent variables that less explain the dependent variables before using the regression model
It certainly is not right to eliminate independent variables that less explain the dependent variables, because if regression is itself significant than we need to use the regression as a whole even if it contains the less significant variables. Consider the reasoning, if regression is Y= ax+by than if y is less significant independent variable explaining the Y but regression as a whole is significant explaining the Y. If we change x than this changes y also thus the variation in x explains Y through its inter-dependency with y thus if we drop y from the equation Y=ax than any variation in x will show variation in Y by itself only and any explained variation due to inter dependency of x and y is not there so that the regression becomes less explanatory. So it is useful to keep the less significant variable in the regression itself because other significant variables explains the regression with their inter-dependency with less significant ones.


thanks
 
Top