Omitted Variable Bias

brian.field

Well-Known Member
Subscriber
This notion of OVB is confusing to me.

In order for OVB to be present, the regression equation must omit a variable that is correlated with a variable that is included in the regression equation and also, the dependent variable and omitted variable must be dependent, essentially.

So, I interpret these conditions to suggest that there must be an omitted variable for which both the regressor and regressand are dependent.

But the assumptions of MLR clearly state that the regressors are assumed to be independent, which would imply a correlation of zero among them, so I do not understand (I cannot reconcile) how it is possible to require i.i.d. regressors while also requiring that there not be any omitted variables that are correlated with the included regressors.

For example, assume a regression equation includes independent variables x, y, and z and that the dependent variable is w. Also, assume that variable v is correlated with variable x and that w is also dependent on v. Then if we do not include v, we have omitted variable bias yet if we include v, we have violated the i.i.d. multiple linear regression condition.

Further, page 201 of S&W states, in assumption #2, that the regressors are i.i.d. but on page 204, S&W states that the OLS estimators "... are correlated; this correlation arises from the correlation between the regressors."

Fundamentally, I am assuming i.i.d. random variables are uncorrelated since independence implies a covariance of zero, and in kind, a correlation of zero.
(I recall that a correlation of zero does not imply independence, but I believe independence does imply zer0 correlation.)

Anyone have any thoughts?

Thanks,

Brian
 
Last edited:

Mad_Mac

New Member
I think that as long as there is no perfect co-linearity then you should include variable v. The ideas of a requirement for all regressors to be completely independent doesn't sound right to me. I might be wrong but I think the result of them not being independent is that you need a much larger sample to get meaningful results.
 

brian.field

Well-Known Member
Subscriber
Here is a return to a question I asked a few years back, although I now understand why I was mistaken in my question above which began this thread.

1) To address my question above, the MLR assumptions require that the samples themselves be I.I.D and NOT that the independent variables be "independent" of one another!

For example, my following statement is nonsensical:

"But the assumptions of MLR clearly state that the regressors are assumed to be independent, which would imply a correlation of zero among them, so I do not understand (I cannot reconcile) how it is possible to require i.i.d. regressors while also requiring that there not be any omitted variables that are correlated with the included regressors."

Indeed, the independent variables are most often correlated with one another!

2) @David Harper CFA FRM - Now I have another question this is a bit more subtle and I don't think I fully grasp it.

Let's assume we have Y = B0 + B1X1 + B2X2 + error as our regression equation. Also, assume that X1 and X2 are determinants of Y. Then, let us define X3 as X2^2. It follows that we would have OVB since X3 is correlated with X2 and X3 is also a determinant of Y. In other words, wouldn't we always have OVB no matter what? We could always identify an omitted variable that is correlated with an independent and also a determinant of the dependent.

What are your thoughts David? Is it the case that we will always have some OVB and that we simply try to minimize it with intuition?
 

ShaktiRathore

Well-Known Member
Subscriber
Brian according to your second question,
I think that's why Brian there is an assumption of multicollinearity ,yes you are right that there shall always be OVB if we consider variables likes this but dont you think that the assumption of
multicollinearity shall be violated that the independent variables shall not be correlated to each other.Yes there shall always be OVB if we add any independent variable power but what about multicollinearity.I think OVB is a threat if there is no violation of multicollinearity ,but if there is violation of multicollinearity makes OVB that less important as a threat.
thanks
 

ami44

Well-Known Member
Subscriber
As I see it, the problem is, that linear models are not uniquly defined in their independent variables. If your model is:
b1 * X1 + b2 * X2 = Y
Then also
b1 * X1 + b3 * X3 = Y
with X3 = 2 * X2 is an equivalent model. I would define the conditions for OVB like this:
If we consider the independent variables X1 and X2, than the omission of a variable X3 gives rise to an OVB, if a linear model exists that is not collinear and includes X1, X2 and X3 as Variables with non-zero regression coefficients. Also X3 has to be correlated with X1 or X2 in some way.

So if
b1 * X1 + b2 * X2 + b3 * X3 = Y
are a fully specified linear model, than the omission of X4 = X3^2 does not give rise to an OVB, since the regression coefficient of X4 is zero, because all variability of Y is already explained with X1, X2, X3.

If on the other hand you define X4 = X3^2 and X5 = X3-X3^2 then
b1 * X1 + b2 * X2 + b4 * X4 + b5 * X5 = Y
would be a linear model in which X4 und X5 have non-zero coefficients and the omission of either of them would give rise to OVB.

I think the confusion stems from the defintion of determinant. Because you can read (e.g. in the wikipedia article to OVB) that the condition for OVB is, that the omitted variable is a determinant of Y. Im not sure how this property is defined, but as I tried to illustrate above, it is necessarily a property of the whole linear model and not of the omitted variable alone.
In my example X4 = X3^2 can in one model give rise to OVB and in another model not. Does anybody know a definition of determinant in that context?

Nitpick at the end: X and X^2 are only correlated, if X is restricted to positive values, i assume that this is the case then. We could also use X^3 instead to avoid this problem.

i hope that all made some sense to you.
 

brian.field

Well-Known Member
Subscriber
Another interesting question is: "what type of correlation are we talking about?"

Imperfect multicollinearity is when two or more of the independent variables (regressors) are highly correlated.....

Bur what does "highly correlated" really mean?
Is this correlation the typical linear correlation, in which case, my scenario in 2) above would not relate, or are we talking about correlation referring to any clear dependency relationship, like a cubic relationship, for instance?
 

brian.field

Well-Known Member
Subscriber
David's notes indicate that:

"The regressors exhibit perfect multicollinearity if one of the regressors is a perfect linear function of the other regressors. The fourth least squares assumption is that the regressors are not perfectly multicollinear."

So, I don't think my example violates the no perfect collinearity assumption @ShaktiRathore, but again, it is a bit dubious.
 

ShaktiRathore

Well-Known Member
Subscriber
Hi Brian ,
Yes Brian after a long time.
Yes Brian i am talking in context of perfect multicollinearity,that is if you add a power like X3^2 then its possible(if possible??) that this X3^2 is a perfect linear function of the other regressors then multicollinearity does holds then OVB is not that important that is adding X3^2 would not have made the regression more valid because of violation of assumption of multicollinearity but if X3^2 is a not a perfect linear function of the other regressors then multicollinearity does not holds then OVB is important that is adding X3^2 would have made the regression more valid.
thanks
 

ami44

Well-Known Member
Subscriber
Bur what does "highly correlated" really mean?
Is this correlation the typical linear correlation, in which case, my scenario in 2) above would not relate, or are we talking about correlation referring to any clear dependency relationship, like a cubic relationship, for instance?

Hi Brian,

I'm pretty sure only linear correlation counts here. If the omitted variable is not correlated to the other independent variables, it can be seen as an error term. I seem to recall, that error terms need to be uncorrelated, but not necessarily statistical independent.
I don't have any math to back it up though.
 

brian.field

Well-Known Member
Subscriber
Here is my question, to reiterate.

Let Y = Xo + B1X1 + B2X2 + u.

Then, assume Z = X2^2 (or X2^3 or any other nonlinear relationship, like Z = sqrt(X2), etc.) AND assume that Z is a determinant of Y.

Then, is there there OVB?

We have that Y is dependent on Z which is one of the requirements for OVB.

But the second requirement that Z be correlated with one of the independent variables is unclear to me in this scenario since X2 and Z are related non-linearly, so they are not "correlated" in the typical manner. Rather, there is a dependence relationship but no linear relationship, so is requirement 2 satisfied?

As is typical for me, I get caught up in exceptions and/or nuances.....my degrees in mathematics always lead me to look for contradictions to statements.
 

ShaktiRathore

Well-Known Member
Subscriber
Yes Brian there is OVB as Z = X2^2 is a determinant of Y.But dont you think if majority of the explanation is being done by X2 alone then why its power is required?,R^2 shall not change by much,even if OVB is there it would not pose much threat.
Also even if X2 and Z are related non-linearly how can we say that regressor Z is not a perfect linear function of the other regressors that is X1 and X2,we cannot certainly say this,is it not possible for Z is a perfect linear function of the other regressors that is X1 and X2(violation of multicollinearty)??
thanks
 

brian.field

Well-Known Member
Subscriber
I don't follow how Z is a linear function of X1 and X2 - I must be missing something obvious.
 

ShaktiRathore

Well-Known Member
Subscriber
X1=2,X2=4,Z=X2^2=16 then Z is the perfect linear function of X1 and X2,Z=aX1+bX2,. then Z=2a+4b=16=>a+2b=8
X1=3,X2=5,Z=X2^2=25, Z=3a+5b=25 thus from above solving we get,b=-1 and a=10, function Z=10X1-X2
We could have data that satisfies the linear function Z=10X1-X2 so that Z is a linear function of X1 and X2.
So in that case assumption of multicollinearity "The regressors exhibit perfect multicollinearity if one of the regressors is a perfect linear function of the other regressors. The fourth least squares assumption is that the regressors are not perfectly multicollinear." is violated.
thanks
 

ami44

Well-Known Member
Subscriber
Here is my question, to reiterate.

Let Y = Xo + B1X1 + B2X2 + u.

Then, assume Z = X2^2 (or X2^3 or any other nonlinear relationship, like Z = sqrt(X2), etc.) AND assume that Z is a determinant of Y.

As I said, I'm not sure how determinant is defined in this context, but I doubt that Z is it. The regression coefficient of Z is zero in this model because X1 and X2 already explain 100% of the variability of Y.

Then, is there there OVB?

We have that Y is dependent on Z which is one of the requirements for OVB.

But the second requirement that Z be correlated with one of the independent variables is unclear to me in this scenario since X2 and Z are related non-linearly, so they are not "correlated" in the typical manner.

If Z = X2^3 they are correlated. If Z = X2^2 they are only correlated, if you restrict X2 in some way e.g. to positive values. Try it out in excel, put some numbers in a column and the number ^3 in the next column and calculate the correlation coefficient. Even though two variables are not linearily connected, doesn't mean, that the correlation is zero. So it looks to me, as if your second requirement is fullfilled, but your first is not.

But all the reasoning above aside: there can't be any OVB in your example, because than every linear regression model would have unlimited OVB, since you can construct unlimited number of variables like Z.
But if your data really obays the linear relationship
Y = Xo + B1X1 + B2X2 + u
than of course it is possible to estimate B1 and B2 without bias. Why shouldn't it?
The bias arises, when the actual relationship is not fully described by the formula above.
 
Top