Nicole Seaman

Director of CFA & FRM Operations
Staff member
Subscriber
Learning objectives: Explain the role of linear regression and logistic regression in prediction. Understand how to encode categorical variables.

Questions:

23.5.1. Clara fits a logistic regression model to her training data that predicts whether a financial transaction is fraudulent. Binary coding informs the response (aka, dependent) variable such that y(i) = zero for repayment and y(i) = one for default. Although a fraudulent transaction is not a typical default, Clara is referring to the fraudulent outcome, y(i) = one, as a default in this model because the result is nonpayment. Her model has three predictors (aka, independent variables) as follows:
  • X1 is the dollar amount of the transactions
  • X2 is the age of the account in months
  • X3 is a Boolean that indicates a foreign transaction: equal to zero if domestic but equal to one if foreign
These are the associated coefficients returned by her best-fit model:
α = -5.8000, β1 = 0.0020, β2 = -0.0460, and β3 = 3.1000.

As a logistic regression, the probability of default is given by P(X) = 1/[1 +exp(-y)] where y = α + β1*X1 + β2*X2 + β3*X3.

Which of the following is nearest to the model's predicted default probability for $400.00 transaction posted in a foreign country (i.e., X3 = 1.0) on an account that is only two (i.e., X2 = 2.0) months old?

a. 0.3%
b. 1.7%
c. 12.0%
d. 24.5%


23.5.2. Patria built a logistic regression model to predict whether her bank's borrowers will default. Her predictors include the following features: income, credit score, debt-to-income ratio, loan purpose (via dummy variables), and employment status. For example, the coefficient for the credit score variable is β(score) = -0.0230, but although the negative sign indicates that default probability (given by p) is a decreasing function of the credit score, its interpretation differs from the interpretation of a linear regression coefficient. Instead, the β(score) of -0.0230 implies that each one-unit increase in the credit score associates with a 0.0230 decrease in the log-odds, ln[p/(1-p)] if the other predictors are held constant; equivalently, since exp(-0.0230) = 0.97726, as the credit score increases by one (+1), the odds, p/(1-p), fall at a constant ratio of 0.97726.

In regard to her logistic regression model, each of the following statements is true EXCEPT which is false?

a. She probably used ordinary least squares (OLS) to estimate the coefficients
b. Her model assumes a linear relationship between the predictors and the logit; aka, log-odds given by ln[p/(1-p)]
c. The logistic function, F(y) = 1/[1 +exp(-y)], has a sigmoid shape and helpfully returns a probability described by the cumulative distribution function (CDF)
d. If the cost of making a bad loan is greater than denying a loan that would have been repaid, she can set Z to a value lower than 0.50.


23.5.3. As he prepares to train new machine learning models on his investment research firm's database, Peter is encoding the features that are natively non-numeric. One of these features is an investment style that locates a fund somewhere in a 3*3 matrix where the dimensions are style (value, growth, or blend) and size (large+, mid, or small/micro). The variable therefore contains one of nine text values; aka, strings. Every fund is located in one and only one of these cells, e.g., large value fund, small growth.

Given his assignment, Peter would prefer to perform a logistic regression due to its advantages. This includes the key fact that his training data includes a binary target (aka, dependent) variable. Among the following choices, which is probably his best approach to treating this non-numeric feature?

a. He should add one feature and assign a numeric value from zero to nine
b. He should perform one-hot encoding to create eight 0/1 dummy features
c. He can simply regress on the STYLE variable as textual data because it is already categorical
d. He needs to select a different machine learning algorithm, and it should be unsupervised because logistic regression cannot overcome data with non-numeric features

Answers here:
 
Last edited by a moderator:
Top