Regression Models Feature Selection

Variable selection in regression, identifying the best subset among many variables to include in a model is arguably the hardest part of model building. Many variable selection methods exist because they provide a solution to one of the most important problems in statistics.

Variable Selection Techniques:

1) Forward Selection - This particular method, adds variables to the model until no remaining variables can add anything significant to the dependent variables. Forward selection starts with no variables in the model. For each variable, the test statistics, a measure of the variable’s contribution to the model, is calculated. The calculated statistics is then recalculated for the remianing variables, and the the evaluation is repwated.

2) Backward Elimination - This method deletes variables one by one from the model until all remaining variables contribute something signifi cant to the dependent variable. BE begins with a model that includes all variables. Variables are then deleted from the model one by one until all the variables remaining in the model have TS values greater than C. At each step, the variable showing the smallest contribution to the model is deleted.

3) Background Knowledge - This method basically uses background knowledge to guide variable selection. Background knowledge can be incorporated at least at two stages, and it requires an intensive interplay between the PI of the research project (usually a nonstatistician) and the statistician in charge of designing and performing statistical analysis. At the first stage, the PI will use subject‐specific knowledge to derive a list of IVs which in principle are relevant as predictors or adjustment variables for the study in question. This list will mostly be based on the availability of variables, and must not take into account the empirical association of IVs with the outcome variable in the data set. The number of IVs to include in the list may also be guided by the EPV. Together with the PI, the statistician will go through the list and critically question the role and further properties of each of the variables, such as chronology of measurement collection, costs of collection, quality of measurement, or availability also to the “user” of the model.

These methods are preferred by me because these are some I have studied and used as part of academic projects, that being said, I am certain there is no one size fit all method and it should be decided based on the type of data set we are dealing with and what are we trying to extract from the data. The articles did provide much better insight into the universe of variable selection and the different methodologies that can be employed when working with datasets.

Hope this explains some of these techniques and also my reasoning for them. Like always, would love the readers feedback and/or comments.

Sincerely,
Naman Goel

Third Blog Post - Exploratory Data Analysis

Project 3 Learning and Improvements