Bayesian variable selection in linear regression pdf
By using our site, you agree to our collection of information through the use of cookies. To learn more, view our Privacy Policy. To browse Academia. Log in with Facebook Log in with Google.
Remember me on this computer. Enter the email address you signed up with and we'll email you a reset link. Need an account? Click here to sign up. Download Free PDF. A short summary of this paper. Comparison of Bayesian objective procedures for variable selection in linear regression. There are two natural ways of encompassing: one way is to encompass all models into the model containing all possible regressors, and the other one is to encompass the model containing the intercept only into any other.
In this paper we compare the variable selection procedures that result from each of the two mentioned ways of encompassing by analysing their theoretical properties and their behavior in simulated and real data. Relations with frequentist criteria for model selection such as those 2 , and Mallows C are provided incidentally. Then, the prior for the latent variables is a product of Bernoulli distributions.
Then, the posterior probabilities of the latent variables are computed and the covariates are selected as follows: a big enough posterior probability that a component of the latent variable is one implies to include the corresponding covariate as an explanatory variable. This formulation involves a large number of hyperperparameters for which subjective inputs are required. Conventional improper priors were used for the error variance.
A fully objective analysis for model comparison in linear regression was given in Berger and Pericchi They utilize an encompassing approach and an empirical measure, the intrinsic Bayes factor, which does not need subjective prior information.
For large sample sizes this empirical measure closely approximates a Bayes factor, the Bayes factor for intrinsic priors. Since Mi is nested into the full Mk , this makes possible the derivation of intrinsic priors. The interpretation of these probabilities is that the submodel having the highest posterior probability is the most plausible reduction in complexity from the full model, the second highest the second most plausible reduction and so on.
Notice that M1 is nested in Mi , for any i, so that the corresponding intrinsic priors can be derived. The variable selection selection from above is based on multiple pairwise comparisons. However, we note that all the posterior probabilities are computed with respect to the full model Mk and, consequently, each one indicates how plausible is a given reduction in complexity compared to the full model.
The variable selection selection from below is also based on multiple pairwise comparisons. However, we will see that it is equivalent to ordering the models according to intrinsic model posterior probabilities computed in the space of all models. Therefore, the possible lack of coherence of the variable selection from above approach is not shared by the variable selection from below procedure.
The paper is organized as follows. Section 2 gives the general form of the intrinsic prior distribution for any two nested normal linear models. Section 3 develops a general formula for computing the Bayes factor and the model pos- terior probabilities, and proves the consistency of the Bayesian procedure.
Section 5 is a short note on stochas- tic search. Section 6 contains comparisons of the numerical results obtained from the frequentist and Bayesian variable selection procedures for simulated and real datasets. Finally, section 7 gives some concluding remarks and recommendations. The direct use of improper priors for computing model posterior probabili- ties is not possible; however, they can be converted into suitable intrinsic priors Berger and Pericchi , Moreno et al.
Intrinsic priors for the param- eters of the above nested linear models provide a Bayes factor Moreno et al. The asymptotic behavior of this posterior probability is as good as can be expected from a Bayesian procedure. Under mild conditions, the posterior prob- ability P Mi n, Bij tends, in probability, to 0 when sampling from model Mj , and to 1 when sampling from Mi.
More precisely, adapting the proof of Moreno and Girn c , we can prove the following theorem. Theorem 1. However, this is not a too demanding condition as the following example shows. This example has been taken from Berger and Pericchi Example 1. Consider the case of testing whether the slope of a simple linear regression is zero. This statistic is closely related to —is in fact a monotonic function of— the classical F statistic and the associated p-value for testing the null hypothesis Mi.
While the former paper focuses on variable selection, the latter contemplates its relation with the usual p-values. Remark 1. There is a close relation between the statistics involved in the computation of both Bayes factors. However, there is no closed form relationship between the corre- sponding Bayes factors nor the pairwise posterior probabilities.
Theorem 2 proves this assertion, which is based on the following lemma. Theorem 2. By using our site, you agree to our collection of information through the use of cookies. To learn more, view our Privacy Policy. To browse Academia. Log in with Facebook Log in with Google.
Remember me on this computer. Enter the email address you signed up with and we'll email you a reset link. Need an account? Click here to sign up. Download Free PDF.
Stefano Cabras. A short summary of this paper. Download Download PDF. Translate PDF. Cabrasab, J. The R code is explained in a way that can be easily applied to other similar studies. The application shows that, compared to more traditional methodologies, the BART approach is particularly useful when a high-dimensional set of confounding variables is considered as its results are not based on a sampling hypothesis.
BART allows for the estimation of different interactive effects between the treatment variable and other covariates. BART models do not require the analyst to make explicit subjective decisions in which covariates must be included in the final models. This makes it an easy procedure to guide policy makers' decisions in different contexts. However, developing an experimental design for this task is not always possible due to a potential range of ethical and economic considerations. Therefore, the use of observational studies may be regarded as an easy alternative for the estimation of causal effects in education.
For this reason, it may be considered a fertile field in which to analyze the causal effects of different educational variables in a non-experimental setting. Nevertheless, this task necessarily requires the consideration of a large number of confounding variables to render the treated and the control samples comparable.
This is a potential problem when considering traditional matching estimation techniques, such as propensity score, given that the two samples cannot be observed for the same values of confounding variables. Originally developed in Chipman et al. This method, which addresses mainly the optimal estimation of response surface, i. Moreover, this flexible approach allows for both the inclusion of a large number of covariates, as it does not require a sampling hypothesis, and the estimation of a large number of interactive effects between the treatment and other variables in the analysis.
The fact that it does not require any subjective decision by the analysist, except for deciding the response and the treated variable, makes it an easy procedure for decision makers to implement in different contexts. We illustrate the use of BART models for causal analysis with an application to the estimation of the causal effect of information and communication technologies ICT on performance of Spanish students in mathematics as measured in PISA This tutorial extends from the collection of the database to the comparison of the estimated causal effect obtained with BART models and those obtained with other more traditional approaches, such as linear regression and matching.
The remainder of this paper is structured as follows. We explain the insight of causal estimation with non-experimental data in Section 2. Sections 3 and 4 briefly describe the estimation of causal effects using some traditional methodologies and the BART approach, respectively. We draw some conclusions in Section 8. In order to compute the causal effect of z on the response variable Y , we should know, in principle, the potential results of the value of the test for the same individual under the use, Yi 1 , and not under the use of computers, Yi 0.
However, this is impossible because only one of these can be observed, while the other is unobservable and it is designated as the counterfactual result that has to be estimated with a regression model like the BART model described below.
Such a model is used mainly in the estimation of response surfaces which is the main problem in the estimation of causal effects. In this case, it is the response Y to a hypothetical treatment z. The causal effect for each individual is of no interest. In observational studies, such as the PISA test, potential results are not typically independent of the treatment.
This is known in the literature as the endogeneity problem. More specifically, the strong ignorability hypothesis regarding the allocation of treatment states that Y is conditionally independent of z given X and that the probability of treatment allocation is always positive regardless of the specific value of X. In order to achieve this, it is necessary to include in X all the potential confounding factors; because of that, matrix X typically has a very high dimensionality and is formed from different types of covariates: qualitative, quantitative and sortable variables.
This situation complicates the analysis, as it requires the use of sophisticated regression models in the estimation of Y. This fact obliges the analyst to consider a set of variables of lower dimension, in many cases putting the strong ignorability assumption in doubt. Finally, it is well known that the specification of regression models with many variables makes it impossible to search for all the possible models with all types of interactions. Finally, we assign the popular Zellner-Siow prior Zellner and Siow , i.
The proof of the result is is given in Appendix A. Moreover, some preliminary results which are required in our proof are formulated as Lemma A. In our notation this Lemma reads as: 90 Under the sampling model also true model MA 13 , 1. Assume that 16 holds. Such a method generates realizations of a Markov chain which converges in distribution to the conditional random vector of interest, i. Thus, after some time of convergence, the so-called burn-in phase, the generated random numbers can be considered as a dependent sample from the posterior distribution.
To finally obtain an independent sample only simulated numbers with a given distance d are chosen, a so-called thinning is performed. MH algorithms are iterative and in each iteration a random number is drawn from a proposal distribution q which is then accepted or not based on a given criterion. If the proposal q depends on the respectively last accepted random number one also speaks of a random walk MH algorithm.
In order to define a MH algorithm for a given problem basically the only thing one has to specify is the proposal distribution q used in the 2 algorithm. A realization of this random variable corresponds to the index of a predictor which is going to be added to or removed from the model. Transitions between models, which differ by two or more parameters, are not allowed. Again, as the parameters p1 , It is well known see Fahrmeir et al. Experimental studies In this section the predictive performance of the proposed hierarchical Bayesian approach is analysed and compared with some other Bayesian and non-Bayesian methods.
For this task we have implemented the MH algorithm described in Section 2. The comparison includes the lasso Tibshirani , the adaptive lasso alasso Zou , the elastic net elastic Zou and Hastie , the Bayesian lasso blasso Park and Casella , the Bayesian adaptive lasso balasso Alhamzawi et al. For our extensive comparisons three real- world datasets and also simulated datasets from two different artificial models are used.
The performance comparison on the real-world datasets is carried out on the basis of a 5-fold cross-validation. Performing a 5-fold cross-validation on a given dataset means partitioning the dataset in 5 equal sized subsets, iteratively selecting each of these subsets exactly once as testing data, whilst using the remaining subsets for training. Aggregating some measure of accuracy, respectively computed from each of the 5 testing datasets, results in the overall performance of the models to evaluate.
For the aggregation of single accuracies we are using the median since it is more robust to outliers than the mean of the values. For all the non-Bayesian comparsion models the predicted target values yb1 , For the Bayesian comparsion models the way the predictions are computed depend on the output provided by the R-packages used in this paper. In this work the R-function blasso included in the R- package monomvn Gramacy is applied in order to perform Bayesian lasso regression.
Moreover, we use the R-Function brq included in the package Brq Alhamzawi for applying the Bayesian adaptive lasso regression. It should be mentioned that this R-function is not developed for the classical linear regression setting where the conditional mean of the response variable is modelled, rather it is developed for the more general setting where any conditional quantile is modelled. For purposes of the comparision study in this paper the conditional median, i.
In order to use the Bayesian elastic net the function EBglmnet included in the package EBglmnet Huang and Liu is used in this work. This function returns the posterior mean of the nonzero regression coefficients, which is then used for prediction again via inner product computation with the test samples.
It is also important to mention how inside the 5-fold cross-validation for real-world datasets, or for the multiple simulated datasets the hyperparameters of the trained models are determined. For the frequen- tist models, which are all implemented in the R-package glmnet Friedman et al. For the adaptive lasso the chosen penalization factors are the absolute values of the inverse regression coefficients obtained from the respective ridge-penalized problem, the penalization factor of which is also determined via fold cross-validation.
For the Bayesian version of the lasso and the adaptive lasso the default values of the corresponding functions blasso and brq are assigned to the hyperparameters of the a-priori distributions.
For further details please refer to the help pages of these packages. However, it should be mentioned that for the Bayesian lasso the option of additional model selection based on a Reversible Jump MCMC algorithm is set to true for the experimental studies in this paper. Moreover, for the Bayesian elastic net the hyperparameters in the prior distribution are selected via 5-fold cross-validation.
This is done by using the function cv. EBglmnet, which is also available in the package EBglmnet. The hyperparameters of our proposed Bayesian model are chosen differently in the following experimental studies to indicate useful possible choices. These specific choices are described later on in this article. For the Bayesian elastic net, the decision which samples to store is already taken by the implementation itself and cannot be user-defined.
All other Bayesian models go along with implementations which allow for a user specific determination of this fraction. To do so the following procedure is chosen: In a first step, respectively, the first 10, samples are deleted and, in a second step, all samples except of every th one are deleted. Note that samples generated in the first step before convergence of the used MCMC algorithms are deleted, while thinning done in the second step should guarantee that the remaining samples can be considered as being independent.
For each of the observed Bayesian models 50, dependent samples are generated, except for the Bayesian adaptive lasso where 70, ones are produced. This results in 4, i. The reason for the special treatment is that in our experiments the Bayesian adaptive lasso turned out to require more samples to get stable estimates of the target variables. Real-world studies In this section the performance of the proposed method is compared with the performance of other well-established methods based on three real-world datasets.
The comparison is made along the lines of the above considerations see beginning of Section 3. The diabetes dataset In this section the popular diabetes dataset, originally used by Efron et al.
The target variable of interest is a quantitative measure of disease progression one year after baseline. In the R-package care Zuber and Strimmer. This version is used in our experiments. The hyperparameters of the proposed Bayesian approach are specified as follows: In an empirical Bayes manner, the parameters p1 , As already mentioned in Section 2.
Thus, a zero truncated binomial prior is assigned to the unknown model size k. However, in general it is somewhat cumbersome to determine the second parameter in such a way that the distribution has a given mean.
Further, note that by the Abel-Ruffini theorem there is no algebraic solution to the general polynomial equations of degree five or higher with arbitrary coefficients. The parameters a and b are both set to 0.
Finally, the tuning parameters see Section 2. The prostate cancer dataset In this section the performance of the seven considered methods is compared on the prostate cancer data provided by Stamey et al. This dataset examines the relationship between the level of prostate specific antigen lpsa and eight clinical measures from 99 patients who were about to receive a radical prostatectomy. The eight measurments are log cancer volume lcavol , log prostate weight lweight , age, log benign prostatic hyperplasia amount lbph , seminal vesicle invasion svi , log capsular penetration lcp , Gleason score gleason , and percentage Gleason scores 4 or 5 pgg This dataset is included e.
The predictors are standardized, consequently we only have to zero-center the target variable lpsa to make the data compatible with the model proposed by us. The hyperparameters of our new Bayesian approach are specified as follows: The parameters p1 , The remaining parameters are also defined as in Section 3.
Thus, performing a 5-fold cross-validation as described at the beginning of Section 3 results in Table 2. The ozone dataset In this section the ozone datset originally studied by Breiman and Friedman is used for the performance comparison.
We decide to examine the regression model including all linear, quadratic and two-way interactions, resulting in 44 possible predictors. Moreover, we zero center and standardize variance equal to one all variables previous to the training of the diverse models. We want to mention that in contrast to a statement at the beginning of Section 3 in this performance evaluation for all Bayesian models , MCMC samples are drawn and not 50, or 70, as stated there. This is the only experimental study where this statement is violated.
The hyperparameters of the newly proposed Bayesian approach are specified as follows: We assign to each of the parameters p1 , We empirically discover that taking the power of three can be a helpful step to guarantee fast convergence of the proposed MH algorithm see Section 2. Performing a 5-fold cross-validation as described at the beginning of Section 3 results in Table 3.
Simulation studies In this section, on the basis of simulated data corresponding to two different artificial models, the performance of our new method is compared with the performance of other well-established methods. This should reflect difficult variable selection problems that more and more organizations have top face in their daily life nowadays and in the future. From both artificial models multiple datasets are simulated. In particular, datasets are sampled, each according to three different settings.
Moreover, all the observations are drawn independently from each other. For each sampled dataset the entire target vector includes target values from both the training set and the testing set is standardized via division by the sample standard deviation of the training target values. This simplifies finding a good choice of the tuning parameters for the MH algorithm proposed.
The scaling allows for choosing them equal to specifications used in Section 3. The predictors are not scaled at all since they are sampled from distributions with mean zero and standard deviation one. Note, that the multiplicative factor has to be defined in such a way that the parameters sum up to one.
Aggregating the accuracy measures obtained by training the models to be compared, as described at the beginning of Section 3, results in Table 4. Moreover, one can see that the implementation of the Bayesian adaptive lasso has some serious problems with these simulated data, due to numerical instabilities.
In Figure 5 one can see the Markov chain of the model size k i. Further, in Figure 6 the relative frequencies of a given regression coefficient being non-zero are plotted for the reduced Markov chain.
0コメント