After you have interfered with the purity of your data by model selection, how do you do inference? TBD
Tricky in general.
Here's an approach. The reusable holdout: Preserving validity in adaptive data analysis which, like everything these days, uses differential privacy methods. Soon I will have my smoothies made by ensuring differential privacy for my bananas' identities.
- BNSS15: (2015) Algorithmic Stability for Adaptive Data Analysis. ArXiv:1511.02513 [Cs].
- Bune04: (2004) Consistent covariate selection and post model selection inference in semiparametric regression. The Annals of Statistics, 32(3), 898–927. DOI
- TLTT14: (2014) Exact Post-selection Inference for Forward Stepwise and Least Angle Regression. ArXiv:1401.3889 [Stat].
- LSST13: (2013) Exact post-selection inference, with application to the lasso. ArXiv:1311.6238 [Math, Stat].
- DFHP15: (2015) Preserving Statistical Validity in Adaptive Data Analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing - STOC ’15 (pp. 117–126). Portland, Oregon, USA: ACM Press DOI
- HaUl14: (2014) Preventing False Discovery in Interactive Data Analysis Is Hard. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (pp. 454–463). Washington, DC, USA: IEEE Computer Society DOI
- ChHS15: (2015) Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach. Annual Review of Economics, 7(1), 649–688. DOI
- BBBZ13: (2013) Valid post-selection inference. The Annals of Statistics, 41(2), 802–837. DOI