### New Perspectives on Data Analysis from Resampling Methods Using R

*John Hilary Maindonald*

Building: Law Building

Room: Breakout 7 - Law Building, Room 028

Date: 2012-07-12 03:30 PM – 05:00 PM

Last modified: 2012-04-20

#### Abstract

The use of simulation to approximate a regression or other result that is known theoretically serves primarily as a check that the simulation has been implemented correctly. Often, however, the modeling process departs in minor or major ways from the model assumptions. For example, selection of explanatory variables may be a component of the modeling process. Simulation then provides a check on consequences for the statistical properties of the modeling process. For commonly used forms of stepwise regression the results can be devastating, as will be demonstrated.

Rather than taking repeated samples from a normal or other theoretical distribution, one can treat the sample data as providing a better indication of the distribution, and resample from that. Alternatively, we can resample from model residuals. This resampling idea underlies bootstrap methodology. Again, this can provide interesting and insightful commentary on the statistical properties of the modeling process, for example with regard to the interplay between variable selection and the identification of outliers.

The Random Forests methodology, most often used for classification problems, demonstrates a use of bootstrap resampling that is radically different from that noted above. It is simple to use and, for many types of problem, remarkably effective. There is scope to extend much more widely the ideas that underpin the Random Forests methodology.

Advances in computer systems have made resampling approaches, including simulation, viable for use even with relatively large data sets. The R system has a range of powerful and versatile tools that can be used to implement such approaches, with modification of adaptation where this seems required. This talk will use data that have been collected for research purposes to illustrate some of the possibilities.

If modern computing systems been available to the pioneers of theoretical statistics, how would statistical methodology differ from current mainstream methodology? It is an interesting speculation, and a good basis for considering where new developments in statistical methodology may be heading.

Rather than taking repeated samples from a normal or other theoretical distribution, one can treat the sample data as providing a better indication of the distribution, and resample from that. Alternatively, we can resample from model residuals. This resampling idea underlies bootstrap methodology. Again, this can provide interesting and insightful commentary on the statistical properties of the modeling process, for example with regard to the interplay between variable selection and the identification of outliers.

The Random Forests methodology, most often used for classification problems, demonstrates a use of bootstrap resampling that is radically different from that noted above. It is simple to use and, for many types of problem, remarkably effective. There is scope to extend much more widely the ideas that underpin the Random Forests methodology.

Advances in computer systems have made resampling approaches, including simulation, viable for use even with relatively large data sets. The R system has a range of powerful and versatile tools that can be used to implement such approaches, with modification of adaptation where this seems required. This talk will use data that have been collected for research purposes to illustrate some of the possibilities.

If modern computing systems been available to the pioneers of theoretical statistics, how would statistical methodology differ from current mainstream methodology? It is an interesting speculation, and a good basis for considering where new developments in statistical methodology may be heading.