Bootstrapping, Monte Carlo and all that… — Part1

Extracting relevant results from a particular dataset requires a rigorous understanding of its statistical properties - mainly the type of distribution the data possesses. For some underlying statistical distribution in the data, parametric tests are applied to infer statistical properties of a particular estimator. It may also happen that the data does not possess any specific statistical distribution in which case, nonparametric tests are applied.

In statistical paradigm, we define an estimator to be a rule to conclude statistic of some unknown parameter from the data. For example, a mean is an estimator and mean of a sample from the dataset is statistic.

Assuming that the data in hand is obtained for a variable following a certain measurement procedure, it will inherently possess some error - random error or systematic error - which will impart some uncertainty to various statistics we try to infer from the data. This uncertainty is basically an interval around derived statistic and defines the probability of interval of the true value of the statistic.

Now, since the system from which the data is derived is not a fully deterministic system because of its inherent stochasticity, to make any predictions from the data we try to model it statistically. But the robustness of statistical model in terms of prediction is something which is highly desirable. This is where Monte Carlo simulations come to the rescue. Monte Carlo simulations are performed on the proposed statistical model to generate thousands of samples using some randomly generated data with known parameters (for example data randomly generated from a normal probability distribution function i.e. normal PDF with known parameters like mean, median etc.) to estimate the characteristics of the sampling distribution. By doing this we get a range of estimates of a particular characteristics (let’s say median). This range of values of median for our statistical model together with the probability or likeliness of a particular value of median is crucial for understanding the robustness of our statistical model.

The method of bootstrapping is a special case of Monte Carlo simulation in which instead of generating random data from a distribution, sampling with replacement is performed from the population of available data (also known as historical data) regardless of its distribution to generate statistics of sampling distribution.

In essence, we can say that Monte Carlo simulation is essentially used to understand how well a particular estimator performs for a particular statistical model/distribution whereas bootstrapping is used to derive and understand the variability of a particular dataset in hand for which the distribution is unknown.

In the next post, we will try to apply and understand both the methods using some test data in order to realise the power of these two methods. 