Simulating Popular Distributions in Python

3 min read

Interest in machine learning and data science has been growing at a rapid rate in recent years. More and more students are enrolling in online data sciences courses that are great at teaching them how to fit machine-learning algorithms to simple data sets. Most of these online courses are fantastic in explaining complex techniques related to machine learning, however, only a few of them delve into the mathematical statistics behind the fancy algorithms. The fundamentals of statistics are grossly undervalued in these courses. For example, there are many so-called data scientists that cannot distinguish between discrete and continuous data. It may seem trivial, but I’ve seen many people simply assume that their data is continuous when in fact it is discrete. The most common being the Poisson distribution. Understanding the properties of various distributions is extremely important in making sense of your data.

To help one understand the properties of a certain distribution, it is always helpful to stimulate the data points and plot them visually. With the help of Python 3, we will go through and simulate the most common simple distributions in the world of data science. We won’t be explaining each distribution in detail, this research can be done in your own time (we provide useful links and resources). Here we will only simulate various popular distributions that can be helpful in many applications. The first step is to install the required libraries.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

Standard Normal Distribution

The Normal Distribution contains the word “Normal” because it’s possibly the distribution that explains most types of phenomena. For example, IQ scores, height and shoe sizes are applications of the normal distribution. You can find a detailed explanation of the normal distribution here. It explains the notation used and the central limit theorem.

sns.set(color_codes=True)

#Set random number for reproducibility

np.random.seed(123)

#simulate normal dist (Std normal: loc = 1 and scale = 1)

x = np.random.normal(size=500, loc = 0, scale = 1)

#Plot

fig = plt.figure(figsize=(10,6))

ax = sns.distplot(x, fit = norm, axlabel = "values", 

     kde_kws={"color": "r", "lw": 3, "label": "KDE"},

     fit_kws={"color": "black", "lw": 3, "label": "StdNormal"})

plt.legend(labels=['Kernel Density','Standard Normal Dist'])

plt.title("Standard Normal Distribution")

plt.show(block=False)

Binomial Distribution

The Binomial Distribution is discrete and is used to model the number of successes in a given sample size. When we simulate this distribution, it’s useful to indicate the size parameter. The size parameter essentially defines how many times we want to run the experiments. The flipping of a coin is the most intuitive way to think about the binomial distribution.

np.random.seed(123)

x = np.random.binomial(size = 100, n = 10, p = 0.5)

#Plot

fig = plt.figure(figsize=(10,6))

ax = sns.distplot(x, axlabel = "Values", 

     kde_kws={"color": "r", "lw": 3, "label": "KDE"})

for p in ax.patches:

             ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2.,  p.get_height()),

              ha='center', va='center', fontsize=11, color='black', xytext=(0, 20),

                 textcoords='offset points')

plt.legend(labels=['Kernel Density','Binomial Distribution'])

plt.ylim(0, 0.6)

plt.title("Binomial Distribution")

plt.show(block=False)

In the plot above we have flipped a coin 10 times and performed this experiment 100 times. For instance, the probability that the coin lands on heads (denoted as a success) exactly 5 times for all these experiments is 0.30. Note that this is an empirical distribution and not theoretical. Also, as n (number of flips) gets larger, the binomial distribution can be somewhat approximated by the normal distribution. Notice the kernel density (red line), it closely resembles the normal distribution. Have a look at Khan Academy for a detailed explanation of the distribution.

Poisson Distribution

The Poisson Distribution is used to model events that occur at random time points, in which we are interested in the number of occurrences of the event. For example, the number of goals in a match or the number of calls recorded per day. Lambda is defined as the rate of the event multiplied by the time interval of the event. Stat Trek is a good place to get started on the Poisson distribution.

np.random.seed(123)

x = np.random.poisson(lam = 3, size = 10000)

fig = plt.figure(figsize=(10,6))

ax = sns.distplot(x, axlabel = "values", kde = False)

plt.title("Poisson Distribution")

plt.ylabel("frequency")

plt.show(block=False)

Note that in this simulation we plot the frequency and not the density of the distribution. To plot with the density on the y-axis, you’d only need to change ‘kde = False’ to ‘kde = True’ in the code above.

Exponential Distribution

Referring back to the Poisson distribution and the example with the number of goals scored per match, a natural question arises: how would one model the interval of time between the goals? We would use the popular Exponential distribution to provide the result. Here, Lambda is defined as the rate parameter. A lower rate parameter is linked to a flatter curve.

np.random.seed(123)

x = np.random.exponential(size=500, scale = 1.5)

fig = plt.figure(figsize=(10,6))

ax = sns.distplot(x, fit = stats.expon, axlabel = "values", 

     kde_kws={"color": "r", "lw": 3, "label": "KDE"},

     fit_kws={"color": "black", "lw": 3, "label": "exp"})

plt.legend(labels=['Kernel Density','Exponential Dist'])

plt.title("Exponential Distribution")

plt.show(block=False)

Have a look here for a detailed explanation of the Exponential distribution and its applications.

Luka Beverin As a current Masters in Statistics student, Luka is eager to simplify complex topics and provide big-data solutions to real-world problems. He also has an educational background in actuarial and financial engineering. In his spare time, Luka enjoys traveling, writing on machine learning topics and taking part in data science competitions.

Leave a Reply

Your email address will not be published. Required fields are marked *