This weekend I had to generate some random sample data, but instead of it being selected from a normal distribution, I needed the data to have some skew. A distribution is said to be skewed if there are more samples on one side of the mean than on the other. To get a feeling for what skew means, have a look at the following figures.
Now, it is quite easy to generate normally distributed random data, for example using the norm class from scipy.stats. There also is a class for a skewed-normally distributed variable, scipy.stats.skewnorm. However, the variable used to express skew is a bit unintuitive there.
So I searched for another solution, and came across a technical report about that issue ( Citation: Ermak & Nasstrom, 1995 Ermak, D. & Nasstrom, J. (1995). A method for generating skewed random numbers using two overlapping uniform distributions . Office of Scientific and Technical Information (OSTI) https://doi.org/10.2172/32589 ) . Ermak and Nasstrom use the sum of two uniformly distributed random variables with overlapping ranges to generate a skewed quasi-uniformly distributed variable, and then add many of these together to generate a skewed quasi-normally distributed variable.
The paper is worth a read, but here is a summary of the thought process. Going back a bit to basic statistics, we find that the normal distribution is somewhat special in that generally, the distribution of the sum of many identically distributed random variables approaches the normal distribution. This result is known as the Central Limit Theorem. It is also considered the reason that normally distributed random variables are so widespread in nature.
So, if we were to consider the sum of a bunch of — for example — uniformly distributed random variables, then the distribution of that sum would also approach the normal distribution. This is one way of generating approximately normally distributed random data.
Ermak and Nasstrom instead generate random data from what they call a double-block distribution. The probability density function of such a distribution is shown in the following figure.
The idea here is to use two uniformly-distributed random variables with differing means. The sum of these two will have a skewed distribution. Adding many of these skewed random variables up will give us an approximated skewed quasi-normal distribution in the same way that summing up identically distributed random variables will give us an approxomate normal distribution.
Ermak and Nasstrom go through the lengths of calculating the required parameters for the two uniformly-distributed random variables in order to get a specific variance and skew, and then provide an algorithm for drawing samples from such a double-block distribution as well as generating samples from the skewed quasi-normal distribution.
I have coded this up using SciPy in Python. To generate N samples with given mean, standard deviation and skewness,
run gen_skewed_continuous(N,mean,stdev,skewness)
. The optional parameter oversample
can be used to modify the number
of skewed-uniformly-distributed variables to be added.
import numpy as np;
import scipy.stats as stats;
def gen_skewed_block(N, var, skew):
# Determine block parameters
a = np.sqrt(5);
offset = np.sqrt(skew ** 2 + 243 * (var ** 3) / 32)
m1 = 2 / (9 * var) * (skew - offset);
m2 = 2 / (9 * var) * (skew + offset);
p1 = m2 / (2 * a * m1 * (m1 - m2));
p2 = m1 / (2 * a * m2 * (m1 - m2));
d1 = -a * m1;
d2 = a * m2;
# Uniform random number distribution
rv = stats.uniform();
# Get pairs of random numbers
r = rv.rvs((N, 2));
# Calculate block numbers
v1 = 2 * d1 * (r[:, 1] - 0.5) + m1;
v2 = 2 * d2 * (r[:, 1] - 0.5) + m2;
values = np.where(r[:, 0] & lt;
2 * d1 * p1, v1, v2);
return values;
def gen_skewed_continuous(N, mean, std, skewness, oversample=10):
# Determine moments
m2 = std ** 2;
m3 = skewness * std ** 3;
# Generate skewed block random values
rv1 = gen_skewed_block(N * oversample,
m2,
m3 * np.sqrt(oversample));
# Reshape them so we have oversample number of values in a row
rv2 = rv1.reshape(-1, oversample);
# Sum them together and scale them to approximate a continuous distribution
rv3 = rv2.sum(axis=1) / np.sqrt(oversample) + mean;
return rv3;
Now, if you think about it, there might also be a way of simplifying this by directly using two non-identically normally distributed random variables, saving the effort for calculating the sums. I’ll have to think about that one.
Bibliography
- Ermak & Nasstrom (1995)
- Ermak, D. & Nasstrom, J. (1995). A method for generating skewed random numbers using two overlapping uniform distributions . Office of Scientific and Technical Information (OSTI) https://doi.org/10.2172/32589