Intro to Power Law

9 min readJan 10, 2023

Extremes are more frequent than we expect

distribution plot of heavy-tailed distribution, showing 80–20 rule. 80% of the data is in 20% of the space — 80–20 rule, Picture by Hay Kranen / PD, source

Intro

People have tendency to perceive world in a linear fashion. This is probably the reason why Gaussian distribution is famous and widely used statistical distribution. Many aspects of our lives fall close to Gaussian distribution (source):

height of people
birthweight of babies
blood pressure etc.

But Gaussian distribution might be utterly useless (and dangerous) to use in some of the cases, for example:

we have probably heard that calculating mean income might not give much information how people live especially if there is a wide gap between poor and rich
using Gaussian models (distribution) might severely underestimate probability of extreme market fluctuations

Previous cases were examples where power law is more helpful. Let’s dive into power law. Graphs in this post are generated by me if not linked to a source. Code for generating my graphs could be found here.

Power law

By Wikipedia (source) power law is a functional relationship between two quantities, where one quantity varies as a power of another. It has the following formula:

where:

a is a constant
x is input a variable (for example income of person)
k is also constant, exponent of a power law

Example value is very simple to calculation: assume k=2, a=1 and x=3, then: 1*3^(-2), which is 1/9 (roughly 0.11111). Function outputs probability. In essence this function is very simple: we take input variable x into exponent -k and multiply with a. But simple things might have unexpected behavior.

On a side not do not confuse power law with exponential distribution (source):

exponential function formula — Exponential distribution probability density function

where λ is parameter (often called rate parameter and has values above 0), e is Euler’s number and x is an input variable. Compared to power law input variable x is now exponent.

Experiment

To understand nature of a power law distribution I’ll make a comparison with exponential distribution. Experiment is following:

generate a big list of random samples (10⁸ of them to be precise) from power law and exponential distributions separately. Parameters where chosen arbitrarily (k=2.5 for power law and λ=0.4 for exponential distribution).
Randomly take 10–10⁶ subsamples (each iteration takes 10¹ times more subsamples) from both distribution samples. I assume that like in real life we have small amount of data at the beginning and in the longer run we might collect more of it.
Calculate summary statistics for subsamples.

Simple and obvious statistics to look at are mean and median. In the following plot we could see that power law samples have a bit higher median and mean values but not too much. Of course you could say that mean of a power law is twice as high as for exponential distribution. But let’s assume that this is not a problem for us. We still see that more samples tend to converge into a certain value range. So far so good.

plot showing how exponential and power law distribution sample means and medians differ if sample size increases. They don’t change too much — Experiment mean and median values at different number of samples

Things tend to get weird if we see what happens to variance at different number of samples. Power law distribution sample variance tends to skyrocket! Shouldn’t it be that more samples means more certainty about the variance and thus is should converge? Not with power law. And this weird variance is related to maximum values.

plot showing how exponential and power law distribution sample standard deviations differ if sample size increases. Power law standard deviation increases a lot. Exponential samples standard deviation stays same. — Experiment standard deviation values at different number of samples

Real devil lies in the maximum values. Under power law the more samples we have the bigger maximum values we see. And increase is significant (on log-scale).

plot showing how exponential and power law distribution maximum values differ if sample size increases. Power law maximum increases a lot. Exponential samples maximum stays almost the same. — Experiment maximum values at different number of samples. Left plot: original y-scale, right plot: logarithmic y-scale for better visualization

This funnily behaving distribution has some properties which I try to formalize in the following section.

Properties

From previous we can conclude that power law distribution is a heavy-tailed distribution. Significantly many samples are in the further away from mean (or more precisely meanish thing I tried to calculate) meaning that probability of rare events are much higher than for example in the case of Gaussian distribution.

Previous statement also means if data follows power law it is difficult to learn from the history. In previous experiment we saw that the more data we had the more extreme values we saw. For example we might have historic data about price changes of market indexes. We train some models to predict probability of sudden market crashes. But maybe we have just too few datapoints to see significantly bigger crashes beyond current maximum? If we would wait long enough we would see a bigger crash.

In real life there is usually some limit (for example at maximum markets can collapse to 0, earthquakes have physical boundaries). In that case we could use truncated power law distribution. It might make analysis more complicated because under truncated power law tails have less data meaning that we might have less information to make conclusions what is happening there.

In overall power law has the following properties (source: An Introduction to Complex Systems):

table showing how power law properties change if k increases

PDF refers to probability density function. If PDF is normalizable it means it sums up to 1 (like probabilities should). Most of the real life power law distributions are in the range of 1<k<3, which means we are dealing with infinite variance processes (source: An Introduction to Complex Systems). In my experiment I had exponent 2.5 which has infinite variance as was confirmed by plot.

On important property of power law is that sample generation is not independent/memoryless. This makes a crucial difference compared to exponential distribution (or geometric distributions which are only two kinds of memoryless distributions) which expects samples to be generated independently. Memorylessness means that waiting time until a certain event does not depend on how much time has elapsed already (source). For example earthquakes don’t happen randomly, there is internal pressure that needs to build up before larger ones happen.

Power law is scale invariant. This means that multiplying a random variable with constant factor does not change its probability. Scale invariance means that the shape of the density curve is the same regardless of the scale at which we are observing the curve (source). By the way this is an hint that we are dealing with a complex system (source: An Introduction to Complex Systems and source). Simple example of scale invariance is distribution of wealth. We know that most of the wealth is held by tiny fraction of people. If we take 10% of the wealthiest then we could see that they have similar distribution of wealth: few people have a lot more of it then rest.

How to detect power law distribution

Most straightforward way to detect power law distribution is to plot data in log-log scale (source: An Introduction to Complex Systems). For example we have text which contains many words. We’ll count each words frequency and rank of the frequency. Then we take log both of them and plot it. Sample data from our previous experiment:

log-log plot of exponential and power law data. power law data is straight line, exponential data drops a lot faster — Exponential distribution data (blue) vs power law distribution data (red) in log-log scale

Power law data has distinctive straight line form in log-log scale. Exponential data values diminish a lot faster in the far end of the distribution. This also shows why data generated under power law have more extreme values. But note that far right tails have fewer points thus making difficult to make certain conclusions what is happening there.

From log-log plot it is pretty simple to calculate exponent of a power law (slope of the line). But in real life there are some restrictions. First we might not have enough data to detect correct distribution. We don’t want to wait until something extreme (bad) happens. On practical level we should care about heaviness of the tails. Secondly our distribution might be truncated: having power law distribution on some level but tail drops fast. In real life situations we might expect physical boundaries (truncated power law) which restriction how extreme system can go.

Examples

Power law distributions are widely common in nature and society. There are multiple examples which you could find from Wikipedia (source). I’ll highlight some of them.

Pareto 80–20 rule, which is an observation that for example 80% of the wealth is owned by 20% of the people. Pareto principle is an example how power law could be used for positive purpose: to have 80% impact we should focus on 20% that delivers this impact. For example focus on 20% of the biggest oil companies that produce most of the fossil fuel (and pollution). Pareto rule is a good way to see the scale free property: if 20% of the customers bring 80% of the sales, then 4% (20% of the original 20%) of them bring 64% of the sales.

Benford’s law is another example of power law. It shows what is a expected frequency of first digits in real life situations (for example tax returns) (source). It states that small numbers are more likely to be as first numbers than bigger numbers. For example about 30% of the cases number 1 should be the first digit and only a bit under 5% of the cases it should be number 9. Real life numbers where first digits don’t follow Benford’s law distribution are susceptible for fraud (but of course it is just an indirect evidence).

plot showing Benford’s law, smaller digits have higher probability being a first digit in a number. — The distribution of first digits, according to Benford’s law. Each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit. (Source)

Zipf’s law refers to the fact that in many numeric datasets the rank-frequency distribution has inverse relation (source). For example words in larger corpus: few words have very high frequency (low rank) but there are many words with lower frequency but higher rank. More frequent words tend to take most of the frequency probability mass. From the plot we could see that there are quite a lot of words with low frequency (upper left corner of the plot). On the other hand there tends to be less words with very high frequency (upper right corner).

log-log plot showing how for frequency is related to how may such frequency words are. It is linear decreasing relationship — Log-log plot of word frequency vs number of such words with same count. Words taken from book “Tõde ja õigus” parts I-V

One example where choosing wrong distribution for risk modelling blew up a company is Long-Term Capital Management (LTCM) (source). It was founded in 1994 and its board of directors included winners of Nobel Prize for Economics. Long story short: they used Gaussian distribution for predicting market fluctuations. Under Gaussian distribution chances of event happening that is 10 standard deviations from the mean is 1.31x10⁻²³, (very very rare event). Under power law with exponent of 2, this probability is 1/203 (something to deal with) (source). They blew up spectacularly on 1998.

The value of $1,000 invested in LTCM, the Dow Jones Industrial Average and invested monthly in U.S. Treasuries at constant maturity. LTCM value decreases on 1998 a lot. — The value of $1,000 invested in LTCM, the Dow Jones Industrial Average and invested monthly in U.S. Treasuries at constant maturity. (source)

Plot above is a reminder what might happen if we underestimate risks of power law like shaped data.

Conclusion

Power law is a essential for describing variety of phenomenon. It has treacherously simple form which hides potential risks. If data/phenomenon follows power law distribution extreme events might be much more common than we think. But power law could be used to have maximum impact with minimal effort.

References

6 Real-Life Examples of the Normal Distribution, Zach, 2021, https://www.statology.org/example-of-normal-distribution/
An Introduction to Complex Systems, Society, Ecology, and Nonlinear Dynamics, Second Edition, Paul Fieguth, 2017
Benford’s law, Wikipedia, https://en.wikipedia.org/wiki/Benford%27s_law
Complex systems, Wikipedia, https://en.wikipedia.org/wiki/Complex_system
Exponential distribution, Wikipedia, https://en.wikipedia.org/wiki/Exponential_distribution
Improved (the LTCM blow up story), Nassim Nicholas Taleb, 2019, https://twitter.com/nntaleb/status/1151863828726059008
Long-Term Capital Management, Wikipedia, https://en.wikipedia.org/wiki/Long-Term_Capital_Management
Memorylessness, Wikipedia, https://en.wikipedia.org/wiki/Memorylessness
Power law, Wikipedia, https://en.wikipedia.org/wiki/Power_law
Stat 720 Notes, Alan Mishler, 2016, https://www.stat.cmu.edu/~cshalizi/networks/16-1/lectures/09/mishler.pdf
Tõde ja õigus, Wikisource, https://et.wikisource.org/wiki/T%C3%B5de_ja_%C3%B5igus_I
Zipf’s law, Wikipedia, https://en.wikipedia.org/wiki/Zipf%27s_law