Intro to Power Law
Extremes are more frequent than we expect
Intro
People have tendency to perceive world in a linear fashion. This is probably the reason why Gaussian distribution is famous and widely used statistical distribution. Many aspects of our lives fall close to Gaussian distribution (source):
- height of people
- birthweight of babies
- blood pressure etc.
But Gaussian distribution might be utterly useless (and dangerous) to use in some of the cases, for example:
- we have probably heard that calculating mean income might not give much information how people live especially if there is a wide gap between poor and rich
- using Gaussian models (distribution) might severely underestimate probability of extreme market fluctuations
Previous cases were examples where power law is more helpful. Let’s dive into power law. Graphs in this post are generated by me if not linked to a source. Code for generating my graphs could be found here.
Power law
By Wikipedia (source) power law is a functional relationship between two quantities, where one quantity varies as a power of another. It has the following formula:
where:
a
is a constantx
is input a variable (for example income of person)k
is also constant, exponent of a power law
Example value is very simple to calculation: assume k=2, a=1 and x=3, then: 1*3^(-2), which is 1/9 (roughly 0.11111). Function outputs probability. In essence this function is very simple: we take input variable x into exponent -k and multiply with a. But simple things might have unexpected behavior.
On a side not do not confuse power law with exponential distribution (source):
where λ is parameter (often called rate parameter and has values above 0), e is Euler’s number and x is an input variable. Compared to power law input variable x is now exponent.
Experiment
To understand nature of a power law distribution I’ll make a comparison with exponential distribution. Experiment is following:
- generate a big list of random samples (10⁸ of them to be precise) from power law and exponential distributions separately. Parameters where chosen arbitrarily (k=2.5 for power law and λ=0.4 for exponential distribution).
- Randomly take 10–10⁶ subsamples (each iteration takes 10¹ times more subsamples) from both distribution samples. I assume that like in real life we have small amount of data at the beginning and in the longer run we might collect more of it.
- Calculate summary statistics for subsamples.
Simple and obvious statistics to look at are mean and median. In the following plot we could see that power law samples have a bit higher median and mean values but not too much. Of course you could say that mean of a power law is twice as high as for exponential distribution. But let’s assume that this is not a problem for us. We still see that more samples tend to converge into a certain value range. So far so good.
Things tend to get weird if we see what happens to variance at different number of samples. Power law distribution sample variance tends to skyrocket! Shouldn’t it be that more samples means more certainty about the variance and thus is should converge? Not with power law. And this weird variance is related to maximum values.
Real devil lies in the maximum values. Under power law the more samples we have the bigger maximum values we see. And increase is significant (on log-scale).
This funnily behaving distribution has some properties which I try to formalize in the following section.
Properties
From previous we can conclude that power law distribution is a heavy-tailed distribution. Significantly many samples are in the further away from mean (or more precisely meanish thing I tried to calculate) meaning that probability of rare events are much higher than for example in the case of Gaussian distribution.
Previous statement also means if data follows power law it is difficult to learn from the history. In previous experiment we saw that the more data we had the more extreme values we saw. For example we might have historic data about price changes of market indexes. We train some models to predict probability of sudden market crashes. But maybe we have just too few datapoints to see significantly bigger crashes beyond current maximum? If we would wait long enough we would see a bigger crash.
In real life there is usually some limit (for example at maximum markets can collapse to 0, earthquakes have physical boundaries). In that case we could use truncated power law distribution. It might make analysis more complicated because under truncated power law tails have less data meaning that we might have less information to make conclusions what is happening there.
In overall power law has the following properties (source: An Introduction to Complex Systems):
PDF refers to probability density function. If PDF is normalizable it means it sums up to 1 (like probabilities should). Most of the real life power law distributions are in the range of 1<k<3, which means we are dealing with infinite variance processes (source: An Introduction to Complex Systems). In my experiment I had exponent 2.5 which has infinite variance as was confirmed by plot.
On important property of power law is that sample generation is not independent/memoryless. This makes a crucial difference compared to exponential distribution (or geometric distributions which are only two kinds of memoryless distributions) which expects samples to be generated independently. Memorylessness means that waiting time until a certain event does not depend on how much time has elapsed already (source). For example earthquakes don’t happen randomly, there is internal pressure that needs to build up before larger ones happen.
Power law is scale invariant. This means that multiplying a random variable with constant factor does not change its probability. Scale invariance means that the shape of the density curve is the same regardless of the scale at which we are observing the curve (source). By the way this is an hint that we are dealing with a complex system (source: An Introduction to Complex Systems and source). Simple example of scale invariance is distribution of wealth. We know that most of the wealth is held by tiny fraction of people. If we take 10% of the wealthiest then we could see that they have similar distribution of wealth: few people have a lot more of it then rest.
How to detect power law distribution
Most straightforward way to detect power law distribution is to plot data in log-log scale (source: An Introduction to Complex Systems). For example we have text which contains many words. We’ll count each words frequency and rank of the frequency. Then we take log both of them and plot it. Sample data from our previous experiment:
Power law data has distinctive straight line form in log-log scale. Exponential data values diminish a lot faster in the far end of the distribution. This also shows why data generated under power law have more extreme values. But note that far right tails have fewer points thus making difficult to make certain conclusions what is happening there.
From log-log plot it is pretty simple to calculate exponent of a power law (slope of the line). But in real life there are some restrictions. First we might not have enough data to detect correct distribution. We don’t want to wait until something extreme (bad) happens. On practical level we should care about heaviness of the tails. Secondly our distribution might be truncated: having power law distribution on some level but tail drops fast. In real life situations we might expect physical boundaries (truncated power law) which restriction how extreme system can go.
Examples
Power law distributions are widely common in nature and society. There are multiple examples which you could find from Wikipedia (source). I’ll highlight some of them.
Pareto 80–20 rule, which is an observation that for example 80% of the wealth is owned by 20% of the people. Pareto principle is an example how power law could be used for positive purpose: to have 80% impact we should focus on 20% that delivers this impact. For example focus on 20% of the biggest oil companies that produce most of the fossil fuel (and pollution). Pareto rule is a good way to see the scale free property: if 20% of the customers bring 80% of the sales, then 4% (20% of the original 20%) of them bring 64% of the sales.
Benford’s law is another example of power law. It shows what is a expected frequency of first digits in real life situations (for example tax returns) (source). It states that small numbers are more likely to be as first numbers than bigger numbers. For example about 30% of the cases number 1 should be the first digit and only a bit under 5% of the cases it should be number 9. Real life numbers where first digits don’t follow Benford’s law distribution are susceptible for fraud (but of course it is just an indirect evidence).
Zipf’s law refers to the fact that in many numeric datasets the rank-frequency distribution has inverse relation (source). For example words in larger corpus: few words have very high frequency (low rank) but there are many words with lower frequency but higher rank. More frequent words tend to take most of the frequency probability mass. From the plot we could see that there are quite a lot of words with low frequency (upper left corner of the plot). On the other hand there tends to be less words with very high frequency (upper right corner).
One example where choosing wrong distribution for risk modelling blew up a company is Long-Term Capital Management (LTCM) (source). It was founded in 1994 and its board of directors included winners of Nobel Prize for Economics. Long story short: they used Gaussian distribution for predicting market fluctuations. Under Gaussian distribution chances of event happening that is 10 standard deviations from the mean is 1.31x10⁻²³, (very very rare event). Under power law with exponent of 2, this probability is 1/203 (something to deal with) (source). They blew up spectacularly on 1998.
Plot above is a reminder what might happen if we underestimate risks of power law like shaped data.
Conclusion
Power law is a essential for describing variety of phenomenon. It has treacherously simple form which hides potential risks. If data/phenomenon follows power law distribution extreme events might be much more common than we think. But power law could be used to have maximum impact with minimal effort.
References
- 6 Real-Life Examples of the Normal Distribution, Zach, 2021, https://www.statology.org/example-of-normal-distribution/
- An Introduction to Complex Systems, Society, Ecology, and Nonlinear Dynamics, Second Edition, Paul Fieguth, 2017
- Benford’s law, Wikipedia, https://en.wikipedia.org/wiki/Benford%27s_law
- Complex systems, Wikipedia, https://en.wikipedia.org/wiki/Complex_system
- Exponential distribution, Wikipedia, https://en.wikipedia.org/wiki/Exponential_distribution
- Improved (the LTCM blow up story), Nassim Nicholas Taleb, 2019, https://twitter.com/nntaleb/status/1151863828726059008
- Long-Term Capital Management, Wikipedia, https://en.wikipedia.org/wiki/Long-Term_Capital_Management
- Memorylessness, Wikipedia, https://en.wikipedia.org/wiki/Memorylessness
- Power law, Wikipedia, https://en.wikipedia.org/wiki/Power_law
- Stat 720 Notes, Alan Mishler, 2016, https://www.stat.cmu.edu/~cshalizi/networks/16-1/lectures/09/mishler.pdf
- Tõde ja õigus, Wikisource, https://et.wikisource.org/wiki/T%C3%B5de_ja_%C3%B5igus_I
- Zipf’s law, Wikipedia, https://en.wikipedia.org/wiki/Zipf%27s_law