20 September 2019 ||
In this blog post series, we're going to lead you through Bayesian modeling in Haskell with the
We start this series gradually with some simple binary models, move next to linear regression, and finish by building a simple neural network that we "train" with a Metropolis-Hastings sampler.
You don't need any prior knowledge of Bayesian modeling to understand and learn from these posts—and we keep the code simple and understandable for Haskell newcomers.
Want to make this post interactive? Try our notebook version. It includes a Nix shell, the required imports, and some helper routines for plotting. Let's start modeling!
In this first part of the series, we introduce two fundamental concepts of
We examine them based on one of the simplest probabilistic models that we can think of—a model that represents a
You can use it, for example, to describe the answer to a question such as "Did it rain yesterday?".
The model is parameterized by a boolean
b, and in this simple case,
b is also directly the model output.
Without additional information, we assign equal probabilities
0.5 to each value that
b can take (50%
In other words, we get the model parameter
b from a discrete uniform prior distribution.
Let's see how the model looks like in the
model1 :: MonadSample m => m Bool model1 = do b <- uniformD [False, True] return b
monad-bayes a model is expressed through the typeclass
MonadSample typeclass provides a function
random to our
model1 that returns a random sample from it.
The type of the sample itself is set to
Bool in our case.
We define our model by binding together other basic
In this case, for example, we build our model from the
uniformD distribution that is provided by
The model, that is the chain of actions that are bound together in MonadSample, is not executed until we start sampling from the model.
Once we sample, the resulting chain of actions that is executed is simple:
b from a discrete uniform distribution (uniformD) and then return its value.
Sampling can be executed with:
We can get a list of samples with Haskell's
nsamples = 1000 samples <- sampleIOfixed $ replicateM nsamples model1
Then plot the result afterwards with Vega-Lite. You can find our custom plotting functions for Vega-Lite in the notebook.
vlShow $ plot (200, 100) [barPlot "b"] [("b", VL.Booleans samples)]
So far, so good: we now have a model that represents a distribution of
True/False values and we can draw samples from it.
But how can we include observations into this model?
model1 as the answer to "Did it rain yesterday?".
What if we found out "Yes, it did rain yesterday!"?
To include this new knowledge, we need to update the distribution of the model parameter
monad-bayes the function
score is responsible for making samples more or less likely—by a factor.
Don't worry if you are mystified by this explanation, we'll explain more in a moment.
But first check out
model2 that uses
score to include the observation:
model2 :: MonadInfer m => m Bool model2 = do b <- uniformD [False, True] score (if b then 1.0 else 0.0) return b
Notice the new typeclass
MonadInfer that allows us to use
score in addition to sampling.
Instead of a single operation, the model is now a chain of actions: sample, score, …
Here's a naive idea of how this could work:
assume that the representation of the probability distribution of
b was a list of tuples
[(0.5, True), (0.5, False)].
We would multiply this distribution with the distribution of my observation "Yes, it did rain yesterday!", that is with
[(1, True), (0, False)].
Then we would normalize the updated probabilities such that they sum to one, and the job is done.
So are we secretly tracking the probability of all samples—the full distribution—at every step (in the Haskell world a.k.a. some variant of the Dist monad)?
Granted, in the case of a
False question this might be a good approach—but this is not what is happening here for very good reasons:
To update a sample's probability as described above we need to track all samples with their probabilities and run global normalization operations on them. We are essentially running computations with full distributions over all possible values. This quickly becomes intractable because a model can basically be any Haskell function with lots and lots of possible outcomes.
monad-bayes and similar probabilistic frameworks use an elegant approach that do what we want without dragging around full distributions.
The trick behind is to approximate the outcome distribution by drawing successive samples from it.
It turns out that it is enough to know the relative probability of two samples at a time to do this.
The relative probability of the two samples is independent of the probability of other samples—computations on full distributions are thus reduced to computations on samples.
With appropriate sampling algorithms (MCMC) we can approximate the outcome distribution despite this limited knowledge.
It is difficult to overstate the implications:
with MCMC we can address all kinds of sampling related problems in a probabilistic manner that would be completely inaccessible otherwise.
Let's get back to the
score function that modifies the probability to pull a certain sample from the distribution.
We now understand that it multiplies the relative probability to observe a sample compared to any other with a factor.
This means that the sample's probability is left untouched if this factor is
If this factor is
10, the sample's relative probability is increased ten fold with respect to any other sample.
If this factor is
0, the sample's probability is set to
model2 expressed in words:
(a) draw a sample from
[True, False] with equal probability, and (b) multiply the relative probability of a sample with value
1 and of a sample with value
An appropriate sampler can trace and accumulate the score factor of a sample to compare with the score factor of other samples.
The accumulated score factors give us access to the relative probability of the two samples, and then we can use MCMC to start sampling.
We won't go into detail here about how tracing and accumulating works or which two samples we are actually comparing.
monad-bayes provides a few samplers that can go through this process in different ways.
We include here the
prior and the
mh (Metropolis-Hastings sampler) functions before our well known
sampleIOfixed function to use an MCMC sampler.
Hopefully, the general idea became clear enough here. We'll provide more of the details later in this series.
The final result looks like this:
nsamples = 1000 samples <- sampleIOfixed $ prior $ mh nsamples model2 vlShow $ plot (200, 100) [barPlot "b"] [("b", VL.Booleans samples)]
Voilà, we wrote down a model (answer to "Did it rain yesterday?") with an uninformed (uniform) prior, and updated it based on the observation "It rained yesterday!".
The distribution of parameter
b after scoring—its posterior distribution—has probability
1 for True and probability
0 for False.
The operations that we needed to figure out the posterior distribution were random and score.
Let's move onwards to more complex models:
What if we considered a model with two parameters?
Both parameters are drawn independently from uniform continuous distributions between
Again, we want to score this model.
This time, we use the function
condition that is a short form for scoring with
0 based on a condition.
The new model becomes:
model3 :: MonadInfer m => m (Double, Double) model3 = do b <- uniform (-1) 1 m <- uniform (-1) 1 condition $ (b-m) > 0 return (b, m)
The principal approach is the same: pick sample
b, pick sample
m, modify the joint sample probability based on a condition, and return the values of both samples in a tuple.
If we run this through the Metropolis-Hastings sampler, we get:
nsamples = 5000 modelsamples <- sampleIOfixed $ prior $ mh nsamples model3 (xValues, yValues) = unzip modelsamples vlShow $ plot (600, 300) [density2DPlot "b" "m" (-1.1,1.1) (-1.1,1.1)] [("b", VL.Numbers xValues), ("m", VL.Numbers yValues)]
The resulting distribution in the plot above is
b<m, and approximately uniform when
b>m, as we'd expect.
You might spot the initial state of the Markov chain as a faint rectangle in the
How about multiple conditions? Remember, we can freely bind operations together so it shouldn't be a problem. This model chains two sampling and two condition operations:
model4 :: MonadInfer m => m (Double, Double) model4 = do b <- uniform (-1) 1 m <- uniform (-1) 1 condition $ (b-m) > 0 condition $ (b+m) > 0 return (b, m)
And it produces the expected result:
nsamples = 5000 modelsamples <- sampleIOfixed $ prior $ mh nsamples model4 (xValues, yValues) = unzip modelsamples vlShow $ plot (600, 300) [density2DPlot "b" "m" (-1.1,1.1) (-1.1,1.1)] [("b", VL.Numbers xValues), ("m", VL.Numbers yValues)]
We learned how to build models and examine related probability distributions by drawing samples and modifying their relative probabilities with the
The MCMC approach taken by
monad-bayes and similar frameworks avoids computations with full distributions, and works with individual samples to enormously simplify life.
monad-bayes can, therefore, be used to approximate large and complex distributions—something that quickly comes in handy.
We can use
monad-bayes to approximate the distribution of the return values of basically any Haskell function for a given input distribution.
This could even be the return values of entire programs.
Even better—we can use
score to infer the input distribution if we have a way of scoring samples based on observations.
We hope you enjoyed this first post in our Probabilistic Programming with monad‑bayes Series and learned lots! Now, you're ready to build more general statistical models using these building blocks, and proceed to linear regression in our next post. We hope you join us!
We use this GitHub version of
monad-bayes in our posts and notebooks since it's neither on Hackage nor Stackage right now. Here are two original articles you may want to check out: