First Steps

or: How I learned to Stop Worrying and Love Data and ControlIntroduction to Markov chain Monte Carlo (MCMC) Sampling, Part 2:

Gibbs SamplingHaskell art in your browser

with AsteriusHow to make your papers run:

Executable formal semantics for your languageUntrusted CI:

Using Nix to get automatic trusted caching of untrusted builds

20 September 2019 |

|In this blog post series, we're going to lead you through Bayesian modeling in Haskell with the `monad-bayes`

library.
We start this series gradually with some simple binary models, move next to linear regression, and finish by building a simple neural network that we "train" with a Metropolis-Hastings sampler.
You don't need any prior knowledge of Bayesian modeling to understand and learn from these posts—and we keep the code simple and understandable for Haskell newcomers.

Want to make this post interactive? Try our notebook version. It includes a Nix shell, the required imports, and some helper routines for plotting. Let's start modeling!

In this first part of the series, we introduce two fundamental concepts of `monad-bayes`

: `sampling`

and `scoring`

.
We examine them based on one of the simplest probabilistic models that we can think of—a model that represents a `True`

or `False`

choice.
You can use it, for example, to describe the answer to a question such as "Did it rain yesterday?".

The model is parameterized by a boolean `b`

, and in this simple case, `b`

is also directly the model output.
Without additional information, we assign equal probabilities `0.5`

to each value that `b`

can take (50% `True`

, 50% `False`

).
In other words, we get the model parameter `b`

from a discrete uniform prior distribution.

Let's see how the model looks like in the `monad-bayes`

library:

```
model1 :: MonadSample m => m Bool
model1 = do
b <- uniformD [False, True]
return b
```

In `monad-bayes`

a model is expressed through the typeclass `MonadSample`

.
The `MonadSample`

typeclass provides a function `random`

to our `model1`

that returns a random sample from it.
The type of the sample itself is set to `Bool`

in our case.

We define our model by binding together other basic `MonadSample`

models.
In this case, for example, we build our model from the `uniformD`

distribution that is provided by `monad-bayes`

The model, that is the chain of actions that are bound together in *MonadSample*, is not executed until we start sampling from the model.
Once we sample, the resulting chain of actions that is executed is simple:
draw `b`

from a discrete uniform distribution (*uniformD*) and then return its value.

Sampling can be executed with:

```
sampleIOfixed model1
```

```
False
```

We can get a list of samples with Haskell's `replicateM`

function:

```
nsamples = 1000
samples <- sampleIOfixed $ replicateM nsamples model1
```

Then plot the result afterwards with Vega-Lite. You can find our custom plotting functions for Vega-Lite in the notebook.

```
vlShow $ plot (200, 100) [barPlot "b"] [("b", VL.Booleans samples)]
```

So far, so good: we now have a model that represents a distribution of `True/False`

values and we can draw samples from it.
But how can we include observations into this model?

Consider again `model1`

as the answer to "Did it rain yesterday?".
What if we found out "Yes, it did rain yesterday!"?
To include this new knowledge, we need to update the distribution of the model parameter `b`

.

In `monad-bayes`

the function `score`

is responsible for making samples more or less likely—by a factor.
Don't worry if you are mystified by this explanation, we'll explain more in a moment.
But first check out `model2`

that uses `score`

to include the observation:

```
model2 :: MonadInfer m => m Bool
model2 = do
b <- uniformD [False, True]
score (if b then 1.0 else 0.0)
return b
```

Notice the new typeclass `MonadInfer`

that allows us to use `score`

in addition to sampling.
Instead of a single operation, the model is now a chain of actions: *sample*, *score*, …

Here's a naive idea of how this could work:
assume that the representation of the probability distribution of `b`

was a list of tuples `[(0.5, True), (0.5, False)]`

.
We would multiply this distribution with the distribution of my observation "Yes, it did rain yesterday!", that is with `[(1, True), (0, False)]`

.
Then we would normalize the updated probabilities such that they sum to one, and the job is done.

So are we secretly tracking the probability of all samples—the full distribution—at every step (in the Haskell world a.k.a. some variant of the *Dist* monad)?
Granted, in the case of a `True`

/`False`

question this might be a good approach—*but this is not what is happening here for very good reasons*:

To update a sample's probability as described above we need to track all samples with their probabilities and run global normalization operations on them. We are essentially running computations with full distributions over all possible values. This quickly becomes intractable because a model can basically be any Haskell function with lots and lots of possible outcomes.

`monad-bayes`

and similar probabilistic frameworks use an elegant approach that do what we want *without dragging around full distributions*.
The trick behind is to approximate the outcome distribution by drawing successive samples from it.
It turns out that it is enough to know the *relative probability* of two samples at a time to do this.
The relative probability of the two samples is independent of the probability of other samples—computations on full distributions are thus reduced to computations on samples.
With appropriate sampling algorithms (MCMC) we can approximate the outcome distribution despite this limited knowledge.
It is difficult to overstate the implications:
with MCMC we can address all kinds of sampling related problems in a probabilistic manner that would be completely inaccessible otherwise.

Let's get back to the `score`

function that modifies the probability to pull a certain sample from the distribution.
We now understand that it multiplies the *relative probability* to observe a sample compared to any other with a factor.
This means that the sample's probability is left untouched if this factor is `1`

.
If this factor is `10`

, the sample's relative probability is increased ten fold with respect to any other sample.
If this factor is `0`

, the sample's probability is set to `0`

.

Here is `model2`

expressed in words:
(a) draw a sample from `[True, False]`

with equal probability, and (b) multiply the relative probability of a sample with value `True`

with `1`

and of a sample with value `False`

with `0`

.

An appropriate sampler can trace and accumulate the score factor of a sample to compare with the score factor of other samples.
The accumulated score factors give us access to the relative probability of the two samples, and then we can use MCMC to start sampling.
We won't go into detail here about how tracing and accumulating works or which two samples we are actually comparing.
`monad-bayes`

provides a few samplers that can go through this process in different ways.
We include here the `prior`

and the `mh`

(Metropolis-Hastings sampler) functions before our well known `sampleIOfixed`

function to use an MCMC sampler.
Hopefully, the general idea became clear enough here. We'll provide more of the details later in this series.

The final result looks like this:

```
nsamples = 1000
samples <- sampleIOfixed $ prior $ mh nsamples model2
vlShow $ plot (200, 100) [barPlot "b"] [("b", VL.Booleans samples)]
```

Voilà, we wrote down a model (answer to "Did it rain yesterday?") with an uninformed (uniform) prior, and updated it based on the observation "It rained yesterday!".
The distribution of parameter `b`

after scoring—its posterior distribution—has probability `1`

for True and probability `0`

for False.
The operations that we needed to figure out the posterior distribution were *random* and *score*.

Let's move onwards to more complex models:
What if we considered a model with two parameters?
Both parameters are drawn independently from uniform *continuous* distributions between `-1`

and `1`

.
Again, we want to score this model.
This time, we use the function `condition`

that is a short form for scoring with `1`

or `0`

based on a condition.
The new model becomes:

```
model3 :: MonadInfer m => m (Double, Double)
model3 = do
b <- uniform (-1) 1
m <- uniform (-1) 1
condition $ (b-m) > 0
return (b, m)
```

The principal approach is the same: pick sample `b`

, pick sample `m`

, modify the joint sample probability based on a condition, and return the values of both samples in a tuple.
If we run this through the Metropolis-Hastings sampler, we get:

```
nsamples = 5000
modelsamples <- sampleIOfixed $ prior $ mh nsamples model3
(xValues, yValues) = unzip modelsamples
vlShow $ plot (600, 300)
[density2DPlot "b" "m" (-1.1,1.1) (-1.1,1.1)]
[("b", VL.Numbers xValues), ("m", VL.Numbers yValues)]
```

The resulting distribution in the plot above is `0`

where `b<m`

, and approximately uniform when `b>m`

, as we'd expect.
You might spot the initial state of the Markov chain as a faint rectangle in the `b<m`

region.

How about multiple conditions? Remember, we can freely bind operations together so it shouldn't be a problem. This model chains two sampling and two condition operations:

```
model4 :: MonadInfer m => m (Double, Double)
model4 = do
b <- uniform (-1) 1
m <- uniform (-1) 1
condition $ (b-m) > 0
condition $ (b+m) > 0
return (b, m)
```

And it produces the expected result:

```
nsamples = 5000
modelsamples <- sampleIOfixed $ prior $ mh nsamples model4
(xValues, yValues) = unzip modelsamples
vlShow $ plot (600, 300)
[density2DPlot "b" "m" (-1.1,1.1) (-1.1,1.1)]
[("b", VL.Numbers xValues), ("m", VL.Numbers yValues)]
```

We learned how to build models and examine related probability distributions by drawing samples and modifying their relative probabilities with the `score`

function.
The MCMC approach taken by `monad-bayes`

and similar frameworks avoids computations with full distributions, and works with individual samples to enormously simplify life. `monad-bayes`

can, therefore, be used to approximate large and complex distributions—something that quickly comes in handy.
We can use `monad-bayes`

to approximate the distribution of the return values of basically any Haskell function for a given input distribution.
This could even be the return values of entire programs.
Even better—we can use `score`

to infer the input distribution if we have a way of scoring samples based on observations.

We hope you enjoyed this first post in our *Probabilistic Programming with monad‑bayes Series* and learned lots! Now, you're ready to build more general statistical models using these building blocks, and proceed to linear regression in our next post. We hope you join us!

We use this GitHub version of `monad-bayes`

in our posts and notebooks since it's neither on Hackage nor Stackage right now. Here are two original articles you may want to check out: