Engineering blog - Tweag

Blog: data-science (32 posts)

27 February 2025

Evaluating the evaluators: know your RAG metrics

Evaluating retrieval-augmented generation (RAG) is easier than ever, but you need to keep a close eye on the LLMs that drive the evaluation metrics. We discuss some common pitfalls and solutions.

24 September 2024

Python Packaging in the Real World

An empirical analysis of Python packages on PyPI and biomedical journals in 2023, with a focus on the quality of dependency declarations.

12 October 2023

Use monad-bayes and rhine in your interactive machine learning application

21 September 2023

Behind the scenes with FawltyDeps 0.13.0

FawltyDeps 0.13.0 introduces a brand new mapping strategy. In this post, we'll delve into the mechanics of how dependencies and imports are matched, as well as how you can leverage these new features to boost your Python dependency management workflow.

Announcing FawltyDeps

FawltyDeps is a new tool to help you identify undeclared and unused dependencies in your Python code, making your projects leaner and more reproducible.

Chainsail goes open-source

Simeon Carstens

Tweag releases the full source code of the Chainsail web service, for sampling multimodal distributions, first announced in August 2022. This blog post gives a tour of the Chainsail service architecture, links out to the relevant parts of the source code and proposes possible extensions to Chainsail for which the Tweag team would welcome contributions from the community.

8 December 2022

Creating a Delta Lake in Haskell with Sparkle

A practical introduction to using Sparkle to write Haskell programs which interface with Delta Lake.

25 October 2022

Chainsail goes Bayesian statistics

Abdellatif Kadiri

Tweag intern Abdellatif summarizes his internship, in which he augmented Chainsail with a Bayesian Replica Exchange scheme to improve sampling of multimodal distributions.

18 October 2022

Improving monad-bayes

Reuben Cohn-Gordon

My experience improving monad-bayes, the probabilistic programming language package, as a Tweag fellow.

Soft k-means clustering with Chainsail

We recently released a beta version of a web service called Chainsail, that helps with sampling multimodal probability distributions (you can check out our Chainsail announcement blog post for a quick introduction to this tool). In parallel, this post aims to illustrate a use…

Chainsail: sampling multimodal distributions made easy

Tweag announces Chainsail, a simple-to-use web service for better sampling of multimodal distributions with a scalable and auto-tuning Replica Exchange algorithm at its core.

Reproducible probabilistic programming environments

How to get reproducible development environments for probabilistic programming packages such as PyMC3, Theano or TensorFlow using Nix.

17 November 2021

Safe Sparkle: a resource-safe interface with linear types

On design choices to build a resource-safe interface for Sparkle using linear types

30 September 2021

A higher-order integrator for Hamiltonian Monte Carlo

A discussion and benchmark of an alternative integrator for Hamiltonian Monte Carlo.

23 September 2021

Functional data pipelines with funflow2

Introducing a library for writing data pipelines which compose well and fail early

28 October 2020

Markov chain Monte Carlo Sampling (4)

Simeon Carstens

In the final post of Tweag's four-part series, we discuss Replica Exchange, a powerful MCMC algorithm designed to improve sampling from multimodal distributions. An illustrative example and, as always, an interactive Python notebook with easy-to-modify code lead to an intuitive understanding and invite experimentation.

23 September 2020

Announcing Lagoon

Meet Lagoon, a new open source tool for centralizing and querying semi-structured datasets.

Markov Chain Monte Carlo Sampling (3)

Simeon Carstens

Learn about Hamiltonian Monte Carlo, and how to implement it from scratch.

26 February 2020

Probabilistic Programming with monad‑bayes (3)

In this blog post series, we're going to lead you through Bayesian modeling in Haskell with the monad-bayes library. In the third part of the series, we setup a simple Bayesian neural network.

Markov chain Monte Carlo Sampling (2)

Simeon Carstens

In this second post of Tweag's four-part series, we discuss Gibbs sampling, an important MCMC-related algorithm which can be advantageous when sampling from multivariate distributions. Two different examples and, again, an interactive Python notebook illustrate use cases and the issue of heavily correlated samples.

8 November 2019

Probabilistic Programming with monad‑bayes (2)

Here's Part 2 in Tweag's Series about Bayesian modeling in Haskell with the monad-bayes library.

30 October 2019

Porcupine: Announcing First Release

We're happy to announce the first release of Porcupine, an open source framework to express portable and customizable data pipelines.

25 October 2019

Markov chain Monte Carlo Sampling (1)

Simeon Carstens

In this first post of Tweag's four-part series on Markov chain Monte Carlo sampling algorithms, you will learn about why and when to use them and the theoretical underpinnings of this powerful class of sampling methods. We discuss the famous Metropolis-Hastings algorithm and give an intuition on the choice of its free parameters. Interactive Python notebooks invite you to play around with MCMC yourself and thus deepen your understanding of the Metropolis-Hastings algorithm.

20 September 2019

Probabilistic Programming with monad‑bayes (1)

In this blog post series, we're going to lead you through Bayesian modeling in Haskell with the monad-bayes library. In the first part of the series, we introduce two fundamental concepts of `monad-bayes`: `sampling` and `scoring`.

Code Line Patterns

We visualize large collections of Haskell and Python source codes as 2D maps using methods from Natural Language Processing (NLP) and dimensionality reduction and find a surprisingly rich structure for both languages. Clustering on the 2D maps allows us to identify common patterns in source code which give rise to these structures. Finally, we discuss this first analysis in the context of advanced machine learning-based tools performing automatic code refactoring and code completion.

Revelations from repetition

Every day we write repetitive code. A lot of it is boilerplate that you write only to satisfy your compiler/interpreter. But how do languages differ in their boilerplate content? We explore these questions using data sets of Python and Haskell code.

The Sneakernet: Towards A Much Faster Internet

Matthias Meschede

Inspired by the Event Horizon Telescope images, we develop a quick exploratory study about future possibilities of this technology called the Sneakernet: Could massive data transfer give a new live to the homing pigeon industry? How about using transportation means that are optimized to carry incredible amounts of weight? Or transportation means that are designed to be fast as a bullet?

28 February 2019

Declarative, Reproducible Jupyter Environments

Millions of Jupyter notebooks are spread over the internet - machine learning, astrophysics, biology, economy, you name it. What a great age for reproducible science! Or that's what you think until you try to actually run these notebooks. Then you realize that having understandable high-level code alone is not enough to reproduce something on a computer. JupyterWith is a solution to this problem.

6 February 2019

Mapping a Universe of Open Source Software

Matthias Meschede

The repositories of distributions such as Debian and Nixpkgs are among the largest collections of open source (and some unfree) software. They are complex systems that connect and organize many interdependent packages. In this blog post I'll try to shed some light on them from the perspective of Nixpkgs, mostly with visualizations of its complete dependency graph.

23 January 2019

Harnessing the Power of Haskell in JupyterLab

Introduction Haskell and data science - on first sight a great match: native function composition, lazy evaluation, fast execution times, and lots of code checks. These sound like ingredients for scalable, production-ready data transformation pipelines. What is missing then? Why…

Haskell compute PaaS with sparkle

Maintaining a compute cluster for batch or stream data processing is hard work. Connecting it up to storage facilities and time sharing resources across multiple demands even more so. Fortunately cloud service providers these days typically upscale their offering to not just…

25 February 2016

Haskell meets large scale distributed analytics

Large scale distributed applications are complex: there are effects at scale that matter far more than when your application is basked in the warmth of a single machine. Messages between any two processes may or may not make it to their final destination. If reading from a memory…

Company

About Open Source Careers Contact Us

What we do

Strategy Product Development Platform Modernization Digital Operations Work

Insights

Modus Blog Ospo Blog Research Innovation podcast

Connect with us

GitHub

YouTube

Bluesky

Mastodon

© 2025 Modus Create, LLC

Privacy Policy Sitemap