Python Monorepo: an Example. Part 2: A Simple CI

13 July 2023 — by Guillaume Desforges, Clément Hurlin

For a software team to be successful, you need excellent communication. That is why we want to build systems that foster cross-team communication, and using a monorepo is an excellent way to do that. However, designing a monorepo can be challenging as it impacts the development workflow of all engineers and comes with its own scaling challenges. Special care for tooling is required for a monorepo to stay performant as a team grows.

In our previous article, we described our choice of structure and tools to bootstrap a Python monorepo. In this post, we continue by describing the continuous integration system (CI). We made a GitHub template which you can use to bootstrap your own monorepo. Check it out on Tweag’s GitHub organization in the python-monorepo-example repository.

We exemplify our CI in the context of a GitHub repository. Our solution, however, is not GitHub-specific by any means: GitLab or Jenkins could also work with the same approach.

Before diving into the details, we would like to acknowledge the support we had from our client Kaiko, for which we did most of the work described in this series of blog posts.

Goals for Our CI

Teams use a CI pipeline to ensure a high level of quality in their work. Upon changes to the code, the CI runs a list of commands to check quality. In our case we want to:

check that the code is well formatted;
lint the code, i.e., adhere to some standards;
check the typing;
run tests.

Because our series is focusing on bootstrapping a monorepo for an early startup (from day 0 of a startup until approximately the end of the first year), we don’t describe anything fancy or complicated.

We aimed for a simple system that achieves a great deal of reproducibility and modularity, but doesn’t require any DevOps skill. As such, it is amenable to the early months of a tech startup, when a handful of non-specialist engineers lay out the foundations.

Structuring the Workflows

Making a CI pipeline for a repository with a single Python package is usually quite simple. All steps are run from the root of a single Python package folder, so it is easy to install its dependencies in a single environment and run any command we want. In the case of a monorepo where there are many packages to process, we can’t share a single environment for all packages as their dependencies could conflict.

One could build a CI pipeline for each of the packages that needs to be checked, but all steps are not “equal”. Checking formatting for instance can run quickly on a large code base and does not require any of the packages’ dependencies, but checking imports or typing requires dependencies to be installed.

To that end, we distinguish two types of CI pipelines:

A global CI, that runs in the repository’s top-level folder.
Any instances of a local CI, each instance running in a package’s folder.

The Global CI

When a Pull Request triggers the CI, this global pipeline executes once for the whole repository. It completes quickly, giving fast feedback for the most common issues.

It needs the development dependencies from ./dev-requirements.txt installed (at the top-level), but none of the dependencies of the code it checks.

The global CI is declared in ./github/workflows/top_level.yaml.

---
name: Top-level CI
on:
  workflow_dispatch:
  pull_request:
  push:
    branches: main # Comment this line if you want to test the CI before opening a PR

jobs:
  ci-global:
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Install Python
        uses: actions/setup-python@v3
        timeout-minutes: 5
        with:
          python-version-file: .python-version
          cache: "pip"
          cache-dependency-path: |
            pip-requirements.txt
            dev-requirements.txt

      - name: Install Python dependencies
        run: |
          pip install -r pip-requirements.txt
          pip install -r dev-requirements.txt

      - name: Format Python imports
        run: |
          isort --check-only $(git ls-files "*.py")

      - name: Format Python
        run: |
          black --check $(git ls-files "*.py")

      - name: Lint Python
        run: |
          flake8 $(git ls-files "*.py")

      # Are all public symbols documented? (see pyproject.toml configuration)
      - name: Lint Python doc
        run: |
          pylint $(git ls-files "*.py")

A few notes about the pipeline above:

We assume a fixed version of Python, which is stored in the top-level .python-version file. Some tools can pick it up, making it easier to create reproducible environments. It also helps to avoid hardcoding it in the pipeline file and duplicating it. This file is also consumed by pyenv which we recommend to manage the Python interpreter, instead of relying on a system install.

Later on, this mechanism can be generalized to multiple Python versions using GitHub matrices, but we don’t delve into this level of detail in this post.
We use an exact version of Ubuntu (ubuntu-22.04), instead of using a moving version like ubuntu-latest. This improves reproducibility and avoids surprises¹ when GitHub updates the version pointed to by ubuntu-latest.
We protect the entire pipeline from running too long by using timeout-minutes. On multiple occasions, we have seen pipelines get stuck and consume minutes until the default timeout is reached. The default timeout is 360 minutes! We’d rather set a timeout to avoid wasting money.
Dependencies are cached, thanks to the cache and cache-dependency-path entries of the setup-python action. This makes the pipeline noticeably faster when dependencies are unchanged.

Finally, note that the global pipeline can be tested locally by developers, by using act as follows:

act -j ci-global

The Local CI

The local CI’s pipeline executes from a project or library folder. This pipeline executes in a context where the dependencies of the concerned project or library are installed. Because of this, it is usually slower than the global pipeline. To mitigate this, we will make sure this pipeline runs only when it is relevant to do so, as explained in the triggers section. When a Pull Request triggers the CI to run, the local pipeline executes zero or many times, depending on the list of files being changed. The multiple runs (if any) all start from different subfolders of the monorepo and don’t overlap.

Because the monorepo contains many Python packages that are all type-checked and tested similarly, we can follow the DRY² principle and share the same definition of the CI pipelines.

This is possible in GitHub thanks to reusable workflows:

---
name: Reusable Python CI

on:
  workflow_call:
    inputs:
      working-directory:
        required: true
        type: string
      install-packages:
        description: "Space-separated list of packages to install using apt-get."
        default: ""
        type: string
      # To avoid being billed 360 minutes if a step does not terminate
      # (we've seen the setup-python step below do so!)
      ci-timeout:
        description: "The timeout of the ci job. The default is 25min"
        default: 25
        type: number

jobs:
  ci-local-py-template:
    runs-on: ubuntu-22.04
    timeout-minutes: ${{ inputs.ci-timeout }}

    defaults:
      run:
        working-directory: ${{ inputs.working-directory }}

    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          token: ${{ secrets.GITHUB_TOKEN }}

      - name: Setup Python
        uses: actions/setup-python@v4
        timeout-minutes: 5 # Fail fast to minimize billing if this step freezes (it happened!)
        with:
          python-version-file: ${{ github.workspace }}/.python-version
          cache: "pip"
          cache-dependency-path: |
            dev-requirements.txt
            pip-requirements.txt
            ${{ inputs.working-directory }}/requirements.txt

      - name: Install extra packages
        if: ${{ inputs.install-packages != ''}}
        run: |
          sudo apt-get install -y ${{ inputs.install-packages }}

      - name: Install dependencies
        run: |
          pip install -r ${{ github.workspace }}/pip-requirements.txt
          pip install -r ${{ github.workspace }}/dev-requirements.txt -r requirements.txt

      - name: Typechecking
        run: |
          pyright $(git ls-files "*.py")

      - name: Test
        run: |
          python3 -m pytest tests/  # Assume that tests are in folder "tests/"

This pipeline is used by all Python packages. This is possible because they share the same structure, outlined in the first post of this series:

All packages share the same development dependencies (for linting, formatting and testing), defined in the top-level dev-requirements.txt file.
All packages have their own dependencies in a requirements.txt file in their own folder.

Sometimes, a Python package in our monorepo may need specific system-wide dependencies, for instance CUDA or libffmpeg. The install-packages parameter allows the pipeline of a specific library to install additional system packages via apt.³

For now, we have defined a reusable workflow. But this is not yet enough: we need to actually run it! In the next section, we show how to trigger this pipeline for each library.

Triggers

In order to use the reusable workflow, we need to trigger it for every Python package that we want to check. However, we do not want to trigger the pipelines for all Python packages in the monorepo on every small change. Instead, we want to check the Python packages that are impacted by the changes of a Pull Request. This is important to make sure that the monorepo setup can scale as more and more Python packages are created.

GitHub has the perfect mechanism to do it thanks to the paths keyword that can be specified on the workflow trigger rules. With this specification, a workflow is triggered if and only if at least one file or directory changed in the Pull Request matches one of the expressions in the list of path expressions. If files and directories affected by a PR don’t match any of the expressions in paths, then the pipeline is not started at all.

As in the first post of this series, suppose the monorepo contains two libraries, named base and fancy:

...see the first post...
├── dev-requirements.txt
├── pip-requirements.txt
├── pyproject.toml
└─── libs/
    ├─── base/
    │     ├── README.md
    │     ├── pyproject.toml
    │     └── requirements.txt
    └─── fancy/
          ├── README.md
          ├── pyproject.toml
          └── requirements.txt

Then the pipeline for the fancy library is as follows:

---
name: CI libs/fancy

on:
  pull_request:
    paths:
      - "dev-requirements.txt"
      - "pip-requirements.txt"
      - ".github/workflows/ci_python_reusable.yml"
      - ".github/workflows/ci_fancy.yml"
      - "libs/base/**" # libs/fancy depends on libs/base
      - "libs/fancy/**"
  workflow_dispatch: # Allows to trigger the workflow manually in GitHub UI

jobs:
  ci-libs-fancy:
    uses: ./.github/workflows/ci_python_reusable.yml
    with:
      working-directory: libs/fancy
    secrets: inherit

This pipeline runs if there are changes to any of the following:

the top-level files dev-requirements.txt and pip-requirements.txt;
the pipeline’s files;
the code the library depends on, i.e. libs/base;
the library’s own code, i.e., libs/fancy.

Just like the global pipeline, a local pipeline can be executed locally by a developer using act.

For example, the pipeline above is executed locally by running:

act -j ci-libs-fancy

And One Template to Rule Them All

We advise to use a template to automate the creation of a Python package in the monorepo for two main reasons:

To save time for developers, by providing a simple means to do a complex task. Not everyone is familiar with the ins and outs of a Python package’s scaffolding, yet this should not forbid anyone to create one.
To keep all Python packages consistent and principled. In the early days of an organization, where there are not many examples to copy/paste, things tend to grow organically. In this scenario, there is a high risk for incompatibilities to be introduced.

In addition, in the long run, having consistent libraries will help introduce global changes. This is important in the early days of a startup, as the technological choices and solutions are often changing.

We chose cookiecutter to make and use templates because it can be installed with pip and is known to most seasoned Python developers. It uses Jinja2 templating to render files, both for their file name and content.

Our template is structured as follows:

{{cookiecutter.module_name}}/
├── mycorp
│   └── {{cookiecutter.module_name}}
│       └── __init__.py
├── pyproject.toml
├── README.md
├── requirements.txt
├── setup.cfg
└── tests
    ├── conftest.py
    └── test_example.py

For example pyproject.toml is like so:

[tool.poetry]
name = "mycorp-{{ cookiecutter.package_name }}"
version = "0.0.1"
description = "{{ cookiecutter.package_description }}"
authors = [
    "{{ cookiecutter.owner_full_name }} <{{ cookiecutter.owner_email_address }}>"
]

The template also uses a hook to generate the new library’s instance of the CI, i.e., the second YAML file described in the Triggers section above. This is convenient as it was mostly boilerplate code that does not need to be adapted.

While this may seem like a minor thing, this template proved to foster the adoption of shared common practices, be it at the level of the Python configuration (pyproject.toml, requirements.txt) but also at the level of the CI (hey, it’s automatically adopted!).

We like to compare the adoption of shared practices to excellent UI design. Any UI designer would tell you it takes only a minor glitch or a minor pain point in a UI to deter users from adopting a new tool or application. We believe the same holds for best practices: making things easier for developers is critical for the adoption of a new tool or new workflow.

Possible Improvements

A few quality-of-life improvements are possible to maintain uniformity and augment automation:

The CI can be augmented to check that all pyproject.toml files share the same values for some entries. For example the values in the [build-system] section should be the same in all pyproject.toml files:
```
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
```
The same goes for the configuration of the formatter ([tool.black] section), the linter ([tool.pylint] section) and the type-checker ([tool.pyright] section).

Such a checker can be implemented using the toml library.
In a similar vein, the CI can be augmented to check the definitions of the local pipelines.

This check could be avoided if we were using a tool dedicated to monorepo CIs, Pants for example, which has good Python support.⁴ This is one topic where our setup is a trade-off between simplicity (GitHub Actions only) and efficiency (some pipeline duplication happening, mitigated by additional checks).
We recommend using a top-level CODEOWNERS file to automatically select reviewers on Pull Requests. A CODEOWNER file maps paths to GitHub handles. When a Pull Request is created and if appropriately configured, GitHub will go over the list of paths changed by the PR, and for each of these paths, find the matching GitHub handle. Then GitHub adds the handle to the list of reviewers of the Pull Request.

In this manner, Pull Requests authors don’t have to select reviewers manually, saving them some time, as well as making clear who owns what in the codebase.

One can additionally require approvals from owners to merge, but in a startup that is scaling up we don’t recommend it: requiring approval from owners can slow things down when people are unavailable, as the teams are usually too small to offer 365 days of availability on every part of the codebase.
We recommend using a mergebot like Kodiak or Mergify. In our experience, mergebots are like chocolate: once you’ve tasted them, it’s really hard not to love them.

In practice, they save valuable time for developers, by allowing them to leave the boring job of rebasing in case concurrent Pull Requests are competing to merge. If the codebase has good test coverage, they allow to assign a Pull Request to the bot and forget about it, knowing the bot will take care of ultimately merging it without compromising quality.

Conclusion

We have presented a continuous integration (CI) system for a monorepo that strikes a good balance between being easy to use and being featureful.

Notable features of this CI include a clear separation between fast repo-wide jobs and slower library-specific jobs. This CI is modular thanks to paths triggers and reusable workflows, and relatively fast thanks to caching. Finally, developers can jump in the monorepo and start new projects thanks to templates provided in the monorepo.

This CI is simple enough for the early months of a startup when CI specialists are not yet available. Nonetheless, it paves the way for a shared culture that fosters quality and, we believe, delivers faster.

You still have something to do when using an exact version of a runner: you need to change it when GitHub deprecates it and ultimately removes it. As opposed to things breaking when a latest runner is updated, you get a notification from GitHub before it happens, giving you time to plan ahead. If you know it beforehand, it’s not a surprise anymore right?↩
“Don’t Repeat Yourself”↩
Generally speaking, apt is subpar for reproducibility. However, it does fit well for an early stage CI that wants to keep it simple. If reproducibility starts being an issue here, Nix could enter the picture.↩
We don’t demonstrate the usage of Pants for this monorepo, because Pants works better with a shared Python environment (one single requirements.txt file at the top-level of the repository), whereas our client Kaiko had business incentives to work with an environment per Python package (as described in the first post of this series). If we had been in a global sandbox scenario, Pants would be the obvious choice for preparing this monorepo for scaling, once a CI specialist has been hired for example. Pants is our favorite pick here because our monorepo is pure Python and Pants’ Python support is excellent.↩

Behind the scenes

Guillaume Desforges

Guillaume is a versatile engineer based in Paris, with fluency in machine learning, data engineering, web development and functional programming.

Clément Hurlin

Clément is a Director of Engineering, leading the Build Systems department. He studied Computer Science at Telecom Nancy and received his PhD from Université Nice Sophia Antipolis, where he proved multithreaded programs using linear logic. His technical background includes functional programming, compilers, provers, distributed systems, and build systems.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← A Tour Around Buck2, Meta's New Build System How to Prevent GHC from Inferring Types with Undesirable Constraints →