Tweag

Behind the scenes with FawltyDeps v0.13.0: Matching imports with dependencies

21 September 2023 — by Johan Herland, Nour El Mawass, Maria Knorps, Zhihan Zhang

We have previously introduced FawltyDeps, a tool to help Python projects avoid the dreaded, and seemingly unavoidable state, where dependencies declared in the configuration do not match those actually imported in the code1. FawltyDeps is the perfect addition to your CI, your pre-commit hooks, or your dependency management arsenal.

Curious to know how FawltyDeps works its magic? In this sequel we’ll delve into an essential component of FawltyDeps: how it matches imports and dependencies behind the scenes, and why it is important to get this matching right.

We’ve been busy working on an improved mapping strategy that combines versatility with simplicity, and we have come a long way from the quite limited version we presented in our first announcement. By the end of this post, you’ll have a solid understanding of FawltyDeps’ brand new mapping options and how to tailor them to your project’s unique context and needs.

Matching imports and dependencies

Simply put, FawltyDeps extracts imports from your code, and dependencies declared in your project configuration, and matches them against each other:

  • the imports that are not present in your declared dependencies are reported as undeclared dependencies
  • the declared dependencies that are not imported in your code are reported as unused dependencies.
extracting matching imports deps
Figure 1. An illustration of extracting imports and dependencies in a Python project and matching them to each other.

When matching imports and dependencies, we first assume that a dependency (specifically: the package it references) and an import have the same name. This approximation works well for many Python packages. numpy is a good example: in your code, you write import numpy, and to install it you run pip install numpy, or you list numpy in your requirements.txt (or wherever you list your project dependencies).

Problem solved! So why are we even writing this post?

It turns out that, as always, things are not that simple™. Many packages provide import names that are different from the package name. For example:

  • You depend on the pyyaml package, but you import yaml (as seen in Figure 1).
  • You depend on the scikit-learn package, but you import sklearn.
  • You depend on the setuptools package, but you import either pkg_resources or some other import, as setuptools exposes multiple imports.

Clearly our first approximation (hereafter referred to as the identity mapping) is not good enough. To solve this, we need a smarter mapping: a way to figure out which packages correspond to which imports. In practice, there are a few different ways to acquire these mappings, each having its advantages and limitations. Our main goal here is to lay out the mappings we support in FawltyDeps, and explain how they can be used individually or together to resolve packages into their respective imports.

Mapping from already-installed packages

Arguably, the only correct way for FawltyDeps to match packages to imports is to actually ask each package what imports it provides. FawltyDeps can do this3, but it first needs to find where the packages are installed, and that turns out to be more complicated than one might think.

In the first versions of FawltyDeps, we had not yet properly drilled into this issue. Instead, we only looked at the Python environment in which FawltyDeps itself was already running, and we simply assumed that your project dependencies should be installed into the same environment 4. If a dependency of your project was not found in this environment, we would fall back to the identity mapping.

This meant simply pushing the problem onto the user, however, and making FawltyDeps harder to use. What we wanted instead was for FawltyDeps to resolve the dependencies wherever they may be installed. This is where things can get very complicated: In general, there is a bewildering variety of ways to install dependencies in the Python world.

pandora's box of Python's packaging and dependency management

We are not going to open the entire Pandora’s box of Python packaging and dependency management in this blog post, except as to note some different examples of where Python packages (specifically: your project’s 3rd-party dependencies) can typically be found:

  • System-wide package locations, like those found under /usr/lib/python* or /usr/local/lib/python* (whether installed by your system’s package manager or system-wide pip install).
  • User-specific packages, installed by tools like pipx install or pip install --user.
  • Virtual environments (from venv, virtualenv, Poetry, PDM, etc.), located either within your project, or somewhere else.
  • Other, less common, methods or locations5 that resemble any of the above.

We would like to have FawltyDeps work with as many of these as possible, and furthermore, when it’s possible: to have FawltyDeps automatically discover and use them by default.

As of v0.13.0 we have come a long way towards realizing this vision: We support the kinds of Python environments mentioned above (for FawltyDeps’ purpose, a “Python environment” really means any directory in which Python packages could be installed), and the following diagram outlines how FawltyDeps determines which Python environments are used to look up the project’s dependencies:

finding local pyenvs
Figure 2. FawltyDeps’ strategy for finding local Python environments

In other words:

  • The --pyenv option lets you point to one or more Python environments. All of these environments will be used when matching dependencies to imports
  • If --pyenv is not used, FawltyDeps will automatically find and use Python environments that exist within your project directories (i.e. within any directory that is passed as a positional argument to FawltyDeps, aka. “basepath”, or the current directory by default).
  • If no Python environment is found by the two methods above, FawltyDeps will fall back to using the environment in which it’s running.

There is still some way to go until all the details are perfect here6, but we believe this approach covers most common cases well.

Temporarily installing dependencies to complete the mapping

There is an elephant in the room that we have not yet talked about: Sometimes you may be running FawltyDeps on a project where the project dependencies are not installed at all! Then what can you do? (Assuming that you don’t want to go through the bother of installing packages manually.) Until recently FawltyDeps would simply fall back to the identity mapping for any packages that it could not find locally, with the undeclared/unused report provided by FawltyDeps suffering as a result.

With the new --install-deps option introduced in v0.13.0, we are now able to provide a better alternative: With this option FawltyDeps will not fall back to the identity mapping, instead it will automatically use pip install to install the unresolved dependencies (from PyPI, by default7) into a temporary virtualenv8, and it will then use this as an additional source for the dependency-to-import mapping. For dependencies that are not found locally, this allows FawltyDeps to come up with the correct mapping (and hence produce a much better undeclared/unused report) rather than relying on the imperfect identity mapping.

Since this is a potentially expensive strategy we have chosen to hide it behind the --install-deps command-line option. If you want to always enable this option, you can set the corresponding install_deps configuration variable to true in the [tool.fawltydeps] section of your pyproject.toml.

Note that there is no guarantee that we’re able to resolve all dependencies with this method: For example, there could be a typo in your declared dependency that means it will never be found on PyPI, or there could be other circumstances (e.g. network issues) that prevent this strategy from working at all. What happens with such unresolved dependencies will be covered below.

User-defined mapping

The mappings discussed above have FawltyDeps look into packages that are actually installed (whether in an existing local environment or temporarily by FawltyDeps). But this might not always be achievable in practice. You might want to run FawltyDeps in your CI, possibly on multiple libraries, without having to either set up a local environment or access packages from outside sources (like PyPI).

A simple solution to this is to provide FawltyDeps with your own custom mapping.9 We have chosen not to ship any database with the code as it needs to be frequently updated, with no guarantee of it covering all Python packages. Instead, we allow users to provide their own custom TOML mapping. This mapping does not have to be complete and it can be used in conjunction with the other mappings discussed in this article. We talk more about how FawltyDeps combines different mappings in the following section.

Putting it together: FawltyDeps’ mapping strategy

Now that we have gathered all these mappings, let’s see how to best combine them.

Overall, we have three guiding principles in this endeavor:

  • Completeness: we should be able to resolve all dependencies extracted from a project into associated import names, as otherwise we cannot reach any conclusions about undeclared or unused dependencies.
  • Correctness: some mappings offer a higher level of correctness than others. Identity mapping, for example, is correct for many - but certainly not all - packages. Resolving a dependency via a locally installed package offers a higher guarantee of correctness.
  • Transparency: we should be able to trace back what mapping was used to resolve any given dependency. This allows users to discover where they may improve the information passed to FawltyDeps (e.g. using --pyenv to point at the most appropriate Python environments). It also makes it much easier for us to diagnose where FawltyDeps itself might be improved.

First, let’s start by repeating our available strategies:

  • Identity mapping: The simplest strategy, but also the worst. We would like to avoid using it as much as possible.
  • Looking at locally installed packages: Our best option in terms of correctness, but not always complete: sometimes we have to concede that not all dependencies are available in a local Python environment, so we still need a fallback strategy.
  • Installing packages (from PyPI) into a temporary virtualenv: The ultimate fallback solution, but quite heavy-weight, and not always suitable (e.g. in a restricted CI environment). Hence, we put this behavior behind the --install-deps option.
  • Custom/user-defined mapping: Allow the user to have the final say in how dependencies are mapped into imports. This strategy should override the other strategies, but we expect few users will want to go through the fuss of defining their own mapping, so we cannot rely on this being used commonly.

Now, we need to figure out how to combine these strategies in the best way.

We have chosen to organize them in the sequence shown in Figure 3 below. Each strategy - when given the name of a dependency - can either return a successful mapping of that dependency name (into a corresponding set of import names), or return nothing (when a dependency is not found by that strategy). Dependencies that are not resolved by a strategy are passed onto the next strategy in the sequence. Since a dependency is mapped by only one strategy, that is, the first that returns something, we need to organize our strategies in order of decreasing preference. In other words:

  • The user-defined mapping, when provided, should always override other mappings. It thus comes first in the sequence.
  • Next, we want to look at the locally installed packages.
  • Finally, if we have not been able to find the dependency in either of the above, we want to use a fallback strategy:
    • If the user has enabled --install-deps, we attempt to install packages (subject to pip configuration, but from PyPI by default). If any of these packages fail to install, we abort the entire process and raise an error, as we do not expect the user wants a further fallback to the inaccurate identity mapping.
    • Otherwise, our fallback is the identity mapping, that is, we assume any unresolved dependency points to a package (as yet unseen) that provides a single import of the same name. Although this strategy is always “successful” (in terms of mapping to an import name), it is crucially not always correct!

To illustrate:

fawltydeps resolvers sequence
Figure 3. The sequence of resolvers used by FawltyDeps

To bring this back into the overall context of FawltyDeps: once we have resolved the dependencies through the above mapping strategies, we now have an overall mapping of dependency names to provided import names, and this is the basis for the final report:

  • Any import found in the project that is not covered by any dependency is reported as an undeclared dependency.
  • Any dependency found to only provide imports that are never imported from anywhere is reported as a possibly unused dependency.

The table below provides a summary of the available mappings, sorted in the order FawltyDeps processes them, along with options to customize them.

Priority Mapping strategy Options
1 User-defined mapping Provide a custom mapping in TOML format via --custom-mapping-file or a [tool.fawltydeps.custom_mapping] section in pyproject.toml.
Default: No custom mapping
2 Mapping from installed packages Point to one or more environments via --pyenv.
Default: auto-discovery of Python environments under the project’s basepath. If none are found, default to the Python environment in which FawltyDeps itself is installed.
3a Mapping via temporary installation of packages Activated with the --install-deps option.
3b Identity mapping Active by default.
Deactivated when --install-deps is used.

Examples

This section dives into some practical scenarios. Suppose you have a simple requirements.txt file:

numpy>=1.25.0
scikit-learn
pyyaml

We assume that these packages are already imported in some_script.py as

import numpy
import sklearn
import yaml

As we can see, our project has defined all its dependencies as it should, so FawltyDeps should ideally not report any problems. But let’s also assume that we’re running FawltyDeps in an incomplete environment - one where pyyaml is not installed - to see how this affects FawltyDeps.

Example 1: running with default options

When running with default options, like so:

fawltydeps

FawltyDeps will run through the default sequence of mappings, as shown in Figure 4:

resolving default scenario
Figure 4. A scenario where FawltyDeps resolves a requirements.txt file with default options

In particular:

  • No custom mapping is provided.
  • FawltyDeps automatically finds local environments or defaults to its own environment. In this example it finds scikit-learn and numpy in the local environment, and we can see that scikit-learn is correctly resolved to the sklearn import name.
  • Identity mapping is used to resolve any dependencies not resolved via previous mappers. In this example, pyyaml was not found above, and was therefore incorrectly resolved by the identity mapping to pyyaml.

The resulting output from FawltyDeps is:

These imports appear to be undeclared dependencies:
- 'yaml'

These dependencies appear to be unused (i.e. not imported):
- 'pyyaml'

For a more verbose report re-run with the `--detailed` option.

This first example shows a common pitfall of the identity mapping. Next, we can see how --install-deps improves on these situations:

Example 2: running with custom options

Let’s now take advantage of some advanced FawltyDeps options by running the following command:

fawltydeps --custom-mapping-file my_mapping.toml --pyenv venv --install-deps

Figure 5 shows the path FawltyDeps takes through the sequence of mappings:

resolving custom scenario
Figure 5. A scenario in which a requirements.txt file is resolved with a customized mapping configuration.

In particular:

  • We provide a partial custom mapping. (e.g. via --custom-mapping-file). In this example, the custom mapping is defined in my_mapping.toml.
    scikit-learn = ["sklearn"]
  • We point to a local virtual environment (with --pyenv) where some dependencies are installed. (In this example, only numpy is installed in venv.)
  • We pass --install-deps, to ask FawltyDeps to temporarily install and resolve any remaining dependencies.

FawltyDeps returns the following result:

No undeclared or unused dependencies detected.

As expected, FawltyDeps now returns a better result:

The --install-deps option downloads the pyyaml PyPI package and makes it available to the resolver, so it can now map the yaml import to the correct pyyaml dependency declaration.

Customizing your FawltyDeps’ mappers

These examples demonstrate two extremes and we expect most usage to fall somewhere in between.

With the --json flag, the resulting package-to-imports mapping is exposed in the output under the .resolved_deps key. Using a command like this:

fawltydeps --custom-mapping-file my_mapping.toml --pyenv venv --install-deps --json | jq .resolved_deps

you can see which mappings are used to resolve a package into a set of imports, and further iterate on the mapping options to help FawltyDeps perform its best on your codebase.

Conclusion

FawltyDeps has come a long way from the version we presented in our first announcement. While it was initially limited to resolving packages from its own environment and falling back to the identity mapping, it now supports arbitrary local environments, custom user mappings and it can temporarily install and resolve packages on its own. On top of that, it can also automatically discover virtual environments inside the analyzed project.2

We strive to provide a default behavior that makes sense for most projects, and to offer a customizable yet simple interface for advanced users that wish to take control over the mapping process. We believe the result is a powerful tool that delivers a complete, correct and transparent matching of your project’s dependencies and imports.

As always, we would be happy to hear your feedback! Try out the latest version of FawltyDeps and reach out to us with any problems or questions on our Github repository.


  1. The recent publication of Computational reproducibility of Jupyter notebooks from biomedical publications highlights that missing dependencies is a frequent occurrence in repositories hosting scientific computational experiments and has a detrimental effect on reproducibility.

  2. We depend on functionality from the excellent importlib_metadata library to extract imports exposed in locally installed packages.

  3. This assumption was made no matter whether FawltyDeps was installed in a virtualenv or as part of the system-wide Python installation, and we only documented that FawltyDeps had to be installed into the same environment as your project dependencies. One example of where this did not work out well is when you installed FawltyDeps with pipx install fawltydeps: This makes fawltydeps available everywhere (via your $PATH), but pipx installs it into its own, separate, virtualenv that is isolated from your project, meaning that FawltyDeps would almost always fall back to the identity mapping, and yield poor results.

  4. Some less common locations of Python packages:

    • __pypackages__ directories (even though PEP582 was recently rejected, these still occur in the wild).
    • Conda and other environment managers (not yet explicitly supported, although it’s on our radar).
    • Nix closures containing Python packages, like those produced by poetry2nix.

  5. One open issue is that FawltyDeps currently does not look at package versions. This usually does not cause problems in practice, but there are corner cases where it might: Consider, for example, a package_foo that used to provide two import names module_a and module_b, but starting from version 2, it only provides module_a. Now, if your project declares a dependency on package_foo>=2, but you still happen to import module_b in your code, this should be reported by FawltyDeps as an undeclared dependency (because you’re declaring a dependency on a version of package_foo where module_b no longer exists). However, if package_foo version 1 (not version 2) happens to be installed in your project’s environment, FawltyDeps will simply believe that package_foo (whichever version) provides both module_a and module_b, and the error won’t be flagged.

  6. To customize automatic installation (for example, to use a different package index), you can use pip’s environment variables.

  7. Note that the PyPI API does not currently expose imports of the hosted packages (see here and here for relevant discussions). Downloading and unpacking these packages is therefore necessary.

  8. Some tools rely on custom mappings. A notable example is the Pants build system, which relies on static mappings provided by the user. Another example is the pipreqs library, which keeps a static database mapping packages to the import names they expose.

  9. For completeness, here is an overview of the changes we’ve made to our mapping strategy over the last releases, and that together realize the picture presented in this blog post:

    • v0.7 introduces the --pyenv option to allow FawltyDeps to look up packages in a different Python environment than the one in which FawltyDeps is running.
    • v0.9 adds the user-defined mapping.
    • v0.10 adds support for __pypackages__ directories.
    • v0.11 introduces support for multiple --pyenv options.
    • v0.12 revamps our project traversal, allowing Python environments to be automatically found inside the project.
    • v0.13 introduces the --install-deps option allowing missing project dependencies to be mapped correctly instead of using the identity mapping.

About the authors
Johan HerlandJohan is a Developer Productivity Engineer at Tweag. Originally from Western Norway, he is currently based in Delft, NL, and enjoys this opportunity to discover the Netherlands and the rest of continental Europe. Johan has almost twenty years of industry experience, mostly working with Linux and open source software within the embedded realm. He has a passion for designing and implementing elegant and useful solutions to challenging problems, and is always looking for underlying root causes to the problems that face software developers today. Outside of work, he enjoys playing jazz piano and cycling.
Nour El MawassNour is a data scientist/engineer that recently made the leap of faith from Academia to Industry. She has worked on Machine Learning, Data Science and Data Engineering problems in various domains. She has a PhD in Computer Science and currently lives in Paris, where she stubbornly tries to get her native Lebanese plants to live on her tiny Parisian balcony.
Maria KnorpsMaria, a mathematician turned Senior Data Engineer, excels at blending her analytical prowess and software development skills within the tech industry. Her role at Tweag is twofold: she is not only a key contributor to the innovative AI projects but also heavily involved in data engineering aspects, such as building robust data pipelines and ensuring data integrity. This skill set was honed through her transition from academic research in numerical modelling of turbulence to the realm of software development and data science.
Zhihan ZhangZhihan is a data scientist/engineer with expertise in Machine Learning. Holding a PhD in Geometric Topology and Probabilities, she transitioned from a pure mathematics background to apply her knowledge in industry. Currently at Tweag, Zhihan focuses on developing data engineering solutions and cloud deployment. Beyond her technical skills, she has a passion for music, arts, literature, and cooking.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

Company

AboutOpen SourceCareersContact Us

Connect with us

© 2024 Modus Create, LLC

Privacy PolicySitemap