Capturing computational environments

Last updated on 2025-04-15 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • What is a package manager?
  • What are virtual environments?
  • How can we use them to capture information about a specific computational environment

Objectives

  • Develop conceptual understanding of virtual environments

Now you have a better idea of the challenges around computational reproducibility let’s look at how Package Managers and Virtual Environments can be used in conjunction to capture the ‘packages’ layer of your project’s computational environment.

What is a Package Manager?


As the name suggests, a Package Manager is a tool used for adding, remove, upgrade, and keeping track of the packages installed for a particular piece of software (including programming languages). As part of this course we will be using Python, and so we will be using Python’s built in package manager: pip

In the context of your code, the packages you install are also known as the dependencies for your code (i.e. your code depends on these packages being available to work). The packages you install will also have their own dependencies, and these dependencies may have their own dependencies. Generally this is not something you need to worry about, because as part of the installation process the package manager will work out the dependencies of all the packages that need to be installed (a process called dependency resolution), and then install them for you.

For example:

You want to install the pandas package. When you do pip will see that one of the dependencies of pandas is the numpy package.
So when you install pandas, pip will also install numpy for you.

Package mangers will also keep track of the specific versions of the packages installed for a project, and can produce files allowing this information to be shared. This functionality is a key part of capturing a specific computational environment and we’ll return to it later.

You can learn more about Python packages and how to make one in the FAIR4RS Packaging lesson

Where does pip get packages from?

When you install a package using pip it will typically access the Python Package Index (PyPI) to download and install that package.
PyPI is an online repository of over 500,000 packages, and is the most commonly used source for installing python packages.

It is also possible to install packages from local files, private repositories, or even directly from Github repositories, but this is outside the scope of this lesson.

What are Virtual Environments?


By default when you use pip to install a package it will be installed in Python’s base environment, and so using pip alone will result in different projects sharing the same space.

Discussion

Can you think of a few reasons why this may be a problem?

  • Different projects requiring different versions of the same package (Dependency clashes)
    • Could be a package you use directly
    • Could also be a dependency of a package you use
  • Difficulty identifying which packages are required for which projects (Isolating dependencies)

This diagram illustrates the situation:

  1. In this example, ‘Project 1’ requires numpy v1.18 and while ‘Project 2’ doesn’t directly require a conflicting version of numpy, pandas requires at least v1.22. This creates a dependency clash - You cannot have both versions of this package installed in the same environment, so either:
    • You break the older project’s computational environment, or
    • You cannot develop your new code.
  2. If you manage to resolve the dependency clash, you still have the issue that the additional dependencies (namely pillow) from ‘Project 1’ will also be captured as part of the computational environment for ‘Project 2’. This may not cause any issues, but it is generally not good practice:
    • When capturing information about the computational environment for a project we only want to include exactly what is required for the reproduction of that project.

Virtual environments are a tool designed to solve both of these problems. Conceptually they work by creating a seperate, self-contained space to install packages. Because these spaces are isolated from one another you are able to install different versions of the same package for different projects without creating dependency clashes. This isolation between projects also allows you to accurately capture which packages were used within a specific project, making it easier to recreate that aspect of the computational environment in a different context.

Capturing the ‘packages’ level of a computational environment


Now we’ve described Package managers and Virtual environments we can outline the steps required to successfully capture the ‘packages’ layer of a computational environment for a project:

  1. Create a virtual environment for your project

  2. Develop your project, installing packages into the virtual environment as needed

  3. Periodically recorded the packages installed in the environment, ideally to a file alongside the code

In the next section we’ll get to grips with using pip and venv, and then move onto how to capture the ‘packages’ level of a computational environment using them.

Key Points

  • Package Managers are used to install, remove, upgrade, and track software.
  • In the context of Python and other programming languages this software is bundles of other people’s code.
  • However, installing all packages to the same place causes dependency clashes and makes recreating a computational environment difficult.
  • Virtual environments are used to deal with this problem by creating isolated spaces where packages can be installed without interfering with one another.
  • Using these two tools together allows capture of the ‘package’ level of your computational environment