Content from Introduction


Last updated on 2024-12-09 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How can code design make your code more FAIR?

Objectives

  • Understand the definition of software in research.
  • Understand the FAIR principles as applied to research software.
  • Understand how code design is related to the FAIR principles.

What is a software in academia?


It is not always easy to define what constitutes software in a research setting. The size of projects can vary from a small script of a few dozens of lines to a massive project with millions of lines. If you are interested in a discussion around a research software definition a good starting point is Defining Research Software: a controversial discussion. An abbreviated summary of this paper and a pragmatic definition we use as a basis for this course is…

Key Points

Research Software includes source code files, algorithms, scripts, computational workflows and executables that were created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts,etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software. This differentiation may vary between disciplines.

Reminder: The FAIR principles applied to research software


The FAIR principles (Findable, Accessible, Interoperable and Reusable) were originally designed for research data in order to “enhance their re-usability” (see Wilkinson et al (2016)). In this seminal paper it was made clear that while data was a central aspect of research, the principles should also apply to algorithms, tools and workflows that led to the production of that data. Few years later, in 2022, a set of recommendation was published (Chue Hon et al. (2022); Barker et al. (2022)) in order to apply these FAIR principle to research software. An overview of the FAIR principles adapted to research software are that…

Key Points

  • Findable: Software, and its associated metadata, is easy for both humans and machines to find.

  • Accessible: Software, and its metadata, is retrievable via standardised protocols.

  • Interoperable: Software interoperates with other software by exchanging data and/or metadata, and/or through interaction via Application Programming Interfaces (APIs), described through standards.

  • Reusable: Software is both usable (can be executed) and reusable (can be understood, modified, built upon, or incorporated into other software).

How can Code Design help the FAIR principles?


By designing your code efficiently you will make it FAIRer. Code design is about making your code easy to read, adapt, maintain and share. Here’s how some of the principles benefit from good design:

  • Interoperability: Writing code in a modular way and using standard data format allows other systems to communicate with it.

  • Reusability: Documented code, use of docstrings and comments makes it easier for others to understand, use, and modify your code. Using modular design with single-task blocks also increases reusability. Indeed, small simple pieces can be more easily transferred to other projects or extended without significant refactoring.

The goal of this lecture is to dive into these practices and learn a little bit more about how the way you code will greatly enhance how maintainable, adaptable, and sustainable your software is in the long run.

Content from Why should you care?


Last updated on 2024-12-09 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • Why should you know about code design?

Objectives

  • Understand the 4 main concepts developed in this course: Maintainability, readability, reusability and scalibility

Why should you care?


Reproducibility and Reliability

Good code practices ensure that research results are reproducible and reliable. Research findings are often scrutinized and validated by others in the field, and well-written code facilitates this process. Clean, well-documented, and well-tested code allows other researchers to replicate experiments, verify results, and build upon existing work, thus advancing scientific knowledge.

Efficiency and Maintainability

Writing good code enhances efficiency and maintainability. Research projects can span several years and involve multiple collaborators. Readable and well-structured code makes it easier for current and future researchers to understand, modify, and extend the software. This reduces the time and effort required to troubleshoot issues, implement new features, or adapt the code for different datasets or experiments.

Collaboration and Community Contribution

Good coding practices facilitate collaboration and contribution from the wider research community. Open-source research software, written with clear, standardized coding practices, attracts contributions from other researchers and developers. This collaborative environment can lead to improvements in the software, innovative uses, and more robust and versatile tools, ultimately benefiting the entire research community.

Readability


Definition and key aspects

Readability in software refers to how easily a human reader can understand the purpose, control flow, and operation of the code. High readability means that the code is clear, easy to follow, and well-organized, which greatly enhances maintainability, collaboration, and reduces the likelihood of bugs.

Key aspects:

  • Descriptve Naming: Use meaningful and descriptive names that convey the purpose of the variable.

  • Consistent Formatting: Consistent indentation improves the visual structure of the code. Keeping lines of code within a reasonable length (usually 80-100 characters) prevents horizontal scrolling and improves readability.

  • Comments and documentation: Brief comments within the code explaining non-obvious parts. Detailed documentation at the beginning of modules, classes, and functions explaining their purpose, parameters, and return values.

  • Code structure: Breaking down code into functions, classes, and modules that each handle a specific task. Group related pieces of code together, and separate different functionalities clearly.

Benefits:

1 - Maintainability: Your code will be easier to understand and modify the code. It will also greatly reduce the risk of errors when introducing changes.

2 - Collaboration: writing readable code will enhance teamwork and make it easy for others to contribute. Code reviews will be easy!

3 - Efficiency: You are going to save a LOT of time. You will waste less time deciphering your code. That saved time will be used to develop the code.

4 - Quality: Reduces the likelihood of bugs and errors, leading to more reliable code

Reusability


Definition and Key aspects

Reusability in software refers to the ability to use existing software components (such as functions, classes, modules, or libraries) across multiple projects or in different parts of the same project without significant modification. Reusable code is designed to be generic and flexible, promoting efficiency, reducing redundancy, and enhancing maintainability.

Key aspects:

  • Modularity: Encapsulate functionality within well-defined modules or classes that can be independently reused.

  • Abstraction: Provide simple interfaces while hiding the complex implementation details.

  • Parametrization: Design functions and methods that accept parameters to make them adaptable to different situations.

  • Generic and Reusable Components: Develop generic libraries and utility functions that can be reused across multiple projects.

  • Documentation and Naming: Provide comprehensive documentation for modules, classes, and functions to explain their usage.

  • Avoid hardcoding values: Instead, use constants or configuration files.

Benefits:

  • Time saving: Reusable components save development time. You don’t need to rewrite from sratch! Avoids duplication of effort by using existing solutions for common tasks.

  • Consistency: Using the same code components across projects ensures consistency in functionality and behavior.

  • Maintainability: Reusable components can be maintained and updated independently, making it easier to manage large codebases.

  • Quality: Reusable components are often well-tested, leading to more reliable and bug-free software

Maintainability


Definition and key aspects

Maintainability in software refers to the ease with which a software system can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment. Highly maintainable software is designed to be easily understood, tested, and updated by developers, ensuring that the software can evolve over time with minimal effort and cost.

Key aspects:

  • Core readibility: your code should be organized logically with meaningful names for variables, functions and classes.

  • Modularity: If you divide your software into distinct modules or components, each responsible for a specific functionality, you will greatly reduce dependencies.

  • Documentation: The documentation of the code should be continuously updated to reflect the latest state of the sotware.

  • Automated testing: Testing your software is important to make sure that modification and implementatio of new functionalities do not break it.

Benefits

  • Reduce technical debt: Maintainable code is easier to refacto ad improve over time, reducing the accumulation of technical debt. The cost and effort to maintain the software will be significantly reduced

  • Faster development: If you code is maintainable, it will be easier to understand, modify and extend. It will also be easier to identfy and fix bugs.

  • Increase collaboration: Having a maintainable code will make it easier for people to join you!

  • Adaptability to new requirements: if your code is maintainable it will be easier to adapt it to changing (or new) requirements, as it is often the case in research.

Quizz


The question for each code is ‘Is this code readable, reusable and maintainable’?

Challenge

Code #1:

PYTHON

import math
def process_list(data):
    processed_list = []
    for x in data:
        if x * 1.5 < 5:
            processed_list.append(math.sqrt(x) * 2 + 3)

    return processed_list

#Example usage
input_data = [1, 2, 3, 4, 5, 6]
result = process_list(input_data)
print("processed list:", result)
  • Reusable: The function can be used with any list of integers to filter and transform the data.

  • Partially Readable: The code is readable because it uses a simple structure that is easy to follow. But it is impossible to understand its purpose. There are no comments explaining what the function is doing or why it’s doing it.

However, the code will be difficult to maintain because:

  • Constraints are not explained.
  • The logic includes “magic numbers” (2 and 3) without any explanation or named constants.
  • There is no error handling, which makes it harder to maintain when unexpected inputs occur.

Challenge

Code #2:

PYTHON

def calculate_statistics():
    data = [23, 45, 12, 67, 34, 89, 23, 45, 23, 34]
    total_sum = sum(data)
    count = len(data)
    average = total_sum / count
    
    data_sorted = sorted(data)
    if count % 2 == 0:
        median = (data_sorted[count // 2 - 1] + data_sorted[count // 2]) / 2
    else:
        median = data_sorted[count // 2]

    occurrences = {}
    for num in data:
        if num in occurrences:
            occurrences[num] += 1
        else:
            occurrences[num] = 1
    mode = max(occurrences, key=occurrences.get)

    print("Sum:", total_sum)
    print("Average:", average)
    print("Median:", median)
    print("Mode:", mode)

# Calculate statistics for the specific data set
calculate_statistics()
  • Maintainable: The code is well-structured, with clear variable names and straightforward logic. It’s easy to understand and modify if needed.
  • Readable: The code uses descriptive variable names and simple constructs, making it easy to follow.

However, the code is not reusable because the function calculate_statistics is hardcoded to work with a specific dataset defined within the function. It cannot be easily reused with different datasets without modifying the function itself.

Content from Code structure


Last updated on 2024-12-09 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How to structure a code in a scalable and reusable way?

Objectives

  • Learn to use functions and classes
  • Understand how to organise your code in modules and packages

Introduction


When you’re writing code, making it consistent and well-structured is just as essential as ensuring it produces the correct result. You should think about how you structure your code as you write it, as this will make it easier to read and maintain in the future both by yourself and others.

It’s important to follow the design principles of the programming language you are using. Python is known as an object-oriented programming language, which means Python code is structured around creating, using, and interacting with code objects. There are many different types of objects in Python, such as basic integers or text strings, lists or dictionaries which contain multiple objects, and functions that operate on objects. It’s also encouraged to create your own classes of objects, to make your code more modular and reusable.

When your code grows in size and complexity, it’s a good idea to split it into multiple files, known as modules, and organise these modules into packages. This makes your code easier to manage and maintain, and allows you to reuse code across multiple projects.

Functions


Functions are a way to group code together that performs a specific task. There are many built-in functions in Python, such as print() to output values or len() to get the length of a list. But you can also create your own functions, to perform tasks that you need to do multiple times.

Functions are defined using the def keyword, followed by the function name and a set of parentheses containing any parameters the function takes. The function body is then indented and contains the code that the function will execute. The return keyword is used to specify what the function should output.

Below is a very basic function that simply takes two parameters and returns their sum. Once you have defined a function, you can call it by using its name and passing in the required parameters as arguments to the function. In this case the parameters a and b are set to 2 and 6 respectively, and the result of the function is stored in the result variable:

PYTHON

def add(a, b):
    return a + b

result = add(3, 5)
print(result)

OUTPUT

8

Whitespace

If you’re used to other programming languages like C or R, you might be surprised to see that Python does not use curly braces {} to define blocks like functions. Instead, Python uses indentation to define blocks of code, either with spaces or tabs (known as whitespace).

It’s important to be consistent with your indentation, as mixing different numbers of spaces or tabs can cause errors. Most code editors will automatically indent and convert tabs to spaces when writing Python.

This is obviously a very simple example, but functions can be much more complex and can take multiple parameters, return multiple values, or raise errors if something goes wrong.

Any function acts as a reusable block of code that can be called multiple times within your program. Building your code out of small, modular functions makes it easier to read and maintain, by not having to repeat the same code multiple times you save space and only need to edit the code in one place when making changes. It’s also easier and more reliable to test the output of individual functions to make sure they work correctly, rather than having to run the entire program. There will be a future session on Testing and Continuous Integration which will cover this in more detail.

Scope

When you define a variable inside a function, it is only accessible within that function. This is known as the scope of the variable. If you try to access a variable that is defined inside a function from outside the function, you will get an error:

PYTHON

def my_function():
    x = 10
    return x

result = my_function()
print(x)

OUTPUT

NameError: name 'x' is not defined

Anything defined outside of a function is said to be in the global scope, and can be accessed from anywhere in the program (including within functions). However, it’s generally considered good practice to avoid using global variables, as they can make your code harder to understand and debug.

Challenge 1

What will be the output of this code?

PYTHON

x = 5
y = 10

def my_function(z):
    x = 20
    return x + y + z
result = my_function(3)

print('x:', x)
print('y:', y)
print('Result:', result)
print('z:', z)

OUTPUT

x: 5
y: 10
Result: 33
NameError: name 'z' is not defined

The x variable inside the function is a different variable to the x variable outside, so changing it inside the function does not affect the global x variable.

The y variable is accessible inside the function because it is defined in the global scope.

The result variable is the sum of the function’s x variable, the global y variable and the argument z.

The z variable is not defined outside the function, so trying to print it in the main body of the script will raise an error.

Classes


Classes are a way to group functions and data together into a single object. Classes act as a blueprint for creating objects, which are then called instances of the class.

Similar to functions, classes are defined using the class keyword, followed by the class name. The class body is then indented and contains any properties or methods (i.e. functions) that the class has.

Below is a very simple class for a Rectangle object, which has properties for its width and height, as well as a method to calculate its area:

PYTHON

class Rectangle:
   width = 5
   height = 3

    def get_area(self):
        return self.width * self.height

The self parameter

Methods in a class always define self as the first parameter, which is used as a reference to the instance of the class that the method is being called on. In this case the get_area method uses the width and height properties of the Rectangle instance. You can actually call this parameter anything you like, but self is the convention in Python.

You can then create an instance of this class by calling the class name as if it were a function, and you can access its properties and methods using the . operator:

PYTHON

my_rectangle = Rectangle()
print('This rectangle has a width of', my_rectangle.width, 'and a height of', my_rectangle.height)
print('Its area is', my_rectangle.get_area())

OUTPUT

This rectangle has a width of 5 and a height of 3
Its area is 15

In this case the class is not very reusable, as the width and height are fixed. You can pass arguments when creating an instance of the class by defining a special __init__() method. The __init__() method at is called whenever a new instance of the class is created, and it can take values to set the initial state of the object. In the example below, the Rectangle class takes width and height parameters and stores them as properties of the self object (i.e. the instance of the class that is being created):

PYTHON

class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def get_area(self):
        return self.width * self.height

my_rectangle = Rectangle(10, 20)
print('This rectangle has a width of', my_rectangle.width, 'and a height of', my_rectangle.height)
print('Its area is', my_rectangle.get_area())

OUTPUT

This rectangle has a width of 10 and a height of 20
Its area is 200

“Dunder” methods

In Python, methods that start and end with double underscores are called “dunder” (short for “double underscore”) or “magic” methods. These methods are special and have specific built-in meanings, such as __init__() being called when an instance of a class is initialised. We’ll see more examples of “dunders” later in this section.

Classes are a powerful way to structure your code, as they allow you to group related functions and data together in a way that is reusable and easy to understand.

Challenge 2

In the code below, we have multiple lists representing items in a grocery store. Each fruit has a name, price, and the number in stock.

PYTHON

fruits = ['apple', 'banana', 'orange']
fruit_prices = [1.00, 0.50, 0.75]
fruit_count = [10, 20, 15]

Create a class called Stock that has properties called name, price, count, and two methods: display() that prints out the name and unit price of the fruit, and get_total_value() that returns the total value of the stock.

Then create a new list of Stock objects, and call the display() method on each object.

PYTHON

class Stock:
    def __init__(self, name, price, count):
        self.name = name
        self.price = price
        self.count = count

    def display(self):
        print(f'Each {self.name} costs £{self.price}')

    def get_total_value(self):
        return self.price * self.count


fruits = [
    Stock('apple', 1.00, 10),
    Stock('banana', 0.50, 20),
    Stock('orange', 0.75, 15)
]
for fruit in fruits:
    fruit.display()

OUTPUT

Each apple costs £1.0
Each banana costs £0.5
Each orange costs £0.75

Challenge 3

Now, create a Shop class that has a property called stock that is a list of Stock objects. The Shop class should have a method called display_stock() that calls the display() method on each item in the stock list, and a method called get_total_stock_value() that returns the total value of all items in the stock list.

Then create a new Shop object with the fruits list as the input, and call the display_stock() and get_total_stock_value() methods.

PYTHON

class Shop:
    def __init__(self, stock):
        self.stock = stock

    def display_stock(self):
        for item in self.stock:
            item.display()

    def get_total_stock_value(self):
        total = 0
        for item in self.stock:
            total += item.get_total_value()
        return total

shop = Shop(fruits)
shop.display_stock()
print('Total stock value:', shop.get_total_stock_value())

OUTPUT

Each apple costs £1.0
Each banana costs £0.5
Each orange costs £0.75
Total stock value: 31.25

Naming conventions

Note that in each of these examples, the variables, properties and class names are written as nouns (e.g. result, Rectangle, width or Shop), while the functions and class methods are lowercase verbs or short phrases (e.g. add(), get_area() or display_stock()). This is a common convention in Python, and following it can help you write more readable code and understand the difference between objects and functions more easily.

We will see more examples of coding conventions like this in later sections.

Scripts and Modules


A Python script is a file containing Python code that can be executed by the Python interpreter. You can run a script by calling the Python interpreter with the script file as an argument, like this:

BASH

python my_script.py

When you’re writing a large program, it’s a good idea to split your code into multiple files to make it easier to manage, and to make it easier to reuse code in other projects. Each file containing Python code is called a module, and you can import modules into scripts and other modules to use the functions and classes they contain.

For example, you could create a file called calculator.py that contains the add() function we defined earlier, as well as other functions for subtraction, multiplication, and division.

PYTHON

"""calculator.py"""

def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

def multiply(a, b):
    return a * b

def divide(a, b):
    return a / b

Now, in another script or module, you can import the calculator module and use its functions by using the import keyword:

PYTHON

"""script.py"""
import calculator

result = calculator.add(5, 3)
print(result)

OUTPUT

8

You can also import specific functions or classes from a module, rather than importing the whole module. This can be useful if you only need one or two functions, as it saves some time and can make your code more readable:

PYTHON

from calculator import add

result = add(5, 3)

Python comes with a lot of built-in modules that you can import and use in your code, collectively known as the Standard Library. These work in the same way, for instance if you wanted trigonometric functions you can do from math import sin, cos, tan and then use those functions in your code.

Running a module as a script

In some cases, you might want a mixture of code that executes when the module is run as a script as well as functions and classes that can be imported into other modules. Too keep the two separate, you can use another special “dunder” method called __name__. This is a built-in variable that is set to '__main__' when the module is run as a script, and is set to the module name when the module is imported into another module.

As such, if you include a check for if __name__ == '__main__': in your module, you can define code that only runs when the module is run as a script:

PYTHON

"""calculator.py"""
def add(a, b):
    return a + b

if __name__ == '__main__':
    result = add(5, 3)
    print('Test result:', result)

BASH

$ python calculator.py
Test result: 8

If you didn’t include the if __name__ == '__main__': check, then every time you tried to import the add() function from the calculator module, the test code would run as well and print out the result.

Packages


Once you have a collection of modules that you want to reuse across multiple projects, you can organise them into a package. A package is a directory containing multiple modules, along with a special __init__.py file that tells Python that the directory is a package (this is yet another example of a “dunder” being used as a special marker in Python, in this case being used to signify that a directory is importable).

For example, you could create a package called my_package that contains the calculator module we defined earlier, as well as a new module called geometry that contains functions for working with shapes. When organising these modules into a package, the directory structure would look like this:

my_package/
    __init__.py
    calculator.py
    geometry.py

The __init__.py file can be empty, but it can also contain code that runs when the package is imported.

You can then import the modules from the package in the same way as before, but you need to include the package name either as a prefix or by using the from keyword:

PYTHON

import my_package.calculator
my_package.calculator.add(5, 3)

OR

from my_package import calculator
calculator.add(5, 3)

OR

from my_package.calculator import add
add(5, 3)

If your package grows large enough, you can also create sub-packages within your package by creating subdirectories with their own __init__.py files. This allows you to organise your code into a hierarchical structure that makes it easier to manage and understand.

Challenge 4

A researcher has all of their code for a project in a single Python file, and they want to split it into multiple modules within a package. Here is a list of the functions and classes they have defined in their script:

  • load_data(): a function that reads data from a file
  • clean_data(): a function that removes any missing values from the data
  • plot_data(): a function that plots the data
  • Data: a class that holds the data, returned by load_data()
  • Model: a class that represents a machine learning model created from the data
  • test_data(): a function that tests the data class is working correctly
  • test_model(): a function that tests the model class is working correctly
  • run_experiment(): a function that runs the entire experiment, taking a data file as an input

There is also a test file called test_data.csv that contains some example data, and is used by the test_data() and test_model() functions.

How would you organise this code into a package?

Here is an example of how you could organise the code into a package:

research_project/
    __init__.py
    data/
        __init__.py
        data.py
    model/
        __init__.py
        model.py
    plot/
        __init__.py
        plot.py
    scripts/
        __init__.py
        run_experiment.py
    tests/
        __init__.py
        test_data.csv
        test_data.py
        test_model.py
  • The data module would contain the load_data() and clean_data() functions and the Data class.
  • The model module would contain the Model class definition.
  • The plot module would contain the plot_data() function.
  • The scripts module would contain the run_experiment() function in a standalone script.
  • The tests module would contain the test_data() and test_model() functions, as well as the test_data.csv file.

For instance, if you wanted to load and plot the data in a different script, you could import the data and plot modules like this:

PYTHON

from research_project.data import load_data
from research_project.plot import plot_data

data = load_data('data.csv')
plot_data(cleaned_data)

Although it seems like a lot of files for a small amount of code, this structure makes it easier to manage and maintain the project over time, and will make it easier to reuse the code in other projects in the future.

Once you have your package organised, you can share it with others by using a code hosting platform like GitHub, or uploading it to the Python Package Index (PyPI, https://pypi.org/). If you do this there are some additional files you should include to make your package more user-friendly, such as a README file that explains what the package does and how to use it, and a LICENSE file that specifies the terms under which the code can be used. There is another session on Packaging which will go into more detail on how to create and share Python packages.

Final exercise : Rewriting a Python Script


In this final exercise, you will rewrite a Python script that is poorly structured and difficult to read, with a large amount of repeated code.

Challenge 5

You can find the script here: student_scores.py.

This Python script is designed to read in a data file containing student names and scores (all generated randomly!), although here we just include the data as a text string in the script to save having to download a separate file.

After checking there are no errors with the input, the script then goes through each student, calculates their total score across the three exams and prints out a summary of the data. It then calculates the average score for each assignment, and prints out the student with the highest total score in the class.

You can download it and run it on your local computer using python test_script.py. The output should look like this:

OUTPUT

studentid  firstname  surname    score1  score2  score3  total
39816      Fiona      Ellis      15      18      16      49
40859      Philip     Holdcroft  12      17      15      44
71625      Kathleen   Ingram     20      19      19      58
91462      David      Nicholson  14      16      18      48
97297      Mark       Walch      18      20      17      55
Average score1: 15.80
Average score2: 18.00
Average score3: 18.00
Student with highest total: Kathleen Ingram (58.00)

Your task is to rewrite this script using functions and classes to make it more modular and reusable. You can also move code out of the script into separate files and modules if you think it will make the code easier to manage.

Make sure when you’re done that the output of the script is the same as the original.

Content from The Zen of Python


Last updated on 2024-12-09 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • What are PEPs?
  • How to write clean?
  • How can I do this efficiently with Pylint?

Objectives

  • Understand why it is important to write good code
  • Write PEP8 compliant code
  • Use Pylint to help with code formating and programmatic errors

Python Enhancement Proposals and the Zen of Python


The Python Enhancement Proposals are documents that provide information to the Python community, or describing a new feature for Python or its processes or environment. Some of them are also focusing on design and style:

  • The main one is PEP8. It lays out rules to write clean code in Python.
  • Docstrings convention are given in PEP257.
  • The Zen of Python in PEP20 gives principle for Python’s design. It is accesible in any python distribution with:
In [1]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Readability counts


As Guido van Rossum (Python creator and Benevolent dictator for life) once said ‘’Code is read much more often than it is written’’

While coding you may spend of a few hours (days) on a piece of code and when you will be done with it you will not write it again. Nevertheless there is a very high chance that you will read it again. If the piece of code is part of an on-going project you will have to remember what that code does and why you actually wrote it. Hence, readability counts! Remembering what a code does after a few weeks/months is not easy. If you follow the standard guidelines it will greatly help you (and save you a lot of time!).

In addition, if multiple people are looking at the code and developing with you, writing readable code is paramount. If people have to decipher your coding style before actually trying to understand what you are coding that will become very difficult for everybody. PEP8 provides a standardisation of the python coding style.

Explicit is better than implicit.


Writing clear code is not complicated. It starts by giving meaningful name to variables, function and classes. Avoid single letter names like x or y. For example:

PYTHON

# This is bad:

x = 5
y = 10
z = 2*x + 2*y

# This is much better:

width = 5
height = 10
diameter = 2*height + 2*width

Just by using descriptive names we can understand what the code is trying to do.

In addition, everything that you write (variables, constants, function, classes…) comes with a way to name it. The main conventions are:

  • Variables, function and methods use the snake_case convention. It means that they should use lowercase letters and words should be separated by underscore:

PYTHON

# This is bad
def ComputeDiameter(width, height):
    return 2*width + 2*height

# This is good
def compute_diameter(width, height):
    return 2*width + 2*height
  • Class names follow the PascalCase convention (also known as CamelCase). In that convention, each word starts with a capital letter and there are NO underscores between words.

PYTHON

# This is bad
class example_class:

# This is good
class ExampleClass:
  • Constant names follow the UPPER_SNAKE_CASE convention. Constants, or variables that are intended to remain unchanged, should be written in all uppercase letters, with words separated by underscores.

PYTHON

# This is bad
speedoflight = 3e8
plankconstant = 6.62e-34

# This is good
SPEED_OF_LIGHT = 3e8
PLANK_CONSTANT = 6.62e-34 

Beautiful is better than ugly


In the context of Python, beautiful means that the code is clean, readable and well structured. Beautiful code is easy to understand, not only for you but also others people who might have to maintain the code in the future. It uses meaningful names and clear logic and structure.

Challenge

What is this code doing?

PYTHON

result = [x * 2 if x % 2 == 0 else x * 3 for x in range(10) if x % 2 != 1 and x != 4]

This one-liner tries to:

  • Filter out odd numbers and the number 4.
  • For even numbers, double them.
  • For the rest, triple them.

‘Beautiful is better than ugly’ means that developers should aim for simplicity and elegant solution. It makes a code very difficult to maintain when the author tries to cram as much functionality as possible in a single line or function. Always tries to break down into clear single component.

Beautiful code is aesthetically pleasing because it follows good design principles (see next chapter). It is modular, reusable, and adheres to the DRY (Don’t Repeat Yourself) principle. It avoids unnecessary complexity and focuses on clarity.

Challenge

Rewrite the previous one line to make it more understandable

PYTHON

result = []
for x in range(10):
    if x % 2 == 1 or x == 4:
        continue
    if x % 2 == 0:
        result.append(x * 2)
    else:
        result.append(x * 3)

print(result)

The advantages of this version:

  • Clarity: Each part of the logic is isolated—first filtering, then applying the transformation based on conditions.
  • Step-by-Step: It’s clear what’s happening at each step without trying to parse it all at once.
  • Debuggable: It’s easier to debug and modify, especially if you need to change one part of the logic.
  • Maintainability: Each step is explicit, making it easier for others (or yourself in the future) to understand.

Nevertheless, this does not mean that you should over-complexify your code. While you start to know the language in more details, you will start to learn how it works and it will help you to be concise and efficient:

Challenge

Consider the following function:

PYTHON

def is_empty(lst):
    if len(lst) == 0:
        return True
    else:
        return False

lst = []
print(is_empty(lst))

Rewrite this in two lines.

PYTHON

lst = []
print(not lst)

The advantages of this version:

  • Readability: It’s immediately clear that the code checks if the list is empty.
  • Conciseness: The not operator works directly with lists in Python, making the code more succinct.
  • Simplicity: Eliminates unnecessary conditional checks and additional code.

Sparse is better than dense.


When you write your code it is important to make it readable. Avoiding cluttered code by making is sparse and spaced out makes it easier to read and increase clarity and readability. Use whitespaces, correct indentation and separation will make your code quicker to understand. Moreover, when code is spread out with proper comments and breaks it is easier to modify or debug. Let’s see an example:

Challenge

What is wrong with this code? Is it actually working?

PYTHON

def   example_function(param1,param2):print(param1+param2*2) 
def   another_function(x,y):return x+y
class MyClass: 
    def __init__(self,param): self.param=param
    def  method(self): 
        if self.param >10: print("Value is greater than 10")
        else: print("Value 10") 
my_list=[1,2,3,4,5]
dictionary={'key1':'value1','key2':'value2'}
result=another_function(5,10) 
print(result)

So what are the rules?

  • Indentation: The convention is to use 4 spaces. Tabs are not recommended as they can lead to inconsistencies:

PYTHON

def example_function():
    if True:
        print("Indented correctly")
  • Whitespaces around operators: A single space on both sides of binary operators should be included (+, -, *, /, =, ==, !=, <, >, <=, >=, etc).

PYTHON

#This is bad
a=2
b=3
c=4
result=a+b+c


#This is good
a = 2
b = 3
c = 4
result = a + b * c
  • Comma and colon spacing: you shoud include a single space after a comma and you should include a space after the colon in dictionary:

PYTHON


#This is bad
dictionary={'key1':'value1','key2':'value2'}

#This is good
dictionary = {'key1': 'value1', 'key2': 'value2'}
  • Blank lines: Use two blank lines before a top-level function or class definition and use a single blank line between method definitions inside a class.

PYTHON

# This is bad
class MyClass:
    def method_one(self):
        pass
    def method_two(self):
        pass

# This is good

class MyClass:
    def method_one(self):
        pass

    def method_two(self):
        pass

Challenge

Based on what we saw up to now, rewrite this code to make it easier to understand.

PYTHON

def   example_function(param1,param2):print(param1+param2*2, end=' ')
print("The result is:",  param1,param2) 
def   another_function(x,y):return x+y
class  MyClass: def __init__(self,param):self.param=param
def  method(self):if self.param >10:print("Value is greater than 10")
else:print("Value is 10 or less") 
my_list=[1,2,3,4,5]
dictionary={'key1':'value1','key2':'value2'}
result=another_function(5,10) 
print(result)

PYTHON

def calculate_adjusted_sum(base_value, multiplier):
    """
    Calculate and print the sum of the base_value and twice the multiplier.
    
    Args:
        base_value (int or float): The base value to which the adjusted multiplier will be added.
        multiplier (int or float): The value that will be doubled and added to the base value.
    """
    adjusted_sum = base_value + (multiplier * 2)
    print(adjusted_sum, end=' ')
    print("The adjusted sum is:", base_value, multiplier)


def add_two_numbers(x, y):
    """
    Return the sum of two numbers.
    
    Args:
        x (int or float): The first number.
        y (int or float): The second number.
    
    Returns:
        int or float: The sum of x and y.
    """
    return x + y


class ValueChecker:
    def __init__(self, value):
        """
        Initialize with a specific value.
        
        Args:
            value (int or float): The value to be checked.
        """
        self.value = value

    def check_and_print_message(self):
        """
        Print a message based on whether the value is greater than 10 or not.
        """
        if self.value > 10:
            print("The value is greater than 10.")
        else:
            print("The value is 10 or less.")


# Example usage
numbers_list = [1, 2, 3, 4, 5]
key_value_pairs = {'key1': 'value1', 'key2': 'value2'}

# Add two numbers and print the result
result = add_two_numbers(5, 10)
print("Sum of numbers:", result)

We actually see now that the class or the first function were not used at all in the rest of the code. If that codes stands like this, they can be removed..

If the implementation is hard to explain, it’s a bad idea…If the implementation is easy to explain, it may be a good idea.


If you follow this FAIR training program you might be interested to share your code with the wider research community. If that’s the case people might want to have a look at your code. This aphorism tells you that how you implemented your code matters! Code should always be easy to understand. If you are unable to explain what your code is doing then you should not leave it in your software. Conversely, if you are able to explain in an easy what your piece of code is doing, this is probably a good implementation. For example

Challenge

What is this code doing?

PYTHON


def check_number(num):
    if num % 2 == 0:
        if num % 5 == 0:
            return True
        else:
            return False
    else:
        return False

How could you make it easier to understand?

The function checks if a number is both even and a multiple of 5. A better way of doing it could be:

PYTHON

def check_number(num):
    return num % 2 == 0 and num % 5 == 0

In addition to writing simpler and more logical code, commenting your code is important. For more complex type of operations it is often useful to explain what is the logic behind the reasoning and why a particular approach has been chosen.

There are a few rules for writing comments in Python:

  • Comments should be complete sentences and start with a capital letter.
  • Block comments apply to the code coming after it and are indented to the same level of that code. Each line should start with a # followed by a single space.
  • Inline comments should be separate by at least two spaces from the piece of code they are related to.
  • Comments should not state the obvious (it is distracting).

Finally, when you update your code you should always update the comment. ‘Comments that contradict the code are worse than no comments’ [PEP8].

PyLint


PyLint is a tool that analyzes Python code to find programming errors, enforce a coding standard, and look for improvements. It provides a score based on the number of issues detected, helping you writing clean and readable code.

Key Features of PyLint

  • Error Detection: Detects issues such as using undefined variables, unnecessary imports, and more.

  • Coding Standard Enforcement: Checks the code against PEP 8. Flags violations such as incorrect indentation, naming conventions, and line length.

  • Code Quality Metrics: Provides a detailed report with metrics like code complexity, number of lines, and number of classes. Offers a score that reflects the overall quality of the code.

  • Refactoring Suggestions: Suggests improvements to make the code cleaner and more efficient. Highlights duplicated code, unused variables, and functions that can be simplified.

Running pylint

To analyse a python file you can simply run:

BASH

pylint your_python_file.py

When you run PyLint on a Python file, it provides an output with the following components:

  • Messages: Each detected issue is reported with a message ID, type, line number, and a brief description.
  • Statistics: Provides a summary of the issues found, such as the number of errors, warnings, and refactor suggestions.
  • Score: An overall score out of 10, reflecting the code quality based on the issues detected.

Challenge

Let’s have a look at an example: Consider that file here and run PyLint on it. Try to clean up the code according to the error messages you see.

Configuration:

PyLint can be configured to match your specific project requirements. You can create a configuration file (.pylintrc) to customize the behavior of PyLint, such as enabling/disabling certain checks, adjusting thresholds, and more. Generate a configuration file using:

BASH

pylint --generate-rcfile > .pylintrc

Integrating with IDEs

Many Integrated Development Environments (IDEs) and text editors, such as Visual Studio Code, PyCharm, and Sublime Text, support PyLint integration. This allows you to see linting results directly within your editor as you write code.

Content from Principles of Code design


Last updated on 2024-12-04 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How to write maintainable, readable, resusable and scalable code?

Objectives

  • Be familiar with standard principles of code design
  • Understand what they mean and how to apply them

Coding principles are guidelines and best practices that anybody writing code should follow to write clean, maintainable and efficient code. They enhance code quality and ensure it is readable, reusable and less prone to errors.

You aren’t gonna need it (YAGNI)


xkcd.com
xkcd.com

Introduction

The principle YAGNI stands for “You Aren’t Gonna Need It”. This principle encourages you to build only what is needed right now, avoiding adding features for hypothetical future needs. It comes from Agile programming and aims to reduce spending time and resources on unnecessary code and keep the code clean and understandable.

Why YAGNI is important:

  • Simplicity: By avoiding unnecessary code you will reduce complexity, making it easier to read, maintain, and debug code.
  • Saving Time: Don’t wast time by building features that may never be used.
  • Flexibility: Writing only what is needed makes any changes in requirements easier to implement.

Applying YAGNI

Let’s consider the following instruction: create a function that implements a percentage discount price. Here is a solution that does not respect the YAGNI principle:

PYTHON

def calculate_discount(price, discount_type="percentage", value=10.0):
    '''
    This function applies a discount to a price

    Parameters
    ----------
    price   : float
              Original price
    discount_type: str
                   type of discount [percentage of fixed]
    value:  float
            discount to be applied

    Return
    ------
    discounted_price: float
                      final price after applying discount

    Raises
    ------
    ValueError
            if the discount type is not 'percentage' or 'fixed'

    '''
    if discount_type == "percentage":
        return price - (price * (value / 100))
    elif discount_type == "fixed":
        return price - value
    else:
        raise ValueError("Invalid discount type")

In that example, the software engineer has planned for possible other use cases (different type of discount) while not being required. It is an example of over-engineering. A better implementation would be:

PYTHON

def calculate_discount(price, discount_percentage):
    '''
    Function that applies a discount. The discount is given as a percentage of the original price.

    Parameter
    ---------
    price:  float
            original price

    Return
    ------
    final_price: float
                final price after applying discount
    '''
    final_price = price - (price * (discount_percentage / 100))
    return final_price

Exercise

Challenge

Context: You’re working on a feature to calculate the final price of items in a shopping cart. Right now, the only two requirement are (1) to apply a fixed 10% discount to the total cart price and (2) return the final price with a $ sign in front of the total price (e.g. $42.2). However, the initial implementation includes additional features that anticipate potential, but not confirmed, future requirements.

PYTHON

def calculate_final_price(prices, currency="USD", discount_type="percentage", discount_value=0.1, include_shipping=False, shipping_cost=5.0):

    # Calculate the initial total price
    total = sum(prices)

    # Apply discount based on type
    if discount_type == "percentage":
        total -= total * discount_value
    elif discount_type == "fixed":
        total -= discount_value

    # Include shipping if specified
    if include_shipping:
        total += shipping_cost

    # Format total with currency symbol
    if currency == "USD":
        return f"${total:.2f}"
    elif currency == "EUR":
        return f"€{total:.2f}"
    else:
        raise ValueError("Unsupported currency")

Work on the calculate_final_price function to apply the YAGNI principle by removing unnecessary parameters and logic, focusing only on the known requirements.

PYTHON

def calculate_final_price(prices):
    # Calculate the total with a fixed 10% discount
    total = sum(prices) * 0.9
    return f"${total:.2f}"

Summary

  • YAGNI encourages to code only the requirements you currently have.
  • Write lean, purpose-focused code and avoid implementing hypothetical features.
  • Keeps your code agile and maintainable.

Keep it simple, Stupid (KISS) & Curly’s Law


Introduction

The KISS Principle stands for “Keep It Simple, Stupid” and points out that writing simple code should be a primary goal in design. Complex structure often leads to unreadable and error-prone code. This is especially important in research where maintaining code over a long time period is essential.

Why KISS is important?

  • Readability: Simple code is easier to understand. There is a high chance that the person who will read your code the most is yourself, so help your future self.
  • Maintainability: Bug are easier to be found and fixed when each component is simple.
  • Upgrade: A simple code is easier to adapt to changes in the requirements.

It is easy to recognize complex code. When you have too many nested loops or if statements it means that your code is not optimal. In such case you might take a step back and try to simplify the structure.

The Curly’s Law says that a function should focus on a single task. Each function should “do one thing” and “do it well,” meaning that if a function has multiple tasks, consider breaking it down.

Why is the Curly’s Law important?

  • Reusability: Simple single-task function are easier to reuse.
  • Bug fix: When you code is composed of simple function, potential issues are easier to localise.
  • Testing: Simple single-task function are easier to test.
  • Modularity: Code becomes more modular and organized.

Applying KISS and Curly’s Law : Simplifying a Complex Function

Let’s consider a function that compute the area of circles, rectangles and triangles:

PYTHON

def calculate_area(shape, dimensions):
    '''
    This function compute the area of a given geometrical shape


    Parameters
    ----------
    shape   : str
              shape to consider. Can be rectangle, circle or triangle

    dimensions: list
                of dimension to consider. For rectangle and triangle you need to give a list
                of 2 numbers. For circle, you need to pass a list of one quantity (radius).

    Return
    ------
    area      : float
                area of the shape

    Raises
    ------
    ValueError
            if the shape is not recognised
    '''
    if shape == "rectangle":
        area = dimensions[0] * dimensions[1]
    elif shape == "circle":
        area = 3.14159 * (dimensions[0] ** 2)
    elif shape == "triangle":
        area = 0.5 * dimensions[0] * dimensions[1]
    else:
        raise ValueError("Unsupported shape!")

    return area
        
area = calculate_area("rectangle", [10, 20])

This function is able to compute the area of each shape. To apply KISS and the Curly’s law what you can do is to split this function into three simple independent functions:

PYTHON

def rectangle_area(length, width):
    return length * width

def circle_area(radius):
    return 3.14159 * radius ** 2

def triangle_area(base, height):
    return 0.5 * base * height

# Simple and clear usage
area = rectangle_area(10, 20)

In that version, functions are specific and easy to understand and there is no unnecessary complexity in shape management. It is easier to maintain and extend.

Exercice

Challenge

Let’s consider a function that processes data by removing values, calculating the average and returning a formatted result :

PYTHON

def process_data(data):

    cleaned_data = [x for x in data if x is not None]  # Remove missing values

    average = sum(cleaned_data) / len(cleaned_data)    # Calculate average

    return f"Average: {average:.2f}"  

Using KISS and Curly’s law, rewrite this code.

PYTHON

def remove_missing(data):
    '''
    This function is removing missing data from a list
    Parameter
    ---------
    data   : list
             list of numbers

    Return
    ------
    cleaned_data: list
                  of data without missing values
    '''
    clenaed_data = [x for x in data if x is not None]

    return cleaned_data


def calculate_average(data):
    '''
    This function computes the average of the input data

    Parameters
    ----------
    data   : list
             of numbers

    Return
    ------
    average : float
              average of the data
    '''
    average = sum(data) / len(data)

    return average

def format_average(average):
    '''
    Format the number as given in parameter as string.

    Parameter
    ---------
    average    : float
                 number to format

    Return
    ------
    formatted_string    : str
                          of the form 'Average: X.YZ'
    '''
    return f"Average: {average:.2f}"

Summary

  • The KISS Principle encourages you to your code as simple as possible.
  • Curly’s Law advise you to keep functions focused on a single task.
  • Combining these principles improves code readability, maintainability and testability.

Don’t repeat yourself (DRY) - Rule of three


Introduction

The DRY Principle states: “Don’t Repeat Yourself.” It encourages you to minimize duplication by refactoring similar code patterns. This leads to more readable, maintainable, and scalable code.

Why DRY is it important:

  • Improves Readability: Code is clearer when it’s not cluttered with repeated logic.
  • Reduces Bugs: If you need to make changes, you only do it in one place, reducing the chance of errors.
  • Saves Time: Updating and testing code is faster when code is organized with minimal duplication.

Using functions to avoid repeting code

Instead of writing the same code in multiple places in your script, create a function. This makes updates easier and avoids errors. For example consider the following code:

PYTHON

price1 = 100 * 1.2
price2 = 150 * 1.2
price3 = 200 * 1.2
print(price1, price2, price3)

The same operation is repeated three times with a different value. If you create a function that makes this operation you can refactor your code:

PYTHON

# With DRY Principle
def calculate_price(base_price):
    return base_price * 1.2

price1 = calculate_price(100)
price2 = calculate_price(150)
price3 = calculate_price(200)
print(price1, price2, price3)

Using loops instead of manual repetition

In the previous examples we still call the function three time which is not optimal. In general, If you’re applying the same operation to multiple elements, use a loop to avoid repeated code blocks:

PYTHON

prices = [100, 150, 200]
for price in prices:
    print(calculate_price(price))

Using constants for common values

When a value is repeated in multiple places, declare it as a constant variable. This way, you only need to change it once if necessary. Consider the following code:

PYTHON

total = (100 * 0.1) + (200 * 0.1) + (300 * 0.1)

The value 0.1 is repeated three times. If you want to change it, you will need to do it three times. To save time and to add some clarity to your code, you may want to declare the value 0.1, as follows:

PYTHON

TAX_RATE = 0.1
total = (100 * TAX_RATE) + (200 * TAX_RATE) + (300 * TAX_RATE)

Now if you want to change 0.1 to 0.2 you need to do it only once. In addition, now you have a better idea of what that constant is! The code is already clearer.

Challenge

Write a code, without repetition, that produces the following output:

Hello, Alice!
Hello, Bob!
Hello, Charlie!

PYTHON

def greet(name):
    print(f"Hello, {name}!")

names = ["Alice", "Bob", "Charlie"]
for name in names:
    greet(name)

Summary

DRY helps you write clear, efficient, and error-resistant code. Use functions, loops, and constants to reduce repetition. A DRY approach saves time and effort in the long run, especially when scaling or debugging code.

It is important to note that prematurely refactoring a code might lead to the unnecessary complexity. This is why DRY is often associated to the Rule of Three. The latter is a guideline suggesting that you should wait until a piece of code is repeated three times before refactoring it. It ensures that you only refactor when a pattern is stable and repeated enough time.

Principle of least astonishment (POLA)


Introduction

The Principle of Least Astonishment (POLA) states that code should work in a way that does not surprise its users and maintainers. POLA encourages you to design code that aligns with common expectations.

Why POLA is important:

  • Usability: When code works as expected, users and maintainers are less likely to misuse or misunderstand it.
  • Maintainability: Familiar and predictable patterns make the code easier to maintain and upgrade.
  • Collaboration: Using consistent and intuitive code make it easier for multiple people to work with and develop.

Common violations

Here are three common violation of POLA:

  • Naming Conventions: Function or variable names that don’t align with their purpose often lead to problems

  • Unexpected Return Types: Functions that return types users wouldn’t expect, such as a function sometimes returning an integer and other times returning None.

  • Multiple Functionalities: Using functions for multiple unrelated tasks often leads to unexpected behaviors.

Applying POLA

Example 1: Consider a function that returns different types based on a condition, which could confuse users who expect one type.

PYTHON

def calculate_total(items):
    if not items:
        return None  # If no items, return None
    return sum(items)

The problem in that function is that depending on a condition, the returned value has a different type. To overcome this problem a potential solution is to return a number anyway:

PYTHON

def calculate_total(items):
    if not items:
        return 0  # Return 0 instead of None for consistency
    return sum(items)

With this solution, the user of the code will always get the same type out of that function.

Example 2: Consider a function that does two different tasks: processing some data and save them in a file.

PYTHON

def process_data(data, save=False):
    cleaned_data = [d.strip() for d in data]
    if save:
        with open('data.txt', 'w') as f:
            f.write('\n'.join(cleaned_data))
    return cleaned_data

The user may not expect that processing data will save them into a file as well. This can lead to data being overwriten. To overcome this potential problem, you might want to separate the two functionalities into two different functions:

PYTHON

def process_data(data):
    return [d.strip() for d in data]

def save_data(data, filename='data.txt'):
    with open(filename, 'w') as f:
        f.write('\n'.join(data))

This solution keeps each function’s purpose clear.

Exercice

Challenge

Refactor calculate area to make it more predictable and intuitive.

PYTHON


from math import pi

def calculate_area(shape, a, b=0):
    if shape == "rectangle":
        return a * b  # Expects both `a` and `b`
    elif shape == "circle":
        return pi * (a ** 2)  # Ignores `b`
    elif shape == "triangle":
        return 0.5 * a * b  # Expects `a` as base and `b` as height
    else:
        return "Unknown shape"

# Example usage:
print(calculate_area("rectangle", 5))       
print(calculate_area("circle", 3, 4))       
print(calculate_area("triangle", 6, 3))     
print(calculate_area("hexagon", 5, 5))     

PYTHON

from math import pi

# Specific functions for each shape
def rectangle_area(length, width):
    if length <= 0 or width <= 0:
        return "Error: Length and width must be positive numbers."
    return length * width

def circle_area(radius):
    if radius <= 0:
        return "Error: Radius must be a positive number."
    return pi * radius ** 2

def triangle_area(base, height):
    if base <= 0 or height <= 0:
        return "Error: Base and height must be positive numbers."
    return 0.5 * base * height

# Example usage
rect_area = rectangle_area(10, 5)          # Expected: Valid rectangle area
circle_area_invalid = circle_area(-3)     # Expected: Error message
tri_area = triangle_area(6, 3)            # Expected: Valid triangle area
rect_invalid = rectangle_area(10, -5)     # Expected: Error message

# Output results
print(f"Rectangle Area: {rect_area}")
print(f"Circle Area: {circle_area_invalid}")
print(f"Triangle Area: {tri_area}")
print(f"Invalid Rectangle Area: {rect_invalid}")

Content from Don't touch your code anymore!


Last updated on 2024-12-10 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How can you modify your code configuration without touching it?

Objectives

  • Understand the purpose and usage of configuration files and command line interfaces.
  • Create, read, and write configuration files in different formats (INI, JSON, YAML)
  • Learn to design and implement command-line interfaces (CLI) using argparse
  • Integrate configuration files with CLI arguments for dynamic applications.

Until now, what we have seen deals with the design of the code itself and how to make it cleaner, more readable and maintainable. Now we are going to see how to reduce the amount of time you change the code while still modify parameters. Research is often based on a trial-error or trial-trial loops. You will often find yourself trying to rerun a code with different parameters to try different configuration. Hard coding this values can lead to inflexibility and error-prone results because it means that you will need to go change the code itself to change the configuration. In addition, and unless you are able to track very well all your trials, you will probably loose track of some of them.

Configuration files


Why would need them?

Configuration files will allow you to adjust some parameters of the code (it can be filenames, directories, values, etc) while actually leaving the code untouched.

Benefits:

  • Easier Reproducibility: By simply changing configuration files, you can reproduce the same results or adjust parameters for new experiments.
  • Collaboration: Configuration files allow collaborators to use the same script but adjust settings for their own environment. It is also easier to share configurations between collaborators.
  • Minimizing Code Modifications: Parameters are externalized, making the core code cleaner and more maintainable.
  • Documentation: Well-structured configuration files serve as documentation for your run. They provide a clear and organized record of the settings used, which is crucial for understanding and interpreting results.
  • Version Control: Configuration files can be versioned alongside the code using version control systems like Git.

Types of configuration files

As it is often the case in Python, multiple options are available:

[section1]
key1 = value1
key2 = value2

#Comments

[section2]
key1 = value1


[Section3]
key = value3
    multiline

INI files are structured as (case sensitive) sections in which you can list keyword/value pairs (like for a dictionary) separated by either the = or : signs. Values can span multiple lines and comments are accepted as long as the extra lines are indented with respect to the first line. All data are parsed as strings.

  • JSON: Originally developed for JavaScript, they are very popular in web applications. The module to read these files is json and also part of the standard library.
{
  "section1": {
    "key1": "value1",
    "key2": "value2"
  },
  "section2": {
    "key1": "value1"
  }
}

JSON files are also structured as section and keyword/value pairs. JSON files start with an opening brace { and end with a closing brace }. Then each section comes with its name followed by :. Then key/value pairs are listed within braces (one for each section). Nevertheless, comments are not allowed and they might be a little bit more complex to write.

  • YAML Files: are also a popular format (used for github action for example). In order to read (and write) YAML files, you will need to install a third party package called PyYAML.
section1:
  key1: value1
  key2: value2

section2:
  key1: value1

# Comments

YAML files work also with sections and keyword/value pairs.

  • TOML files are a bit more recent than the other ones but start to be widely use in Python (a simple example is the setup.py file for installation that became a pyproject.toml file in the last years). They allow structure and data formats. They are quite similar to INI files for the syntax. It is worth mentioning that the library tomllib is part of the Python standard library from python versoin 3.11.
[section1]
key1 = value1
key2 = value2

[section2]
key1 = value1

Loading and writing INI files: Configparser

In the following we will be using INI files. We will start by a simple exercice on writing a configuration file, manually.

Challenge

Using the text editor of your choice, create an INI file with three sections: simulation, environment and initial conditions. In the first section, two parameters are given: time_step set at 0.01s and total_time set at 100.0s. The environment section also has two parameters with gravity at 9.81 and air_resistance at 0.02. Finally the initial conditions are: velocity at 10.0 km/s, angle at 45 degrees and height at 1m.

Creating a file 'config.ini' with the following content.

[simulation]
time_step = 0.01
total_time = 100.0

[environment]
gravity = 9.81
air_resistance = 0.02

[initial_conditions]
velocity = 10.0
angle = 45.0
height = 0.0

Reading configuration files: INI

Reading an INI file is very easy. It requires the use of the Configparser library. You do not need to install it because it comes as part of the standard library. When you want to read a config file you will need to import it and create a parser object which will then be used to read the file we created just above, as follows:

PYTHON

##Import the library
import configparser 

##Create the parser object
parser = configparser.ConfigParser()

##Read the configuration file
parser.read('config.ini')

From there you can access everything that is in the configuration file. Firstly you can access the section names and check if sections are there or not (useful to check that the config file is compliant with what you would expect):

PYTHON

>>> print(parser.sections())
['simulation', 'environment', 'initial_conditions'] 


>>>print(parser.has_section('simulation'))
True

>>>print(parser.has_section('Finalstate'))
False

Eventually, you will need to extract the values in the configuration file. You can get all the keys inside a section at once:

PYTHON

>>> options = parser.options('simulation')
['time_step', 'total_time']

You can also extract everything at once, in that case each couple key/value will be called an item:

PYTHON

>>> items_in_simulation = parser.items('simulation')
>>> print(items_in_simulation)
[('time_step', '0.01'), ('total_time', '100.0')]

That method will return a list of tuples, each tuple will contain the couple key/value. Values will always be of type string.

Alternatively, you can turn each section to a dictionary:

PYTHON

>>> dict(parser['simulation'])
{'time_step': '0.01', 'total_time': '100.0'}

Finally, you can access directly values of keys inside a given section like this:

PYTHON

>>> time_step = parser['simulation']['time_step']
>>> print(time_step)
0.01

By default, ALL values will be a string. Another option is to use the method .get():

PYTHON

>>> time_step_with_get = parser.get('simulation', 'time_step')
>>> print(time_step_with_get)
0.01

It will also be giving a string…And that can be annoying when you have some other types because you will have to convert everything to the right type. Fortunately, other methods are available:

  • .getint() will extract the keyword and convert it to integer
  • .getfloat() will extract the keyword and convert it to a float
  • .getboolean() will extract the keyword and convert it to a boolean. Interestingly, you it return True is the value is 1, yes, true or on, while it will return False if the value is 0, no, false, or off.

Writing configuration files

In some occasions it might also be interesting to be able to write configuration file programatically. Configparser allows the user to write INI files as well. As for reading them, everything starts by importing the module and creating an object:

PYTHON

#Let's import the ConfigParser object directly
import ConfigParser 

# And create a config object
config = ConfigParser()

Creating a configuration is equivalent to creating a dictionaries:

PYTHON

config['simulation'] = {'time_step': 1.0, 'total_time': 200.0}
config['environment'] = {'gravity': 9.81, 'air_resistance': 0.02}
config['initial_conditions'] = {'velocity': 5.0, 'angle': 30.0, 'height': 0.5}

And finally you will have to save it:

PYTHON

with open('config_file_program.ini', 'w') as configfile: ##This open the config_file_program.ini in write mode
    config.write(configfile)

After running that piece of code, you will end with a new file called config_file_program.ini with the following content:

[simulation]
time_step = 1.0
total_time = 200.0

[environment]
gravity = 9.81
air_resistance = 0.02

[initial_conditions]
velocity = 5.0
angle = 30.0
height = 0.5

Challenge

Consider the following INI file:

[fruits]
oranges = 3
lemons = 6
apples = 5

[vegetables]
onions = 1
asparagus = 2
beetroots = 4

Read it using the configparser library. Then you will change the number of beetroot to 2 and the number of oranges to 5 and a section ‘pastries’ with 5 croissants. Then save it back on disk in a different file.

PYTHON

##Import the package
import configparser

###create the object
config = configparser.ConfigParser()

##read the file
config.read('conf_fruit.ini')


###Change the values
config['fruits']['oranges'] = str(5)
config['vegetables']['beetroots'] = str(2)

###Add a section with a new key/pair value
config['pastries'] = {'croissants': '5'}


##save it back
with open('new_conf_fruits', 'w') as openconfig:
    config.write(openconfig)

Using command line interfaces


Definition & advantages

A Command Line Interface (CLI) is a text-based interface used to interact with software and operating systems. It allows users to type commands into a terminal or command prompt to perform specific tasks, ranging from file manipulation to running scripts or programs.

When writing research software CLIs are particularly suitable:

  • Configuration: Using CLI it is easy to modify the configuration of a software without having to touch the source code.

  • Batch Processing: Researchers often need to process large datasets or run simulations multiple times. CLI commands can be easily scripted to automate these tasks, saving time and reducing the potential for human error.

  • Quick Execution: Experienced users can perform complex tasks more quickly with a CLI compared to navigating through a GUI.

  • Adding New Features: Adding new arguments and options is straightforward, making it easy to extend the functionality of your software as requirements evolve.

  • Documentation: CLI helps document the functionality of your script through the help command, making it clearer how different options and parameters affect the outcome.

  • Use in HPCs: HPCs are often accessible through terminal making command line interfaces particularly useful to start codes from HPCs.

Creating a command line interface in Python

In Python, there is a very nice module called argparse. It allows to write, in a very limited amount of lines, a command line interface. Again, that module is part of the standard library so you do not need to install anything.

As for the configuration files, we must start by importing the module and creating a parser object. The parser object can take a few arguments, the main ones are:

  • prog: The name of the program
  • description: A short description of the program.
  • epilog: Text displayed at the bottom of the help

We would proceed as follows:

PYTHON

###import the library
import argparse


###create the parser object
parser = argparse.ArgumentParser(description='This program is an example of command line interface in Python',
 				                         epilog='Author: R. Thomas, 2024, UoS')

Once this is written, you need to tell the program to analyse (parse) the arguments passed to program. This is done with the parse_args() method:

args = parser.parse_args()

If you save everything in a python file (e.g. cli_course.py) and run python cli_course.py --help you will see the following on the terminal:

usage: cli_course.py [-h]

This program is an example of command line interface in Python

optional arguments:
  -h, --help  show this help message and exit

Author: R. Thomas, 2024, UoS

You can see that the only option that was implemented is the help. It is done by constructing the command line interface and you do not need to implement it yourself. Now let’s add extra arguments!

Define command line arguments

The argparse modules implements the add_argument method to add argument. Based on the code we prepared before, you would use if this way:

PYTHON

parser.add_argument(SOMETHING TO ADD HERE)

Two main types of arguments are possible: * Optional arguments: their name start by - or -- and are called in the terminal by their name. They can be ignored by the user. * Positional arguments: Their name DO NOT start by - or --, the user cannot ignore them and they are not to be called by their name (just the value need to be passed).

For example, you can add this three lines before the args = parser.parse.args() in the file commandline.py that you created before:

PYTHON

parser.add_argument('file')               # positional argument (mandatory)
parser.add_argument('file2')              # positional argument (mandatory)
parser.add_argument('-c', '--count')      # option that takes a value
parser.add_argument('-n')                 # option that takes a value
parser.add_argument('--max')              # option that takes a value

If once again you want to print the help in the terminal python commandline.py --help you will see the following being displayed:

usage: cli_course.py [-h] [-c COUNT] [-n N] [--max MAX] file file2

This program is an example of command line interface in Python

positional arguments:
  file
  file2

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
  -n N
  --max MAX

Author: R. Thomas, 2024, UoS

The help tells you that file and file2 are positional arguments. THe user have to provide the values for each of them (in the right order!). In the next section named ‘options’, we find the help, that was already there before, and the count, n and max options: - The count option can be called by using -c OR --count and a value COUNT that the user will need to provide. - The n option is called using ‘-n’ plus a value - The max option is called using ‘–max’ plus a value

Now that we have defined a few argument, we can tune them a little bit. The first thing you will want to do is to provide the user of your program with a small help. As is stands now, displaying the help tells you what are the arguments to be used but nothing tell you what they actually are. To prevent any confusion, add a one-liner help to your argument:

PYTHON

parser.add_argument('file', help='input data file to the program')          # position argument
parser.add_argument('file2', help='Configuration file to the program')      # position argument
parser.add_argument('-c', '--count', help='Number of counts per iteration') # option that takes a value
parser.add_argument('-n', help='Number of iteration')                       # option that takes a value
parser.add_argument('--max', help='Maximum population per iteration')       # option that takes a value

These short descriptions will be displayed when using the help:

usage: My program [-h] [-c COUNT] [-n N] [--max MAX] file file2

This program is an example of command line interface in Python

positional arguments:
  file                  input data file to the program
  file2                 Configuration file to the program

options:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of counts per iteration
  -n N                  Number of iteration
  --max MAX             Maximum population per iteration

Author: R. Thomas, 2024, UoS

It is possible to use extra options to define arguments, we list a few here:

  • action: this options allows you to do store boolean values.

  • default: This allows you to define a default value for the argument. In the case thr argument will not be used by the user, the default value will be selected: parser.add_argument('--color', default='blue').

  • type: By default, the argument will be extracted as strings. Nevertheless, it is possible to have them interpreted as other types using the type argument: parser.add_argument('-i', type=int). It the user passes a value that cannot be converted to the expected type an error will be returned.

  • choices: If you want to restrict the values an argument can take, you can use the choice option to add this contraints: parser.add_argument('--color', choices=['blue', 'red', 'green']). If the user pass ‘purple’ as value, an error will be raised.

Finally you must be able to retrieve all the argument values:

How to get values from the CLI?

To get values from the command line interface you need to look into the args variable that you defined with the line args = parser.parse_args(). Each argument can be called via the structure ‘arge.’ + argument name:

args = parser.parse_args()
print(args) # Gives the namespace content
print(args.file) #direct access to the 1st positional argument
print(args.max) #direct access to the max optional argument

Below we give a couple of examples of calls to the program with different configurations:

[user@user]$ python cli_course.py file1path/file.py -c 3 --max 5
usage: cli_course.py [-h] [-c COUNT] [-n N] [--max MAX] file file2
cli_course.py: error: too few arguments  ####<---One positional argument is missing.

[user@user]$ python cli_course.py file1path/file.py file2path/file2.py -c 3
Namespace(count='3', file='file1path/file.py', file2='file2path/file2.py', max=None, n=None)
file1path/file.py
None

Challenge

Create a Python script called basic_cli.py that:

Accepts two arguments: --input_file (path to the data file) and --output_dir (directory for saving results). Prints out the values of these arguments.

Expected output:

Input file: /data/input.txt
Output directory: /results/

Final exercice: Mixing everything


For this last part of the final lecture we will combine a bit of everything we have seen during the module (you should plan for an hour for this exercice).

Challenge

It all start with a configuration file that you should download.

You will create four python files in a directory called ‘final’:

  • main.py: it will contain the main code of the program
  • cli.py: that will contain the command line interface
  • conf.py: that will handle configuration file
  • simulation.py that will handle the simulation that we are going to fake.
  • __init__.py that will stay empty.

You will start by creating the command line interface in the cli.py file with the following optional arguments:

  • --config: that will take a string value and the user will use it to pass the configuration file.
  • --timestamp: that will take a float as value
  • --save: an action argument. If used it should be true, false otherwise.

You should wrap this up in a function called command_line.

In the main.py, you will import the file cli.py and call the command_line function. You should get the value of all the arguments and we are going to analysing them. If the argument --config is empty (=None), you will close the program with a message printed in terminal (‘No config file passed…exit’).

If something is passed to --config the code will continue and you will read the configuration file. This will be done by calling the conf.py file where you will create a function call read_conf that takes the file as argument. This function will return a dictionary with the complete configuration. In the main.py, you must retrieve this complete configuration.

Once you are there, check the --timestamp option from the command line interface. If something has been given, you should replace, in the configuration, the value under Parameters/time_step by the value given by the user.

With this final configuration (updated or not, depending on the user’s request) you will create a simulation. This will be done in simulation.py. In that file you will create a Simulation class, that takes the three parameters L, M and H from the configuration file as parameters. These parameters will become properties of that class. You will create a method (function inside that class) called get_total() that makes the addition L+M.

Back to main.py, you will create a Simulation object and then print the result of the get_total() method.

Finally, if the user called the --save argument, you will use a function write_conf() that you will create in the conf.py file that will write the final configuration to a file called final_conf.ini.

The solution of the code can be found in the github repository