Data Flow Should Follow Execution Flow

Let us start with an analogy...

Imagine you are on a police surveillance team, watching suspicious comings-and-goings on Functional Street. You walk past Larry's Barber Shop, and see that it is empty except for Larry himself. A little while later, from a distant observation point, you observe three shady individuals leave Bada Bing club and enter Larry's Barber shop. A police sniper contacts you by radio, and asks how many people are in Larry's. You answer "Four".

Now, on a similar assignment, you are conducting surveillance on Object-Oriented Avenue. After chatting to Mr Sharp who is alone in his shop, you retire across the street to discreetly observe. You see three dodgy characters leave The Red Lion pub and enter Sharp's Bait and Tackle. Anticipating the question, you think about how many people are in Sharp's. What is the correct answer? Four? Nope. There are seven. What? People came in the BACK DOOR. Five minutes later, your radio crackles to life with the question "How many people are in Sharp's?" How many indeed? The correct answer is: zero. What? They all left through the back door.

This story is analogous to reading code where the data flow does not follow the control flow. The control flow (or execution flow) is what you are reading, line-by-line, as you try to understand what it does. Data flow describes the movement of data between different components. Suppose you are reading the code, and see that function_A() is executed, followed by function_B(). Further suppose that just by reading the code in front of you, you see that data generated by function_A() is passed as input into function_B(). Then we can say that Data Flow is following Control Flow.

Interestingly, Data flow is something you don't think about unless it deviates from the execution flow. How might it do that? By the use of evil global variables.

Evil Global Variables

Most CPUs work by executing machine code instructions and storing values in registers, which are later read by other instructions. When the first higher-level languages were developed, they tended to follow this model of doing some operation, storing the output value, then having subsequent operations read the value from its memory store. It was some time before experience with larger-scale software programs showed the dangers of this approach.

A concrete example of the use of global variables in Python:

evil_global_variable = 0

def main():
    operation_1()

    if evil_global_variable == 1:
        operation_2()
    else:
        operation_3()


def operation_1():
    ...
    evil_global_variable = peculiar_computation()
    ...


def operation_2():
    ...
    # Lots of computation that does not use evil_global_variable.
    ...


def operation_3():
    ...
    result = 3.1415927 * r^2 * evil_global_variable
    ...

The opacity and obfuscation are obvious in the example above. Simply by reading the code in main(), one cannot know that operation_1() sets evil_global_variable, while operation_2() does not use evil_global_variable at all, and operation_3() depends on evil_global_variable.

Thus, global variables force the reader to make much more mental effort to understand the code: You have to read every single damn line to see where global variables are being set, and remember that fact as you read on. Then much later, when you encounter code that reads that variable, you have to remember the state of that variable. So you have to keep a parallel thread in your mind. You are reading code - that is one mental thread - but now you need another mental thread to track the flow of data between functions.

The effort of reading all the code might be reasonable for a small code base, but this becomes a huge time cost as the code base gets bigger. And the cognitive task of maintaining a mental thread of the global variable value becomes simply unfeasible for large enough code bases. This will inevitably lead to bugs. Consider this scenario: function_A() dumps a result to a global variable. Some time later, function_B() reads that variable as an input. But surprise! That new bit of code your colleague wrote yesterday alters the global variable before funtion_B() access it. And that is why you are debugging today...

So don't do this. Make data follow the control flow. Pass necessary data in, so that you can see it right there as you read the code that uses it. Here is the example code fixed:

def main():
    local_variable = operation_1()

    if local_variable == 1:
        operation_2()
    else:
        operation_3(local_variable)


def operation_1():
    ...
    local_variable = peculiar_computation()
    ...
    return local_variable


def operation_2():
    ...
    # Lots of computation that does not use evil_global_variable.
    ...


def operation_3(local_variable):
    ...
    result = 3.1415927 * r^2 * local_variable
    ...

An Object-Oriented Trap

The code in the example above that uses global variables is terrible!!! So let us 'fix' it by applying the panacea of object-oriented programming!!!

class AnAllTooTypicalClass:

    def __init__():

        self.perfectly_innocent_class_variable = 0

    def main():
        self.operation_1()

        if self.perfectly_innocent_class_variable == 1:
            self.operation_2()
        else:
            self.operation_3()

    def operation_1():
        ...
        self.perfectly_innocent_class_variable = peculiar_computation()
        ...

    def operation_2():
        ...
        # Lots of computation that does not use perfectly_innocent_class_variable.
        ...

    def operation_3():
        ...
        result = 3.1415927 * r^2 * self.perfectly_innocent_class_variable
        ...

Much better!!! Object-oriented design saves the day!!! The evil global variable has been encapsulated, and the problems have been fixed. Right?

Sadly, no.

A careful look will reveal that the problematic structure is exactly the same, it has just been transported down a level from the file level to the class level. Adding an extra layer of abstraction - the class - has not changed the relationship between the troublesome variable and the functions that access it. (Note that in this example, the extra layer of abstraction adds no benefit, and has therefore made the code worse.)

We have uncovered a serious 'foot gun' of object-oriented design. Various claims about the desirability of 'data encapsulation' etc obscure the fact that object-oriented design positively encourages the use of what are effectively global variables. Adding an extra layer of abstraction and relabelling them to 'object attributes' does not eliminate the problems they cause.

Now, the problems with the object-oriented example code above could be fixed by eliminating the object variable, and passing data directly into methods that use it. However, we should look at the bigger picture and consider the issue from a higher level of abstraction, which we'll consider next.

When To Use Objects

Objects are very good at capturing state. They are excellent for carrying around variables and methods (functions) that access those variables, though care must be taken to ensure that the state is always forced to be consistent. This code construction is so useful that even strict functional languages such as Haskell have it.

Problems arise when objects are used as containers for algorithm-like code. As the example above shows, a common anti-pattern is to have a class that carries out a complex algorithm using what are effectively global variables.

The structure of an object is inherently tree-like, and as the section Make Execution Flow Obvious argued, algorithm-like code suits linear-type structures, while data-like code suits tree-like structures.

A possible use case for an object that carries out complex algorithms is as a context for the computations, in which the object attributes are all immutable, so that the object does not have global variables but rather global constants. Then the functions can easily access the global constants. However, since there are no encapsulated variables, some would argue that this violates principles of object-oriented design, and a class should not be used in this instance. So we do not advise doing this - we merely mention this theoretical use case for completeness.

In general, just separate the concerns of algorithm-like code and data-like code, and follow these rules of thumb:

Make objects small, and chiefly concerned about carrying state. Use as few objects as possible, but no fewer - separate the different kinds of state into separate objects.

The above advice is a general corollary of the Golden Rule: Separation Of Concerns.

🙠