Merging Two Dictionaries

Merge two dictionaries together such that the resulting dictionary always retain the greater value among mappings with common keys.

Here’s an example:

# merging two dictionaries, retaining the greatest value among
# common keys
>>> dict1 = {'bananas': 7, 'apples': 3, 'pears': 14}
>>> dict2 = {'bananas': 3, 'apples': 6, 'grapes': 9}
>>> merge_max_mappings(dict1, dict2)
{'bananas': 7, 'apples': 6, 'pears': 14, 'grapes': 9}

Write a function that accepts two dictionaries and merges them in the fashion shown above. The dictionaries’ keys need not be strings, and the values should be any data type that can be ordered (e.g. can be compared using the inequality operators <, >, etc.).

A Simple, Buggy Solution

Let’s begin by writing a straightforward but flawed solution to our problem. The following function will correctly merge its two input dictionaries and is generally well-written, however, something insidious is afoot… Can you identify the bad behavior that will result from this function?

def buggy_merge_max_mappings(dict1, dict2):
    # create the output dictionary, which contains all
    # the mappings from `dict1`
    merged = dict1

    # populate `merged` with the mappings in dict2 if:
    #   - the key doesn't exist in `merged`
    #   - the value in dict2 is larger
    for key in dict2:
        if key not in merged or dict2[key] > merged[key]:
            merged[key] = dict2[key]
    return merged

Let’s first see what this function does right. Recall that iterating over a dictionary will produce each of its keys one-by-one. Thus for key in dict2 loops over every key in dict2. We then set a key-value from dict2 mapping in merged if that key doesn’t exist in merged or if the value is larger than the one stored in existing mapping. Given that merged is initialized to have the same mappings as dict1, this is a correct algorithm for merging our two dictionaries based on max-value.

The problem with our function is that we inadvertently merge dict2 into dict1, rather than merging the two dictionaries into a new dictionary. Recall that dictionaries are mutable objects and that the statement merged = dict1 simply assigns a variable that references dict1 rather than creating a new copy of the dictionary. Thus calling this function will mutate (change) the state of dict1, as demonstrated here:

>>> exam_1 = dict(Alice=99, Bob=87, Cindy=65)  # equivalent to {'Alice': 99, 'Bob': 87, 'Cindy': 65}
>>> exam_2 = dict(Alice=77, Bob=90, Cindy=78)
>>> buggy_merge_max_mappings(exam_1, exam_2)
{'Alice': 99, 'Bob': 90, 'Cindy': 78}

See that the value of exam_1 has changed, and that it matches the output of our function:

>>> exam_1
{'Alice': 99, 'Bob': 90, 'Cindy': 78}

To reiterate, this is because the statement merged = dict1 found at the beginning of our function merely creates a reference to dict1, not a copy of it. In the above example exam_1 stored a class’ exam-1 scores for each student. After passing it to our function, it now stores the max score for each student across two exams!

While we are likely to have tested that our function properly merges dictionaries as we desire, we are less likely to test that it leaves its inputs unchanged. This is a valuable lesson to be l

Takeaway

  • Be mindful of object mutability and take care never to unwittingly mutate an input variable or global scope variable within a function.
  • When testing a function, include a check that the function does not mutate its inputs.

A Simple, Correct Solution

We can easily stomp the bug in the previous function; updating merged = dict1 with either merged = dict(dict1) or merged = dict1.copy() will ensure that merged references a new dictionary, which we are free to update:

def simple_merge_max_mappings(dict1, dict2):
    """ Merges two dictionaries based on the largest value in a given mapping.

    Parameters
    ----------
    dict1 : Dict[Any, Comparable]
    dict2 : Dict[Any, Comparable]

    Returns
    -------
    Dict[Any, Comparable]
        The merged dictionary
    """
    merged = dict(dict1)
    for key in dict2:
        if key not in merged or dict2[key] > merged[key]:
            merged[key] = dict2[key]
    return merged

Note the use of simple and descriptive variables (e.g. we use the variable name key when we iterate over the keys of a dictionary). This, along with a good docstring, makes our code easy to read, understand, and debug. Also note that our code is general: it makes no presumptions about the dictionaries’ keys - they need not be strings nor of any other particular type. Similarly, the only constraint on the dictionaries’ values is that they can be compared against one another. This is reflected in the docstring of our function.

Consider the importance of the ordering of the conditional statement:

if key not in merged or dict2[key] > merged[key]:

What is different if we flip the ordering of the terms? I.e.:

if dict2[key] > merged[key] or key not in merged:

The problem with this flipped ordering is that the given key may not exist in merged yet, thus dict2[key] > merged[key] will raise a KeyError. Using the original ordering, such a case would cause key not in merged to return True, and the overall expression will return True without evaluating the second part of the expression (convince yourself that False or <whatever> should will always return False).

A Minor Optimization

Supposing that your code makes heavy use of dictionary merging and that its performance is a bottleneck for the overall performance of your code, then there is a minor optimization that we can implement.

Consider a case of extreme imbalance in the sizes of our dictionaries; suppose dict1 is contains one key while dict2 contains 10,000. It would be preferable for our solution to manually iterate over the smaller of these two dictionaries. We can easily accommodate this in our function by choosing merged to be the longer of the two dictionaries and iterating over the other dictionary:

def opt_merge_max_mappings(dict1, dict2):
    """ Merges two dictionaries based on the largest value in a given mapping.

    Parameters
    ----------
    dict1 : Dict[Any, Comparable]
    dict2 : Dict[Any, Comparable]

    Returns
    -------
    Dict[Any, Comparable]
        The merged dictionary
    """
    # we will iterate over `other` to populate `merged`
    merged, other = (dict1, dict2) if len(dict1) > len(dict2) else (dict2, dict1)
    merged = dict(merged)

    for key in other:
        if key not in merged or other[key] > merged[key]:
            merged[key] = other[key]
    return merged

Here, we make keen use of a inline if-else statement and iterable unpacking, which helps keep our code succinct despite the logic that we have added to it. See that

merged, other = (dict1, dict2) if len(dict1) > len(dict2) else (dict2, dict1)

is equivalent to:

if len(dict1) < len(dict2):
    merged = dict1
    other = dict2
else:
    merged = dict2
    other = dict1

We can use the timeit magic command in either a Jupyter notebook or an IPython console to time our functions (Note: each timeit must be run in a separate notebook cell, and %%timeit must be the topmost command in the cell).

a = {}
b = dict(zip(range(10000), range(10000)))  # {1 : 1, 2: 2, ..., 9999:9999}
%%timeit
simple_merge_max_mappings(a, b)
2.05 ms ± 90.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
opt_merge_max_mappings(a, b)
455 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This is a relatively small speedup despite the rather stark example we cooked up.

Extended Problem: Merging Arbitrary Numbers of Dictionaries

There is no reason that our function should only be able to merge two dictionaries, it should be easy to accommodate an arbitrary number of inputs:

>>> a = dict(a=0, b=100, c=3)
>>> b = dict(a=10, b=10)
>>> c = dict(c=50)
>>> d = dict(d=-70)
>>> e = dict()
>>> merge_max_mappings(a, b, c, d, e)
{'a': 10, 'b': 100, 'c': 50, 'd': -70}

Before you write your solution, consider the following: 1. How do you write a function signature so that it can handle an arbitrary number of input dictionaries? 2. How should your function handle zero inputs? How about one input?

Generalized Solution

Addressing point #1, we will want to use the *args syntax in our function signature so that it can accommodate an arbitrary number of dictionaries. Thus all of the dictionaries fed to our function will be packed into a tuple that can be accessed via args (or whatever variable name we use in our signature).

Regarding point #2, we can handle the case of receiving no inputs by simply returning an empty dictionary. We will also see that handling the case of a single input is more subtle than one might guess. This point will be discussed further once the solution is presented.

Our solution is to create an empty dictionary, merged, and to simply iterate over each mapping of each input dictionary and perform our merging as we did above:

def merge_max_mappings(*dicts):
    """ Merges an arbitrary number of dictionaries based on the
    maximum value in a given mapping.

    Parameters
    ----------
    dicts : Dict[Any, Comparable]

    Returns
    -------
    Dict[Any, Comparable]
        The merged dictionary
    """
    merged = {}
    for d in dicts:  # `dicts` is a tuple storing the input dictionaries
        for key in d:
            if key not in merged or d[key] > merged[key]:
                merged[key] = d[key]
    return merged

Handling Zero Inputs

See that our function returns an empty dictionary if it is not passed any inputs, this is because dicts will be an empty tuple and thus our loop over it will immediately exit, returning an unpopulated merged:

>>> merge_max_mappings()
{}

While you may have thought to have returned None in the case of there being no dictionaries to merge, returning an empty dictionary is much better behavior as code downstream from our function will only need to accommodate one type of output. Additionally, our function’s docstring promises that it will return a dictionary and we should always abide by this contract.

Handling One Input

In the case that our function is passed a single dictionary, our function effectively makes a (shallow) copy of that dictionary and returns it. Suppose we tried to be clever and wrote our function to pass a single dictionary through; e.g.

def bad_merge_max_mappings(*dicts):
    if len(dicts) == 1:
        return dicts[0]
    merged = {}
    for d in dicts:
        for key in d:
            if key not in merged or d[key] > merged[key]:
                merged[key] = d[key]
    return merged

What is wrong with this solution? The problem is similar to the one that we considered above; our function always returns a dictionary that is independent from its inputs, meaning that mutating the merged dictionary has no impact on any of the inputs dictionaries. This is no longer the case when we receive a single dictionary - here the “merged” dictionary is simply a reference to the input. Any subsequent mutation of the output dictionary will also mutate the input dictionary, and vice versa. This unanticipated behavior could create very hard-to-find bugs in your code!

It is best to not try to handle specialized cases like this, when the general code behaves appropriately to begin with.

Extra Challenges

  • Write tests for your functions where the dictionary keys aren’t strings and the values aren’t numbers.
  • What if you wanted to merge values based on a criterion other than the maximum value? Try rewriting the function so that a comparison function can passed in as an argument. Remember that functions, once defined, are objects just like integers and strings - they can be passed as arguments into other functions.

The following code is correct, but really bad:

def gross_merge_max_mappings(*dicts):
    """merges dicts"""
    merged = {}
    for i in range(len(dicts)):
        for j in dicts[i]:
            if not (j in list(merged)):
                merged[j] = dicts[i][j]
            elif dicts[i][j] > merged[j]:
                merged[j] = dicts[i][j]
            else:
                continue
    return merged

Compare this code to the solution that we devised, and enumerate all of the stylistic mistakes that are made here. Appreciate how simple and readable our code is by comparison and make a promise to yourself that you will not write code like this.