Merging Two Dictionaries¶
Merge two dictionaries together such that the resulting dictionary always retain the greater value among mappings with common keys.
Here’s an example:
# merging two dictionaries, retaining the greatest value among
# common keys
>>> dict1 = {'bananas': 7, 'apples': 3, 'pears': 14}
>>> dict2 = {'bananas': 3, 'apples': 6, 'grapes': 9}
>>> merge_max_mappings(dict1, dict2)
{'bananas': 7, 'apples': 6, 'pears': 14, 'grapes': 9}
Write a function that accepts two dictionaries and merges them in the
fashion shown above. The dictionaries’ keys need not be
strings,
and the values should be any data type that can be ordered (e.g. can be
compared using the inequality operators <
, >
, etc.).
A Simple, Buggy Solution¶
Let’s begin by writing a straightforward but flawed solution to our problem. The following function will correctly merge its two input dictionaries and is generally well-written, however, something insidious is afoot… Can you identify the bad behavior that will result from this function?
def buggy_merge_max_mappings(dict1, dict2):
# create the output dictionary, which contains all
# the mappings from `dict1`
merged = dict1
# populate `merged` with the mappings in dict2 if:
# - the key doesn't exist in `merged`
# - the value in dict2 is larger
for key in dict2:
if key not in merged or dict2[key] > merged[key]:
merged[key] = dict2[key]
return merged
Let’s first see what this function does right. Recall that iterating
over a
dictionary
will produce each of its keys one-by-one. Thus for key in dict2
loops over every key in dict2
. We then set a key-value from
dict2
mapping in merged
if that key doesn’t exist in merged or
if the value is larger than the one stored in existing mapping. Given
that merged
is initialized to have the same mappings as dict1
,
this is a correct algorithm for merging our two dictionaries based on
max-value.
The problem with our function is that we inadvertently merge dict2
into dict1
, rather than merging the two dictionaries into a new
dictionary. Recall that dictionaries are
mutable
objects and that the statement merged = dict1
simply assigns a
variable that references dict1
rather than creating a new copy of
the dictionary. Thus calling this function will mutate (change) the
state of dict1
, as demonstrated here:
>>> exam_1 = dict(Alice=99, Bob=87, Cindy=65) # equivalent to {'Alice': 99, 'Bob': 87, 'Cindy': 65}
>>> exam_2 = dict(Alice=77, Bob=90, Cindy=78)
>>> buggy_merge_max_mappings(exam_1, exam_2)
{'Alice': 99, 'Bob': 90, 'Cindy': 78}
See that the value of exam_1
has changed, and that it matches the
output of our function:
>>> exam_1
{'Alice': 99, 'Bob': 90, 'Cindy': 78}
To reiterate, this is because the statement merged = dict1
found at
the beginning of our function merely creates a reference to dict1
,
not a copy of it. In the above example exam_1
stored a class’ exam-1
scores for each student. After passing it to our function, it now stores
the max score for each student across two exams!
While we are likely to have tested that our function properly merges dictionaries as we desire, we are less likely to test that it leaves its inputs unchanged. This is a valuable lesson to be l
Takeaway
- Be mindful of object mutability and take care never to unwittingly mutate an input variable or global scope variable within a function.
- When testing a function, include a check that the function does not mutate its inputs.
A Simple, Correct Solution¶
We can easily stomp the bug in the previous function; updating
merged = dict1
with either merged = dict(dict1)
or
merged = dict1.copy()
will ensure that merged
references a new
dictionary, which we are free to update:
def simple_merge_max_mappings(dict1, dict2):
""" Merges two dictionaries based on the largest value in a given mapping.
Parameters
----------
dict1 : Dict[Any, Comparable]
dict2 : Dict[Any, Comparable]
Returns
-------
Dict[Any, Comparable]
The merged dictionary
"""
merged = dict(dict1)
for key in dict2:
if key not in merged or dict2[key] > merged[key]:
merged[key] = dict2[key]
return merged
Note the use of simple and descriptive variables (e.g. we use the
variable name key
when we iterate over the keys of a dictionary).
This, along with a good docstring, makes our code easy to read,
understand, and debug. Also note that our code is general: it makes no
presumptions about the dictionaries’ keys - they need not be strings nor
of any other particular type. Similarly, the only constraint on the
dictionaries’ values is that they can be compared against one another.
This is reflected in the docstring of our function.
Consider the importance of the ordering of the conditional statement:
if key not in merged or dict2[key] > merged[key]:
What is different if we flip the ordering of the terms? I.e.:
if dict2[key] > merged[key] or key not in merged:
The problem with this flipped ordering is that the given key may not
exist in merged
yet, thus dict2[key] > merged[key]
will raise a
KeyError
. Using the original ordering, such a case would cause
key not in merged
to return True
, and the overall expression
will return True
without evaluating the second part of the
expression (convince yourself that False or <whatever>
should will
always return False
).
A Minor Optimization¶
Supposing that your code makes heavy use of dictionary merging and that its performance is a bottleneck for the overall performance of your code, then there is a minor optimization that we can implement.
Consider a case of extreme imbalance in the sizes of our dictionaries;
suppose dict1
is contains one key while dict2
contains 10,000.
It would be preferable for our solution to manually iterate over the
smaller of these two dictionaries. We can easily accommodate this in our
function by choosing merged
to be the longer of the two dictionaries
and iterating over the other dictionary:
def opt_merge_max_mappings(dict1, dict2):
""" Merges two dictionaries based on the largest value in a given mapping.
Parameters
----------
dict1 : Dict[Any, Comparable]
dict2 : Dict[Any, Comparable]
Returns
-------
Dict[Any, Comparable]
The merged dictionary
"""
# we will iterate over `other` to populate `merged`
merged, other = (dict1, dict2) if len(dict1) > len(dict2) else (dict2, dict1)
merged = dict(merged)
for key in other:
if key not in merged or other[key] > merged[key]:
merged[key] = other[key]
return merged
Here, we make keen use of a inline if-else statement and iterable unpacking, which helps keep our code succinct despite the logic that we have added to it. See that
merged, other = (dict1, dict2) if len(dict1) > len(dict2) else (dict2, dict1)
is equivalent to:
if len(dict1) < len(dict2):
merged = dict1
other = dict2
else:
merged = dict2
other = dict1
We can use the timeit
magic
command
in either a Jupyter notebook or an IPython console to time our functions
(Note: each timeit
must be run in a separate notebook cell, and
%%timeit
must be the topmost command in the cell).
a = {}
b = dict(zip(range(10000), range(10000))) # {1 : 1, 2: 2, ..., 9999:9999}
%%timeit
simple_merge_max_mappings(a, b)
2.05 ms ± 90.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
opt_merge_max_mappings(a, b)
455 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is a relatively small speedup despite the rather stark example we cooked up.
Extended Problem: Merging Arbitrary Numbers of Dictionaries¶
There is no reason that our function should only be able to merge two dictionaries, it should be easy to accommodate an arbitrary number of inputs:
>>> a = dict(a=0, b=100, c=3)
>>> b = dict(a=10, b=10)
>>> c = dict(c=50)
>>> d = dict(d=-70)
>>> e = dict()
>>> merge_max_mappings(a, b, c, d, e)
{'a': 10, 'b': 100, 'c': 50, 'd': -70}
Before you write your solution, consider the following: 1. How do you write a function signature so that it can handle an arbitrary number of input dictionaries? 2. How should your function handle zero inputs? How about one input?
Generalized Solution¶
Addressing point #1, we will want to use the *args
syntax in our
function signature so that it can accommodate an arbitrary number of
dictionaries.
Thus all of the dictionaries fed to our function will be packed into a
tuple that can be accessed via args
(or whatever variable name we
use in our signature).
Regarding point #2, we can handle the case of receiving no inputs by simply returning an empty dictionary. We will also see that handling the case of a single input is more subtle than one might guess. This point will be discussed further once the solution is presented.
Our solution is to create an empty dictionary, merged
, and to simply
iterate over each mapping of each input dictionary and perform our
merging as we did above:
def merge_max_mappings(*dicts):
""" Merges an arbitrary number of dictionaries based on the
maximum value in a given mapping.
Parameters
----------
dicts : Dict[Any, Comparable]
Returns
-------
Dict[Any, Comparable]
The merged dictionary
"""
merged = {}
for d in dicts: # `dicts` is a tuple storing the input dictionaries
for key in d:
if key not in merged or d[key] > merged[key]:
merged[key] = d[key]
return merged
Handling Zero Inputs¶
See that our function returns an empty dictionary if it is not passed
any inputs, this is because dicts
will be an empty tuple and thus
our loop over it will immediately exit, returning an unpopulated
merged
:
>>> merge_max_mappings()
{}
While you may have thought to have returned None
in the case of
there being no dictionaries to merge, returning an empty dictionary is
much better behavior as code downstream from our function will only need
to accommodate one type of output. Additionally, our function’s
docstring promises that it will return a dictionary and we should always
abide by this contract.
Handling One Input¶
In the case that our function is passed a single dictionary, our function effectively makes a (shallow) copy of that dictionary and returns it. Suppose we tried to be clever and wrote our function to pass a single dictionary through; e.g.
def bad_merge_max_mappings(*dicts):
if len(dicts) == 1:
return dicts[0]
merged = {}
for d in dicts:
for key in d:
if key not in merged or d[key] > merged[key]:
merged[key] = d[key]
return merged
What is wrong with this solution? The problem is similar to the one that we considered above; our function always returns a dictionary that is independent from its inputs, meaning that mutating the merged dictionary has no impact on any of the inputs dictionaries. This is no longer the case when we receive a single dictionary - here the “merged” dictionary is simply a reference to the input. Any subsequent mutation of the output dictionary will also mutate the input dictionary, and vice versa. This unanticipated behavior could create very hard-to-find bugs in your code!
It is best to not try to handle specialized cases like this, when the general code behaves appropriately to begin with.
Extra Challenges¶
- Write tests for your functions where the dictionary keys aren’t strings and the values aren’t numbers.
- What if you wanted to merge values based on a criterion other than the maximum value? Try rewriting the function so that a comparison function can passed in as an argument. Remember that functions, once defined, are objects just like integers and strings - they can be passed as arguments into other functions.
The following code is correct, but really bad:
def gross_merge_max_mappings(*dicts):
"""merges dicts"""
merged = {}
for i in range(len(dicts)):
for j in dicts[i]:
if not (j in list(merged)):
merged[j] = dicts[i][j]
elif dicts[i][j] > merged[j]:
merged[j] = dicts[i][j]
else:
continue
return merged
Compare this code to the solution that we devised, and enumerate all of the stylistic mistakes that are made here. Appreciate how simple and readable our code is by comparison and make a promise to yourself that you will not write code like this.