# Data Structures (Part I): Introduction¶

Here we survey Python’s built-in data structures. You should already be familiar with its lists and tuples, two data structures that facilitate working with sequential data. Two other critical, built-in data structures to be introduced are:

- dictionary: a mapping from “keys” to “values”
- sets: an unordered collection of items that can be used to perform set-algebraic operations (e.g. union and intersection)

These data structures are not merely convenient constructs with some nice pre-written functionality; they also provide an interface to some optimized core utilities that have been written in C (or whatever language your Python interpreter is written in). For example, let’s write a function that checks if an item is contained within an iterable:

```
def is_in(seq, target):
""" Returns True if `target` is contained in `seq`."""
for item in seq:
if item == target:
return True
return False
```

This function mirrors the C-algorithm that Python uses “under the hood” for checking for membership in a list (assuming you are using the CPython interpreter, which you almost definitely are). Because their function is implemented “at a lower level”, and need not be interpreted, we expect it to be faster than ours:

```
>>> x = [1, "moo", 3, True, 5, None, 7, 8]
>>> is_in(x, -1) # takes 980 nanoseconds on my machine
False
>>> -1 in x # takes 320 nanoseconds on my machine
False
```

Here, Python’s built-in sequence-membership function is 3x faster than using our own function. Furthermore, it will be important to know the advantages provided by each of the data structures. For instance, testing for membership in a set is even faster than is checking for membership in a list:

```
# test for membership in a list
>>> -1 in [1, "moo", 3, True, 5, None, 7, 8] # takes 295 nanoseconds on my machine
False
# test for membership in a set
>>> -1 in {1, "moo", 3, True, 5, None, 7, 8} # takes 65 nanoseconds on my machine
False
```

We get a 4.5x speedup in our membership test just by using a set instead of a list, because the use of a set permits Python to use an entirely different algorithm for checking for membership. On our end, we merely replaced square brackets with curly braces! Hopefully this is sufficient motivation for learning about Python’s data structures and the algorithms that they utilize “under the hood”.

**Takeaway**:

Python’s data structures come with a wealth of built-in functionality. Furthermore, understanding where each data structure “shines” is critical for writing efficient Python code. It is not necessary to memorize this information, but you should know that it exists and should be referenced frequently.

## Describing Algorithm Complexity¶

In order to meaningfully compare the relative efficiency of algorithms, it is useful to summarize how algorithms “scale” with problem size. Two sorting algorithms may be comparable when sorting tens of items, and yet they may have wildly different performances when sorting thousands of items.

“Big-O” notation allows us to denote how an algorithm’s run time scales against problem size. Specifically, it signifies the “worst-case scenario” performance of an algorithm.

Let’s take, for instance, the `is_in`

function that we wrote at the
beginning of this section. In it, we iterate over a collection to check
if it contains a specific item. The worst-case scenario for this
algorithm is when the item is not a member of the collection at all - we
have to iterate over the entire collection before we can conclude that
it does not possess our item. So if we increase the collection to be
\(n\) times larger in size, it should take \(n\) times as long
to iterate over it to determine that the item is not a member of the
collection (again, dealing with the worst-case scenario). Because the
worst-case scenario run time of `is_in`

scales linearly with the size
of the collection, \(n\), we denote it’s run time complexity, using
big-O notation, as \(\mathcal{O}(n)\).

Now suppose we did a truly terrible job writing a membership algorithm, and performed a nested iteration over our collection:

```
def is_in_slow(seq, target):
""" Returns True if `target` is contained in `seq`."""
for item in seq:
# for each item in seq, iterate over seq in its entirety!
for item2 in seq:
if item == target:
return True
return False
```

For each item in `seq`

we iterate over `seq`

again, thus in the
worst-case scenario we need to iterate over \(n\) items, \(n\)
separate times - a total of \(n^{2}\) “steps” in our algorithm. Thus
we would say that `is_in_slow`

is a \(\mathcal{O}(n^{2})\)
algorithm: whereas doubling size of `seq`

would increase the run time
of our \(\mathcal{O}(n)\) algorithm by a factor of 2 (linear
scaling), it would increase the run time of this
\(\mathcal{O}(n^{2})\) algorithm by 4 (quadratic scaling).

Here is a more formal description of this notation:

**“Big-O” Notation**: Suppose that \(n\) denotes the “size” of the
input to an algorithm, and that the mathematical expression \(f(n)\)
describes how many computational steps the algorithm must take in its
*worst-case scenario*, given that input. Then the algorithm’s run time
complexity can be denoted using the **“Big-O”** notation:
\(\mathcal{O}(f(n))\).

A few important notes:

- We only care about the “highest-order” term in \(f(n)\). That is, \(\mathcal{O}(n + n^{2})\) should just be written as \(\mathcal{O}(n^{2})\).
- We never care about constant factors in our scaling. That is, even if an algorithm iterates over a sequence twice, its big-O complexity should be written as \(\mathcal{O}(n)\), rather than \(\mathcal{O}(2n)\).
- An algorithm whose run time
*does not depend on the size of its input*is a \(\mathcal{O}(1)\) algorithm.- Example: a function that returns the second element from a list.

- There are more nuanced methods for analyzing algorithm complexity than solely considering the worst-case scenario, which can be overly pessimistic. Know that “big-O” notation can be used to convey mean performance, amortized performance, and other types of analysis. Here, we will simply stick to the worst-case analysis.

**Takeaway**:

We will be using the “big-O” notation, \(\mathcal{O}(f(n))\), to summarize the performance of the algorithms used by Python’s data structures.

## Sequential Data Structures: Lists and Tuples¶

The “Sequence Types” section already introduced lists and tuples. Recall
that both provide the same interface for accessing and summarizing the
contents of a heterogeneous sequence of objects. However, a list can be
mutated - updated, removed from, and added to - whereas a tuple cannot
be mutated. Thus a list is *mutable*, whereas a tuple is *immutable*.
Here you will find a summary of the algorithmic complexities of many of
the built-in functions that work on sequential data structures.

For a complete rundown of the functions available to Python’s list, go here.

### List/Tuple Complexities¶

Let `seq`

represent a **list or tuple** of length \(n\); \(i\)
and \(j\) are integers drawn randomly from the interval
\([0, n-1]\); \(k\) is the length of any subsequence involved in
the operation. The following is a summary of the complexities associated
with various common operations using lists and tuple (according to their
implementations in CPython):

Operation | Complexity | Explanation |
---|---|---|

`len(seq)` |
O(1) | Return the number of items in the sequence |

`seq[i]` |
O(1) | Retrieve any item from the sequence |

`seq[i:j]` |
O(k) | Retrieve a length-k slice from the sequence |

`for item in seq..` |
O(n) | Iterate over the sequence |

`obj in seq` |
O(n) | Check if `obj` is a
member of `seq` |

`seq.count(obj)` |
O(n) | Count the number of
occurrences of
`obj` in `seq` |

`seq.index(obj)` |
O(n) | Return position-index
of `obj` in `seq` |

### List Complexities¶

Here we consider some mutating operations, like `append`

, that are
available to lists and not tuples. It is important to note that lists
are implemented such that:

- operations that add-to or remove-from the
*end*of the list are \(\mathcal{O}(1)\) - operations that add-to or remove-from the
*beginning*of the list are \(\mathcal{O}(n)\)

Let `my_list`

represent a list of length \(n\), and `i`

is an
integer drawn randomly from the interval \([0, n-1]\). The following
is a summary of the complexities associated with various operations
using a list (according to its implementation in CPython):

Operation | Complexity | Explanation |
---|---|---|

`my_list[i] = obj` |
O(1) | Set the ith entry of the list with a new object. |

``my_list.append(obj) `` | O(1) | Append a new object to the end of the list. |

`my_list.pop()` |
O(1) | Remove the object
from the end of the
list. |

`my_list.pop(0)` |
O(n) | Remove the object
from the beginning
of the list. |

`my_list.sort()` |
O(nlog(n)) | Return a sorted version of the list. |