# Measuring the Accuracy of a Classification Model¶

Suppose that we are working on a project in which we have some model
that can process an image and classify its content. For example, my
`cat_dog_goose_other`

function tries to classify whether a picture is
of a cat (class 0), a dog (class 1), a goose (class 2), or something
else (class 3). We want to measure the *accuracy* of our classifier.
That is, we want to feed it a series of images whose contents are known
and tally the number of times the model’s prediction matches the true
content of an image. The accuracy is the fraction of images that the
model classifies correctly.

For each image we feed the `cat_dog_goose_other`

model, it will
produce four **scores** - one score for each class. The model was
designed such that the class with the highest score corresponds to its
prediction. There are no constraints on the values the scores can take.
For example, if the model processes one image it will return a
shape-\((1, 4)\) score-array:

```
>>> scores = cat_dog_goose_other(image)
# processing one image produces a 1x4 array of classification scores
>>> scores
array([[-10, 33, 580, 100]])
```

Here, our model has predicted that this is a picture of a goose, since
the score associate with class 2 (`scores[2]`

) is the largest value.
In general, if we pass `cat_dog_goose_other`

an array of \(N\)
images, it will return a shape-\((N, 4)\) array of classification
scores - each of the \(N\) images has \(4\) scores associated
with it.

Because we are measuring our model’s accuracy, we have curated a set of
images whose contents are known. That is, we have a true **label** for
each image, which is encoded as a class-ID. For example, a picture of a
cat would have the label `0`

associated with it, a picture of a dog
would have the label `1`

and so on. Thus, a stack of \(N\) images
would have associated with it a shape-\((N,)\) array of integer
labels, each label is within \([0, 4)\).

Suppose we have passed our model five images, and it produced the following scores:

```
# Classification scores produced by `cat_dog_goose_other`
# on five images. A shape-(5, 4) array.
>>> import numpy as np
>>> scores = np.array([[ 30, 1, 10, 80], # prediction: other
... [-10, 20, 0, -5], # prediction: dog
... [ 27, 50, 9, 30], # prediction: dog
... [ -1, 0, 84, 3], # prediction: goose
... [ 5, 2, 10, 0]]) # prediction: goose
```

And suppose that the true labels for these five images are:

```
# truth: cat, dog, dog, goose, other
>>> labels = np.array([0, 1, 1, 2, 3])
```

Our model classified three out of five images correctly; thus, our accuracy function should return 0.6:

```
>>> classification_accuracy(scores, labels)
0.6
```

To generalize this problem, assume that your classifier is dealing with \(K\) classes (instead of \(4\)). Complete the following function.

**Tip:** You will find it useful to leverage numpy’s argmax
function`f

```
def classification_accuracy(classification_scores, true_labels):
"""
Returns the fractional classification accuracy for a batch of N predictions.
Parameters
----------
classification_scores : numpy.ndarray, shape=(N, K)
The scores for K classes, for a batch of N pieces of data
(e.g. images).
true_labels : numpy.ndarray, shape=(N,)
The true label for each datum in the batch: each label is an
integer in the domain [0, K).
Returns
-------
float
(num_correct) / N
"""
# YOUR CODE HERE
pass
```

## Unvectorized Solution¶

A simple approach to this problem is to first loop over the rows of our
classification scores. We know that each such row stores the scores for
each class for a particular data point, and that the *index* of the
highest score in that row gives us the predicted label for that data
point (e.g. image in our hypothetical use-case). We can then directly
compare these predicted labels with the true labels to compute the
accuracy.

We can use the function `numpy.argmax`

to get the index of the highest
score, and thus the predicted class-ID, for each data point. Recall that
NumPy arrays use row-major traversal
ordering,
so performing a for-loop over `classification_scores`

will yield one
row of the array at a time.

```
pred_labels = [] # Will store the N predicted class-IDs
for row in classification_scores:
# store the index associated with the highest score for each datum
pred_labels.append(np.argmax(row))
```

Next, we need to count the fraction of predicted class-IDs that match the true labels classification matches the true classification.

```
num_correct = 0
for i in range(len(pred_labels)):
if pred_labels[i] == true_labels[i]:
num_correct += 1
```

Or we can make use of a generator comprehension and itertools to be much more succinct:

```
# recall: int(True) -> 1, int(False) -> 0
num_correct = sum(p == t for p, t in zip(pred_labels, true_labels))
```

We can formally write this out into the following function:

```
def unvectorized_accuracy(classification_scores, true_labels):
"""
Returns the fractional classification accuracy for a batch of N predictions.
Parameters
----------
classification_scores : numpy.ndarray, shape=(N, K)
The scores for K classes, for a batch of N pieces of data
(e.g. images).
true_labels : numpy.ndarray, shape=(N,)
The true label for each datum in the batch: each label is an
integer in the domain [0, K).
Returns
-------
float
(num_correct) / N
"""
pred_labels = [] # Will store the N predicted class-IDs
for row in classification_scores:
pred_labels.append(np.argmax(row))
num_correct = 0
for i in range(len(pred_labels)):
if pred_labels[i] == true_labels[i]:
num_correct += 1
return num_correct / len(true_labels)
```

Testing against our example from above:

```
>>> unvectorized_accuracy(scores, labels)
0.6
```

Horray! We have a working accuracy function! However, this function can be greatly simplified and optimized by vectorizing it.

## Vectorized Solution¶

`numpy.argmax`

is one of NumPy’s vectorized sequential
functions.
As such, it accepts axis as a keyword
argument.
This means that, instead of calling `np.argmax`

on each row of
`classification_scores`

in a for-loop, we can simply instruct
`np.argmax`

to operate *across the columns of each row of the array*
by specifying `axis=1`

.

```
# returns the column-index of the max value
# within each row of `classification_scores`
pred_labels = np.argmax(classification_scores, axis=1)
```

This simple expression eliminates our first for-loop entirely.

Next, we can use NumPy’s *vectorized logical operations*, specifically
`==`

, to get a boolean-valued array that stores `True`

wherever the
predicted labels match the true labels and `False`

everywhere else.
Recall that `True`

behaves like `1`

and `False`

like `0`

. Thus,
we can call `np.mean`

on our resulting boolean-valued array to compute
the number of correct predictions divided by the total number of
predictions. We can thus vectorize our second for-loop with:

```
# computes the fraction of correctly predicted labels
frac_correct = np.mean(pred_labels == true_labels)
```

All together, making keen use of vectorization allows us to write our
classification accuracy function *in a single line of code*.

```
def classification_accuracy(classification_scores, true_labels):
"""
Returns the fractional classification accuracy for a batch of N predictions.
Parameters
----------
classification_scores : numpy.ndarray, shape=(N, K)
The scores for K classes, for a batch of N pieces of data
(e.g. images).
true_labels : numpy.ndarray, shape=(N,)
The true label for each datum in the batch: each label is an
integer in the domain [0, K).
Returns
-------
float
(num_correct) / N
"""
return np.mean(np.argmax(classification_scores, axis=1) == true_labels)
```

Not only is this cleaner to look at, but it was also simpler and less error-prone to write. Moreover, it is much faster than our unvectorized solution - given \(N=10,000\) data points and \(K=100\) classes, our vectorized solution is roughly \(40\times\) faster

(The following “time-it” code blocks must be run in independent cells in
a Jupyter notebook or IPython console - `%%timeit`

must be the topmost
command in the cell)

```
>>> N = 10000
>>> K = 100
>>> scores = np.random.rand(N, K)
>>> labels = np.random.randint(low=0, high=K, size=N)
```

```
>>> %%timeit
... unvectorized_accuracy(scores, labels)
39.5 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

```
>>> %%timeit
... classification_accuracy(scores, labels)
1.6 ms ± 7.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```