Skip to main content

Example 1: Analysis of Linear Search

To look at how to perform analysis, we will start with a performance analysis of the following C++ function for a linear search:

def linear_search(my_list, key):
for i in range(0, len(my_list)):
if my_list[i] == key:
return i

return -1

Assumptions

Looking at the C++ code (the Python one has been described a bit differently but has similar functionality,) we can see that there is a loop. The loop stops on one of two conditions, it reaches the end of the array or rc becomes something other than -1. rc changes only if key is found in the array. So, what are the possibilities when I run this program?

1) key is not in the array... in that case loop stops when it gets to the end of the array

2) key is in the array... but where? If it is at the front, then loop stops right away, if it is in the back then it needs to go through all the elements... clearly these will both affect how many operations we carry out.

The worst case scenario is that of an unsuccessful search... the loop will need to run more times for that. If the search was successful however, where will the key be? To perform an average case scenario, we need to consider the likelihood of finding the key in each element. Now, making the assumption that the key is equally likely to be in any element of the array, we should expect to go through half the array on average to find key.

We will look at each of these in turn below

Let n represent the size of the array arr.
Let T(n) represent the number of operations necessary to perform linear search on an array of n items.

Next we go through the code and count how many operations are done. We can do this by simply adding a comment to each line with the total operation count per line. That is we count how many operations are on that line AND the number of times that particularly line is done.

def linear_search(my_list, key):
for i in range(0, len(my_list)): # n + 2
if my_list[i] == key: # n
return i # 0

return -1 # 1
caution

In python3, range() and len() are both constant. Further, range() calls are done only once to determine the range() of values used for looping. This is not the same as testing the continuation condition in C/C++ which happens at top of every loop. Here, you loop through the entire range of values

Where do the counts come from?

  • Firstly, n is the length of the list.
def linear_search(my_list, key):
for i in range(0, len(my_list)): # n + 2 - reason: you change the value of i each time you go through loop.
# This change is done for every value between from 0 to n-1 inclusive.
# Thus, i is changed n times in total. len() function call is constant,
# range() in python3 is also constant. Thus, we count those items as 1
# each. range() is only called once to get all the values loop must
# iterate through

if my_list[i] == key: # n - reason: every time in loop you must count the == (note operator, not
# symbol. == is the operator. You require two equal signs to type it out,
# but it is not two operators. You can count [] as an op if you want
# if you do it, do it consistently or not at all)

return i # 0 - reason: on an unsuccessful search this return never occurs

return -1 # 1 - reason: on an unsuccessful search, this return always occurs

Come up with the expression for T(n)

Recall that T(n) represents the total number of operations needed for an unsuccessful linear search. As such, we get T(n) by summing up the number of operations. Thus:

T(n)=n+2+n+1=2n+3\begin{aligned} T(n) &= n + 2 + n + 1 \\ &= 2n + 3 \end{aligned}

The 2n2n comes from the 2 operations that we do each time through the loop (note that we have n elements in the list/array.) The +3+ 3 comes from the 3 operations that we always have to do no matter what (the loop runs nn times and causes 2n+22n + 2 operations.)

Thus the expression for the number of operations that is performed is:

T(n)=2n+3T(n) = 2n + 3

Now, imagine nn becoming a very very large number. As nn gets bigger and bigger, the 33 matters less and less. Thus, in the end we only care about 2n2n . 2n2n is the dominating term of the polynomial. This is similar to this idea. If you had 2 dollars, then adding 3 dollars is a significant increase to the amount of money you have. However, if you had 2 million dollars, then 3 dollars matter very little. Similarly, in order to find something in an array with 2 million elements by comparing against each of them, doing just 3 ops we need to do to find the range, length of list, and return matters very little.

Using the dominating term, we can say that the linear search algorithm has a run time that never exceeds the curve for n. In other words, linear search is O(n)O(n)

Now... can we actually prove this?

According to the formal definition for big-O

" T(n)T(n) is O(f(n))O(f(n))" if for some constants cc and n0n_0​​, T(n)cf(n)T(n) \leq cf(n) for all nn0n \geq n_0

In other words... to prove that T(n)=2n+3T(n) = 2n + 3 is O(n)O(n) we must find 2 constants cc and n0n_0 ​​ such that the statement T(n)cnT(n)\leq cn is true for all nn0n \geq n_0

The important part about this... is to realize that we can pick any constant we want as long as it meets the criteria. For cc , I can pick any number > 2. I will pick 3 (there isn't a good reason for this other than 3>2), thus c = 3. I will also pick 3 for n0n_0. The reason is that starting at n == 3, 2n+33n2n+3 \leq 3n

We can see this in the graph below. Here, we see T(n)=2n+3T(n) = 2n +3 (blue) as well as the cf(n)=3ncf(n) = 3n (orange). At n==3, the line for 2n+32n+3 falls under 3n3n and it never gets above it again as n gets bigger and bigger

Thus, we have proven that the function is O(n)O(n) because we were able to find the two constants cc and ​​ n0n_0 needed for the T(n)T(n) to be O(n)O(n)

This is also the reason why it did not matter if we counted the [i] operator. If we counted that operator, our function would be : T(n)=3n+3T(n) = 3n + 3.

The dominating term would be 3n. We would need to pick a different c that is greater than 3 and a different n0n_0 but we would still be able to find these and thus, T(n) is still O(n).

In the previous analysis, we assumed we wouldn't find the data and thus the worst possible case of searching through the entire array was used to do the analysis. However, what if the item was in the array?

Assuming that key is equally likely to be in any element, we would have to search through nn elements at worst (finding it at the last element) and n2n \over 2​ ​​ elements on average.

Now since we will at some point find the item, the statement inside the if will also be performed while the return -1 will not.

If we assume that the worst case occurs and we do not find the item until we get to the last element our operation count would be:

def linear_search(my_list, key):
for i in range(0, len(my_list)): # n + 2
if my_list[i] == key: # n
return i # 1

return -1 # 0

When we add up our counts, we would end up with the same function.

If we assume we will only search half the list.. then what would we get?

def linear_search(my_list, key):
for i in range(0, len(my_list)): # n/2 + 2
if my_list[i] == key: # n/2
return i # 1

return -1 # 0

T(n) = n/2 + 2 + n/2 + 1 = n + 3

In this case, the dominating term is n. and thus T(n) is still O(n)

caution

One thing to note. If you flip between the python code and the C/C++ code, it would seem that the number steps is higher in the C/C++ version of the code. However, this does not mean that the C/C++ version of the code is slower. In fact generally speaking, you will find C/C++ code to run faster than python due to the way it is optimized and the way it runs. This is why any application where speed truly matters in real time is often written in C/C++. So you cannot compare the two in this manner! Algorithms analysis gives you performance only as a function of resource consumption. it does not translate directly into wall clock time!