Searching
Searching is the process of finding particular piece of data (the key) within a data structure. Some data structures will support faster searching by the way that the data is stored and organized.
In this section of the notes we are concerned with how to search a list. In particular we are looking at lists that are array like. For example, Python lists are "array like", as are vectors from the C++ standard library. The list from C++ standard library however is not array based so this is not the type of list we are considering in this chapter. What we want are data structures that provide fast random access to any element given its index.
Linear Search
The linear search algorithm is a pretty standard algorithm that you have probably written at some point in your previous programming courses. You may not have called it linear search, but it is likely that you have written it. Here we will formalize what the function does.
The linear search algorithm is given a list of values and a key. It returns the first index of where the key is found or -1 if the key is not part of the list. We use -1 to indicate the state of not finding the value because 0 is a perfectly valid index into the list.
The code for this is pretty straight forward. Start at the beginning of the list and search until you either find the item or you reach the end of the list.
- Python
- C++
def linear_search(my_list, key):
for i in range(0, len(my_list)):
if my_list[i] == key:
return i
return -1
template <class TYPE>
int linearSearch(const vector<TYPE>& arr, const TYPE& key){
int rc=-1;
for(int i=0;i<arr.size()&& rc==-1;i++){
if(arr[i]==key){
rc=i;
}
}
return rc;
}
Performance of a linear search
A linear search has a run time of O(n). You saw this analysis in the algorithms analysis part of the notes.
If you were to sort the array and then, perform a linear search (where you would stop when you either find the key or find a value in the list that is bigger than key,) it's not worth it as your overall cost will be more than O(n) (beacues of sorting first!). Assuming that your key can be found anywhere in the array (i.e. equal chance of being smallest/biggest/second smallest/etc.), finding a value, on average, still requires that you go through half the array, and at the worst case, the entire array. The search part of the process is still, therefore, linear.
Binary Search
Binary search is a search that can only be performed on lists that are sorted AND have random access to each of its elements.
Consider two different number guessing games:
In game 1 you are asked to guess a number between the values 1 to 1000. After every guess you are told whether you are correct or incorrect. That is it, no other information other than whether you are right or wrong.
In game 2 you are asked to guess a number between the values 1 to 1000. After every guess you are told that the guess is either correct, too high, or too low.
Which game is it easier to find the correct number?
In the first game, you are only told whether your guess was right or wrong. Thus, any guess that you make can eliminate only one number from consideration. Every other number is still possible. Initially you had 1000 options to choose from. After making a guess, if you are wrong, you have 999 options left to choose from. After two guesses, 998 possibilities. This is linear search. After you look at one element, you eliminate the possibility that the value is there... but it is possible for value to be in every other element.
In the second game, you are told that not only whether the guess is correct or in correct but also whether the value is too high or too low. This allows you to eliminate many more numbers. If you guess 500 as your first guess, then even if the number is not 500, you can immediate either eliminate all numbers bigger than 500 or all numbers smaller than 500. Repeatedly doing this, always guessing at the middle of the remaining valid range will allow you to continue to half the data set with each guess.
This is why the second game is much easier... you are zeroing in on the value you want.
Preconditions of a binary search
To perform a binary search requires two things:
- the list must be sorted. This allows us to split the array into two pieces, one where the key might be found and the other where the key definitely can't be found.
- the second requirement is the list must allow fast random access (ie be array-like). That is getting to any element of the list takes the same amount of time. In the list from the C++ standard library this is not the case.
The algorithm
The binary search algorithm goes like this: Track the range of possible indexes by storing the first/last index where key may be found initially this is 0 and (length of list) - 1. Calculate the mid point index between those first/last indexes look at the value at the middle element if we found it we are done if the key is smaller than middle element, then key can only be found between first index and element before middle element. Thus, set last index to middle index - 1. * if key is greater than middle element then key can only be found after that middle element, thus set first index to middle index + 1
- Python
- C++
def binary_search(my_list, key):
low_index = 0 # low_index stores lowest index where you might find key
high_index = len(my_list) - 1 # high_index stores highest index where you might find key
# initially these indexes cover every element in array
while low_index <= high_index: # when low_index become bigger than high index, we stop
# because we have eliminated all possiblities
mid_index = (low_index + high_index) // 2 # be careful here, we need to find mid point
# this is NOT high_index/2. That only works
# when low_index == 0
if key == my_list[mid_index]:
return mid_index
elif key < my_list[mid_index]:
high_index = mid_index - 1
else:
low_index = mid_index + 1
return -1
template <class TYPE>
int binarySearch(const vector<TYPE>& arr, const TYPE& key){
int rc=-1;
int low=0;
int high=arr.size()-1;
int mid;
while(low<=high && rc==-1){
mid=(low+high)/2;
if(key < arr[mid])
high=mid-1;
else if(key > arr[mid] )
low=mid+1;
else
rc=mid;
}/*while*/
return rc;
}