Sorting

This is an article about sorting and the classic ways on how to achieve it in computer science.

Bubble Sort

Bubble sort was invented by Edward Harry Friend in 1956 and is a very inefficient sorting algorithm. It is so bad and easy to understand, that even Barack Obama understands it.

In Bubble sort, there are two loops, the outer one iterating as often as there are elements in an unsorted list. The inner loop compares each item with its neighbour and if bigger, it will be switched in a way that the bigger item is located on the right side. In this way, huge items (e.g. numbers) are bubbled up all to the left.

Problems

The problem is, that not only with each outer iteration, n comparisons have to be made with n items, but also a lot of switches in case the the list is sorted in a somewhat reverse order. As a result the algorithm will face the hares and the tortoises problem, where big items will bubble up fairly quickly (the hares) and small items will crawl down pretty slowly (the tortoises).

hares

Complexity

Outer Loop

n-1 outer passes will be made to process a list of size n. We don't need to make a nth pass, since after n-1 passes all items will be in pace already. Let n be 1, then zero passes are necessary to sort a list of a lenght of 1. Let n be 2 then one pass has to be made to sort a list of 2 items.

Inner Loop

You can compare an arbitrary item of n only with the remaining rest of the items of n, which is n-1. Imagine a list of 1 item, you need only to make no comparison to sort the list. Imagine a list of 2 items, you need only to make one comparison to sort the list.

To calculate complexity we look at how many comparisons in total have to be made to sort a list, that is all inner loops nested in all outer loops.

The number of total comparisons being made is hence the sum of n-1 integers. Imagine a list of 4 items, then 4-1 + 4-2 + 4-3 comparisons have to be made in total, equaling 6 comparisons in total. There are only 3 passes when n=4, since only n-1 passes will need to be made to sort a list of size 4.

Another way to understand this:

Pass  Comparisons
1         n-1
2         n-2
n-1      1

Hence, since the sum of all n integers is defined as

$$ \sum\limits_{k=1}^n k = \frac{n(n+1)}{2} $$

the sum of the first n - 1 integers is

$$ \frac{1}{2}n^2 - \frac{1}{2}n $$

resulting in

$$ \mathcal O(n^2) $$ comparisons beeing made in the worst case. (You can try the formula with our example above). In the best case, when the list is already sorted, no exchanges will be made. In average we exchange half of the time.

Implementation

Selection Sort

Selection Sort, while being still in the same category as bubble sort efficient-wise, improves on the exchanges being made, as it does not compare as often as Bubble Sort, but exchanges only one time each outer loop. This means it moves the biggest found value on each outer pass to the end of the list.

Complexity

So while still n-1 passes will be made to process a list of size n, in the inner loop, the number of exchanges being made is reduced, yet the number of comparisons on each run is still the sum of the first n - 1 integers:

$$ \frac{1}{2}n^2 + \frac{1}{2}n - n $$

resulting in

$$ \mathcal O(n^2) $$

resulting still in an exponential order of magnitude. Yet, in benchmarks the Selection Sort algorithm will be faster than the Bubble Sort due to the lesser exchanges being made on each pass.

Implementation

What happens is that on each pass, the lenght of the list is counted down by one and in the inner loop each element of the current list partition is compared against the positionMax variable. This way positionMax will reference the largest value on each pass and after that be switched with the last value of the current list partition. There will be always n-1 switches being made compared to exponentially more in bubble sort.

Insertion Sort

Insertion sort once again has an exponential complexity of O(n2), yet improves the number of exchanges being made to zero and uses shifts instead, which is about one third of the computational complexity of an exchange, because only one assignment is being made in the inner loop.

The algorithm maintains a sorted sublist and looks for unsorted items to be inserted into the sublist at the right place by shifting the sublist items to the right.

Complexity

$$ \mathcal O(n^2) $$

Implementation

Shell Sort

The Shell Sort algorithm is a special version of the insertion sort, where smaller sublists are chosen with an increment i, called the gap, creating sublists that are i items apart.

If the original list was

[0,55,22,12,7,24,15]

and i = 3 the resulting sublists would be:

[0,12,15] sublist 1
[55,7]      sublist 2
[22,24]    sublist 3

If these sublists are then sorted and put back together, the resulting list would be:

0,7,22,12,55,24,15

That list can divided up into sublists again. A final insertion sort will now perform better than a traditional one.

Complexity

If we chose a changing increment each pass the resulting order of magnitude can become

$$ \mathcal O(n^\frac{2}{3}) $$

Merge Sort

Time complexity can be vastly improved by using a divide and conquer strategy, by continually splitting a list in half, calling itself recursively. This is being done until no split can be done anymore. Then the merging starts by merging the sublists in an ordered manner until all lists are merged.

Complexity

Splitting a list with the length n in halfs is possible log2 n times. Merging the lists together to a list of lenght n recquires n operations, thus merge sort's complexity is

$$ \mathcal O(n \log n) $$

The Merge Sort needs additional space to store the sublists, which will be a problem with huge data sets.

Quick Sort

Finally we arrive at an sorting algorithm that can be rightfully called sophisticated. The englishman Tony Hoare of the University of Oxford (yeeeh!) developed it. In addition of being quite efficient, it avoids using additional space such as the Merge Sort does.

Quick sort selects a value, a pivot value where it will split a list in half.

Choosing an optimal pivot point is paramount to effiency. As a common practice the median of three is used, considering the first, the middle and the last element of a list and chosing the median value, e.g. if these values would be 32, 8 and 90, the median value would be 32.

The algorithm starts to create a moving leftmark and a moving rightmark, comparing each value at the leftmark with the pivot value and if a bigger value is found it will stop moving and continuing with the rightmark and if a smaller value than the pivot value is found it will stop as well and then switch the two values.

After the switch the marks are converging into each others direction and continue comparing and switching until they cross and the split point is found. The pivot value is then exchanged with the value at the split point. Also we made sure that left from the split point only smaller values then the pivot value are located and right from the split point bigger values.

The list is then split into two and the algorithm is called recursively on the resulting lists.

Complexity

In the average complexity of Quick Sort is around

$$ \mathcal O(n \log n) $$

but may degrade to O(n2) if the split points are not near the middle of the lists. It does not use any additional space.

This concludes this article for now. It is a work in progress and will be extended in the next few days.

Tim Sort

In Python the creators implemented Tim sort as the default sorting mechanism. It is a combination out of merge and insertion sort and other heuristics, giving an excellent performance in common sorting problems programmers face. It finds sequences of already sorted data and merges these sequences together. It's worst case complexity is O (n log n) and best case is O(n). You can use it in Python with:

Internal sort
aList.sort()

Function returning sorted copy
sorted(aList)