Saturday 30 November 2013

Sorting Algorithms

Much of the coursework in the passed few weeks has been centred around sorting algorithms. I was really looking forward to studying these! Professor Heap gave great demonstrations of implementations of select sort, merge sort, and quick with his different-sized empty coffee cups; these demonstrations helped make the algorithms more concrete and memorable. We then went on to discuss the efficiency of the different sorting algorithms which was very interesting. Selection sort was the least efficient out of the three: it has O(n^2) efficiency. Merge sort on the other hand is much more efficient and has O(nlogn) efficiency, while quicksort has O(nlogn) efficiency in the best case scenario and, contrary to its name, O(n^2) efficiency in the worst case! This inefficiency in the worst case scenario arises because of the choice of a pivot element. The pivot element is used to split the list in two parts: on part containing elements less than the pivot and the other part containing elements greater than the pivot. Then the algorithm is performed recursively on the two parts and so on and so forth until one element or empty parts are obtained, which are of course already sorted. Then the list is put back together again in already sorted order. Ideally, the pivot element is somewhere close to the median of the list, but it is possible (albeit unlikely) that the pivot is repeatedly at the lower end or the upper end of the list, in which case the benefit of dividing and conquering is lost and the algorithm becomes similar to select sort. In our implementation of the algorithm we chose the pivot from first element of the list in each step, but we spoke about how the code could be modified to pick the pivot from a random index, which in the case of an already sorted or nearly-sorted list would help to avoid the worst case scenario. Merge sort, despite being a divide-and-conquer algorithm doesn't have this same problem because when the list is split in each step, it is done by simply splitting it into two halves. One thing that really surprised me was how few steps it takes to split a list into single element or empty parts by repeatedly splitting the list in two; the power of exponential growth is always surprising! When it comes to coding the algorithms, I much prefer the code for quicksort because it doesn't involve any complicated merging procedure as is necessary in merge sort. However, these aren't the only sorting algorithms that we have been studying. In our labs we also looked at bubble sort which is a simple but not very efficient sorting algorithm: it typically has O(n^2) (in)efficiency, but if an already sorted list is passed to it, it has O(n) efficiency. So bubble sort is slightly better than selection sort. It has O(n) efficiency for an already sorted list because the algorithm involves going through each item of the list in sequence in each step of the algorithm and if the list is already sorted it simply returns it on the first step. There is a plethora of different sorting algorithms and at first even the few that we studied seemed like a lot. Wikipedia has a great article on sorting algorithms with a comparison chart of the algorithms' big-O efficiencies. Each sorting algorithm also has its own article and some of them have very elucidating animations that demonstrate the sorting algorithms with a large list. I found wikipedia to be a very useful resource and I suggest you check it out if you haven't done so already!

Monday 18 November 2013

Midterm no. 2

We had our second midterm last week and I think it went very well! I was much more prepared  much less nervous for this midterm than I was for the first one. Again, I thought the test was at an appropriate difficulty level and the length of the test was very reasonable for the 50 minutes we were given to write it. We only have two weeks left of coursework, so stay tuned for another blog post later on this week!

Sunday 10 November 2013

Assignment 2 and the Algebra of Regular Expressions

My partner and I finished our assignment last weekend. It was a much shorter assignment then the previous one, but much more abstract. It took me a little while to come to basic understanding of what regular expressions are, and although we completed the assignment without much difficulty, I still feel like I don't fully grasp what regular expressions are and how more general regular expressions work, since the assignment was focused on a very simple and particular class of expressions and because I haven't been exposed to real applications of general explanations. (I realized the majority of the ideas that I wrote about below after having written this sentence, and now my understanding of regular expressions has grown immensely. I can see how what we did with our simple regular expressions can be generalized.) However, I do see the motivation for regular expressions: they code for different structures of strings and can be used to check if a string matches a particular structure. Our regular expressions consisted of some symbols: '0', '1', and 'e', combined with other symbols: '|', '.', and '*', which were grouped together using parentheses and by following some basic rules. The regular expressions we considered code for different structures of binary strings.

(see: http://www.cdf.toronto.edu/~heap/148/F13/Assignments/A2/a2.pdf)

After writing the last couple of sentences above, I realized that regular expressions form an algebraic structure! (I should have realized this sooner, since I am a pure mathematics student.)

The algebraic structure consists of a set S of all regular expressions, along with two binary operations: | : SxS -> S and . : SxS -> S, and one unary operation: * : S -> S. Moreover, the numbers: 0, 1, and e are defined to be regular expressions, and are thus elements of S. Hence, all compositions of the binary operations and unary operations that make sense, map regular expressions 0, 1, e, and any other regular expressions in S to regular expressions in S. 

We use infix notation for | and . and postfix notation for *. Also, composition is denoted with parentheses as in ordinary arithmetic and * by convention takes precedence over the other operators as is the case with postfix operations and prefix operations in ordinary arithmetic (such as: negation). Thus, the following is a regular expression:

                                                                     ((1.(1|0)*).1).

The outer parentheses can be removed without rendering the expression ambiguous, as in:

(1)                                                                  (1.(1|0)*).1, 

but we use them by convention in our assignment.

For those who do not know what an algebraic structure is, a familiar one are the integers along with the binary operations of addition and multiplication, and the unary operation of negation. In this algebraic structure (Z, +, x, -), which is a type of ring, the following expression is an integer:

                                                                     (1x-(1+0))x1.

If we use postfix notation for negation instead of prefix notation, then the expression becomes:

(2)                                                                 (1x(1+0)-)x1.

Do you see the similarity between (1) and (2)?

Now, by studying the structure of (S, |, ., *) it might be possible to find different identities and properties that the operations defined on S satisfy. Using these identities, we can simplify the code used for matching strings to regular expressions in the regex_match function. But, in order to write down identities between regular expressions we need some notion of "equality" between regular expressions. Thus, we must define an equivalence relation on our algebraic structure and for this equivalence relation to be of any use it must be compatible with the structure of our algebraic structure. Such an equivalence relation is called a congruence relation. Now, what does it mean for two regular expressions to be "equal"? Well, a regular expression r codes for a particular structure that a string can possess. Thus, a regular expression defines a set of strings that all possess the the particular structure coded for by the regular expression, we'll denote this set S(r). Sometimes a string might possess the structure that two different regular expressions code for even though the structures they code for are in general very different and usually not possessed by the same string. For example the binary string '' (the empty string) is in the set S(e) and is also in the set S(1*), however the string '111' is in the set S(1*), but not in the set S(e). It does not make sense to consider two regular expressions r and s to be equal if the sets S(r) and S(s) only have one or a few elements in common. A useful definition of equality between between regular expressions is the following:

Def.: Two regular expressions r and s are equal, denoted ~ s, iff S(r) = S(s).

This is indeed an equivalence relation, and it is trivial to verify that this is the case because it inherits the necessary properties of an equivalence relation from set equality, which is an equivalence relation.

In our assignment we say a string matches a regular expression iff it possesses the structure coded for by the regular expression. That is, a string x matches a regular expression r iff x is an element of S(r). Thus, an equivalent definition of regular expression equality is:

Def.: Two regular expressions r and s are equal, denoted r ~ s, iff every string that matches r also matches s and vice versa.

From the definitions of r | s, r . s, and r*, where r and s are regular expressions, (refer to the assignment handout hyperlink above) it follows from the way that we have defined regular expression equality that, ~ is a congruence relation. This means that if r ~ s, then in any expression containing r we can replace r with s without changing the string structure that the resulting expression codes for.

One simply identity in the algebra of our regular expressions is that the operation * is idempotent. That is, if r is a regular expression, then

                                                                         r** ~ r*.

This identity is easy to verify, try it! Moreover, by induction we have that

(3)                                                                   r*...* ~ r*,

for any number of times that the r on the left is "starred". Therefore, in the code for the function regex_match from our assignment, the case that treats checking if a string matches a regular expression that is starred (potentially any positive integer number of times) only needs to check if it matches the regular expression when it is starred once. This drastically reduces the number of recursive function calls needed in the code in the cases where the regular expression argument contains a regular subexpression that is starred a large number of times. My partner and I used this identity (equation (3)) to our advantage in the programming of our assignment.

I encourage the reader to investigate this algebraic structure; if he or she finds any interesting identities, please comment them. I would be very interested in seeing what you come up with!

Sunday 3 November 2013

Linked-Lists: another collection ADT to add to the list

From programming in Python, we have become quite acquainted with many different data structures for representing collections of objects. In fact, we have been using so many different ways for representing collections that it's a little overwhelming. There are lists, tuples, dicts, sets, to name some of the built-in types in Python. There are also endless ways of combining these structures: nested lists, list-valued dicts, etc. We have also written our own ADT classes for representing collections: stacks, queues, linked-lists, trees, etc., and not only have written one class for each of these ADTs, we have implemented each of them in a number of different ways! Of course there is a principal reason for this: certain data structures (or certain implementations of a data structure) are more suited to some applications than others are, perhaps because they simplify the code or reduce algorithmic complexity. But there are so many different ways of representing collections and so many different ways of implementing the same data structure that when it comes time to decide on which one to use, there are so many options that it becomes difficult to figure out which one would work best.

Lately we've been working with linked-lists and one of our main motivations for doing so is that linked-lists can be more efficient than the built-in list data type provided by Python. A linked-list consists of a sequence of nodes each of which contains a value and a reference to the next node in the linked-list. We first wrote a class for the nodes and then we wrote a class for linked-lists which for the most part consisted of an attribute reference to the first node of the list as well as some other attributes and methods. One of the ways in which linked-lists are more efficient than the built-in lists is that the algorithm for adding a new item to the front of a linked-list has O(1) complexity whereas the algorithm for adding a new item to the front of a built-in list has O(n) complexity. The reason for this was immediately clear to me: in order to add a new item to the front of a linked-list, the algorithm only needs to create a new node for the item and link it to the front node of the list (which can be done in a constant amount of time), whereas for built-in lists the algorithm needs to "shift" all of the other items down the list after the new item is added, and the time required for this is proportional to the length n of the list. The algorithm for inserting or deleting an item at any point of a linked-list is also more efficient than that for a built-in list for the same reason: built-in lists need the elements after the insertion to be "shifted". I had nothing more than a vague understanding of what this "shifting" was, since I had no clue how the built-in lists were implemented so out of curiosity I did a bit of research. The built-in lists store all of the items together in computer memory and thus when a new item is inserted the other items need to be moved. It turns out that the built-in list data type in Python is actually a dynamic array data type and a linked-list is a list data type. What distinguishes the two is that lists data types use sequential access, while array data types use direct access. Direct access allows for indexing algorithms to access an item by its index in constant time (O(1) complexity), whereas sequential access forces indexing algorithms to traverse through the list in sequence and thus access an item by its index in time that increases proportionally to the index size (O(n) complexity). So although linked-lists beat the built-in lists in Python when it comes to inserting or deleting elements, they lose against built-in lists when it comes to indexing.

Sunday 27 October 2013

Hint for e4b?

Professor Heap gave the following hint for exercise 4, part b:

"Assigning a depth of -1 to empty trees should simplify your code—you should be able to treat this case more uniformly." 

I managed to solve this exercise without much difficulty and in writing the function, nowhere did I see the need to assign a depth of -1 to empty trees. Defining the depth of an empty tree to be 0 seems much more natural to me and my code was quite simple—only around 6 lines. I was curious if any other students incorporated this hint into their code and found it useful. If so, how did it simplify your code? Perhaps the hint made more sense for certain approaches to the function than to others.

Midterm

I only have one midterm left this month after a very busy week last week, so now that the time is available it seems fitting to write about the CSC148 midterm. I'm not always great at test-taking and thinking under pressure so I didn't do as well as I thought I could have, but on the whole I thought it was a very fair midterm; the questions were at a good difficulty level and the length of the test was very reasonable. One thing I found difficult was writing code with pen and paper. Although conceptually everything is the same, it felt very strange since for the most part I have only been typing code on my laptop. Another difficulty I had was that I was a little unsure of what was expected for the last question, which was on describing test cases, and I spent a little too much time worrying and thinking about what to write. The first question was a good exercise in carefully going through what a function does when it is called and, although it was tedious, I thought it was an appropriate question for the test.

I think this midterm exposed that I have a bit of reliance on my computer for quickly checking that my code works and that I didn't leave certain cases untreated, after having quickly typed something up. Being more careful and mentally checking code while programming is something I will have to work on for the next midterm.

Monday 14 October 2013

The PEP8 Checker is Foolishly Consistent

Need I say more? I guess it's natural for an automated PEP8 checker to be foolishly consistent, as it would be difficult to program one to be able to recognize when readability and style are improved upon by not following PEP8, something that is quite subjective. PEP8 checkers therefore tend to sometimes check for consistency even when it is foolish, and thereby do not really check for PEP8 at all. "Programs must be written for people to read, and only incidentally for machines to execute." (Abelson & Sussman, Structure and Interpretation of Computer Programs). Maybe someday computers will be able to better evaluate what is subjectively more readable for humans than humans can. Until then, humans are probably the best PEP8 checkers.