HW2. Sorting & Binary Searching (16th of Sep, 23.59)
EX1(1p) Task scheduler - a terrible randomized task scheduler creates and assigns tasks randomly to different queues. It decides on its own, whether to insert into each queue or stack a new item (unique id), or to remove (delete). Currently there are 10K timepoints, but you can also run it for 100K or 1M. It does not check much anything. See code: https://colab.research.google.com/drive/1JnwDsIn-C_FL75q_ZogtEWqY38ILBCkX
Simulate those operations on queues and stacks (collectively data structures, DS). Calculate the performance of the scheduler in total and each DS separately. Report in a table, using one row for total nr and one row per each DS with the following columns:
- How many items were inserted and deleted in total (and per DS)?
- How many erroneous commands (e.g. dequeue or pop from empty DS) does each DS receive?
- What maximal size does each DS reach and when?
- How many elements remain in each DS at the end of the scheduler work?
EX2(1p) Task scheduler continued … If there are items remaining at the end of the run, dequeue or pop all remaining values one by one, decreasing the size of each DS by one at every next time point.
- Calculate for each DS, how long did elements stay in it on average (including the new “emptying period” after the scheduler work?
- Which value stayed in each DS the longest, and for how long? (e.g. if item 3245 was inserted to Q3 at t00200 and removed at t02345 then it stayed in Q3 for 2145 units. Hint: add insertion time to DS together with item id).
Provide an interpretation of main (expected or unexpected) similarities and differences between various data structures in DS.
Make sure to work in a way that when the number of queues and/or stacks changes, your code would accommodate such changes as well. Demonstrate this by running Task 2 for 5 queues, 3 stacks (original); as well as for 3 queues and 5 stacks
State briefly, how you would make scheduler code more fair to tasks (customers?) and how to avoid erroneous deletions, pops from “wrong DS”. You can choose to change nr of queues/stacks and in which queue you insert or delete elements from; while keeping the same total number of them (eight in total); and the same frequency of insertions and deletions as in the original scheduler. You can hence change only the name/nr of the queue or stack into which you insert or delete from (based on your inspection of the underlying queues/stacks). Remember - you can also use all eight data structures as queues only or all 8 as stacks, or use any other split in between them
Provide evidence (comparison), that the new scheduler code is “better”. Provide only the necessary columns for comparison to make the point.
EX3(1p) Quicksort - the provided code compares sorting methods. It includes a Quicksort() implementation. Your task is to run the code for 1M random values. Characterise, what is wrong with the provided code? Fix it to solve that proble/issue.
https://colab.research.google.com/drive/1qTtWCuDNVnccayELkgR4iSaBf-SEYUb-
After you have fixed it, run it for 3-4 more data types: small precision integers, very large integers, floating point numbers, and differently generated data just to see if there are any data type specific differences in speed.
EX4(1p) Implement the Quicksort with two pivots. Describe how you might try to avoid bad splits to make your code "unbreakable". Compare your “best” take on Quicksort to the code that was presented to you in Task 3. How many x faster can you make your implementation (e.g. compare and vary the ways you split data)?
EX5(1p) Binary Search and Order Statistics - how many random lookups and rank order statistics queries can be performed within 1 minute from 10M elements of sorted and unsorted data?
There are two types of queries: a) Look up the key for its membership and rank k. b) Look up, which value is ranked k’th in data. Generate the keys and ranks randomly.
Run these two types of queries on unordered data (with linear time methods) and on already ordered data (binary search and lookup).
From how many random rank order queries from originally randomly ordered data it would be actually faster to first sort and then “look up” the key using binary search?
Since the linear time order statistics query changes underlying data, how does that affect the speed after many queries over the data? Describe what happens to underlying data after repeated application of linear rank order queries. Visualize that gradual change using much smaller random data, e.g. 1000 values only.
EX6(bonus 2p) Let's dig deeper into the scheduler. We extend the same scheduling task, e.g. supermarket customers arriving at random intervals, each taking a random time to serve at the cashier. Customers arrive according to following generator, included is the estimated time to service one customer by one cashier:
https://colab.research.google.com/drive/1pqj2GepwJRcSNtCMrTrB0lFCTYRCMmmk
Devise a methodology to add and remove cashiers (new queues) dynamically to match the demand. Add or remove cashiers after every 1000 timepoints. Make it a goal to add cashiers as demand increases, so that no customers should spend much more than 20 minutes in the queue and that no cashier has less than 3 customers in their respective queue. Analyze the need in the non-overlapping time frames of 1000 timepoints. How many customers had to wait for more than 20 minutes, by how much? How long did cashiers have to wait their timeshifts without customers? Visualize the required nr of cashiers over these non-overlapping timeframes.
How many cashier “unit hours” (“unit hour” is 1000 timepoints) in total are needed under such a strategy? Cashiers work in 1000-time point “unit hour” measured shifts.