Non-Relational databases: CouchDB

In this Lab we are going to learn how to use the CouchDB NoSQL database, which follows the Document-oriented non-relational model. We will take an example dataset, convert its data into json documents and upload them to CouchDB. We will also look how to create MapReduce JavaScript queries in the database.

References

Dataset Description

The dataset that we will use in this practice session is taken from the Tennis-Data Betting, Results and Livescores Portal, which provides historical tennis results and fixed odds betting data from the men's ATP and women's WTA tours. Its main goal is to help developing quantitative tennis betting systems and forecasting.

Name: Tennis results & Betting odds data for the ATP & WTA Tours
Download the tennis tournaments 2018.csv CSV file.
Location of the original data: http://www.tennis-data.co.uk/alldata.php (MS Excel format)

Dataset attributes (column names) are:
ATP: integer Location: string Tournament: string Date: string Series: string Court: string Surface: string Round: string Best of: integer Winner: string
	Loser: string WRank: integer LRank: string WPts: integer LPts: string W1: integer L1: integer W2: integer L2: integer W3: integer
		L3: integer W4: integer L4: integer W5: integer L5: integer Wsets: integer Lsets: integer Comment: string B365W: double B365L: double
			EXW: double EXL: double LBW: double LBL: double PSW: double PSL: double MaxW: double MaxL: double AvgW: double AvgL: double

Each row in the file represents a tennis match. The fields include data on the winner, loser, sets played as well as betting odds by different betting brokers for either player winning the match
- The odds are represented as payout coefficients meaning that e.g for a bet of 10 and odds 1.5 the broker will pay out 10 x 1.5 = 15
- A detailed explanation of the dataset can be found here

Exercise 13.1 Install CouchDB

Your first task is to install CouchDB in your computer. CouchDB works in Windows, Linux and MacOS. Do not forget to uninstall the database once you have completed the lab and submitted the solution.

Download and install CouchDB 2.1.0 from http://couchdb.apache.org/
To see if the installation worked point your browser to http://localhost:5984
You should see the default CouchDB Welcome JSON document when everything is working properly.

Exercise 13.2 Using the Futon web interface

CouchDB also deploys a Futon web User Interface for managing the database, manually creating JSON documents and also creating design documents.

Open up the Futon web UI at http://localhost:5984/_utils/
Create a new database named test.
Add some JSON documents to the database you just created.

Next we will take a look how we can access documents in the database

Exercise 13.3 Using the RESTful Query Interface to access CouchDB

We can send HTTP methods (GET, PUT, POST, DELETE) to the CouchDB web service to execute typical database operations, such as fetching, adding, modifying or deleting documents.

Add a document to your database by making an HTTP PUT request on the command line
- Linux:
  - Use the command line shell:
  - curl -X PUT http://localhost:5984/test/foobar -d '{"foo": "bar"}'
- Windows:
  - Use Windows Power Shell:
  - Invoke-RestMethod -Method Put -Uri http://localhost:5984/test/foobar -Body '{"foo": "bar"}'
- In this example test is the name of the database and foobar is the id of the document you enter into the database
Perform an HTTP GET request to query the foobar document from the database test
- Linux:
  - curl -X GET http://localhost:5984/test/foobar
- Windows:
  - Invoke-RestMethod -Method Get -Uri http://localhost:5984/test/foobar
- You can also access the document simply by navigating with your browser to the same url:
  - http://localhost:5984/test/foobar
Feel free to try out other HTTP API functions

Exercise 13.4 Bulk Data Upload

Create a new database named tennis in your CouchDB instance for this data
Download the tennis tournaments 2018.csv file
We will use a Spark Python program (which you can download from here) to upload the content of 2018.csv dataset file (each row as a separate document representing a tennis match) to the CouchDB.
- Its argument should be: input_folder (location of 2018.csv file)
- The Spark program expects the CouchDB to be running @ http://localhost:5984
- It first uses the DataFrame API to load in csv file, executes a set of parallel queries to convert the DataFrame rows into a list of Python dictionaries (Equivalent to JSON documents) and stores the list in local CouchDB using foreachPartition() Spark operation.
- The resulting JSON documents in the CouchDB will look something like this:
- ```
{
   "_id": "7f4842ef6e27eeee8b3aeec0850007ce",
   "_rev": "1-bdc2e63835a24039690db0a4f5a3b4d3",
   "ATP": 1,
   "Location": "Brisbane",
   "Tournament": "Brisbane International",
   "Date": "12/30/2013",
   "Series": "ATP250"
   /*...*/
}
```
Download and run the Spark program in your computer
- You will have to install couchdb python library. When using PyCharm, you need to do the following:
- Add couchdb to your PyCharm virtual enviorement:
  - Go to File->Settings->Project Interpreter-> Plus Icon (upper right) -> Search for package name (couchdb) -> Install Package
- If you get an error that couchdb module does not exist when you run the Spark application:
  - Also install couchdb client library for your computer's default Python enviorement
  - python -m pip install couchdb

Exercise 13.5 Counting the Documents using MapReduce query

To run MapReduce queries on CouchDB documents we need to specify them as Views in Design Documents. These are documents whose id has the prefix _design/
Lets create a new design document view in the tennis database. tennis -> Design Documents -> New View
- Read up on design documents here and here
Write the map and reduce functions to count all documents in the database.
- Map:
```
function (doc) {
  emit("count", 1);
}
```
- Reduce: _count (Predefined function)
- Assign a name to the Design Document (winners) and also to the view (Index name: mycount).
We can call and access the result of the created view at:
- localhost:5984/tennis/_design/DESIGN_DOC_NAME/_view/ViEW_NAME
- For example: http://localhost:5984/tennis/_design/winners/_view/mycount
- To force data to be grouped by key <url>?group=true
- To not to keep it grouped by key: <url>?group=false (all map output will be aggregated into a single value by applying the reduce function)
- We can address the json fields inside the document using . - like this: doc.EXL to get the value of field EXL
Lets now change the Map function to count separately the matches where
1. the favorite won (doc.AvgW < doc.AvgL)
2. and the matches where this was not the case (doc.AvgW => doc.AvgL)
  - AvgW is the payout ratio of the match winner and AvgL was the losers payout ratio.
  - NB! Smaller payout ratio indicates that the chance of that player winning was considered higher.
Assign a separate key in the Map function for each case (For example, "favorite_won" for doc.AvgW < doc.AvgL and "favorite_failed" otherwise) like this:
- ```
function (doc) {
  if(doc.AvgW < doc.AvgL)
      emit("favorite_won", 1);
  else
      emit("favorite_failed", 1);
}
```
Check the result of the view in the database, while making sure to add ?group=true at the end of the url.

Exercise 13.6 Who Beat the betting odds?

Let us look more closely at the matches with a surprise winner i.e matches that the favorite lost.
Add a new view (againstOdds) to your design document (winners) to count how many times each winner beat the betting odds against them (using again the fields AvgW and AvgL to assess the betting odds)
Write the Map and Reduce functions to find winners who beat the betting odds and count how many times each Winner did it.
- Map: Emit winner's name as key and 1 as value if winners odds were considered to be lower (NB! Remember: Higher payout in betting means that odds of that person are considered lower! )
- Reduce: Count all records
View the results of the view function at localhost:5984/tennis/_design/winners/_view/againstOdds?group=true
- NB! ?group=true at the end of the URL forces data to be grouped by the key in Reduce. Without this attribute, reduce will aggregate all results into a single value by default.

Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
Take a screenshot of the view result

Exercise 13.7 Biggest Payouts

Add another view to find the five matches with the largest payouts (matches where AvgW is the largest)
- You only need a Map part of the query for this. Leave the Reduce as NONE.
Keep in mind that CouchDB only allows for sorting (Add &descending=true to change the sorting order) results based on the key and not the value.
- This means the map function will have to emit AvgW as the key.
- Value should be the name of the Winner
Add a new list function "topFive" to your design document.
- List functions are used to present data in the desired format (not just JSON).
- Read more about them here
- Your list function should take the results of your MapReduce query, limit the list of results to five and print them out as a HTML table inside a HTML document.
- Unfortunately, Futon does not have an editor for List functions and you will have to manually modify the Design Document as a JSON file to add a new JavaScript list function.
- To add a new List to the Design document:
  - Open the Design document in JSON editor: http://localhost:5984/_utils/#database/tennis/_design/winners
  - Add a new "lists" entry after "views" and create a new list function called "topFive". Here is an example of a list function which simply prints out all values:
    - ```
     "lists": {
        "topFive": "function(head, req) { while(row = getRow()) {send(row.value);}}"
      },
```
- List function needs to be written as a single line string into the design document (just like the already existing map functions that are there)
View the results of the list function at localhost:5984/tennis/_design/winners/_list/topFive/biggestPayouts

Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
Take a screenshot of the view result

NB! While you can sort and limit results without a List function. You need to implement a JavScript list function to get full marks for this exercise.

Bonus Exercise: Safe Betting

Write a MapReduce job to count the matches where either the winner’s or loser’s betting were less than 1.2 (AvgW and AvgL)
Using the resulting count, write another MapReduce job which simulates placing 1000 pounds divided in equal bets on the favorite on each match where the favorite’s payout (AvgW or AvgL) is less than 1.2 and calculates the winnings.
Is it possible to “beat the house” using this strategy?
Deliverables:
1. Write a brief description how your solution works
2. The design document containing the solution
3. Also take a screenshot of the respective view result

Bonus Exercise: Advanced Betting Strategy

Devise a betting strategy that yields more profit / less loss than the one given in the previous exercise.
- PS! You can use other fields (not just AvgW and AvgL) to devise your strategy.
Formalize the strategy as a CouchDB MapReduce job and run the 1000 pound simulation to see how it stacks up against the safe betting strategy.
Feel free to use additional MapReduce jobs for intermediate data.
Write a short explanation on how the strategy works.
I will test the strategy on a different data set than the one we used in the practice session.
If it works better than the strategy in the previous exercise you get an extra point.
The strategy must be different from the strategy from Bonus Exercise: Safe Betting. Meaning: just changing the coefficient will not be sufficient.
Deliverables:
1. Write a brief description how your solution works
2. The design document containing the solution
3. Also take a screenshot of the respective view result

Deliverables:

Design documents containing the MapReduce code for Exercises 13.6 and 13.7
- You should provide is as a text document.
  - Direct link from where you find the design document 'should' be http://localhost:5984/tennis/_design/winners/
- NB! It is not allowed to provide it as a screenshot!
Take a screenshot of the view result in Exercises 13.6 and 13.7
NB! Do not forget to uninstall CouchDB

13. Practice session 13

Sellele ülesandele ei saa enam lahendusi esitada.

Potential issues and solutions

If you get an error that couchdb module does not exist when you run the Spark application:
- Also install couchdb client library for your computer's default Python enviorement
- python -m pip install couchdb

(Original author of the Lab Materials was Jürmo Mehine)

Hajusandmetöötlus pilves 2018/19 sügis