Cloud Computing - Courses - Institute of Computer Science

Practice 5 - Non-Relational databases: CouchDB

In this Lab we are going to learn how to use the CouchDB NoSQL database, which follows the Document-oriented non-relational model. We will take an example dataset, convert its data into json documents and upload them to CouchDB. We will also look how to create MapReduce JavaScript queries in the database.

References

Dataset Description

The dataset that we will use in this practice session is taken from the Tennis-Data Betting, Results and Livescores Portal, which provides historical tennis results and fixed odds betting data from the men's ATP and women's WTA tours. Its main goal is to help developing quantitative tennis betting systems and forecasting.

Name: Tennis results & Betting odds data for the ATP & WTA Tours
Download the tennis tournaments 2018.csv CSV file.
Location of the original data: http://www.tennis-data.co.uk/alldata.php (MS Excel format)

Dataset attributes (column names) are:
ATP: integer Location: string Tournament: string Date: string Series: string Court: string Surface: string Round: string Best of: integer Winner: string	Loser: string WRank: integer LRank: string WPts: integer LPts: string W1: integer L1: integer W2: integer L2: integer W3: integer	L3: integer W4: integer L4: integer W5: integer L5: integer Wsets: integer Lsets: integer Comment: string B365W: double B365L: double	EXW: double EXL: double LBW: double LBL: double PSW: double PSL: double MaxW: double MaxL: double AvgW: double AvgL: double

Each row in the file represents a tennis match. The fields include data on the winner, loser, sets played as well as betting odds by different betting brokers for either player winning the match
- The odds are represented as payout coefficients meaning that e.g for a bet of 10 and odds 1.5 the broker will pay out 10 x 1.5 = 15
- A detailed explanation of the dataset can be found here

Exercise 5.1 Install CouchDB

Your first task is to install CouchDB using docker image in OpenStack VM. CouchDB can also be installed Windows, Linux and MacOS as a standalone server. Here, we are going to use OpenStack VM.

Create a virtual machine with following configuration as carried out in Practice1
- Boot Source-->Volume Snapshot-->Lab5_VM (Docker Engine and Docker Compose are already configured in this image)
- Flavour--> m3.nano
Connect to the VM and verify for the installation of docker engine and docker compose.
Create docker container and choose COUCH_PASSWORD freely docker run -itd --name couchdb -p 80:5984 -e COUCHDB_USER='admin' -e COUCHDB_PASSWORD=<CHOOSE_FREELY> -v /home/ubuntu/dbdata:/opt/couchdb/data couchdb
Check the couchdb container running using sudo docker ps
Let us look into Fauxton web User Interface where CouchDB provides this for managing the database, manually creating JSON documents and also creating design documents.
- Open up the Fauxton web UI at http://<YOUR_VM_IP>:80/_utils/ in your browser.
- Create a new (non partitioned) database named test.
- Add some JSON documents to the database you just created.

Next we will take a look how we can access documents in the database

Exercise 5.2 Using the RESTful Query Interface to access CouchDB

We can send HTTP methods (GET, PUT, POST, DELETE) to the CouchDB web service to execute typical database operations, such as fetching, adding, modifying or deleting documents.
Try out following commands in the VM

Get the detail information of couchdb database using curl http://username:password@localhost:80
- This will provide a response containing Couchdb version, uuid and features etc,.
Create a database demo using curl -X PUT http://username:password@localhost:80/demo. Check the status in the Fauxton web interface and you can see the data base created.
Add a document to your database by making an HTTP PUT request curl -X PUT http://username:password@localhost:80/demo/foobar -d '{"foo": "bar"}'. Verify the same in the Fauxton web interface.
- In this example demo is the name of the database and foobar is the id of the document you enter into the database
Perform an HTTP GET request to query the foobar document from the database demo curl -X GET http://username:password@localhost:80/demo/foobar
Check the current status of the database demo using curl -X GET http://username:password@localhost:80/demo
- It provides you the information of document counts, deleted documents so far, etc.
Perform Update on the document.
- Use PUT method
- Update the value of foo with your name
- Please add the _rev attribute within the request.
- It is similar to how you create a document
- Take the screenshot of the Fauxton web UI with your updated document and your VM_IP should be visible.
Perform delete operation on the document foobar using DELETE method Example
Feel free to try with other operations HTTP API functions

Exercise 5.3 Bulk Data Upload

In this exercise you will create a new database and upload the content of the tennis database into it as JSON documents. We will write a Python script for uploading bulk data. The exercise description will teach how to use couchdb python library and upload a bulk csv data using HTTP interface. We also use python pandas to read and convert csv data to json documents.

Create a new (non partitioned) database named tennis in your CouchDB instance for this data (You can use HTTP PUT method or Fauxton UI)
Download the tennis tournaments 2018.csv file using wget
Install the required pip packages in this task. For this, create a requirements.txt and add ibmcloudant and pandas and then run using command pip3 install -r requirements.txt
Now let us create python script bulk_upload.py according to the following instructions
- Import ibmcloudant packages
  - Import CloudantV1 and BulkDocs libraries as from ibmcloudant.cloudant_v1 import CloudantV1, BulkDocs, here CloudantV1 is used to create connection string and BulkDocs used to create a couchdb schema for bulk json documents.
  - Import BasicAuthenticator library as from ibm_cloud_sdk_core.authenticators import BasicAuthenticator. It is used to authenticate to the cloudant database using username and password
- Import pandas package
- Read the 2018.csv using pandas into a data frame (Hint: pandas.read_csv(...) method)
- Replace empty, nan, inf values with "" (empty string). Otherwise, CouchDB has problems.
  - For example, you could use fillna(..).
  - Example: filled_dataframe=tennis_dataframe.fillna('')
- Transform the dataframe into a list of json records (Hint: json_records = dataframe.to_dict('records'))
- Create a BasicAuthenticator with username and password (Hint: authenticator = BasicAuthenticator(username, password))
- Create connection service using CloudantV1 with basic authentication service = CloudantV1(authenticator=authenticator)
- Set the url of the cloudant database to created service service.set_service_url('http://<VM_IP>:80')
- Create BulDocs scheme for json records bulk_docs = BulkDocs.from_dict({"docs":json_records})
- Insert the BulkDocs using service.post_bulk_docs and get the response. Example of postbulkdocs. Here, db should be tennis.
- Run the Python script.
- Check that the new entries have been added to the tennis database.
- The resulting JSON documents in the CouchDB will look something like this:
- ```
{
   "_id": "7f4842ef6e27eeee8b3aeec0850007ce",
   "_rev": "1-bdc2e63835a24039690db0a4f5a3b4d3",
   "ATP": 1,
   "Location": "Brisbane",
   "Tournament": "Brisbane International",
   "Date": "12/30/2013",
   "Series": "ATP250"
   /*...*/
}
```
- After entry of the documents, you should have a total of 2637 documents. Verify this is the case to make sure data entry was correct and you didn't accidentally add the data multiple times. If that happened, clear the DB and try again.

Exercise 5.4 Counting the Documents using MapReduce query

To run MapReduce queries on CouchDB documents we need to specify them as Views in Design Documents. These are documents whose id has the prefix _design/
Lets create a new design document view in the tennis database. tennis -> Design Documents -> New View
- Read up on design documents here and here
Write the map and reduce functions to count all documents in the database.
- Map:
```
function (doc) {
  emit("count", 1);
}
```
- Reduce: _count (Predefined function)
- Assign a name to the Design Document (winners) and also to the view (Index name: mycount).
We can call and access the result of the created view at:
- <VM_IP>:80/tennis/_design/DESIGN_DOC_NAME/_view/ViEW_NAME
- For example: http://<VM_IP>:80/tennis/_design/winners/_view/mycount
- To force data to be grouped by key <url>?group=true
- To not to keep it grouped by key: <url>?group=false (all map output will be aggregated into a single value by applying the reduce function)
Lets now change the Map function to count separately the matches where
1. the favorite won (doc.AvgW < doc.AvgL)
2. and the matches where this was not the case (doc.AvgW => doc.AvgL)
  - AvgW is the payout ratio of the match winner and AvgL was the losers payout ratio.
  - NB! Smaller payout ratio indicates that the chance of that player winning was considered higher.
Assign a separate key in the Map function for each case (For example, "favorite_won" for doc.AvgW < doc.AvgL and "favorite_failed" otherwise) like this:
- ```
function (doc) {
  if(doc.AvgW < doc.AvgL)
      emit("favorite_won", 1);
  else
      emit("favorite_failed", 1);
}
```
- Note: Map and Reduce functions in CouchDB are in JavaScript!
Check the result of the view in the database, while making sure to add ?group=true at the end of the url.

Exercise 5.5 Who Beat the betting odds?

Let us look more closely at the matches with a surprise winner i.e matches that the favorite lost.
Add a new view Index name==>(againstOdds) to your design document (winners) to count for each winner, how many times they beat the betting odds against them (using again the fields AvgW and AvgL to assess the betting odds)
Write the Map and Reduce functions to find winners who beat the betting odds and count how many times each Winner did it.
- Map: Emit winner's name as key and 1 as value if winners odds were considered to be lower (NB! Remember: Higher payout in betting means that odds of that person are considered lower! )
- Reduce: Count all records
View the results of the view function at <VM_IP>:80/tennis/_design/winners/_view/againstOdds?group=true
- NB! ?group=true at the end of the URL forces data to be grouped by the key in Reduce. Without this attribute, reduce will aggregate all results into a single value by default.

Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- You can find the design document content here: http://<VM_IP>:80/_utils/#database/tennis/_design/winners
Take a screenshot of the view result
- screenshot should display both the url and results!

Exercise 5.6 Biggest Payouts

Add another view to find the five matches with the largest payouts (matches where AvgW is the largest)
- You only need a Map part of the query for this. Leave the Reduce as NONE.
Keep in mind that CouchDB only allows for sorting (Add &descending=true to change the sorting order) results based on the key and not the value.
- This means the map function will have to emit AvgW as the key.
- Value should be the name of the Winner
Find a way how to limit the number of returned results to 5.

Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- You can find the design document content here: http://<VM_IP>:80/_utils/#database/tennis/_design/winners
- Hind: Consult the API: https://docs.couchdb.org/en/latest/api/ddoc/views.html#db-design-design-doc-view-view-name
Take a screenshot of the result
- screenshot should display both the url and results!

Exercise 5.7 Court effect on payouts

Let's create one more view to examine how the type of court affects payouts for outdoor Matches. Namely, we want to see the maximum payout odds for each type of court. For this, we want to have grouping by the court type.
- Use Map to filter out only documents where Court is "Outdoor".
  - Do not use emit(..) with other documents.
  - Use the court Surface value as the key, and the AvgW value as the value when emitting
- Add a CUSTOM Reduce function
  - The reduce function should return the maximum of provided values.
  - PS! You can ignore/remove the rereduce variable from the (generated) code. It is not necessary for function that computes max.
- Then, invoke the View with grouping enabled to get the max payout for each court type.

Take a screenshot of the result
- screenshot should display both the url and results!

Bonus Exercise I: Safe Betting

Write a MapReduce job to count the matches where either the winner’s or loser’s betting were less than 1.2 (AvgW and AvgL)
Using the resulting count, write another MapReduce job which simulates placing 1000 euros divided in equal bets on the favorite on each match where the favorite’s payout (AvgW or AvgL) is less than 1.2 and calculates the winnings.
Is it possible to “beat the house” using this strategy?
Deliverables:
1. Write a brief description how your solution works
2. The design document containing the solution
3. Also take a screenshot of the respective view result

Bonus Exercise II: Advanced Betting Strategy

Devise a betting strategy that yields more profit / less loss than the one given in the previous exercise.
- PS! You can use other fields (not just AvgW and AvgL) to devise your strategy.
Formalize the strategy as a CouchDB MapReduce job and run the 1000 euro simulation to see how it stacks up against the safe betting strategy.
Feel free to use additional MapReduce jobs for intermediate data.
Write a short explanation on how the strategy works.
I will test the strategy on a different data set than the one we used in the practice session.
If it works better than the strategy in the previous exercise you get an extra point.
The strategy must be different from the strategy from Bonus Exercise: Safe Betting. Meaning: just changing the coefficient will not be sufficient.
Deliverables:
1. Write a brief description how your solution works
2. The design document containing the solution
3. Also take a screenshot of the respective view result

Deliverables:

Design documents containing the MapReduce code for Exercises 5.5,5.6 and 5.7
- You should provide it as a text document containing full MapReduce code.
  - Direct link from where you find the design document 'should' be http://<VM_IP>:80/tennis/_design/winners/
- NB! It is not allowed to provide it as a screenshot!
Take a screenshot of the view result in Exercises 5.2,5.5, 5.6 and 5.7
NB! Do not forget to terminate your instance

5. Lab 5

Solutions for this task can no longer be submitted.

Cloud Computing 2021/22 spring