Practice 5 - Non-Relational databases: CouchDB
In this Lab we are going to learn how to use the CouchDB NoSQL database, which follows the Document-oriented non-relational model. We will take an example dataset, convert its data into json documents and upload them to CouchDB. We will also look how to create MapReduce JavaScript queries in the database.
References
Dataset Description
The dataset that we will use in this practice session is taken from the Tennis-Data Betting, Results and Livescores Portal, which provides historical tennis results and fixed odds betting data from the men's ATP and women's WTA tours. Its main goal is to help developing quantitative tennis betting systems and forecasting.
- Name: Tennis results & Betting odds data for the ATP & WTA Tours
- Download the tennis tournaments 2018.csv CSV file.
- Location of the original data: http://www.tennis-data.co.uk/alldata.php (MS Excel format)
Dataset attributes (column names) are: | |||
|
|
|
|
- Each row in the file represents a tennis match. The fields include data on the winner, loser, sets played as well as betting odds by different betting brokers for either player winning the match
- The odds are represented as payout coefficients meaning that e.g for a bet of 10 and odds 1.5 the broker will pay out 10 x 1.5 = 15
- A detailed explanation of the dataset can be found here
Exercise 5.1 Install CouchDB
Your first task is to install CouchDB using docker image in OpenStack VM. CouchDB can also be installed Windows, Linux and MacOS as a standalone server. Here, we are going to use OpenStack VM.
- Create a virtual machine with following configuration as carried out in Practice1
- Boot Source-->Volume Snapshot-->Lab5_VM (Docker Engine and Docker Compose are already configured in this image)
- Flavour--> m3.nano
- Connect to the VM and verify for the installation of docker engine and docker compose.
- Create docker container and choose COUCH_PASSWORD freely
docker run -itd --name couchdb -p 80:5984 -e COUCHDB_USER='admin' -e COUCHDB_PASSWORD=<CHOOSE_FREELY> -v /home/ubuntu/dbdata:/opt/couchdb/data couchdb
- Check the couchdb container running using
sudo docker ps
- Let us look into Fauxton web User Interface where CouchDB provides this for managing the database, manually creating JSON documents and also creating design documents.
- Open up the Fauxton web UI at
http://<YOUR_VM_IP>:80/_utils/
in your browser. - Create a new (non partitioned) database named
test
. - Add some JSON documents to the database you just created.
- Open up the Fauxton web UI at
Next we will take a look how we can access documents in the database
Exercise 5.2 Using the RESTful Query Interface to access CouchDB
We can send HTTP methods (GET, PUT, POST, DELETE) to the CouchDB web service to execute typical database operations, such as fetching, adding, modifying or deleting documents.
Try out following commands in the VM
- Get the detail information of couchdb database using
curl http://username:password@localhost:80
- This will provide a response containing Couchdb version, uuid and features etc,.
- Create a database demo using
curl -X PUT http://username:password@localhost:80/demo
. Check the status in the Fauxton web interface and you can see the data base created. - Add a document to your database by making an HTTP PUT request
curl -X PUT http://username:password@localhost:80/demo/foobar -d '{"foo": "bar"}'
. Verify the same in the Fauxton web interface.- In this example
demo
is the name of the database andfoobar
is the id of the document you enter into the database
- In this example
- Perform an HTTP GET request to query the
foobar
document from the database democurl -X GET http://username:password@localhost:80/demo/foobar
- Check the current status of the database demo using
curl -X GET http://username:password@localhost:80/demo
- It provides you the information of document counts, deleted documents so far, etc.
- Perform Update on the document.
- Use PUT method
- Update the value of foo with your name
- Please add the _rev attribute within the request.
- It is similar to how you create a document
- Take the screenshot of the Fauxton web UI with your updated document and your VM_IP should be visible.
- Perform delete operation on the document foobar using DELETE method Example
- Feel free to try with other operations HTTP API functions
Exercise 5.3 Bulk Data Upload
In this exercise you will create a new database and upload the content of the tennis database into it as JSON documents. We will write a Python script for uploading bulk data. The exercise description will teach how to use couchdb python library and upload a bulk csv data using HTTP interface. We also use python pandas to read and convert csv data to json documents.
- Create a new (non partitioned) database named tennis in your CouchDB instance for this data (You can use HTTP PUT method or Fauxton UI)
- Download the tennis tournaments 2018.csv file using
wget
- Install the required pip packages in this task. For this, create a
requirements.txt
and addibmcloudant
andpandas
and then run using commandpip3 install -r requirements.txt
- Now let us create python script
bulk_upload.py
according to the following instructions- Import
ibmcloudant
packages- Import CloudantV1 and BulkDocs libraries as
from ibmcloudant.cloudant_v1 import CloudantV1, BulkDocs
, here CloudantV1 is used to create connection string and BulkDocs used to create a couchdb schema for bulk json documents. - Import BasicAuthenticator library as
from ibm_cloud_sdk_core.authenticators import BasicAuthenticator
. It is used to authenticate to the cloudant database using username and password
- Import CloudantV1 and BulkDocs libraries as
- Import
pandas
package - Read the
2018.csv
using pandas into a data frame (Hint: pandas.read_csv(...) method) - Replace empty, nan, inf values with "" (empty string). Otherwise, CouchDB has problems.
- For example, you could use fillna(..).
- Example:
filled_dataframe=tennis_dataframe.fillna('')
- Transform the dataframe into a list of json records (Hint: json_records = dataframe.to_dict('records'))
- Create a BasicAuthenticator with username and password (Hint: authenticator = BasicAuthenticator(username, password))
- Create connection service using CloudantV1 with basic authentication
service = CloudantV1(authenticator=authenticator)
- Set the url of the cloudant database to created service
service.set_service_url('http://<VM_IP>:80')
- Create BulDocs scheme for json records
bulk_docs = BulkDocs.from_dict({"docs":json_records})
- Insert the BulkDocs using service.post_bulk_docs and get the response. Example of postbulkdocs. Here, db should be tennis.
- Run the Python script.
- Check that the new entries have been added to the tennis database.
- The resulting JSON documents in the CouchDB will look something like this:
{ "_id": "7f4842ef6e27eeee8b3aeec0850007ce", "_rev": "1-bdc2e63835a24039690db0a4f5a3b4d3", "ATP": 1, "Location": "Brisbane", "Tournament": "Brisbane International", "Date": "12/30/2013", "Series": "ATP250" /*...*/ }
- After entry of the documents, you should have a total of 2637 documents. Verify this is the case to make sure data entry was correct and you didn't accidentally add the data multiple times. If that happened, clear the DB and try again.
- Import
Exercise 5.4 Counting the Documents using MapReduce query
- To run MapReduce queries on CouchDB documents we need to specify them as
Views
inDesign Documents
. These are documents whose id has the prefix_design/
- Lets create a new
design document view
in the tennis database.tennis -> Design Documents -> New View
- Write the map and reduce functions to count all documents in the database.
- Map:
function (doc) { emit("count", 1); }
- Reduce:
_count
(Predefined function) - Assign a name to the Design Document (
winners
) and also to the view (Index name:mycount
).
- Map:
- We can call and access the result of the created view at:
- <VM_IP>:80/tennis/_design/DESIGN_DOC_NAME/_view/ViEW_NAME
- For example: http://<VM_IP>:80/tennis/_design/winners/_view/mycount
- To force data to be grouped by key
<url>?group=true
- To not to keep it grouped by key:
<url>?group=false
(all map output will be aggregated into a single value by applying the reduce function)
- Lets now change the Map function to count separately the matches where
- the favorite won (doc.AvgW < doc.AvgL)
- and the matches where this was not the case (doc.AvgW => doc.AvgL)
- AvgW is the payout ratio of the match winner and AvgL was the losers payout ratio.
- NB! Smaller payout ratio indicates that the chance of that player winning was considered higher.
- Assign a separate key in the Map function for each case (For example, "favorite_won" for doc.AvgW < doc.AvgL and "favorite_failed" otherwise) like this:
function (doc) { if(doc.AvgW < doc.AvgL) emit("favorite_won", 1); else emit("favorite_failed", 1); }
- Note: Map and Reduce functions in CouchDB are in JavaScript!
- Check the result of the view in the database, while making sure to add
?group=true
at the end of the url.
Exercise 5.5 Who Beat the betting odds?
- Let us look more closely at the matches with a surprise winner i.e matches that the favorite lost.
- Add a new view Index name==>(
againstOdds
) to your design document (winners
) to count for each winner, how many times they beat the betting odds against them (using again the fields AvgW and AvgL to assess the betting odds) - Write the Map and Reduce functions to find winners who beat the betting odds and count how many times each Winner did it.
- Map: Emit winner's name as key and 1 as value if winners odds were considered to be lower (NB! Remember: Higher payout in betting means that odds of that person are considered lower! )
- Reduce: Count all records
- View the results of the view function at
<VM_IP>:80/tennis/_design/winners/_view/againstOdds?group=true
- NB!
?group=true
at the end of the URL forces data to be grouped by the key in Reduce. Without this attribute, reduce will aggregate all results into a single value by default.
- NB!
- Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- You can find the design document content here: http://<VM_IP>:80/_utils/#database/tennis/_design/winners
- Take a screenshot of the view result
- screenshot should display both the url and results!
Exercise 5.6 Biggest Payouts
- Add another view to find the five matches with the largest payouts (matches where AvgW is the largest)
- You only need a Map part of the query for this. Leave the Reduce as NONE.
- Keep in mind that CouchDB only allows for sorting (Add
&descending=true
to change the sorting order) results based on the key and not the value.- This means the map function will have to emit AvgW as the key.
- Value should be the name of the Winner
- Find a way how to limit the number of returned results to 5.
- Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- You can find the design document content here: http://<VM_IP>:80/_utils/#database/tennis/_design/winners
- Hind: Consult the API: https://docs.couchdb.org/en/latest/api/ddoc/views.html#db-design-design-doc-view-view-name
- Take a screenshot of the result
- screenshot should display both the url and results!
Exercise 5.7 Court effect on payouts
- Let's create one more view to examine how the type of court affects payouts for outdoor Matches. Namely, we want to see the maximum payout odds for each type of court. For this, we want to have grouping by the court type.
- Use Map to filter out only documents where Court is "Outdoor".
- Do not use emit(..) with other documents.
- Use the court Surface value as the key, and the AvgW value as the value when emitting
- Add a CUSTOM Reduce function
- The reduce function should return the maximum of provided values.
- PS! You can ignore/remove the rereduce variable from the (generated) code. It is not necessary for function that computes max.
- Then, invoke the View with grouping enabled to get the max payout for each court type.
- Use Map to filter out only documents where Court is "Outdoor".
- Take a screenshot of the result
- screenshot should display both the url and results!
Bonus Exercise I: Safe Betting
- Write a MapReduce job to count the matches where either the winner’s or loser’s betting were less than 1.2 (AvgW and AvgL)
- Using the resulting count, write another MapReduce job which simulates placing 1000 euros divided in equal bets on the favorite on each match where the favorite’s payout (AvgW or AvgL) is less than 1.2 and calculates the winnings.
- Is it possible to “beat the house” using this strategy?
- Deliverables:
- Write a brief description how your solution works
- The design document containing the solution
- Also take a screenshot of the respective view result
Bonus Exercise II: Advanced Betting Strategy
- Devise a betting strategy that yields more profit / less loss than the one given in the previous exercise.
- PS! You can use other fields (not just AvgW and AvgL) to devise your strategy.
- Formalize the strategy as a CouchDB MapReduce job and run the 1000 euro simulation to see how it stacks up against the safe betting strategy.
- Feel free to use additional MapReduce jobs for intermediate data.
- Write a short explanation on how the strategy works.
- I will test the strategy on a different data set than the one we used in the practice session.
- If it works better than the strategy in the previous exercise you get an extra point.
- The strategy must be different from the strategy from Bonus Exercise: Safe Betting. Meaning: just changing the coefficient will not be sufficient.
- Deliverables:
- Write a brief description how your solution works
- The design document containing the solution
- Also take a screenshot of the respective view result
Deliverables:
- Design documents containing the MapReduce code for Exercises 5.5,5.6 and 5.7
- You should provide it as a text document containing full MapReduce code.
- Direct link from where you find the design document 'should' be http://<VM_IP>:80/tennis/_design/winners/
- NB! It is not allowed to provide it as a screenshot!
- You should provide it as a text document containing full MapReduce code.
- Take a screenshot of the view result in Exercises 5.2,5.5, 5.6 and 5.7
- NB! Do not forget to terminate your instance