Non-Relational databases: CouchDB
In this Lab we are going to learn how to use the CouchDB NoSQL database, which follows the Document-oriented non-relational model. We will take an example dataset, convert its data into json documents and upload them to CouchDB. We will also look how to create MapReduce JavaScript queries in the database.
References
Dataset Description
The dataset that we will use in this practice session is taken from the Tennis-Data Betting, Results and Livescores Portal, which provides historical tennis results and fixed odds betting data from the men's ATP and women's WTA tours. Its main goal is to help developing quantitative tennis betting systems and forecasting.
- Name: Tennis results & Betting odds data for the ATP & WTA Tours
- Download the tennis tournaments 2018.csv CSV file.
- Location of the original data: http://www.tennis-data.co.uk/alldata.php (MS Excel format)
Dataset attributes (column names) are: | |
| |
| |
| |
|
- Each row in the file represents a tennis match. The fields include data on the winner, loser, sets played as well as betting odds by different betting brokers for either player winning the match
- The odds are represented as payout coefficients meaning that e.g for a bet of 10 and odds 1.5 the broker will pay out 10 x 1.5 = 15
- A detailed explanation of the dataset can be found here
Exercise 13.1 Install CouchDB
Your first task is to install CouchDB in your computer. CouchDB works in Windows, Linux and MacOS. Do not forget to uninstall the database once you have completed the lab and submitted the solution.
- Download and install CouchDB 2.1.0 from http://couchdb.apache.org/
- To see if the installation worked point your browser to
http://localhost:5984
- You should see the default CouchDB Welcome JSON document when everything is working properly.
Exercise 13.2 Using the Futon web interface
CouchDB also deploys a Futon web User Interface for managing the database, manually creating JSON documents and also creating design documents.
- Open up the Futon web UI at
http://localhost:5984/_utils/
- Create a new database named
test
. - Add some JSON documents to the database you just created.
Next we will take a look how we can access documents in the database
Exercise 13.3 Using the RESTful Query Interface to access CouchDB
We can send HTTP methods (GET, PUT, POST, DELETE) to the CouchDB web service to execute typical database operations, such as fetching, adding, modifying or deleting documents.
- Add a document to your database by making an HTTP PUT request on the command line
- Linux:
- Use the command line shell:
curl -X PUT http://localhost:5984/test/foobar -d '{"foo": "bar"}'
- Windows:
- Use Windows Power Shell:
Invoke-RestMethod -Method Put -Uri http://localhost:5984/test/foobar -Body '{"foo": "bar"}'
- In this example
test
is the name of the database andfoobar
is the id of the document you enter into the database
- Linux:
- Perform an HTTP GET request to query the
foobar
document from the databasetest
- Linux:
curl -X GET http://localhost:5984/test/foobar
- Windows:
Invoke-RestMethod -Method Get -Uri http://localhost:5984/test/foobar
- You can also access the document simply by navigating with your browser to the same url:
- Linux:
- Feel free to try out other HTTP API functions
Exercise 13.4 Bulk Data Upload
- Create a new database named tennis in your CouchDB instance for this data
- Download the tennis tournaments 2018.csv file
- We will use a Spark Python program (which you can download from here) to upload the content of 2018.csv dataset file (each row as a separate document representing a tennis match) to the CouchDB.
- Its argument should be:
input_folder
(location of 2018.csv file) - The Spark program expects the CouchDB to be running @ http://localhost:5984
- It first uses the DataFrame API to load in
csv
file, executes a set of parallel queries to convert the DataFrame rows into a list of Python dictionaries (Equivalent to JSON documents) and stores the list in local CouchDB usingforeachPartition()
Spark operation. - The resulting JSON documents in the CouchDB will look something like this:
{ "_id": "7f4842ef6e27eeee8b3aeec0850007ce", "_rev": "1-bdc2e63835a24039690db0a4f5a3b4d3", "ATP": 1, "Location": "Brisbane", "Tournament": "Brisbane International", "Date": "12/30/2013", "Series": "ATP250" /*...*/ }
- Its argument should be:
- Download and run the Spark program in your computer
- You will have to install
couchdb
python library. When using PyCharm, you need to do the following: - Add couchdb to your PyCharm virtual enviorement:
- Go to
File->Settings->Project Interpreter-> Plus Icon (upper right) -> Search for package name (couchdb) -> Install Package
- Go to
- If you get an error that couchdb module does not exist when you run the Spark application:
- Also install couchdb client library for your computer's default Python enviorement
python -m pip install couchdb
- You will have to install
Exercise 13.5 Counting the Documents using MapReduce query
- To run MapReduce queries on CouchDB documents we need to specify them as
Views
inDesign Documents
. These are documents whose id has the prefix_design/
- Lets create a new
design document view
in the tennis database.tennis -> Design Documents -> New View
- Write the map and reduce functions to count all documents in the database.
- Map:
function (doc) { emit("count", 1); }
- Reduce:
_count
(Predefined function) - Assign a name to the Design Document (
winners
) and also to the view (Index name:mycount
).
- Map:
- We can call and access the result of the created view at:
- localhost:5984/tennis/_design/DESIGN_DOC_NAME/_view/ViEW_NAME
- For example: http://localhost:5984/tennis/_design/winners/_view/mycount
- To force data to be grouped by key
<url>?group=true
- To not to keep it grouped by key:
<url>?group=false
(all map output will be aggregated into a single value by applying the reduce function) - We can address the json fields inside the document using
.
- like this:doc.EXL
to get the value of field EXL
- Lets now change the Map function to count separately the matches where
- the favorite won (doc.AvgW < doc.AvgL)
- and the matches where this was not the case (doc.AvgW => doc.AvgL)
- AvgW is the payout ratio of the match winner and AvgL was the losers payout ratio.
- NB! Smaller payout ratio indicates that the chance of that player winning was considered higher.
- Assign a separate key in the Map function for each case (For example, "favorite_won" for doc.AvgW < doc.AvgL and "favorite_failed" otherwise) like this:
function (doc) { if(doc.AvgW < doc.AvgL) emit("favorite_won", 1); else emit("favorite_failed", 1); }
- Check the result of the view in the database, while making sure to add
?group=true
at the end of the url.
Exercise 13.6 Who Beat the betting odds?
- Let us look more closely at the matches with a surprise winner i.e matches that the favorite lost.
- Add a new view (
againstOdds
) to your design document (winners
) to count how many times each winner beat the betting odds against them (using again the fields AvgW and AvgL to assess the betting odds) - Write the Map and Reduce functions to find winners who beat the betting odds and count how many times each Winner did it.
- Map: Emit winner's name as key and 1 as value if winners odds were considered to be lower (NB! Remember: Higher payout in betting means that odds of that person are considered lower! )
- Reduce: Count all records
- View the results of the view function at
localhost:5984/tennis/_design/winners/_view/againstOdds?group=true
- NB!
?group=true
at the end of the URL forces data to be grouped by the key in Reduce. Without this attribute, reduce will aggregate all results into a single value by default.
- NB!
- Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- Take a screenshot of the view result
Exercise 13.7 Biggest Payouts
- Add another view to find the five matches with the largest payouts (matches where AvgW is the largest)
- You only need a Map part of the query for this. Leave the Reduce as NONE.
- Keep in mind that CouchDB only allows for sorting (Add
&descending=true
to change the sorting order) results based on the key and not the value.- This means the map function will have to emit AvgW as the key.
- Value should be the name of the Winner
- Add a new list function
"topFive"
to your design document.- List functions are used to present data in the desired format (not just JSON).
- Read more about them here
- Your list function should take the results of your MapReduce query, limit the list of results to five and print them out as a HTML table inside a HTML document.
- Unfortunately, Futon does not have an editor for List functions and you will have to manually modify the Design Document as a JSON file to add a new JavaScript list function.
- To add a new List to the Design document:
- Open the Design document in JSON editor: http://localhost:5984/_utils/#database/tennis/_design/winners
- Add a new
"lists"
entry after"views"
and create a new list function called"topFive"
. Here is an example of a list function which simply prints out all values:"lists": { "topFive": "function(head, req) { while(row = getRow()) {send(row.value);}}" },
- List function needs to be written as a single line string into the design document (just like the already existing map functions that are there)
- View the results of the list function at
localhost:5984/tennis/_design/winners/_list/topFive/biggestPayouts
- Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- Take a screenshot of the view result
NB! While you can sort and limit results without a List function. You need to implement a JavScript list function to get full marks for this exercise.
Bonus Exercise: Safe Betting
- Write a MapReduce job to count the matches where either the winner’s or loser’s betting were less than 1.2 (AvgW and AvgL)
- Using the resulting count, write another MapReduce job which simulates placing 1000 pounds divided in equal bets on the favorite on each match where the favorite’s payout (AvgW or AvgL) is less than 1.2 and calculates the winnings.
- Is it possible to “beat the house” using this strategy?
- Deliverables:
- Write a brief description how your solution works
- The design document containing the solution
- Also take a screenshot of the respective view result
Bonus Exercise: Advanced Betting Strategy
- Devise a betting strategy that yields more profit / less loss than the one given in the previous exercise.
- PS! You can use other fields (not just AvgW and AvgL) to devise your strategy.
- Formalize the strategy as a CouchDB MapReduce job and run the 1000 pound simulation to see how it stacks up against the safe betting strategy.
- Feel free to use additional MapReduce jobs for intermediate data.
- Write a short explanation on how the strategy works.
- I will test the strategy on a different data set than the one we used in the practice session.
- If it works better than the strategy in the previous exercise you get an extra point.
- The strategy must be different from the strategy from Bonus Exercise: Safe Betting. Meaning: just changing the coefficient will not be sufficient.
- Deliverables:
- Write a brief description how your solution works
- The design document containing the solution
- Also take a screenshot of the respective view result
Deliverables:
- Design documents containing the MapReduce code for Exercises 13.6 and 13.7
- You should provide is as a text document.
- Direct link from where you find the design document 'should' be http://localhost:5984/tennis/_design/winners/
- NB! It is not allowed to provide it as a screenshot!
- You should provide is as a text document.
- Take a screenshot of the view result in Exercises 13.6 and 13.7
- NB! Do not forget to uninstall CouchDB
Potential issues and solutions
- If you get an error that couchdb module does not exist when you run the Spark application:
- Also install couchdb client library for your computer's default Python enviorement
python -m pip install couchdb
(Original author of the Lab Materials was Jürmo Mehine)