Non-Relational databases: CouchDB
In this Lab we are going to learn how to use the CouchDB NoSQL database, which follows the Document-oriented non-relational model. We will take an example dataset, convert its data into json documents and upload them to CouchDB. We will also look how to create MapReduce JavaScript queries in the database.
References
Dataset Description
The dataset that we will use in this practice session is taken from the Tennis-Data Betting, Results and Livescores Portal, which provides historical tennis results and fixed odds betting data from the men's ATP and women's WTA tours. Its main goal is to help developing quantitative tennis betting systems and forecasting.
- Name: Tennis results & Betting odds data for the ATP & WTA Tours
- Download the tennis tournaments 2018.csv CSV file.
- Location of the original data: http://www.tennis-data.co.uk/alldata.php (MS Excel format)
Dataset attributes (column names) are: | |
| |
| |
| |
|
- Each row in the file represents a tennis match. The fields include data on the winner, loser, sets played as well as betting odds by different betting brokers for either player winning the match
- The odds are represented as payout coefficients meaning that e.g for a bet of 10 and odds 1.5 the broker will pay out 10 x 1.5 = 15
- A detailed explanation of the dataset can be found here
Exercise 5.1 Install CouchDB
Your first task is to install CouchDB in your computer. CouchDB works in Windows, Linux and MacOS. Do not forget to uninstall the database once you have completed the lab and submitted the solution.
- Download and install CouchDB 3.1.1 from http://couchdb.apache.org/
- Linux installation guide is here
- Windows installation guide is here https://docs.couchdb.org/en/latest/install/windows.html
- ' On Windows
- Make sure you also have/install the .NET Framework 3.5, as the CouchDB install guides highlight
- On Linux
- the installer will prompt for the following:
- Couchdb general configuration:
standalone
- bind-address: Give IP address if your using VM or else
127.0.0.1
- Set admin user password
- Couchdb general configuration:
- the installer will prompt for the following:
- To see if the installation worked point your browser to
http://localhost:5984
for local installation and if its installed in VM, then check usinghttp://VM-IPadress:5984
- You should see the default CouchDB Welcome JSON document when everything is working properly.
Exercise 5.2 Using the Fauxton web interface
CouchDB also deploys a Fauxton web User Interface for managing the database, manually creating JSON documents and also creating design documents.
- Open up the Fauxton web UI at
http://localhost:5984/_utils/
- Create a new (non partitioned) database named
test
. - Add some JSON documents to the database you just created.
Next we will take a look how we can access documents in the database
Exercise 5.3 Using the RESTful Query Interface to access CouchDB
We can send HTTP methods (GET, PUT, POST, DELETE) to the CouchDB web service to execute typical database operations, such as fetching, adding, modifying or deleting documents.
We are going to disable authentication for the test database, for the illustration purposes:
- Go to test database view in the Fauxton interface.
- Click on Permissions.
- Under MEMBERS (NOT Admins!), remove the _admin user entry by clicking on "x" button
- Now we can send non authenticated rest requests to this database, which will simplify trying simple rest requests (at least for Windows)
- PS! Do not do this for the next exercises!
- Add a document to your database by making an HTTP PUT request on the command line
- Linux:
- Use the command line shell:
curl -X PUT http://localhost:5984/test/foobar -d '{"foo": "bar"}'
- Windows:
- Use Windows Power Shell:
Invoke-RestMethod -Method Put -Uri http://localhost:5984/test/foobar -Body '{"foo": "bar"}'
- In this example
test
is the name of the database andfoobar
is the id of the document you enter into the database
- Linux:
- Perform an HTTP GET request to query the
foobar
document from the databasetest
- Linux:
curl -X GET http://localhost:5984/test/foobar
- Windows:
Invoke-RestMethod -Method Get -Uri http://localhost:5984/test/foobar
- You can also access the document simply by navigating with your browser to the same url:
- Linux:
- Feel free to try out other HTTP API functions
Exercise 5.4 Bulk Data Upload
In this exercise you will create a new database and upload the content of the tennis database into it as JSON documents. We will use a prepared Python script for uploading bulk data. The exercise description will teach how to use Python PyCharm IDE to simplify working and running Python scripts. But you are free to use your own means to execute the required python script.
- Create a new (non partitioned) database named tennis in your CouchDB instance for this data
- Download the tennis tournaments 2018.csv file
- We will use a Python Pandas program (which you can download from here) to upload the content of 2018.csv dataset file (each row as a separate document representing a tennis match) to the CouchDB.
- Modify the Python script.
- At the start of the script, you should specify:
input_file
- location of 2018.csv file- Couchdb username
- Couchdb password
- At the start of the script, you should specify:
- The script expects the CouchDB to be running @ http://localhost:5984
- It first uses the DataFrame API to load in
csv
file, converts the DataFrame rows into a list of Python dictionaries (Equivalent to JSON documents) and stores the list in local CouchDB using.
Running the script
Here are 2 options how to execute the Python script:
1) If you already feel confident using Python & PIP from the command-line, run the script from the terminal, after having installed the pip packages: couchdb
and pandas
.
2) Alternatively, download and install the community edition of PyCharm Python IDE.
- https://www.jetbrains.com/pycharm/download
- Note: UT students can also get licensed professional version of PyCharm and other IntelliJ-based IDEs. See here for more information
- Open PyCharm and create a new Python project.
- Create a new VirtualEnv.
- Make sure to activate Inherit global site-packages
- You will have to install
couchdb
andpandas
python libraries. When using PyCharm, you need to do the following:- Go to
File->Settings->Project:sparklab2018-> Project Interpreter-> Plus Icon (upper right) -> Search for package name -> Install Package
- Add couchdb and pandas libraries
- Go to
Expected output after running the script
- Check that the new entries have been added to the tennis database.
- The resulting JSON documents in the CouchDB will look something like this:
{ "_id": "7f4842ef6e27eeee8b3aeec0850007ce", "_rev": "1-bdc2e63835a24039690db0a4f5a3b4d3", "ATP": 1, "Location": "Brisbane", "Tournament": "Brisbane International", "Date": "12/30/2013", "Series": "ATP250" /*...*/ }
Exercise 5.5 Counting the Documents using MapReduce query
- To run MapReduce queries on CouchDB documents we need to specify them as
Views
inDesign Documents
. These are documents whose id has the prefix_design/
- Lets create a new
design document view
in the tennis database.tennis -> Design Documents -> New View
- Write the map and reduce functions to count all documents in the database.
- Map:
function (doc) { emit("count", 1); }
- Reduce:
_count
(Predefined function) - Assign a name to the Design Document (
winners
) and also to the view (Index name:mycount
).
- Map:
- We can call and access the result of the created view at:
- localhost:5984/tennis/_design/DESIGN_DOC_NAME/_view/ViEW_NAME
- For example: http://localhost:5984/tennis/_design/winners/_view/mycount
- To force data to be grouped by key
<url>?group=true
- To not to keep it grouped by key:
<url>?group=false
(all map output will be aggregated into a single value by applying the reduce function) - We can address the json fields inside the document using
.
- like this:doc.EXL
to get the value of field EXL
- Lets now change the Map function to count separately the matches where
- the favorite won (doc.AvgW < doc.AvgL)
- and the matches where this was not the case (doc.AvgW => doc.AvgL)
- AvgW is the payout ratio of the match winner and AvgL was the losers payout ratio.
- NB! Smaller payout ratio indicates that the chance of that player winning was considered higher.
- Assign a separate key in the Map function for each case (For example, "favorite_won" for doc.AvgW < doc.AvgL and "favorite_failed" otherwise) like this:
function (doc) { if(doc.AvgW < doc.AvgL) emit("favorite_won", 1); else emit("favorite_failed", 1); }
- Check the result of the view in the database, while making sure to add
?group=true
at the end of the url.
Exercise 5.6 Who Beat the betting odds?
- Let us look more closely at the matches with a surprise winner i.e matches that the favorite lost.
- Add a new view (
againstOdds
) to your design document (winners
) to count how many times each winner beat the betting odds against them (using again the fields AvgW and AvgL to assess the betting odds) - Write the Map and Reduce functions to find winners who beat the betting odds and count how many times each Winner did it.
- Map: Emit winner's name as key and 1 as value if winners odds were considered to be lower (NB! Remember: Higher payout in betting means that odds of that person are considered lower! )
- Reduce: Count all records
- View the results of the view function at
localhost:5984/tennis/_design/winners/_view/againstOdds?group=true
- NB!
?group=true
at the end of the URL forces data to be grouped by the key in Reduce. Without this attribute, reduce will aggregate all results into a single value by default.
- NB!
- Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- You can find the design document content here: http://localhost:5984/_utils/#database/tennis/_design/winners
- Take a screenshot of the view result
- screenshot should display both the url and results!
Exercise 5.7 Biggest Payouts
- Add another view to find the five matches with the largest payouts (matches where AvgW is the largest)
- You only need a Map part of the query for this. Leave the Reduce as NONE.
- Keep in mind that CouchDB only allows for sorting (Add
&descending=true
to change the sorting order) results based on the key and not the value.- This means the map function will have to emit AvgW as the key.
- Value should be the name of the Winner
- Find a way how to limit the number of returned results to 5.
- Save the design documents containing the MapReduce code for this exercise as a deliverable from this exercise
- You can find the design document content here: http://localhost:5984/_utils/#database/tennis/_design/winners
- Take a screenshot of the result
- screenshot should display both the url and results!
Bonus Exercise I: Safe Betting
- Write a MapReduce job to count the matches where either the winner’s or loser’s betting were less than 1.2 (AvgW and AvgL)
- Using the resulting count, write another MapReduce job which simulates placing 1000 euros divided in equal bets on the favorite on each match where the favorite’s payout (AvgW or AvgL) is less than 1.2 and calculates the winnings.
- Is it possible to “beat the house” using this strategy?
- Deliverables:
- Write a brief description how your solution works
- The design document containing the solution
- Also take a screenshot of the respective view result
Bonus Exercise II: Advanced Betting Strategy
- Devise a betting strategy that yields more profit / less loss than the one given in the previous exercise.
- PS! You can use other fields (not just AvgW and AvgL) to devise your strategy.
- Formalize the strategy as a CouchDB MapReduce job and run the 1000 euro simulation to see how it stacks up against the safe betting strategy.
- Feel free to use additional MapReduce jobs for intermediate data.
- Write a short explanation on how the strategy works.
- I will test the strategy on a different data set than the one we used in the practice session.
- If it works better than the strategy in the previous exercise you get an extra point.
- The strategy must be different from the strategy from Bonus Exercise: Safe Betting. Meaning: just changing the coefficient will not be sufficient.
- Deliverables:
- Write a brief description how your solution works
- The design document containing the solution
- Also take a screenshot of the respective view result
Deliverables:
- Design documents containing the MapReduce code for Exercises 5.6 and 5.7
- You should provide is as a text document.
- Direct link from where you find the design document 'should' be http://localhost:5984/tennis/_design/winners/
- NB! It is not allowed to provide it as a screenshot!
- You should provide is as a text document.
- Take a screenshot of the view result in Exercises 5.6 and 5.7
- NB! Do not forget to uninstall CouchDB
Potential issues and solutions
- If you get an error that couchdb module does not exist when you run the data import application:
- Also install couchdb client library for your computer's default Python environment
python -m pip install couchdb