Practice 1 - Requesting and utilizing Cloud computing resources

Labs are supervised by Pelle Jakovits (jakovits ät ut . ee, - Ülikooli 17, Room - 324)
Feel free to send an email if you have any questions about the lab or issues with submitting deliverables.

Introduction

In this practice session you learn how to access cloud services that are used in the rest of the course for data processing. We will be using a private OpenStack cloud running on the hardware of the High Performance Computing Center at the University of Tartu, located at: https://stack.cloud.hpc.ut.ee/

To access local cloud resources you will have to be inside the university network. So you should either use lab computers, Eduroam Wifi in the institute building, or set up a VPN connection to university network.

VPN - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590
Eduroam Wifi - https://wiki.ut.ee/display/AA/WiFi+ja+eduroam

We will concentrate on the Infrastructure as a Service (IaaS) type of cloud services where you get direct and full access to the computing resources from the cloud. Infrastructure as a Service (IaaS) is model of Cloud computing, in which Virtualized computing resources are provided to users over the internet. In comparison to using physical servers, computing resources can be provisioned on-demand and in real-time and applications running on same hardware can be separated into different secure environments, each containing their own OS, software libraries and kernels.

Working with IaaS model of Cloud usually consists of the following steps:

Register an account to access the cloud services
Select appropriate virtual machine image to run (Ubuntu, Debian, Windows, etc.)
Specify how much computing resources (CPU, RAM, disk space) is allocated for the instance.
Start a new instance of the selected virtual machine image. Login into the instance as a root user over the internet and configure it to meet your requirements. I.e. install needed software, upload your own application, perform any required configuration actions as you would do in any real computer.
As you will lose all your work when instance is deleted -- you have three options on how to persist the changes you made:
1. Save all your configuration steps to a script that will launch and configure the instance automatically for you.
2. Bundle a new image from your running instance and next time launch your custom image.
3. Save the running instance as a snapshot, and next time launch new instances from there.

First option is more flexible as you can easily change the script than bundle a new image if something changes. Second and third option are simpler to use once you have stable configuration or when launching large number of instances.

Exercise 1.1. - Accessing the cloud services

Verify that you have access to the university OpenStack cloud resources and familiarize yourself with the available cloud functionality.

Log into https://stack.cloud.hpc.ut.ee/ using your university username and password and ut.ee as domain.
- Contact lab supervisor if you have issues while logging in. Make sure to include your university username in the email, as it is required for configuring access.
- You have to be in the university network to access this website. Use VPN or log into eduroam, if you're using your own laptop
Familiarize yourself with the available OpenStack cloud functionality.
Create a ssh Key Pair for accessing Virutal Machines over the network. Make sure the name of the Key Pair includes your last name!
- You will find this functionality under Compute -> Key Pairs
- NB! Copy private key to Clipboard and save it to your computer in text file with extension *.pem into a location from where you can easily find it later.
- If using Putty in Windows to connect to cloud instance with ssh you should use PuTTYgen to convert certificate into Putty specific *.ppk format. Use Load and Save private key functinality in PuttyGen program to do it.
- PEM or Privacy Enhanced Mail is a Base64 encoded DER certificate. PEM certificates are frequently used for servers as they can easily be translated into readable data using a simple text editor.

Exercise 1.2. - Requesting computing resources from the cloud

In this exercise you will start a Cloud instance (or virtual machine) while specifying it's configuration and computing resources available for it.

Use the OpenStack web interface
Under the "Compute" tab go to "Instances" and start a new instance by clicking the "Launch Instance" button (If not specified leave the default values)
Start a new instance of Ubuntu 16.04 virtual machine image
- Use your last name as the Instance Name under Details tab
- Choose ubuntu16.04 under Source tab & change the volume Size to 10GB
- Also enable Delete Volume on Instance Delete under Source tab
- Choose the capacity of the instance
  - Under Flavor tab, choose m1.small as the type of the instance
- Choose network for the instance
  - Under Networks tab, choose provider_64_net
- Specify what Key Pair to use under the Key Pair tab!
  - use the Key Pair that you created in the previous exercises. If you lose the downloaded file, you will have to create a new one!

Exercise 1.3. Accessing your Cloud instance over the internet

We will use Secure Shell (ssh) protocol to log into the started instance over the internet. Instances in the cloud can have multiple IP addresses. Public IP for accessing the instance from outside the cloud and Private IP for accessing the instance from inside the cloud (from other instances). However, our instances will only have a single IP in the current configuration.

Log into the instance through ssh using SSH Key based authentication
- On Linux:
  - ssh -i path_to_my_key_pair_file ubuntu@<instance public ip address>
  - For example: ssh -i .hpc/jakovits_ldpc.pem ubuntu@172.17.64.63
  - if you get an error, check that the path to the keyfile is correct and that it has correct rights (chmod 400 <filename>)
- On Windows:
  - Either copy the private key pair file to a university Linux server (like math.ut.ee) and use the previous ssh command.
  - Or user Putty, SSH secure Shell or WinSCP program to get a command line interface to a remote server through ssh
    - In windows, we first have to transform the private key file (_keyname_.pem) we downloaded from OpenStack into a .ppk file.
    - Follow the To prepare to connect to a Linux instance from Windows using PuTTY section @ https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html#prepare-for-putty
    - Username for the SSH connection has to be ubuntu
      - Specify it in Putty under Connection->Data->Login details
    - Host must be the public IP of the instance you started
    - SSH Key must be the same .ppk key you converted with PyttyKeyGen
      - Specify it's location under: Connection->SSH->Auth->Private key file...
Change the password of the user ubuntu
- sudo passwd ubuntu
- assign a password you can remember.
- We have to use sudo because normal use of passwd command otherwise requires us to enter the current password, which we do not know.
Now the user ubuntu has a password and we can use it to log into the instance through the web console instead of a ssh client.

Exercise 1.4. Accessing your instance through the web interface

To have access to your VM when external network connection is down or there is a problem connecting over SSH we can use OpenStack web interface and VNC. Make sure you changed password earlier for user ubuntu so you can log into your VM using username:password though the OpenStack web console. Virtual Network Computing (VNC) is a graphical desktop sharing system that uses the Remote Frame Buffer protocol (RFB) to remotely control another computer. It transmits the keyboard and mouse events from one computer to another, relaying the graphical screen updates back in the other direction, over a network. https://en.wikipedia.org/wiki/Virtual_Network_Computing

Log into the OpenStack web interface at https://stack.cloud.hpc.ut.ee/
Go to Instances page and click on the name of your instance.
Go to the Console Tab and click on the Click here to show only console link.
A command line interface should show up in a few moments. Refresh the page if it does not show up. If you see only black screen try hitting ENTER few times.
Log into the instance using ubuntu as the username and the password you previously specified.

FIX ERROR: "sudo: unable to resolve host <your_machine_name_here>"
- If you try entering any sudo commands i.e sudo free ; sudo du you should get and error "unable to resolve ..."
- In order fix it edit /etc/hosts file and add your hostname to the end of first line like this 127.0.0.1 localhost <your_machine_hostname_here>. You can use nano with sudo rights to do it.
Take a screenshot of the web command line interface after you have successfully logged in and executed a sudo command without error. Browser should stay visible in the screenshot.

It is up to your preferences whether to use the web interface or the ssh client to access your instances. Web client allows you to avoid using additional ssh software but may not work as well when you need to use a lot of copy/paste commands or transfer files.

Exercise 1.5. - Installing software on the instance and creating a snapshot.

We will now install Java JDK and Cloudera Hadoop distribution in a single node configuration to be able to try out some simple Hadoop MapReduce programs that we will be working on in the following labs.

Install Java JDK on the instance
- sudo apt-get update
- sudo apt-cache search <name of the software>
- sudo apt-get install <name of the software>
- openjdk-8-jdk is a suitable Java package to install
Check that Java is installed
- java -version
Lets install a large scale data processing framework - Hadoop - on the instance
- What Hadoop is and how exactly it works will be described in the second lecture
- We will be installing it using a single node configuration. In the following labs, we will work in a real Hadoop cluster.
Add Cloudera repository to be able to install Cloudera Hadoop Ubuntu packages by creating a new repository source file:
- sudo wget 'https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list' -O /etc/apt/sources.list.d/cloudera.list
Add public key for authenticating the added source
- wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key -O archive.key
- sudo apt-key add archive.key
Install cloudera Hadoop packages
- sudo apt-get update
- sudo apt-get install hadoop
Test that Hadoop is working by executing the following command:
- hadoop
Save the instance as a snapshot from the web interface (Create Snapshot)
- When naming the snapshot, include your last name!
- For example: jakovits_lab1_snapshot

This will create a snapshot of the running instance and can be used to skip all the virtual machine configuration steps you have performed by selecting it (instead of Ubuntu 16 image) as the base for instances.

Exercise 1.6. - Running a data processing task on the cloud instance

In this exercise, you will execute a couple of Hadoop MapReduce example applications inside the cloud instance to verify that everything was installed properly. In the following labs we will use Hadoop MapReduce in a cluster environment where there are much more resources available than a single small instance.

Log in again into the instance
Create an input folder at /home/ubuntu/input and create some text files inside it
- For example, you can download text books from Gutemberg
- You can download files directly to the instance from the internet using wget command, for example:
  - wget http://www.gutenberg.org/ebooks/36099.txt.utf-8
- Move the downloaded file into the /home/ubuntu/input folder (You can use the mv command)
Lets run an example data processing task
- A collection of already compiled MapReduce programs is located at /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar
- You can run hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar command to get the list of available MapReduce programs in this example jar.
Lets first try to execute Distributed Grep.
- Distributed grep searches all text files in a folder for a user specified string and writes into output (folder) how many times the string was seen.
- Execute the following command:
  - hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar grep /home/ubuntu/input /home/ubuntu/output 'tere'
- Investigate the content of the output folder to access the result of the Distributed Grep program.
Browse and try other example MapReduce programs:
- hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar
- For example, try to run Pi calculation.
Take a screenshot that shows you running one of the Hadoop commands from the command-line. It should display the command you executed and also partial output (which can be quite verbose).

We will take a look into the source code of these MapReduce example programs in the next lab and perform a number of modifications to improve them.

NB! Once you are done, you must delete your instance! Also, be careful you do not delete work of other students.

Deliverables

Screenshots created in exercise 1.4 and 1.6
- Compress the screenshots into a single zip container file and upload them through the following submission form.
Provide answers to the following questions:
1. What are cloud Images? How are Snapshots different from Images?
2. What happens if you lose your ssh Keypair? What happens to existing instances which were started with the lost ssh key?

1. Practice session 1

Sellele ülesandele ei saa enam lahendusi esitada.

NB! You can update your lab submission by uploading a modified version of your solution as long as you do it before the submission deadline.

Solutions to common problems

java.net.UnknownHostException: pelletest: pelletest: Name or service not known
- Add the name of the machine to list of known hosts inside the instance. Otherwise we will later get an error when trying to execute Hadoop Mapreduce job.
  - Check what is the current hostname of the instance:
    - cat /etc/hostname
  - Modify the firsr line of the /etc/hosts file to add the hostname after 127.0.0.1 localhost
    - sudo nano /etc/hosts
    - The line should look something like this: 127.0.0.1 localhost hostname_of_the_instance
ssh command line command complains about key.pem file permissions.
- Check that the path to the Keypair .pem file is correct and that it has correct file permissions. (chmod 400 <filename>)

Additional links & information

OpenStack cloud for the course https://stack.cloud.hpc.ut.ee
List of companies using Hadoop and their use cases https://wiki.apache.org/hadoop/PoweredBy
Hadoop wiki https://hadoop.apache.org/
List of Hadoop tools, frameworks and engines https://hadoopecosystemtable.github.io/
Commercial Hadoop distributions and support https://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
SSH secure shell can be downloaded from here: https://software.sites.unc.edu/shareware/#s
Putty and PuttyGen can be downloaded from here: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

Hajusandmetöötlus pilves 2018/19 sügis