Protecting data with encryption
Some data is confidential, i.e. it is not public and hence only authorized people should be able to read it. For example, business and state secrets, medical data, some personally identifiable information (PII) are usually considered confidential.
The method for securing data depends on the confidentiality level and on the way how data is stored. Physical data such as information on a sheet of paper can be kept securely in a safe, but securing digital information is more complicated. The device that stores digital information can be kept in a safe but at some point the device has to be taken out of the safe such that the data could be accessed by a computer. But how should the confidential data be secured when it is stored in a computer?
There is no simple answer to this question as it depends on the context - the owner of the data, the confidentiality level and the potential adversary. The most important factor to be considered is the potential adversary. The security requirements that are designed to defend against a competitor or an absentminded employee differ significantly from the security requirements used when the adversary is a nation state. In the latter case it might be almost impossible to create a fully secure environment for the confidential data. One method that might help in this case is the usage of an air-gapped computer (computer that is disconnected from network). However, this only works if used correctly. If the data is not properly secured then the adversary might physically break into the room where the computer is located and copy the confidential data. In addition, it might be necessary to transport the data once in a while and therefore it also needs to be protected while it is being moved. In such case it is important to protect the data by using encryption.
Encryption is one of the most basic methods for securing data. When used correctly, encryption can provide protection even against an advanced adversary. In the following we describe what is symmetric encryption and how it can be used to protect data.
Symmetric encryption
Encryption provides confidentiality by making data illegible without the right key. The process that is used to secure data is called encryption and the encrypted form of data is called ciphertext. With the right key it is possible to restore the initial data, also known as plaintext. This process is called decryption.
To protect data we need an encryption process and a decryption process but these do not work without a key.
key key | | v v plaintext --[encrypt]--> ciphertext --[decrypt]--> plaintext
Therefore, in order to protect data with encryption three algorithms are needed: one for generating the random key, one for encrypting data and one for decrypting the ciphertext. Together they form a cryptosystem.
Encryption key is in many ways similar to an ordinary door key. It must be protected so it is not lost or copied by somebody who would want to misuse it (it is much easier to copy in digital world than in physical world). Moreover, if an encryption key is lost then the encrypted data can not be decrypted anymore (at least not by the owner of the data). In case of symmetric encryption the same key is used to both encrypt and decrypt data.
Strong encryption requires strong encryption keys that are usually long randomly generated bitstrings (arrays of bits, i.e., zeros and ones). Unfortunately, humans are bad at remembering long random data. For example, try to memorize the following randomly generated 128-bit key (which is encoded to be human readable):
83f38541d9fa2540c8663d9d82a5c97f
Instead of memorizing encryption keys, people use passwords that are much shorter (or have less entropy). The encryption key is then encrypted with the user password and the encrypted key is then stored together with the encrypted data, usually in a separate file or in the encrypted document header. For example, when the whole system drive is encrypted then the encrypted key is usually stored in the header of the drive.
Alternatively, it is possible to put the original encryption key in a hardware device that protects it from being copied or from unauthorized access. Such hardware devices are for example smart cards, hardware security modules (HSM), Trusted Platform Module (TPM).
The security level of encrypted data depends on several factors. First, it is important to choose a safe encryption algorithm as there are also many vulnerable and outdated algorithms. Second, the encryption key has to be random as otherwise the security assumptions do not hold. Third, the encryption / decryption key has to be stored and managed in a secure manner such that it would not leak and no unauthorized party would have access to it. Fourth, the encryption software has to follow the specifications for the encryption algorithm and must not contain bugs. Implementing crypto algorithms is difficult even when all specification are followed due to side-channel attacks. Therefore, it is an unwritten rule that people should not create their own crypto software if there are other options. Fifth, you have to trust the computer that is running the encryption / decryption algorithm as it has access to the plaintext. Thus, it is obvious that encryption does not provide confidentiality if the corresponding computer is infected with malware or controlled by a third party.
Examples of symmetric encryption
- mobile communication
- paid cable TV channels
- Skype calls
- communication between a WiFi router and a computer (if it's not an open hotspot)
- communication with web sites using HTTPS
- encrypting documents using Estonian ID-card
- DRM protection on DVD-s
Encryption algorithms
Encryption algorithm is called a cipher. Let's take a simple cipher ROT13 as an example. In ROT13, every letter in a message (plaintext) is replaced by a letter that is 13 places ahead of it in the alphabet. The ends of the alphabet are tied together so after going over Z, there comes another A. An example is given in the figure.
But where is the key here? We can construct an abstract version of ROT13, called ROTn, where each letter is shifted n positions. Here n is the key and it is constant for the whole message. If we consider Latin alphabet, there are 26 different options for choosing n. Yes, if n equals 0 then the encrypted message is same as the original, but mathematically this is ok. This key space is so small that just trying all the possible combinations, i.e., brute forcing the key, is quite simple. The ROTn cipher is also known as the Caesar cipher.
There are two kinds of modern symmetric encryption algorithms: stream ciphers (Est: jadašiffer) and block ciphers (Est: plokkšiffer). Stream ciphers encrypt data by individual bits or bytes (see figure) and are thus faster than block cipher. Therefore, they are mostly used in real-time systems that require fast data throughput, e.g. to encrypt mobile communication. For example, stream ciphers include RC4 and the A5 family, the latter is used in mobile communication. While RC4 is still used, its usage has decreased significantly since statistical vulnerabilities were found in RC4. Therefore, Google, Mozilla and Microsoft decided to stop supporting RC4 in their browsers in 2015 and 2016.
Block ciphers encrypt data block-by-block, where the block has a predefined length (see figure below). Block ciphers are slower than stream ciphers but have been less prone to attacks. Best known block ciphers include Digital Encryption Standard (DES), 3DES, Blowfish and Advanced Encryption Standard (AES). DES used to be the most widely used block cipher but it should not be used anymore as the data encrypted with DES can be decrypted with a brute force attack. DES was replaced by AES, which is currently considered to be unbreakable (if implemented and used correctly).
Are there any perfectly secure encryption algorithms?
It is important to understand the concept of information-theoretic security which states that security is a direct result of the information theory and therefore can not be broken even by an adversary who has access to a unlimited computing resource. Therefore, if an algorithm is information-theoretically secure then without the key no adversary is able to break it. An information-theoretically secure symmetric encryption algorithm exists but it is not used in computer systems. So, why is it not used if it provides perfect security?
The one-time pad
One-time pad is an information-theoretically secure symmetric cryptosystem. In this cryptosystem the size of the plaintext is equal to the size of the encryption key and each plaintext bit is paired with a key bit by using modular addition. More information about modular arithmetic can be found from the tutorial: What is modular arithmetic?
As a bit can have only two values, zero and one, it is logical to use addition modulo 2 to pair the plaintext bit with the key bit. This is represented with the XOR operation. To encrypt a plaintext bit, the bit will be XOR-ed with a paired key bit and to decrypt the corresponding ciphertext bit, it is XOR-ed with the key bit. The following list shows all possible ways how an XOR operation may work.
- 1 XOR 1 = 0
- 1 XOR 0 = 1
- 0 XOR 1 = 1
- 0 XOR 0 = 0
There are several problems with the one-time pad. First, the key has to be very large, for example, to encrypt a one gigabyte file one would need a one gigabyte random key. This means that the key generation is extremely slow and it is difficult to transport the key. Second, the key can be used only once and for each message a new key has to be used. Third, the key has to be perfectly random and it is very difficult to create truly random bits. The previous problems show why the usage of one-time pad is not feasible in commercial systems. In spite of this, there are still some use cases for one-time pad. For example, spies may be given keys that can be used once to send an encrypted message that can not be decrypted once the key is destroyed.
Transparency and backdoors
It can not be said that making the encryption program open-source makes it less secure. If the implementation is correct and the program does not contain bugs then revealing the source code of a standardized cryptosystem should have no effect on the security. This is a direct result of the Kerckhoffs's principle, which says that a cryptosystem should be designed to be secure even if everything except the key is public. This means that the security depends on the randomness of the key. Due to Kerckhoffs's principle almost all modern modern cryptographic algorithms are public as they are designed to be secure without using tricks hidden in the algorithm.
Following the same reasoning, crypto software should be open-source as then it can be reviewed by cryptographers and the security community. When crypto software is closed source, the users have to trust that the developers did not make mistakes. However, it would be naive to believe that a large software product does not contain any bugs. The estimate used to be around 15-50 bugs per 1000 lines of code. In addition, it is important to audit the crypto / security software to make sure that it does not contain backdoors. With open source products the community can review the code and make sure that the product would not contain obvious backdoors. While it is also possible to have backdoors hidden in standards, like in Dual EC, they are still rather unlikely as developing backdoors into standards is expensive and complicated. As source code audit would probably detect regular backdoors, it is advised to use open source crypto software if that is an option. History gives us several examples of crypto products that contained backdoors.
In recent years the discussion about mandatory backdoors and regulated cryptography has resurfaced. The source of the discussion is a combination of terrorism and the increasing amount of encrypted connections. Thus, it is claimed that law enforcement agencies are not able to monitor the suspects anymore. Actually, similar ideas where present in the US during the 90's and at the time the opponents of regulated cryptography won. During that time strong cryptography was considered military equipment and therefore exporting such applications was restricted. The main idea by the pro-regulation was to create a backdoored chip (Clipper chip) that would give access when necessary. However, the plan had security issues and had strong resistance which led to it being dropped.
Now we are back to the same question with several countries pondering whether to regulate cryptography. But the times have changed and it is not possible to restrict the usage of strong cryptography anymore due to it being a central piece of digital economy. Moreover, it is not wise to restrict the usage of encryption software as it is so widespread. There are hundreds of different software products that offer to protect data with the help of encryption. These tools originate from multiple countries over the globe and thus blocking the encryption software in one country would not work. When a country would set restrictions on the usage of such software then the average person would suffer the most as criminals could continue to use the restricted software. This problem is described in blog post by Bruce Schneier: A Worldwide Survey of Encryption Products.
Still, several countries are interested in creating laws that mandate backdoors. However, this would create a necessity for new crypto protocols that would allow to create "secure" backdoors. What do we mean by secure backdoors? Well, we would not want the opponents to use the same backdoor on us. Therefore, a secure backdoor should only work for selected parties. The problem is that currently there is a lack of such crypto protocols that would not be exploitable by the opponents. Lets assume that key escrow is mandated to get access to encrypted content. In this case the security of the encrypted data lies in the escrow key, which must not fall into the wrong hands. However, we have seen several examples of top secret documents being leaked. In addition, there are cyber attacks, bribes, espionage, etc. Thus, if the backdoor falls into the wrong hands then they can also exploit the vulnerability. This concept is unfortunately not widely understood by the decisionmakers. What may be even worse is the precedent that such regulation would create. When a western country can create such laws, then why couldn't countries with less freedom of speech like China or Russia have the same rights? When western companies would do business in these countries they would also be forced to comply. In order to bring attention to these problems well known cryptographers co-authored a report: Keys Under Doormats: Mandating insecurity by requiring government access to all data and communications
There are several other reports on this topic:
- The Effect of Encryption on Lawful Access to Communications and Data (CSIS, 2017)
- The Risks of “Responsible Encryption” (Stanford, 2018)
- Moving the Encryption Policy Conversation Forward (The Carnegie Endowment for International Peace, 2019)
Encryption vs encoding
Sometimes, encryption is confused with encoding (Est. kodeering), but in fact they are completely different. Encryption is meant to hide a message from unauthorised view while encoding "translates" information between different formats. It is important to notice that the encoding algorithm is public and no key is used. Therefore, everyone can decode the data if they know which encoding algorithm was used. For example, the characters on this webpage are encoded using UTF-8.
In everyday life, we are used to decimal numbers, where we have 10 digits: 0-9. However, every decimal number (and in computers, everything is a number) can be encoded as a binary number that the computer understands. In binary, since there are only two digits (0 and 1), the same number is a lot longer.
Other more common encodings are octal (eight digits: 0-7), hexadecimal (hex; digits 0-9, A-F) and base64 (64 digits: A-Z, a-z, 0-9, +, /). The last two are most used for short representations of binary data, e.g. encryption keys.
The same data in different encodings:
binary: 001001001101011101101100000001000001100011011111 octal: 1115355401014337 decimal: 40507648776415 hex: 24d76c0418df base64: JNdsBBjf
Lab
In the lab session we will introduce some built-in encryption functions of the Windows operating system. We will learn to use VeraCrypt. In addition, we will try to recover deleted files and then securely erase them.
Further reading
- Cryptography
- Random numbers and how to generate them
- Other links