Homework assignment: Measuring information and data compression
Goal of the homework The goal is to understand the meaning of the theoretical characterization of the amount of information contained in a specific message. It is informative to compute theoretical limits on the achievable compression efficiency and compare the results with those obtained from practical data compression techniques. Another goal is to discover an analogy between the way compression algorithms approximate the source model and the heuristic approaches to ”tokenization” in the framework of LLMs.
Test data examples
The exemplary data files are presented in the Table in the Appendix. Use your birthdate as your variant number.
Steps Step 1. Estimate one-dimensional and two-dimensional probabilities for the data source. Remark1: Whileestimatingempiric probabilities use “sliding” blocks. In this case, a file of length N contains N − n+1 blocks of length n. Step 2. Estimate source entropies H(X), H(X2) = H(XiXi+1), H2(X) = H(X2)/2, H(Xi|Xi−1). Comment on the achievable compression eff iciency.
Hint: For estimating conditional entropies, the following formula can be helpful: H(Y|X) = H(XY)−H(X).