message digest exercise

Create a working file

Create a file containing a megabyte of random values in your home directory:

cd
dd  if=/dev/urandom  of=~/myfile1M  bs=1024  count=1024

Verify that you ended up with a megabyte:

ls  -l  ~/myfile1M

It will show a size of 1,048,526 (which is one megabyte). Examine the contents in hex if you like (interrupt with ctrl-C key if you get impatient):

xxd  -g1  myfile1M

Below we'll manipulate the content of files with message digest utilities. We refer to the files' contents as "messages."

Generate digests

You can produce a string whose value is
  1) practically unique to this file,
  2) dependent on its whole content, and
  3) short.
Such a string is called a "digest" or "hash" of the content of the file. There are a (at least) couple of different utilities that can do it: md5sum and sha1sum. They use different algorithms ("Message Digest" and "Secure Hash Algorithm") that both output strings with the properties mentioned. Try both:

md5sum  myfile1M
sha1sum  myfile1M

In the first case the md5 algorithm's output string, derived from a megabyte of input, is itself just 32 characters. sha1 outputs just 40 characters. (The digests' actual sizes are a constant 16 bytes and 20 respectively. The encoded outputs are double that since normal hexadecimal encoding uses 2 ascii characters to represent a single byte.)

Digest unique to a given message

Different messages produce different digests, and the same message always produces the same digest. A digest utility, given the same stuff from different files or alternatively a pipe, will produce the same digest every time.

echo  "How now brown cow"  >  mytestfile1
echo  "How now brown cow"  >  mytestfile2
md5sum  mytestfile1
md5sum  mytestfile2
echo  "How now brown cow"  |  md5sum

Note that the digest is the same each time (namely 5be031ac67e80d243ff758ca436f6d3a), because so is the message. The digest fingerprints the file. Please don't try to prove that other messages produce other digests by trying them all. You don't have time.

Digest depends on "every bit" of the message

The digest is sensitive to the whole message, producing highly dissimilar digests even for highly similar messages. Consider the message "ABC" in ascii. In binary A is 01000001, B is 01000010, and C is 01000011. So the message "ABB" differs from "ABC" by just a single bit (the last one); while the other 23 are all the same. Same with "ABA," which differs from "ABC" solely in the next-to-last bit. Generate the 3 corresponding digests: 

echo  -n  "ABC"  |  md5sum
echo  -n  "ABB"  |  md5sum
echo  -n  "ABA"  |  md5sum

echo  -n  "ABC"  |  sha1sum
echo  -n  "ABB"  |  sha1sum
echo  -n  "ABA"  |  sha1sum

The inputs are nearly identical; what about the 3 corresponding digests?

If you receive a file and know the digest value the sender generated for the original, running the digest utility on your copy will reveal if yours departs from his in the slightest (i.e., if it changed in transit). Even by just one bit. Digests are commonly used this way for data validation like checksums. They are also used in generating digital signatures and passwords. They are employed in the protocols (TLS/SSL) used routinely for secure access to commercial and financial websites. They are in play, behind the scenes, whenever you visit your bank on the web.

Digest size for large files remains short

Digest size is constant. A few dozen bits only, characteristic of the algorithms. They don't diverge for large files. So a digest of the same small size represents either a huge or tiny file. An interesting command to generate an arbitrary number of bytes, say 1000 of them, is "head -c 1000 < /dev/urandom". The head command will take the first 1000 characters is it given. The /dev/urandom device generates an endless stream of bytes with (pseudo)random values. The < indirection operator gives them to head. Use it to generate some files ranging from 1 to 64 megabytes in size:

for i in {0..6}; do     head -c $[1048576*2**$i] < /dev/urandom     > ~/myfile$[$i+1]; done

You don't need to use it but this is a functionally equivalent script formerly used here:

#!/bin/bash
rm -f ~/myfile?
dd if=/dev/urandom of=~/myfile1 bs=1024 count=1024 2> /dev/null
i=1
while [ $i -le 6 ]
do
  echo "2 x myfile$i -> myfile$[$i+1]"
  cat ~/myfile$i ~/myfile$i >> ~/myfile$[$i+1]
  let i=i+1
done
ls -l ~/myfile?

Note the range of file sizes produced. Now produce digests for those files large and small:

md5sum  myfile?
sha1sum  myfile?

All the digests are of constant size. It doesn't matter how big the file itself.

Deprecation of digest algorithms

Security algorithms typically have a lifetime. At the outset they provide security sufficient for their purposes. But then they gradually become unacceptable due to factors like discovery of algorithmic weaknesses and/or increasing processing power of computers with which to mount attacks against them. At that point they are said to be "deprecated." Both md5 and sha1 are deprecated. There are several other variants of sha that employ modifications to the digest calculation algorithm, and produce longer digests. For example, while sha1 produces a 20-byte digest its more recent counterpart sha512 produces one that is 64 bytes instead. Try several of the variants to see the differences in output size:

echo  -n  "ABC"  |  sha1sum
echo  -n  "ABC"  |  sha224sum
echo  -n  "ABC"  |  sha256sum
echo  -n  "ABC"  |  sha384sum
echo  -n  "ABC"  |  sha512sum

You can see clear evidence of the movement to stronger digest algorithms in the growth of password digest length in later versions of linux. For example, assigning the word "password" as an account password in fedora 13 versus fedora 8 produces a much longer password digest (the stored version of the password) in the /etc/shadow file.

 "password" as a password, as hashed and stored in fedora 7:
  $1$QA5LiQni$hGZSchyZwb81ukYbr271./

 "password" as a password, as hashed and stored in fedora 13:
 $6$ygvkpyQm$nnSxzzAP1pQUZJGw2.UfbDG3iKRpk16zcwV3oIcs1KXq9Irk7za5L7gbbEd9T7naNqPYjBhBgXx/9LVJ/mqHK1