Line termination: operating systems use different conventions

 

While typing text into an editor, the result of pressing the "a" key is insertion into the document of an "a" character. Pressing the "b" key inserts a "b" character. Each character is one byte, and there's a coding system that assigns code numbers to the alphabet. The number assigned to "a" happens to be 97, for example. The "a" character I'm talking about is a byte whose contents is the number 97, encoded in binary (i.e., 01100001). This coding system is called ASCII and is well known. Characters' code values are represented by various utilities using sometimes the decimal number system, sometimes (rarely) binary, and sometimes hexadecimal. So "a" might show up as 97, 01100001, or 61, which are synonyms in these different numeric "languages" for the same number.

What do you suppose goes into the file when you press the Enter key? Something does, I assure you. But while the character inserted for "a" is the same regardless of your platform-- unix, Windows, Macintosh-- the character for Enter is not. This causes some portability problems if try to use the document on one of the other platforms.

Here's an example in Windows using its Notepad editor and making exactly these 6 keystrokes: a, Enter, b, Enter, c, and Enter.

If you then save the document to a file called abcwin.txt on a diskette, you have tools that can tell you how many bytes there are.

For 6 keystrokes you might expect 6 bytes, on the assumption that each keystroke outputs one byte. That's true for the alphabet. But we see here there are 9 bytes. So, you can figure out how many bytes the Enter key generated. How many?

Now let's do much the same in linux, using its gedit  editor* and making exactly the same 6 keystrokes: a, Enter, b, Enter, c, and Enter.

If you then save the document to a file called abclin.txt on the same diskette, you have tools that can tell you how many bytes there are.

This time there are 6 bytes. Matching our expectation, this seems to confirm that the Enter key produces a single output byte just like keys for the letters of the alphabet. Whereas in Windows, we saw two. Now we would like to know exactly which bytes these are, that is, what ASCII codes they have. Whatever they are, they aren't visual characters since we don't see any symbol corresponding to them on the screen. We do, of course, see their effect on the screen, namely the vertical stacking of the letters on separate lines. But not a symbol.

Peering into a file to see its raw bytes in terms of coded ASCII values is a special job. There is a category of programs to do that job. They are called hexadecimal editors, or hexadecimal dump programs. Linux has one called xxd (and a freeware hex editor for Windows is xvi32). Running xxd on the our two files should reveal all.

Note the "a," "b," and "c" show up as 61, 62, and 63 respectively. In the linux file, you see the single character 0a after each letter, that is, wherever we pressed the Enter key. In the Windows file, the characters 0d and 0a appear-- a pair of them-- instead. Hex 0a, a control character as opposed to a printing character, is called a line feed. 0d is called a carriage return. (Both terms derive from typewriter technology.) Pretty much all the programs in the Windows arena understand and expect that in text, this 0d0a pair of characters is the signal for the end of one line and beginning of another. Whereas in linux, they all expect the single 0a character to denote the same thing. A problem arises with cross-platform exchange of files. The software sees something other than what it expects and can get confused.

This problem sometimes arises in the common scenario where a file being transmitted between two platforms of the same kind is used (copied, edited, downloaded) on the other platform during transmittal. For example, if I produce a file in linux for you to use in linux, I might distribute it to you via my website. You might download it while in Windows before moving it to your linux environment. While you handle it in Windows, it might get converted from linux-format to Windows-format. Some software does that. When it arrives at its destination, it's in the wrong form.

Should you suspect this, if maybe you get odd error messages when using an imported file, look at the file through a hex editor. If you see 0d0a pairs you're working with a Windows file in a linux world and, before being able to use it reliably, need to change it. How? Maybe your hex editor has a search-and-replace option. In that case search for 0d0a and replace with 0a. That's the easiest way. Or, linux offers two utilities for doing the conversions, from win-to-lin or lin-to-win. They are:

dos2unix - converts text files in Windows format to unix format

unix2dos - converts text files in unix format to Windows format

A third way, for unix junkies, is via the unix stream editor program sed.

sed  s/.$//   <infile>   >   <outfile>

sed  s/$/\\x0d/   <infile>   >   <outfile>

And there are other ways it could be done. Below, our Windows file is modified to linux format with each of the above 2 tools, and then our linux file similarly modified to Windows format.

 

When Macintosh comes into the picture there is a third warp on this theme. The original Mac OS used carriagereturn 0d as the line separator. So Windows used carriagereturn-linefeed, unix used linefeed, and Mac used carriagereturn. Just to be different I suppose.

 

 

 

* gedit like most linux editors may by default add a newline character at the end of the file, of its own accord, additional to and apart from what the user actually typed. To control this behavior in gnome:  gsettings set org.gnome.gedit.preferences.editor ensure-trailing-newline false  (or use the dconf graphical equivalent of the gsettings command).