Thursday, January 16, 2014

What is the end of the line (a couple of notes on newlines)

Handling newlines

When working with text files between different systems, line changes (also known as newlines) can be problematic. This is because different systems use different character codes (or combinations) to present newlines.

The basic codes used are the following:
LF (Line feed) ‘\n’ ^J 0×0A 10 (decimal)
CR (Carriage return) ‘\r’ ^M 0×0D 13 (decimal)

The basic cases for the typical systems are:
LF Unix-like systems (including GNU/Linux & OS X, for example)
CR+LF Windows & DOS, textual Internet protocols typically (see below)
CR Mac OS before OS X

How about the (textual) Internet protocols? In general they use CR+LF on the protocol level, even though usually recommend applications to accept also plain LF. In textual mode, FTP transforms the newlines between CR+LF and system’s encoding (but not when in binary mode).

How to convert them

A typical example case is having a file in Unix/Linux/OS X that has Windows newlines. In this case, an extra ^M (CR) is found at the end of each line.

There are many ways to convert files between different newline formats (see Wikipedia). One utility for this for is flip (note to be a bit careful, since the tool updates the file straight)

Current file type can be determined with
$ flip -t file

File can be converted with -u (to Unix) or -d (to DOS/Windows), like
flip -u file_with_windows_newlines

Also tr can be used, like the following:
tr '\r' '\n' < macfile.txt > unixfile.txt
tr '\n' '\r' < unixfile.txt > macfile.txt

Links