Handling newlines
When working with text files between different systems, line changes (also known as newlines) can be problematic. This is because different systems use different character codes (or combinations) to present newlines.The basic codes used are the following:
LF (Line feed) | ‘\n’ | ^J | 0×0A | 10 (decimal) |
CR (Carriage return) | ‘\r’ | ^M | 0×0D | 13 (decimal) |
The basic cases for the typical systems are:
LF | Unix-like systems (including GNU/Linux & OS X, for example) |
CR+LF | Windows & DOS, textual Internet protocols typically (see below) |
CR | Mac OS before OS X |
How about the (textual) Internet protocols? In general they use CR+LF on the protocol level, even though usually recommend applications to accept also plain LF. In textual mode, FTP transforms the newlines between CR+LF and system’s encoding (but not when in binary mode).
How to convert them
A typical example case is having a file in Unix/Linux/OS X that has Windows newlines. In this case, an extra ^M (CR) is found at the end of each line.There are many ways to convert files between different newline formats (see Wikipedia). One utility for this for is
flip
(note to be a bit careful, since the tool updates the file straight)Current file type can be determined with
$ flip -t file
File can be converted with -u (to Unix) or -d (to DOS/Windows), like
flip -u file_with_windows_newlines
Also
tr
can be used, like the following:tr '\r' '\n' < macfile.txt > unixfile.txt tr '\n' '\r' < unixfile.txt > macfile.txt
Links
- Newline article in Wikipedia
- EOL story
- ASCII control code chart (Control-M ^M & Control-J ^J)
No comments:
Post a Comment