7.2. String Escape Characters

../../_images/type-machine1.jpg

Figure 7.14. Why we have '\r\n' on Windows?

Table 7.2. Frequently used escape characters

Sequence

Description

\n

New line (LF - Linefeed)

\r

Carriage Return (CR)

\t

Horizontal Tab (TAB)

\'

Single quote '

\"

Double quote "

\\

Backslash \

Table 7.3. Less frequently used escape characters

Sequence

Description

\a

Bell (BEL)

\b

Backspace (BS)

\f

New page (FF - Form Feed)

\v

Vertical Tab (VT)

\uF680

Character with 16-bit (2 bytes) hex value F680

\U0001F680

Character with 32-bit (4 bytes) hex value 0001F680

\o755

ASCII character with octal value 755

\x1F680

ASCII character with hex value 1F680

print('\U0001F680')     # 🚀

7.2.1. Escape characters

  • Escape characters

  • \t - tab

  • \r - carriage return

  • \n - newline

  • \r\n - newline (on Windows)

  • \b - backspace

  • \v - vertical space

  • \f - form feed

  • \x - hexadecimal

  • \o - octal

  • \u - Unicode entity 16-bit

  • \U - Unicode entity 32-bit

  • \\ - backslash

  • \' - apostrophe

  • \" - double quote

>>> import string
>>>
>>>
>>> string.whitespace
' \t\n\r\x0b\x0c'
>>> print('Hello\nWorld')
Hello
World

Linefeed means to advance downward to the next line; however, it has been repurposed and renamed. Used as "newline", it terminates lines (commonly confused with separating lines). This is commonly escaped as n, abbreviated LF or NL, and has ASCII value 10 or 0x0A. CRLF (but not CRNL) is used for the pair rn [#stackFF]_.

>>> print('Hello\r\nWorld')
Hello
World

Carriage return means to return to the beginning of the current line without advancing downward. The name comes from a printer's carriage, as monitors were rare when the name was coined. This is commonly escaped as r, abbreviated CR, and has ASCII value 13 or 0x0D [#stackFF]_.

>>> print('Hello\rWorld')
World

The most common difference (and probably the only one worth worrying about) is lines end with CRLF on Windows, NL on Unix-likes, and CR on older Macs (the situation has changed with OS X to be like Unix). Note the shift in meaning from LF to NL, for the exact same character, gives the differences between Windows and Unix. (Windows is, of course, newer than Unix, so it didn't adopt this semantic shift. That probably came from the Apple II using CR. CR was common on other 8-bit systems, too, like the Commodore and Tandy. ASCII wasn't universal on these systems: Commodore used PETSCII, which had LF at 0x8d (!). Atari had no LF character at all. For whatever reason, CR = 0x0d was more-or-less standard. Many text editors can read files in any of these three formats and convert between them, but not all utilities can [#stackFF]_.

>>> print('Hello\bWorld')
HellWorld

b is a nondestructive backspace. It moves the cursor backward, but doesn't erase what's there. Then following output overwrites the previous.

>>> print('Hello\sWorld')
Hello\sWorld
>>> print('hello\tWorld')
Hello   World

Form feed means advance downward to the next "page". It was commonly used as page separators, but now is also used as section separators. (It's uncommonly used in source code to divide logically independent functions or groups of functions.) Text editors can use this character when you "insert a page break". This is commonly escaped as f, abbreviated FF, and has ASCII value 12 or 0x0C [#stackFF]_.

>>> print('Hello\fWorld')
Hello World

Form feed is a bit more interesting (even though less commonly used directly), and with the usual definition of page separator, it can only come between lines (e.g. after the newline sequence of NL, CRLF, or CR) or at the start or end of the file [#stackFF]_.

Vertical tab was used to speed up printer vertical movement. Some printers used special tab belts with various tab spots. This helped align content on forms. VT to header space, fill in header, VT to body area, fill in lines, VT to form footer. Generally it was coded in the program as a character constant. From the keyboard, it would be CTRL-K. It is hardly used any more. Most forms are generated in a printer control language like postscript [#stackVT1]_.

>>> print('Hello\vWorld')
Hello
     World

The above output appears to result in the default vertical size being one line. This could be used to do line feed without a carriage return on devices with convert linefeed to carriage-return + linefeed [#stackVT1]_.

Microsoft Word uses VT as a line separator in order to distinguish it from the normal new line function, which is used as a paragraph separator [1].

7.2.2. Case Study

  • Windows absolute path problem

  • Absolute path include all entries in the directories hierarchy

  • Absolute path on *nix starts with root / dir

  • Absolute path on Windows starts with drive letter

Linux (and other *nix):

>>> file = '/home/myuser/newfile.txt'

macOS:

>>> file = '/Users/myuser/newfile.txt'

Windows:

>>> file = 'c:/Users/myuser/newfile.txt'
  • Problem with paths on Windows

  • Use backslash (\\) as a path separator

  • Use r-string for paths

Let's say we have a path to a file:

>>> print('C:/Users/myuser/newfile.txt')
C:/Users/myuser/newfile.txt

Paths on Windows do not use slashes (/). You must use backslash (\\) as a path separator. This is where all problems starts. Let's start changing slashes to backslashes from the end (the one before newfile.txt):

>>> print('C:/Users/myuser\newfile.txt')
C:/Users/myuser
ewfile.txt

This is because \n is a newline character. In order this to work we need to escape it.

Now lets convert another slash to backslash, this time the one before directory named myuser:

>>> print('C:/Users\myuser\\newfile.txt')
SyntaxWarning: invalid escape sequence '\m'
C:/Users\myuser\newfile.txt

Since Python 3.12 all non-existing escape characters (in this case \m will need to be escaped or put inside of a row strings. This is only a warning (SyntaxWarning: invalid escape sequence '\m', so we can ignore it, but this behavior will be default sometime in the future, so it is better to avoid it now.

The last slash (the one before Users):

>>> print('C:\Users\\myuser\\newfile.txt')
Traceback (most recent call last):
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

This time the problem is more serious. Problem is with \Users. After escape sequence \U Python expects hexadecimal Unicode codepoint, i.e. \U0001F600 which is a smiley 😀 emoticon emoticon. In this example, Python finds letter s, which is invalid hexadecimal character and therefore raises an SyntaxError telling user that there is an error with decoding bytes. The only valid hexadecimal numbers are 0123456789abcdefABCDEF and letter s isn't one of them.

There is two ways how you can avoid this problem. Using escape before every slash:

>>> print('C:\\Users\\myuser\\newfile.txt')
C:\Users\myuser\newfile.txt

Or use r-string:

>>> print(r'C:\Users\myuser\newfile.txt')
C:\Users\myuser\newfile.txt

Both will generate the same output, so you can choose either one. In my opinion r-strings are less error prone and I use them each time when I have to deal with paths.

7.2.3. References