Removing backslash in Python at runtime [duplicate] - python

This question already has answers here:
How can I convert special characters in a string back into escape sequences?
(3 answers)
Closed 7 months ago.
I need a way for my function to take in a string at runtime and remove the backslashes while KEEPING the character it is prepended to. So for \a I must get a. This must also work for nonescaped characters like \e -> e.
I've scoured the internet looking for a general solution to this problem, but there does not appear to be one. The best solution I have found uses a dictionary to build the string from scratch like: How to prevent automatic escaping of special characters in Python
escape_dict={'\a':r'\a',
'\b':r'\b',
'\c':r'\c',
'\f':r'\f',
'\n':r'\n',
'\r':r'\r',
'\t':r'\t',
'\v':r'\v',
'\'':r'\'',
'\"':r'\"',
'\0':r'\0',
'\1':r'\1',
'\2':r'\2',
'\3':r'\3',
'\4':r'\4',
'\5':r'\5',
'\6':r'\6',
'\7':r'\7',
'\8':r'\8',
'\9':r'\9'}
def raw(text):
"""Returns a raw string representation of the string"""
new_string=''
for char in text:
try:
new_string += escape_dict[char]
except KeyError:
new_string += char
return new_string
However this fails in general because of conflicts between the escaped numbers and escaped letters. Using the 3 digit numbers like \001 instead of \1 also fails because the output will have additional numbers in it which defeats the purpose. I should simply remove the backslash. Other proposed solutions based on encodings like the one found here Process escape sequences in a string in Python
also does not work because this converts just converts the escape characters into the hex code. \a gets converted to \x07. Even if were to somehow remove this the character a is still lost.

There is a function you may want to use for this purpose called repr().
repr() computes the “official” string representation of an object (a representation that has all information about the object) and str() is used to compute the “informal” string representation of an object (a representation that is useful for printing the object).
Example:
s = 'This is a \t string tab. And this is a \n newline character'
print(s) # This will print `s` with a tab and a newline inserted in the string
print(repr(s)) # This prints `s` as the original string with backslash and the whatever letter you have used
# So maybe you can use this somewhere
print(repr(s).replace('\\', '_'))
# And obviously this might not have worked for you
print(s.replace('\\', '_'))
So you can replace the backslash from your string by using repr(<your string>)

Related

How to write a regular expression with a list of quotes, using raw string literal? [duplicate]

Technically, any odd number of backslashes, as described in the documentation.
>>> r'\'
File "<stdin>", line 1
r'\'
^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
File "<stdin>", line 1
r'\\\'
^
SyntaxError: EOL while scanning string literal
It seems like the parser could just treat backslashes in raw strings as regular characters (isn't that what raw strings are all about?), but I'm probably missing something obvious.
The whole misconception about python's raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python's tutorial sequence:
When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string
So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).
This way:
r'abc\d' comprises a, b, c, \, d
r'abc\'d' comprises a, b, c, \, ', d
r'abc\'' comprises a, b, c, \, '
and:
r'abc\' comprises a, b, c, \, ' but there is no terminating quote now.
Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will 'devour' string closing char.
The reason is explained in the part of that section which I highlighted in bold:
String quotes can be escaped with a
backslash, but the backslash remains
in the string; for example, r"\"" is a
valid string literal consisting of two
characters: a backslash and a double
quote; r"\" is not a valid string
literal (even a raw string cannot end
in an odd number of backslashes).
Specifically, a raw string cannot end
in a single backslash (since the
backslash would escape the following
quote character). Note also that a
single backslash followed by a newline
is interpreted as those two characters
as part of the string, not as a line
continuation.
So raw strings are not 100% raw, there is still some rudimentary backslash-processing.
That's the way it is! I see it as one of those small defects in python!
I don't think there's a good reason for it, but it's definitely not parsing; it's really easy to parse raw strings with \ as a last character.
The catch is, if you allow \ to be the last character in a raw string then you won't be able to put " inside a raw string. It seems python went with allowing " instead of allowing \ as the last character.
However, this shouldn't cause any trouble.
If you're worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don't do it with string concatenation, for it's not the right way to do it anyway! use os.path.join
>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'
In order for you to end a raw string with a slash I suggest you can use this trick:
>>> print r"c:\test"'\\'
test\
It uses the implicit concatenation of string literals in Python and concatenates one string delimited with double quotes with another that is delimited by single quotes. Ugly, but works.
Another trick is to use chr(92) as it evaluates to "\".
I recently had to clean a string of backslashes and the following did the trick:
CleanString = DirtyString.replace(chr(92),'')
I realize that this does not take care of the "why" but the thread attracts many people looking for a solution to an immediate problem.
Since \" is allowed inside the raw string. Then it can't be used to identify the end of the string literal.
Why not stop parsing the string literal when you encounter the first "?
If that was the case, then \" wouldn't be allowed inside the string literal. But it is.
The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.
But you could use:
'\\'
Another user who has since deleted their answer (not sure if they'd like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).
I thought it was an interesting idea and am including it as community wiki for posterity.
Naive raw strings
The naive idea of a raw string is
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
and it will mean itself.
Unfortunately, this does not work, because if the whatever
happens to contain a quote, the raw string would end at that point.
It is simply impossible that I can put "whatever I want"
between fixed delimiters, because some of it could look like
the terminating delimiter -- no matter what that delimiter is.
Real-world raw strings (variant 1)
One possible approach to this problem would be to say
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
This restriction sounds harsh, until one recognizes that
Python's large offering of quotes can accommodate most situations
with this rule. The following are all valid Python quotes:
'
"
'''
"""
With this many possibilities for the delimiter, almost anything
can be made to work.
About the only exception would be if the string
literal is supposed to contain a complete list of all allowed
Python quotes.
Real-world raw strings (variant 2, as in Python)
Python, however, takes a different route using
an extended version of the above rule.
It effectively states
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
If I insist on including a quote, even that is allowed,
but I have to put a backslash before it.
So the Python approach is, in a sense, even more liberal
than variant 1 above -- but it has the side effect of
"mis"interpreting the closing quote as part of the string
if the last intended character of the string is a backslash.
Variant 2 is not helpful:
If I want the quote in my string,
but not the backslash, the allowed version of my string literal
will not be what I need.
However, given the three different other kinds of quotes I have
at my disposal, I will probably just pick one of those and my
problem will be solved -- so this is not problematic case.
The problematic case is this one:
If I want my string to end with a backslash, I am at a loss.
I need to resort to concatenating a non-raw string literal
containing the backslash.
Conclusion
After writing this, I go with several of the other posters
that variant 1 would have been easier to understand and to accept
and therefore more pythonic. That's life!
Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.
That does indeed disallow \ as last character since it will escape the " and make the parser choke. But as pointed out earlier \ is legal.
some tips :
1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :
os.path.normpath('c:/folder1/')
2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use 'r' prefix before your literal string). for example :
r'\one \two \three'
3) if you need to prefix a string in a variable X with a backslash then you can do this :
X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X # X2 now contains \dummy
4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :
voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name
now lilypond_statement contains "\DisplayLilyMusic \upper"
long live python ! :)
n3on
Despite its role, even a raw string cannot end in a single
backslash, because the backslash escapes the following quote
character—you still must escape the surrounding quote character to
embed it in the string. That is, r"...\" is not a valid string
literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use
two and slice off the second.
I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:
x = 'a string\\'
x
'a string\\'
# Now save it in a text file and it will appear with a single backslash:
with open("my_file.txt", 'w') as h:
h.write(x)
BTW it is not working with json if you dump it using python's json library.
Finally, I work with Spyder, and I noticed that if I open the variable in spider's text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it's not very helpful for most needs but maybe for some..).

Regex patterns with windows paths in python [duplicate]

This question already has answers here:
Why do backslashes appear twice?
(2 answers)
Closed 7 months ago.
I found a python package on GitHub that doesn't work. It attempts to replace a substring within a url with another string.
string = "filename.txt"
rewrite = "c:\\windows\\system32\\drivers\\hosts"
url = "https://www.example.com/path?parameter=filename.txt"
fullrewrite = re.sub(string, rewrite, url)
The string, rewrite, and url parameters are arbitrary and not hard-coded. I just put them there as an example (this is a path traversal testing library I'm trying to play around with).
When I run this code, I get a KeyError from re, which is expected according to the docs:
If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.
I tried using repr to convert the string into a raw string:
raw = repr(rewrite)[1:-1] # [1:-1] removes extra quotes.
fullrewrite = re.sub(string, raw, url)
But this creates double backslashes in the resulting url: https://www.example.com/path?parameter=c:\\windows\\system32\\drivers\\hosts
My question is how am I supposed to have it replace the key word so that the resulting string is: https://www.example.com/path?parameter=c:\windows\system32\drivers\hosts?
This is my understanding, please correct me if i'm wrong.
You don't get double backslashes, but escaped backslashes. In Re and Python, one backslash is a special character. It does not match the backslash character.(or rather, not always) To print one backslash, one would need to escape it with another.(again - most often) Thus, one can say that a double backslash is an internal representation of a backslash.
If one puts 'c:\\' into print() or save it to a 'txt' file, one will get 'c:\'.
P.S. Since '\q' is not a special sequence in Python, '\q'=='\\q' returns True.

Create a class opening a csv file and generating a new excel file with csv data [duplicate]

Technically, any odd number of backslashes, as described in the documentation.
>>> r'\'
File "<stdin>", line 1
r'\'
^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
File "<stdin>", line 1
r'\\\'
^
SyntaxError: EOL while scanning string literal
It seems like the parser could just treat backslashes in raw strings as regular characters (isn't that what raw strings are all about?), but I'm probably missing something obvious.
The whole misconception about python's raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python's tutorial sequence:
When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string
So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).
This way:
r'abc\d' comprises a, b, c, \, d
r'abc\'d' comprises a, b, c, \, ', d
r'abc\'' comprises a, b, c, \, '
and:
r'abc\' comprises a, b, c, \, ' but there is no terminating quote now.
Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will 'devour' string closing char.
The reason is explained in the part of that section which I highlighted in bold:
String quotes can be escaped with a
backslash, but the backslash remains
in the string; for example, r"\"" is a
valid string literal consisting of two
characters: a backslash and a double
quote; r"\" is not a valid string
literal (even a raw string cannot end
in an odd number of backslashes).
Specifically, a raw string cannot end
in a single backslash (since the
backslash would escape the following
quote character). Note also that a
single backslash followed by a newline
is interpreted as those two characters
as part of the string, not as a line
continuation.
So raw strings are not 100% raw, there is still some rudimentary backslash-processing.
That's the way it is! I see it as one of those small defects in python!
I don't think there's a good reason for it, but it's definitely not parsing; it's really easy to parse raw strings with \ as a last character.
The catch is, if you allow \ to be the last character in a raw string then you won't be able to put " inside a raw string. It seems python went with allowing " instead of allowing \ as the last character.
However, this shouldn't cause any trouble.
If you're worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don't do it with string concatenation, for it's not the right way to do it anyway! use os.path.join
>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'
In order for you to end a raw string with a slash I suggest you can use this trick:
>>> print r"c:\test"'\\'
test\
It uses the implicit concatenation of string literals in Python and concatenates one string delimited with double quotes with another that is delimited by single quotes. Ugly, but works.
Another trick is to use chr(92) as it evaluates to "\".
I recently had to clean a string of backslashes and the following did the trick:
CleanString = DirtyString.replace(chr(92),'')
I realize that this does not take care of the "why" but the thread attracts many people looking for a solution to an immediate problem.
Since \" is allowed inside the raw string. Then it can't be used to identify the end of the string literal.
Why not stop parsing the string literal when you encounter the first "?
If that was the case, then \" wouldn't be allowed inside the string literal. But it is.
The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.
But you could use:
'\\'
Another user who has since deleted their answer (not sure if they'd like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).
I thought it was an interesting idea and am including it as community wiki for posterity.
Naive raw strings
The naive idea of a raw string is
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
and it will mean itself.
Unfortunately, this does not work, because if the whatever
happens to contain a quote, the raw string would end at that point.
It is simply impossible that I can put "whatever I want"
between fixed delimiters, because some of it could look like
the terminating delimiter -- no matter what that delimiter is.
Real-world raw strings (variant 1)
One possible approach to this problem would be to say
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
This restriction sounds harsh, until one recognizes that
Python's large offering of quotes can accommodate most situations
with this rule. The following are all valid Python quotes:
'
"
'''
"""
With this many possibilities for the delimiter, almost anything
can be made to work.
About the only exception would be if the string
literal is supposed to contain a complete list of all allowed
Python quotes.
Real-world raw strings (variant 2, as in Python)
Python, however, takes a different route using
an extended version of the above rule.
It effectively states
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
If I insist on including a quote, even that is allowed,
but I have to put a backslash before it.
So the Python approach is, in a sense, even more liberal
than variant 1 above -- but it has the side effect of
"mis"interpreting the closing quote as part of the string
if the last intended character of the string is a backslash.
Variant 2 is not helpful:
If I want the quote in my string,
but not the backslash, the allowed version of my string literal
will not be what I need.
However, given the three different other kinds of quotes I have
at my disposal, I will probably just pick one of those and my
problem will be solved -- so this is not problematic case.
The problematic case is this one:
If I want my string to end with a backslash, I am at a loss.
I need to resort to concatenating a non-raw string literal
containing the backslash.
Conclusion
After writing this, I go with several of the other posters
that variant 1 would have been easier to understand and to accept
and therefore more pythonic. That's life!
Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.
That does indeed disallow \ as last character since it will escape the " and make the parser choke. But as pointed out earlier \ is legal.
some tips :
1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :
os.path.normpath('c:/folder1/')
2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use 'r' prefix before your literal string). for example :
r'\one \two \three'
3) if you need to prefix a string in a variable X with a backslash then you can do this :
X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X # X2 now contains \dummy
4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :
voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name
now lilypond_statement contains "\DisplayLilyMusic \upper"
long live python ! :)
n3on
Despite its role, even a raw string cannot end in a single
backslash, because the backslash escapes the following quote
character—you still must escape the surrounding quote character to
embed it in the string. That is, r"...\" is not a valid string
literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use
two and slice off the second.
I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:
x = 'a string\\'
x
'a string\\'
# Now save it in a text file and it will appear with a single backslash:
with open("my_file.txt", 'w') as h:
h.write(x)
BTW it is not working with json if you dump it using python's json library.
Finally, I work with Spyder, and I noticed that if I open the variable in spider's text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it's not very helpful for most needs but maybe for some..).

Print only a single slash using print r"\" (Python) [duplicate]

Technically, any odd number of backslashes, as described in the documentation.
>>> r'\'
File "<stdin>", line 1
r'\'
^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
File "<stdin>", line 1
r'\\\'
^
SyntaxError: EOL while scanning string literal
It seems like the parser could just treat backslashes in raw strings as regular characters (isn't that what raw strings are all about?), but I'm probably missing something obvious.
The whole misconception about python's raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python's tutorial sequence:
When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string
So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).
This way:
r'abc\d' comprises a, b, c, \, d
r'abc\'d' comprises a, b, c, \, ', d
r'abc\'' comprises a, b, c, \, '
and:
r'abc\' comprises a, b, c, \, ' but there is no terminating quote now.
Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will 'devour' string closing char.
The reason is explained in the part of that section which I highlighted in bold:
String quotes can be escaped with a
backslash, but the backslash remains
in the string; for example, r"\"" is a
valid string literal consisting of two
characters: a backslash and a double
quote; r"\" is not a valid string
literal (even a raw string cannot end
in an odd number of backslashes).
Specifically, a raw string cannot end
in a single backslash (since the
backslash would escape the following
quote character). Note also that a
single backslash followed by a newline
is interpreted as those two characters
as part of the string, not as a line
continuation.
So raw strings are not 100% raw, there is still some rudimentary backslash-processing.
That's the way it is! I see it as one of those small defects in python!
I don't think there's a good reason for it, but it's definitely not parsing; it's really easy to parse raw strings with \ as a last character.
The catch is, if you allow \ to be the last character in a raw string then you won't be able to put " inside a raw string. It seems python went with allowing " instead of allowing \ as the last character.
However, this shouldn't cause any trouble.
If you're worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don't do it with string concatenation, for it's not the right way to do it anyway! use os.path.join
>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'
In order for you to end a raw string with a slash I suggest you can use this trick:
>>> print r"c:\test"'\\'
test\
It uses the implicit concatenation of string literals in Python and concatenates one string delimited with double quotes with another that is delimited by single quotes. Ugly, but works.
Another trick is to use chr(92) as it evaluates to "\".
I recently had to clean a string of backslashes and the following did the trick:
CleanString = DirtyString.replace(chr(92),'')
I realize that this does not take care of the "why" but the thread attracts many people looking for a solution to an immediate problem.
Since \" is allowed inside the raw string. Then it can't be used to identify the end of the string literal.
Why not stop parsing the string literal when you encounter the first "?
If that was the case, then \" wouldn't be allowed inside the string literal. But it is.
The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.
But you could use:
'\\'
Another user who has since deleted their answer (not sure if they'd like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).
I thought it was an interesting idea and am including it as community wiki for posterity.
Naive raw strings
The naive idea of a raw string is
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
and it will mean itself.
Unfortunately, this does not work, because if the whatever
happens to contain a quote, the raw string would end at that point.
It is simply impossible that I can put "whatever I want"
between fixed delimiters, because some of it could look like
the terminating delimiter -- no matter what that delimiter is.
Real-world raw strings (variant 1)
One possible approach to this problem would be to say
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
This restriction sounds harsh, until one recognizes that
Python's large offering of quotes can accommodate most situations
with this rule. The following are all valid Python quotes:
'
"
'''
"""
With this many possibilities for the delimiter, almost anything
can be made to work.
About the only exception would be if the string
literal is supposed to contain a complete list of all allowed
Python quotes.
Real-world raw strings (variant 2, as in Python)
Python, however, takes a different route using
an extended version of the above rule.
It effectively states
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
If I insist on including a quote, even that is allowed,
but I have to put a backslash before it.
So the Python approach is, in a sense, even more liberal
than variant 1 above -- but it has the side effect of
"mis"interpreting the closing quote as part of the string
if the last intended character of the string is a backslash.
Variant 2 is not helpful:
If I want the quote in my string,
but not the backslash, the allowed version of my string literal
will not be what I need.
However, given the three different other kinds of quotes I have
at my disposal, I will probably just pick one of those and my
problem will be solved -- so this is not problematic case.
The problematic case is this one:
If I want my string to end with a backslash, I am at a loss.
I need to resort to concatenating a non-raw string literal
containing the backslash.
Conclusion
After writing this, I go with several of the other posters
that variant 1 would have been easier to understand and to accept
and therefore more pythonic. That's life!
Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.
That does indeed disallow \ as last character since it will escape the " and make the parser choke. But as pointed out earlier \ is legal.
some tips :
1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :
os.path.normpath('c:/folder1/')
2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use 'r' prefix before your literal string). for example :
r'\one \two \three'
3) if you need to prefix a string in a variable X with a backslash then you can do this :
X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X # X2 now contains \dummy
4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :
voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name
now lilypond_statement contains "\DisplayLilyMusic \upper"
long live python ! :)
n3on
Despite its role, even a raw string cannot end in a single
backslash, because the backslash escapes the following quote
character—you still must escape the surrounding quote character to
embed it in the string. That is, r"...\" is not a valid string
literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use
two and slice off the second.
I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:
x = 'a string\\'
x
'a string\\'
# Now save it in a text file and it will appear with a single backslash:
with open("my_file.txt", 'w') as h:
h.write(x)
BTW it is not working with json if you dump it using python's json library.
Finally, I work with Spyder, and I noticed that if I open the variable in spider's text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it's not very helpful for most needs but maybe for some..).

Python raw strings and html parsing

How do python raw strings and string literals work? I'm trying to make a webscraper to download pdfs from a site. When I search the string it works, but when I try to implement it in python I always get None as my answer
import urllib
import re
url="" //insert url here
sock=urllib.urlopen(url)
htmlSource=sock.read();
sock.close();
m=re.match(r"<a href.*?pdf[^>]*?", raw(htmlSource))
print m
$ python temp.py
None
The raw function is from here: http://code.activestate.com/recipes/65211-convert-a-string-into-a-raw-string/
That said, how can I complete this program so that I can print out all of the matches and then download the pdfs?
Thanks!
You seem to be very confused.
A 'string literal' is a string that you type into the program. Because there needs to be a clear beginning and end to your string, certain characters become inconvenient to have within the middle of the string, and escape sequences must be used to represent them.
Python offers 'raw' string literals which have different rules for how the escape sequences are interpreted: the same rules are used to figure out where the string ends (so a single backslash, followed by the opening quote character, doesn't terminate the string), but then the stuff between the backslashes doesn't get transformed. So, while '\'' is a string that consists of a single quote character (the \' in the middle is an escape sequence that produces the quote), r'\'' is a string that consists of a backslash and a quote character.
The raw string literal produces an object of type str. It is the same type as produced by an ordinary string literal. These are often used for the pattern for a regex operation, because the strings used for regexes often need to contain a lot of backslashes. If you wanted to write a regex that matched a backslash in the source text, and you didn't have raw string literals, then you would need to put, perhaps surprisingly, four backslashes between the quotes in your source code: the Python compiler would interpret this as a string containing two real backslashes, which in turn represents "match a backslash" in the regex syntax.
The function you found is an imperfect attempt to re-introduce escape sequences into input text. This is not what what you want to do, doesn't even really make sense, and doesn't meet the author's own spec anyway. It seems to be based on a misconception similar to your own. The concept of a "raw equivalent of" a string is nonsensical. There is, really, no such thing as "a raw string"; raw string literals are a convenience for creating ordinary strings.
You want to search for the pattern within htmlSource. It is already in the form you need it to be in. Your problem has nothing to do with string escapes. When a string comes from user input, file input, or basically anything other than the program source, it is not processed the way string literals are, unless you explicitly arrange for that to happen. If the web page contains a backslash followed by an n, the string that gets read by urllib contains, in the corresponding spot, exactly that - a backslash followed by an n, not a newline.
The problem is as follows: you want to search the string, as you said: "when I search the string it works". You are currently matching the string. See the documentation:
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
Your pattern does not appear at the beginning of the string, since the HTML for the webpage does not start with the <a> tag you are looking for.
You want m=re.search(r"<a href.*?pdf[^>]*?", htmlSource).
Check out this answer. It seems that Python’s urllib is a lot less user‐friendly — and Unicode‐friendly — than it should be. It seems to force you to deal with ugly raw bytes content instead of decoding it for you into a normal string.

Categories