I'm trying to learn python, and I'm pretty new at it, and I can't figure this one part out.
Basically, what I'm doing now is something that takes the source code of a webpage, and takes out everything that isn't words.
Webpages have a lot of \n and \t, and I want something that will find \ and delete everything between it and the next ' '.
def removebackslash(source):
while(source.find('\') != -1):
startback = source.find('\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
is what I have. It doesn't work like this, because the \' doesn't close the string, but when I change \ to \\, it interprets the string as \\. I can't figure out anything that is interpreted at '\'
\ is an escape character; it either gives characters a special meaning or takes said special meaning away. Right now, it's escaping the closing single quote and treating it as a literal single quote. You need to escape it with itself to insert a literal backslash:
def removebackslash(source):
while(source.find('\\') != -1):
startback = source.find('\\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
Try using replace:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
So in your case:
my_text = my_text.replace('\n', '')
my_text = my_text.replace('\t', '')
As others have said, you need to use '\\'. The reason you think this isn't working is because when you get the results, they look like they begin with two backslashes. But they don't begin with two backslashes, it's just that Python shows two backslashes. If it didn't, you couldn't tell the difference between a newline (represented as \n) and a backslash followed by the letter n (represented as \\n).
There are two ways to convince yourself of what's really going on. One is to use print on the result, which causes it to expand the escapes:
>>> x = "here is a backslash \\ and here comes a newline \n this is on the next line"
>>> x
u'here is a backslash \\ and here comes a newline \n this is on the next line'
>>> print x
here is a backslash \ and here comes a newline
this is on the next line
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ and here comes a newline \n this is on the next line'
>>> print x[startback:]
\ and here comes a newline
this is on the next line
Another way is to use len to verify the length of the string:
>>> x = "Backslash \\ !"
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ !'
>>> print x[startback:]
\ !
>>> len(x[startback:])
3
Notice that len(x[startback:]) is 3. The string contains three characters: backslash, space, and exclamation point. You can see what's going on even more simply by just looking at a string that contains only a backslash:
>>> x = "\\"
>>> x
u'\\'
>>> print x
\
>>> len(x)
1
x only looks like it starts with two backslashes when you evaluate it at the interactive prompt (or otherwise use it's __repr__ method). When you actually print it, you can see it's only one backslash, and when you look at its length, you can see it's only one character long.
So what this means is you need to escape the backslash in your find, and you need to recognize that the backslashes displayed in the output may also be doubled.
The SO auto-format shows your problem. Since \ is used to escape characters, it's escaping the end quotes. Try changing that line to (note the use of double quotes):
while(source.find("\\") != -1):
Read more about escape characters in the docs.
I don't think anyone's mentioned this yet, but if you don't want to deal with having to escape characters just use a raw string.
source.find(r'\')
Adding the letter r before the string tells Python not to interpret any special characters and keeps the string exactly as you type it.
Related
How to write a regular expression which can handle the following substitution scenario.
Hello, this is a ne-
w line of text wher-
e we are trying hyp-
henation.
i have a short Python code which handles breaking long one_line strings into a multi_line string and produces output similar to the code sample given above
I want a regular expression that takes care of the single hyphenated character like in first and second line and just pulls up the single hyphenated character on the previous like.
something like re.sub("-\n<any character>","<the any character>\n")
I can not find a way on how to handle the hyphenated character
below is some further information about the question
Word = "Python string comparison is performed using the characters in both strings. The characters in both strings are compared one by one."
def hyphenate(word, x):
for i in range(x, len(word), x):
word = word[:i] + ("\n" if (word[i] == " " or word[i-1] == " " ) else "-\n") + (word[i:] if word[i] != " " else word[(i+1):])
return(word)
print(hyphenate(Word, 20))
#Produced output
Python string compar-
ison is performed
using the character- <=
s in both strings.
The characters in b- <=
oth strings are co-
mpared one by one.
#Desired output
Python string compar-
ison is performed
using the characters <=
in both strings.
The characters in <=
both strings are co-
mpared one by one.
You don't need to include the trailing character at all.
re.sub(r'-\n', '')
If for some reason you do need to capture the character, you can use r'\1' to refer back to it.
re.sub(r'-\n([aeiou])', r'\1')
The notation r'...' produces a "raw string" where backslashes only represent themselves. In Python, backslashes in strings are otherwise processed as escapes - for example, '\n' represents the single wharacter newline, whereas r'\n' represents the two literal characters backslash and n (which in a regex match a literal newline).
Why I need to add 5 backslash on the left if I want to show three of them in Python? How to count the backslash?
# [ ] print "\\\WARNING!///"
print('"\\\\\Warning///"')
You can use a "raw string" by adding r:
print(r'"\\\Warning///"')
This helps to avoid the backslash's "escape" properties, which python uses to control the use of special characters
Backslash is taken as escape sequence most of the time thus for printing single \ one needs to use \\
Unlike standard C, any unrecognized escape sequences are left unchanged in python.
>>> print('\test')
' est' # '\t' evaluates to tab + 'est'
>>> print('\\test')
'\test' # '\\' evaluates to literal '\' + 'test'
>>> print('\\\test')
'\ est' # '\\' evaluates to literal '\' + '\t' evaluates to tab + 'est'
>>> print('\\\\test')
'\\test' # '\\' evaluates to literal '\' + '\\' evaluates to literal '\' + 'test'
according to https://docs.python.org/2.0/ref/strings.html
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string.
since \W is not a valid escape sequence, \W is printed as it is. on the other hand \\ is printed as \.
so \\\\\W is printed as \\\W
However, in python 3.6, according to Strings and bytes literals
Changed in version 3.6: Unrecognized escape sequences produce a DeprecationWarning. In some future version of Python they will be a SyntaxError.
So your code might give SyntaxError in future python.
\ is used for special characters like '\n', '\t' etc. You should type 2n or 2n-1 \ for printing n \.
>>> print('\warning') # 1 \
\warning
>>> print('\\warning') # 2 \
\warning
>>> print('\\\warning') # 3 \
\\warning
>>> print('\\\\warning') # 4 \
\\warning
>>> print('\\\\\warning') # 5 \
\\\warning
>>> print('\\\\\\warning') # 6 \
\\\warning
>>>
Actually you should type 2n backslashes to represent n of them, technically. In a strict grammar, backslash is reserved as an escape character and has a special meaning, i.e., it does not represent "backslash". So, we took it from our character set to give it a special meaning then how to represent a pure "backslash"? The answer is represent it in the way newline character is represented, namely '\' stands for a backslash.
And the reason why you get 3 backslashes printed when 5 is typed is: The first 4 of them is interpreted as I said above and when it comes to the fifth one, the interpreter found that there is no definition for '\W' so it treated the fifth backslash as a normal one instead of a escape character. This is an advanced convenience feature of the interpreter and might not be true in other versions of it or in other languages (especially in more strict ones).
I want to remove \n from a string if it is in a string.
I have tried:
slashn = str(chr(92))+"n"
if slashn in newString:
newerString = newString.replace(slashn,'')
print(newerString)
else:
print(newString)
Assume that newString is a word that has \n at the end of it. E.g. text\n.
I have also tried the same code except slash equals to "\\"+"n".
Use str.replace() but with raw string literals:
newString = r"new\nline"
newerString = newString.replace(r"\n", "")
If you put a r right before the quotes enclosing a string literal, it becomes a raw string literal that does not treat any backslash characters as special escape sequences.
Example to clarify raw string literals (output is behind the #> comments):
# Normal string literal: single backslash escapes the 'n' and makes it a new-line character.
print("new\nline")
#> new
#> line
# Normal string literal: first backslash escapes the second backslash and makes it a
# literal backslash. The 'n' won't be escaped and stays a literal 'n'.
print("new\\nline")
#> new\nline
# Raw string literal: All characters are taken literally, the backslash does not have any
# special meaning and therefore does not escape anything.
print(r"new\nline")
#> new\nline
# Raw string literal: All characters are taken literally, no backslash has any
# special meaning and therefore they do not escape anything.
print(r"new\\nline")
#> new\\nline
You can use strip() of a string. Or strip('\n'). strip is a builtin function of a string.
Example:
>>>
>>>
>>> """vivek
...
... """
'vivek\n\n'
>>>
>>> """vivek
...
... """.strip()
'vivek'
>>>
>>> """vivek
...
... \n"""
'vivek\n\n\n'
>>>
>>>
>>> """vivek
...
... \n""".strip()
'vivek'
>>>
Look for the help command for a string builtin function strip like this:
>>>
>>> help(''.strip)
Help on built-in function strip:
strip(...)
S.strip([chars]) -> string or unicode
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
>>>
Use
string_here.rstrip('\n')
To remove the newline.
Try with strip()
your_string.strip("\n") # removes \n before and after the string
If you want to remove the newline from the ends of a string, I'd use .strip(). If no arguments are given then it will remove whitespace characters, this includes newlines (\n).
Using .strip():
if newString[-1:-2:-1] == '\n': #Test if last two characters are "\n"
newerString = newString.strip()
print(newerString)
else:
print(newString)
Another .strip() example (Using Python 2.7.9)
Also, the newline character can simply be represented as "\n".
Text="test.\nNext line."
print(Text)
Output:::: test.\nNextline"
This is because the element is stored in double inverted commas.In such cases next line will behave as text enclose in string.
I'm trying to write a regular expression in python, and one of the characters involved in it is the \001 character. putting \001 in a string doesn't seem to work. I also tried 'string' + str(chr(1)), but the regex doesn't seem to catch it. Please for the love of god somebody help me, I've been struggling with this all day.
import sys
import postgresql
import re
if len(sys.argv) != 2:
print("usage: FixToDb <fix log file>")
else:
f = open(sys.argv[1], 'r')
timeExp = re.compile(r'(\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\S)')
tagExp = re.compile('(\\d+)=(\\S*)\001')
for line in f:
#parse the time
m = timeExp.match(line)
print(m.group(1) + ':' + m.group(2) + ':' + m.group(3) + '.' + m.group(4) + ' ' + m.group(5));
tagPairs = re.findall('\\d+=\\S*\001', line)
for t in tagPairs:
tagPairMatch = tagExp.match(t)
print ("tag = " + tagPairMatch.group(1) + ", value = " + tagPairMatch.group(2))
Here's is an example line of for the input. I replaced the '\001' character with a '~' for readability
15:32:36.357227 R 1 0 0 0 8=FIX.4.2~9=0067~35=A~52=20120713-19:32:36~34=1~49=PD~56=P~98=0~108=30~10=134
output:
15:32:36.357227 R
tag = 8, value = FIX.4.29=006735=A52=20120713-19:32:3634=149=PD56=P98=0108=3010=134
So it doesn't stop at the '\001' character.
chr(1) should work, as will "\x01", as will "\001". (Note that chr(1) already returns a string, so you don't need to do str(chr(1)).) In your example it looks like you have both "\001" and chr(1), so that won't work unless you have two of the characters in a row in your data.
You say the regex "doesn't seem to catch it", but you don't give an example of your input data, so it's impossible to say why.
Edit; Okay, it looks like the problem has nothing to do with the \001. It is the classic greediness problem. The \S* in your tagExp expression will match a \001 character (since that character is not whitespace. So the \S* is gobbling the entire line. Use \S*? to make it non-greedy.
Edit: As others have noted, it also looks like your backslashes are awry. In regular expressions you face a backslash-doubling problem: Python uses the backslash for its own string escapes (like \t for tab, \n for newline), but regular expressions also use the backslash for their own purposes (e.g., \s for whitespace). The usual solution is to use raw strings, but you can't do that if you want to use the "\001" escape. However, you could use raw strings for your timeExp regex. Then in your other regexes, double the backslashes (except on \001, because you want that one to be interpreted as a character-code escape).
Instead of using \S to match the value, which can be any non-whitespace character, including \001, you should use [^\x01], which will match any character that is not \001.
#Sam Mussmann, no...
1 (decimal) = \001 (octal) <> \x01 (UNICODE)
I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?
You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).
Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.