how to place a character literal in a python string - python

I'm trying to write a regular expression in python, and one of the characters involved in it is the \001 character. putting \001 in a string doesn't seem to work. I also tried 'string' + str(chr(1)), but the regex doesn't seem to catch it. Please for the love of god somebody help me, I've been struggling with this all day.
import sys
import postgresql
import re
if len(sys.argv) != 2:
print("usage: FixToDb <fix log file>")
else:
f = open(sys.argv[1], 'r')
timeExp = re.compile(r'(\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\S)')
tagExp = re.compile('(\\d+)=(\\S*)\001')
for line in f:
#parse the time
m = timeExp.match(line)
print(m.group(1) + ':' + m.group(2) + ':' + m.group(3) + '.' + m.group(4) + ' ' + m.group(5));
tagPairs = re.findall('\\d+=\\S*\001', line)
for t in tagPairs:
tagPairMatch = tagExp.match(t)
print ("tag = " + tagPairMatch.group(1) + ", value = " + tagPairMatch.group(2))
Here's is an example line of for the input. I replaced the '\001' character with a '~' for readability
15:32:36.357227 R 1 0 0 0 8=FIX.4.2~9=0067~35=A~52=20120713-19:32:36~34=1~49=PD~56=P~98=0~108=30~10=134
output:
15:32:36.357227 R
tag = 8, value = FIX.4.29=006735=A52=20120713-19:32:3634=149=PD56=P98=0108=3010=134
So it doesn't stop at the '\001' character.

chr(1) should work, as will "\x01", as will "\001". (Note that chr(1) already returns a string, so you don't need to do str(chr(1)).) In your example it looks like you have both "\001" and chr(1), so that won't work unless you have two of the characters in a row in your data.
You say the regex "doesn't seem to catch it", but you don't give an example of your input data, so it's impossible to say why.
Edit; Okay, it looks like the problem has nothing to do with the \001. It is the classic greediness problem. The \S* in your tagExp expression will match a \001 character (since that character is not whitespace. So the \S* is gobbling the entire line. Use \S*? to make it non-greedy.
Edit: As others have noted, it also looks like your backslashes are awry. In regular expressions you face a backslash-doubling problem: Python uses the backslash for its own string escapes (like \t for tab, \n for newline), but regular expressions also use the backslash for their own purposes (e.g., \s for whitespace). The usual solution is to use raw strings, but you can't do that if you want to use the "\001" escape. However, you could use raw strings for your timeExp regex. Then in your other regexes, double the backslashes (except on \001, because you want that one to be interpreted as a character-code escape).

Instead of using \S to match the value, which can be any non-whitespace character, including \001, you should use [^\x01], which will match any character that is not \001.

#Sam Mussmann, no...
1 (decimal) = \001 (octal) <> \x01 (UNICODE)

Related

delete whitespace in regular expression

I'm learning python and also english. And I have a problem that might be easy, but I can't solve it. I have a folder of .txt's, I was able to extract by regular expression a sequence of 17 numbers of each one.I need to rename each file with the sequence I extracted from .txt
import os
import re
path_txt = (r'C:\Users\usuario\Desktop\files')
name_files = os.listdir(path_txt)
for TXT in name_files:
with open(path_txt + '\\' + TXT, "r") as content:
search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
if search is not None:
print(search.group(0))
f = open(os.path.join( "Processes" , search.group(0) + ".txt"), "w")
for line in content:
print(line)
f.write(line)
f.close()
there are .txt where the sequences appear with spaces between characters, and my regular expression can not find them (example: 00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5)
edit: They are serial numbers, were typed, so sometimes they appear with "." and "-" and other times without them. Sometimes spaces appear because of typos.
You want this regex:
search = re.search(r'(\d{5}.*\d{4}.*\d{3}.*\d{2}.*\d{2}-.*\d)', content.read())
Dot . is any character. By putting \ in front of the dot you escaped it and searched for dots and not any character.
You can use \D in your regular expression to match any non-numeric character (including white space) and + to match one or more (or * to match zero or more), so you could rewrite your expression as:
pattern = r'(\d{5}\D+\d{4}\D+\d{3}\D+\d{2}\D+\d{2}\D+\d)'
re.findall(pattern, '00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5')
# ['00372.2004 .442.02.00-1', '00572.2008.872.02.00- 5']
Note I am using re.findall to find every match in the string and return them in a list.

Python: Ignore a # / and random numbers in a string

I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?
If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".
import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.
If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.
I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)
A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)

Multiple Regex Search and Replace

I'm trying to create a simple script which will take the regular expressions from a file, and then carry out the searches and replacements on another file. This is what I have but it doesn't work, the file is unchanged, what am I doing wrong?
import re, fileinput
separator = ' => '
file = open("searches.txt", "r")
for search in file:
pattern, replacement = search.split(separator)
pattern = 'r"""' + pattern + '"""'
replacement = 'r"""' + replacement + '"""'
for line in fileinput.input("test.txt", inplace=1):
line = re.sub(pattern, replacement, line)
print(line, end="")
The file searches.txt looks like this:
<p (class="test">.+?)</p> => <h1 \1</h1>
(<p class="not">).+?(</p>) => \1This was changed by the script\2
and test.txt like this:
<p class="test">This is an element with the test class</p>
<p class="not">This is an element without the test class</p>
<p class="test">This is another element with the test class</p>
I did a test to see if it's getting the expression from the file correctly:
>>> separator = ' => '
>>> file = open("searches.txt", "r")
>>> for search in file:
... pattern, replacement = search.split(separator)
... pattern = 'r"""' + pattern + '"""'
... replacement = 'r"""' + replacement + '"""'
... print(pattern)
... print(replacement)
...
r"""<p (class="test">.+?)</p>"""
r"""<h1 \1</h1>
"""
r"""(<p class="not">).+?(</p>)"""
r"""\1This was changed by the script\2"""
The closing triple quotes on the first replacement are on a newline for some reason, could this be the cause of my problem?
You don't need
pattern = 'r"""' + pattern + '"""'
In the call to re.sub, pattern should be the actual regex. So <p (class="test">.+?)</p>. When you wrap all those double quotes around it, it makes it so that the pattern never matches the text in your file.
Even though you seem to have seen code like this:
replaced = re.sub(r"""\w+""", '-')
In that case, the r""" indicates to the python interpreter that you're talking about a "raw" multiline string, or a string that should not have backslash sequences replaced (such as \n replaced with newline). Programmers often use "raw" strings in python to quote regex because they want to use regex sequences (like \w above) without having to quote the backslash. Without a raw string, the regex would have to be '\\w+', which gets confusing.
However in any case, you don't need the triple double quotes at all. The last code phrase could simply have been written:
replaced = re.sub(r'\w+', '-')
Finally, your other problem is that your input file has newlines in it, separating each case of pattern => replacement. So really it's "pattern => replacement\n" and the trailing newline follows your replacement variable. Try doing:
for search in file:
search = search.rstrip() #Remove the trailing \n from the input
pattern, replacement = search.split(separator)
Two observations:
1) Use .strip() when reading the file like so:
pattern, replacement = search.strip().split(separator)
This will remove the \n from the file
2) Use re.escape() rather than the r"""+ str +""" form you are using if you intend to escape regex meta characters from the pattern

Can't get single \ in python

I'm trying to learn python, and I'm pretty new at it, and I can't figure this one part out.
Basically, what I'm doing now is something that takes the source code of a webpage, and takes out everything that isn't words.
Webpages have a lot of \n and \t, and I want something that will find \ and delete everything between it and the next ' '.
def removebackslash(source):
while(source.find('\') != -1):
startback = source.find('\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
is what I have. It doesn't work like this, because the \' doesn't close the string, but when I change \ to \\, it interprets the string as \\. I can't figure out anything that is interpreted at '\'
\ is an escape character; it either gives characters a special meaning or takes said special meaning away. Right now, it's escaping the closing single quote and treating it as a literal single quote. You need to escape it with itself to insert a literal backslash:
def removebackslash(source):
while(source.find('\\') != -1):
startback = source.find('\\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
Try using replace:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
So in your case:
my_text = my_text.replace('\n', '')
my_text = my_text.replace('\t', '')
As others have said, you need to use '\\'. The reason you think this isn't working is because when you get the results, they look like they begin with two backslashes. But they don't begin with two backslashes, it's just that Python shows two backslashes. If it didn't, you couldn't tell the difference between a newline (represented as \n) and a backslash followed by the letter n (represented as \\n).
There are two ways to convince yourself of what's really going on. One is to use print on the result, which causes it to expand the escapes:
>>> x = "here is a backslash \\ and here comes a newline \n this is on the next line"
>>> x
u'here is a backslash \\ and here comes a newline \n this is on the next line'
>>> print x
here is a backslash \ and here comes a newline
this is on the next line
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ and here comes a newline \n this is on the next line'
>>> print x[startback:]
\ and here comes a newline
this is on the next line
Another way is to use len to verify the length of the string:
>>> x = "Backslash \\ !"
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ !'
>>> print x[startback:]
\ !
>>> len(x[startback:])
3
Notice that len(x[startback:]) is 3. The string contains three characters: backslash, space, and exclamation point. You can see what's going on even more simply by just looking at a string that contains only a backslash:
>>> x = "\\"
>>> x
u'\\'
>>> print x
\
>>> len(x)
1
x only looks like it starts with two backslashes when you evaluate it at the interactive prompt (or otherwise use it's __repr__ method). When you actually print it, you can see it's only one backslash, and when you look at its length, you can see it's only one character long.
So what this means is you need to escape the backslash in your find, and you need to recognize that the backslashes displayed in the output may also be doubled.
The SO auto-format shows your problem. Since \ is used to escape characters, it's escaping the end quotes. Try changing that line to (note the use of double quotes):
while(source.find("\\") != -1):
Read more about escape characters in the docs.
I don't think anyone's mentioned this yet, but if you don't want to deal with having to escape characters just use a raw string.
source.find(r'\')
Adding the letter r before the string tells Python not to interpret any special characters and keeps the string exactly as you type it.

Handling backreferences to capturing groups in re.sub replacement pattern

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 - i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = '0.71331, 52.25378'
coord_re = re.sub("(\d), (\d)", "\1,\2", coords)
print coord_re
But this gives me 0.7133,2.25378. What am I doing wrong?
You should be using raw strings for regex, try the following:
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):
>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2' # this is what you actually want
\1,\2
Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).
Python interprets the \1 as a character with ASCII value 1, and passes that to sub.
Use raw strings, in which Python doesn't interpret the \.
coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)
This is covered right in the beginning of the re documentation, should you need more info.

Categories