delete whitespace in regular expression - python

I'm learning python and also english. And I have a problem that might be easy, but I can't solve it. I have a folder of .txt's, I was able to extract by regular expression a sequence of 17 numbers of each one.I need to rename each file with the sequence I extracted from .txt
import os
import re
path_txt = (r'C:\Users\usuario\Desktop\files')
name_files = os.listdir(path_txt)
for TXT in name_files:
with open(path_txt + '\\' + TXT, "r") as content:
search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
if search is not None:
print(search.group(0))
f = open(os.path.join( "Processes" , search.group(0) + ".txt"), "w")
for line in content:
print(line)
f.write(line)
f.close()
there are .txt where the sequences appear with spaces between characters, and my regular expression can not find them (example: 00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5)
edit: They are serial numbers, were typed, so sometimes they appear with "." and "-" and other times without them. Sometimes spaces appear because of typos.

You want this regex:
search = re.search(r'(\d{5}.*\d{4}.*\d{3}.*\d{2}.*\d{2}-.*\d)', content.read())
Dot . is any character. By putting \ in front of the dot you escaped it and searched for dots and not any character.

You can use \D in your regular expression to match any non-numeric character (including white space) and + to match one or more (or * to match zero or more), so you could rewrite your expression as:
pattern = r'(\d{5}\D+\d{4}\D+\d{3}\D+\d{2}\D+\d{2}\D+\d)'
re.findall(pattern, '00372.2004 .442.02.00-1, 00572.2008.872.02.00- 5')
# ['00372.2004 .442.02.00-1', '00572.2008.872.02.00- 5']
Note I am using re.findall to find every match in the string and return them in a list.

Related

Cannot get my regular expression to capture target [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

Find a date in a text and add a annotation [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

How to insert commas amongst any letter and following any digit using regex

fileinput = open('INFILE.txt', 'r')
fileoutput = fileinput.read()
replace = re.sub(r'([A-Za-z]),([A-Za-z])', r'\1\2', fileoutput)
print replace
replaceout = open('OUTFILE.txt', 'w')
replaceout.write(replace)
The code above delete commas among any letter whether CapsLocks or not. How to insert commas among any letter and digit? I try the code
replace = re.sub(r"([a-z])([0-9])", r",\1", fileoutput)
but it does not work. Any suggestion how to insert commas among any letter and any digit?
This may help you understand how to add in the comma and reference out what you want. The brackets around the pattern allow you to capture a value in the regex pattern to return later. First one you capture is referenced as \1 and second \2 and so on.
Inside the square brackets you are telling the regex what you want it to match and without further instructions in the regex pattern it's referencing a single character it's trying to match. So the code below will put a comma in between each character.
import re
test = "123frogger"
replace = re.sub(r'([A-Za-z0-9])', r'\1,', test)
creating the output
1,2,3,f,r,o,g,g,e,r,
Here's an update based on one of your comments above about the content of what you are trying to adjust.
import re
test = "Vilniausnuoma483,NuomaVilniuiiraplinkVilniu"
replace = re.sub(r'([A-Za-z])([0-9].*)', r'\1,\2', test)
It will output the following.
Vilniausnuoma,483,NuomaVilniuiiraplinkVilniu

copy required data from one file to another file in python

I am new to Python and am stuck at this I have a file a.txt which contains 10-15 lines of html code and text. I want to copy data which matches my regular expression from one a.txt to b.txt. Suppose i have a line Hello "World" How "are" you and I want to copy data which is between double quotes i.e. World and are to be copied to new file.
This is what i have done.
if x in line:
p = re.compile("\"*\"")
q = p.findall(line)
print q
But this is just displaying only " "(double quotes) as output. I think there is a mistake in my regular expression.
any help is greatly appreciated.
Thanks.
Your regex (which translates to "*" without all the string escaping) matches zero or more quotes, followed by a quote.
You want
p = re.compile(r'"([^"]*)"')
Explanation:
" # Match a quote
( # Match and capture the following:
[^"]* # 0 or more characters except quotes
) # End of capturing group
" # Match a quote
This assumes that you never have to deal with escaped quotes, e. g.
He said: "The board is 2\" by 4\" in size"
Capture the group you're interested in (ie, between quotes), extract the matches from each line, then write them one per line to the new file, eg:
import re
with open('input') as fin, open('output', 'w') as fout:
for line in fin:
matches = re.findall('"(.*?)"', line)
fout.writelines(match + '\n' for match in matches)

how to place a character literal in a python string

I'm trying to write a regular expression in python, and one of the characters involved in it is the \001 character. putting \001 in a string doesn't seem to work. I also tried 'string' + str(chr(1)), but the regex doesn't seem to catch it. Please for the love of god somebody help me, I've been struggling with this all day.
import sys
import postgresql
import re
if len(sys.argv) != 2:
print("usage: FixToDb <fix log file>")
else:
f = open(sys.argv[1], 'r')
timeExp = re.compile(r'(\d{2}):(\d{2}):(\d{2})\.(\d{6}) (\S)')
tagExp = re.compile('(\\d+)=(\\S*)\001')
for line in f:
#parse the time
m = timeExp.match(line)
print(m.group(1) + ':' + m.group(2) + ':' + m.group(3) + '.' + m.group(4) + ' ' + m.group(5));
tagPairs = re.findall('\\d+=\\S*\001', line)
for t in tagPairs:
tagPairMatch = tagExp.match(t)
print ("tag = " + tagPairMatch.group(1) + ", value = " + tagPairMatch.group(2))
Here's is an example line of for the input. I replaced the '\001' character with a '~' for readability
15:32:36.357227 R 1 0 0 0 8=FIX.4.2~9=0067~35=A~52=20120713-19:32:36~34=1~49=PD~56=P~98=0~108=30~10=134
output:
15:32:36.357227 R
tag = 8, value = FIX.4.29=006735=A52=20120713-19:32:3634=149=PD56=P98=0108=3010=134
So it doesn't stop at the '\001' character.
chr(1) should work, as will "\x01", as will "\001". (Note that chr(1) already returns a string, so you don't need to do str(chr(1)).) In your example it looks like you have both "\001" and chr(1), so that won't work unless you have two of the characters in a row in your data.
You say the regex "doesn't seem to catch it", but you don't give an example of your input data, so it's impossible to say why.
Edit; Okay, it looks like the problem has nothing to do with the \001. It is the classic greediness problem. The \S* in your tagExp expression will match a \001 character (since that character is not whitespace. So the \S* is gobbling the entire line. Use \S*? to make it non-greedy.
Edit: As others have noted, it also looks like your backslashes are awry. In regular expressions you face a backslash-doubling problem: Python uses the backslash for its own string escapes (like \t for tab, \n for newline), but regular expressions also use the backslash for their own purposes (e.g., \s for whitespace). The usual solution is to use raw strings, but you can't do that if you want to use the "\001" escape. However, you could use raw strings for your timeExp regex. Then in your other regexes, double the backslashes (except on \001, because you want that one to be interpreted as a character-code escape).
Instead of using \S to match the value, which can be any non-whitespace character, including \001, you should use [^\x01], which will match any character that is not \001.
#Sam Mussmann, no...
1 (decimal) = \001 (octal) <> \x01 (UNICODE)

Categories