Python re.search finds result but group doesnt work - python

I want to find number matching my pattern inside every line in the .txt file.
text fragment
sometext - 0.007442749125388171
sometext - 0.004296183916209439
sometext - 0.0037923667088698393
sometext - 0.003137404884873018
code
file = codecs.open(FILEPATH, encoding='utf-8')
for cnt, line in enumerate(file):
result_text = re.match(r'[a-zżźćńółęąś]*', line).group()
result_value = re.search(r'[0-9].[0-9]*', line).group()
print("Line {}: {}".format(cnt, line))
It's strange because re.search finds results:
<_sre.SRE_Match object; span=(8, 28), match='0.001879612135574806'>
but if I want to assign result to variable I get this:
error
File "read.py", line 18, in <module>
result_value = re.search(r'[0-9].[0-9]*', line).group()
AttributeError: 'NoneType' object has no attribute 'group'

When capturing a group in a regular expression, you need to put parentheses around the group that you aim to capture. Also, you need to pass the index of the group you want to capture to the group() method.
For example, for your second match, the code should be modified as follows:
# There is only 1 group here, so we pass index 1
result_value = re.search(r'([0-9].[0-9]*)', line).group(1)
As proposed by other comments in your question, you may also want to check whether matches were found before trying to extract the captured groups:
import re
with open("file.txt") as text_file:
for i, line in enumerate(text_file):
text_matches = re.match(r'([a-zżźćńółęąś]*)', line)
if text_matches is None:
continue
text_result = text_matches.group(1)
value_matches = re.search(r'([0-9].[0-9]*)', line)
if value_matches is None:
continue
value_result = value_matches.group(1)
print("Line {}: {}".format(text_result, value_result))

I'd like to suggest a tighter regex definition:
^([a-zżźćńółęąś]+)\s+-\s+(\d+\.\d+)$
Demo
Explanation
multiline mode: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of the string)
^ assert the beginning of the line
([a-zżźćńółęąś]+) capture group to match the "identifier"
\s+-\s+ the separator in-between with a variable number of spaces
(\d+\.\d+) matches the decimal number
$ asserts the end of the line
Sample Code:
import re
regex = r"^([a-zżźćńółęąś]+)\s+-\s+(\d+\.\d+)$"
test_str = ("sometext - 0.007442749125388171\n"
"sometext - 0.004296183916209439\n"
"sometext - 0.0037923667088698393\n"
"sometext - 0.003137404884873018")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum}: {group}".format(groupNum = groupNum, group = match.group(groupNum)))

Related

Regular expression to search string from a text file

I wrote the below code to extract two values from a specific line in a text file. My text file have multiple lines of information and I am trying to find the below line
2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856
I am extracting the time (11:15:09) and bandwidth (1751856) from above line
import re
import matplotlib.pyplot as plt
import sys
time =[]
bandwidth = []
myfile = open(sys.argv[1])
for line in myfile:
line = line.rstrip()
if re.findall('TMMBR with bps:',line):
time.append(line[12:19])
bandwidth.append(line[-7:])
plt.plot(time,bandwidth)
plt.xlabel('time')
plt.ylabel('bandwidth')
plt.title('TMMBR against time')
plt.legend()
plt.show()
The problem here is that i am giving absolute index values(line[12:19]) to extract the data which doesnt work out if the line have some extra characters or have any extra spaces. What regular expression i can right to extract the values? I am new to RE
Try this:
(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+
(?:\d+:\d+:|(?<=TMMBR with bps: )) non-capturing group.
\d+: one or more digits followed by a colon :.
\d+: one or more digits followed by a colon :.
| OR
(?<=TMMBR with bps: ) a position where it is preceded by the sentence TMMBR with bps: .
\d+ one or more digits.
See regex demo
import re
txt1 = '2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856'
res = re.findall(r'(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+', txt1)
print(res[0]) #Output: 11:15:09
print(res[1]) #Output: 1751856
You can use a bit more specific with 2 capture groups:
(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b
Explanation
(\d\d:\d\d:\d\d) Capture group 1, match a time like format
\.\d{3}\b Match a dot and 3 digits
.* Match the rest of the line
\bTMMBR with bps:\s* A word boundary, match TMMBR with bps: and optional whitespace chars
(\d+) Capture group 2, match 1 or more digits
\b A word boundary
See a regex101 demo and a Python demo.
Example
import re
s = r"2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856"
pattern = r"(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b"
m = re.search(pattern, s)
if m:
print(m.groups())
Output
('11:15:09', '1751856')
You can just use split:
BPS_SEPARATOR = "TMMBR with bps: "
for line in strings:
line = line.rstrip()
if BPS_SEPARATOR in line:
time.append(line.split(" ")[1])
bandwidth.append(line.split(BPS_SEPARATOR)[1])
Use context manager for handling a file
don't use re.findall for just checking the occurrence of a pattern in a string; it's not efficient. Use re.search instead for regex cases
In your case it's enough to split a line and get the needed parts:
with open(sys.argv[1]) as myfile:
...
if 'TMMBR with bps:' in line:
parts = line.split()
time.append(parts[1][:-4])
bandwidth.append(parts[-1])

Skip python from reading line if it has a certain character

This is the input file data in foo.txt
Wildthing83:A1106P
http://Wink3319:Camelot1#members/
f.signat#cnb.fr:arondep60
And I wanna output the data in the following format
f.signat#cnb.fr:arondep60
Wildthing83:A1106P
fr:arondep60
Here is the code
import re
f = open('foo.txt','r')
matches = re.findall(r'(\w+:\w+)#',f.read())
for match in matches:
print match
f.seek(0)
matches = re.findall(r'([\w.]+#[\w.]+:\w+)',f.read())
for match in matches:
print match
f.seek(0)
matches = re.findall(r'(\w+:\w+)\n',f.read())
for match in matches:
print match
Here is my output.
Wink3319:Camelot1
f.signat#cnb.fr:arondep60
Wildthing83:A1106P
fr:arondep60
As you can tell, it's outputting fr:arondep60 and I don't want it to. Is there a way to eliminate python from reading a line that has any # symbol? This would eliminate python even looking at it
Pretty ugly solution, but it should work.
line = f.readline()
if not "#" in line:
matches = re.findall(r'(\w+:\w+)#',line)
for match in matches:
print match
line = f.readline()
if not "#" in line:
matches = re.findall(r'([\w.]+#[\w.]+:\w+)',f.read())
for match in matches:
print match
line = f.readline()
if not "#" in line:
matches = re.findall(r'(\w+:\w+)\n',f.read())
for match in matches:
print match

Python regex to get n characters before and after a keyword in a line of text

I'm trying to parse trough a file and search for a keyword in a list of strings. I need to return the 'n' characters before and after each occurrence. I have it working without regex but it's not very efficient. Any idea how to do the same with regex and findall? Lookup is a list of strings. This is what I have without regex:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
if string in line:
# Split the line in 2 substrings
tmp1 = line.split(string)[0]
tmp2 = line.split(string)[1]
# Truncate only 'n' characters before and after the keyword
tmp = tmp1[-n:] + string + tmp2[:n]
# Do something here...
This is the start with regex:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex search with Ignorecase
searchObj = re.findall(string, line, re.M | re.I)
if searchObj:
print "search --> : ", searchObj
# Loop trough searchObj and get n characters
From https://docs.python.org/2/library/re.html
start([group])
end([group])
Return the indices of the start and end of the substring matched by
group; group defaults to zero (meaning the whole matched substring).
Return -1 if group exists but did not contribute to the match. For a
match object m, and a group g that did contribute to the match, the
substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a
null string. For example, after m = re.search('b(c?)', 'cba'),
m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both
2, and m.start(2) raises an IndexError exception.
Using re.finditer you can generate an iterator of MatchObject and then use these attributes to get the start and end of your substrings.
I got it to work. Below is the code if anyone needs it:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex
searchObj = re.finditer(string, line, re.M | re.I)
if searchObj:
for match in searchObj:
# Find the start index of the keyword
start = match.span()[0]
# Find the end index of the keyword
end = match.span()[1]
# Truncate line to get only 'n' characters before and after the keyword
tmp = line[start-n:end+n] + '\n'
print tmp

trying to print a group from a regex match in python

I am trying to print the group info from my regex match match.
My script matches my regex versus line in my file, so that's working.
I have based this on the python regex tutorial btw ...
I'm a python newbie (with some perl experience) :)
import re
file = open('read.txt', 'r')
p = re.compile("""
.*,\\\"
(.*) # use grouping here with brackets so we can fetch value with group later on
\\\"
""", re.VERBOSE)
i = 0
for line in file:
if p.match(line):
print p.group() #this is the problematic group line
i += 1
re.match() returns a match object - you need to assign it to something. Try
for line in file:
m = p.match(line)
if m:
print m.group()
i += 1
You are not using the regex object returned by match. Try this:
for line in file:
matched = p.match(line)
if matched:
print matched.group() # this should now work
i += 1

regex python Fasta

Thank you for your previous advices,
I have another regex problem:
now I have a list with this pattern:
*7 3 279 0
*33 2 254 0.0233918128654971
*39 2 276 0.027431421446384
and a file with DNA sequencing in Fasta format:
EDIT reformated lines
>OCTU1
GCTTGTCTCAAAGATTAAGCCATGCATGTATAAGCACAAGCCTAAAATGGTGAAGCCGCGAATAGCTCATTACAACAGTCGTAGTTTATTGGAAAGTTCACTATGGATAACTGTGGTAATTCTAGAGCTAATACATGTTCCAATCCTCGACTCACGGAGAGGTGCATTTATTAGAACAAAGCTGATCAGACTATGTCTGTCTCAGGTTGACTCTGAATAACTTTGCTAATCGCACAGTCTTTGTACTGGCGATGTATCTTTCATGCTATGTA
>OCTU2
GCTGCTTCCTTGGATGTGGTAGCCGTTTCTCAGGCTCCCTCTCCGGAATCGAACCCTATTCCCCGTTACCCGTTCAACCATGGTAGGCCCTACTACCATCAAAGTTGATAGGGCAGATATTTGAAAGACATCGCCGCACAAAGGCTATGCGATTAGCAAAGTTATTAGATCAACGACGCAGCGATCGGCTTTGACTAATAAATCACCCCTCCAGTTGGGGACTTTTACATGTATTAGCTCTAGAATTACCACAGTTATCCATTAGTGAAGTACCTTCCAATAAACTATACTGTTTAATGAGCCATTCGCGGTTTCACCGTAAAATTAGGTTGTCTTAGACATGCATGGCTTAATCTTTGTAGACAAGC
I'd need to find the numbers in the list with * (e.g., 7 or 33) in the Fasta file (e.g., >OCTU7 and >OCTU33) and copy in another file only the Fasta sequences that are present in the list, this is my script:
regex=re.compile(r'.+\d+\s+')
OCTU=b.readlines()
while OCTU:
for line in a:
if regex.match(OCTU)==line:
c.write(OCTU)
The scripts seems to work but I think the pattern is not correct because the file created is empty.
Thank you in advance for your precious advices.
You could first collect the id numbers from file a to a set for fast lookup later:
seta = set()
regexa = re.compile(r'\*(\d+)') #matches asterisk followed by digits, captures digits
for line in a:
m = regexa.match(line) #looks for match at start of line
if m:
seta.add(m.group(1))
Then loop over b. Use b.next() inside the loop to get the second line where the sequence is.
regexb = re.compile(r'>OCTU(\d+)') #matches ">OCTU" followed by digits, captures digits
for line in b:
m = regexb.match(line)
if m:
sequence = b.next()
if m.group(1) in seta:
c.write(line)
c.write(sequence)
You may want to use Biopython to parse the fasta file.
Then you can slice out the number and look it up in your list and access the sequence and sequence name more reliably...If a fasta file has line wrapping the above method may run into problems...
import collections
from Bio import SeqIO
infile = "yourfastafile.fasta"
outfile = "desired_outfilename.fasta"
dct = collections.OrderedDict()
for record in SeqIO.parse(open(infile), "fasta"):
dct[record.description()] = str(record.seq).upper()
for k,v in dct.items():
if int(k[4:]) in seta: #from answer above
with open(outfile, "a") as handle:
handle.write(">" + k + "\n" + str(v) + "\n")
coding=utf8
the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r">.+\n[acgtnACGTN\n]+"
test_str = (">AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368\n"
"ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC\n"
"CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC\n"
"CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG\n"
"AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC\n"
"CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG\n"
"TTTAATTACAGACCTGAA")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Categories