I am trying to print the group info from my regex match match.
My script matches my regex versus line in my file, so that's working.
I have based this on the python regex tutorial btw ...
I'm a python newbie (with some perl experience) :)
import re
file = open('read.txt', 'r')
p = re.compile("""
.*,\\\"
(.*) # use grouping here with brackets so we can fetch value with group later on
\\\"
""", re.VERBOSE)
i = 0
for line in file:
if p.match(line):
print p.group() #this is the problematic group line
i += 1
re.match() returns a match object - you need to assign it to something. Try
for line in file:
m = p.match(line)
if m:
print m.group()
i += 1
You are not using the regex object returned by match. Try this:
for line in file:
matched = p.match(line)
if matched:
print matched.group() # this should now work
i += 1
Related
I wrote the below code to extract two values from a specific line in a text file. My text file have multiple lines of information and I am trying to find the below line
2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856
I am extracting the time (11:15:09) and bandwidth (1751856) from above line
import re
import matplotlib.pyplot as plt
import sys
time =[]
bandwidth = []
myfile = open(sys.argv[1])
for line in myfile:
line = line.rstrip()
if re.findall('TMMBR with bps:',line):
time.append(line[12:19])
bandwidth.append(line[-7:])
plt.plot(time,bandwidth)
plt.xlabel('time')
plt.ylabel('bandwidth')
plt.title('TMMBR against time')
plt.legend()
plt.show()
The problem here is that i am giving absolute index values(line[12:19]) to extract the data which doesnt work out if the line have some extra characters or have any extra spaces. What regular expression i can right to extract the values? I am new to RE
Try this:
(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+
(?:\d+:\d+:|(?<=TMMBR with bps: )) non-capturing group.
\d+: one or more digits followed by a colon :.
\d+: one or more digits followed by a colon :.
| OR
(?<=TMMBR with bps: ) a position where it is preceded by the sentence TMMBR with bps: .
\d+ one or more digits.
See regex demo
import re
txt1 = '2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856'
res = re.findall(r'(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+', txt1)
print(res[0]) #Output: 11:15:09
print(res[1]) #Output: 1751856
You can use a bit more specific with 2 capture groups:
(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b
Explanation
(\d\d:\d\d:\d\d) Capture group 1, match a time like format
\.\d{3}\b Match a dot and 3 digits
.* Match the rest of the line
\bTMMBR with bps:\s* A word boundary, match TMMBR with bps: and optional whitespace chars
(\d+) Capture group 2, match 1 or more digits
\b A word boundary
See a regex101 demo and a Python demo.
Example
import re
s = r"2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856"
pattern = r"(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b"
m = re.search(pattern, s)
if m:
print(m.groups())
Output
('11:15:09', '1751856')
You can just use split:
BPS_SEPARATOR = "TMMBR with bps: "
for line in strings:
line = line.rstrip()
if BPS_SEPARATOR in line:
time.append(line.split(" ")[1])
bandwidth.append(line.split(BPS_SEPARATOR)[1])
Use context manager for handling a file
don't use re.findall for just checking the occurrence of a pattern in a string; it's not efficient. Use re.search instead for regex cases
In your case it's enough to split a line and get the needed parts:
with open(sys.argv[1]) as myfile:
...
if 'TMMBR with bps:' in line:
parts = line.split()
time.append(parts[1][:-4])
bandwidth.append(parts[-1])
I´m trying to read a file with the follow regex sentence using python "pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')" to return both regex parameters,but is Only return the first one
here is my code:
import regex
filename = "file.log"
pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')
matchvalues = []
new_output = []
comm = 'comm="nom-http"'
i = 0
with open(filename, 'r') as audit:
lines = audit.readlines()
for line in lines:
match = regex.search(pattern, line)
if match:
new_line = match.group()
print(new_line)
matchvalues.append(new_line)
matchvalues_size = len(matchvalues)
print(matchvalues)
Can you guys help me please?
Normally \K is not used within lookbehinds since its meaning is to make the match succeed at the current position and one usually does not want lookbehinds to be part of the current match. So I don't know why you are using variable-length lookbehinds, which require the regex package, to begin with. That said, I did not have a problem matching 'comm="nom-http"' with your regex:
>>> pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')
... comm = 'comm="nom-http"'
... regex.search(pattern, comm)
<regex.Match object; span=(0, 15), match='comm="nom-http"'>
Note that comm="nom is part of the match due to the presence of \K in the regex.
But simpler would be to use:
pattern = r'((?:auid=|comm="nom)\S+)'
So, what is the problem you are having? When you say, "It is only returning the first one", are you then referring to not the pattern but the first occurrence on the line because there may be multiple occurrences? If so, then instead of doing regex.search, do regex.findall, which will return a list of string matches.
import re
pattern = r'((?:auid=|comm="nom)\S+)'
matches = re.findall(pattern, line)
I want to find number matching my pattern inside every line in the .txt file.
text fragment
sometext - 0.007442749125388171
sometext - 0.004296183916209439
sometext - 0.0037923667088698393
sometext - 0.003137404884873018
code
file = codecs.open(FILEPATH, encoding='utf-8')
for cnt, line in enumerate(file):
result_text = re.match(r'[a-zżźćńółęąś]*', line).group()
result_value = re.search(r'[0-9].[0-9]*', line).group()
print("Line {}: {}".format(cnt, line))
It's strange because re.search finds results:
<_sre.SRE_Match object; span=(8, 28), match='0.001879612135574806'>
but if I want to assign result to variable I get this:
error
File "read.py", line 18, in <module>
result_value = re.search(r'[0-9].[0-9]*', line).group()
AttributeError: 'NoneType' object has no attribute 'group'
When capturing a group in a regular expression, you need to put parentheses around the group that you aim to capture. Also, you need to pass the index of the group you want to capture to the group() method.
For example, for your second match, the code should be modified as follows:
# There is only 1 group here, so we pass index 1
result_value = re.search(r'([0-9].[0-9]*)', line).group(1)
As proposed by other comments in your question, you may also want to check whether matches were found before trying to extract the captured groups:
import re
with open("file.txt") as text_file:
for i, line in enumerate(text_file):
text_matches = re.match(r'([a-zżźćńółęąś]*)', line)
if text_matches is None:
continue
text_result = text_matches.group(1)
value_matches = re.search(r'([0-9].[0-9]*)', line)
if value_matches is None:
continue
value_result = value_matches.group(1)
print("Line {}: {}".format(text_result, value_result))
I'd like to suggest a tighter regex definition:
^([a-zżźćńółęąś]+)\s+-\s+(\d+\.\d+)$
Demo
Explanation
multiline mode: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of the string)
^ assert the beginning of the line
([a-zżźćńółęąś]+) capture group to match the "identifier"
\s+-\s+ the separator in-between with a variable number of spaces
(\d+\.\d+) matches the decimal number
$ asserts the end of the line
Sample Code:
import re
regex = r"^([a-zżźćńółęąś]+)\s+-\s+(\d+\.\d+)$"
test_str = ("sometext - 0.007442749125388171\n"
"sometext - 0.004296183916209439\n"
"sometext - 0.0037923667088698393\n"
"sometext - 0.003137404884873018")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum}: {group}".format(groupNum = groupNum, group = match.group(groupNum)))
This question is going to be similar, but looking for something completely different than one I asked a few days ago.
I have a string that is known, but is a portion of code, and it varies, I currently find it by using:
for num, line in enumerate(code, 1):
if re.match("function (.*) {", line):
That gets me through a good portion of what I need, as I need to know the line number that it starts at. My problem starts here. What I need is just the part where I am using the (.*) regular expression.
You mean the text between ( and )?
Use capturing groups:
m = re.match("function (.*) {", line):
if m:
print m.group(1)
The match object object which is returned contains all contents of groups. I would use re.search if 'function' isn't always at the beginning of a line and '.+' to match functions with at least one character.
line_to_fn = {}
for num, line in enumerate(code, 1):
match = re.search("function (.+) {", line)
if match:
matches = match.groups()
assert len(matches) == 1, repr(matches)
line_to_fn[num] = matches[0]
# line_to_fn: {1: 'something', 5: 'something_else'}here
This is the input file data in foo.txt
Wildthing83:A1106P
http://Wink3319:Camelot1#members/
f.signat#cnb.fr:arondep60
And I wanna output the data in the following format
f.signat#cnb.fr:arondep60
Wildthing83:A1106P
fr:arondep60
Here is the code
import re
f = open('foo.txt','r')
matches = re.findall(r'(\w+:\w+)#',f.read())
for match in matches:
print match
f.seek(0)
matches = re.findall(r'([\w.]+#[\w.]+:\w+)',f.read())
for match in matches:
print match
f.seek(0)
matches = re.findall(r'(\w+:\w+)\n',f.read())
for match in matches:
print match
Here is my output.
Wink3319:Camelot1
f.signat#cnb.fr:arondep60
Wildthing83:A1106P
fr:arondep60
As you can tell, it's outputting fr:arondep60 and I don't want it to. Is there a way to eliminate python from reading a line that has any # symbol? This would eliminate python even looking at it
Pretty ugly solution, but it should work.
line = f.readline()
if not "#" in line:
matches = re.findall(r'(\w+:\w+)#',line)
for match in matches:
print match
line = f.readline()
if not "#" in line:
matches = re.findall(r'([\w.]+#[\w.]+:\w+)',f.read())
for match in matches:
print match
line = f.readline()
if not "#" in line:
matches = re.findall(r'(\w+:\w+)\n',f.read())
for match in matches:
print match