This is the input file data in foo.txt
Wildthing83:A1106P
http://Wink3319:Camelot1#members/
f.signat#cnb.fr:arondep60
And I wanna output the data in the following format
f.signat#cnb.fr:arondep60
Wildthing83:A1106P
fr:arondep60
Here is the code
import re
f = open('foo.txt','r')
matches = re.findall(r'(\w+:\w+)#',f.read())
for match in matches:
print match
f.seek(0)
matches = re.findall(r'([\w.]+#[\w.]+:\w+)',f.read())
for match in matches:
print match
f.seek(0)
matches = re.findall(r'(\w+:\w+)\n',f.read())
for match in matches:
print match
Here is my output.
Wink3319:Camelot1
f.signat#cnb.fr:arondep60
Wildthing83:A1106P
fr:arondep60
As you can tell, it's outputting fr:arondep60 and I don't want it to. Is there a way to eliminate python from reading a line that has any # symbol? This would eliminate python even looking at it
Pretty ugly solution, but it should work.
line = f.readline()
if not "#" in line:
matches = re.findall(r'(\w+:\w+)#',line)
for match in matches:
print match
line = f.readline()
if not "#" in line:
matches = re.findall(r'([\w.]+#[\w.]+:\w+)',f.read())
for match in matches:
print match
line = f.readline()
if not "#" in line:
matches = re.findall(r'(\w+:\w+)\n',f.read())
for match in matches:
print match
Related
I wrote the below code to extract two values from a specific line in a text file. My text file have multiple lines of information and I am trying to find the below line
2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856
I am extracting the time (11:15:09) and bandwidth (1751856) from above line
import re
import matplotlib.pyplot as plt
import sys
time =[]
bandwidth = []
myfile = open(sys.argv[1])
for line in myfile:
line = line.rstrip()
if re.findall('TMMBR with bps:',line):
time.append(line[12:19])
bandwidth.append(line[-7:])
plt.plot(time,bandwidth)
plt.xlabel('time')
plt.ylabel('bandwidth')
plt.title('TMMBR against time')
plt.legend()
plt.show()
The problem here is that i am giving absolute index values(line[12:19]) to extract the data which doesnt work out if the line have some extra characters or have any extra spaces. What regular expression i can right to extract the values? I am new to RE
Try this:
(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+
(?:\d+:\d+:|(?<=TMMBR with bps: )) non-capturing group.
\d+: one or more digits followed by a colon :.
\d+: one or more digits followed by a colon :.
| OR
(?<=TMMBR with bps: ) a position where it is preceded by the sentence TMMBR with bps: .
\d+ one or more digits.
See regex demo
import re
txt1 = '2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856'
res = re.findall(r'(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+', txt1)
print(res[0]) #Output: 11:15:09
print(res[1]) #Output: 1751856
You can use a bit more specific with 2 capture groups:
(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b
Explanation
(\d\d:\d\d:\d\d) Capture group 1, match a time like format
\.\d{3}\b Match a dot and 3 digits
.* Match the rest of the line
\bTMMBR with bps:\s* A word boundary, match TMMBR with bps: and optional whitespace chars
(\d+) Capture group 2, match 1 or more digits
\b A word boundary
See a regex101 demo and a Python demo.
Example
import re
s = r"2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856"
pattern = r"(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b"
m = re.search(pattern, s)
if m:
print(m.groups())
Output
('11:15:09', '1751856')
You can just use split:
BPS_SEPARATOR = "TMMBR with bps: "
for line in strings:
line = line.rstrip()
if BPS_SEPARATOR in line:
time.append(line.split(" ")[1])
bandwidth.append(line.split(BPS_SEPARATOR)[1])
Use context manager for handling a file
don't use re.findall for just checking the occurrence of a pattern in a string; it's not efficient. Use re.search instead for regex cases
In your case it's enough to split a line and get the needed parts:
with open(sys.argv[1]) as myfile:
...
if 'TMMBR with bps:' in line:
parts = line.split()
time.append(parts[1][:-4])
bandwidth.append(parts[-1])
my file content has token words starts and ends with symbol #. there could also be two pairs in single line.
eg.
line1 ncghtdhj #token1# jjhhja #token2# hfyuj.
line2 hjfuijgt #token3# ghju
line3 hdhjii#jk8ok#token4#hj
how do i get list of tokens...like
[token1,token2,token3,jk8ok,token4]
using python re
tried ...
mlist = re.findall(r'#.+#', content)
not working as expected, file content has token words starts and ends with symbol #. there could also be two pairs in single line.
If jk8ok can also be a match and there should be no spaces in the token you might use a negated character class with a capturing group and use a positive lookahead to assert what is on the right is an #
#([^\s#]+)(?=#)
Regex demo | Python demo
For example
import re
regex = r"#([^\s#]+)(?=#)"
test_str = ("line1 ncghtdhj #token1# jjhhja #token2# hfyuj. \n"
"line2 hjfuijgt #token3# ghju \n"
"line3 hdhjii#jk8ok#token4#hj")
print(re.findall(regex, test_str))
Result
['token1', 'token2', 'token3', 'jk8ok', 'token4']
If the tokens should be on the same line and spaces are allowed, you might use
#([^\r\n#]+)(?=#)
If you only want to match token followed by a digit:
#(token\d+)(?=#)
Regex demo
First, you need to separate the words with # on the beginning and end. And then you can filter out the words between the #.
with open("filename", "r") as fp:
lines = fp.readlines()
lines_string = " ".join(lines)
# Seperating the words with # on the beginning and end.
temp1 = re.findall("#([^\s#]+)(?=#)", lines_string)
# Filtering out the words between the #s.
temp2 = list(map(lambda x: re.findall("\w+", x), temp1))
# Flattening the list.
tokens = [val for sublist in temp2 for val in sublist]
Output:
['token1', 'token2', 'token3', 'jk8ok']
I have used the regex as mentioned by #The fourth bird
I want to find number matching my pattern inside every line in the .txt file.
text fragment
sometext - 0.007442749125388171
sometext - 0.004296183916209439
sometext - 0.0037923667088698393
sometext - 0.003137404884873018
code
file = codecs.open(FILEPATH, encoding='utf-8')
for cnt, line in enumerate(file):
result_text = re.match(r'[a-zżźćńółęąś]*', line).group()
result_value = re.search(r'[0-9].[0-9]*', line).group()
print("Line {}: {}".format(cnt, line))
It's strange because re.search finds results:
<_sre.SRE_Match object; span=(8, 28), match='0.001879612135574806'>
but if I want to assign result to variable I get this:
error
File "read.py", line 18, in <module>
result_value = re.search(r'[0-9].[0-9]*', line).group()
AttributeError: 'NoneType' object has no attribute 'group'
When capturing a group in a regular expression, you need to put parentheses around the group that you aim to capture. Also, you need to pass the index of the group you want to capture to the group() method.
For example, for your second match, the code should be modified as follows:
# There is only 1 group here, so we pass index 1
result_value = re.search(r'([0-9].[0-9]*)', line).group(1)
As proposed by other comments in your question, you may also want to check whether matches were found before trying to extract the captured groups:
import re
with open("file.txt") as text_file:
for i, line in enumerate(text_file):
text_matches = re.match(r'([a-zżźćńółęąś]*)', line)
if text_matches is None:
continue
text_result = text_matches.group(1)
value_matches = re.search(r'([0-9].[0-9]*)', line)
if value_matches is None:
continue
value_result = value_matches.group(1)
print("Line {}: {}".format(text_result, value_result))
I'd like to suggest a tighter regex definition:
^([a-zżźćńółęąś]+)\s+-\s+(\d+\.\d+)$
Demo
Explanation
multiline mode: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of the string)
^ assert the beginning of the line
([a-zżźćńółęąś]+) capture group to match the "identifier"
\s+-\s+ the separator in-between with a variable number of spaces
(\d+\.\d+) matches the decimal number
$ asserts the end of the line
Sample Code:
import re
regex = r"^([a-zżźćńółęąś]+)\s+-\s+(\d+\.\d+)$"
test_str = ("sometext - 0.007442749125388171\n"
"sometext - 0.004296183916209439\n"
"sometext - 0.0037923667088698393\n"
"sometext - 0.003137404884873018")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum}: {group}".format(groupNum = groupNum, group = match.group(groupNum)))
I'm trying to parse trough a file and search for a keyword in a list of strings. I need to return the 'n' characters before and after each occurrence. I have it working without regex but it's not very efficient. Any idea how to do the same with regex and findall? Lookup is a list of strings. This is what I have without regex:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
if string in line:
# Split the line in 2 substrings
tmp1 = line.split(string)[0]
tmp2 = line.split(string)[1]
# Truncate only 'n' characters before and after the keyword
tmp = tmp1[-n:] + string + tmp2[:n]
# Do something here...
This is the start with regex:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex search with Ignorecase
searchObj = re.findall(string, line, re.M | re.I)
if searchObj:
print "search --> : ", searchObj
# Loop trough searchObj and get n characters
From https://docs.python.org/2/library/re.html
start([group])
end([group])
Return the indices of the start and end of the substring matched by
group; group defaults to zero (meaning the whole matched substring).
Return -1 if group exists but did not contribute to the match. For a
match object m, and a group g that did contribute to the match, the
substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a
null string. For example, after m = re.search('b(c?)', 'cba'),
m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both
2, and m.start(2) raises an IndexError exception.
Using re.finditer you can generate an iterator of MatchObject and then use these attributes to get the start and end of your substrings.
I got it to work. Below is the code if anyone needs it:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex
searchObj = re.finditer(string, line, re.M | re.I)
if searchObj:
for match in searchObj:
# Find the start index of the keyword
start = match.span()[0]
# Find the end index of the keyword
end = match.span()[1]
# Truncate line to get only 'n' characters before and after the keyword
tmp = line[start-n:end+n] + '\n'
print tmp
I am trying to print the group info from my regex match match.
My script matches my regex versus line in my file, so that's working.
I have based this on the python regex tutorial btw ...
I'm a python newbie (with some perl experience) :)
import re
file = open('read.txt', 'r')
p = re.compile("""
.*,\\\"
(.*) # use grouping here with brackets so we can fetch value with group later on
\\\"
""", re.VERBOSE)
i = 0
for line in file:
if p.match(line):
print p.group() #this is the problematic group line
i += 1
re.match() returns a match object - you need to assign it to something. Try
for line in file:
m = p.match(line)
if m:
print m.group()
i += 1
You are not using the regex object returned by match. Try this:
for line in file:
matched = p.match(line)
if matched:
print matched.group() # this should now work
i += 1