Regular expression to search string from a text file - python

I wrote the below code to extract two values from a specific line in a text file. My text file have multiple lines of information and I am trying to find the below line
2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856
I am extracting the time (11:15:09) and bandwidth (1751856) from above line
import re
import matplotlib.pyplot as plt
import sys
time =[]
bandwidth = []
myfile = open(sys.argv[1])
for line in myfile:
line = line.rstrip()
if re.findall('TMMBR with bps:',line):
time.append(line[12:19])
bandwidth.append(line[-7:])
plt.plot(time,bandwidth)
plt.xlabel('time')
plt.ylabel('bandwidth')
plt.title('TMMBR against time')
plt.legend()
plt.show()
The problem here is that i am giving absolute index values(line[12:19]) to extract the data which doesnt work out if the line have some extra characters or have any extra spaces. What regular expression i can right to extract the values? I am new to RE

Try this:
(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+
(?:\d+:\d+:|(?<=TMMBR with bps: )) non-capturing group.
\d+: one or more digits followed by a colon :.
\d+: one or more digits followed by a colon :.
| OR
(?<=TMMBR with bps: ) a position where it is preceded by the sentence TMMBR with bps: .
\d+ one or more digits.
See regex demo
import re
txt1 = '2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856'
res = re.findall(r'(?:\d+:\d+:|(?<=TMMBR with bps: ))\d+', txt1)
print(res[0]) #Output: 11:15:09
print(res[1]) #Output: 1751856

You can use a bit more specific with 2 capture groups:
(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b
Explanation
(\d\d:\d\d:\d\d) Capture group 1, match a time like format
\.\d{3}\b Match a dot and 3 digits
.* Match the rest of the line
\bTMMBR with bps:\s* A word boundary, match TMMBR with bps: and optional whitespace chars
(\d+) Capture group 2, match 1 or more digits
\b A word boundary
See a regex101 demo and a Python demo.
Example
import re
s = r"2022-05-03 11:15:09.395 [6489266] | (rtcp_receiver.cc:823): BwMgr Received a TMMBR with bps: 1751856"
pattern = r"(\d\d:\d\d:\d\d)\.\d{3}\b.*\bTMMBR with bps:\s*(\d+)\b"
m = re.search(pattern, s)
if m:
print(m.groups())
Output
('11:15:09', '1751856')

You can just use split:
BPS_SEPARATOR = "TMMBR with bps: "
for line in strings:
line = line.rstrip()
if BPS_SEPARATOR in line:
time.append(line.split(" ")[1])
bandwidth.append(line.split(BPS_SEPARATOR)[1])

Use context manager for handling a file
don't use re.findall for just checking the occurrence of a pattern in a string; it's not efficient. Use re.search instead for regex cases
In your case it's enough to split a line and get the needed parts:
with open(sys.argv[1]) as myfile:
...
if 'TMMBR with bps:' in line:
parts = line.split()
time.append(parts[1][:-4])
bandwidth.append(parts[-1])

Related

Python Regex - get words around match

I want to get the words before and after my match. I could use string.split(' ') - but as I already use regex, isn't there a much better way using only regex?
Using a match object, I can get the exact location. However, this location is character indexed.
import re
myString = "this. is 12my90\nExample string"
pattern = re.compile(r"(\b12(\w+)90\b)",re.IGNORECASE | re.UNICODE)
m = pattern.search(myString)
print("Hit: "+m.group())
print("Indix range: "+str(m.span()))
print("Words around match: "+myString[m.start()-1:m.end()+1]) # should be +/-1 in _words_, not characters
Output:
Hit: 12my90 Indix
range: (9, 15)
Words around match: 12my90
For getting the matching word and the word before, I tried:
pattern = re.compile(r"(\b(w+)\b)\s(\b12(\w+)90\b)",re.IGNORECASE |
re.UNICODE)
Which yields no matches.
In the second pattern you have to escape the w+ like \w+.
Apart from that, there is a newline in your example which you can match using another following \s
Your pattern with 3 capturing groups might look like
(\b\w+\b)\s(\b12\w+90\b)\s(\b\w+\b)
Regex demo
You could use the capturing groups to get the values
print("Words around match: " + m.group(1) + " " + m.group(3))
new line character is missing
regx = r"(\w+)\s12(\w+)90\n(\w+)"

How can I remove a specific character from multi line string using regex in python

I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)

Regex for listing words start and ends with symbol

my file content has token words starts and ends with symbol #. there could also be two pairs in single line.
eg.
line1 ncghtdhj #token1# jjhhja #token2# hfyuj.
line2 hjfuijgt #token3# ghju
line3 hdhjii#jk8ok#token4#hj
how do i get list of tokens...like
[token1,token2,token3,jk8ok,token4]
using python re
tried ...
mlist = re.findall(r'#.+#', content)
not working as expected, file content has token words starts and ends with symbol #. there could also be two pairs in single line.
If jk8ok can also be a match and there should be no spaces in the token you might use a negated character class with a capturing group and use a positive lookahead to assert what is on the right is an #
#([^\s#]+)(?=#)
Regex demo | Python demo
For example
import re
regex = r"#([^\s#]+)(?=#)"
test_str = ("line1 ncghtdhj #token1# jjhhja #token2# hfyuj. \n"
"line2 hjfuijgt #token3# ghju \n"
"line3 hdhjii#jk8ok#token4#hj")
print(re.findall(regex, test_str))
Result
['token1', 'token2', 'token3', 'jk8ok', 'token4']
If the tokens should be on the same line and spaces are allowed, you might use
#([^\r\n#]+)(?=#)
If you only want to match token followed by a digit:
#(token\d+)(?=#)
Regex demo
First, you need to separate the words with # on the beginning and end. And then you can filter out the words between the #.
with open("filename", "r") as fp:
lines = fp.readlines()
lines_string = " ".join(lines)
# Seperating the words with # on the beginning and end.
temp1 = re.findall("#([^\s#]+)(?=#)", lines_string)
# Filtering out the words between the #s.
temp2 = list(map(lambda x: re.findall("\w+", x), temp1))
# Flattening the list.
tokens = [val for sublist in temp2 for val in sublist]
Output:
['token1', 'token2', 'token3', 'jk8ok']
I have used the regex as mentioned by #The fourth bird

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Categories