Parsing Regex with "|" (OR) in python - python

I´m trying to read a file with the follow regex sentence using python "pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')" to return both regex parameters,but is Only return the first one
here is my code:
import regex
filename = "file.log"
pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')
matchvalues = []
new_output = []
comm = 'comm="nom-http"'
i = 0
with open(filename, 'r') as audit:
lines = audit.readlines()
for line in lines:
match = regex.search(pattern, line)
if match:
new_line = match.group()
print(new_line)
matchvalues.append(new_line)
matchvalues_size = len(matchvalues)
print(matchvalues)
Can you guys help me please?

Normally \K is not used within lookbehinds since its meaning is to make the match succeed at the current position and one usually does not want lookbehinds to be part of the current match. So I don't know why you are using variable-length lookbehinds, which require the regex package, to begin with. That said, I did not have a problem matching 'comm="nom-http"' with your regex:
>>> pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')
... comm = 'comm="nom-http"'
... regex.search(pattern, comm)
<regex.Match object; span=(0, 15), match='comm="nom-http"'>
Note that comm="nom is part of the match due to the presence of \K in the regex.
But simpler would be to use:
pattern = r'((?:auid=|comm="nom)\S+)'
So, what is the problem you are having? When you say, "It is only returning the first one", are you then referring to not the pattern but the first occurrence on the line because there may be multiple occurrences? If so, then instead of doing regex.search, do regex.findall, which will return a list of string matches.
import re
pattern = r'((?:auid=|comm="nom)\S+)'
matches = re.findall(pattern, line)

Related

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

to find the pattern using regex?

curP = "https://programmers.co.kr/learn/courses/4673'>#!Muzi#Muzi!)jayg07con&&"
I want to find the Muzi from this string with regex
for example
MuziMuzi : count 0 because it considers as one word
Muzi&Muzi: count 2 because it has & between so it separate the word
7Muzi7Muzi : count 2
I try to use the regex to find all matched
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern = re.compile('[^a-zA-Z]muzi[^a-zA-Z]')
print(pattern.findall(curP))
I expected the ['!muzi#','#Muzi!']
but the result is
['!muzi#']
You need to use this as your regex:
pattern = re.compile('[^a-zA-Z]muzi(?=[^a-zA-Z])', flags=re.IGNORECASE)
(?=[^a-zA-Z]) says that muzi must have a looahead of [^a-zA-Z] but does not consume any characters. So the first match is only matching !Muzi leaving the following # available to start the next match.
Your original regex was consuming !Muzi# leaving Muzi!, which would not match the regex.
Your matches will now be:
['!Muzi', '#Muzi']
As I understand it you want to get any value that may appear on both sides of your keyword Muzi.
That means that the #, in this case, has to be shared by both output values.
The only way to do it using regex is to manipulate the string as you find patterns.
Here is my solution:
import re
# Define the function to find the pattern
def find_pattern(curP):
pattern = re.compile('([^a-zA-Z]muzi[^a-zA-Z])', flags=re.IGNORECASE)
return pattern.findall(curP)[0]
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern_array = []
# Find the the first appearence of pattern on the string
pattern_array.append(find_pattern(curP))
# Remove the pattern found from the string
curP = curP.replace('Muzi','',1)
#Find the the second appearence of pattern on the string
pattern_array.append(find_pattern(curP))
print(pattern_array)
Output:
['!Muzi#', '#Muzi!']

How to use regex to tell if first and last character of a string match?

I'm relatively new to using Python and Regex, and I wanted to check if strings first and last characters are the same.
If first and last characters are same, then return 'True' (Ex: 'aba')
If first and last characters are not same, then return 'False' (Ex: 'ab')
Below is the code, I've written:
import re
string = 'aba'
pattern = re.compile(r'^/w./1w$')
matches = pattern.finditer(string)
for match in matches
print (match)
But from the above code, I don't see any output
if and only if you really want to use regex (for learning purpose):
import re
string = 'aba'
string2 = 'no match'
pattern = re.compile(r'^(.).*\1$')
if re.match(pattern, string):
print('ok')
else:
print('nok')
if re.match(pattern, string2):
print('ok')
else:
print('nok')
output:
ok
nok
Explanations:
^(.).*\1$
^ start of line anchor
(.) match the first character of the line and store it in a group
.* match any characters any time
\1 backreference to the first group, in this case the first character to impose that the first char and the last one are equal
$ end of line anchor
Demo: https://regex101.com/r/DaOPEl/1/
Otherwise the best approach is to simply use the comparison string[0] == string[-1]
string = 'aba'
if string[0] == string[-1]:
print 'same'
output:
same
Why do you overengineer with an regex at all? One principle of programming should be keeping it simple like:
string[0] is string[-1]
Or is there a need for regex?
The above answer of #Tobias is perfect & simple but if you want solution using regex then try the below code.
Try this code !
Code :
import re
string = 'abbaaaa'
pattern = re.compile(r'^(.).*\1$')
matches = pattern.finditer(string)
for match in matches:
print (match)
Output :
<_sre.SRE_Match object; span=(0, 7), match='abbaaaa'>
I think this is the regex you are trying to execute:
Code:
import re
string = 'aba'
pattern = re.compile(r'^(\w).(\1)$')
matches = pattern.finditer(string)
for match in matches:
print (match.group(0))
Output:
aba
if you want to check with regex use below:
import re
string = 'aba is a cowa'
pat = r'^(.).*\1$'
re.findall(pat,string)
if re.findall(pat,string):
print(string)
this will match first and last character of line or string if they match then it returns matching character in that case it will print string of line otherwise it will skip

Parsing timestamps with Python regular expressions ':' character not found

I am teaching myself python and I am trying to implement the regular expression to obtain a timestamp from an application log file ( I normally use grep, cut and awk for this )
My logfiles contain many lines started with date and time next
18.12.19 14:03:16 [ ..... # message error
18.12.19 14:03:16 [
:
I normally use a simple grep command grep "14\:03\:16" mytext
and this expression works "14:03:16", so after researching I came up with this regex:
Where res is one of the lines above
datap = re.compile(r'(\d{2}):(\d{2}):(\d{2})')
m = datap.match(res)
This does not find anything whereas
datap = re.compile(r'(\d{2}).(\d{2}).(\d{2})')
m = datap.match(re
Captures the date.
Why the character : is not found? I have tried to use \: as well and it also does not work. Thanks in advance.
re.match tries to match the regex from the beginning of the string.
From the docs:
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Return None if the string does not match the pattern; note that this
is different from a zero-length match.
When you did
datap = re.compile(r'(\d{2}).(\d{2}).(\d{2})')
m = datap.match(res)
the regex actually matched the date, not the time (because it is at the beginning of the string):
print(m)
# <re.Match object; span=(0, 8), match='18.12.19'>
If you use re.search then you will get the expected output:
import re
res = '18.12.19 14:03:16 [ ..... # message error'
datap = re.compile(r'(\d{2}):(\d{2}):(\d{2})')
m = datap.search(res)
print(m)
# <re.Match object; span=(9, 17), match='14:03:16'>

Get particular information from a string

I want to get the value of name from fstr using RegEx in Python. I tried as below, but couldn't find the intended result.
Any help will be highly appreciaaed.
fstr = "MCode=1,FCode=1,Name=XYZ,Extra=whatever" #",Extra=whatever" this portion is optional
myobj = re.search( r'(.*?),Name(.*?),*(.*)', fstr, re.M|re.I)
print(myobj.group(2))
You may not believe, but the actual problem was ,*, in your regular expression. It makes matching , optional. So, the second capturing group in your regex matches nothing (.*? means match between zero to unlimited and match lazily) and it checks the next item ,*, it also means match , zero or more times. So it matches zero times and the last capturing groups matches the rest of the string.
If you want to fix your RegEx, you can simply remove the * after the comma, like this
myobj = re.search( r'(.*?),Name(.*?),(.*)', fstr, re.I)
print(myobj.group(2))
# =XYZ
Online RegEx demo (with the mistake)
Online RegEx demo (after fixing it)
Debuggex Demo
But as the other answer shows, you don't have to create additional capture groups.
BTW, I like to use RegEx only when it is particularly needed. In this case, I would have solved it, without RegEx, like this
fstr = "MCode=1,FCode=1,Name=XYZ,Extra=whatever"
d = dict(item.split("=") for item in fstr.split(","))
# {'FCode': '1', 'Extra': 'whatever', 'Name': 'XYZ', 'MCode': '1'}
Now that I have all the information, I can access them like this
print d["Name"]
# XYZ
Simple, huh? :-)
Edit: If you want to use the same regex for one million records, we can slightly improve the performance by precompiling the RegEx, like this
import re
pattern = re.compile(r"Name=([^,]+)", re.I)
match = re.search(pattern, data)
if match:
match.group(1)
You can do it as follows:
import re
fstr = "MCode=1,FCode=1,Name=XYZ,Extra=whatever"
myobj = re.search( r'Name=([^,]+)', fstr, re.M|re.I)
>>> print myobj.group(1)
XYZ
try it
rule = re.compile(r"Name=(?P<Name>\w*),")
res = rule.search(fstr)
res.group("Name")

Categories