Using a single regex to search for 2 criteria in python - python

I have a function
def extract_pid(log_line):
regex = PROBLEM
result = re.search(regex, log_line)
if result is None:
return None
return "{} ({})".format(result[1], result[2])
My intended outcome of this function is to be able to return the pid numbers between [ ] and the corresponding uppercase text, for example;
logline = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
From the above string I'd expect the outcome of my function to return 12345 (ERROR)
I have 2 criteria to meet in this function \[\d+\] and [A-Z]{2,} If i test each regex individually I get the expected outcome.
My question is, how do I specify both regex in the same line and output them as displayed in the function above, I cannot find anywhere in the documentation a simple "use this for AND" I found "| for or" but I need it to process both regex criteria.
I understand this could be done in 2 functions and joined together but i've been tasked with doing this from a single function.

Use two capturing groups:
regex = r'(\[\d+\]).*?([A-Z]{2,})'
The .*? means "put any characters in between, as few as possible". If you can, replace that with : (colon space) assuming that's what's always there.

Related

Replace capturing group with return value from passing capturing group to a function

I am trying to replace a specific capturing group with the return value from passing said capturing group to a function. The following code is in Python:
def translateWord(word):
... do some stuff
return word
def translateSentence(sentence):
# ([alpha and ']+) [non-alpha]*
# keep the () part, ignore the rest
p = re.compile(r"([a-zA-Z']+)[^a-zA-Z]*")
# find each match, then translate each word and replace
return p.sub(lambda match: translateWord(match.group(1)), sentence)
This code replaces the entire match as opposed to the capturing group.
Example of bad output:
>>> sentence = This isn't my three-egg omelet.
>>> sentence = translateSentence(sentence)
>>> print(sentence)
Isthayisn'tyayymayeethrayeggyayomeletyay
The code needs to output this instead:
Isthay isn'tyay ymay eethray-eggyay omeletyay.
The translateWord() function should only operate on a string input. I could test to see what kind of input the function is taking and change behavior based on that, but that defeats the purpose. How would one do this correctly?
Anyway, just try:
return p.sub(lambda match: translateWord(match.group(1)), sentence)
It looks like you got confused about what to pass as the second parameter to re.sub: you pass the actual function (in this case, the lambda expression), no need to try to embed that in a string.
If you want to change just a group though, the re methods don't give direct support to it - instead, you have to recreate the a single string with the whole match, replacing the groups you want to change yourself.
The easier way is to expand your "lambda" function into another multi-line function that will do that mangling for you. It can then use the .regs attribute on the match object it receives to know the groups limits (start and end), and build your replacing string:
def replace_group(match):
sentence = translateWord(match.group(1))
matched = match.group(0)
new_sentence = matched[:match.regs[1][0]] + sentence + matched[match.regs[1][1]:]
return new_sentence

Matching if any keyword from a list is present in a string

I have a list of keywords. A sample is:
['IO', 'IO Combination','CPI Combos']
Now what I am trying to do is see if any of these keywords is present in a string. For example, if my string is: there is a IO competition coming in Summer 2018. So for this example since it contains IO, it should identify that but if the string is there is a competition coming in Summer 2018 then it should not identify any keywords.
I wrote this Python code but it also identifies IO in competition:
if any(word.lower() in string_1.lower() for word in keyword_list):
print('FOUND A KEYWORD IN STRING')
I also want to identify which keyword was identified in the string (if any present). What is the issue in my code and how can I make sure that it matches only complete words?
Regex solution
You'll need to implement word boundaries here:
import re
keywords = ['IO', 'IO Combination','CPI Combos']
words_flat = "|".join(r'\b{}\b'.format(word) for word in keywords)
rx = re.compile(words_flat)
string = "there is a IO competition coming in Summer 2018"
match = rx.search(string)
if match:
print("Found: {}".format(match.group(0)))
else:
print("Not found")
Here, your list is joined with | and \b on both sides.
Afterwards, you may search with re.search() which prints "Found: IO" in this example.
Even shorter with a direct comprehension:
rx = re.compile("|".join(r'\b{}\b'.format(word) for word in keywords))
Non-regex solution
Please note that you can even use a non-regex solution for single words, you just have to reorder your comprehension and use split() like
found = any(word in keywords for word in string.split())
if found:
# do sth. here
Notes
The latter has the drawback that strings like
there is a IO. competition coming in Summer 2018
# ---^---
won't work while they do count as a "word" in the regex solution (hence the approaches are yielding different results). Additionally, because of the split() function, combined phrases like CPI Combos cannot be found. The regex solution has the advantage to even support lower and uppercase scenarios (just apply flag = re.IGNORECASE).
It really depends on your actual requirements.
for index,key in enumerate(mylist):
if key.find(mystring) != -1:
return index
It loops over your list, on every item in the list, it checks if your string is contained in the item, if it does, find() returns -1 which means it is contained, and if that happens, you get the index of the item where it was found with the help of enumerate().

How to match and remove occurrences from a file using regex

I am new in Python and I am trying to to get some contents from a file using regex. I upload a file, I load it in memory and then I run this regular expression. I want to take the names from the file but it also needs to work with names that have spaces like "Marie Anne". So imagine that the array of names has this values:
all_names = [{name:"Marie Anne", id:1}, {name:"Johnathan", id:2}, {name:"Marie", id:3}, {name:"Anne", id:4},{name:"John", id:5}]
An the string that I am searching might have multiple occurrences and it's multiline.
print all_names # this is an array of id and name, ordered descendently by names length
textToStrip = stdout.decode('ascii', 'ignore').lower()
for i in range(len(all_skills)):
print all_names[i]
m = re.search(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W',textToStrip)
if m:
textToStrip = re.sub(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W', "", textToStrip, 100)
print "found " + all_names[i]['name']
print textToStrip
The script is finding the names, but the line re.sub removes them from the list to avoid that takes "Maria Anne", and "Marie" from the same instance, it's also removing extra characters like "," or "." before or after.
Any help would much appreciated... or if you have a better solution for this problem even better.
The characters on both sides are deleted because you have \W included in re.sub() regexp. That's because re.sub replaced everything the regexp matches -- the way you call re.sub.
There's an alternate way to do this. If you wrap the part that you want keep in the matched regext with grouping parens, and if you call re.sub with a callable (a function) instead of the new string, that function can extract the group values from the match object passed to it and assemble a return value that preserves them.
Read documentation for re.sub for details.

replace multiple words - python

There can be an input "some word".
I want to replace this input with "<strong>some</strong> <strong>word</strong>" in some other text which contains this input
I am trying with this code:
input = "some word".split()
pattern = re.compile('(%s)' % input, re.IGNORECASE)
result = pattern.sub(r'<strong>\1</strong>',text)
but it is failing and i know why: i am wondering how to pass all elements of list input to compile() so that (%s) can catch each of them.
appreciate any help
The right approach, since you're already splitting the list, is to surround each item of the list directly (never using a regex at all):
sterm = "some word".split()
result = " ".join("<strong>%s</strong>" % w for w in sterm)
In case you're wondering, the pattern you were looking for was:
pattern = re.compile('(%s)' % '|'.join(sterm), re.IGNORECASE)
This works on your string because the regular expression would become
(some|word)
which means "matches some or matches word".
However, this is not a good approach as it does not work for all strings. For example, consider cases where one word contains another, such as
a banana and an apple
which becomes:
<strong>a</strong> <strong>banana</strong> <strong>a</strong>nd <strong>a</strong>n <strong>a</strong>pple
It looks like you're wanting to search for multiple words - this word or that word. Which means you need to separate your searches by |, like the script below:
import re
text = "some word many other words"
input = '|'.join('some word'.split())
pattern = re.compile('(%s)' % input, flags=0)
print pattern.sub(r'<strong>\1</strong>',text)
I'm not completely sure if I know what you're asking but if you want to pass all the elements of input in as parameters in the compile function call, you can just use *input instead of input. * will split the list into its elements. As an alternative, could't you just try joining the list with and adding at the beginning and at the end?
Alternatively, you can use the join operator with a list comprehension to create the intended result.
text = "some word many other words".split()
result = ' '.join(['<strong>'+i+'</strong>' for i in text])

How to get a capture group that doesnt always exist?

I have a regex something like
(\d\d\d)(\d\d\d)(\.\d\d){0,1}
when it matches I can easily get first two groups, but how do I check if third occurred 0 or 1 times.
Also another minor question: in the (\.\d\d) I only care about \d\d part, any other way to tell regex that \.\d\d needs to appear 0 or 1 times, but that I want to capture only \d\d part ?
This was based on a problem of parsing a
hhmmss
string that has optional decimal part for seconds( so it becomes
hhmmss.ss
)... I put \d\d\d in the question so it is clear about what \d\d Im talking about.
import re
value = "123456.33"
regex = re.search("^(\d\d\d)(\d\d\d)(?:\.(\d\d)){0,1}$", value)
if regex:
print regex.group(1)
print regex.group(2)
if regex.group(3) is not None:
print regex.group(3)
else:
print "3rd group not found"
else:
print "value don't match regex"

Categories