Python regex to detect string in multiline - python

I am trying to detect a string, sometime it appears as one line and sometimes it appears as multiline.
case 1:
==================== 1 error in 500.14 seconds =============
case 2:
================= 3 tests deselected by "-m 'not regression'" ==================
21 failed, 553 passed, 35 skipped, 3 deselected, 4 error, 51 rerun in 6532.96 seconds
I have tried the following thing but it's not working
==+.*(?i)(?m)(error|failed).*(==+|seconds)

Use the below regex:
==+[\s\S]*?(\d+)\s(error|failed).*(==+|seconds)
[\s\S] instead of . allows line delimiters as well
(\d+) is the first matching group so matches[0] will always contains the number such as 1 or 21
(error|failed) is the second matching group so matches[1] will contain either 'error' or 'failed'
Regex101 Demo
Testing in Python:
import re
pattern = "==+[\s\S]*?(\d+)\s(error|failed).*(==+|seconds)"
case1 = "==================== 1 error in 500.14 seconds ============="
p = re.compile(pattern)
matches = p.match(case1).groups()
matches[0] + " " + matches[1] # Output: '1 error'
case2 = """================= 3 tests deselected by -m 'not regression' ==================
21 failed, 553 passed, 35 skipped, 3 deselected, 4 error, 51 rerun in 6532.96 seconds"""
matches = p.match(case2).groups()
matches[0] + " " + matches[1] # Output: '21 failed'
Hope this helps!

Related

Manipulate time-range in a pandas Dataframe

Need to clean up a csv import, which gives me a range of times (in string form). Code is at bottom; I currently use regular expressions and replace() on the df to convert other chars. Just not sure how to:
select the current 24 hour format numbers and add :00
how to select the 12 hour format numbers and make them 24 hour.
Input (from csv import):
break_notes
0 15-18
1 18.30-19.00
2 4PM-5PM
3 3-4
4 4-4.10PM
5 15 - 17
6 11 - 13
So far I have got it to look like (remove spaces, AM/PM, replace dot with colon):
break_notes
0 15-18
1 18:30-19:00
2 4-5
3 3-4
4 4-4:10
5 15-17
6 11-13
However, I would like it to look like this ('HH:MM-HH:MM' format):
break_notes
0 15:00-18:00
1 18:30-19:00
2 16:00-17:00
3 15:00-16:00
4 16:00-16:10
5 15:00-17:00
6 11:00-13:00
My code is:
data = pd.read_csv('test.csv')
data.break_notes = data.break_notes.str.replace(r'([P].|[ ])', '').str.strip()
data.break_notes = data.break_notes.str.replace(r'([.])', ':').str.strip()
Here is the converter function that you need based on your requested input data. convert_entry takes complete value entry, splits it on a dash, and passes its result to convert_single, since both halfs of one entry can be converted individually. After each conversion, it merges them with a dash.
convert_single uses regex to search for important parts in the time string.
It starts with a some numbers \d+ (representing the hours), then optionally a dot or a colon and some more number [.:]?(\d+)? (representing the minutes). And after that optionally AM or PM (AM|PM)? (only PM is relevant in this case)
import re
def convert_single(s):
m = re.search(pattern="(\d+)[.:]?(\d+)?(AM|PM)?", string=s)
hours = m.group(1)
minutes = m.group(2) or "00"
if m.group(3) == "PM":
hours = str(int(hours) + 12)
return hours.zfill(2) + ":" + minutes.zfill(2)
def convert_entry(value):
start, end = value.split("-")
start = convert_single(start)
end = convert_single(end)
return "-".join((start, end))
values = ["15-18", "18.30-19.00", "4PM-5PM", "3-4", "4-4.10PM", "15 - 17", "11 - 13"]
for value in values:
cvalue = convert_entry(value)
print(cvalue)

regex matching multiple repeating groups

I have the following string:
s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134 completed"
I would like to parse the statuses and counts after "workorders". I've tried the following regex:
r = r"workorders:( (\d+) (\w+),?)*"
but this only returns the last group. How can I return all groups?
p.s. I know I could do this in python, but was wondering if there's a pure regex solution
>>> s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134 completed"
>>> r = r"workorders:( (\d+) (\w+),?)*"
>>> re.findall(r, s)
[(' 134 completed', '134', 'completed')]
>>>
output should be close to
[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]
For the text in the example, you could try it like this:
(?:(\d+) (\w+)(?=,|$))+
Explanation
A non capturing group (?:
A capturing group for one or more digits (\d+)
A white space
A capturing group for one or more word characters (\w+)
A positive lookhead which asserts that what follows is either a comma or the end of the string (?=,|$)
Close the non capturing group and repeat that one or more times )+
Demo
That would give you:
[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]
this should work for your particular case:
re.findall('[:,] (\d+)', s)
In my experience, I found it better to use regex after you process the string as much as possible; regex on an arbitrary string will only cause headaches.
In your case, try splitting on ':' (or even workorders:) and getting the stuff after to get only the counts of statuses. After that, it's easy to get the counts for each status.
s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134
completed"
statuses = s.split(':') #['3434 garbage workorders', ' 138 waiting, 2 running, 3 failed, 134 completed']
statusesStr = ''.join(statuses[1]) # ' 138 waiting, 2 running, 3 failed, 134 completed'
statusRe = re.compile("(\d+)\s*(\w+)")
statusRe.findall(statusesStr) #[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]
Edit: changed expression to meet desired outcome and more robust
Answer that will only look at regex that are after :
re.findall(r'(?: )\d+ \w+')
This will give you your output exactly.
map = re.findall(r'(\d+) ([A-Za-z]+)', s.split("workorders:")[1])
You can then bust this init.
x = {v: int(k) for k, v in map}

Conditioning on Regex

I have several strings from which I need to extract the block numbers. The block numbers are of the format type "3rd block" , "pine block" ,"block 2" and "block no 4". Please note that is just the format type and the numbers could change. I have added them in OR conditions .
The problem is that at times the regex extracts the previous word connected to something else like "main phase block 2" would mean I need "block 2" to be extracted . Using re.search causes the 1st result to turn up and there are even limitations of "OR".
What I want is to add exceptions or condition my regex with something like
if 1 or 2 digits (like 23 , 3 ,6 ,7 etc) occur before the word "block", extract "block" with the word following "block".
Eg :
string = "rmv clusters phase 2 block 1 , flat no 209 dev." #extract "block 1" and not "2 block".
if words "phase , apartment or building" come before "block", extract word that follows block (irrespective of whether its a number or word)
Eg:
string 2 = "sky line apartments block 2 chandra layout" #extract "block 2" and not "apartments block"
Here is what I have done. But I've got no idea about adding conditions.
p = re.compile(r'(block[^a-z]\s\d*)|(\w+\sblock[^a-z])|(block\sno\s\d+)')
q = p.search(str)
this is a part of an entire function.
Tested on Python 2.7 and 3.3.
import re
strings = ("rmv clusters phase 2 block 1 , flat no 209 dev."
"sky line apartments block 2 chandra layout"
"foo bar 99 block baz") # tests rule 1.
Here's the rules you stated you wanted:
if 1 or 2 digits (like 23 , 3 ,6 ,7 etc) occur before the word "block", extract "block" with the word following "block".
if words "phase , apartment or building" come before "block", extract word that follows block (irrespective of whether its a number or word). * I'm inferring you want the word block too.
So
regex = re.compile(r'''
(?:\d{1,2}\s)(block\s\w*) # rule 1
| # or
(?:(phase|apartment|building).*?)(block\s\w+) # rule 2
''', re.X)
found = regex.finditer(strings)
for i in found:
print(i.groups())
prints:
(None, 'phase', '1')
(None, 'apartment', '2')
('block baz', None, None)
None is the default for a group if not found, so, you can pick a preference and allow the short-cutting or to return the first if it's non-empty, or the second if the first is empty (i.e. evaluates as False in Python's boolean contexts).
>>> found = regex.finditer(strings)
>>> for i in found:
... print(i.group(1) or i.group(3))
...
1
2
block baz
So to put this thing into a simple function:
def block(str):
regex = re.compile(r'''
(?:\d{1,2}\s)(block\s\w*) # rule 1
| # or
(?:(phase|apartment|building).*?)(block\s\w+) # rule 2
''', re.X)
match = regex.search(str)
if not match:
return ''
else:
return match.group(1) or match.group(3) or ''
usage:
>>> block("foo bar 99 block baz")
'block baz'
>>> block("sky line apartments block 2 chandra layout")
'block 2'
Why don't you write multiple regexes? See the following snippet in python3
def getBlockMatch(string):
import re
p1Regex = re.compile('block\s+\d+')
p2Regex = re.compile('(block[^a-z]\s\d*)|(\w+\sblock[^a-z])|(block\sno\s\d+)')
if p1Regex.search(string) is not None:
return p1Regex.findall(string)
else:
return p2Regex.findall(string)
string = "rmv clusters phase 2 block 1 , flat no 209 dev."
print(getBlockMatch(string))
string = "sky line apartments block 2 chandra layout"
print(getBlockMatch(string))
Outputs:
['block 1']
['block 2']
>> import re
>>> string = "rmv clusters phase 2 block 1 , flat no 209 dev."
>>> string2 = "sky line apartments block 2 chandra layout"
>>> print re.findall(r'block\s+\d+', string)
['block 1']
>>> print re.findall(r'block\s+\d+', string2)
['block 2']

Regex with grouping, how to terminate the group?

I need to match the below with a regexp and want to accces the resulting group.
String to be searched:
Products in these categories Nr 24432 in Kitchen ( Bestsellers ) Nr 11 in Home Improvement > Garden Nr 25 in Hobby > Gärtnerei
Expected Results:
"Kitchen","Home Improvement > Garden", "Hobby > Gärtnerei"
This is the regexp that I came up with so far, but it only matches the first occurrance.
Any ideas?
Nr [0-9]{1,} in ([0-9A-z >&äÄüÜöÖ]{1,})
Not sure how you're currently trying to match them, but this should work:
text = "Products in these categories Nr 24432 in Kitchen ( Bestsellers ) Nr 11 in Home Improvement > Garden Nr 25 in Hobby > Gärtnerei "
for m in re.finditer(r"Nr [0-9]{1,} in ([0-9A-z >&äÄüÜöÖ]{1,})", text):
print m.group(1)
Reference.
Also, your second match will match the whole rest of the string.
I suggest changing it to something like:
Nr [0-9]+ in (.+?)(?=[^0-9A-z >&äÄüÜöÖ]|$| Nr )
+ means the same as {1,}
.+? means one or more wild-cards (non-greedily)
?= means look-ahead, so it checks if the next character is an invalid character, end-of-line or " Nr " - the start of the next match.

Extract multiple line data between two symbols - Regex and Python3

I have a huge file from which I need data for specific entries. File structure is:
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
What I have to achieve is that I need to extract all the lines (complete record) for certain entries. For e.x. I need record for Entry1.1 than I can use name of entry '>Entry1.1' till next '>' as markers in REGEX to extract lines in between. But I do not know how to build such complex REGEX expressions. Once I have such expression the I will put it a FOR loop:
For entry in entrylist:
GET record from big_file
DO some processing
WRITE in result file
What could be the REGEX to perform such extraction of record for specific entries? Is there any more pythonic way to achieve this? I would appreciate your help on this.
AK
With regex
import re
ss = '''
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
'''
patbase = '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))'
while True:
x = raw_input('What entry do you want ? : ')
found = re.findall(patbase % x, ss, re.DOTALL)
if found:
print 'found ==',found
for each_entry in found:
print '\n%s\n' % each_entry
else:
print '\n ** There is no such an entry **\n'
Explanation of '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))' :
1)
%s receives the reference of entry: 1.1 , 2 , 2.1 etc
2)
The portion (?![^\n]+?\d) is to do a verification.
(?![^\n]+?\d) is a negative look-ahead assertion that says that what is after %s must not be [^\n]+?\d that is to say any characters [^\n]+? before a digit \d
I write [^\n] to mean "any character except a newline \n".
I am obliged to write this instead of simply .+? because I put the flag re.DOTALL and the pattern portion .+? would be acting until the end of the entry.
However, I only want to verify that after the entered reference (represented by %s in the pattern), there won't be supplementary digits before the end OF THE LINE, entered by error
All that is because if there is an Entry2.1 but no Entry2 , and the user enters only 2 because he wants Entry2 and no other, the regex would detect the presence of the Entry2.1 and would yield it, though the user would really like Entry2 in fact.
3)
At the end of '(>Entry *%s(?![^\n]+?\d).+?) , the part .+? will catch the complete block of the Entry, because the dot represents any character, comprised a newline \n
It's for this aim that I put the flag re.DOTALLin order to make the following pattern portion .+? capable to pass the newlines until the end of the entry.
4)
I want the matching to stop at the end of the Entry desired, not inside the next one, so that the group defined by the parenthesises in (>Entry *%s(?![^\n]+?\d).+?) will catch exactly what we want
Hence, I put at the end a positive look-ahaed assertion (?=>|(?:\s*\Z)) that says that the character before which the running ungreedy .+? must stop to match is either > (beginning of the next Entry) or the end of the string \Z.
As it is possible that the end of the last Entry wouldn't exactly be the end of the entire string, I put \s* that means "possible whitespaces before the very end".
So \s*\Z means "there can be whitespaces before to bump into the end of the string"
Whitespaces are a blank , \f, \n, \r, \t, \v
I'm no good with regexes, so I try to look for non-regex solutions whenever I can. In Python, the natural place to store iteration logic is in a generator, and so I'd use something like this (no-itertools-required version):
def group_by_marker(seq, marker):
group = []
# advance past negatives at start
for line in seq:
if marker(line):
group = [line]
break
for line in seq:
# found a new group start; yield what we've got
# and start over
if marker(line) and group:
yield group
group = []
group.append(line)
# might have extra bits left..
if group:
yield group
In your example case, we get:
>>> with open("entry0.dat") as fp:
... marker = lambda line: line.startswith(">Entry")
... for group in group_by_marker(fp, marker):
... print(repr(group[0]), len(group))
...
'>Entry1.1\n' 10
'>Entry2.1\n' 9
'>Entry3.2\n' 4
One advantage to this approach is that we never have to keep more than one group in memory, so it's handy for really large files. It's not nearly as fast as a regex, although if the file is 1 GB you're probably I/O bound anyhow.
Not entirely sure what you're asking. Does this get you any closer? It will put all your entries as dictionary keys and a list of all its entries. Assuming it is formatted like I believe it is. Does it have duplicate entries? Here's what I've got:
entries = {}
key = ''
for entry in open('entries.txt'):
if entry.startswith('>Entry'):
key = entry[1:].strip() # removes > and newline
entries[key] = []
else:
entries[key].append(entry)

Categories