Matching both possible solutions in Regex - python

I have a string aaab. I want a Python expression to match aa, so I expect the regular expression to return aa and aa since there are two ways to find substrings of aa.
However, this is not what's happening.
THis is what I've done
a = "aaab"
b = re.match('aa', a)

You can achieve it with a look-ahead and a capturing group inside it:
(?=(a{2}))
Since a look-ahead does not move on to the next position in string, we can scan the same text many times thus enabling overlapping matches.
See demo
Python code:
import re
p = re.compile(r'(?=(a{2}))')
test_str = "aaab"
print(re.findall(p, test_str))

To generalize #stribizhev solution to match one or more of character a: (?=(a{1,}))
For three or more: (?=(a{3,})) etc.

Related

python regex unexpected match groups

I am trying to find all occurrences of either "_"+digit or "^"+digit, using the regex ((_\^)[1-9])
The groups I'd expect back eg for "X_2ZZZY^5" would be [('_2'), ('^5')] but instead I am getting [('_2', '_'), ('^5', '^')]
Is my regex incorrect? Or is my expectation of what gets returned incorrect?
Many thanks
** my original re used (_|\^) this was incorrect, and should have been (_\^) -- question has been amended accordingly
You have 2 groups in your regex - so you're getting 2 groups. And you need to match atleast 1 number that follows.
try this:
([_\^][1-9]+)
See it in action here
Demand at least 1 digit (1-9) following the special characters _ or ^, placed inside a single capture group:
import re
text = "X_2ZZZY^5"
pattern = r"([_\^][1-9]{1,})"
regex = re.compile(pattern)
res = re.findall(regex, text)
print(res)
Returning:
['_2', '^5']

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?
You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

How can I extract the information I want using this RegEx or better?

So here's the Regular Expression I have so far.
r"(?s)(?<=([A-G][1-3])).*?(?=[A-G][1-3]|$)"
It looks behind for a letter followed by a number between A-G and 1-3 as well as doing the same when looking ahead. I've tested it using Regex101.
Here's what it returns for each match
This is the string I'm testing it against,
"A1 **ACBFEKJRQ0Z+-** F2 **.,12STLMGHD** F1 **9)(** D2 **!?56WXP** C1 **IONVU43\"\'** E1 **Y87><** A3 **-=.,\'\"!?><()#**"
(the string shouldn't have any spaces but I needed to embolden the values between each Letter followed by a number so it is easier to see what I want)
What I want it to do is store the values between each of the matches for the group (The "Full Matches") and the matches for the group they coincide with to use later.
In the end I would like to end up with either a list of tuples or a dictionary for example:
dict = {"A1":"ACBFEKJRQ0Z+-", "F2":",12STLMGHD", "F1":"9)(", "next group match":"characters that follow"}
or
list_of_tuples = (["A1","ACBFEKJRQ0Z+-"], ["F2","12STLMGHD"], ["F1","9)("], ["next group match","characters that follow"])
The string being compared to the RegEx won't ever have something like "C1F2" btw
P.S. Excuse the terrible explanation, any help is greatly appreciated
I suggest
(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)
See the regex demo
The (?s) will enable . to match linebreaks, ([A-G][1-3]) will capture the uppercase letter+digit into Group 1 and ((?:(?![A-G][1-3]).)*) will match all text that is not starting the uppercase letter+digit sequence.
The same regex can be unrolled as ([A-G][1-3])([^A-G]*(?:[A-G](?![1-3])[^A-G]*)*) for better performance (no re.DOTALL modifier or (?s) is necessary with it). See this demo.
Python demo:
import re
regex = r"(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)"
test_str = """A1 ACBFEKJRQ0Z+-F2.,12STLMGHDF19)(D2!?56WXPC1IONVU43"'E1Y87><A3-=.,'"!?><()#"""
dct = dict(re.findall(regex, test_str))
print(dct)

Return any number of matching groups with re findall in python

I have a relatively complex string that contains a bunch of data. I am trying to extract the relevant pieces of the string using a regex command. The portions I am interested in are contained in square brackets, like this:
s = '"data":["value":3.44}] lol haha "data":["value":55.34}]
"data":["value":2.44}] lol haha "data":["value":56.34}]'
And the regex expression I have built is as follows:
l = re.findall(r'\"data\"\:.*(\[.*\])', s)
I was expecting this to return
['["value":3.44}]', '["value":55.34}]', '["value":2.44}]', '["value":56.34}]']
But instead all I get is the last one, i.e.,
['["value":56.34}]']
How can I catch 'em all?
It's because quantifiers are greedy by default. So .* will match everything between the first "data": and the last [, so there's only one [...] left to match.
Use non-greedy quantifiers by adding ?.
l = re.findall(r'\"data\"\:.*?(\[.*?\])', s)
You can also use finditer to extract the relevant content iteratively:
import re
s = '"data":["value":3.44}] lol haha "data":["value":55.34}] "data":["value":2.44}] lol haha "data":["value":56.34}]'
for m in re.finditer(r'(\[.*?\])', s):
print m.group(1)
OUTPUT
["value":3.44}]
["value":55.34}]
["value":2.44}]
["value":56.34}]

Categories