regex. Find multiple occurrence of pattern - python

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?

You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']

So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']

print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Related

How to extract a repeated pattern from a word without space

I have the following word: "PANGOLINUPANGO" and would like to split it into ["PANGO","LINUP","PANGO"]. So in general splitting by repeated pattern appearing in the word (not a string with spaces).
I have tried the following Python re expression but can't get what I need:
[m.group(0) for m in re.finditer(r"(\D)\1*", s)]
It can also be like the following: 'VRJAMVRJAM' which should result into ['VRJAM','VRJAM'], so not necessarily non-contiguous repeats.
Here's a solution:
(\w+)(\w*)(\1)
Creates a group for the first letters, matches any possible (optional) letters in the middle, then matches the same group from the start.

returning info from a string

return poker_hand(list_of_five_cards) returns a string similar to this:
**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)
and I have created a string out of it I want the information inside the brackets. in this vein I have tried:
s = str(poker_hand(one_man))
print s
the_search = re.search(r"\((\w+)\)", s)
and this returns None when you type print the_search. I have also tried
s[s.find("(")+1:s.find(')')]
print s
which returns the whole string. Does anyone know what I am doing wrong?
EDIT sorry for the confusion I should be better,
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
re the assigning... trying to assign it now, will post the results
the pattern you are using to find the item in brackets is not right.
you can try to test your regex in http://regexr.com/
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
pattern = r'\(.+\.\)'
for item in re.findall(pattern,s):
print item.strip('().')
output:
One pair
IIUC at the end of your string you always have the closed brackets. Then try this:
'**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'.split('(')[1][:-1]
Out[1]: 'One pair.'
The idea is to split by the opening brackets, taking what's after, and deleting the closing brackets.
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
You can use something like:
import re
string = "7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)"
result = re.findall(r"\((.*?)\.?\)", string )
print result[0]
Ideone Demo
Regex Explanation:
\((.*?)\.?\)
Match the character “(” literally «\(»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “)” literally «\)»
Use the groups:
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
print (s)
m = re.search(r'\(([\s\S]+)\.\)', s)
print(m.group(1))

Regular expression to limit multiple occurrence of any character to two

I am looking for a regular expression to limit multiple occurrence of any character in a string to two.
eg: Reallllly like and Sooooooo good should be converted to Really like and So good.
Replaces sequences of three or more same characters by only two.
re.sub(r'(.)\1{2,}', r'\1\1', "Realllllly goooood")
Edit: fixed typo.
I don't know how to do it with a regex, but itertools.groupby works well:
>>> from itertools import groupby
>>> g = groupby('reallllly goood')
>>> ''.join(''.join(list(x)[:2]) for _,x in g)
>>> 'really good'
Answer from #pacholik is almost right.
Proper expression:
re.sub(r'(.)\1{2,}', r'\1\1', "Realllllly goood")
We replace substrings where more then 3 occurrences, not 4 (first (.) and 2 or more repeats \1{2,} replaced with 2 repeats of the 1st character \1\1).

Matching both possible solutions in Regex

I have a string aaab. I want a Python expression to match aa, so I expect the regular expression to return aa and aa since there are two ways to find substrings of aa.
However, this is not what's happening.
THis is what I've done
a = "aaab"
b = re.match('aa', a)
You can achieve it with a look-ahead and a capturing group inside it:
(?=(a{2}))
Since a look-ahead does not move on to the next position in string, we can scan the same text many times thus enabling overlapping matches.
See demo
Python code:
import re
p = re.compile(r'(?=(a{2}))')
test_str = "aaab"
print(re.findall(p, test_str))
To generalize #stribizhev solution to match one or more of character a: (?=(a{1,}))
For three or more: (?=(a{3,})) etc.

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Categories