using regular expression to split string in python

using regular expression to split string in python - python

I use
re.compile(r"(.+?)\1+").findall('44442(2)2(2)44')
can get
['4','2(2)','4']
, but how can I get
['4444','2(2)2(2)','44']
by using regular expression?
Thanks

No change to your pattern needed. Just need to use to right function for the job. re.findall will return a list of groups if there are capturing groups in the pattern. To get the entire match, use re.finditer instead, so that you can extract the full match from each actual match object.
pattern = re.compile(r"(.+?)\1+")
[match.group(0) for match in pattern.finditer('44442(2)2(2)44')]

With minimal change to OP's regular expression:
[m[0] for m in re.compile(r"((.+?)\2+)").findall('44442(2)2(2)44')]
findall will give you the full match if there are no groups, or groups if there are some. So given that you need groups for your regexp to work, we simply add another group to encompass the full match, and extract it afterwards.

You can do:
[i[0] for i in re.findall(r'((\d)(?:[()]*\2*[()]*)*)', s)]
Here the Regex is:
((\d)(?:[()]*\2*[()]*)*)
which will output a list of tuples containing the two captured groups, and we are only interest din the first one hence i[0].
Example:
In [15]: s
Out[15]: '44442(2)2(2)44'
In [16]: [i[0] for i in re.findall(r'((\d)(?:[()]*\2*[()]*)*)', s)]
Out[16]: ['4444', '2(2)2(2)', '44']

Related

Which pattern was matched among those that I passed through a regular expression?

I am using regexp with pyhton and the library re. The regular expression I am passing contains many possible variations of a string, such as:
myRExp = ("aaaaa|bbbbb|ccccc|ddddd")
This is what I am doing to match the full regular expression
# read a file with two columns
df = pd.read_csv('a_file.csv')
# get second column and create a unique regular expression
myRExp = "|".join(df[df.columns[1]])
# now test if line contains myRExp
if re.match(myRExp, line):
# get the actual matching pattern and do something with it
What I need to do is to know which substring from myRExp was actually matching the line, i.e. which one between "aaaaa", "bbbbb", "ccccc" or "ddddd", matched?
EDIT:
Let's go with the example. This is my regular expression:
>>> linE = 'zzzzbbdbbxxx'
>>> myRExp = "(aa[a|b]a)|(bb[c|d]bb)|(ccc[d|c]c)"
by re.match() I can now match it and get this output (note that I am using search to make my point here):
# do we have a match? (yes)
>>> matched = re.search(myRExp, linE)
# show groups: I partially care
>>> matched.groups(0)
(0, 'bbdbb', 0)
At this point, what I need is the index of the regular expression that matched: the match was (bb[c|d]bb), then the output should be 2, i.e. the index of that regular expression group in myRExp:
index of matched.groups(0) in myRExp
Is there any way of obtaining the index?

Grab the "match object" returned by the regex call, and you can examine it:
m = re.match(myRExp, line)
if m:
print("Matched", m.group(0))
This will show you the part of your string that matched, which in this case is the simplest way to get what you are after.
If your regex contains groups and you want to know exactly which of the groups matched, use m.groups() instead:
>>> probe = "(orange)|(or)|(or.*)"
>>> m = re.match(probe, 'order')
>>> m.groups()
(None, 'or', None)
There should only be one value that is not None, so you can take its index and look up the regex in your list of regex substrings. Here's one way to find the index with a one-liner:
>>> match_index = list(map(bool, m.groups())).index(True)

I would suggest, that you can use this website
There you can tinker and adapt your regular expressions and get visual feedback what is matched, when providing test strings. Also the syntax is documented for the rare case you forget some commands ;)

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?

You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']

So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']

print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Python parentheses and returning only certain part of regex

I have a list of strings that I'm looping through. I have the following regular expression (item is the string I'm looping through at any given moment):
regularexpression = re.compile(r'set(\d+)e', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
What I want it to do is return numbers that have the word set before them and the letter e after them.
However, I also want it to return numbers that have set before them and x after them. If I use the following code:
regularexpression = re.compile(r'set(\d+)(e|x)', re.IGNORECASE)
number = re.search(regularexpression,item).group(1)
Instead of returning just the number, it also returns e or x. Is there a way to use parentheses to group my regular expression into bits without it returning everything in the parentheses?

Your example code seems fine already, but to answer your question, you can make a non-capturing group using the (?:) syntax, e.g.:
set(\d+)(?:e|x)
Additionally, in this specific example you can just use a character class:
set(\d+)[ex]

It appears you are looking at more than just .group(1); you have two capturing groups defined in your regular expression.
You can make the second group non-capturing by using (?:...) instead of (...):
regularexpression = re.compile(r'set(\d+)(?:e|x)', re.IGNORECASE)

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?

try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.

Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.

I think you don't need regexpes for this problem,
you need some recursial graph search function

How to make re.split() inclusive

I have the following:
>>> x='STARSHIP_TROOPERS_INVASION_2012_LOCDE'
>>> re.split('_\d{4}',x)[0]
'STARSHIP_TROOPERS_INVASION'
How would I get the year included? For example:
STARSHIP_TROOPERS_INVASION_2012
Note there are tens of thousands of titles, and I need to split on the year for each. I can't do a normal python split() here.

A more straightforward solution would be using re.search()/MatchObject.end():
m = re.search('_\d{4}', x)
print x[:m.end(0)]
If you want to stick with split(), you can use a lookbehind:
re.split('(?<=_\d{4}).', x)
(This work even when the year is at the end of the string, because split() returns an array with the original string in case the delimiter isn't found.)

If its always going to be the same pattern, then why not:
>>> x = 'STARSHIP_TROOPERS_INVASION_2012_LOCDE'
>>> x[:x.rfind('_')]
'STARSHIP_TROOPERS_INVASION_2012'
For your original regular expression, since you aren't capturing the matched group, it is not part of your matches:
>>> re.split('_\d{4}',x)
['STARSHIP_TROOPERS_INVASION', '_LOCDE']
>>> re.split('_(\d{4})',x)
['STARSHIP_TROOPERS_INVASION', '2012', '_LOCDE']
The () marks the selection as a captured group:
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(] [)].

you may use both split() and search() supposing you have a single such date in your string you wish to split at.
import re
x='STARSHIP_TROOPERS_INVASION_2012_LOCDE'
date=re.search('_\d{4}',x).group(0)
print(date)
gives
>>>
_2012
and
print(re.split('_\d{4}',x)[0]+date)
gives
STARSHIP_TROOPERS_INVASION_2012

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using regular expression to split string in python - python

I use re.compile(r"(.+?)\1+").findall('44442(2)2(2)44') can get ['4','2(2)','4'] , but how can I get ['4444','2(2)2(2)','44'] by using regular expression? Thanks

Related

Which pattern was matched among those that I passed through a regular expression?

regex. Find multiple occurrence of pattern

Python parentheses and returning only certain part of regex

Python regexp: get all group's sequence

How to make re.split() inclusive

Categories

Resources