I have a regex like --
query = "(A((hh)|(hn)|(n))?)"
and an input inp = "Ahhwps edAn". I want to extract all the matched pattern along with unmatched(remaining) but with preserving order of the input.
The output should look like -- ['Ahh', 'wps ed', 'An'] or ['Ahh', 'w', 'p', 's', ' ', 'e', 'd', 'An'].
I had searched online but found nothing.
How can I do this?
The re.split method may output captured submatches in the resulting array.
Capturing groups are those constructs that are formed with a pair of unescaped parentheses. Your pattern abounds in redundant capturing groups, and re.split will return all of them. You need to remove those unnecessary ones, and convert all capturing groups to non-capturing ones, and just keep the outer pair of parentheses to make the whole pattern a single capturing group.
Use
re.split(r'(A(?:hh|hn|n)?)', s)
Note that there may be an empty element in the output list. Just use filter(None, result) to get rid of the empty values.
The match objects' span() method is really useful for what you're after.
import re
pat = re.compile("(A((hh)|(hn)|(n))?)")
inp = "Ahhwps edAn"
result=[]
i=k=0
for m in re.finditer(pat,inp):
j,k=m.span()
if i<j:
result.append(inp[i:j])
result.append(inp[j:k])
i=k
if i<len(inp):
result.append(inp[k:])
print result
Here's what the output looks like.
['Ahh', 'wps ed', 'An']
This technique handles any non-matching prefix and suffix text as well. If you use an inp value of "prefixAhhwps edAnsuffix", you'll get the output I think you'd want:
['prefix', 'Ahh', 'wps ed', 'An', 'suffix']
You can try this:
import re
import itertools
new_data = list(itertools.chain.from_iterable([re.findall(".{"+str(len(i)/2)+"}", i) for i in inp.split()]))
Output:
['Ahh', 'wps', 'ed', 'An']
Related
I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']
I need to manage string in Python in this way:
I have this kind of strings with '>=', '=', '<=', '<', '>' in front of them, for example:
'>=1_2_3'
'<2_3_2'
what I want to achieve is splitting the strings to obtain, respectively:
'>=', '1_2_3'
'<', '2_3_2'
basically I need to split them starting from the first numeric character.
There's a way to achieve this result with regular expressions without iterating over the string checking if a character is a number or a '_'?
thank you.
This will do:
re.split(r'(^[^\d]+)', string)[1:]
Example:
>>> re.split(r'(^[^\d]+)', '>=1_2_3')[1:]
['>=', '1_2_3']
>>> re.split(r'(^[^\d]+)', '<2_3_2')[1:]
['<', '2_3_2']
import re
strings = ['>=1_2_3','<2_3_2']
for s in strings:
mat = re.match(r'([^\d]*)(\d.*)', s)
print mat.groups()
Outputs:
('>=', '1_2_3')
('<', '2_3_2')
This just groups everything up until the first digit in one group, then that first digit and everything after into a second.
You can access the individual groups with mat.group(1), mat.group(2)
You can split using this regex:
(?<=[<>=])(?=\d)
RegEx Demo
There's probably a better way but you can split with a capture then join the second two elements:
values = re.split(r'(\d)', '>=1_2_3', maxsplit = 1)
values = [values[0], values[1] + values[2]]
I tried some code like below.
re.findall(r'(\d{2}){2}', 'shs111111111')
I want to get the result like
11111111
but the result is
['11', '11']
Edit:
I make some errors at the example, what I really need is to find the all repeated substrings.
Like this:
re.findall(r'([actg]{2,}){2,}', 'aaaaaaaccccctttttttttt')
I prefer the result is ['aaaaaaa', 'ccccc', 'tttttttttt']
But I got ['aa', 'cc', 'tt']
What's the problem and how can I do?
I believe you need this regex:
>>> print re.findall(r'(?:\d{2}){2,}', 'shs111111111');
['11111111']
EDIT: Based on edited question you can use:
>>> print re.findall(r'(([actg\d])\2+)', 'aaaaaaaccccctttttttttt');
[('aaaaaaa', 'a'), ('ccccc', 'c'), ('tttttttttt', 't')]
And grab captured group #1 from each pair.
Using finditer:
>>> arr=[]
>>> for match in re.finditer(r'(([actg\d])\2+)', 'aaaaaaaccccctttttttttt') :
... arr.append( match.groups()[0] )
...
>>> print arr
['aaaaaaa', 'ccccc', 'tttttttttt']
re.findall returns all the groups. So use
re.findall(r'(?:\d{2}){2}', 'shs111111111')
Just make the group non capturing.
Relevant doc excerpt:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
(([acgt])\2+)
Use this and
x="aaaaaaaccccctttttttttt"
print [i[0] for i in re.findall(r'(([acgt])\2+)', 'aaaaaaaccccctttttttttt')]
You cannot obtain pure ['aaaaaaa', 'ccccc', 'tttttttttt'] because you need a capture group to check for repetition using the back-reference. Here, you have a regex with named group letter that will hold a, or b, etc. and then the (?P=letter)+) back-reference is used to match all the group repetition.
((?P<letter>[a-zA-Z])(?P=letter)+)
You can only use this regex with a finditer described in #anubhava's post.
I'm trying to write a Regex for strings -
c190_12_1-10
c129
abc_1-90
to separate to -
['c190_12_', '1', '10']
['c', '129']
['abc_', '1', '90']
So far I've came up with (\D+)(\d+)-?(\d+)?
But, it doesn't work for all combinations. What I am missing here?
You can use this:
items = ['c190_12_1-10', 'c129', 'abc_1-90']
reg = re.compile(r'^(.+?)(\d+)(?:-(\d+))?$')
for item in items:
m = reg.match(item)
print m.groups()
Not sure what exactly you do and don't want to match, but this might work for you:
(?:(\w+)(\d+)-|([a-z]+))(\d+)$
http://regex101.com/r/uA3eZ4
The secret here was working backwards form the end, where it always seems to be the same condition. Then using the conditionals and the non-capture group, you end up with the result you've shown.
I have a string like "SAB_bARGS_D" . What I want is that the string gets divided into list of characters but whenever there is a _ sign the next character gets appended to the previous one.
So the answer to above should be ['S','A','B_b','A','R','G','S_D']
It can be done by using a for loop traversing through the list but is there an inbuilt function that I can use.....
Thanks a lot
Update
Hello all
Thanks to Robert Rossney,aaronasterling I got the required answer but I have an exactly similar question that I am going to ask here only...... Lets say that now my string has critaria that it can have a letter or a letter followed by _ and a number..... How can I seperate the string into list now...... The solutions suggested cannot be used now since S_10 would be seperated into S_1 and 0 ...... It would be helpful if someone can tell how to do so using RE.... Thanks a lot....
I know, I'll use regular expressions:
>>> import re
>>> pattern = "[^_]_[^_]|[^_]"
>>> re.findall(pattern, "SAB_bARGS_D", re.IGNORECASE)
['S', 'A', 'B_b', 'A', 'R', 'G', 'S_D']
The pattern tries to match 3 characters in a row - non-underscore, underscore, non-underscore - and, failing that, tries to match a non-underscore character.
I would probably use a for loop.
def a_split(inp_string):
res = []
if not inp_string: return res # allows us to assume the string is nonempty
# This avoids taking res[-1] when res is empty if the string starts with _
# and simplifies the loop.
inp = iter(inp_string)
last = next(inp)
res.append(last)
for c in inp:
if '_' in (c, last): # might want to use (c == '_' or last == '_')
res[-1] += c
else:
res.append(c)
last = c
return res
You will be able to get some performance gain my storing res.append in a local variable and referencing that directly instead of referencing a local variable, res and then performing an attribute lookup to get the append method.
If there is a string like 'a_b_c' then it will not be split. No behavior was specified in this case but it wouldn't be to hard to modify it to do something else. Also a string like '_ab' will split into ['_a', 'b'] and similarly for 'ab_'.
Using a regular expression
>>> import re
>>> s="SAB_bARGS_D"
>>> re.findall("(.(?:_.)?)",s)
['S', 'A', 'B_b', 'A', 'R', 'G', 'S_D']