Regex Group Matching - python

I am trying to match a pattern like so:
Pattern: (abc)(def)(ghi)h
Match:
Group 0 = [a,b,c]
Group 1 = [d,e,f]
Group 2 = [g,h,i]
Group 3 = h
Is it possible via regex to extrapolate the data into a list like that described?
The code being used is Python for reference.

AFAIK, that's not possible in one regex. You could do something like this:
import re
matches = re.findall('[^()]+', '(abc)(def)(ghi)h')
map = []
for m in matches:
map.append(list(m))
for e in map:
print e
which will print:
['a', 'b', 'c']
['d', 'e', 'f']
['g', 'h', 'i']
['h']
EDIT
The pattern [^()] matches any character other than a ( and ), so [^()]+ matches one or more characters other than ( and ).
Everything between a [ and ] is called a character class, and will always match just a single character. The ^ at the start makes it a negated character class (matches everything-but what is defined in it).
More info about character classes: http://www.regular-expressions.info/charclass.html

Related

Regex to match all repeating alphanumerical subpatterns [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 2 years ago.
After searching for a while, I could only find how to match specific subpattern repetitions. Is there a way I can find (3 or more) repetitions for any subpattern ?
For example:
re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
→ ['a', 'b', 'x', 'aaabbbxxx_']
re.findall(<the_regex>, 'lalala luuluuluul')
→ ['la', 'luu', 'uul']
I apologize in advance if this is a duplicate and would be grateful to be redirected to the original question.
Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.
r'(?=(.+)\1\1)'
RegEx Demo
Code:
>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']
RegEx Details:
Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.
Using findall we only return capture group in our regex.
(?=: Start lookahead
(.+): Match 1 or more of any character (greedy) and capture in group #1
\1\1: Match 2 occurrence of group #1 using back-reference \1\1
): End lookahead
re.findall() won't find overlapping matches. But you can find the non-overlapping matches using a capture group followed by a positive lookahead that matches a back-reference to that group.
>>> import re
>>> regex = r'(.+)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'a', 'b', 'x', 'a', 'b', 'x']
>>> re.findall(regex, 'lalala luuluuluul')
['la', 'luu']
>>>
This will find the longest matches; if you change (.+) to (.+?) you'll get the shortest matches at each point.
>>> regex = r'(.+?)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['a', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
It is not possible without defining the subpattern first.
Anyway, if the subpattern is just <any_alphanumeric>, then re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_') would produce something like this :
['a', 'b', 'x', 'aa', 'ab', 'bb', 'bx', 'xx', 'x_', 'aaa', 'aaab', 'aaabb', ....]
ie, every alphanumeric combination that is repeated thrice - so a lot of combinations, not just ['a', 'b', 'x', 'aaabbbxxx_']

Python: replace an exact matching substring with variable

I have a list of strings like 'cdbbdbda', 'fgfghjkbd', 'cdbbd' etc. I have also a variable fed from another list of strings. What I need is to replace a substring in the first list's strings, say b by z, only if it is preceeded by a substring from the variable list, all the other occurrences being intouched.
What I have:
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
What I do:
for i in a:
for j in c:
if j+'b' in i:
i = re.sub('b', 'z', i)
What I need:
'cdzbdzda'
'fgfghjkbd'
'cdzbd'
What I get:
'cdzzdzda'
'fgfghjkbd'
'cdzzd'
all instances of 'b' are replaced.
I'm new in it, any help is very welcome. Looking for answer at Stackoverflow I have found many solutions with regex based on word boundaries or with re either with str.replace based on count, but I can't use it as the lenght of the string and number of occurrences of 'b' can vary.
I think if you include j in the find and replace, you'll get what you want.
>>> for i in a:
... for j in c:
... i = re.sub(j+'b', j+'z', i)
... print i
...
cdzbdzda
fgfghjkbd
cdzbd
>>>
I added print i because your loop doesn't make in-place changes, so without that output, it's not possible to see what replacements were made.
You should simply use regular expressions with a positive lookbehind assertion.
Like this:
import re
for i in a:
for j in c:
i = re.sub('(?<=' + j + ')b', 'z', i)
The base case is:
re.sub('(?<=d)b', 'z', 'cdbbdbda')
You can use a list comprehension:
import re
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
new_a = [re.sub('|'.join('(?<={})b'.format(i) for i in c), 'z', b) for b in a]
Output:
['cdzbdzda', 'fgfghjkbd', 'cdzbd']

split string into list by regex

I need a regex, which split input string to list with next rules:
1) By dot;
2) Do not split expression if it is in quotes.
Examples:
'a.b.c' -> ['a', 'b', 'c'];
'a."b.c".d' -> ['a', 'b.c', 'd'];
'a.'b.c'.d' -> ['a', 'b.c', 'd'];
'a.'b c'.d' -> ['a', 'b c', 'd'];
You could leverage the newer regex module with the following expression:
(["']).*?\1(*SKIP)(*FAIL)|\.
This captures quotes, match them up to the next quote and let the matched part fail. The alternation is the dot.
In Python:
import regex as re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1(*SKIP)(*FAIL)|\.""")
for line in data.split("\n"):
if line:
parts = [part.strip("'").strip('"') for part in rx.split(line) if part]
print(parts)
Which yields
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']
['a', 'b c', 'd']
See a demo on regex101.com.
If you want to stick with the re module, you could replace the dot in question before and split by the replacement afterwards.
import re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1|(?P<dot>\.)""")
needle = "SUPERMAN"
def replacer(match):
if match.group('dot') is not None:
return needle
else:
return match.group(0)
for line in data.split("\n"):
if line:
line = rx.sub(replacer, line)
parts = [part.strip("'").strip('"') for part in line.split(needle) if part]
print(parts)
This yields the exact same output as above. Please note that both approaches won't work for escaped quotes.
You can do it with some extra efforts here how can you do.
First split with '.' and then do some logically work on it.
string_data = 'a."b.c".d'
data = string_data.split('.')
list = []
value = None
for i in range(0,len(data)):
if value:
value = None
else:
if '"' in data[i]:
value = data[i]
value = value + '.' + data[i+1]
if value:
list.append(value)
else:
list.append(data[i])
print(list)
It will give you output same as in your qus.
As an alternative you could try using an or | with a positive lookbehind (?<= and a positive lookahead (?= for the single and double quotes
(?<=").*?(?=")|(?<=').*?(?=')|[a-z]+
regex = r"(?<=\").*?(?=\")|(?<=').*?(?=')|[a-z]+"
line = "a.\"b.t\".qq.d.d.'d'.'d.g.r'.d.d"
print(re.findall(regex, line))
['a', 'b.t', 'qq', 'd', 'd', 'd', '.', 'd.g.r', 'd', 'd']
Test output python
here is a regex for you:
\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?
implementation:
import re
regex = re.compile( r"""\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?""")
def str2list(string):
b = regex.findall(string)
l = []
for i in list(b):
for j in list(i):
if j:
l.append(j)
return l
str2list('a.b.c')
str2list('a."b.c".d')
str2list("a.'b.c'.d")
output:
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']

Split a string in Python having parenthesis (multiple splitters)

I have a string, for example:
"ab(abcds)kadf(sd)k(afsd)(lbne)"
I want to split it to a list such that the list is stored like this:
a
b
abcds
k
a
d
f
sd
k
afsd
lbne
I need to get the elements outside the parenthesis in separate rows and the ones inside it in separate ones.
I am not able to think of any solution to this problem.
You can use iter to make an iterator and use itertools.takewhile to extract the strings between the parens:
it = iter(s)
from itertools import takewhile
print([ch if ch != "(" else "".join(takewhile(lambda x: x!= ")",it)) for ch in it])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
If ch is not equal to ( we just take the char else if ch is a ( we use takewhile which will keep taking chars until we hit a ) .
Or using re.findall get all strings starting and ending in () with \((.+?))` and all other characters with :
print([''.join(tup) for tup in re.findall(r'\((.+?)\)|(\w)', s)])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
You just need to use the magic of 're.split' and some logic.
import re
string = "ab(abcds)kadf(sd)k(afsd)(lbne)"
temp = []
x = re.split(r'[(]',string)
#x = ['ab', 'abcds)kadf', 'sd)k', 'afsd)', 'lbne)']
for i in x:
if ')' not in i:
temp.extend(list(i))
else:
t = re.split(r'[)]',i)
temp.append(t[0])
temp.extend(list(t[1]))
print temp
#temp = ['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
Have a look at difference in append and extend here.
I hope this helps.
You have two options. The really easy one is to just iterate over the string. For example:
in_parens=False
buffer=''
for char in my_string:
if char =='(':
in_parens=True
elif char==')':
in_parens = False
my_list.append(buffer)
buffer=''
elif in_parens:
buffer+=char
else:
my_list.append(char)
The other option is regex.
I would suggest regex. It is worth practicing.
Try: Python re. If you are new to re it may take a bit of time but you can do all kind of string manipulations once you get it.
import re
search_string = 'ab(abcds)kadf(sd)k(afsd)(lbne)'
re_pattern = re.compile('(\w)|\((\w*)\)') # Match single character or characters in parenthesis
print [x if x else y for x,y in re_pattern.findall(search_string)]

Finding consecutive consonants in a word

I need code that will show me the consecutive consonants in a word. For example, for "concertation" I need to obtain ["c","nc","rt","t","n"].
Here is my code:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes += x + ''
return consonnes
I manage to find the consonants, but I don't see how to find them consecutively. Can anybody tell me what I need to do?
You can use regular expressions, implemented in the re module
Better solution
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation", re.IGNORECASE)
['c', 'nc', 'rt', 't', 'n']
[bcdfghjklmnprstvyz]+ matches any sequence of one or more characters from the character class
re.IGNORECASE enables a case in sensitive match on the characters. That is
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "CONCERTATION", re.IGNORECASE)
['C', 'NC', 'RT', 'T', 'N']
Another Solution
>>> import re
>>> re.findall(r'[^aeiou]+', "concertation",)
['c', 'nc', 'rt', 't', 'n']
[^aeiou] Negated character class. Matches anything character other than the one in this character class. That is in short Matches consonents in the string
+ quantifer + matches one or more occurence of the pattern in the string
Note This will also find the non alphabetic, adjacent characters in the solution. As the character class is anything other than vowels
Example
>>> re.findall(r'[^aeiou]+', "123concertation",)
['123c', 'nc', 'rt', 't', 'n']
If you are sure that the input always contain alphabets, this solution is ok
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
The string is scanned left-to-right, and matches are returned in the order found.
If you are curious about how the result is obtained for
re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation")
concertation
|
c
concertation
|
# o is not present in the character class. Matching ends here. Adds match, 'c' to ouput list
concertation
|
n
concertation
|
c
concertation
|
# Match ends again. Adds match 'nc' to list
# And so on
You could do this with regular expressions and the re module's split function:
>>> import re
>>> re.split(r"[aeiou]+", "concertation", flags=re.I)
['c', 'nc', 'rt', 't', 'n']
This method splits the string whenever one or more consecutive vowels are matched.
To explain the regular expression "[aeiou]+": here the vowels have been collected into a class [aeiou] while the + indicates that one or more occurrence of any character in this class can be matched. Hence the string "concertation" is split at o, e, a and io.
The re.I flag means that the case of the letters will be ignored, effectively making the character class equal to [aAeEiIoOuU].
Edit: One thing to keep in mind is that this method implicitly assumes that the word contains only vowels and consonants. Numbers and punctuation will be treated as non-vowels/consonants. To match only consecutive consonants, instead use re.findall with the consonants listed in the character class (as noted in other answers).
One useful shortcut to typing out all the consonants is to use the third-party regex module instead of re.
This module supports set operations, so the character class containing the consonants can be neatly written as the entire alphabet minus the vowels:
[[a-z]--[aeiou]] # equal to [bcdefghjklmnpqrstvwxyz]
Where [a-z] is the entire alphabet, -- is set difference and [aeiou] are the vowels.
If you are up for a non-regex solution, itertools.groupby would work perfectly fine here, like this
>>> from itertools import groupby
>>> is_vowel = lambda char: char in "aAeEiIoOuU"
>>> def suiteConsonnes(in_str):
... return ["".join(g) for v, g in groupby(in_str, key=is_vowel) if not v]
...
>>> suiteConsonnes("concertation")
['c', 'nc', 'rt', 't', 'n']
A really, really simple solution without importing anything is to replace the vowels with a single thing, then split on that thing:
def SuiteConsonnes(mot):
consonnes = ''.join([l if l not in "aeiou" else "0" for l in mot])
return [c for c in consonnes.split("0") if c is not '']
To keep it really similar to your code - and to add generators - we get this:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes.append(x)
elif consonnes:
yield ''.join(consonnes)
consonnes = []
if consonnes: yield ''.join(consonnes)
def SuiteConsonnes(mot):
consonnes=[]
consecutive = '' # initialize consecutive string of consonants
for x in mot:
if x in "aeiou": # checks if x is not a consonant
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append consecutive string to consonnes
consecutive = '' # reinitialize consecutive for another consecutive string of consonants
else:
consecutive += x # add x to consecutive string if x is a consonant or not a vowel
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append last consecutive string of consonants
return consonnes
SuiteConsonnes('concertation')
#['c', 'nc', 'rt', 't', 'n']
Not that I'd recommend it for readability, but a one-line solution is:
In [250]: q = "concertation"
In [251]: [s for s in ''.join([l if l not in 'aeiou' else ' ' for l in q]).split()]
Out[251]: ['c', 'nc', 'rt', 't', 'n']
That is: join the non-vowels with spaces and split again on whitespace.
Use regular expressions from re built-in module:
import re
def find_consonants(string):
# find all non-vovels occuring 1 or more times:
return re.findall(r'[^aeiou]+', string)
Although I think you should go with #nu11p01n73R's answer, this will also work:
re.sub('[AaEeIiOoUu]+',' ','concertation').split()

Categories