I'm trying to use a python regular expression to match 'BrahuiHan' or 'BrahuiYourba'
>> re.search(r'((Brahui|Han|Yoruba)+\d+)', '10xBrahuiHan50_10xBrahuiYoruba50n4').groups()
('BrahuiHan50', 'Han')
this only returns one group, the first one, I thought it should return the second one too. i.e BrahuiYoruba
If you want to capture all occurrences of a pattern, you need to use re.findall:
>>> import re
>>> re.findall(r'((Brahui|Han|Yoruba)+\d+)', '10xBrahuiHan50_10xBrahuiYoruba50n4')
[('BrahuiHan50', 'Han'), ('BrahuiYoruba50', 'Yoruba')]
>>>
re.search will only capture the first occurrence.
Try
import re
regex = re.compile("((Brahui|Han|Yoruba)\\d{1,})")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches
Here is demo on debuggex
Pictorial representation:
Related
Is it possible to return the contents that match a wildcard (like .*) in a regex pattern in Python?
For example, a match like:
re.search('stack.*flow','stackoverflow')
would return the string 'over'.
Use a capturing group:
>>> import re
>>> re.search('stack(.*)flow', 'stackoverflow').group(1)
'over'
Yes, you can capture your result. For this, just use the ()
matchobj = re.search('stack(.*)flow','stackoverflow')
print(matchobj.group(1)) # => over
I have the following code:
tablesInDataset = ["henry_jones_12345678", "henry_jones", "henry_jones_123"]
for table in tablesInDataset:
tableregex = re.compile("\d{8}")
tablespec = re.match(tableregex, table)
everythingbeforedigits = tablespec.group(0)
digits = tablespec.group(1)
My regex should only return the string if it contains 8 digits after an underscore. Once it returns the string, I want to use .match() to get two groups using the .group() method. The first group should contain a string will all of the characters before the digits and the second should contain a string with the 8 digits.
What is the correct regex to get the results I am looking for using .match() and .group()?
Use capture groups:
>>> import re
>>> pat = re.compile(r'(?P<name>.*)_(?P<number>\d{8})')
>>> pat.findall(s)
[('henry_jones', '12345678')]
You get the nice feature of named groups, if you want it:
>>> match = pat.match(s)
>>> match.groupdict()
{'name': 'henry_jones', 'number': '12345678'}
tableregex = re.compile("(.*)_(\d{8})")
I think this pattern should match what you need: (.*?_)(\d{8}).
First group includes everything up to the 8 digits, including the underscore. Second group is the 8 digits.
If you don't want the underscore included, use this instead: (.*?)_(\d{8})
Here you go:
import re
tablesInDataset = ["henry_jones_12345678", "henry_jones", "henry_jones_123"]
rx = re.compile(r'^(\D+)_(\d{8})$')
matches = [(match.groups()) \
for item in tablesInDataset \
for match in [rx.search(item)] \
if match]
print(matches)
Better than any dot-star-soup :)
I want to replace the string
ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg
with
ID12345678
How can I replace this via regex?
I tried this - it didn't work.
import re
re.sub(r'_\w+_\d_\d+_\w+','')
Thank you
You can use re.sub with pattern [^_]* that match any sub-string from your text that not contain _ and as re.sub replace the pattern for first match you can use it in this case :
>>> s="ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
>>> import re
>>> re.sub(r'([^_]*).*',r'\1',s)
'ID12345678'
But if it could be appear any where in your string you can use re.search as following :
>>> re.search(r'ID\d+',s).group(0)
'ID12345678'
>>> s="_S3_MPRAGE_ADNI_ID12345678_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
>>> re.search(r'ID\d+',s).group(0)
'ID12345678'
But without regex simply you can use split() :
>>> s.split('_',1)[0]
'ID12345678'
I guess the first part is variable, then
import re
s = "ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
print re.sub(r'_.*$', r'', s)
I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')
import re
a = re.compile('myregex')
a.search(target_string)
I'm looking for a way to get the index of the next char after a possible match.
I know you can get the last position of a group but if my regex consists of something more than a group I'd like to get that or do I need to start counting my regex string chars?
The MatchObject returned by .search() has a .end() method; by default it returns the end position for which your whole regular expression matched.
You can also pass the method a group to find the end point of for that specific group, with the 0 group being the whole pattern.
Demonstration:
>>> import re
>>> a = re.compile('(my|your) regex')
>>> match = a.search('Somewhere in this string is my regex; not much different from your regex.')
>>> print match.end()
36
>>> print match.end(1)
30