use regex to replace long string - python

I want to replace the string
ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg
with
ID12345678
How can I replace this via regex?
I tried this - it didn't work.
import re
re.sub(r'_\w+_\d_\d+_\w+','')
Thank you

You can use re.sub with pattern [^_]* that match any sub-string from your text that not contain _ and as re.sub replace the pattern for first match you can use it in this case :
>>> s="ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
>>> import re
>>> re.sub(r'([^_]*).*',r'\1',s)
'ID12345678'
But if it could be appear any where in your string you can use re.search as following :
>>> re.search(r'ID\d+',s).group(0)
'ID12345678'
>>> s="_S3_MPRAGE_ADNI_ID12345678_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
>>> re.search(r'ID\d+',s).group(0)
'ID12345678'
But without regex simply you can use split() :
>>> s.split('_',1)[0]
'ID12345678'

I guess the first part is variable, then
import re
s = "ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
print re.sub(r'_.*$', r'', s)

Related

Remove Characters After Number

I have a string from a file which I need to remove all the characters after the substring "c1525". Can regex be used for this? The pattern I am seeing is that all my strings have a "c" then 4 digits (although I have seen more than 4 digits, so need to take that into consideration).
d098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566
expected:
d098746532d1234567c1525
You could use a regex with a capturing group and re.sub:
import re
s = 'd098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566'
s2 = re.sub(r'(c\d{4,}).*', r'\1', s)
output: 'd098746532d1234567c1525'
s = 'd098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566'
s[:s.find('c1525')+len('c1525')]
Output:
d098746532d1234567c1525
>>> txt = "d098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566"
>>> import re
>>> re.match("[^c]*(c\d+)", txt).group()
'd098746532d1234567c1525'

Return the content of a Wildcard match in Python

Is it possible to return the contents that match a wildcard (like .*) in a regex pattern in Python?
For example, a match like:
re.search('stack.*flow','stackoverflow')
would return the string 'over'.
Use a capturing group:
>>> import re
>>> re.search('stack(.*)flow', 'stackoverflow').group(1)
'over'
Yes, you can capture your result. For this, just use the ()
matchobj = re.search('stack(.*)flow','stackoverflow')
print(matchobj.group(1)) # => over

Python regular expression search vs match

I'm trying to use a python regular expression to match 'BrahuiHan' or 'BrahuiYourba'
>> re.search(r'((Brahui|Han|Yoruba)+\d+)', '10xBrahuiHan50_10xBrahuiYoruba50n4').groups()
('BrahuiHan50', 'Han')
this only returns one group, the first one, I thought it should return the second one too. i.e BrahuiYoruba
If you want to capture all occurrences of a pattern, you need to use re.findall:
>>> import re
>>> re.findall(r'((Brahui|Han|Yoruba)+\d+)', '10xBrahuiHan50_10xBrahuiYoruba50n4')
[('BrahuiHan50', 'Han'), ('BrahuiYoruba50', 'Yoruba')]
>>>
re.search will only capture the first occurrence.
Try
import re
regex = re.compile("((Brahui|Han|Yoruba)\\d{1,})")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches
Here is demo on debuggex
Pictorial representation:

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

python regex: match to the first "}"

I have a string s containing:-
Hello {full_name} this is my special address named {address1}_{address2}.
I am attempting to match all instances of strings that is contained within the curly brackets.
Attempting:-
matches = re.findall(r'{.*}', s)
gives me
['{full_name}', '{address1}_{address2}']
but what I am actually trying to retrieve is
['{full_name}', '{address1}', '{address2}']
How can I do that?
>>> import re
>>> text = 'Hello {full_name} this is my special address named {address1}_{address2}.'
>>> re.findall(r'{[^{}]*}', text)
['{full_name}', '{address1}', '{address2}']
Try a non-greedy match:
matches = re.findall(r'{.*?}', s)
You need a non-greedy quantifier:
matches = re.findall(r'{.*?}', s)
Note the question mark ?.

Categories