I have a string from a file which I need to remove all the characters after the substring "c1525". Can regex be used for this? The pattern I am seeing is that all my strings have a "c" then 4 digits (although I have seen more than 4 digits, so need to take that into consideration).
d098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566
expected:
d098746532d1234567c1525
You could use a regex with a capturing group and re.sub:
import re
s = 'd098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566'
s2 = re.sub(r'(c\d{4,}).*', r'\1', s)
output: 'd098746532d1234567c1525'
s = 'd098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566'
s[:s.find('c1525')+len('c1525')]
Output:
d098746532d1234567c1525
>>> txt = "d098746532d1234567c1525qFPXplFSm-FS8575664637338586224hKwHFmFSRRnm0Uc006566"
>>> import re
>>> re.match("[^c]*(c\d+)", txt).group()
'd098746532d1234567c1525'
Related
import re
str_ = "8983605653Sudanshu452365423256Shinde"
print(re.findall(r"\d{10}\B|[A-Za-z]{8}|\d{12}|[A-Za-z]{6}",str_))
current output
['8983605653', 'Sudanshu', '4523654232', 'Shinde']
Desired output
['8983605653', 'Sudanshu', '452365423256', 'Shinde']
A regex find all on \d+|\D+ should work here:
str_ = "8983605653Sudanshu452365423256Shinde"
matches = re.findall(r'\d+|\D+', str_)
print(matches) # ['8983605653', 'Sudanshu', '452365423256', 'Shinde']
The pattern used here alternatively finds all digit substrings, or all non digit substrings.
Instead of using an alternation | you can use the matches with capture groups and then print the group values.
import re
str_ = "8983605653Sudanshu452365423256Shinde"
m = re.match(r"(\d{10})([A-Za-z]{8})(\d{12})([A-Za-z]{6})",str_)
if m:
print(list(m.groups()))
Output
['8983605653', 'Sudanshu', '452365423256', 'Shinde']
See a Python demo.
I have the following code:
tablesInDataset = ["henry_jones_12345678", "henry_jones", "henry_jones_123"]
for table in tablesInDataset:
tableregex = re.compile("\d{8}")
tablespec = re.match(tableregex, table)
everythingbeforedigits = tablespec.group(0)
digits = tablespec.group(1)
My regex should only return the string if it contains 8 digits after an underscore. Once it returns the string, I want to use .match() to get two groups using the .group() method. The first group should contain a string will all of the characters before the digits and the second should contain a string with the 8 digits.
What is the correct regex to get the results I am looking for using .match() and .group()?
Use capture groups:
>>> import re
>>> pat = re.compile(r'(?P<name>.*)_(?P<number>\d{8})')
>>> pat.findall(s)
[('henry_jones', '12345678')]
You get the nice feature of named groups, if you want it:
>>> match = pat.match(s)
>>> match.groupdict()
{'name': 'henry_jones', 'number': '12345678'}
tableregex = re.compile("(.*)_(\d{8})")
I think this pattern should match what you need: (.*?_)(\d{8}).
First group includes everything up to the 8 digits, including the underscore. Second group is the 8 digits.
If you don't want the underscore included, use this instead: (.*?)_(\d{8})
Here you go:
import re
tablesInDataset = ["henry_jones_12345678", "henry_jones", "henry_jones_123"]
rx = re.compile(r'^(\D+)_(\d{8})$')
matches = [(match.groups()) \
for item in tablesInDataset \
for match in [rx.search(item)] \
if match]
print(matches)
Better than any dot-star-soup :)
I want to replace the string
ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg
with
ID12345678
How can I replace this via regex?
I tried this - it didn't work.
import re
re.sub(r'_\w+_\d_\d+_\w+','')
Thank you
You can use re.sub with pattern [^_]* that match any sub-string from your text that not contain _ and as re.sub replace the pattern for first match you can use it in this case :
>>> s="ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
>>> import re
>>> re.sub(r'([^_]*).*',r'\1',s)
'ID12345678'
But if it could be appear any where in your string you can use re.search as following :
>>> re.search(r'ID\d+',s).group(0)
'ID12345678'
>>> s="_S3_MPRAGE_ADNI_ID12345678_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
>>> re.search(r'ID\d+',s).group(0)
'ID12345678'
But without regex simply you can use split() :
>>> s.split('_',1)[0]
'ID12345678'
I guess the first part is variable, then
import re
s = "ID12345678_S3_MPRAGE_ADNI_32Ch_2_98_clone_transform_clone_reg_N3Corrected1_mask_cp_strip_durastripped_N3Corrected_clone_lToads_lesions_seg"
print re.sub(r'_.*$', r'', s)
I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?
It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.
The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.
1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']
mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?
You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.