I need to find the exact position where the string matched..
>>> pattern = 'Test.*1'
>>> str1='Testworld1'
>>> match = re.search(pattern,str1)
>>> match.group()
'Testworld1'
I need the position of 1(10th byte) from the 'Testworld1' string which matched the pattern .*1.
You want to do two things. First make a group out of the .*1, then when accessing the group you can call .start() Like so:
>>> pattern = 'Test.*(1)'
>>> match = re.search(pattern,str1)
>>> match.group(1)
'1'
>>> match.start(1)
9
How about end()
>>> pattern = r'Test.*1'
>>> str1='Testworld1'
>>> match = re.search(pattern,str1)
>>> match.end()
10
For more complicated applications (where you are not just looking for the last position of the last character in your match), you might want to use capturing and start instead:
>>> pattern = r'Test.*(11)'
>>> str1='Testworld11'
>>> match = re.search(pattern,str1)
>>> match.start(1) + 1
10
Here, start(n) gives you the beginning index of the capture of the nth group, where groups are counted from left to right by their opening parentheses.
Related
I have a list of strings:
str_list = ['123_456_789_A1', '678_912_000_B1', '980_210_934_A1', '632_210_464_B1']
And I basically want another list:
output_list = ['789', '000', '934', '464']
It is always going to be the third group of numbers that will always have a _A of _B
so far I have:
import re
m = re.search('_(.+?)_A', text)
if m:
found = m.group(1)
But I keep getting somthing like: 456_789
Just use simple list comprehension for this
ans = [i.split("_")[-2] for i in lst]
If you only want to match digits followed by an underscore and an uppercase char, you can match the digits and assert the underscore and uppercase char directly to the right.
To match only A or B, use [AB] else use [A-Z] to match that range.
\d+(?=_[AB])
Regex demo
You can use re.search to find the first occurrence in the string.
import re
str_list = ['123_456_789_A1', '678_912_000_B1', '980_210_934_A1', '632_210_464_B1']
str_list = [re.search(r'\d+(?=_[AB])', s).group() for s in str_list]
print(str_list)
Output
['789', '000', '934', '464']
Or using a capturing group version, matching the _ before as well to be a bit more precise as in your pattern you also wanted to match the leading _
str_list = [re.search(r'_(\d+)_[AB]', s).group(1) for s in str_list]
I have a string like this '00004079_20150427_5_169_192_114.npz', and I want to split it into this ['00004079_20150427_5', '169_192_114.npz'].
I tried the Python string split() method:
a = '00004079_20150427_5_169_192_114.nii.npz'
a.split("_", 3)
but it returned this:
['00004079', '20150427', '5', '169_192_114.nii.npz']
How can I split this into 2 parts by the third "_" appearance?
I also tried this:
reg = ".*\_.*\_.\_"
re.split(reg, a)
but it returns:
['', '169_192_114.nii.npz']
You can split the string based on the delimiter _ upto 3 times and then join back everything except the last value
>>> *start, end = s.split('_', 3)
>>> start = '_'.join(start)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
For python2, you can follow this instead
>>> lst = s.split('_', 3)
>>> end = lst.pop()
>>> start = '_'.join(lst)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
One of possible approaches (if going with regex):
import re
s = '00004079_20150427_5_169_192_114.nii.npz'
res = re.search(r'^((?:[^_]+_){2}[^_]+)_(.+)', s)
print(res.groups())
The output:
('00004079_20150427_5', '169_192_114.nii.npz')
I have a list of strings, and I want to all the strings that end with _1234 where 1234 can be any 4-digit number. It's ideal to find all the elements, and what the digits actually are, or at least return the 1st matching element, and what the 4 digit is.
For example, I have
['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
I want to get
['1024', '0510']
Okay so far I got, _\d{4}$ will match _1234 and return a match object, and the match_object.group(0) is the actual matched string. But is there a better way to look for _\d{4}$ but only return \d{4} without the _?
Use re.search():
import re
lst = ['A', 'BB_1024', 'CQ_2', 'x_0510']
newlst = []
for item in lst:
match = re.search(r'_(\d{4})\Z', item)
if match:
newlst.append(match.group(1))
print(newlst) # ['1024', '0510']
As for the regex, the pattern matches an underscore and exactly 4 digits at the end of the string, capturing only the digits (note the parens). The captured group is then accessible via match.group(1) (remember that group(0) is the entire match).
import re
src = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765', 'AB2421', 'D3&1345']
res = []
p = re.compile('.*\D(\d{4})$')
for s in src:
m = p.match(s)
if m:
res.append(m.group(1))
print(res)
Works fine, \D means not a number, so it will match 'AB2421', 'D3&1345' and so on.
Please show some code next time you ask a question here, even if it doesn't work at all. It makes it easier for people to help you.
If you're interested in a solution without any regex, here's a way with list comprehensions:
>>> data = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
>>> endings = [text.split('_')[-1] for text in data]
>>> endings
['A', '1024', '2', '0510', '98765']
>>> [x for x in endings if x.isdigit() and len(x)==4]
['1024', '0510']
Try this:
[s[-4:] for s in lst if s[-4:].isdigit() and len(s) > 4]
Just check the last four characters if it's a number or not.
added the len(s) > 4 to correct the mistake Joran pointed out.
Try this code:
r = re.compile(".*?([0-9]+)$")
newlist = filter(r.match, mylist)
print newlist
My regex:
联系人[::]\s{1,2}([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*
Test string:
联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大
Code
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> re.findall(p, s)
[('啊啊', '实打实大')]
# finditer
>>> for i in re.finditer(p, s):
... print(i.groups())
...
('啊啊', '实打实大')
Matchs:
You can test it here https://regex101.com/
(regex101 can't save regex now, so I have to post above pics)
I want all groups split by [\s,,、], but only match the first and last. I don't feel there is any wrong in my regex, though the result is wrong, this stuck me for half hour...
As I mentioned in my comments, you need to use re.search (to get a single match only) or re.finditer (to get multiple matches) and access the corresponding group captures (in your case, it is captures(2)):
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> res = []
>>> for x in re.finditer(p, s):
res.append(x.captures(2))
>>> print(res)
[['实打实大', '好说歹说', '实打实', '实打实大']]
>>> m = re.search(p, s)
>>> if m:
print(m.captures(2))
['实打实大', '好说歹说', '实打实', '实打实大']
I'm new to Python, and I would like to find a substring in a string.
For example, if I have a substring of some constant letters such as:
substring = 'sdkj'
And a string of some letters such as:
string = 'sdjskjhdvsnea'
I want to make a counter so that any letters S, D, K, and J found in the string the counter will get incremented by 1. For example, for the above example, the counter will be 8.
How can I achieve this?
May this code can help you:
>>> string = 'sdjskjhdvsnea'
>>> substring = 'sdkj'
>>> counter = 0
>>> for x in string:
... if x in substring:
... counter += 1
>>> counter
8
>>>
An alternative solution using re.findall():
>>> import re
>>> substring = 'sdkj'
>>> string = 'sdjskjhdvsnea'
>>> len(re.findall('|'.join(list(substring)), string))
8
Edit:
As you apparently do want the count of the appearances of the whole four-character substring, regex is probably the easiest method:
>>> import re
>>> string = 'sdkjhsgshfsdkj'
>>> substring = 'sdkj'
>>> len(re.findall(substring, string))
2
re.findall will give you a list of all (non-overlapping) appearances of substring in string:
>>> re.findall('sdkj', 'sdkjhsgshfsdkj')
['sdkj', 'sdkj']
Normally, "finding a sub-string 'sdkj'" would mean trying to locate the appearances of that complete four-character substring within the larger string. In this case, it appears that you simply want the sum of the counts of those four letters:
sum(string.count(c) for c in substring)
Or, more efficiently, use collections.Counter:
from collections import Counter
counts = Counter(string)
sum(counts.get(c, 0) for c in substring)
This only iterates over string once, rather than once for each c in substring, so is O(m+n) rather than O(m*n) (where m == len(string) and n == len(substring)).
In action:
>>> string = "sdjskjhdvsnea"
>>> substring = "sdkj"
>>> sum(string.count(c) for c in substring)
8
>>> from collections import Counter
>>> counts = Counter(string)
>>> sum(counts.get(c, 0) for c in substring)
8
Note that you may want set(substring) to avoid double-counting:
>>> sum(string.count(c) for c in "sdjks")
11
>>> sum(string.count(c) for c in set("sdjks"))
8