How to choose a certain position to split a string by "_"?

How to choose a certain position to split a string by "_"? - python

I have a string like this '00004079_20150427_5_169_192_114.npz', and I want to split it into this ['00004079_20150427_5', '169_192_114.npz'].
I tried the Python string split() method:
a = '00004079_20150427_5_169_192_114.nii.npz'
a.split("_", 3)
but it returned this:
['00004079', '20150427', '5', '169_192_114.nii.npz']
How can I split this into 2 parts by the third "_" appearance?
I also tried this:
reg = ".*\_.*\_.\_"
re.split(reg, a)
but it returns:
['', '169_192_114.nii.npz']

You can split the string based on the delimiter _ upto 3 times and then join back everything except the last value
>>> *start, end = s.split('_', 3)
>>> start = '_'.join(start)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
For python2, you can follow this instead
>>> lst = s.split('_', 3)
>>> end = lst.pop()
>>> start = '_'.join(lst)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'

One of possible approaches (if going with regex):
import re
s = '00004079_20150427_5_169_192_114.nii.npz'
res = re.search(r'^((?:[^_]+_){2}[^_]+)_(.+)', s)
print(res.groups())
The output:
('00004079_20150427_5', '169_192_114.nii.npz')

Related

Get every 2nd and 3rd characters of a string in Python

I know that my_str[1::3] gets me every 2nd character in chunks of 3, but what if I want to get every 2nd and 3rd character? Is there a neat way to do that with slicing, or do I need some other method like a list comprehension plus a join:
new_str = ''.join([s[i * 3 + 1: i * 3 + 3] for i in range(len(s) // 3)])

I think using a list comprehension with enumerate would be the cleanest.
>>> "".join(c if i % 3 in (1,2) else "" for (i, c) in enumerate("peasoup booze scaffold john"))
'eaou boz safol jhn'

Instead of getting only 2nd and 3rd characters, why not filter out the 1st items?
Something like this:
>>> str = '123456789'
>>> tmp = list(str)
>>> del tmp[::3]
>>> new_str = ''.join(tmp)
>>> new_str
'235689'

Regex multiple same pattern/repeated captures not work correctly, only match first and last

My regex:
联系人[:：]\s{1,2}([^\s,，、]+)(?:[\s,，、]{1,2}([^\s,，、]+))*
Test string:
联系人: 啊啊，实打实大, 好说歹说、实打实 实打实大
Code
>>> import regex as re
>>> p = r'联系人[:：]\s*([^\s,，、]+)(?:[\s,，、]{1,2}([^\s,，、]+))*'
>>> s = '联系人: 啊啊，实打实大, 好说歹说、实打实 实打实大'
>>> re.findall(p, s)
[('啊啊', '实打实大')]
# finditer
>>> for i in re.finditer(p, s):
... print(i.groups())
...
('啊啊', '实打实大')
Matchs:
You can test it here https://regex101.com/
(regex101 can't save regex now, so I have to post above pics)
I want all groups split by [\s,，、], but only match the first and last. I don't feel there is any wrong in my regex, though the result is wrong, this stuck me for half hour...

As I mentioned in my comments, you need to use re.search (to get a single match only) or re.finditer (to get multiple matches) and access the corresponding group captures (in your case, it is captures(2)):
>>> import regex as re
>>> p = r'联系人[:：]\s*([^\s,，、]+)(?:[\s,，、]{1,2}([^\s,，、]+))*'
>>> s = '联系人: 啊啊，实打实大, 好说歹说、实打实 实打实大'
>>> res = []
>>> for x in re.finditer(p, s):
res.append(x.captures(2))
>>> print(res)
[['实打实大', '好说歹说', '实打实', '实打实大']]
>>> m = re.search(p, s)
>>> if m:
print(m.captures(2))
['实打实大', '好说歹说', '实打实', '实打实大']

python strip string from end the most greedily

here it is:
str_ = 'file_.csv_.csv.bz2'
re.sub(regex, '', str_)
I want 'regex' value to get 'file_.csv_' i.e. the file name without the actual extension which here '.csv.bz2' and could be '.csv.*' while .* = ''|bz2|gz|7z|... any compression format.
More precisely I want re.sub to match from the end of str_ the most greedily.
with regex = '\.csv.*$' I would get only 'file_'.
I could of course do os.path.splitext() - check if str_ ends with '.csv' - os.path.splitext() if so, but is there a shorter way?

You could use re.split() splitting of the suffix:
result = re.split(r'\.csv(?:\.\w+)?$', filename)[0]
Demo:
>>> import re
>>> filename = 'file_.csv_.csv.bz2'
>>> re.split(r'\.csv(?:\.\w+)?$', filename)[0]
'file_.csv_'
>>> re.split(r'\.csv(?:\.\w+)?$', 'foobar_.csv_.csv')[0]
'foobar_.csv_'
>>> re.split(r'\.csv(?:\.\w+)?$', 'foobar_.csv_.csv.gz')[0]
'foobar_.csv_'

This would remove all the continuous extensions and prints only the filename,
>>> s = "file_.csv_.csv.bz2"
>>> m = re.sub(r'[.a-z0-9]+$', r'', s)
>>> m
'file_.csv_'
>>> s = "foobar_.csv_.csv.gz"
>>> m = re.sub(r'[.a-z0-9]+$', r'', s)
>>> m
'foobar_.csv_'

identifying position of the pattern match

I need to find the exact position where the string matched..
>>> pattern = 'Test.*1'
>>> str1='Testworld1'
>>> match = re.search(pattern,str1)
>>> match.group()
'Testworld1'
I need the position of 1(10th byte) from the 'Testworld1' string which matched the pattern .*1.

You want to do two things. First make a group out of the .*1, then when accessing the group you can call .start() Like so:
>>> pattern = 'Test.*(1)'
>>> match = re.search(pattern,str1)
>>> match.group(1)
'1'
>>> match.start(1)
9

How about end()
>>> pattern = r'Test.*1'
>>> str1='Testworld1'
>>> match = re.search(pattern,str1)
>>> match.end()
10
For more complicated applications (where you are not just looking for the last position of the last character in your match), you might want to use capturing and start instead:
>>> pattern = r'Test.*(11)'
>>> str1='Testworld11'
>>> match = re.search(pattern,str1)
>>> match.start(1) + 1
10
Here, start(n) gives you the beginning index of the capture of the nth group, where groups are counted from left to right by their opening parentheses.

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!

>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']

For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678

Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.

Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']

If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break

>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]

I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to choose a certain position to split a string by "_"? - python

One of possible approaches (if going with regex): import re s = '00004079_20150427_5_169_192_114.nii.npz' res = re.search(r'^((?:[^_]+_){2}[^_]+)_(.+)', s) print(res.groups()) The output: ('00004079_20150427_5', '169_192_114.nii.npz')

Related

Get every 2nd and 3rd characters of a string in Python

Regex multiple same pattern/repeated captures not work correctly, only match first and last

python strip string from end the most greedily

identifying position of the pattern match

Python string split decimals from end of string

Categories

Resources