python strip string from end the most greedily - python

here it is:
str_ = 'file_.csv_.csv.bz2'
re.sub(regex, '', str_)
I want 'regex' value to get 'file_.csv_' i.e. the file name without the actual extension which here '.csv.bz2' and could be '.csv.*' while .* = ''|bz2|gz|7z|... any compression format.
More precisely I want re.sub to match from the end of str_ the most greedily.
with regex = '\.csv.*$' I would get only 'file_'.
I could of course do os.path.splitext() - check if str_ ends with '.csv' - os.path.splitext() if so, but is there a shorter way?

You could use re.split() splitting of the suffix:
result = re.split(r'\.csv(?:\.\w+)?$', filename)[0]
Demo:
>>> import re
>>> filename = 'file_.csv_.csv.bz2'
>>> re.split(r'\.csv(?:\.\w+)?$', filename)[0]
'file_.csv_'
>>> re.split(r'\.csv(?:\.\w+)?$', 'foobar_.csv_.csv')[0]
'foobar_.csv_'
>>> re.split(r'\.csv(?:\.\w+)?$', 'foobar_.csv_.csv.gz')[0]
'foobar_.csv_'

This would remove all the continuous extensions and prints only the filename,
>>> s = "file_.csv_.csv.bz2"
>>> m = re.sub(r'[.a-z0-9]+$', r'', s)
>>> m
'file_.csv_'
>>> s = "foobar_.csv_.csv.gz"
>>> m = re.sub(r'[.a-z0-9]+$', r'', s)
>>> m
'foobar_.csv_'

Related

Regular expressions (regex) - How to split a string by the first X digits appear in it?

I've been struggling to find the right regex (Python) to cover my requirement:
I want to split a string according to the first place in which there are 6 digits.
For example -
stringA = 'abcdf123456789'
Will ideally be cut into -
StringB='abcdf123456'
StringC='789'
So far - This is the solution I came up with:
x = re.split("(?=[0-9])", stringA)
And than loop over the results while counting the chars.
Your help will be greatly appreciated!
import re
stringA = 'abcdf123456789'
result = re.split("(?<=[0-9]{6})",stringA,maxsplit=1)
print(result)
# ['abcdf123456', '789']
Using a lookbehind:
>>> stringA = 'abcdf123456789'
>>> re.split(r'(?<=\d{6})', stringA, maxsplit=1)
['abcdf123456', '789']
Demo and explanation of the regex
You may use this code with 2 capture groups:
>>> import re
>>> stringA = 'abcdf123456789'
>>> [(stringB,stringC)] = re.findall(r'(.*?\d{6})(.*)', stringA)
>>> print (stringB)
abcdf123456
>>> print (stringC)
789
You can split on 6 digits with maxsplit=1, and capture the group you split on, then you can build your strings easily:
import re
stringA = 'abcdf123456789'
split = re.split(r'(\d{6})', stringA, maxsplit=1)
# split is now ['abcdf', '123456', '789']
stringB = ''.join(split[:2])
stringC = split[2]
print(stringB)
print(stringC)
# abcdf123456
# 789
import re
stringA = 'abcdf123456789'
index = re.search(r'\d{6}', stringA).end()
stringB = stringA[:index]
stringC = stringA[index:]
You can just use the findall() and groups aka '()' and simply finds the things you need!
import re
stringA = 'abcdf123456789'
pattern = r"([\D]*\d{6})(.*)"
result = re.findall(pattern, stringA)
print(result)
#output [('abcdf123456', '789')]

How to choose a certain position to split a string by "_"?

I have a string like this '00004079_20150427_5_169_192_114.npz', and I want to split it into this ['00004079_20150427_5', '169_192_114.npz'].
I tried the Python string split() method:
a = '00004079_20150427_5_169_192_114.nii.npz'
a.split("_", 3)
but it returned this:
['00004079', '20150427', '5', '169_192_114.nii.npz']
How can I split this into 2 parts by the third "_" appearance?
I also tried this:
reg = ".*\_.*\_.\_"
re.split(reg, a)
but it returns:
['', '169_192_114.nii.npz']
You can split the string based on the delimiter _ upto 3 times and then join back everything except the last value
>>> *start, end = s.split('_', 3)
>>> start = '_'.join(start)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
For python2, you can follow this instead
>>> lst = s.split('_', 3)
>>> end = lst.pop()
>>> start = '_'.join(lst)
>>>
>>> start
'00004079_20150427_5'
>>> end
'169_192_114.npz'
One of possible approaches (if going with regex):
import re
s = '00004079_20150427_5_169_192_114.nii.npz'
res = re.search(r'^((?:[^_]+_){2}[^_]+)_(.+)', s)
print(res.groups())
The output:
('00004079_20150427_5', '169_192_114.nii.npz')

Regex multiple same pattern/repeated captures not work correctly, only match first and last

My regex:
联系人[::]\s{1,2}([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*
Test string:
联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大
Code
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> re.findall(p, s)
[('啊啊', '实打实大')]
# finditer
>>> for i in re.finditer(p, s):
... print(i.groups())
...
('啊啊', '实打实大')
Matchs:
You can test it here https://regex101.com/
(regex101 can't save regex now, so I have to post above pics)
I want all groups split by [\s,,、], but only match the first and last. I don't feel there is any wrong in my regex, though the result is wrong, this stuck me for half hour...
As I mentioned in my comments, you need to use re.search (to get a single match only) or re.finditer (to get multiple matches) and access the corresponding group captures (in your case, it is captures(2)):
>>> import regex as re
>>> p = r'联系人[::]\s*([^\s,,、]+)(?:[\s,,、]{1,2}([^\s,,、]+))*'
>>> s = '联系人: 啊啊,实打实大, 好说歹说、实打实 实打实大'
>>> res = []
>>> for x in re.finditer(p, s):
res.append(x.captures(2))
>>> print(res)
[['实打实大', '好说歹说', '实打实', '实打实大']]
>>> m = re.search(p, s)
>>> if m:
print(m.captures(2))
['实打实大', '好说歹说', '实打实', '实打实大']

return all matches in regex in Python

>>> import re
>>> p=re.compile('(a(.)c)d')
Why does the following only return 'abcd' but not also 'aecd'? If I want to return both, how shall I do? If I only like to return aecd, what shall I do?
>>> m=p.match('abcdeaecd')
>>> m.group()
'abcd'
>>> m.groups()
('abc', 'b')
Thanks!
You can simplify your RegEx, like this
import re
p=re.compile(r'a.cd')
And use re.findall to get all the matches, like this
print p.findall('abcdeaecd')
# ['abcd', 'aecd']
Otherwise you can use your RegEx itself and iterate over the matches like this
print [item.group() for item in p.finditer('abcdeaecd')]
# ['abcd', 'aecd']
You will want to use finditer instead of match:
ms = p.finditer('abcdeaecd')
for m in ms:
# do something with m.group or m.groups

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!
>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']
For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678
Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.
Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']
If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break
>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]
I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

Categories