Split Python String by letters and keep deliminators - python

Using regex, how can i split a string and keep it's deliminators in the returned results? I'm trying to split a string containing numbers and strings by a set of letters followed by any numerical value including '.' however it's not appearing to work correctly.
Below is my test string, im using python 2.7 and it's not producing what id expect.
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, re.IGNORECASE))
print len(parts), parts
>>> 3 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z']
I would expect it to give me this
>>> 10 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
It should output a list of strings where each string starts with a letter, found in the original regex MLHVCSQTAZ

In your code you are passing re.IGNORECASE as 3rd argument to re.split but 3rd argument of re.split is maxsplit not flags.
re.IGNORECASE equals to 2 hence your input is split only two times.
You may use:
>>> list(filter(None, re.split(r'([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, 0, re.I)))
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
Or use inline mode for ignore case:
re.split(r'(?i)([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s)
I suggest using this simple re.findall code that uses almost identical regex:
parts = re.findall('(?i)[MLHVCSQTAZ][^MLHVCSQTAZ]*', s)
Reference: SRE_FLAG_IGNORECASE = 2 in lib/python2.7/sre_constants.py (thanks to comment from #vks)

You can use re.findall:
import re
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
result = re.findall('[A-Z][\.\d,]+|[A-Z]', s)
Output:
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']

parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, flags=re.IGNORECASE))
You need to use flags.Check re.split function definition.
Default re does not support 0 width assertion split.So you can also use regex module for that.
import regex
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
print regex.split('(?=[MLHVCSQTAZ][^MLHVCSQTAZ])', s, flags=regex.IGNORECASE|regex.VERSION1)

Related

How to get String Before Last occurrence of substring?

I want to get String before last occurrence of my given sub string.
My String was,
path =
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov
my substring, 1001-1010 which will occurred twice. all i want is get string before its last occurrence.
Note: My substring is dynamic with different padding but only number.
I want,
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v
I have done using regex and slicing,
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> q = re.findall("\d*-\d*",p)
>>> q[-1].join(p.split(q[-1])[:-1])
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v'
>>>
Is their any better way to do by purely using regex?
Please Note I have tried so many eg:
regular expression to match everything until the last occurrence of /
Regex Last occurrence?
I got answer by using regex with slicing but i want to achieve by using regex alone..
Why use regex. Just use built in string methods:
path = "D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov"
index = path.rfind("1001-1010")
print(path[:index])
You can use a simple greedy match and a capture group:
(.*)1001-1010
Your match is in capture group #1
Since .* is greedy by nature, it will match longest match before matching your keyword 1001-1010.
RegEx Demo
As per comments below if keyword is not a static string then you may use this regex:
r'(.*\D)\d+-\d+'
Python Code:
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> print (re.findall(r'(.*\D)\d+-\d+', p))
['D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v']
Thanks #anubhava,
My first regex was,
.*(\d*-\d*)\/
Now i have corrected mine..
.*(\d*-\d*)
or
(.*)(\d*-\d*)
which gives me,
>>> q = re.search('.+(\d*-\d*)', p)
>>> q.group()
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v0001-1001'
>>>
(.*\D)\d+-\d+
this gives me exactly what i want...
>>> q = re.search('(.*\D)\d+-\d+', p)
>>> q.groups()
('D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v',)
>>>

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

Removing many types of chars from a Python string

I have some string X and I wish to remove semicolons, periods, commas, colons, etc, all in one go. Is there a way to do this that doesn't require a big chain of .replace(somechar,"") calls?
You can use the translate method with a first argument of None:
string2 = string1.translate(None, ";.,:")
Alternatively, you can use the filter function:
string2 = filter(lambda x: x not in ";,.:", string1)
Note that both of these options only work for non-Unicode strings and only in Python 2.
You can use re.sub to pattern match and replace. The following replaces h and i only with empty strings:
In [1]: s = 'byehibyehbyei'
In [1]: re.sub('[hi]', '', s)
Out[1]: 'byebyebye'
Don't forget to import re.
>>> import re
>>> foo = "asdf;:,*_-"
>>> re.sub('[;:,*_-]', '', foo)
'asdf'
[;:,*_-] - List of characters to be matched
'' - Replace match with nothing
Using the string foo.
For more information take a look at the re.sub(pattern, repl, string, count=0, flags=0) documentation.
Don't know about the speed, but here's another example without using re.
commas_and_stuff = ",+;:"
words = "words; and stuff!!!!"
cleaned_words = "".join(c for c in words if c not in commas_and_stuff)
Gives you:
'words and stuff!!!!'

Python RegEx search and replace with part of original expression

I'm new to Python and looking for a way to replace all occurrences of "[A-Z]0" with the [A-Z] portion of the string to get rid of certain numbers that are padded with a zero. I used this snippet to get rid of the whole occurrence from the field I'm processing:
import re
def strip_zeros(s):
return re.sub("[A-Z]0", "", s)
test = strip_zeros(!S_fromManhole!)
How do I perform the same type of procedure but without removing the leading letter of the "[A-Z]0" expression?
Thanks in advance!
Use backreferences.
http://www.regular-expressions.info/refadv.html "\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses."
http://docs.python.org/2/library/re.html#re.sub "Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern."
Untested, but it would look like this:
return re.sub(r"([A-Z])0", r"\1", s)
Placing the first letter inside a capture group and referencing it with \1
you can try something like
In [47]: s = "ab0"
In [48]: s.translate(None, '0')
Out[48]: 'ab'
In [49]: s = "ab0zy"
In [50]: s.translate(None, '0')
Out[50]: 'abzy'
I like Patashu's answer for this case but for the sake of completeness, passing a function to re.sub instead of a replacement string may be cleaner in more complicated cases. The function should take a single match object and return a string.
>>> def strip_zeros(s):
... def unpadded(m):
... return m.group(1)
... return re.sub("([A-Z])0", unpadded, s)
...
>>> strip_zeros("Q0")
'Q'

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

Categories