Python RegEx search and replace with part of original expression - python

I'm new to Python and looking for a way to replace all occurrences of "[A-Z]0" with the [A-Z] portion of the string to get rid of certain numbers that are padded with a zero. I used this snippet to get rid of the whole occurrence from the field I'm processing:
import re
def strip_zeros(s):
return re.sub("[A-Z]0", "", s)
test = strip_zeros(!S_fromManhole!)
How do I perform the same type of procedure but without removing the leading letter of the "[A-Z]0" expression?
Thanks in advance!

Use backreferences.
http://www.regular-expressions.info/refadv.html "\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses."
http://docs.python.org/2/library/re.html#re.sub "Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern."
Untested, but it would look like this:
return re.sub(r"([A-Z])0", r"\1", s)
Placing the first letter inside a capture group and referencing it with \1

you can try something like
In [47]: s = "ab0"
In [48]: s.translate(None, '0')
Out[48]: 'ab'
In [49]: s = "ab0zy"
In [50]: s.translate(None, '0')
Out[50]: 'abzy'

I like Patashu's answer for this case but for the sake of completeness, passing a function to re.sub instead of a replacement string may be cleaner in more complicated cases. The function should take a single match object and return a string.
>>> def strip_zeros(s):
... def unpadded(m):
... return m.group(1)
... return re.sub("([A-Z])0", unpadded, s)
...
>>> strip_zeros("Q0")
'Q'

Related

Split Python String by letters and keep deliminators

Using regex, how can i split a string and keep it's deliminators in the returned results? I'm trying to split a string containing numbers and strings by a set of letters followed by any numerical value including '.' however it's not appearing to work correctly.
Below is my test string, im using python 2.7 and it's not producing what id expect.
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, re.IGNORECASE))
print len(parts), parts
>>> 3 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z']
I would expect it to give me this
>>> 10 ['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
It should output a list of strings where each string starts with a letter, found in the original regex MLHVCSQTAZ
In your code you are passing re.IGNORECASE as 3rd argument to re.split but 3rd argument of re.split is maxsplit not flags.
re.IGNORECASE equals to 2 hence your input is split only two times.
You may use:
>>> list(filter(None, re.split(r'([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, 0, re.I)))
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
Or use inline mode for ignore case:
re.split(r'(?i)([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s)
I suggest using this simple re.findall code that uses almost identical regex:
parts = re.findall('(?i)[MLHVCSQTAZ][^MLHVCSQTAZ]*', s)
Reference: SRE_FLAG_IGNORECASE = 2 in lib/python2.7/sre_constants.py (thanks to comment from #vks)
You can use re.findall:
import re
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
result = re.findall('[A-Z][\.\d,]+|[A-Z]', s)
Output:
['M160.394,83.962', 'L121.5,52', 'L86.31,73.378', 'L58,104.917', 'L89.75,', 'C136.667', 'L158.542,136.667', 'L185,110.208', 'L160.394,83.962', 'Z']
parts = filter(None, re.split('([MLHVCSQTAZ][^MLHVCSQTAZ]+)', s, flags=re.IGNORECASE))
You need to use flags.Check re.split function definition.
Default re does not support 0 width assertion split.So you can also use regex module for that.
import regex
s = 'M160.394,83.962L121.5,52L86.31,73.378L58,104.917L89.75,C136.667L158.542,136.667L185,110.208L160.394,83.962Z'
print regex.split('(?=[MLHVCSQTAZ][^MLHVCSQTAZ])', s, flags=regex.IGNORECASE|regex.VERSION1)

How to get String Before Last occurrence of substring?

I want to get String before last occurrence of my given sub string.
My String was,
path =
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov
my substring, 1001-1010 which will occurred twice. all i want is get string before its last occurrence.
Note: My substring is dynamic with different padding but only number.
I want,
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v
I have done using regex and slicing,
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> q = re.findall("\d*-\d*",p)
>>> q[-1].join(p.split(q[-1])[:-1])
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v'
>>>
Is their any better way to do by purely using regex?
Please Note I have tried so many eg:
regular expression to match everything until the last occurrence of /
Regex Last occurrence?
I got answer by using regex with slicing but i want to achieve by using regex alone..
Why use regex. Just use built in string methods:
path = "D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov"
index = path.rfind("1001-1010")
print(path[:index])
You can use a simple greedy match and a capture group:
(.*)1001-1010
Your match is in capture group #1
Since .* is greedy by nature, it will match longest match before matching your keyword 1001-1010.
RegEx Demo
As per comments below if keyword is not a static string then you may use this regex:
r'(.*\D)\d+-\d+'
Python Code:
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> print (re.findall(r'(.*\D)\d+-\d+', p))
['D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v']
Thanks #anubhava,
My first regex was,
.*(\d*-\d*)\/
Now i have corrected mine..
.*(\d*-\d*)
or
(.*)(\d*-\d*)
which gives me,
>>> q = re.search('.+(\d*-\d*)', p)
>>> q.group()
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v0001-1001'
>>>
(.*\D)\d+-\d+
this gives me exactly what i want...
>>> q = re.search('(.*\D)\d+-\d+', p)
>>> q.groups()
('D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v',)
>>>

Need help extracting data from a file

I'm a newbie at python.
So my file has lines that look like this:
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
I need help coming up with the correct python code to extract every float preceded by a colon and followed by a space (ex: [-0.294118, 0.487437,etc...])
I've tried dataList = re.findall(':(.\*) ', str(line)) and dataList = re.split(':(.\*) ', str(line)) but these come up with the whole line. I've been researching this problem for a while now so any help would be appreciated. Thanks!
try this one:
:(-?\d\.\d+)\s
In your code that will be
p = re.compile(':(-?\d\.\d+)\s')
m = p.match(str(line))
dataList = m.groups()
This is more specific on what you want.
In your case .* will match everything it can
Test on Regexr.com:
In this case last element wasn't captured because it doesnt have space to follow, if this is a problem just remove the \s from the regex
This will do it:
import re
line = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
for match in re.finditer(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE):
print match.group(1)
Or:
match = re.search(r"(-?\d\.\d+)", line, re.DOTALL | re.MULTILINE)
if match:
datalist = match.group(1)
else:
datalist = ""
Output:
-0.294118
0.487437
0.180328
-0.292929
0.00149028
-0.53117
-0.0333333
Live Python Example:
http://ideone.com/DpiOBq
Regex Demo:
https://regex101.com/r/nR4wK9/3
Regex Explanation
(-?\d\.\d+)
Match the regex below and capture its match into backreference number 1 «(-?\d\.\d+)»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character that is a “digit” (ASCII 0–9 only) «\d»
Match the character “.” literally «\.»
Match a single character that is a “digit” (ASCII 0–9 only) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Given:
>>> s='-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333.333'
With your particular data example, you can just grab the parts that would be part of a float with a regex:
>>> re.findall(r':([\d.-]+)', s)
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
You can also split and partition, which would be substantially faster:
>>> [e.partition(':')[2] for e in s.split() if ':' in e]
['-0.294118', '0.487437', '0.180328', '-0.292929', '-1', '0.00149028', '-0.53117', '-0.0333.333']
Then you can convert those to a float using try/except and map and filter:
>>> def conv(s):
... try:
... return float(s)
... except ValueError:
... return None
...
>>> filter(None, map(conv, [e.partition(':')[2] for e in s.split() if ':' in e]))
[-0.294118, 0.487437, 0.180328, -0.292929, -1.0, 0.00149028, -0.53117, -0.0333333]
A simple oneliner using list comprehension -
str = "-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333"
[float(s.split()[0]) for s in str.split(':')]
Note: this is simplest to understand (and pobably fastest) as we are not doing any regex evaluation. But this would only work for the particular case above. (eg. if you've to get the second number - in the above not so correctly formatted string would need more work than a single one-liner above).

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

is there a way find substring index through a regular expression in python?

i want to find a substring 's index postion,but the substring is long and hard to expression(multiline,& even you need escape for it ) so i want to use regex to match them,and return the substring's index,
the function like str.find or str.rfind ,
is there some package help for this?
Use the .start() method on the result object of a successful match (the so called match object):
mo = re.search('foo', veryLongString)
if mo:
return mo.start()
If the match was successful, mo.start() will give you the (first) index of the matching substring within the searched string.
Something like this might work:
import re
def index(longstr, pat):
rx = re.compile(r'(?P<pre>.*?)({0})'.format(pat))
match = rx.match(longstr)
return match and len(match.groupdict()['pre'])
Then:
>>> index('bar', 'foo') is None
True
>>> index('barfoo', 'foo')
3
>>> index('\xbarfoo', 'foo')
2
regular expression "re" package should be of help
http://docs.python.org/library/re.html
http://diveintopython3.ep.io/regular-expressions.html
http://www.doughellmann.com/PyMOTW/re/

Categories