Getting python regexp for data - python

I have tried finding a python regular expression to match the following lines, and and my interest being to extract the portion of each line between "|" and "." (preceding upx).
My attempt was:
pattern=compile.re(re"^\S+\|(\S+).upx\.+")
But it did not work
My data:
UMM_189|XXYT9888_UMX_5711769.upx_OWED_786_bopsio_34527_sen_72.345615
AMW_126|7010.upx_XAWA01266525261
QEA_234|Serami_bolismun_milte_1_UMM1.upx_YU_168145
MMP_377|723C_UMM_5711781.upx_UXA_2_serax_78120_ser_23.26255277
My expected output:
XXYT9888_UMX_5711769
7010
Serami_bolismun_milte_1_UMM1
723C_UMM_5711781
Any better ideas please?

I do not think that Regex is necessary here because your data is pretty ordered. A list comprehension with str.split and str.splitlines will suffice:
>>> data = '''\
... UMM_189|XXYT9888_UMX_5711769.upx_OWED_786_bopsio_34527_sen_72.345615
... AMW_126|7010.upx_XAWA01266525261
... QEA_234|Serami_bolismun_milte_1_UMM1.upx_YU_168145
... MMP_377|723C_UMM_5711781.upx_UXA_2_serax_78120_ser_23.26255277
... '''
>>> [x.split('|', 1)[1].split('.upx', 1)[0] for x in data.splitlines()]
['XXYT9888_UMX_5711769', '7010', 'Serami_bolismun_milte_1_UMM1', '723C_UMM_5711781']
>>>

Try this:
>>> re.findall(r'\|(.*?)\.',data)
['XXYT9888_UMX_5711769', '7010', 'Serami_bolismun_milte_1_UMM1', '723C_UMM_5711781']

import re
your_str = "UMM_189|XXYT9888_UMX_5711769.upx_OWED_786_bopsio_34527_sen_72.345615"
result = re.match(r'^[A-Z]{3}_[0-9]{3}\|(?P<id>[A-Za-z0-9_]*).upx*', your_str)
print result.group('id')

You have the slash dot and dot backwards. Try
pattern=compile.re(re"^\S+\|(\S+)\.upx.+")

Related

String.split() after n characters

I can split a string like this:
string = 'ABC_elTE00001'
string = string.split('_elTE')[1]
print(string)
How do I automate this, so I don't have to pass '_elTE' to the function? Something like this:
string = 'ABC_elTE00001'
string = string.split('_' + 4 characters)[1]
print(string)
Use regex, regex has a re.split thing which is the same as str.split just you can split by a regex pattern, it's worth a look at the docs:
>>> import re
>>> string = 'ABC_elTE00001'
>>> re.split('_\w{4}', string)
['ABC', '00001']
>>>
The above example is using a regex pattern as you see.
split() on _ and take everything after the first four characters.
s = 'ABC_elTE00001'
# s.split('_')[1] gives elTE00001
# To get the string after 4 chars, we'd slice it [4:]
print(s.split('_')[1][4:])
OUTPUT:
00001
You can use Regular expression to automate the extraction that you want.
import re
string = 'ABC_elTE00001'
data = re.findall('.([0-9]*$)',string)
print(data)
This is a, quite horrible, version that exactly "translates" string.split('_' + 4 characters)[1]:
s = 'ABC_elTE00001'
s.split(s[s.find("_"):(s.find("_")+1)+4])[1]
>>> '00001'

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

Python regular expression for

What would be the regular expression for such data
/home//Desktop/3A5F.py
path/sth/R67G.py
a/b/c/d/t/6UY7.py
i would like to get these
3A5F.py
R67G.py
6UY7.py
It looks like you're parsing paths, in which case you should really be using os.path instead of regex:
from os.path import basename
basename('/home//Desktop/3A5F.py')
# 3A5F.py
It is a simple split, no regex needed:
>>> "/home//Desktop/3A5F.py".split("/")[-1]
'3A5F.py'
As an alternative, you can get same result without regexps:
lines = ['/home//Desktop/3A5F.py', 'path/sth/R67G.py', 'a/b/c/d/t/6UY7.py']
result = [l.split('/')[-1] for l in lines]
print result
# ['3A5F.py', 'R67G.py', '6UY7.py']
use : [^\/]*\.py$
But this is a bad question. You need to show what you have try. Whe are not here to do your work for you.
You can use this.
pattern = ".*/(.*$)"
mystring = "/home//Desktop/3A5F.py"
re.findall(pattern, mystring)
You can also use os.path.split(mystring)

python regex find characters from and end of the string

svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz
from the following string I need to fetch Rev1233. So i was wondering if we can have any regexpression to do that. I like to do following string.search ("Rev" uptill next /)
so far I split this using split array
s1,s2,s3,s4,s5 = string ("/",4)
You don't need a regex to do this. It is as simple as:
str = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
str.split('/')[-2]
Here is a quick python example
>>> impot re
>>> s = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
>>> p = re.compile('.*/(Rev\d+)/.*')
>>> p.match(s).groups()[0]
'Rev1223'
Find second part from the end using regex, if preferred:
/(Rev\d+)/[^/]+$
http://regex101.com/r/cC6fO3/1
>>> import re
>>> m = re.search('/(Rev\d+)/[^/]+$', 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz')
>>> m.groups()[0]
'Rev1223'

Changing only one letter when there are a lot simular in string

Suppose I have the following string:
I.like.football
sky.is.blue
I need to make a loop that changes the last '.' to '_' so it looks this way
I.like_football
sky.is_blue
They are all simular style(3 words, 3 dots).
How to do that in a loop?
str='I.like.football'
str=str.rsplit('.',1) #this split from right but only first '.'
print '_'.join(str) # then join it
#output I.like_football
in single line
str='_'.join(str.rsplit('.',1))
str.replace lets you specify the number of replacements. Unfortunately there is no str.rreplace, so you'd need to reverse the string before and after :) eg.
>>> def f(s):
... return s[::-1].replace(".", "_", 1)[::-1]
...
>>> f('I.like.football')
'I.like_football'
>>> f('sky.is.blue')
'sky.is_blue'
Alternatively you could use one of str.rpartition, str.rsplit, str.rfind
This doesn't even need to run in a loop:
import re
p = re.compile(ur'\.(?=[^\.]+$)', re.IGNORECASE | re.MULTILINE)
test_str = u"I.like.football\nsky.is.blue"
subst = u"_"
result = re.sub(p, subst, test_str)

Categories