python extract specific part of the changing string - python

I have this URL string:
Hdf5File= '/home/Windows-Share/SCS931000126/20170101.h5'
I want to get two desired output from this string:
1- 'SCS931000126'
2- '20170101'
I wrote this regular expression to extract the above mentioned outputs, so I wrote:
import re
print(re.split(r'/', (re.split(r'[a-f]',Hdf5File)[4]))[1])
print(re.split(r'\.', (re.split(r'/', (re.split(r'[a-f]',Hdf5File)[4]))[2]))[0])
This gives me the desired output(if there is a better way to extract these outputs please let me know).
But the case is that this part of the URL /home/Windows-Share/ might change, is there anyway that I only get my desired output which are always at the end of the string regardless of the part of URL that might change?
for example if I have :
Hdf5File='/home/dal/windows-Share/SCS931000126/20170101.h5'
Then i cant re-use my regex. is there any way to this in a more reusable way?

Do you need re.split? You can just as well use str.split for this one:
In [294]: x, y = Hdf5File.split('/')[-2:]
In [296]: x, y.split('.')[0]
Out[296]: ('SCS931000126', '20170101')

While a simple split will work as cᴏʟᴅsᴘᴇᴇᴅ already demonstrated, you can also use os.path to get parts of your url:
import os
Hdf5File= '/home/Windows-Share/SCS931000126/20170101.h5'
f = os.path.basename(Hdf5File)
d = os.path.basename(os.path.dirname(Hdf5File))
print( d, f ) # SCS931000126 20170101.h5
# and to remove the file extension:
f = os.path.splitext(f)[0]
print(f) # 20170101

Related

Delete all characters that come after a given string

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])
You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'
You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.
You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.
The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')
See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.
You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Replacing specific substrings in a specific part of a string

I have a following text file that is to be edited in a certain manner. The part of the file that comes to inside the (init: part is to be overwritten and nothing except that should be edited.
File:
(define (problem bin-picking-doosra)
(:domain bin-picking-second)
;(:requirements :typing :negative-preconditions)
(:objects
)
(:init
(batsmen first_batsman)
(bowler none_bowler)
(umpire third_umpire)
(spectator no_spectator)
)
(:goal (and
(batsmen first_batsman)
(bowler last_bowler)
(umpire third_umpire)
(spectator full_spectator)
)
)
)
In this file I want replace every line that is inside the (init: section with the required string. In this case, I want to replace:
(batsmen first_batsman) with (batsmen none_batsmen)
(bowler none_bowler) with (bowler first_bowler)
(umpire third_umpire) with (umpire leg_umpire)
(spectator no_spectator) with (spectator empty_spectator)
The code I currently have the following:
file_path = "/home/mus/problem_turtlebot.pddl"
s = open(file_path).read()
s = s.replace('(batsmen first_batsman)', '(batsmen '+ predicate_batsmen + '_batsman)')
f = open(file_path, 'w')
f.write(s)
f.close()
The term predicate_batsmen here contains the word none. It works fine this way. This code only satisfies point number 1. mentioned above
There are three problems that I have.
This code also changes the '(batsmen first_batsmen)' part in (goal: part which I dont want. I only want it to change the (init: part
Currently for the other strings in the (init: part, I have to redo this code with different statement. For eg: for '(bowler none_bowler)' i.e. point number 2 above, I have to have a copy of the coded lines again which I think is a not a good coding technique. Any better way for it.
If we consider the first string in (init: that is to be overwritten i.e (batsmen first_batsman). Is there a way in python that no matter what matter what is written in the question mark part of the string like (batsmen ??????_batsman) could be replaced with none. For now it is 'first' but even if it is written 'second'((batsmen second_batsman)) or 'last' ((batsmen last_batsman)) , I want to replace it with 'none'(batsmen none_batsman).
Any ideas on these issues?
Thanks
First of all you need to find the init-group. The init-group seems to have the structure:
(:init
...
)
where ... is some recurrence of text contained inside parenthesis, e.g. "(batsmen first_batsman)". Regular expressions is a powerful way to locate these kind of patterns in text. If you are not familiar with regular expressions (or regex for short) have a look here.
The following regex locates this group:
import re
#Matches the items in the init-group:
item_regex = r"\([\w ]+\)\s+"
#Matches the init-group including items:
init_group_regex = re.compile(r"(\(:init\s+({})+\))".format(item_regex))
init_group = init_group_regex.search(s).group()
Now you have the init-group in match. The next step is to locate the term you would want to replace, and actually replace it. re.sub can do just that! First store the mappings in a dictionary:
mappings = {'batsmen first_batsman': 'batsmen '+ predicate_batsmen + '_batsman',
'bowler none_bowler': 'bowler first_bowler',
'umpire third_umpire': 'umpire leg_umpire',
'spectator no_spectator': 'spectator empty_spectator'}
Finding the occurrences and replacing them by their corresponding value one-by-one:
for key, val in mappings.items():
init_group = re.sub(key, val, init_group)
Finally you can replace the init-group in the original string:
s = init_group_regex.sub(init_group, s)
This is really flexible! You can use regex in mappings to have it match anything you like, including:
mappings = {'batsmen \w+_batsman': '(batsmen '+ predicate_batsmen + '_batsman)'}
to match 'batsmen none_batsman', 'batsmen first_batsman' etc.

Python regular expression, ignoring characters until some charater is matched a number of times

i'm renaming a batch of files i downloaded from a torrent and wanted to get the episode's name,so i figured regex would do the trick. I'm kinda new to regex so I'd appreciate the help. This is what i could come up to:
i have a class related to other renaming functions so the function defined here is within this class, that initializes with the path to the files directory, the expression to rename to and the file extension.
im using glob to access all files with the extension ".mkv"
for debugging i printed out all the file names:
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
def strip_ep_name(self):
for i, f in enumerate(self.files):
f_list = f.split("\\")
name, ext = os.path.splitext(f_list[-1])
ep_name = name.strip(r'(.*?)".720p.WEB-DL.x264-[MULVAcoded]"')
print(ep_name)
for me, the goal is to get the episode's name, either with or without the episode's number, because i can, later on, give the episode a new name.
and the output is:
r.Robot.S02E01.eps2.0_unm4sk-pt1.t
r.Robot.S02E02.eps2.0_unm4sk-pt2.t
r.Robot.S02E03.eps2.1_k3rnel-pan1c.ks
r.Robot.S02E04.eps2.2_init_1.as
r.Robot.S02E05.eps2.3.logic-b0mb.h
r.Robot.S02E06.eps2.4.m4ster-s1ave.aes
r.Robot.S02E07.eps2.5_h4ndshake.sm
r.Robot.S02E08.eps2.6.succ3ss0r.p1
r.Robot.S02E09.eps2.7_init_5.fv
r.Robot.S02E10.eps2.8_h1dden-pr0cess.a
r.Robot.S02E11.eps2.9_pyth0n-pt1.p7z
r.Robot.S02E12.eps2.9_pyth0n-pt2.p7z
I wanted to strip all the ".eps2.2" before the episode's name, but they dont follow an order.
Now I don't know how to move on from here. can anyone help?
Do it all in one step:
\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$
Broken down, this reads:
\.eps\d+\.\d+ # ".eps", followed by digits, a dot and other digits
[-_.] # one of -, _ or .
(.+?) # anything else lazily afterwards
(?:\.720p.+) # until .720p is found (might need some tweaking)
\. # a dot
(\w+)$ # some word characters (aka the file extension) at the end
This needs to be replaced by .\1.\2 to get your desired format in the end.
Everything in Python:
import re
filenames = """
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
"""
rx = re.compile(r'\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$', re.M)
filenames = rx.sub(r".\1.\2", filenames)
print(filenames)
Which yields
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv
See a demo on regex101.com.
Firstly import the regex module of Python:
import re
Then use this to replace from "r.Robot.S02E01.eps2.0_unm4sk-pt1.t" :
ep_name = re.sub(r"eps2\.\d{1,2}(\.|\_)","",episode_name)
use ep_name in loop and pass episode name to episode_name one by one and then print ep_name.
Output will be like:
r.Robot.S02E01.unm4sk-pt1.t
I'm not sure if I understand correctly, I don't know the series hence nor do I the titles. But do you really need re?
for f in files:
print(f[23:-35].split('.')[0])
results in
unm4sk-pt1
unm4sk-pt2
k3rnel-pan1c
init_1
logic-b0mb
m4ster-s1ave
h4ndshake
succ3ss0r
init_5
h1dden-pr0cess
pyth0n-pt1
pyth0n-pt2
Edit:
I still don't see an actual target format definition in your post, but just in case that #Jan is right, here's the re-less solution for that, too:
for f in files:
print(f[:16] + '.'.join(f[23:].split('.')[:2]) + '.mkv')
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv

Formatting form data?

I need to convert the form data below to a slightly different format to be able to submit correctly.
I have this form data.
PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:
Wanted format:
PaReq=eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz%2FlO6en5PrLZCs6j%0D%0ANWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA%2F6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp%0D%0ACEXF7Pg8X9JRgAIICbhCWz9wMY%2Boj%2FEYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY%0D%0AYMu12NOtlOUUgKZp3N%2BikGUsRbF3WeHWO0CAVphXgMdnkFWtiap%2FY5sldBGFjf1Yuzzv0PL8evrc%0D%0ApDMCtMLqk1hyiqCHoT%2F0HIimCE%2FHmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ%0D%0AFfC2LHKuzqg%2B3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ%2BtakbbhOfr6h9sjC65rpSehE%0D%0Ad4Yy1TXkQb9zlNkWEmD%2Br642A6n71A0vHRBwP9j%2F7TDLBQ%3D%3D%0D%0A&TermUrl=https%3A%2F%2Fwww.footpatrol.co.uk%2Fcheckout%2F3d&MD=
I have tried this but seems to be a different format than what I need to submit correctly.
Code:
import urllib.parse
print(urllib.parse.quote_plus('''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''))
Is this obtainable with python? And what do i need to do to achieve the wanted end result?
if your paraneters are separated by newlines you can use the splitlines method to get a list of parameters, and use re.split on each item to get a list with name, value.
Then apply quote_plus on each name and value, '='.join them and '&'.join all parameters.
import urllib.parse
import re
data = '''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6jNWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNpCEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQYYMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrcpDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQFfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehEd4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''
data = [re.split(':(?!//)', line) for line in data.splitlines()]
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If your data is split by newlines arbitrarily, you could join the lines and split by name. Then zip names and values, quote and join.
data = ''.join(data.splitlines())
data = zip(['PaReq', 'TermUrl', 'MD'], re.split('PaReq:|TermUrl:|MD:', data)[1:])
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If you want to keep the newline cheracter, use only the last two lines in the second code snippet.

String comparing in python

I have an array of strings like
urls_parts=['week', 'weeklytop', 'week/day']
And i need to monitor inclusion of this strings in my url, so this example needs to be triggered by weeklytop part only:
url='www.mysite.com/weeklytop/2'
for part in urls_parts:
if part in url:
print part
But it is of course triggered by 'week' too.
What is the way to do it right?
OOps, let me specify my question a bit.
I need that code not to trigger when url='www.mysite.com/week/day/2' and part='week'
The only url needed to trigger on is when the part='week' and the url='www.mysite.com/week/2' or 'www.mysite.com/week/2-second' for example
This is how I would do it.
import re
urls_parts=['week', 'weeklytop', 'week/day']
urls_parts = sorted(urls_parts, key=lambda x: len(x), reverse=True)
rexes = [re.compile(r'{part}\b'.format(part=part)) for part in urls_parts]
urls = ['www.mysite.com/weeklytop/2', 'www.mysite.com/week/day/2', 'www.mysite.com/week/4']
for url in urls:
for i, rex in enumerate(rexes):
if rex.search(url):
print url
print urls_parts[i]
print
break
OUTPUT
www.mysite.com/weeklytop/2
weeklytop
www.mysite.com/week/day/2
week/day
www.mysite.com/week/4
week
Suggestion to sort by length came from #Roman
Sort you list by len and break from the loop at first match.
try something like this:
>>> print(re.findall('\\weeklytop\\b', 'www.mysite.com/weeklytop/2'))
['weeklytop']
>>> print(re.findall('\\week\\b', 'www.mysite.com/weeklytop/2'))
[]
program:
>>> urls_parts=['week', 'weeklytop', 'week/day']
>>> url='www.mysite.com/weeklytop/2'
>>> for parts in urls_parts:
if re.findall('\\'+parts +r'\b', url):
print (parts)
output:
weeklytop
Why not use urls_parts like this?
['/week/', '/weeklytop/', '/week/day/']
A slight change in your code would solve this issue -
>>> for part in urls_parts:
if part in url.split('/'): #splitting the url string with '/' as delimiter
print part
weeklytop

Categories