String comparing in python - python

I have an array of strings like
urls_parts=['week', 'weeklytop', 'week/day']
And i need to monitor inclusion of this strings in my url, so this example needs to be triggered by weeklytop part only:
url='www.mysite.com/weeklytop/2'
for part in urls_parts:
if part in url:
print part
But it is of course triggered by 'week' too.
What is the way to do it right?
OOps, let me specify my question a bit.
I need that code not to trigger when url='www.mysite.com/week/day/2' and part='week'
The only url needed to trigger on is when the part='week' and the url='www.mysite.com/week/2' or 'www.mysite.com/week/2-second' for example

This is how I would do it.
import re
urls_parts=['week', 'weeklytop', 'week/day']
urls_parts = sorted(urls_parts, key=lambda x: len(x), reverse=True)
rexes = [re.compile(r'{part}\b'.format(part=part)) for part in urls_parts]
urls = ['www.mysite.com/weeklytop/2', 'www.mysite.com/week/day/2', 'www.mysite.com/week/4']
for url in urls:
for i, rex in enumerate(rexes):
if rex.search(url):
print url
print urls_parts[i]
print
break
OUTPUT
www.mysite.com/weeklytop/2
weeklytop
www.mysite.com/week/day/2
week/day
www.mysite.com/week/4
week
Suggestion to sort by length came from #Roman

Sort you list by len and break from the loop at first match.

try something like this:
>>> print(re.findall('\\weeklytop\\b', 'www.mysite.com/weeklytop/2'))
['weeklytop']
>>> print(re.findall('\\week\\b', 'www.mysite.com/weeklytop/2'))
[]
program:
>>> urls_parts=['week', 'weeklytop', 'week/day']
>>> url='www.mysite.com/weeklytop/2'
>>> for parts in urls_parts:
if re.findall('\\'+parts +r'\b', url):
print (parts)
output:
weeklytop

Why not use urls_parts like this?
['/week/', '/weeklytop/', '/week/day/']

A slight change in your code would solve this issue -
>>> for part in urls_parts:
if part in url.split('/'): #splitting the url string with '/' as delimiter
print part
weeklytop

Related

Remove Prefixes From a String

What's a cute way to do this in python?
Say we have a list of strings:
clean_be
clean_be_al
clean_fish_po
clean_po
and we want the output to be:
be
be_al
fish_po
po
Another approach which will work for all scenarios:
import re
data = ['clean_be',
'clean_be_al',
'clean_fish_po',
'clean_po', 'clean_a', 'clean_clean', 'clean_clean_1']
for item in data:
item = re.sub('^clean_', '', item)
print (item)
Output:
be
be_al
fish_po
po
a
clean
clean_1
Here is a possible solution that works with any prefix:
prefix = 'clean_'
result = [s[len(prefix):] if s.startswith(prefix) else s for s in lst]
You've merely provided minimal information on what you're trying to achieve, but the desired output for the 4 given inputs can be created via the following function:
def func(string):
return "_".join(string.split("_")[1:])
you can do this:
strlist = ['clean_be','clean_be_al','clean_fish_po','clean_po']
def func(myList:list, start:str):
ret = []
for element in myList:
ret.append(element.lstrip(start))
return ret
print(func(strlist, 'clean_'))
I hope, it was useful, Nohab
There are many ways to do based on what you have provided.
Apart from the above answers, you can do in this way too:
string = 'clean_be_al'
string = string.replace('clean_','',1)
This would remove the first occurrence of clean_ in the string.
Also if the first word is guaranteed to be 'clean', then you can try in this way too:
string = 'clean_be_al'
print(string[6:])
You can use lstrip to remove a prefix and rstrip to remove a suffix
line = "clean_be"
print(line.lstrip("clean_"))
Drawback:
lstrip([chars])
The [chars] argument is not a prefix; rather, all combinations of its values are stripped.

How to extract string from python list

Feels like this should be easy, but I can't find the right keywords to search for the answer.
Given ['"https://container.blob.core.windows.net/"'] as results from a python statement...
...how do I extract only the URL and drop the ['" and "']?
You want the first element of the list without the first and last char
>>> l[0][1:-1]
'https://container.blob.core.windows.net/'
How about using regex??
In [35]: url_list = ['"https://container.blob.core.windows.net/"']
In [36]: url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url_list[
...: 0])[0]
In [37]: print(url)
https://container.blob.core.windows.net/
try:
a = ['"https://container.blob.core.windows.net/"']
result = a[0].replace("\"","")
print(result)
Result:
'https://container.blob.core.windows.net/'
As a python string.
How about getting first element using list[0] and remove the single quotes from it using replace() or strip() ?
print(list[0].replace("'",""))
OR
print(list[0].strip("'")

python extract specific part of the changing string

I have this URL string:
Hdf5File= '/home/Windows-Share/SCS931000126/20170101.h5'
I want to get two desired output from this string:
1- 'SCS931000126'
2- '20170101'
I wrote this regular expression to extract the above mentioned outputs, so I wrote:
import re
print(re.split(r'/', (re.split(r'[a-f]',Hdf5File)[4]))[1])
print(re.split(r'\.', (re.split(r'/', (re.split(r'[a-f]',Hdf5File)[4]))[2]))[0])
This gives me the desired output(if there is a better way to extract these outputs please let me know).
But the case is that this part of the URL /home/Windows-Share/ might change, is there anyway that I only get my desired output which are always at the end of the string regardless of the part of URL that might change?
for example if I have :
Hdf5File='/home/dal/windows-Share/SCS931000126/20170101.h5'
Then i cant re-use my regex. is there any way to this in a more reusable way?
Do you need re.split? You can just as well use str.split for this one:
In [294]: x, y = Hdf5File.split('/')[-2:]
In [296]: x, y.split('.')[0]
Out[296]: ('SCS931000126', '20170101')
While a simple split will work as cᴏʟᴅsᴘᴇᴇᴅ already demonstrated, you can also use os.path to get parts of your url:
import os
Hdf5File= '/home/Windows-Share/SCS931000126/20170101.h5'
f = os.path.basename(Hdf5File)
d = os.path.basename(os.path.dirname(Hdf5File))
print( d, f ) # SCS931000126 20170101.h5
# and to remove the file extension:
f = os.path.splitext(f)[0]
print(f) # 20170101

How would I get rid of certain characters then output a cleaned up string In python?

In this snippet of code I am trying to obtain the links to images posted in a groupchat by a certain user:
import groupy
from groupy import Bot, Group, Member
prog_group = Group.list().first
prog_members = prog_group.members()
prog_messages = prog_group.messages()
rojer = str(prog_members[4])
rojer_messages = ['none']
rojer_pics = []
links = open('rojer_pics.txt', 'w')
print(prog_group)
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
links.write(str(message) + '\n')
links.close()
The issue is that in the links file it prints the entire message: ("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')>"
What I am wanting to do, is to get rid of characters that aren't part of the URL so it is written like so:
"https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12"
are there any methods in python that can manipulate a string like so?
I just used string.split() and split it into 3 parts by the parentheses:
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
link = str(message).split("'")
rojer_pics.append(link[1])
links.write(str(link[1]) + '\n')
This can done using string indices and the string method .find():
>>> url = "(\"Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')"
>>> url = url[url.find('+')+1:-2]
>>> url
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'
>>>
>>> string = '("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12\')>"'
>>> string.split('+')[1][:-4]
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'

python replace() not working as expected

My script is supposed to write html files changing the html menu to show the current page as class="current_page_item" so that it will be highlighted when rendered. It has to do two replacements, first set the previous current page to be not current, then set the new current page to current. The two writeText.replace lines do not appear to have any effect. It doesn't give me an error or anything. Any suggestions would be appreciated.
for each in startList:
sectionName = s[each:s.find("\n",each)].split()[1]
if sectionName[-3:] <> "-->":
end = s.find("end "+sectionName+'-->')
sectionText = s[each+len(sectionName)+12:end-1]
writeText = templatetop+"\n"+sectionText+"\n"+templatebottom
writeText.replace('<li class="current_page_item">','<li>')
writeText.replace('<li><a href="'+sectionName+'.html','<li class="current_page_item"><a href="'+sectionName+'.html')
f = open(sectionName+".html", 'w+')
f.write(writeText)
f.close()
Here is part of the string I am targeting (templatetop):
<li class="current_page_item">Home</li>
<li>History</li>
<li>Members</li>
replace returns the resulting string, so you need to do this:
writeText = writeText.replace('<li class="current_page_item">','<li>')
writeText = writeText.replace('<li><a href="'+sectionName+'.html','<li class="current_page_item"><a href="'+sectionName+'.html')
You should not expect that to work, because you should read the documentation:
Return a copy of the string with all occurrences of substring old replaced by new.
So first you replace '<li class="current_page_item">' with '<li>' and then you replace '<li>' with '<li class="current_page_item">'. That's a bit funny, I have to say.
In addition to the problem pointed out by misha, that replace returns the result, your two replacements in fact cancel each other out.
>>> writeText = """<li class="current_page_item">Home</li>
... <li>History</li>
... <li>Members</li>"""
>>> result = writeText.replace('<li class="current_page_item">','<li>')>>> result = result.replace('<li><a href="index.html','<li class="current_page_item"><a href="index.html')
>>> result == writeText
True
Now this is just the first iteration of replacements, but it's a good indication that you are using the wrong solution. It also means you can simply remove the first of the replacements and it will still work.
Also, why are you doing the replacement on writeText, when you are only targeting templatetop?

Categories