Python finding nth occurrence of forward slash - python

I have some URLs in a JSON file, and I'm trying to extract just the image name (i.e. 1234_5678.jpg). The URLs look like this:
"display_url":
"https://scontent-ort2-1.cdninstagram.com/v/t51.2885-15/e35/42672335_535716956833725_410505336278760344_n.jpg?_nc_ht=scontent-ort2-1.cdninstagram.com&_nc_cat=109&_nc_ohc=PCKXombie-oAX-T37mi&tp=1&oh=69744106833b4fa24cb921e6e1009d32&oe=6024044B&ig_cache_key=MTg5ODMzNjQ1NzMwMTM4Njg2Nw%3D%3D.2"
I decided the method to use would be to locate the 6th occurrence of the forward slash, as well as the .jpg, and extract the substring between them:
import json
def findnth(haystack, needle, n):
parts= haystack.split(needle, n+1)
if len(parts)<=n+1:
return -1
return len(haystack)-len(parts[-1])-len(needle)
with open('pathtofile.json', encoding='utf8') as json_file:
data = json.load(json_file)
for p in data['GraphImages']:
url = p['display_url']
start = findnth(url, "/", 6)
end = url.find(".jpg")
print(start)
print(end)
url = url[start:end+3]
However, the start value is always -1. The end value is between 90-110, which seems reasonable. Why isn't my nth search function locating the appropriate location?

You can use urlparse
Ex:
import os
from urllib.parse import urlparse
url = "https://scontent-ort2-1.cdninstagram.com/v/t51.2885-15/e35/42672335_535716956833725_410505336278760344_n.jpg?_nc_ht=scontent-ort2-1.cdninstagram.com&_nc_cat=109&_nc_ohc=PCKXombie-oAX-T37mi&tp=1&oh=69744106833b4fa24cb921e6e1009d32&oe=6024044B&ig_cache_key=MTg5ODMzNjQ1NzMwMTM4Njg2Nw%3D%3D.2"
o = urlparse(url)
print(os.path.basename(o.path))
# --> 42672335_535716956833725_410505336278760344_n.jpg

Related

extract Unique id from the URL using Python

I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928

Change url in python

how can I change the activeOffset in this url? I am using Python and a while loop
https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC
It first should be 10, then 20, then 30 ...
I tried urlparse but I don't understand how to just increase the number
Thanks!
If this is a fixed URL, you can write activeOffset={} in the URL then use format to replace {} with specific numbers:
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset={}&orderBy=kh&orderDirection=ASC"
for offset in range(10,100,10):
print(url.format(offset))
If you cannot modify the URL (because you get it as an input from some other part of your program), you can use regular expressions to replace occurrences of activeOffset=... with the required number (reference):
import re
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC"
query = "activeOffset="
pattern = re.compile(query + "\\d+") # \\d+ means any sequence of digits
for offset in range(10,100,10):
# Replace occurrences of pattern with the modified query
print(pattern.sub(query + str(offset), url))
If you want to use urlparse, you can apply the previous approach to the fragment part returned by urlparse:
import re
from urllib.parse import urlparse, urlunparse
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC"
query = "activeOffset="
pattern = re.compile(query + "\\d+") # \\d+ means any sequence of digits
parts = urlparse(url)
for offset in range(10,100,10):
fragment_modified = pattern.sub(query + str(offset), parts.fragment)
parts_modified = parts._replace(fragment = fragment_modified)
url_modified = urlunparse(parts_modified)
print(url_modified)

Extract urls from a string of html data

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.
input:
System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html
prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:
import re
L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"
>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']
Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:
prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
input_ = l.rsplit(prefix, 1)
try:
input_ = input_[1]
ans.append(input_[:input_.index('.html')] + '.html')
except Exception as e:
pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
Another option is to use urlparse instead of/along with re
It will allow you to split a URL like this:
import urlparse
my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)
url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'
# Now you're able to work with every chunk as wanted!
prefix = '/products'
if url_obj.path.startswith(prefix):
# Do whatever you need, replacing the initial characters. You can use re here
print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

decoding Microsoft Safelink URL in Python

import re
import urllib
import HTMLParser
urlRegex = re.compile(r'(.+)&data=')
match=urlRegex.search('https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foffice.memoriesflower.com%2FPermission%2F%2525%2524%255E%2526%2526*%2523%2523%255E%2524%2525%255E%2526%255E*%2526%2523%255E%2525%2525%2526%2540%255E*%2523%2526%255E%2525%2523%2526%2540%2525*%255E%2540%255E%2523%2525%255E%2540%2526%2525*%255E%2540%2Foffice.php&data=01%7C01%7Cdavid.levin%40mheducation.com%7C0ac9a3770fe64fbb21fb08d50764c401%7Cf919b1efc0c347358fca0928ec39d8d5%7C0&sdata=PEoDOerQnha%2FACafNx8JAep8O9MdllcKCsHET2Ye%2B4%3D&reserved=0')
x = match.group()
urlRegex_1 = re.compile(r'url=(.+)&data=')
match_1 = urlRegex_1.search(x)
print match1.group(1)
htmlencodedurl = urllib.unquote(urllib.unquote(match1.group(1)))
actual_url = HTMLParser.HTMLParser().unescape(htmlencodedurl)
So the 'actual_url' displays this:
'https://office.memoriesflower.com/Permission/%$^&&##^$%^&^&#^%%&#^*#&^%'
I need it to display this:
https://office.memoriesflower.com/Permission/office.php
The following is cleaner as it uses the urlparse to extract the query string and then uses path operations to remove the unwanted component:
import posixpath as path
from urlparse import urlparse, parse_qs, urlunparse
url = 'https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foffice.memoriesflower.com%2FPermission%2F%2525%2524%255E%2526%2526*%2523%2523%255E%2524%2525%255E%2526%255E*%2526%2523%255E%2525%2525%2526%2540%255E*%2523%2526%255E%2525%2523%2526%2540%2525*%255E%2540%255E%2523%2525%255E%2540%2526%2525*%255E%2540%2Foffice.php&data=01%7C01%7Cdavid.levin%40mheducation.com%7C0ac9a3770fe64fbb21fb08d50764c401%7Cf919b1efc0c347358fca0928ec39d8d5%7C0&sdata=PEoDOerQnha%2FACafNx8JAep8O9MdllcKCsHET2Ye%2B4%3D&reserved=0'
target = parse_qs(urlparse(url).query)['url'][0]
p = urlparse(target)
q = p._replace(path=path.join(path.dirname(path.dirname(p.path)), path.basename(p.path)))
print urlunparse(q)
prints https://office.memoriesflower.com/Permission/office.php
I found this having a similar problem. Here's the code I used to resolve the issue. It's not particularly elegant, but you may be able to tweak it for your needs.
self.urls = (re.findall("safelinks\.protection\.outlook\.com/\?url=.*?sdata=", self.body.lower(), re.M))
if len(self.urls) > 0:
for i, v in enumerate(self.urls):
self.urls[i] = v[38:-11]
This works by getting the value in an ugly format and then stripping off the excess pieces of each item as a string. I believe the proper way to do this is with grouping, but this worked well enough for my needs.

In Python, how can I get a substring that's in between coordinates XY of another string?

I'm trying to create a script that will auto-downloads the Bing wallpaper of the day so it can be my desktop background, and so far I found out that the url for it is hidden in the g_img tag. So what I've done so far is find the position of g_img={url: " and then for the next ".
from os.path import expanduser
import urllib
import os
HOME = expanduser("~")
os.system("wget https://www.bing.com")
str="g_img={url: \""
file="index.html"
pos=open(file, 'r').read().find(str)
pos2=open(file, 'r').read().find("\"", pos)
# function to get the url string
# function to get PicName
BingBackUrl= "https://www.bing.com%s" % url
urllib.urlretrieve(BingBackUrl, PicName)
os.system("gsettings set org.gnome.desktop.background picture-uri file://%s/%s" % (HOME, PicName))
something like this: I'd use with so the file gets closed at the end, so:
with open(file) as data:
urlraw = data.read()
pos = urlraw.find(str) + len(str)
pos2 = urlraw.find("\"")
PicName = urlraw[pos:pos2]
The file is just a string so there's no XY it's all 1 dimension. Then you need to take into account you want the end of the first string not the beginning, hence adding len(str).
Should basically do what you need, if I've read your requirements properly.

Categories