decoding Microsoft Safelink URL in Python - python

import re
import urllib
import HTMLParser
urlRegex = re.compile(r'(.+)&data=')
match=urlRegex.search('https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foffice.memoriesflower.com%2FPermission%2F%2525%2524%255E%2526%2526*%2523%2523%255E%2524%2525%255E%2526%255E*%2526%2523%255E%2525%2525%2526%2540%255E*%2523%2526%255E%2525%2523%2526%2540%2525*%255E%2540%255E%2523%2525%255E%2540%2526%2525*%255E%2540%2Foffice.php&data=01%7C01%7Cdavid.levin%40mheducation.com%7C0ac9a3770fe64fbb21fb08d50764c401%7Cf919b1efc0c347358fca0928ec39d8d5%7C0&sdata=PEoDOerQnha%2FACafNx8JAep8O9MdllcKCsHET2Ye%2B4%3D&reserved=0')
x = match.group()
urlRegex_1 = re.compile(r'url=(.+)&data=')
match_1 = urlRegex_1.search(x)
print match1.group(1)
htmlencodedurl = urllib.unquote(urllib.unquote(match1.group(1)))
actual_url = HTMLParser.HTMLParser().unescape(htmlencodedurl)
So the 'actual_url' displays this:
'https://office.memoriesflower.com/Permission/%$^&&##^$%^&^&#^%%&#^*#&^%'
I need it to display this:
https://office.memoriesflower.com/Permission/office.php

The following is cleaner as it uses the urlparse to extract the query string and then uses path operations to remove the unwanted component:
import posixpath as path
from urlparse import urlparse, parse_qs, urlunparse
url = 'https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foffice.memoriesflower.com%2FPermission%2F%2525%2524%255E%2526%2526*%2523%2523%255E%2524%2525%255E%2526%255E*%2526%2523%255E%2525%2525%2526%2540%255E*%2523%2526%255E%2525%2523%2526%2540%2525*%255E%2540%255E%2523%2525%255E%2540%2526%2525*%255E%2540%2Foffice.php&data=01%7C01%7Cdavid.levin%40mheducation.com%7C0ac9a3770fe64fbb21fb08d50764c401%7Cf919b1efc0c347358fca0928ec39d8d5%7C0&sdata=PEoDOerQnha%2FACafNx8JAep8O9MdllcKCsHET2Ye%2B4%3D&reserved=0'
target = parse_qs(urlparse(url).query)['url'][0]
p = urlparse(target)
q = p._replace(path=path.join(path.dirname(path.dirname(p.path)), path.basename(p.path)))
print urlunparse(q)
prints https://office.memoriesflower.com/Permission/office.php

I found this having a similar problem. Here's the code I used to resolve the issue. It's not particularly elegant, but you may be able to tweak it for your needs.
self.urls = (re.findall("safelinks\.protection\.outlook\.com/\?url=.*?sdata=", self.body.lower(), re.M))
if len(self.urls) > 0:
for i, v in enumerate(self.urls):
self.urls[i] = v[38:-11]
This works by getting the value in an ugly format and then stripping off the excess pieces of each item as a string. I believe the proper way to do this is with grouping, but this worked well enough for my needs.

Related

extract Unique id from the URL using Python

I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928

How to remove string after .com and "https://" from an URL in Python

I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com
I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?
df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your-
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}
df1['url']==df2['url']
Out[7]: False
Thanks.
URLs are not trivial to parse. Take a look at the urllib module in the standard library.
Here's how you could remove the path after the domain:
import urllib.parse
def remove_path(url):
parsed = urllib.parse.urlparse(url)
parsed = parsed._replace(path='')
return urllib.parse.urlunparse(parsed)
df1['url'] = df1['url'].apply(remove_path)
You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
Here is a full working example:
import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
The output of print(joined) would be:
Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:
urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
What is the difference between urlparse and urlsplit?
It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.
[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit
Use urlparse and isolate the hostname:
from urllib.parse import urlparse
urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

Change url in python

how can I change the activeOffset in this url? I am using Python and a while loop
https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC
It first should be 10, then 20, then 30 ...
I tried urlparse but I don't understand how to just increase the number
Thanks!
If this is a fixed URL, you can write activeOffset={} in the URL then use format to replace {} with specific numbers:
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset={}&orderBy=kh&orderDirection=ASC"
for offset in range(10,100,10):
print(url.format(offset))
If you cannot modify the URL (because you get it as an input from some other part of your program), you can use regular expressions to replace occurrences of activeOffset=... with the required number (reference):
import re
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC"
query = "activeOffset="
pattern = re.compile(query + "\\d+") # \\d+ means any sequence of digits
for offset in range(10,100,10):
# Replace occurrences of pattern with the modified query
print(pattern.sub(query + str(offset), url))
If you want to use urlparse, you can apply the previous approach to the fragment part returned by urlparse:
import re
from urllib.parse import urlparse, urlunparse
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC"
query = "activeOffset="
pattern = re.compile(query + "\\d+") # \\d+ means any sequence of digits
parts = urlparse(url)
for offset in range(10,100,10):
fragment_modified = pattern.sub(query + str(offset), parts.fragment)
parts_modified = parts._replace(fragment = fragment_modified)
url_modified = urlunparse(parts_modified)
print(url_modified)

Extract urls from a string of html data

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.
input:
System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html
prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:
import re
L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"
>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']
Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:
prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
input_ = l.rsplit(prefix, 1)
try:
input_ = input_[1]
ans.append(input_[:input_.index('.html')] + '.html')
except Exception as e:
pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
Another option is to use urlparse instead of/along with re
It will allow you to split a URL like this:
import urlparse
my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)
url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'
# Now you're able to work with every chunk as wanted!
prefix = '/products'
if url_obj.path.startswith(prefix):
# Do whatever you need, replacing the initial characters. You can use re here
print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

Get YouTube video url or YouTube video ID from a string using RegEx

So I've been stuck on this for about an hour or so now and I just cannot get it to work. So far I have been trying to extract the whole link from the string, but now I feel like it might be easier to just get the video ID.
The RegEx would need to take the ID/URL from the following link styles, no matter where they are in a string:
http://youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
https://youtube.com/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
youtube.com/iwGFalTRHDA
youtube.com/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
So far, I have this RegEx:
((http(s)?\:\/\/)?(www\.)?(youtube|youtu)((\.com|\.be)\/)(watch\?v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
This catches the link, however it also breaks down the link in to multiple parts and also adds that to the list too, so if a string contains a single youtube link, the output when I print the list is something like this:
('https://www.youtube.com/watch?v=Idn7ODPMhFY', 'https://', 's', 'www.', 'youtube', '.com/', '.com', 'watch?v=', 'Idn7ODPMhFY', '', '')
I need the list to only contain the link itself, or just the video id (which would be more preferable). I have really tried doing this myself for quite a while now but I just cannot figure it out. I was wondering if someone could sort out the regex for me and tell me where I am going wrong so that I don't run in to this issue again in the future?
Instead of writing a complicated regex that probably work not in all cases, you better use tools to analyze the url, like urllib:
from urllib.parse import urlparse, parse_qs
url = 'http://youtube.com/watch?v=iwGFalTRHDA'
def get_id(url):
u_pars = urlparse(url)
quer_v = parse_qs(u_pars.query).get('v')
if quer_v:
return quer_v[0]
pth = u_pars.path.split('/')
if pth:
return pth[-1]
This function will return None if both attempts fail.
I tested it with the sample urls:
>>> get_id('http://youtube.com/watch?v=iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related')
'iwGFalTRHDA'
>>> get_id('https://youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://youtu.be/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('youtube.com/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4')
'r5nB9u4jjy4'
>>> get_id('http://www.youtube.com/watch?v=t-ZRX8984sc')
't-ZRX8984sc'
>>> get_id('http://youtu.be/t-ZRX8984sc')
't-ZRX8984sc'
Here's the approach I'd use, no regex needed at all.
(This is pretty much equivalent to #Willem Van Onsem's solution, plus an easy to run / update unit test).
from urlparse import parse_qs
from urlparse import urlparse
import re
import unittest
TEST_URLS = [
('iwGFalTRHDA', 'http://youtube.com/watch?v=iwGFalTRHDA'),
('iwGFalTRHDA', 'http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related'),
('iwGFalTRHDA', 'https://youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'http://youtu.be/n17B_uFF4cA'),
('iwGFalTRHDA', 'youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'youtube.com/n17B_uFF4cA'),
('r5nB9u4jjy4', 'http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4'),
('t-ZRX8984sc', 'http://www.youtube.com/watch?v=t-ZRX8984sc'),
('t-ZRX8984sc', 'http://youtu.be/t-ZRX8984sc'),
(None, 'http://www.stackoverflow.com')
]
YOUTUBE_DOMAINS = [
'youtu.be',
'youtube.com',
]
def extract_id(url_string):
# Make sure all URLs start with a valid scheme
if not url_string.lower().startswith('http'):
url_string = 'http://%s' % url_string
url = urlparse(url_string)
# Check host against whitelist of domains
if url.hostname.replace('www.', '') not in YOUTUBE_DOMAINS:
return None
# Video ID is usually to be found in 'v' query string
qs = parse_qs(url.query)
if 'v' in qs:
return qs['v'][0]
# Otherwise fall back to path component
return url.path.lstrip('/')
class TestExtractID(unittest.TestCase):
def test_extract_id(self):
for expected_id, url in TEST_URLS:
result = extract_id(url)
self.assertEqual(
expected_id, result, 'Failed to extract ID from '
'URL %r (got %r, expected %r)' % (url, result, expected_id))
if __name__ == '__main__':
unittest.main()
I really advise on #LukasGraf's comment, however if you really must use regex you can check the following:
(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
Here is a working example in regex101:
https://regex101.com/r/5eRqn2/1
And here the python example:
In [38]: r = re.compile('(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(?:\-|\_)[0-z]{4}|.(?:\-|\_)[0-z]{9}))')
In [39]: r.match('http://youtube.com/watch?v=iwGFalTRHDA').groups()
Out[39]: ('iwGFalTRHDA',)
In [40]: r.match('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related').groups()
Out[40]: ('iwGFalTRHDA',)
In [41]: r.match('https://youtube.com/iwGFalTRHDA').groups()
Out[41]: ('iwGFalTRHDA',)
In order to not catch specific group in regex you should this: (?:...)

Categories