Extract urls from a string of html data - python

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.
input:
System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html
prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:
import re
L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"
>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']

Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:
prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
input_ = l.rsplit(prefix, 1)
try:
input_ = input_[1]
ans.append(input_[:input_.index('.html')] + '.html')
except Exception as e:
pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

Another option is to use urlparse instead of/along with re
It will allow you to split a URL like this:
import urlparse
my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)
url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'
# Now you're able to work with every chunk as wanted!
prefix = '/products'
if url_obj.path.startswith(prefix):
# Do whatever you need, replacing the initial characters. You can use re here
print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

Related

extract Unique id from the URL using Python

I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928

Strip A specific part from a url string in python

Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?
You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']
Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"
Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.
We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?
Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"

Get number sequence after an specific string in url text

I'm coding a python script to check a bunch of URL's and get their ID text, the URL's follow this sequence:
http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
Up to
http://XXXXXXX.XXX/index.php?id=YYYYYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
What I'm trying to do is get only the numbers after the id= and before the &
I've tried to use the regex (\D+)(\d+) but I'm also getting the auth numbers too.
Any suggestion on how to get only the id sequence?
Another way is to use split:
string = 'http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX'
string.split('id=')[1].split('&auth=')[0]
Output:
YY
These are URL addresses, so I would just use url parser in that case.
Look at urllib.parse
Use urlparse to get query parameters, and then parse_qs to get query dict.
import urllib.parse as p
url = "http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"
query = p.urlparse(url).query
params = p.parse_qs(query)
print(params['id'])
You can include the start and stop tokens in the regex:
pattern = r'id=(\d+)(?:&|$)'
You can try this regex
import re
urls = ["http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"]
for url in urls:
id_value = re.search(r"id=(.*)(?=&)", url).group(1)
print(id_value)
that will get you the id value from the URL
YY
YYY
YYYY
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX""".splitlines()
for v in variables:
p1 = v.split("id=")[1]
p2 = p1.split("&")[0]
print(p2)
outoput:
YY
YYY
YYYY
If you prefer regex
import re
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
output:
['YY', 'YYY', 'YYYY']
I don't know if you mean with only numbers after id= and before & you mean that there could be letters and numbers between those letters, so I though to this
import re
variables = """http://XXXXXXX.XXX/index.php?id=5Y44Y&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=Y2242YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=5YY453YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
x2 = []
for p in x:
x2.append(re.sub("\\D", "", p))
print(x2)
Output:
['5Y44Y', 'Y2242YY', '5YY453YY']
['544', '2242', '5453']
Use the regex id=[0-9]+:
pattern = "id=[0-9]+"
id = re.findall(pattern, url)[0].split("id=")[1]
If you do it this way, there is no need for &auth to follow the id, which makes it very versatile. However, the &auth won't make the code stop working. It works for the edge cases, as well as the simple ones.

decoding Microsoft Safelink URL in Python

import re
import urllib
import HTMLParser
urlRegex = re.compile(r'(.+)&data=')
match=urlRegex.search('https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foffice.memoriesflower.com%2FPermission%2F%2525%2524%255E%2526%2526*%2523%2523%255E%2524%2525%255E%2526%255E*%2526%2523%255E%2525%2525%2526%2540%255E*%2523%2526%255E%2525%2523%2526%2540%2525*%255E%2540%255E%2523%2525%255E%2540%2526%2525*%255E%2540%2Foffice.php&data=01%7C01%7Cdavid.levin%40mheducation.com%7C0ac9a3770fe64fbb21fb08d50764c401%7Cf919b1efc0c347358fca0928ec39d8d5%7C0&sdata=PEoDOerQnha%2FACafNx8JAep8O9MdllcKCsHET2Ye%2B4%3D&reserved=0')
x = match.group()
urlRegex_1 = re.compile(r'url=(.+)&data=')
match_1 = urlRegex_1.search(x)
print match1.group(1)
htmlencodedurl = urllib.unquote(urllib.unquote(match1.group(1)))
actual_url = HTMLParser.HTMLParser().unescape(htmlencodedurl)
So the 'actual_url' displays this:
'https://office.memoriesflower.com/Permission/%$^&&##^$%^&^&#^%%&#^*#&^%'
I need it to display this:
https://office.memoriesflower.com/Permission/office.php
The following is cleaner as it uses the urlparse to extract the query string and then uses path operations to remove the unwanted component:
import posixpath as path
from urlparse import urlparse, parse_qs, urlunparse
url = 'https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foffice.memoriesflower.com%2FPermission%2F%2525%2524%255E%2526%2526*%2523%2523%255E%2524%2525%255E%2526%255E*%2526%2523%255E%2525%2525%2526%2540%255E*%2523%2526%255E%2525%2523%2526%2540%2525*%255E%2540%255E%2523%2525%255E%2540%2526%2525*%255E%2540%2Foffice.php&data=01%7C01%7Cdavid.levin%40mheducation.com%7C0ac9a3770fe64fbb21fb08d50764c401%7Cf919b1efc0c347358fca0928ec39d8d5%7C0&sdata=PEoDOerQnha%2FACafNx8JAep8O9MdllcKCsHET2Ye%2B4%3D&reserved=0'
target = parse_qs(urlparse(url).query)['url'][0]
p = urlparse(target)
q = p._replace(path=path.join(path.dirname(path.dirname(p.path)), path.basename(p.path)))
print urlunparse(q)
prints https://office.memoriesflower.com/Permission/office.php
I found this having a similar problem. Here's the code I used to resolve the issue. It's not particularly elegant, but you may be able to tweak it for your needs.
self.urls = (re.findall("safelinks\.protection\.outlook\.com/\?url=.*?sdata=", self.body.lower(), re.M))
if len(self.urls) > 0:
for i, v in enumerate(self.urls):
self.urls[i] = v[38:-11]
This works by getting the value in an ugly format and then stripping off the excess pieces of each item as a string. I believe the proper way to do this is with grouping, but this worked well enough for my needs.

How would I get rid of certain characters then output a cleaned up string In python?

In this snippet of code I am trying to obtain the links to images posted in a groupchat by a certain user:
import groupy
from groupy import Bot, Group, Member
prog_group = Group.list().first
prog_members = prog_group.members()
prog_messages = prog_group.messages()
rojer = str(prog_members[4])
rojer_messages = ['none']
rojer_pics = []
links = open('rojer_pics.txt', 'w')
print(prog_group)
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
links.write(str(message) + '\n')
links.close()
The issue is that in the links file it prints the entire message: ("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')>"
What I am wanting to do, is to get rid of characters that aren't part of the URL so it is written like so:
"https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12"
are there any methods in python that can manipulate a string like so?
I just used string.split() and split it into 3 parts by the parentheses:
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
link = str(message).split("'")
rojer_pics.append(link[1])
links.write(str(link[1]) + '\n')
This can done using string indices and the string method .find():
>>> url = "(\"Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')"
>>> url = url[url.find('+')+1:-2]
>>> url
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'
>>>
>>> string = '("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12\')>"'
>>> string.split('+')[1][:-4]
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'

Categories