parsing a url in python with changing part in it

parsing a url in python with changing part in it - python

I'm parsing a url in Python, below you can find a sample url and the code, what i want to do is splitting the (74743) from the url and make a for loop which will be taking it from a parts list.
Tried to use urlparse but couldn't complete it to the end mostly because of the changing parts in the url. Ijust want the easiest and fastest way to do this.
Sample url:
http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is=
(http://example.com/wps/portal) Always fixed
(lYuxDoIwGAYf6f9aqKSjMNQ) Always changing
(74743) Will be taken from a list name Parts
(IntNumberOf=&is=) Also changing depending on the section of
the website
Here's the Code:
from lxml import html
import requests
import urlparse
Parts = [74743, 85731, 93021]
url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
parsing = urlparse.urlsplit(url)
print parsing

>>> import urlparse
>>> url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
>>> split_url = urlparse.urlsplit(url)
>>> split_url.path
'/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/'
You can split the path into a list of strings using '/', slice the list, and re-join:
>>> path = split_url.path
>>> path.split('/')
['', 'wps', 'portal', 'lYuxDoIwGAYf6f9aqKSjMNQ', '']
Slice off the last two:
>>> path.split('/')[:-2]
['', 'wps', 'portal']
And re-join:
>>> '/'.join(path.split('/')[:-2])
'/wps/portal'
To parse the query, use parse_qs:
>>> parsed_query = urlparse.parse_qs(split_url.query)
{'PartNo': ['74743']}
To keep the empty parameters use keep_blank_values=True:
>>> query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
>>> query
{'PartNo': ['74743'], 'is': [''], 'IntNumberOf': ['']}
You can then modify the query dictionary:
>>> query['PartNo'] = 85731
And update the original split_url:
>>> updated = split_url._replace(path='/'.join(base_path.split('/')[:-2] +
['ASDFZXCVQWER', '']),
query=urllib.urlencode(query, doseq=True))
>>> urlparse.urlunsplit(updated)
'http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&IntNumberOf=&is='

Related

Strip A specific part from a url string in python

Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?

You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']

Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"

Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.

We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?

Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"

Extract urls from a string of html data

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.
input:
System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html
prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:
import re
L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"
>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']

Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:
prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
input_ = l.rsplit(prefix, 1)
try:
input_ = input_[1]
ans.append(input_[:input_.index('.html')] + '.html')
except Exception as e:
pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

Another option is to use urlparse instead of/along with re
It will allow you to split a URL like this:
import urlparse
my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)
url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'
# Now you're able to work with every chunk as wanted!
prefix = '/products'
if url_obj.path.startswith(prefix):
# Do whatever you need, replacing the initial characters. You can use re here
print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

Get YouTube video url or YouTube video ID from a string using RegEx

So I've been stuck on this for about an hour or so now and I just cannot get it to work. So far I have been trying to extract the whole link from the string, but now I feel like it might be easier to just get the video ID.
The RegEx would need to take the ID/URL from the following link styles, no matter where they are in a string:
http://youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
https://youtube.com/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
youtube.com/iwGFalTRHDA
youtube.com/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
So far, I have this RegEx:
((http(s)?\:\/\/)?(www\.)?(youtube|youtu)((\.com|\.be)\/)(watch\?v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
This catches the link, however it also breaks down the link in to multiple parts and also adds that to the list too, so if a string contains a single youtube link, the output when I print the list is something like this:
('https://www.youtube.com/watch?v=Idn7ODPMhFY', 'https://', 's', 'www.', 'youtube', '.com/', '.com', 'watch?v=', 'Idn7ODPMhFY', '', '')
I need the list to only contain the link itself, or just the video id (which would be more preferable). I have really tried doing this myself for quite a while now but I just cannot figure it out. I was wondering if someone could sort out the regex for me and tell me where I am going wrong so that I don't run in to this issue again in the future?

Instead of writing a complicated regex that probably work not in all cases, you better use tools to analyze the url, like urllib:
from urllib.parse import urlparse, parse_qs
url = 'http://youtube.com/watch?v=iwGFalTRHDA'
def get_id(url):
u_pars = urlparse(url)
quer_v = parse_qs(u_pars.query).get('v')
if quer_v:
return quer_v[0]
pth = u_pars.path.split('/')
if pth:
return pth[-1]
This function will return None if both attempts fail.
I tested it with the sample urls:
>>> get_id('http://youtube.com/watch?v=iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related')
'iwGFalTRHDA'
>>> get_id('https://youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://youtu.be/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('youtube.com/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4')
'r5nB9u4jjy4'
>>> get_id('http://www.youtube.com/watch?v=t-ZRX8984sc')
't-ZRX8984sc'
>>> get_id('http://youtu.be/t-ZRX8984sc')
't-ZRX8984sc'

Here's the approach I'd use, no regex needed at all.
(This is pretty much equivalent to #Willem Van Onsem's solution, plus an easy to run / update unit test).
from urlparse import parse_qs
from urlparse import urlparse
import re
import unittest
TEST_URLS = [
('iwGFalTRHDA', 'http://youtube.com/watch?v=iwGFalTRHDA'),
('iwGFalTRHDA', 'http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related'),
('iwGFalTRHDA', 'https://youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'http://youtu.be/n17B_uFF4cA'),
('iwGFalTRHDA', 'youtube.com/iwGFalTRHDA'),
('n17B_uFF4cA', 'youtube.com/n17B_uFF4cA'),
('r5nB9u4jjy4', 'http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4'),
('t-ZRX8984sc', 'http://www.youtube.com/watch?v=t-ZRX8984sc'),
('t-ZRX8984sc', 'http://youtu.be/t-ZRX8984sc'),
(None, 'http://www.stackoverflow.com')
]
YOUTUBE_DOMAINS = [
'youtu.be',
'youtube.com',
]
def extract_id(url_string):
# Make sure all URLs start with a valid scheme
if not url_string.lower().startswith('http'):
url_string = 'http://%s' % url_string
url = urlparse(url_string)
# Check host against whitelist of domains
if url.hostname.replace('www.', '') not in YOUTUBE_DOMAINS:
return None
# Video ID is usually to be found in 'v' query string
qs = parse_qs(url.query)
if 'v' in qs:
return qs['v'][0]
# Otherwise fall back to path component
return url.path.lstrip('/')
class TestExtractID(unittest.TestCase):
def test_extract_id(self):
for expected_id, url in TEST_URLS:
result = extract_id(url)
self.assertEqual(
expected_id, result, 'Failed to extract ID from '
'URL %r (got %r, expected %r)' % (url, result, expected_id))
if __name__ == '__main__':
unittest.main()

I really advise on #LukasGraf's comment, however if you really must use regex you can check the following:
(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))
Here is a working example in regex101:
https://regex101.com/r/5eRqn2/1
And here the python example:
In [38]: r = re.compile('(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(?:\-|\_)[0-z]{4}|.(?:\-|\_)[0-z]{9}))')
In [39]: r.match('http://youtube.com/watch?v=iwGFalTRHDA').groups()
Out[39]: ('iwGFalTRHDA',)
In [40]: r.match('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related').groups()
Out[40]: ('iwGFalTRHDA',)
In [41]: r.match('https://youtube.com/iwGFalTRHDA').groups()
Out[41]: ('iwGFalTRHDA',)
In order to not catch specific group in regex you should this: (?:...)

Or syntax when parsing urls with regex & python

Struggling with some regex here. I'll be looping through several urls but I cannot get the regex to how to recognize revenue or cost and grab the them both. Essentially the output would look something like this:
import re
url = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
values = []
for i in urls:
values.append(re.search(r'(?<=revenue=)(.*?)(?=&|;)',url).group(0))
print values
[[224.00, ''],
'224.00',
[224.00, 13]]

You need to use re.findall since re.search returns only the first match.
>>> for i in url:
values.append(re.findall(r'(?:\brevenue=|\bcost=)(.*?)(?:[&;]|$)', i))
>>> values
[['224.00', ''], ['224.00'], ['224.00', '13']]

Use urlparse.urlparse to parse the URL, and urlparse.parse_qs to parse the query string.
from urlparse import urlparse, parse_qs
reqs = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
urls = [re.split(' +', s, 1)[1] for s in reqs]
kv = [parse_qs(urlparse(url).query) for url in urls]
values = [(e.get('revenue'), e.get('cost')) for e in kv]
# values = [{'revenue': e.get('revenue'), 'cost': e.get('cost')} for e in kv]
Sample output (parse_qs provides a list of values for every key, since the query may contain duplicate keys):
[(['224.00'], None), (['224.00'], None), (['224.00'], ['13'])]
The values line is not necessary. You can use the kv dict directly.
If you have to deal with invalid input, the list comprehension with urls and kv has to be rewritten as a loop:
For urls, you need to check and filter out entries without HTTP method
For kv, you need to add try catch for urlparse to catch invalid syntax.

Changing hostname in a url

I am trying to use python to change the hostname in a url, and have been playing around with the urlparse module for a while now without finding a satisfactory solution. As an example, consider the url:
https://www.google.dk:80/barbaz
I would like to replace "www.google.dk" with e.g. "www.foo.dk", so I get the following url:
https://www.foo.dk:80/barbaz.
So the part I want to replace is what urlparse.urlsplit refers to as hostname. I had hoped that the result of urlsplit would let me make changes, but the resulting type ParseResult doesn't allow me to. If nothing else I can of course reconstruct the new url by appending all the parts together with +, but this would leave me with some quite ugly code with a lot of conditionals to get "://" and ":" in the correct places.

You can use urllib.parse.urlparse function and ParseResult._replace method (Python 3):
>>> import urllib.parse
>>> parsed = urllib.parse.urlparse("https://www.google.dk:80/barbaz")
>>> replaced = parsed._replace(netloc="www.foo.dk:80")
>>> print(replaced)
ParseResult(scheme='https', netloc='www.foo.dk:80', path='/barbaz', params='', query='', fragment='')
If you're using Python 2, then replace urllib.parse with urlparse.
ParseResult is a subclass of namedtuple and _replace is a namedtuple method that:
returns a new instance of the named tuple replacing specified fields
with new values
UPDATE:
As #2rs2ts said in the comment netloc attribute includes a port number.
Good news: ParseResult has hostname and port attributes.
Bad news: hostname and port are not the members of namedtuple, they're dynamic properties and you can't do parsed._replace(hostname="www.foo.dk"). It'll throw an exception.
If you don't want to split on : and your url always has a port number and doesn't have username and password (that's urls like "https://username:password#www.google.dk:80/barbaz") you can do:
parsed._replace(netloc="{}:{}".format(parsed.hostname, parsed.port))

You can take advantage of urlsplit and urlunsplit from Python's urlparse:
>>> from urlparse import urlsplit, urlunsplit
>>> url = list(urlsplit('https://www.google.dk:80/barbaz'))
>>> url
['https', 'www.google.dk:80', '/barbaz', '', '']
>>> url[1] = 'www.foo.dk:80'
>>> new_url = urlunsplit(url)
>>> new_url
'https://www.foo.dk:80/barbaz'
As the docs state, the argument passed to urlunsplit() "can be any five-item iterable", so the above code works as expected.

Using urlparse and urlunparse methods of urlparse module:
import urlparse
old_url = 'https://www.google.dk:80/barbaz'
url_lst = list(urlparse.urlparse(old_url))
# Now url_lst is ['https', 'www.google.dk:80', '/barbaz', '', '', '']
url_lst[1] = 'www.foo.dk:80'
# Now url_lst is ['https', 'www.foo.dk:80', '/barbaz', '', '', '']
new_url = urlparse.urlunparse(url_lst)
print(old_url)
print(new_url)
Output:
https://www.google.dk:80/barbaz
https://www.foo.dk:80/barbaz

A simple string replace of the host in the netloc also works in most cases:
>>> p = urlparse.urlparse('https://www.google.dk:80/barbaz')
>>> p._replace(netloc=p.netloc.replace(p.hostname, 'www.foo.dk')).geturl()
'https://www.foo.dk:80/barbaz'
This will not work if, by some chance, the user name or password matches the hostname. You cannot limit str.replace to replace the last occurrence only, so instead we can use split and join:
>>> p = urlparse.urlparse('https://www.google.dk:www.google.dk#www.google.dk:80/barbaz')
>>> new_netloc = 'www.foo.dk'.join(p.netloc.rsplit(p.hostname, 1))
>>> p._replace(netloc=new_netloc).geturl()
'https://www.google.dk:www.google.dk#www.foo.dk:80/barbaz'

I would recommend also using urlsplit and urlunsplit like #linkyndy's answer, but for Python3 it would be:
>>> from urllib.parse import urlsplit, urlunsplit
>>> url = list(urlsplit('https://www.google.dk:80/barbaz'))
>>> url
['https', 'www.google.dk:80', '/barbaz', '', '']
>>> url[1] = 'www.foo.dk:80'
>>> new_url = urlunsplit(url)
>>> new_url
'https://www.foo.dk:80/barbaz'

You can always do this trick:
>>> p = parse.urlparse("https://stackoverflow.com/questions/21628852/changing-hostname-in-a-url")
>>> parse.ParseResult(**dict(p._asdict(), netloc='perrito.com.ar')).geturl()
'https://perrito.com.ar/questions/21628852/changing-hostname-in-a-url'

To just replace the host without touching the port in use (if any), use this:
import re, urlparse
p = list(urlparse.urlsplit('https://www.google.dk:80/barbaz'))
p[1] = re.sub('^[^:]*', 'www.foo.dk', p[1])
print urlparse.urlunsplit(p)
prints
https://www.foo.dk:80/barbaz
If you've not given any port, this works fine as well.
If you prefer the _replace way Nigel pointed out, you can use this instead:
p = urlparse.urlsplit('https://www.google.dk:80/barbaz')
p = p._replace(netloc=re.sub('^[^:]*', 'www.foo.dk', p.netloc))
print urlparse.urlunsplit(p)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing a url in python with changing part in it - python

Related

Strip A specific part from a url string in python

Extract urls from a string of html data

Get YouTube video url or YouTube video ID from a string using RegEx

Or syntax when parsing urls with regex & python

Changing hostname in a url

Categories

Resources