Strip A specific part from a url string in python - python

Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?

You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']

Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"

Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.

We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?

Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"

Related

extract Unique id from the URL using Python

I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928

Change url in python

how can I change the activeOffset in this url? I am using Python and a while loop
https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC
It first should be 10, then 20, then 30 ...
I tried urlparse but I don't understand how to just increase the number
Thanks!
If this is a fixed URL, you can write activeOffset={} in the URL then use format to replace {} with specific numbers:
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset={}&orderBy=kh&orderDirection=ASC"
for offset in range(10,100,10):
print(url.format(offset))
If you cannot modify the URL (because you get it as an input from some other part of your program), you can use regular expressions to replace occurrences of activeOffset=... with the required number (reference):
import re
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC"
query = "activeOffset="
pattern = re.compile(query + "\\d+") # \\d+ means any sequence of digits
for offset in range(10,100,10):
# Replace occurrences of pattern with the modified query
print(pattern.sub(query + str(offset), url))
If you want to use urlparse, you can apply the previous approach to the fragment part returned by urlparse:
import re
from urllib.parse import urlparse, urlunparse
url = "https://www.dieversicherer.de/versicherer/auto---reise/typklassenabfrage#activeOffset=10&orderBy=kh&orderDirection=ASC"
query = "activeOffset="
pattern = re.compile(query + "\\d+") # \\d+ means any sequence of digits
parts = urlparse(url)
for offset in range(10,100,10):
fragment_modified = pattern.sub(query + str(offset), parts.fragment)
parts_modified = parts._replace(fragment = fragment_modified)
url_modified = urlunparse(parts_modified)
print(url_modified)

Extract urls from a string of html data

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.
input:
System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html
prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:
import re
L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"
>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']
Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:
prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
input_ = l.rsplit(prefix, 1)
try:
input_ = input_[1]
ans.append(input_[:input_.index('.html')] + '.html')
except Exception as e:
pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']
Another option is to use urlparse instead of/along with re
It will allow you to split a URL like this:
import urlparse
my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)
url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'
# Now you're able to work with every chunk as wanted!
prefix = '/products'
if url_obj.path.startswith(prefix):
# Do whatever you need, replacing the initial characters. You can use re here
print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

parsing a url in python with changing part in it

I'm parsing a url in Python, below you can find a sample url and the code, what i want to do is splitting the (74743) from the url and make a for loop which will be taking it from a parts list.
Tried to use urlparse but couldn't complete it to the end mostly because of the changing parts in the url. Ijust want the easiest and fastest way to do this.
Sample url:
http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is=
(http://example.com/wps/portal) Always fixed
(lYuxDoIwGAYf6f9aqKSjMNQ) Always changing
(74743) Will be taken from a list name Parts
(IntNumberOf=&is=) Also changing depending on the section of
the website
Here's the Code:
from lxml import html
import requests
import urlparse
Parts = [74743, 85731, 93021]
url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
parsing = urlparse.urlsplit(url)
print parsing
>>> import urlparse
>>> url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='
>>> split_url = urlparse.urlsplit(url)
>>> split_url.path
'/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/'
You can split the path into a list of strings using '/', slice the list, and re-join:
>>> path = split_url.path
>>> path.split('/')
['', 'wps', 'portal', 'lYuxDoIwGAYf6f9aqKSjMNQ', '']
Slice off the last two:
>>> path.split('/')[:-2]
['', 'wps', 'portal']
And re-join:
>>> '/'.join(path.split('/')[:-2])
'/wps/portal'
To parse the query, use parse_qs:
>>> parsed_query = urlparse.parse_qs(split_url.query)
{'PartNo': ['74743']}
To keep the empty parameters use keep_blank_values=True:
>>> query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
>>> query
{'PartNo': ['74743'], 'is': [''], 'IntNumberOf': ['']}
You can then modify the query dictionary:
>>> query['PartNo'] = 85731
And update the original split_url:
>>> updated = split_url._replace(path='/'.join(base_path.split('/')[:-2] +
['ASDFZXCVQWER', '']),
query=urllib.urlencode(query, doseq=True))
>>> urlparse.urlunsplit(updated)
'http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&IntNumberOf=&is='

Or syntax when parsing urls with regex & python

Struggling with some regex here. I'll be looping through several urls but I cannot get the regex to how to recognize revenue or cost and grab the them both. Essentially the output would look something like this:
import re
url = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
values = []
for i in urls:
values.append(re.search(r'(?<=revenue=)(.*?)(?=&|;)',url).group(0))
print values
[[224.00, ''],
'224.00',
[224.00, 13]]
You need to use re.findall since re.search returns only the first match.
>>> for i in url:
values.append(re.findall(r'(?:\brevenue=|\bcost=)(.*?)(?:[&;]|$)', i))
>>> values
[['224.00', ''], ['224.00'], ['224.00', '13']]
Use urlparse.urlparse to parse the URL, and urlparse.parse_qs to parse the query string.
from urlparse import urlparse, parse_qs
reqs = ['GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00',
'GET /ca.gif?rb=1631&ca=20564929&ra=%n&pid=&revenue=224.00&cost=13']
urls = [re.split(' +', s, 1)[1] for s in reqs]
kv = [parse_qs(urlparse(url).query) for url in urls]
values = [(e.get('revenue'), e.get('cost')) for e in kv]
# values = [{'revenue': e.get('revenue'), 'cost': e.get('cost')} for e in kv]
Sample output (parse_qs provides a list of values for every key, since the query may contain duplicate keys):
[(['224.00'], None), (['224.00'], None), (['224.00'], ['13'])]
The values line is not necessary. You can use the kv dict directly.
If you have to deal with invalid input, the list comprehension with urls and kv has to be rewritten as a loop:
For urls, you need to check and filter out entries without HTTP method
For kv, you need to add try catch for urlparse to catch invalid syntax.

Categories