How to decode a post url string containing plus sign - python

I have a encoded URL string (click to visit the page)
http://epub.sipo.gov.cn/patentoutline.action?strWhere=OPD%3D%272019.02.15%27+and+PA%3D%27%25%E5%8D%8E%E4%B8%BA%25%27
obtain by chrome inspect. I try to write a requests post function to get the page, the best I could figure out is the following, however, it does not work properly. The troubling part seems to be the plus sign. (If there is no and clause, "OPD='2019.02.15'" or "PA='%华为%'" works fine.)
import requests
url = 'http://epub.sipo.gov.cn/patentoutline.action'
params = {'strWhere': r"OPD='2019.02.15' and PA='%华为%'"} # cannot find results
# params = {'strWhere': r"OPD='2019.02.15'"} # works
# params = {'strWhere': r"PA='%华为%'"} # works
r = requests.post(url, data=params)
print(r.content.decode())

replace in the url the spaces with %20
you can use a function before send it
str.replace(old, new[, max])
for example
params = {'strWhere': r"OPD='2019.02.15' and PA='%华为%'"}
params['strWhere'] = params['strWhere'].replace(' ', '%20')

Related

How to parse JSONP response in python?

How can i parse JSONP response, i tried json.loads(), but it will never work for JSONP
By the reading following
JSONP is JSON with padding, that is, you put a string at the beginning
and a pair of parenthesis around it.
I tried to remove padding from the string and used json.loads()
from json import loads
response = requests.get(link)
startidx = response.text.find('(')
endidx = response.text.rfind(')')
data = loads(response.text[startidx + 1:endidx])
it's working

How To Remove (%0D) in python requests

import requests
nexmokey = 'mykey'
nexmosec = 'mysecretkey'
nexmoBal = 'https://rest.nexmo.com/account/get-balance?api_key={}&api_secret={}'.format(nexmokey,nexmosec)
rr = requests.get(nexmoBal)
print(rr.url)
I would like to send a request to post at
https://rest.nexmo.com/account/get-balance?api_key=mykey&api_secret=mysecretkey
but why does %0D appear?
https://rest.nexmo.com/account/get-balance?api_key=mykey%0D&api_secret=mysecretkey%0D
requests.get expects parameters like api_secret=my_secret to be provided through the params argument, not as part of the URL, which is URL-encoded for you.
Use this:
nexmoBal = 'https://rest.nexmo.com/account/get-balance'
rr = requests.get(nexmoBal, params={'api_key': nexmokey, 'api_secret': nexmosec})
The fact that %0D ends up in there, indicates you have a character #13 (0D hexadecimal) in there, which is a carriage return (part of the end of line on Windows systems) - probably because you are reading the key and secret from some file and didn't include them in the example code.
Also, note that you mention you want to post, but you're calling .get().

Extracting unique URLs in Python

I would like to extract entire unique url items in my list in order to move on a web scraping project. Although I have huge list of URLs on my side, I would like to generate here minimalist scenario to explain main issue on my side. Assume that my list is like that:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"https://www.ox.ac.uk/research"
]
def ExtractUniqueUrls(urls):
pass
ExtractUniqueUrls(url_list)
For the minimalist scenario, I am expecting there are only two unique urls which are "https://www.ox.ac.uk" and "https://www.ox.ac.uk/research". Although each url element have some differences such as "http", "https", with ending "/", without ending "/", index.php, index.html; they are all pointing exactly the same web page. There might be some other possibilities which I already missed them (Please remember them if you catch any). Anyway, what is the proper and efficient way to handle this issue using Python 3?
I am not looking for a hard-coded solution like focusing on each case individually. For instance, I do not want to manually check whether the url has "/" at the end or not. Possibly there is a much better solution with other packages such as urllib? For that reason, I looked the method of urllib.parse, but I could not come up a proper solution so far.
Thanks
Edit: I added one more example into my list at the end in order to explain in a better way. Otherwise, you might assume that I am looking for the root url, but this not the case at all.
By only following all cases you've reveiled:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"ox.ac.uk/research",
"ox.ac.uk/index.php?12"]
def url_strip_gen(source: list):
replace_dict = {".php": "", ".html": "", "http://": "", "https://": ""}
for url in source:
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = url.rstrip('/')
yield url[4:] if url.startswith("www.") else url
print(set(url_strip_gen(url_list)))
{'ox.ac.uk/index?12', 'ox.ac.uk/index', 'ox.ac.uk/research', 'ox.ac.uk'}
This won't cover case if url contains .html like www.htmlsomething, in that case it can be compensated with urlparse as it stores path and url separately like below:
>>> import pprint
>>> from urllib.parse import urlparse
>>> a = urlparse("http://ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='http', netloc='ox.ac.uk', path='/index.php', params='', query='12', fragment='')
However, if without scheme:
>>> a = urlparse("ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='', netloc='', path='ox.ac.uk/index.php', params='', query='12', fragment='')
All host goes to path attribute.
To compensate this we either need to remove scheme and add one for all or check if url starts with scheme else add one. Prior is easier to implement.
replace_dict = {"http://": "", "https://": ""}
for url in source:
# Unify scheme to HTTP
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = "http://" + (url[4:] if url.startswith("www.") else url)
parsed = urlparse(url)
With this you are guaranteed to get separate control of each sections for your url via urlparse. However as you do not specified which parameter should be considered for url to be unique enough, I'll leave that task to you.
Here's a quick and dirty attempt:
def extract_unique_urls(url_list):
unique_urls = []
for url in url_list:
# Removing the 'https://' etc. part
if url.find('//') > -1:
url = url.split('//')[1]
# Removing the 'www.' part
url = url.replace('www.', '')
# Removing trailing '/'
url = url.rstrip('/')
# If not root url then inspect the last part of the url
if url.find('/') > -1:
# Extracting the last part
last_part = url.split('/')[-1]
# Deciding if to keep the last part (no if '.' in it)
if last_part.find('.') > -1:
# If no to keep: Removing last part and getting rid of
# trailing '/'
url = '/'.join(url.split('/')[:-1]).rstrip('/')
# Append if not already in list
if url not in unique_urls:
unique_urls.append(url)
# Sorting for the fun of it
return sorted(unique_urls)
I'm sure it doesn't cover all possible cases. But maybe you can extend it if that's not the case. I'm also not sure if you wanted to keep the 'http(s)://' parts. If yes, then just add them to the results.

Python Variable Mutation Best Practice

Let's say I'm passing a variable into a function, and I want to ensure it's properly formatted for my end use, with consideration for several potential unwanted formats.
Example; I want to store only lowercase representations of url addresses, without http:// or https://.
def standardize(url):
# Lowercase
temp_url = url
url = temp_url.lower()
# Remove 'http://'
if 'http://' in url:
temp_url = url
url = temp_url.replace('http://', '')
if 'https://' in url:
temp_url = url
url = temp_url.replace('https://', '')
return url
I'm only just encroaching on the title of Novice, and was wondering if there is more pythonic approach to achieving this type of process?
End goal being the trasformation of a url as such https://myurl.com/RANDoM --> myurl.com/random
The application of url string formating isn't of any particular importance.
A simple re.sub will do the trick:
import re
def standardize(url):
return re.sub("^https?://",'',url.lower())
# with 'https'
print(standardize('https://myurl.com/RANDoM')) # prints 'myurl.com/random'
# with 'http'
print(standardize('http://myurl.com/RANDoM')) # prints 'myurl.com/random'
# both works
def standardize(url):
return url.lower().replace("https://","").replace("http://","")
That's as simple as I can make it, but, the chaining is a little ugly.
If you want to import regex, could also do something like this:
import re
def standardize(url):
return re.sub("^https?://", "", url.lower())

How to add custom parameters to an URL query string with Python?

I need to add custom parameters to an URL query string using Python
Example:
This is the URL that the browser is fetching (GET):
/scr.cgi?q=1&ln=0
then some python commands are executed, and as a result I need to set following URL in the browser:
/scr.cgi?q=1&ln=0&SOMESTRING=1
Is there some standard approach?
You can use urlsplit() and urlunsplit() to break apart and rebuild a URL, then use urlencode() on the parsed query string:
from urllib import urlencode
from urlparse import parse_qs, urlsplit, urlunsplit
def set_query_parameter(url, param_name, param_value):
"""Given a URL, set or replace a query parameter and return the
modified URL.
>>> set_query_parameter('http://example.com?foo=bar&biz=baz', 'foo', 'stuff')
'http://example.com?foo=stuff&biz=baz'
"""
scheme, netloc, path, query_string, fragment = urlsplit(url)
query_params = parse_qs(query_string)
query_params[param_name] = [param_value]
new_query_string = urlencode(query_params, doseq=True)
return urlunsplit((scheme, netloc, path, new_query_string, fragment))
Use it as follows:
>>> set_query_parameter("/scr.cgi?q=1&ln=0", "SOMESTRING", 1)
'/scr.cgi?q=1&ln=0&SOMESTRING=1'
Use urlsplit() to extract the query string, parse_qsl() to parse it (or parse_qs() if you don't care about argument order), add the new argument, urlencode() to turn it back into a query string, urlunsplit() to fuse it back into a single URL, then redirect the client.
You can use python's url manipulation library furl.
import furl
f = furl.furl("/scr.cgi?q=1&ln=0")
f.args['SOMESTRING'] = 1
print(f.url)
import urllib
url = "/scr.cgi?q=1&ln=0"
param = urllib.urlencode({'SOME&STRING':1})
url = url.endswith('&') and (url + param) or (url + '&' + param)
the docs

Categories