Unable to get webpage using aiohttp ClientSession

Unable to get webpage using aiohttp ClientSession - python

I would like to use asyncio to get the webpage.
However, when I executed the code below, no page is obtained.
The code is
import aiofiles
import aiohttp
from aiohttp import ClientSession
import asyncio
async def get_webpage(url, session):
try:
res = await session.request(method="GET", url=url)
html = await res.text(encoding='GB18030')
return 0, html
except:
return 1, []
async def main_get_webpage(urls):
webpage = []
connector = aiohttp.TCPConnector(limit=60)
async with ClientSession(connector=connector) as session:
tasks = [get_webpage(url, session) for url in urls]
result = await asyncio.gather(*tasks)
for status, data in result:
print(status)
if status == 0:
webpage.append(data)
return webpage
if __name__ == '__main__':
urls = ['https://lcdsj.fang.com/house/3120178164/fangjia.htm', 'https://mingliugaoerfuzhuangyuan0551.fang.com/house/2128242324/fangjia.htm']
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop = asyncio.get_event_loop()
webpage = loop.run_until_complete(main_get_webpage(urls))
I expect two zeros will be printed in the function main_get_webpage(urls).
However, two ones are printed.
What's wrong with my code?
How to fix the problem?
Thank you very much.

What's wrong with my code?
What's wrong is that you have a try: ... except: that masks the source of the problem. If you remove the except clause, you will find an error message that communicates the underlying issue:
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb7 in position 47676: illegal multibyte sequence
The web page is not encoded as GB18030. The page declares itself as GB2312 (a pre-cursor to GB18030), but using that as the coding also fails.
How to fix the problem?
Depending on what you want to do with the web page text, you have several options:
Find an encoding supported by Python that works with the page as given. This is the ideal option, but I wasn't able to find it with a short search. (Using this answer to find out what chrome thinks the page uses didn't help either, because the response was GBK, which against produces an error on character 47676.)
Decode the page with a more relaxed error handler, such as res.text(encoding='GB18030', errors='replace'). That will give you a good approximation of the text, with the undecipherable bytes rendered as the unicode replacement character. This is a good option if you need to search the page for a substring or analyze it as text, and don't care about a weird character somewhere in it.
Give up the idea of decoding the page as text, and just use res.data() to get the bytes. This option is best if you need to archive or cache the page, or index it.

I think a better way maybe just use await res.text() instead of await res.text(encoding='GB18030') because as https://docs.aiohttp.org/en/stable/client_reference.html?highlight=encoding#aiohttp.ClientResponse.text said
If encoding is None content encoding is autocalculated using
Content-Type HTTP header and chardet tool if the header is not
provided by server.
I will argue that if aiohttp didn't use the charset in Content-Type to decode the response text, its implementation is rather problematic. You really don't need to provide encoding parameter.
I check the 2 urls in your example, Content-Type are both text/html; charset=utf-8 so you can't use GB18030 to decode them.

Related

Python fails to parse JSON in API calls

I want to write a python script that retrieves data from a website using it's API. Using the code below (which comes from this [tutorial][1] - click [here for more context][2]), I am able to successfully submit a query to the website but not retrieve the data I need.
import requests, sys, json
WEBSITE_API = "https://rest.uniprot.org/beta"
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return(response)
r = get_url(f"{WEBSITE_API}/uniprotkb/search?query=parkin AND (taxonomy_id:9606)&fields=id,accession,length,ft_site", headers={"Accept": "text/tsv"})
print(r.text)
Instead of printing the data I've queried, I get an error message:
{"url":"http://rest.uniprot.org/uniprotkb/search","messages":["Invalid request received. Requested media type/format not accepted: 'text/tsv'."]}
I figured out that I can print the data as a JSON-formatted string if I change the second-to-last line of my code to this:
r = get_url(f"{WEBSITE_API}/uniprotkb/search?query=parkin AND (taxonomy_id:9606)&fields=id,accession,length,ft_site")
In other words, I simply remove the following code snippet:
headers={"Accept": "text/tsv"}
I highly doubt this is an issue with the website because the tutorial the non-working code comes from was from a November 2021 webinar put on by an affiliated institution. I think a more likely explanation is that my python environment lacks some library and would appreciate any advice about how to investigate this possibility. Another possible solution I would appreciate is code and/or library suggestions that would put me on the path to simply parsing the JSON-string into a tab-delimited file.
[1]: https://colab.research.google.com/drive/1SU3j4VmXHYrYxOlrdK_NCBhajXb0iGes#scrollTo=EI91gRfPYfbc
[2]: https://drive.google.com/file/d/1qZXLl19apibjszCXroC1Jg2BGjcychqX/view

Issue with getting the response data using Locust

Im trying to see if I'm able to get the response data as I'm trying to learn how to use regex on Locust. I'm trying to reproduce my test script from JMeter using Locust.
This is the part of the code that I'm having problem with.
import time,csv,json
from locust import HttpUser, task,between,tag
class ResponseGet(HttpUser):
response_data= ""
wait_time= between (1,1.5)
host= "https://portal.com"
username= "NA"
password= "NA"
#task
def portal(self):
print("Portal Task")
response = self.client.post('/login', json={'username':'user','password':'123'})
print(response)
self.response_data = json.loads(response.text)
print(response_data)
I've tried this suggestion and I somehow can't make it work.
My idea is get response data > use regex to extract string > pass the string for the next task to use
For example:
Get login response data > use regex to extract token > use the token for the next task.
Is there any better way to do this?

The way you're doing it should work, but Locust's HttpUser's client is based on Requests so if you want to access the response data as a JSON you should be able to do that with just self.response_data = response.json(). But that will only work if the response body is valid JSON. Your code will also fail if the response body is not JSON.
If your problem is in parsing the response text as JSON, it's likely that the response just isn't JSON, possibly because you're getting an error or something. You could print the response body before your attempt to load it as JSON. But your current print(response) won't do that because it will just be printing the Response object returned by Requests. You'd need to print(response.text()) instead.
As far as whether a regex would be the right solution for getting at the token returned in the response, that will depend on how exactly the response is formatted.

BrokenFilesystemWarning in Werkzeug [duplicate]

I have the following, very basic code that throws; TypeError: the JSON object must be str, not 'bytes'
import requests
import json
url = 'my url'
user = 'my user'
pwd = 'my password'
response = requests.get(url, auth=(user, pwd))
if(myResponse.ok):
Data = json.loads(myResponse.content)
I try to set decode to the Data variable, as follows but it throws the same error; jData = json.loads(myResponse.content).decode('utf-8')
Any suggestions?

json.loads(myResponse.content.decode('utf-8'))
You just put it in the wrong order, innocent mistake.
(In-depth answer). As courteously pointed out by wim, in some rare cases, they could opt for UTF-16 or UTF-32. These cases will be less common as the developers, in that scenario would be consciously deciding to throw away valuable bandwidth. So, if you run into encoding issues, you can change utf-8 to 16, 32, etc.
There are a couple of solutions for this. You could use request's built-in .json() function:
myResponse.json()
Or, you could opt for character detection via chardet. Chardet is a library developed based on a study. The library has one function: detect. Detect can detect most common encodings and then use them to encode your string with.
import chardet
json.loads(myResponse.content.decode(chardet.detect(myResponse.content)["encoding"]))

Let requests decode it for you:
data = response.json()
This will check headers (Content-Type) and response encoding, auto-detecting how to decode the content correctly.

python3.6+ does this automatically.so your code shouldn't return error in python3.6+
what's new in python3.6

Simplify a streamed request.get and JSON response decode

I have been working on some code that will grab emergency incident information from a service called PulsePoint. It works with software built into computer controlled dispatch centers.
This is an app that empowers citizen heroes that are CPR trained to help before a first resonder arrives on scene. I'm merely using it to get other emergency incidents.
I reversed-engineered there app as they have no documentation on how to make your own requests. Because of this i have knowingly left in the api key and auth info because its in plain text in the Android manifest file.
I will definitely make a python module eventually for interfacing with this service, for now its just messy.
Anyhow, sorry for that long boring intro.
My real question is, how can i simplify this function so that it looks and runs a bit cleaner in making a timed request and returning a json object that can be used through subscripts?
import requests, time, json
def getjsonobject(agency):
startsecond = time.strftime("%S")
url = REDACTED
body = []
currentagency = requests.get(url=url, verify=False, stream=True, auth=requests.auth.HTTPBasicAuth(REDACTED, REDCATED), timeout = 13)
for chunk in currentagency.iter_content(1024):
body.append(chunk)
if(int(startsecond) + 5 < int(time.strftime("%S"))): #Shitty internet proof, with timeout above
raise Exception("Server sent to much data")
jsonstringforagency = str(b''.join(body))[2:][:-1] #Removes charecters that define the response body so that the next line doesnt error
currentagencyjson = json.loads(jsonstringforagency) #Loads response as decodable JSON
return currentagencyjson
currentincidents = getjsonobject("lafdw")
for inci in currentincidents["incidents"]["active"]:
print(inci["FullDisplayAddress"])

Requests handles acquiring the body data, checking for json, and parsing the json for you automatically, and since you're giving the timeout arg I don't think you need separate timeout handling. Request also handles constructing the URL for get requests, so you can put your query information into a dictionary, which is much nicer. Combining those changes and removing unused imports gives you this:
import requests
params = dict(both=1,
minimal=1,
apikey=REDACTED)
url = REDACTED
def getjsonobject(agency):
myParams = dict(params, agency=agency)
return requests.get(url, verify=False, params=myParams, stream=True,
auth=requests.auth.HTTPBasicAuth(REDACTED, REDACTED),
timeout = 13).json()
Which gives the same output for me.

urllib2.urlopen does not add a "/" to the last of url with chinese automatically

Examples:
url_1 = "http://yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83/"
url_2 = "http://yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83"
As you see, if I don't add a / to the last of the URL, when I use urllib2.urlopen(url_2) it returns 400 error because the effective URL should be url_1, if the URL doesn't include any Chinese, the urllib2.urlopen and urllib.urlopen will add a / automatically.
The question is the urllib.urlopen works well all of these situations, but urllib2.urlopen just works well when the URL without Chinese.
So I wonder that if it is a little bug to urllib2.urlopen, or is there another explaination to it?

What actually happens here is a couple of redirections initiated by the server, before the acual error:
Request: http://yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83
Response: Redirect to http://www.yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83
Request: http://www.yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83
Response: Redirect to http://www.yinwang.org/blog-cn/2013/04/21/ydiff-结构化的程序比较/ (actually 'http://www.yinwang.org/blog-cn/2013/04/21/ydiff-\xe7\xbb\x93\xe6\x9e\x84\xe5\x8c\x96\xe7\x9a\x84\xe7\xa8\x8b\xe5\xba\x8f\xe6\xaf\x94\xe8\xbe\x83/', to be precise)
AFAIK, that last redirect is invalid. The address should be plain ASCII (non-ascii characters should be encoded). The correct encoded address would be: http://www.yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83/
Now, it seems that urllib is playing nice and doing the conversion itself, before requesting the final address, whereas urllib2 simply uses the address it receives.
You can see that if you try to open the final address manually:
urllib
>>> print urllib.urlopen('http://www.yinwang.org/blog-cn/2013/04/21/ydiff-\xe7\xbb\x93\xe6\x9e\x84\xe5\x8c\x96\xe7\x9a\x84\xe7\xa8\x8b\xe5\xba\x8f\xe6\xaf\x94\x
e8\xbe\x83/').geturl()
http://www.yinwang.org/blog-cn/2013/04/21/ydiff-%E7%BB%93%E6%9E%84%E5%8C%96%E7%9A%84%E7%A8%8B%E5%BA%8F%E6%AF%94%E8%BE%83/
urllib2
>>> try:
... urllib2.urlopen('http://www.yinwang.org/blog-cn/2013/04/21/ydiff-\xe7\xbb\x93\xe6\x9e\x84\xe5\x8c\x96\xe7\x9a\x84\xe7\xa8\x8b\xe5\xba\x8f\xe6\xaf\x94\xe8\xbe\x83/')
... except Exception as e:
... print e.geturl()
...
http://www.yinwang.org/blog-cn/2013/04/21/ydiff-š╗ôŠ×äňîľšÜäšĘőň║ĆŠ»öŔżâ/
Solution
If it is your server, you should fix the problem there. Otherwise, I guess it should be possible to write a urllib2.HTTPRedirectHandler which would encode the redirection URLs in urllib2.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to get webpage using aiohttp ClientSession - python

Related

Python fails to parse JSON in API calls

Issue with getting the response data using Locust

BrokenFilesystemWarning in Werkzeug [duplicate]

Simplify a streamed request.get and JSON response decode

urllib2.urlopen does not add a "/" to the last of url with chinese automatically

Categories

Resources