I'm using urlopen() to open a website and pull (financial) data from it. Here is my line:
sourceCode = urlopen('xxxxxxxx').read()
After this, I then pull the data I need out. I loop through different pages on the same domain to pull data (stock info). I end the body of the loop with:
time.sleep(1)
as I'm told that keeps the site from blocking me. My program will run for a few minutes, but at some point, it stalls and quits pulling data. I can rerun it and it'll run another arbitrary amount of time and then stall.
Is there something I can do to prevent this?
This worked (for most websites) for me:
If you're using the urllib.request library, you can create a Request and spoof the user agent. This might mean that they stop blocking you.
from urllib.request import Request, urlopen
req = Request(path, headers={'User-Agent': 'Mozilla/5.0})
data = urlopen(req).read()
Hope this helps
Related
I am trying to use requests to pull information from the NPI API but it is taking on average over 20 seconds to pull the information. If I try and access it via my web browser it takes less than a second. I'm rather new to this and any help would be greatly appreciated. Here is my code.
import json
import sys
import requests
url = "https://npiregistry.cms.hhs.gov/api/?number=&enumeration_type=&taxonomy_description=&first_name=&last_name=&organization_name=&address_purpose=&city=&state=&postal_code=10017&country_code=&limit=&skip="
htmlfile=requests.get(url)
data = htmlfile.json()
for i in data["results"]:
print(i)
This might be due to the response being incorrectly formatted, or due to requests taking longer than necessary to set up the request. To solve these issues, read on:
Server response formatted incorrectly
A possible issue might be that the response parsing is actually the offending line. You can check this by not reading the response you receive from the server. If the code is still slow, this is not your problem, but if this fixed it, the problem might lie with parsing the response.
In case some headers are set incorrectly, this can lead to parsing errors which prevents chunked transfer (source).
In other cases, setting the encoding manually might resolve parsing problems (source).
To fix those, try:
r = requests.get(url)
r.raw.chunked = True # Fix issue 1
r.encoding = 'utf-8' # Fix issue 2
print(response.text)
Setting up the request takes long
This is mainly applicable if you're sending multiple requests in a row. To prevent requests having to set up the connection each time, you can utilize a requests.Session. This makes sure the connection to the server stays open and configured and also persists cookies as a nice benefit. Try this (source):
import requests
session = requests.Session()
for _ in range(10):
session.get(url)
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
I am parsing through various links using requests, however some links are "bad" aka they basically just don't load which causes my program to become hung up and eventually crash.
Is there a way to set a time limit for getting a request, and if that time passes (fails to get a request from the url) it will just return some kind of error? Or is there some other way I can prevent bad links from breaking my program?
urllib2 is one option
import urllib2
test_url = "http://www.test.com"
try:
urllib2.urlopen(test_url)
except:
pass
Edit:I have found that i made a mistake, because the cause of the error wasn't urllib but nltk, which wasn't able to process a long string which came from this exact page. Sorry for this one.
I do not know why, but this no matter if I use Urllib2.urlopen or request when I come across a specific url.
import requests
r = requests.get('SomeURL')
print html = r.text
Here is its behavior.
1) When I go thought a loop of 200 urls it freezes each time at that exactly the same URL. It stays here for hours if i do not terminate program.
2) When u try with just example of the code outside of the loop it works.
3) If i blacklist just this url it goes through the loop without problems.
It actually doesn't return any kind of error code and it work good outside of loop and also timeout is set but it doesn't do anything. It still hangs for an indefinite time.
So is there any other way to forcefully stop the http get request after a certain time, because the timeout doesn't work. Is there any other library other than urllib2 and request that could do the job, and that follows timeout limits?
for i in range(0,mincount):
code(call the request for urlist[i])
It always works but freezes only when I request this site. If i had 200 request to yahoo for example it would work. But when i try go to this particular url i cannot.
#end
edit: It is a standard for cycle and there is not much room for error.
I think it's simply a very slow page; on my system, it takes about 9.7s to load.
If you are trying to run it in a short loop, it would indeed seem to freeze.
You could try something like
links = [
'SomeURL',
'http://www.google.com/'
]
for link in links:
try:
html = requests.get(link, timeout=2.).content
print("Successfully loaded {}".format(link))
except requests.Timeout:
print("Timed out loading {}".format(link))
which gave me
Timed out loading SomeURL
Successfully loaded http://www.google.com/
I am currently using the python requests package to make JSON requests. Unfortunately, the service which I need to query has a daily maximum request limit. Right know, I cache the executed request urls, so in case I come go beyond this limit, I know where to continue the next day.
r = requests.get('http://someurl.com', params=request_parameters)
log.append(r.url)
However, for using this log the the next day I need to create the request urls in my program before actually doing the requests so I can match them against the strings in the log. Otherwise, it would decrease my daily limit. Does anybody of you have an idea how to do this? I didn't find any appropriate method in the request package.
You can use PreparedRequests.
To build the URL, you can build your own Request object and prepare it:
from requests import Session, Request
s = Session()
p = Request('GET', 'http://someurl.com', params=request_parameters).prepare()
log.append(p.url)
Later, when you're ready to send, you can just do this:
r = s.send(p)
The relevant section of the documentation is here.
This code was working fine and reading html.
Then the site just stopped giving any data on read(). No error code.
It's because the web server has detected something unusual right?
(Before I figured setting the User Agent I had Error 403: Bad Behavior)
Does urllib2 have some noticeable signature that raises a flag?
Would switching to another library help?
I'm not doing anything suspicious. I can't see any behavior difference between me using this library to read a page and using lynx browser.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
search_url='http://www.google.com/?search=stackoverflow"'
raw = opener.open(search_url)
print raw.headers
print raw.read()
Given your print statement, I assume you are doing this over Python2.x
I ran the same thing on my system and it works, irrespective of setting the user-agent. What David Robinson suggested might have something to do over here.
On another note, I have personally used, the following example snippet
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
Multiple times, over multiple use cases and have never encountered your error.
Try using mechanize instead of plain urllib2 for crawling search engines, it mimics better a browser's behaviour.