When I use SeleniumRC,sometimes I meet a error, but sometimes not. I guess it's related to the time of wait_for_page_to_load(), but I don't know how long will it need?
The error information:
Exception: Timed out after 30000ms
File "C:\Users\Herta\Desktop\test\newtest.py", line 9, in <module>
sel.open(url)
File "C:\Users\Herta\Desktop\test\selenium.py", line 764, in open
self.do_command("open", [url,])
File "C:\Users\Herta\Desktop\test\selenium.py", line 215, in do_command
raise Exception, data
This is my program:
from selenium import selenium
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor#'
sel = selenium('localhost', 4444, '*firefox', url)
sel.start()
sel.open(url)
sel.wait_for_page_to_load(1000)
f = sel.get_html_source()
sav = open('test.html','w')
sav.write(f)
sav.close()
sel.stop()
Timing is a big issue when automating UI pages. You want to make sure you use timeouts when needed and provide the needed time for certain events. I see that you have
sel.open(url)
sel.wait_for_page_to_load(1000)
The sel.wait_for_page_to_load command after a sel.open call is redundant. All sel.open commands have a built in wait. This may be the cause of your problem because selenium waits as a part of the built in process of the sel.open command. Then selenium is told to wait again for the page to load. Since no page is loaded. It throws an error.
However, this is unlikely since it is throwing the trace on the sel.open command. Wawa's response above may be your best bet.
The "Timed out after 30000ms" message is coming from the sel.open(url) call which uses the selenium default timeout. Try increasing this time using sel.set_timeout("timeout"). I would suggest 60 seconds as a good starting point, if 60 seconds doesn't work, try increasing the timeout. Also make sure that you can get to the page normally.
from selenium import selenium
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor#'
sel = selenium('localhost', 4444, '*firefox', url)
sel.set_timeout('60000')
sel.start()
sel.open(url)
sel.wait_for_page_to_load(1000)
f = sel.get_html_source()
sav = open('test.html','w')
sav.write(f)
sav.close()
sel.stop()
I had this problem and it was windows firewall blocking selenium server. Have you tried adding an exception to your firewall?
Related
Goal:
The goal is to generate a robot in replit which will iteratively scrape yahoo pages like this amazon page, and track the dynamic 'Volume' datapoint for abnormally large changes. I'm currently trying to be able to reliably pull this exact datapoint down, and I have been using the yahoo_fin API to do so. I have also considered using bs4, but I'm not sure if it is possible to use BS4 to extract dynamic data. (I'd greatly appreciate it if you happen to know the answer to this: can bs4 extract dynamic data?)
Problem:
The script seems to work, but it does not stay online due to what appears to be an error in yahoo_fin. Usually within around 5 minutes of turning the bot on, it throws the following error:
File "/home/runner/goofy/scrape.py", line 13, in fetchCurrentVolume
table = si.get_quote_table(ticker)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/yahoo_fin/stock_info.py", line 293, in get_quote_table
tables = pd.read_html(requests.get(site, headers=headers).text)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 1098, in read_html
return _parse(
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 926, in _parse
raise retained
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 906, in _parse
tables = p.parse_tables()
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 222, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/io/html.py", line 552, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
However, this usually happens after a number of tables have already been found.
Here is the fetchCurrentVolume function:
import yahoo_fin.stock_info as si
def fetchCurrentVolume(ticker):
table = si.get_quote_table(ticker)
currentVolume = table['Volume']
return currentVolume
and the API documentation is found above under Goal. Whenever this error message is displayed, the bot exits a #tasks.loop , and the robot goes offline. If you know of a way to fix the current use of yahoo_fin, OR any other way to obtain the dynamic data found in this xpath: '//div[#id="quote-summary"]/div/table/tbody/tr' , then you will have pulled me out of a 3 weeks long debacle with this issue! Thank you.
If you are able to retrieve some data then it cuts out, it is probably due to a rate limit. Try adding a sleep of a few seconds between each one.
see here for how to use sleep
Maybe the web server bonks out when the tables are being re-written every so often. Or something like that.
If you use a try/except waited a few seconds and then tried again before bailing out to a failure maybe that would work if it is just a hicup once in a while?
import yahoo_fin.stock_info as si
import time
def fetchCurrentVolume(ticker):
try:
table = si.get_quote_table(ticker)
currentVolume = table['Volume']
except:
# hopefully this was just a hicup and it will be back up in 5 seconds
time.sleep(5)
table = si.get_quote_table(ticker)
currentVolume = table['Volume']
return currentVolume
Following the docs here for Auth Code Flow, I can't seem to get the example to work.
import apis
import spotipy
import spotipy.util as util
username = input("Enter username: ")
scope = "user-library-read"
token = util.prompt_for_user_token(username, scope,
client_id=apis.SPOTIFY_CLIENT,
client_secret=apis.SPOTIFY_SECRET,
redirect_uri="http://localhost")
if token:
sp = spotipy.Spotify(auth=token)
results = sp.current_user_saved_tracks()
for item in results['items']:
track = item['track']
print(track['name'] + ' - ' + track['artists'][0]['name'])
else:
print("Can't get token for", username)
I get a 400 Bad Request error:
Traceback (most recent call last):
File "/home/termozour/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/193.6494.30/plugins/python/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/termozour/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/193.6494.30/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/media/termozour/linux_main/PycharmProjects/SpotiDowner/main.py", line 27, in <module>
sp = spotipy.Spotify(auth=token)
File "/media/termozour/linux_main/PycharmProjects/SpotiDowner/venv/lib/python3.7/site-packages/spotipy/util.py", line 92, in prompt_for_user_token
token = sp_oauth.get_access_token(code, as_dict=False)
File "/media/termozour/linux_main/PycharmProjects/SpotiDowner/venv/lib/python3.7/site-packages/spotipy/oauth2.py", line 382, in get_access_token
raise SpotifyOauthError(response.reason)
spotipy.oauth2.SpotifyOauthError: Bad Request
Yes, I've seen the other ideas people had to fix this issue but none worked for me:
I tried to reset the client secret, change my URI to a website or localhost, check my client ID and secret ID, and they are all fine.
What I also tried was to go all the way to spotipy/oauth2.py where I get the traceback, added a neat little print(response.text) and got a marvelous {"error":"invalid_grant","error_description":"Invalid authorization code"}
Any ideas or insight?
I have to specify that I ran this code from PyCharm Professional(2019.3.3) and the issue came up from a trailing space.
When the code is run, spotipy asks for that confirmation URL you get with the code based on the redirect URI you set in the app and the code.
The issue was how PyCharm handles URLs in it's terminal window when you run the project. When you put a URL in, and you press enter, PyCharm opens the URL in the browser (for some reason) so the workaround for that is to add a space after the URL and then press enter. That works for PyCharm, but sometimes screws up the code. In this case, it did.
I tried to run the code from PyCharm terminal (python3 ), pasted the URL and directly pressed enter. Not only it didn't open the browser window, it also accepted the URL and let me get my info from Spotify.
In any case, the code is fine, it was the IDE that was breaking it all - maybe a good suggestion would be to remove trailing spaces after the url, if any when parsed from the library itself.
This 'bug' has been reported here and has been marked as fixed (spoiler alert: not fixed) and will be fixed in PyCharm 2020.1
I tried running the code normally from PyCharm 2020.1 and it all works fine, so I can confirm: in my case it was an IDE issue.
I am scraping a website which has 2 versions at the moment, and when you visit the site you never know which one you are going to get. For this reason I have had to set up two separate files to scrape it.
For the sake of simplicity I have a master file which controls the running of the two files:
attempts = 0
while attempts < 10:
try:
try:
runfile('file1.py')
except SomeException:
runfile('file2.py')
break
except:
attempts += 1
So basically this keeps trying a maximum of 10 times until the correct version of the site meets the correct scraper file.
The problem with this is that the files launch a webdriver every time, so I can end up with several empty browsers clogging up the machine. Is there any command which can just close all webdriver instances? I cannot use driver.quit() because in the environment of this umbrella script, driver is not a recognized variable.
I also cannot use driver.quit() at the end of file1.py or file2.py because when file1.py encounters an error, it ceases to run and so the driver.quit() command will not be executed. I can't use a try / except because then my master file won't understand that there was an error in file1.py and thus won't run file2.py.
Handle the exception in individual runners, close the driver and raise a common exception that you then handle in the caller.
In file1.py and file2.py
try:
# routine
except Exception as e:
driver.quit()
raise e
You can factor this out to the caller by initializing the driver in the caller, and passing the driver instance to functions instead of modules.
You can have a try..finally block in runfile.
def runfile(filename):
driver = ...
try:
...
finally:
# close the webdriver
driver.quit()
Is there a way to detect whether there is a browser available on the system on which the script is run? Nothing happens when running the following code on a server:
try:
webbrowser.open("file://" + os.path.realpath(path))
except webbrowser.Error:
print "Something went wrong when opening webbrowser"
It's weird that there's no caught exception, and no open browser. I'm running the script from command line over an SSH-connection, and I'm not very proficient in server-related stuff, so there may be another way of detecting this that I am missing.
Thanks!
Checkout the documentation:
webbrowser.get([name])
Return a controller object for the browser type name. If name is empty, return a controller for a default browser appropriate to the caller’s environment.
This works for me:
try:
# we are not really interested in the return value
webbrowser.get()
webbrowser.open("file://" + os.path.realpath(path))
except Exception as e:
print "Webbrowser error: " % e
Output:
Webbrowser error: could not locate runnable browser
I develop a crawler using multiprocessing model.
which use multiprocessing.Queue to store url-infos which need to crawl , page contents which need to parse and something more;use multiprocessing.Event to control sub processes;use multiprocessing.Manager.dict to store hash of crawled url;each multiprocessing.Manager.dict instance use a multiprocessing.Lock to control access.
All the three type params are shared between all sub processes and parent process, and all the params are organized in a class, I use the instance of the class to transfer shared params from parent process to sub process. Just like:
MGR = SyncManager()
class Global_Params():
Queue_URL = multiprocessing.Queue()
URL_RESULY = MGR.dict()
URL_RESULY_Mutex = multiprocessing.Lock()
STOP_EVENT = multiprocessing.Event()
global_params = Global_Params()
In my own timeout mechanism, I use process.terminate to stop the process which can't stop by itself for a long time!
In my test case, there are 2500+ target sites(some are unservice, some are huge).
crawl site by site that in the target sites file.
At the begining the crawler could work well, but after a long time( sometime 8 hours, sometime 2 hours, sometime moer then 15 hours), the crawler has crawled moer than 100( which is indeterminate) sites, I'll get error info:"Errno 32 broken pipe"
I have tried the following methods to location and solve the problems:
location the site A which crawler broken on, then use crawler to crawls the site separately, the crawler worked well. Even I get a fragment(such as 20 sites) from all the target sites file which contain the site A, the crawler worked well!
add "-X /tmp/pymp-* 240 /tmp" to /etc/cron.daily/tmpwatch
when Broken occured the file /tmp/pymp-* is still there
use multiprocessing.managers.SyncManager replace multiprocessing.Manager and ignore most signal except SIGKILL and SIGTERM
for each target site, I clear most shared params(Queues,dicts and event),if error occured, create a new instance:
while global_params.Queue_url.qsize()>0:
try:
global_params.Queue_url.get(block=False)
except Exception,e:
print_info(str(e))
print_info("Clear Queue_url error!")
time.sleep(1)
global_params.Queue_url = Queue()
pass
the following is the Traceback info, the print_info function is defined to print and store debug info by myself:
[Errno 32] Broken pipe
Traceback (most recent call last):
File "Spider.py", line 613, in <module>
main(args)
File "Spider.py", line 565, in main
spider.start()
File "Spider.py", line 367, in start
print_info("STATIC_RESULT size:%d" % len(global_params.STATIC_RESULT))
File "<string>", line 2, in __len__
File "/usr/local/python2.7.3/lib/python2.7/multiprocessing/managers.py", line 769, in _callmethod
kind, result = conn.recv()
EOFError
I can't understand why, does anyone knows the reason?
I don't know if that is fixing your problem, but there is one point to mention:
global_params.Queue_url.get(block=False)
... throws an Queue.Empty expeption, if the Queue is empty. It's not worth to recreate the Queue for an empty exception.
The recreation of the queue can lead to race conditions.
From my point of view, you have to possibilities:
get rid of the "queue recreation" code block
switch to an other Queue implementation
use:
from Queue import Queue
instead of:
from multiprocessing import Queue