I'm working with threads and I need to download a website with a thread. I also have a thread that sends the petition to the site but doesn't wait for an answer.
The one that doesn't wait is this:
class peticion(Thread):
def __init__(self, url):
Thread.__init__(self)
self.url = url
def run(self):
f = urllib.urlopen(self.url)
f.close()
This one works correctly, however the one that has to wait for the response takes something like a random time to complete, from 5 seconds to 2 minutes, or it may never finish. This is the class:
class playerConn(Thread):
def __init__(self, ev):
Thread.__init__(self)
self.ev = ev
def run(self):
try:
params = urllib.urlencode('''params go here''')
f = urllib.urlopen('''site goes here''')
resp = f.read()
f.close()
finally:
# do something with the response
Whether or not I use the try...finally statement it doesn't work, the code after the urlopen function won't get to execute.
What can I do?
It appear to just be a problem with the URL, the code is fine and it does not appear to do something wrong.
I bet you have some type of problem from the website, maybe a 404 or similar.
Try opening something in localhost, just to test.
Related
I'm currently developing in python3 (still beginner) on a tornado framework and I have a function which I would like to run in the background. To be more precise the task of the function is to download a big file (chunk by chunk) and probably do some more things after each chunk is downloaded. But the calling function should not wait for the download-function to complete but should rather continue execution.
Here some code examples:
#gen.coroutine
def dosomethingfunc(self, env):
print("Do something")
self.downloadfunc(file_url, target_path) #I don't want to wait here
print("Do something else")
#gen.coroutine
def downloadfunc(self, file_url, target_path):
response = urllib.request.urlopen(file_url)
CHUNK = 16 * 1024
with open(target_path, 'wb') as f:
while True:
chunk = response.read(CHUNK)
if not chunk:
break
f.write(chunk)
time.sleep(0.1) #do something after a chunk is downloaded - sleep only as example
I've read this answer on stackoverflow https://stackoverflow.com/a/25083098/2492068 and tried use it.
Actually I thought if I use #gen.coroutine but no yield the dosomethingfunc would continue without waiting for downloadfunc to finish. But actually the behaviour is the same (with yield or without) - "Do something else" will only be printed after downloadfunc finished the download.
What I'm missing here?
To benefit of Tornado's asynchronous there must be yielded a non-blocking function at some point. Since the code of downloadfunc is all blocking, the dosomethingfunc does not get back control until called function is finished.
There are couples issue with your code:
time.sleep is blocking, use tornado.gen.sleep instead,
urllib's urlopen is blocking, use tornado.httpclient.AsyncHTTPClient
So the downloadfunc could look like:
#gen.coroutine
def downloadfunc(self, file_url, target_path):
client = tornado.httpclient.AsyncHTTPClient()
# below code will start downloading and
# give back control to the ioloop while waiting for data
res = yield client.fetch(file_url)
with open(target_path, 'wb') as f:
f.write(res)
yield tornado.gen.sleep(0.1)
To implement it with streaming (by chunk) support, you might want to do it like this:
# for large files you must increase max_body_size
# because deault body limit in Tornado is set to 100MB
tornado.web.AsyncHTTPClient.configure(None, max_body_size=2*1024**3)
#gen.coroutine
def downloadfunc(self, file_url, target_path):
client = tornado.httpclient.AsyncHTTPClient()
# the streaming_callback will be called with received portion of data
yield client.fetch(file_url, streaming_callback=write_chunk)
def write_chunk(chunk):
# note the "a" mode, to append to the file
with open(target_path, 'ab') as f:
print('chunk %s' % len(chunk))
f.write(chunk)
Now you can call it in dosomethingfunc without yield and the rest of the function will proceed.
edit
Modifying the chunk size is not supported (exposed), both from server and client side. You may also look at https://groups.google.com/forum/#!topic/python-tornado/K8zerl1JB5o
The class BrokenLinkTest in the code below does the following.
takes a web page url
finds all the links in the web page
get the headers of the links concurrently (this is done to check if the link is broken or not)
print 'completed' when all the headers are received.
from bs4 import BeautifulSoup
import requests
class BrokenLinkTest(object):
def __init__(self, url):
self.url = url
self.thread_count = 0
self.lock = threading.Lock()
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
self.lock.acquire()
for link in soup.find_all('a'):
url = link.get('href')
threading.Thread(target=self._check_url(url))
self.lock.acquire()
def _on_complete(self):
self.thread_count -= 1
if self.thread_count == 0: #check if all the threads are completed
self.lock.release()
print "completed"
def _check_url(self, url):
self.thread_count += 1
print url
result = requests.head(url)
print result
self._on_complete()
BrokenLinkTest("http://www.example.com").execute()
Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
threads = []
for link in soup.find_all('a'):
url = link.get('href')
t = threading.Thread(target=self._check_url, args=(url,))
t.start()
threads.append(t)
for thread in threads:
thread.join()
You could use the join method to wait for all the threads to finish.
Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url in the main thread and passing the return value to the target param.
All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?
You are never actually starting a threads, you are just initializing it
The threads themselves do absolutely nothing other than decrementing the thread count
You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.
I am trying to make a simple function to download file in python
The code is something like
def download(url , dest):
urllib.urlretrieve(url, dest)
My issue is that if I want to cancel the download process in the middle of downloading how do I approach???
This function runs in the background of app and is triggered by a button. Now I am trying to trigger it off with another button.
The platform is XBMC.
A simple class to do the same as your download function:
import urllib
import threading
class Downloader:
def __init__(self):
self.stop_down = False
self.thread = None
def download(self, url, destination):
self.thread = threading.Thread(target=self.__down, args=(url, destination))
self.thread.start()
def __down(self, url, dest):
_continue = True
handler = urllib.urlopen(url)
self.fp = open(dest, "w")
while not self.stop_down and _continue:
data = handler.read(4096)
self.fp.write(data)
_continue = data
handler.close()
self.fp.close()
def cancel(self):
self.stop_down = True
So, when someone clicks the "Cancel" button you have to call the cancel() method.
Please note that this will not remove the partially downloaded file if you cancel it, but that should not be hard to achieve using os.unlink(), for example.
The following example script shows how to use it, starting the download of a ~20Mb file and cancelling it after 5 seconds:
import time
if __name__ == "__main__":
url = "http://ftp.postgresql.org/pub/source/v9.2.3/postgresql-9.2.3.tar.gz"
down = Downloader()
down.download(url, "file")
print "Download started..."
time.sleep(5)
down.cancel()
print "Download canceled"
If you are canceling by pressing CTRL+C, then you can use this built in exception and proceed with what you think the best move should be.
In this case, if I cancel in the middle of a download, I simply want that partial file to be deleted:
def download(url , dest):
try:
urllib.urlretrieve(url, dest)
except KeyboardInterrupt:
if os.path.exists(dest):
os.remove(dest)
except Exception, e:
raise
I'm working on an application to download the code of a web page and captures the links.
It works, but if I connect the program to a GUI, it locks the corresponding button until the download is completed.
If I trigger the download via a separate thread, to avoid the button lock, it just freezes and does not complete execution.
Is this normal? Or am I missing something?
Below goes the snippet of code. If I call grab() from a separate thread, nothing happens, neither errors.
The function update_observers() only notifies the observers, not doing else.
The observer is the responsible by making any changes, in this case, redraw the GUI.
def grab(self, url):
try:
self._status = 'Downloading page.'
self.update_observers()
inpu = urllib2.urlopen(url)
except URLError, e:
self._status = 'Error: '+ e.reason
self.update_observers()
return None
resp = []
self._status = 'Parsing links'
self.update_observers()
for line in inpu.readlines():
for reg in self._regexes:
links = reg.findall(line)
for link in links:
resp.append(link)
self._status = 'Ready.'
self.update_observers()
return resp
This code is called here:
def grab(self, widget):
t = Thread(target=self.work)
t.setDaemon(True)
t.start()
def work(self):
print "Working"
self.links = None
self.links = self.grabber.grab(self.txtLink.get_text())
for link in self.links:
self.store.append([link])
print "Ok."
If I move the code from work() to grab, removing the threading stuff, it's all ok.
I just called gtk.gdk.threads_init() before gtk.main() and everything worked perfectly without any changes.
I can't figure out the problem in this code.
class Threader(threading.Thread):
def __init__(self, queue, url, host):
threading.Thread.__init__(self)
self.queue = queue
self.url = url
self.host = host
def run(self):
print self.url # http://www.stackoverflow.com
with contextlib.closing(urllib2.urlopen(self.url)) as u:
source = u.read()
print "hey" # this is not printing!
source = self.con()
doc = Document(source)
self.queue.put((doc, self.host))
When I run this code, print self.url succesfully outputs the url but print "hey" is not working. So basically, (I believe) there is something with contextlib which is blocking the code. I also tried the conventional urlopen method without using contextlib, but it doesn't work either. Furthermore, I tried try - except but the program doesn't raise any error. So what may be the problem here?
Your Code doesn't work, I have taken the liberty to adapt it a bit (imports, also it doesn't know about Document and self.con), and make it compatible with python2 (that's what I use here at the moment) - it works:
from __future__ import with_statement
import threading, Queue, urllib2, contextlib
class Threader(threading.Thread):
def __init__(self, queue, url, host):
threading.Thread.__init__(self)
self.queue = queue
self.url = url
self.host = host
def run(self):
print self.url
with contextlib.closing(urllib2.urlopen(self.url)) as u:
source = u.read()
print "hey"
if '__main__'==__name__:
t = Threader(Queue.Queue(), 'http://www.stackoverflow.com', '???')
t.start()
t.join()
EDIT: works also with "with" and contextlib
Since the problem persists with only using urllib, the most probable cause is that the url you are trying to open does not response.
You should try to
open the url in a browser or a simple web client (like wget on linux)
set the timeout parameter of urllib2.urlopen