ThreadPoolExecutor + Requests == deadlock? - python

I have a tiny stupid code, which makes a lot of requests to google search service
from concurrent.futures import ThreadPoolExecutor
import requests
import requests.packages.urllib3
requests.packages.urllib3.disable_warnings()
def check(page):
r = requests.get('https://www.google.ru/#q=test&start={}'.format(page * 10))
return len(r.text)
import time
def main():
for q in xrange(30):
st_t = time.time()
with ThreadPoolExecutor(20) as pool:
ret = [x for x in pool.map(check, xrange(1,1000))]
print time.time() - st_t
if __name__ == "__main__":
main()
And it works firstly, but then something is going wrong. All 20 threads are alive, but then they do nothing. I can see in the htop, that they are alive, but I actually don't understand why nothing happens.
Any ideas what could be wrong?

This is a known issue and the requests team did not get enough information for debugging, see this. Possible it is a CPython issue see this.

Related

How do I 'spam' an url via python?

I'd want to make a script that automatically sends alot of requests to a URL via Python.
Example link: https://page-views.glitch.me/badge?page_id=page.id
I've tried selenium but thats very slow.
pip install requests
import requests
for i in range(100): # Or whatever amount of requests you wish to send
requests.get("https://page-views.glitch.me/badge?page_id=page.id")
Or if you really wanted to hammer the address you could use multiprocessing
import multiprocessing as mp
import requests
def my_func(x):
for i in range(x):
print(requests.get("https://page-views.glitch.me/badge?page_id=page.id"))
def main():
pool = mp.Pool(mp.cpu_count())
pool.map(my_func, range(0, 100))
if __name__ == "__main__":
main()
You can send send multiple get() requests in a loop as follows:
for i in range(100):
driver.get(https://page-views.glitch.me/badge?page_id=page.id)

Multiprocessing 'Pool' option deployed with python 3 on windowns 10 not working giving error

I'm trying to use the multiprocessing library for Python 3. The module is imported correctly and does not show any error, however when using it, I get an error.
Here is my code:
from multiprocessing import Pool
import time
start_time = time.process_time()
p = Pool(10)
def print_range():
for i in range(10000):
print('Something')
end_time = time.process_time()
print(end_time-start_time)
p.map(print_range())
However I get this error:
ImportError: cannot import name 'Pool' from 'multiprocessing' (C: ...path file)
Has anyone encountered this error, and have a solution for it? Thanks
Might be related to safely importing main. See this section in the documentation. Specifically, you'd need to change your code like this:
from multiprocessing import Pool
import time
def print_range():
for i in range(10000):
print('Something')
if __name__ == '__main__':
start_time = time.process_time()
p = Pool(10)
end_time = time.process_time()
print(end_time-start_time)
p.map(print_range()) # incorrect usage
In addition, your usage of map() is not correct. See the documentation for examples, or use p.apply(print_range) instead.

socketio.emit() doesn't work when interacting using Popen on Windows in a Thread

I think a quick code snippet is better to explain my problem, so please have a look at this:
from flask import Flask
from flask.ext.socketio import SocketIO
from threading import Thread
import subprocess
import threading
from eventlet.green.subprocess import Popen
app = Flask(__name__)
socketio = SocketIO(app)
def get_tasks_and_emit():
instance = Popen(["tasklist"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1)
lines_iterator = iter(instance.stdout.readline, b"")
data = ""
for line in lines_iterator:
data += line.decode("utf8")
socketio.emit("loaded", data)
print("::: DEBUG - returned tasks with thread")
#app.route("/")
def index():
html = "<!DOCTYPE html>"
html += "<script src=https://code.jquery.com/jquery-2.2.0.min.js></script>"
html += "<script src=https://cdn.socket.io/socket.io-1.4.5.js></script>"
html += "<script>"
html += "var socket = io.connect(window.location.origin);"
html += "socket.on('loaded', function(data) {alert(data);});"
html += "function load_tasks_threaded() {$.get('/tasks_threaded');}"
html += "function load_tasks_nonthreaded() {$.get('/tasks');}"
html += "</script>"
html += "<button onclick='load_tasks_nonthreaded()'>Load Tasks</button>"
html += "<button onclick='load_tasks_threaded()'>Load Tasks (Threaded)</button>"
return html
#app.route("/tasks")
def tasks():
get_tasks_and_emit()
print("::: DEBUG - returned tasks without thread")
return ""
#app.route("/tasks_threaded")
def tasks_threaded():
threading.Thread(target=get_tasks_and_emit).start()
return ""
if __name__ == "__main__":
socketio.run(app, port=7000, debug=True)
I am running this code on Windows using eventlet, if I don't use eventlet everything is fine (but of course much slower due to the werkzeug threading mode). (And I just checked and it's not working on Linux either)
I hope someone can point me into the right direction. (My Python version is 3.5.1 by the way)
I found the problem. Apparently you have to monkey patch the threading module, so I added
import eventlet
eventlet.monkey_patch(thread=True)
and then I also had a problem with long running programs. I had the same problem as the guy in this StackOverflow post:
Using Popen in a thread blocks every incoming Flask-SocketIO request
So I added
eventlet.sleep()
to the for loop that processes the pipes.
EDIT:
As temoto pointed out, alternatively one can also just use the threading module from eventlet.green like this:
from eventlet.green import threading

python apscheduler, an easier way to run jobs?

I have jobs scheduled thru apscheduler. I have 3 jobs so far, but soon will have many more. i'm looking for a way to scale my code.
Currently, each job is its own .py file, and in the file, I have turned the script into a function with run() as the function name. Here is my code.
from apscheduler.scheduler import Scheduler
import logging
import job1
import job2
import job3
logging.basicConfig()
sched = Scheduler()
#sched.cron_schedule(day_of_week='mon-sun', hour=7)
def runjobs():
job1.run()
job2.run()
job3.run()
sched.start()
This works, right now the code is just stupid, but it gets the job done. But when I have 50 jobs, the code will be stupid long. How do I scale it?
note: the actual names of the jobs are arbitrary and doesn't follow a pattern. The name of the file is scheduler.py and I run it using execfile('scheduler.py') in python shell.
import urllib
import threading
import datetime
pages = ['http://google.com', 'http://yahoo.com', 'http://msn.com']
#------------------------------------------------------------------------------
# Getting the pages WITHOUT threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
for page in pages:
job(page)
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microseconds WITHOUT threads" \
.format((end - start).microseconds)
#------------------------------------------------------------------------------
# Getting the pages WITH threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
threads = []
for page in pages:
t = threading.Thread(target=job, args=(page,))
t.start()
threads.append(t)
for t in threads:
t.join()
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microsecond WITH threads" \
.format((end - start).microseconds)
Look #
http://furius.ca/pubcode/pub/conf/bin/python-recursive-import-test
This will help you import all python / .py files.
while importing you can create a list which keeps keeps a function call, for example.
[job1.run(),job2.run()]
Then iterate through them and call function :)
Thanks Arjun

Python Multiprocessing Timeout problems

I am using multiprocessing in python, try to kill the running after a timeout. But it doesn't work, and I don't know the reason.
I followed an example, it seems easy. Just need to start the process, after 2 seconds, terminate the running. But it doesnt work for me.
Could you please help me figure it out? Thanks for your help!
from amazonproduct import API
import multiprocessing
import time
AWS_KEY = '...'
SECRET_KEY = '...'
ASSOC_TAG = '...'
def crawl():
api = API(AWS_KEY, SECRET_KEY, 'us', ASSOC_TAG)
for root in api.item_search('Beauty', Keywords='maybelline',
ResponseGroup='Large'):
# extract paging information
nspace = root.nsmap.get(None, '')
products = root.xpath('//aws:Item',
namespaces={'aws' : nspace})
for product in products:
print product.ASIN,
if __name__ == '__main__':
p = multiprocessing.Process(target = crawl())
p.start()
if time.sleep(2.0):
p.terminate()
Well, this won't work:
if time.sleep(2.0):
p.terminate()
time.sleep does not return anything, so the above statement is always equivalent to if None:. None is False in a boolean context, so there you go.
If you want it to always terminate, take out that if statement. Just do a bare time.sleep.
Also, bug:
p = multiprocessing.Process(target = crawl())
This isn't doing what you think it's doing. You need to specify target=crawl, NOT target=crawl(). The latter calls the function in your main thread, the former passes the function as an argument to Process which will then execute it in parallel.

Categories