I have a tiny stupid code, which makes a lot of requests to google search service
from concurrent.futures import ThreadPoolExecutor
import requests
import requests.packages.urllib3
requests.packages.urllib3.disable_warnings()
def check(page):
r = requests.get('https://www.google.ru/#q=test&start={}'.format(page * 10))
return len(r.text)
import time
def main():
for q in xrange(30):
st_t = time.time()
with ThreadPoolExecutor(20) as pool:
ret = [x for x in pool.map(check, xrange(1,1000))]
print time.time() - st_t
if __name__ == "__main__":
main()
And it works firstly, but then something is going wrong. All 20 threads are alive, but then they do nothing. I can see in the htop, that they are alive, but I actually don't understand why nothing happens.
Any ideas what could be wrong?
This is a known issue and the requests team did not get enough information for debugging, see this. Possible it is a CPython issue see this.
Related
I'd want to make a script that automatically sends alot of requests to a URL via Python.
Example link: https://page-views.glitch.me/badge?page_id=page.id
I've tried selenium but thats very slow.
pip install requests
import requests
for i in range(100): # Or whatever amount of requests you wish to send
requests.get("https://page-views.glitch.me/badge?page_id=page.id")
Or if you really wanted to hammer the address you could use multiprocessing
import multiprocessing as mp
import requests
def my_func(x):
for i in range(x):
print(requests.get("https://page-views.glitch.me/badge?page_id=page.id"))
def main():
pool = mp.Pool(mp.cpu_count())
pool.map(my_func, range(0, 100))
if __name__ == "__main__":
main()
You can send send multiple get() requests in a loop as follows:
for i in range(100):
driver.get(https://page-views.glitch.me/badge?page_id=page.id)
I'm trying to use the multiprocessing library for Python 3. The module is imported correctly and does not show any error, however when using it, I get an error.
Here is my code:
from multiprocessing import Pool
import time
start_time = time.process_time()
p = Pool(10)
def print_range():
for i in range(10000):
print('Something')
end_time = time.process_time()
print(end_time-start_time)
p.map(print_range())
However I get this error:
ImportError: cannot import name 'Pool' from 'multiprocessing' (C: ...path file)
Has anyone encountered this error, and have a solution for it? Thanks
Might be related to safely importing main. See this section in the documentation. Specifically, you'd need to change your code like this:
from multiprocessing import Pool
import time
def print_range():
for i in range(10000):
print('Something')
if __name__ == '__main__':
start_time = time.process_time()
p = Pool(10)
end_time = time.process_time()
print(end_time-start_time)
p.map(print_range()) # incorrect usage
In addition, your usage of map() is not correct. See the documentation for examples, or use p.apply(print_range) instead.
I think a quick code snippet is better to explain my problem, so please have a look at this:
from flask import Flask
from flask.ext.socketio import SocketIO
from threading import Thread
import subprocess
import threading
from eventlet.green.subprocess import Popen
app = Flask(__name__)
socketio = SocketIO(app)
def get_tasks_and_emit():
instance = Popen(["tasklist"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1)
lines_iterator = iter(instance.stdout.readline, b"")
data = ""
for line in lines_iterator:
data += line.decode("utf8")
socketio.emit("loaded", data)
print("::: DEBUG - returned tasks with thread")
#app.route("/")
def index():
html = "<!DOCTYPE html>"
html += "<script src=https://code.jquery.com/jquery-2.2.0.min.js></script>"
html += "<script src=https://cdn.socket.io/socket.io-1.4.5.js></script>"
html += "<script>"
html += "var socket = io.connect(window.location.origin);"
html += "socket.on('loaded', function(data) {alert(data);});"
html += "function load_tasks_threaded() {$.get('/tasks_threaded');}"
html += "function load_tasks_nonthreaded() {$.get('/tasks');}"
html += "</script>"
html += "<button onclick='load_tasks_nonthreaded()'>Load Tasks</button>"
html += "<button onclick='load_tasks_threaded()'>Load Tasks (Threaded)</button>"
return html
#app.route("/tasks")
def tasks():
get_tasks_and_emit()
print("::: DEBUG - returned tasks without thread")
return ""
#app.route("/tasks_threaded")
def tasks_threaded():
threading.Thread(target=get_tasks_and_emit).start()
return ""
if __name__ == "__main__":
socketio.run(app, port=7000, debug=True)
I am running this code on Windows using eventlet, if I don't use eventlet everything is fine (but of course much slower due to the werkzeug threading mode). (And I just checked and it's not working on Linux either)
I hope someone can point me into the right direction. (My Python version is 3.5.1 by the way)
I found the problem. Apparently you have to monkey patch the threading module, so I added
import eventlet
eventlet.monkey_patch(thread=True)
and then I also had a problem with long running programs. I had the same problem as the guy in this StackOverflow post:
Using Popen in a thread blocks every incoming Flask-SocketIO request
So I added
eventlet.sleep()
to the for loop that processes the pipes.
EDIT:
As temoto pointed out, alternatively one can also just use the threading module from eventlet.green like this:
from eventlet.green import threading
I have jobs scheduled thru apscheduler. I have 3 jobs so far, but soon will have many more. i'm looking for a way to scale my code.
Currently, each job is its own .py file, and in the file, I have turned the script into a function with run() as the function name. Here is my code.
from apscheduler.scheduler import Scheduler
import logging
import job1
import job2
import job3
logging.basicConfig()
sched = Scheduler()
#sched.cron_schedule(day_of_week='mon-sun', hour=7)
def runjobs():
job1.run()
job2.run()
job3.run()
sched.start()
This works, right now the code is just stupid, but it gets the job done. But when I have 50 jobs, the code will be stupid long. How do I scale it?
note: the actual names of the jobs are arbitrary and doesn't follow a pattern. The name of the file is scheduler.py and I run it using execfile('scheduler.py') in python shell.
import urllib
import threading
import datetime
pages = ['http://google.com', 'http://yahoo.com', 'http://msn.com']
#------------------------------------------------------------------------------
# Getting the pages WITHOUT threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
for page in pages:
job(page)
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microseconds WITHOUT threads" \
.format((end - start).microseconds)
#------------------------------------------------------------------------------
# Getting the pages WITH threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
threads = []
for page in pages:
t = threading.Thread(target=job, args=(page,))
t.start()
threads.append(t)
for t in threads:
t.join()
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microsecond WITH threads" \
.format((end - start).microseconds)
Look #
http://furius.ca/pubcode/pub/conf/bin/python-recursive-import-test
This will help you import all python / .py files.
while importing you can create a list which keeps keeps a function call, for example.
[job1.run(),job2.run()]
Then iterate through them and call function :)
Thanks Arjun
I am using multiprocessing in python, try to kill the running after a timeout. But it doesn't work, and I don't know the reason.
I followed an example, it seems easy. Just need to start the process, after 2 seconds, terminate the running. But it doesnt work for me.
Could you please help me figure it out? Thanks for your help!
from amazonproduct import API
import multiprocessing
import time
AWS_KEY = '...'
SECRET_KEY = '...'
ASSOC_TAG = '...'
def crawl():
api = API(AWS_KEY, SECRET_KEY, 'us', ASSOC_TAG)
for root in api.item_search('Beauty', Keywords='maybelline',
ResponseGroup='Large'):
# extract paging information
nspace = root.nsmap.get(None, '')
products = root.xpath('//aws:Item',
namespaces={'aws' : nspace})
for product in products:
print product.ASIN,
if __name__ == '__main__':
p = multiprocessing.Process(target = crawl())
p.start()
if time.sleep(2.0):
p.terminate()
Well, this won't work:
if time.sleep(2.0):
p.terminate()
time.sleep does not return anything, so the above statement is always equivalent to if None:. None is False in a boolean context, so there you go.
If you want it to always terminate, take out that if statement. Just do a bare time.sleep.
Also, bug:
p = multiprocessing.Process(target = crawl())
This isn't doing what you think it's doing. You need to specify target=crawl, NOT target=crawl(). The latter calls the function in your main thread, the former passes the function as an argument to Process which will then execute it in parallel.