Make Post Requests with Files Simultaneously - python

I write a simple server and it runs well.
So I want to write some codes which will make many post requests to my server simultaneously to simulate a pressure test. I use python.
Suppose the url of my server is http://myserver.com.
file1.jpg and file2.jpg are the files needed to be uploaded to the server.
Here is my testing code. I use threading and urllib2.
async_posts.py
from Queue import Queue
from threading import Thread
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2, sys
num_thread = 4
queue = Queue(2*num_thread)
def make_post(url):
register_openers()
data = {"file1": open("path/to/file1.jpg"), "file2": open("path/to/file2.jpg")}
datagen, headers = multipart_encode(data)
request = urllib2.Request(url, datagen, headers)
start = time.time()
res = urllib2.urlopen(request)
end = time.time()
return res.code, end - start # Return the status code and duration of this request.
def deamon():
while True:
url = queue.get()
status, duration = make_post(url)
print status, duration
queue.task_done()
for _ in range(num_thread):
thd = Thread(target = daemon)
thd.daemon = True
thd.start()
try:
urls = ["http://myserver.com"] * num_thread
for url in urls:
queue.put(url)
queue.join()
except KeyboardInterrupt:
sys.exit(1)
When num_thread is small (ex: 4), my code runs smoothly. But as I switch num_thread to slightly larger number, say 10, all the threading things break down and keep throwing httplib.BadStatusLine error.
I don't know why my code goes wrong or maybe there is better way to do this?
A a reference, my server is written in python using flask and gunicorn.
Thanks in advance.

Related

How to continuously pull data from a URL in Python?

I have a link, e.g. www.someurl.com/api/getdata?password=..., and when I open it in a web browser it sends a constantly updating document of text. I'd like to make an identical connection in Python, and dump this data to a file live as it's received. I've tried using requests.Session(), but since the stream of data never ends (and dropping it would lose data), the get request also never ends.
import requests
s = requests.Session()
x = s.get("www.someurl.com/api/getdata?password=...") #never terminates
What's the proper way to do this?
I found the answer I was looking for here: Python Requests Stream Data from API
Full implementation:
import requests
url = "www.someurl.com/api/getdata?password=..."
s = requests.Session()
with open('file.txt','a') as fp:
with s.get(url,stream=True) as resp:
for line in resp.iter_lines(chunk_size=1):
fp.write(str(line))
Note that chunk_size=1 is necessary for the data to immediately respond to new complete messages, rather than waiting for an internal buffer to fill before iterating over all the lines. I believe chunk_size=None is meant to do this, but it doesn't work for me.
You can keep making get requests to the url
import requests
import time
url = "www.someurl.com/api/getdata?password=..."
sess = requests.session()
while True:
req = sess.get(url)
time.sleep(10)
this will terminate the request after 1 second ,
import multiprocessing
import time
import requests
data = None
def get_from_url(x):
s = requests.Session()
data = s.get("www.someurl.com/api/getdata?password=...")
if __name__ == '__main__':
while True:
p = multiprocessing.Process(target=get_from_url, name="get_from_url", args=(1,))
p.start()
# Wait 1 second for get request
time.sleep(1)
p.terminate()
p.join()
# do something with the data
print(data) # or smth else

How to send post requests using multi threading in python?

I'm trying to use multi threading to send post requests with tokens from a txt file.
I only managed to send GET requests,if i try to send post requests it results in a error.
I tried modifying the GET to POST but it gets an error.
I want to send post requests with tokens in them and verify for each token if they are true or false. (json response)
Here is the code:
import threading
import time
from queue import Queue
import requests
file_lines = open("tokens.txt", "r").readlines() # Gets the tokens from the txt file.
for line in file_lines:
param={
"Token":line.replace('/n','')
}
def make_request(url):
"""Makes a web request, prints the thread name, URL, and
response text.
"""
resp = requests.get(url)
with print_lock:
print("Thread name: {}".format(threading.current_thread().name))
print("Url: {}".format(url))
print("Response code: {}\n".format(resp.text))
def manage_queue():
"""Manages the url_queue and calls the make request function"""
while True:
# Stores the URL and removes it from the queue so no
# other threads will use it.
current_url = url_queue.get()
# Calls the make_request function
make_request(current_url)
# Tells the queue that the processing on the task is complete.
url_queue.task_done()
if __name__ == '__main__':
# Set the number of threads.
number_of_threads = 5
# Needed to safely print in mult-threaded programs.
print_lock = threading.Lock()
# Initializes the queue that all threads will pull from.
url_queue = Queue()
# The list of URLs that will go into the queue.
urls = ["https://www.google.com"] * 30
# Start the threads.
for i in range(number_of_threads):
# Send the threads to the function that manages the queue.
t = threading.Thread(target=manage_queue)
# Makes the thread a daemon so it exits when the program finishes.
t.daemon = True
t.start()
start = time.time()
# Puts the URLs in the queue
for current_url in urls:
url_queue.put(current_url)
# Wait until all threads have finished before continuing the program.
url_queue.join()
print("Execution time = {0:.5f}".format(time.time() - start))
I want to send a post request for each token in the txt file.
Error i get when using replacing get with post:
Traceback (most recent call last):
File "C:\Users\Creative\Desktop\multithreading.py", line 40, in
url_queue = Queue()
NameError: name 'Queue' is not defined
current_url = url_queue.post()
AttributeError: 'Queue' object has no attribute 'post'
File "C:\Users\Creative\Desktop\multithreading.py", line 22, in manage_queue
Also tried a solution using tornado and async but none of them with success.
I finally managed to do post requests using multi threading.
If anyone sees an error or if you can do an improvement for my code feel free to do it :)
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import time
url_list = [
"https://www.google.com/api/"
]
tokens = {'Token': '326729'}
def download_file(url):
html = requests.post(url,stream=True, data=tokens)
return html.content
start = time()
processes = []
with ThreadPoolExecutor(max_workers=200) as executor:
for url in url_list:
processes.append(executor.submit(download_file, url))
for task in as_completed(processes):
print(task.result())
print(f'Time taken: {time() - start}')

My threads seem to be hanging at the last couple of tasks (image downloads with requests)

I'm using threads to download images from the imagenetdata base.
Here is the link:
http://www.image-net.org/
First, I searched the imagenet database for "puppies" and I was able to get a textfile with 1000+ urls (urls for images)
(If you don't want to go that manner,
I've uploaded the urls onto this pastebin (first 400 or so):
https://pastebin.com/yTcHq0iw )
I then read the first 200 lines of the textfile (thus 200 urls) and used
threads to download those 200 images.
If I download only 100 images (read only the first 100 urls), threads execute perfectly.
However If I try something like 150+ (150, 175,200 etc), the threads will download the first 147, (or 172ish if I'm using 175), and just hang for about 30 seconds or so, before finishing up the last few images.
I'm using Requests to download the images, so not sure if Requests is having trouble making some connections and that is the cause. I'm not too familiar with
Requests lower level API, so not sure how to fix it if it's a Request problem. I found some code on the internet and attempted to override some of the built-in options of Requests, but these tweaks haven't solved the "hanging" problem.
Here is my code:
import os
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import time
import queue
from threading import Thread
SAVE_DIR = r'C:\Users\Moondra\Desktop\TEMP\Puppy_threading' #is a constant,
def decorator_function(func):
def wrapper(*args,**kwargs):
session = requests.Session()
retry = Retry(connect=0, backoff_factor=0.2)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return func(*args, session = session, **kwargs)
return wrapper
#Using threading:
image_count = 0
##decorator_function (optional decorator_function)
def download_image(session = None):
global image_count
if not session:
session = requests.Session()
while not q.empty():
try:
r = session.get(q.get(block = False))
except (requests.exceptions.RequestException, UnicodeError) as e:
print(e)
image_count += 1
q.task_done()
continue
image_count += 1
q.task_done()
print('image', image_count)
with open(os.path.join(
SAVE_DIR, 'image_{}.jpg'.format(image_count)),
'wb') as f:
f.write(r.content)
q =queue.Queue()
with open(r'C:\Users\Moondra\Desktop\puppies.txt', 'rt') as f:
for i in range(200):
line = f.readline()
q.put(line.strip())
print(q.qsize())
threads = []
start = time.time()
for i in range(50):
t = Thread(target = download_image)
t.setDaemon(True)
threads.append(t)
t.start()
q.join()
for t in threads:
t.join()
print(t.name, 'has joined')
end = time.time()
print('time taken: {:.4f}'.format(end - start))

Downloading images with gevent

My task is to download 1M+ images from a given list of urls. What is the recommended way to do so?
After having read Greenlet Vs. Threads I looked into gevent, but I fail to get it reliably to run. I played around with a test set of 100 urls and sometimes it finishes in 1.5s but sometimes it takes over 30s which is strange as the timeout* per request is 0.1, so it should never take more than 10s.
*see below in code
I also looked into grequests but they seem to have issues with exception handling.
My 'requirements' are that I can
inspect the errors raised while downloading (timeouts, corrupt images...),
monitor the progress of the number of processed images and
be as fast as possible.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
import cStringIO
import gevent.hub
POOL_SIZE = 300
def download_image_wrapper(task):
return download_image(task[0], task[1])
def download_image(image_url, download_path):
raw_binary_request = requests.get(image_url, timeout=0.1).content
image = Image.open(cStringIO.StringIO(raw_binary_request))
image.save(download_path)
def download_images_gevent_spawn(list_of_image_urls, base_folder):
download_paths = ['/'.join([base_folder, url.split('/')[-1]])
for url in list_of_image_urls]
parameters = [[image_url, download_path] for image_url, download_path in
zip(list_of_image_urls, download_paths)]
tasks = [gevent.spawn(download_image_wrapper, parameter_tuple) for parameter_tuple in parameters]
for task in tasks:
try:
task.get()
except Exception:
print 'x',
continue
print '.',
test_urls = # list of 100 urls
t1 = time()
download_images_gevent_spawn(test_urls, 'download_temp')
print time() - t1
I think it will be better to stick with urllib2, by example of https://github.com/gevent/gevent/blob/master/examples/concurrent_download.py#L1
Try this code, I suppose it is what you're asking.
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import sys
urls = sorted(chloya_files)
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
def download_file(url):
data = urlopen(url).read()
img_name = url.split('/')[-1]
with open('c:/temp/img/'+img_name, 'wb') as f:
f.write(data)
return True
from time import time
t1 = time()
tasks = [gevent.spawn(download_file, url) for url in urls]
gevent.joinall(tasks, timeout = 12.0)
print "Sucessful: %s from %s" % (sum(1 if task.value else 0 for task in tasks), len(tasks))
print time() - t1
There's a simple solution using gevent and Requests simple-requests
Use Requests Session for HTTP persistent connection. Since gevent makes Requests asynchronous, I think there's no need for timeout in HTTP requests.
By default, requests.Session caches TCP connections (pool_connections) for 10 hosts and limits 10 concurrent HTTP requests per cached TCP connections (pool_maxsize). The default configuration should be tweaked to suit the need by explicitly creating an http adapter.
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', http_adapter)
Break the tasks as producer-consumer. Image downloading is producer task and Image processing is consumer task.
If the image processing library PIL is not asynchronous, it may block producer coroutines. If so, consumer pool can be a gevent.threadpool.ThreadPool. f.e.
from gevent.threadpool import ThreadPool
consumer = ThreadPool(POOL_SIZE)
This is an overview of how it can be done. I didn't test the code.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
from io import BytesIO
import os
from urlparse import urlparse
from gevent.pool import Pool
def download(url):
try:
response = session.get(url)
except Exception as e:
print(e)
else:
if response.status_code == requests.codes.ok:
file_name = urlparse(url).path.rsplit('/',1)[-1]
return (response.content,file_name)
response.raise_for_status()
def process(img):
if img is None:
return None
img, name = img
img = Image.open(BytesIO(img))
path = os.path.join(base_folder, name)
try:
img.save(path)
except Exception as e:
print(e)
else:
return True
def run(urls):
consumer.map(process, producer.imap_unordered(download, urls))
if __name__ == '__main__':
POOL_SIZE = 300
producer = Pool(POOL_SIZE)
consumer = Pool(POOL_SIZE)
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', http_adapter)
test_urls = # list of 100 urls
base_folder = 'download_temp'
t1 = time()
run(test_urls)
print time() - t1
I will suggest to pay attention to Grablib http://grablib.org/
It is an asynchronic parser based on pycurl and multicurl.
Also it tryes to automatically solve network error (like try again if timeout, etc).
I believe the Grab:Spider module will solve your problems for 99%.
http://docs.grablib.org/en/latest/index.html#spider-toc

Parallel/Concurrent http request sending from urllib2 in python

I was trying to send out http post requests in parallel to a web service. To be specific, I'd like to do load test the web service under concurrent requests environment in terms of response time. So I plan to use thread and urllib2 to achieve the job - each thread carries out a http request by urllib2.
Here is how I did it:
import urllib2 as u2
import time
from threading import Thread
def run_job():
try:
req = u2.Request(
url = **the web service url**,
data = **data send to the web service**,
headers = **http header web service requires**,
)
opener = u2.build_opener()
u2.install_opener(opener)
start_time = time.time()
response = u2.urlopen(req, timeout = 60)
end_time = time.time()
html = response.read()
code = response.code
response.close()
if code == 200:
print end_time - start_time
else:
print -1
except Exception, e:
print -2
if __name__ == "__main__":
N = 1
if len(sys.argv) > 1:
N = int(sys.argv[1])
threads = []
for i in range(1, N):
t = Thread(target=run_job, args=())
threads.append(t)
[x.start() for x in threads]
[x.join() for x in threads]
In the meantime, I use fidller2 to capture the requests sending out. Fiddler is a tool to compose http request and send it out, and it can also capture the http requests through the host.
When I looked at the fidller2, those requests are sending out one by one instead of sending out all together at a time, which is what I expect to happen. If what represented in fiddler is right (requests are sending out one by one), to my knowledge, I think there must be some queue that the requests are waiting in. Could someone shed some light what happened behind this? And if it possible, how to do the real parallel requests?
Also, I have put two time stamps before and after the request takes place, if requests are waiting in queue after urllib2.urlopen is executed, then the delta of two time stamps includes the time spending in waiting queue. Is it possible to be more precise - to measure the time between request sending out and received?
Many Thanks,

Categories