gevent + requests blocks when using socks - python

I use python(2.7.6),gevnet(1.1.2),requests(2.11.1) to make http requests concurrent, and it works good. But when I add socks proxy to requests, it blocks.
This is my code:
import time
import requests
import logging
import click
import gevent
from gevent import monkey
monkey.patch_all()
FORMAT = '%(asctime)-15s %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger('test')
#socks proxy
user = MY_SOCKS_PROXY_USERNAME
password = MY_SOCKS_PROXY_PASSWORD
host = MY_SOCKS_PROXY_HOST
port = MY_SOCKS_PROXY_PORT
proxies = {
'http': 'socks5://{0}:{1}#{2}:{3}'.format(user, password, host, port),
'https': 'socks5://{0}:{1}#{2}:{3}'.format(user, password, host, port),
}
url = 'https://www.youtube.com/user/NBA'
def fetch_url(i,with_proxy):
while True:
logger.warning('thread %s fetch url'%i)
try:
if with_proxy:
res = requests.get(url,proxies=proxies, timeout=5)
else:
res = requests.get(url, timeout=5)
except Exception as e:
logger.error(str(e))
continue
logger.warning(res.status_code)
def do_other_thing():
while True:
logger.warning('do other thing...')
time.sleep(1)
#click.command()
#click.option('--with_proxy/--without_proxy',help='if use proxy', default=True)
def run(with_proxy):
if with_proxy:
logger.warning('with proxy......')
else:
logger.warning('without proxy......')
ts = []
ts.append(gevent.spawn(do_other_thing))
for i in xrange(3):
ts.append(gevent.spawn(fetch_url,i,with_proxy))
gevent.joinall(ts)
if __name__=='__main__':
run()
These pictures show the result.
run with proxy
run without proxy
With proxy, do_other_thing will blocks before fetch_url done.
Without proxy, it works good. (timeout error occurs, because of the GFW)
Can anyone help me solving this problem? Thanks very much!

I also ask this question at github. The collaborator is very nice and help me solve this problem. The resolvent is very very.
moving the gevent import and monkeypatch to the very top of the file before doing anything else
I also have a project with multiple files , and moving the import and monkeypatch to the very top of the first file solves my problem.
The issues at github.

Related

Downloading images with gevent

My task is to download 1M+ images from a given list of urls. What is the recommended way to do so?
After having read Greenlet Vs. Threads I looked into gevent, but I fail to get it reliably to run. I played around with a test set of 100 urls and sometimes it finishes in 1.5s but sometimes it takes over 30s which is strange as the timeout* per request is 0.1, so it should never take more than 10s.
*see below in code
I also looked into grequests but they seem to have issues with exception handling.
My 'requirements' are that I can
inspect the errors raised while downloading (timeouts, corrupt images...),
monitor the progress of the number of processed images and
be as fast as possible.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
import cStringIO
import gevent.hub
POOL_SIZE = 300
def download_image_wrapper(task):
return download_image(task[0], task[1])
def download_image(image_url, download_path):
raw_binary_request = requests.get(image_url, timeout=0.1).content
image = Image.open(cStringIO.StringIO(raw_binary_request))
image.save(download_path)
def download_images_gevent_spawn(list_of_image_urls, base_folder):
download_paths = ['/'.join([base_folder, url.split('/')[-1]])
for url in list_of_image_urls]
parameters = [[image_url, download_path] for image_url, download_path in
zip(list_of_image_urls, download_paths)]
tasks = [gevent.spawn(download_image_wrapper, parameter_tuple) for parameter_tuple in parameters]
for task in tasks:
try:
task.get()
except Exception:
print 'x',
continue
print '.',
test_urls = # list of 100 urls
t1 = time()
download_images_gevent_spawn(test_urls, 'download_temp')
print time() - t1
I think it will be better to stick with urllib2, by example of https://github.com/gevent/gevent/blob/master/examples/concurrent_download.py#L1
Try this code, I suppose it is what you're asking.
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import sys
urls = sorted(chloya_files)
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
def download_file(url):
data = urlopen(url).read()
img_name = url.split('/')[-1]
with open('c:/temp/img/'+img_name, 'wb') as f:
f.write(data)
return True
from time import time
t1 = time()
tasks = [gevent.spawn(download_file, url) for url in urls]
gevent.joinall(tasks, timeout = 12.0)
print "Sucessful: %s from %s" % (sum(1 if task.value else 0 for task in tasks), len(tasks))
print time() - t1
There's a simple solution using gevent and Requests simple-requests
Use Requests Session for HTTP persistent connection. Since gevent makes Requests asynchronous, I think there's no need for timeout in HTTP requests.
By default, requests.Session caches TCP connections (pool_connections) for 10 hosts and limits 10 concurrent HTTP requests per cached TCP connections (pool_maxsize). The default configuration should be tweaked to suit the need by explicitly creating an http adapter.
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', http_adapter)
Break the tasks as producer-consumer. Image downloading is producer task and Image processing is consumer task.
If the image processing library PIL is not asynchronous, it may block producer coroutines. If so, consumer pool can be a gevent.threadpool.ThreadPool. f.e.
from gevent.threadpool import ThreadPool
consumer = ThreadPool(POOL_SIZE)
This is an overview of how it can be done. I didn't test the code.
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
from io import BytesIO
import os
from urlparse import urlparse
from gevent.pool import Pool
def download(url):
try:
response = session.get(url)
except Exception as e:
print(e)
else:
if response.status_code == requests.codes.ok:
file_name = urlparse(url).path.rsplit('/',1)[-1]
return (response.content,file_name)
response.raise_for_status()
def process(img):
if img is None:
return None
img, name = img
img = Image.open(BytesIO(img))
path = os.path.join(base_folder, name)
try:
img.save(path)
except Exception as e:
print(e)
else:
return True
def run(urls):
consumer.map(process, producer.imap_unordered(download, urls))
if __name__ == '__main__':
POOL_SIZE = 300
producer = Pool(POOL_SIZE)
consumer = Pool(POOL_SIZE)
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', http_adapter)
test_urls = # list of 100 urls
base_folder = 'download_temp'
t1 = time()
run(test_urls)
print time() - t1
I will suggest to pay attention to Grablib http://grablib.org/
It is an asynchronic parser based on pycurl and multicurl.
Also it tryes to automatically solve network error (like try again if timeout, etc).
I believe the Grab:Spider module will solve your problems for 99%.
http://docs.grablib.org/en/latest/index.html#spider-toc

Make Post Requests with Files Simultaneously

I write a simple server and it runs well.
So I want to write some codes which will make many post requests to my server simultaneously to simulate a pressure test. I use python.
Suppose the url of my server is http://myserver.com.
file1.jpg and file2.jpg are the files needed to be uploaded to the server.
Here is my testing code. I use threading and urllib2.
async_posts.py
from Queue import Queue
from threading import Thread
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2, sys
num_thread = 4
queue = Queue(2*num_thread)
def make_post(url):
register_openers()
data = {"file1": open("path/to/file1.jpg"), "file2": open("path/to/file2.jpg")}
datagen, headers = multipart_encode(data)
request = urllib2.Request(url, datagen, headers)
start = time.time()
res = urllib2.urlopen(request)
end = time.time()
return res.code, end - start # Return the status code and duration of this request.
def deamon():
while True:
url = queue.get()
status, duration = make_post(url)
print status, duration
queue.task_done()
for _ in range(num_thread):
thd = Thread(target = daemon)
thd.daemon = True
thd.start()
try:
urls = ["http://myserver.com"] * num_thread
for url in urls:
queue.put(url)
queue.join()
except KeyboardInterrupt:
sys.exit(1)
When num_thread is small (ex: 4), my code runs smoothly. But as I switch num_thread to slightly larger number, say 10, all the threading things break down and keep throwing httplib.BadStatusLine error.
I don't know why my code goes wrong or maybe there is better way to do this?
A a reference, my server is written in python using flask and gunicorn.
Thanks in advance.

How to create a HTTP proxy handler with Python 3 HTTP lib

I'm trying define a proxy handler to use http.client behind a proxy company. I know just how to use or define a proxy handler to urllib.:
http_proxy_full_auth_string = "http://"+"%s:%s#%s:%s" % (http_proxy_user,
http_proxy_passwd,
http_proxy_server,
http_proxy_port)
proxy_handler = urllib.request.ProxyHandler({"http": http_proxy_full_auth_string})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
resp = urllib.request.urlopen(uri).read()
And using http.client...?
P.S: sorry for the low english skills...
This might be old thread but folks may stumble upon it like I did and dont know how to authenticate.
import http.client
import base64
auth_hash = base64.b64encode(b"username:password").decode("utf-8")
conn = http.client.HTTPSConnection("proxy-ip or hostname", port="proxy-port")
conn.set_tunnel(
"example.com",
headers={"Proxy-Authorization": f"Basic {auth_hash}"})
conn.request("GET", "/")
This is how you do it with basic authentication.
See the httplib python 3 documentation
import http.client
conn = http.client.HTTPSConnection("proxy_domain", 8080)
conn.set_tunnel("www.python.org")
conn.request("HEAD","/index.html")

How to make this Twisted Python Proxy faster?

The code below is an HTTP proxy for content filtering. It uses GET to send the URL of the current site to the server, where it processes it and responds. It runs VERY, VERY, VERY slow. Any ideas on how to make it faster?
Here is the code:
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
from Tkinter import *
#import win32api
import urllib2
import urllib
import os
import webbrowser
cwd = os.path.abspath(sys.argv[0])[0]
proxies = {}
user = "zachb"
class BlockingProxyRequest(ProxyRequest):
def process(self):
params = {}
params['Location']= self.uri
params['User'] = user
params = urllib.urlencode(params)
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
if resp == "allow":
pass
else:
self.transport.write('''BLOCKED BY ADMIN!''')
self.transport.loseConnection()
ProxyRequest.process(self)
class BlockingProxy(Proxy):
requestFactory = BlockingProxyRequest
factory = http.HTTPFactory()
factory.protocol = BlockingProxy
reactor.listenTCP(8000, factory)
reactor.run()
Anyone have any ideas on how to make this run faster? Or even a better way to write it?
The main cause of slowness in this proxy is probably these three lines:
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
A normal Twisted-based application is single threaded. You have to go out of your way to get threads involved. That means that whenever a request comes in, you are blocking the one and only processing thread on this HTTP request. No further requests are processed until this HTTP request completes.
Try using one of the APIs in twisted.web.client, (eg Agent or getPage). These APIs don't block, so your server will handle concurrent requests concurrently. This should translate into much smaller response times.

How can I open a website with urllib via proxy in Python?

I have this program that check a website, and I want to know how can I check it via proxy in Python...
this is the code, just for example
while True:
try:
h = urllib.urlopen(website)
break
except:
print '['+time.strftime('%Y/%m/%d %H:%M:%S')+'] '+'ERROR. Trying again in a few seconds...'
time.sleep(5)
By default, urlopen uses the environment variable http_proxy to determine which HTTP proxy to use:
$ export http_proxy='http://myproxy.example.com:1234'
$ python myscript.py # Using http://myproxy.example.com:1234 as a proxy
If you instead want to specify a proxy inside your application, you can give a proxies argument to urlopen:
proxies = {'http': 'http://myproxy.example.com:1234'}
print("Using HTTP proxy %s" % proxies['http'])
urllib.urlopen("http://www.google.com", proxies=proxies)
Edit: If I understand your comments correctly, you want to try several proxies and print each proxy as you try it. How about something like this?
candidate_proxies = ['http://proxy1.example.com:1234',
'http://proxy2.example.com:1234',
'http://proxy3.example.com:1234']
for proxy in candidate_proxies:
print("Trying HTTP proxy %s" % proxy)
try:
result = urllib.urlopen("http://www.google.com", proxies={'http': proxy})
print("Got URL using proxy %s" % proxy)
break
except:
print("Trying next proxy in 5 seconds")
time.sleep(5)
Python 3 is slightly different here. It will try to auto detect proxy settings but if you need specific or manual proxy settings, think about this kind of code:
#!/usr/bin/env python3
import urllib.request
proxy_support = urllib.request.ProxyHandler({'http' : 'http://user:pass#server:port',
'https': 'https://...'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
with urllib.request.urlopen(url) as response:
# ... implement things such as 'html = response.read()'
Refer also to the relevant section in the Python 3 docs
Here example code guide how to use urllib to connect via proxy:
authinfo = urllib.request.HTTPBasicAuthHandler()
proxy_support = urllib.request.ProxyHandler({"http" : "http://ahad-haam:3128"})
# build a new opener that adds authentication and caching FTP handlers
opener = urllib.request.build_opener(proxy_support, authinfo,
urllib.request.CacheFTPHandler)
# install it
urllib.request.install_opener(opener)
f = urllib.request.urlopen('http://www.google.com/')
"""
For http and https use:
proxies = {'http':'http://proxy-source-ip:proxy-port',
'https':'https://proxy-source-ip:proxy-port'}
more proxies can be added similarly
proxies = {'http':'http://proxy1-source-ip:proxy-port',
'http':'http://proxy2-source-ip:proxy-port'
...
}
usage
filehandle = urllib.urlopen( external_url , proxies=proxies)
Don't use any proxies (in case of links within network)
filehandle = urllib.urlopen(external_url, proxies={})
Use proxies authentication via username and password
proxies = {'http':'http://username:password#proxy-source-ip:proxy-port',
'https':'https://username:password#proxy-source-ip:proxy-port'}
Note: avoid using special characters such as :,# in username and passwords

Categories