I am new to Python and I am currently working on a multiple webpage scraper. While I was playing around with Python I found out about Threading, this really speeds up the code. The problem is that the script scrapes lots of sites and I like to this in an array of 'batches' while using Threading.
When I've got an array of 1000 items I'd like to grab 10 items. When the script is done with these 10 items, grab 10 new items until there is nothing left
I hope someone can help me. Thanks in advance!
import subprocess
import threading
from multiprocessing import Pool
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
pool = Pool()
sites = ["http://site1.com", "http://site2.com", "http://site3.com", "http://site4.com"]
results = pool.imap(scrape, sites)
for result in results:
print(result)
In the future I use an sqlite database where I store all the URLs (this will replace the array). When I run the script I want the control of stopping the process and continue whenever I want to. This is not my question but the context of my problem.
Question: ... array of 1000 items I'd like to grab 10 items
for p in range(0, 1000, 10):
process10(sites[p:p+10])
How about using Process and Queue from multiprocessing? Writing a worker function and calling it from a loop makes it run as batches. With Process, jobs can be started and stopped when needed and give you more control over them.
import subprocess
from multiprocessing import Process, Queue
def worker(url_queue, result_queue):
for url in iter(url_queue.get, 'DONE'):
scrape_result = scrape(url)
result_queue.put(scrape_result)
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
sites = ['http://site1.com', "http://site2.com", "http://site3.com", "http://site4.com", "http://site5.com",
"http://site6.com", "http://site7.com", "http://site8.com", "http://site9.com", "http://site10.com",
"http://site11.com", "http://site12.com", "http://site13.com", "http://site14.com", "http://site15.com",
"http://site16.com", "http://site17.com", "http://site18.com", "http://site19.com", "http://site20.com"]
url_queue = Queue()
result_queue = Queue()
processes = []
for url in sites:
url_queue.put(url)
for i in range(10):
p = Process(target=worker, args=(url_queue, result_queue))
p.start()
processes.append(p)
url_queue.put('DONE')
for p in processes:
p.join()
result_queue.put('DONE')
for response in iter(result_queue.get, 'DONE'):
print response
Note that Queue is the FIFO Queue that supports putting and pulling elements.
Related
So this is the first time I am playing around with threading so please bare with me here. In my main application (which I will implement this into), I need to add multithreading into my script. The script will read account info from a text file, then login & do some tasks with that account. I need to make sure that threads aren't reading the same line from the accounts text file since that would screw everything up, which I'm not quite sure about how to do.
from multiprocessing import Queue, Process
from threading import Thread
from time import sleep
urls_queue = Queue()
max_process = 10
def dostuff():
with open ('acc.txt', 'r') as accounts:
for account in accounts:
account.strip()
split = account.split(":")
a = {
'user': split[0],
'pass': split[1],
'name': split[2].replace('\n', ''),
}
sleep(1)
print(a)
for i in range(max_process):
urls_queue.put("DONE")
def doshit_processor():
while True:
url = urls_queue.get()
if url == "DONE":
break
def main():
file_reader_thread = Thread(target=dostuff)
file_reader_thread.start()
procs = []
for i in range(max_process):
p = Process(target=doshit_processor)
procs.append(p)
p.start()
for p in procs:
p.join()
print('all done')
# wait for all tasks in the queue
file_reader_thread.join()
if __name__ == '__main__':
main()
So at the moment I don't think the threading is even working, because it's printing one account out per second, even with 10 threads. So it should be printing 10 accounts per second which it isn't which has me confused. Also I am not sure how to make sure that threads won't pick the same account line. Help by a big brain is much appreciated
The problem is that you create a single thread to generate the data for your processes but then don't post that data to the queue. You sleep in that single thread so you see one item generated per second and then... nothing because the item isn't queued. It seems that all you are doing is creating a process pool and the inbuilt multiprocessing.Pool should work for you.
I've set pool "chunk size" low so that workers are only given 1 work item at a time. This is good for workflows where processing time can vary for each work item. By default, pool tries to optimize for the case where processing time is roughly equivalent and instead tries to reduce interprocess communication time.
Your data looks like a colon-separated file and you can use csv to cut down the processing there too. This smaller script should do what you want.
import multiprocessing as mp
from time import sleep
import csv
max_process = 10
def doshit_processor(row):
time.sleep(1) # if you want to simulate work
print(row)
def main():
with open ('acc.txt', newline='') as accounts:
table = list(csv.DictReader(accounts, fieldnames=('user', 'pass', 'name'),
delimiter=':')
with mp.Pool(max_process) as pool:
pool.map(doshit_processor, table, chunksize=1)
print('all done')
if __name__ == '__main__':
main()
I'm running multiple threads in python. I've tried using threading module, multiprocessing module. Even though the execution gives the correct result, everytime the terminal gets stuck and printing of the output gets messed up.
Here's a simplified version of the code.
import subprocess
import threading
import argparse
import sys
result = []
def check_thread(args,components,id):
for i in components:
cmd = <command to be given to terminal>
output = subprocess.check_output([cmd],shell=True)
result.append((id,i,output))
def check(args,components):
# lock = threading.Lock()
# lock = threading.Semaphore(value=1)
thread_list = []
for id in range(3):
t=threading.Thread(target=check_thread, args=(args,components,i))
thread_list.append(t)
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
for res in result:
print(res)
return res
if __name__ == 'main':
parser = argparse.ArgumentParser(....)
parser.add_argument(.....)
args = parser.parse_args()
components = ['comp1','comp2']
while True:
print('SELECTION MENU\n1)\n2)\n')
option = raw_input('Enter option')
if option=='1':
res = check(args, components)
if option=='2':
<do something else>
else:
sys.exit(0)
I've tried using multiprocessing module with Process, pool. Tried passing a lock to check_thread, tried returning a value from check_thread() and using a queue to take in the values, but everytime it's the same result, execution is successful but the terminal gets stuck and printed output is shabby.
Is there any fix to this? I'm using python 2.7. I'm using a linux terminal.
Here is how the shabby output looks
output
You should use queue module not list.
import multiprocessing as mp
# Define an output queue
output = mp.Queue()
# define a example function
def function(params, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
# Process params and store results in res variable
output.put(res)
# Setup a list of processes that we want to run
processes = [mp.Process(target=function, args=(5, output)) for x in range(10)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
I'm going to write a program which has multiple process(CPU-crowded) and multiple threading(IO-crowded).(the code below just a sample, not the program)
But when the code meet the join() ,it make the program become a deadlock.
My code is post below
import requests
import time
from multiprocessing import Process, Queue
from multiprocessing.dummy import Pool
start = time.time()
queue = Queue()
rQueue = Queue()
url = 'http://www.bilibili.com/video/av'
for i in xrange(10):
queue.put(url+str(i))
def goURLsCrawl(queue, rQueue):
threadPool = Pool(7)
while not queue.empty():
threadPool.apply_async(urlsCrawl, args=(queue.get(), rQueue))
threadPool.close()
threadPool.join()
print 'end'
def urlsCrawl(url, rQueue):
response = requests.get(url)
rQueue.put(response)
p = Process(target=goURLsCrawl, args=(queue, rQueue))
p.start()
p.join() # join() is here
end = time.time()
print 'totle time %0.4f' % (end-start,)
Thanks in advance.😊
I finally find the reason. As you can see, I import the Queue from the multiprocessing, so the Queue should only used for Process, but I make the Thread access the Queue on my code, so it must something unknown occur behind the program.
To correct it, just import Queue instead of multiprocessing.Queue
I'm running python 2.7.3 and I noticed the following strange behavior. Consider this minimal example:
from multiprocessing import Process, Queue
def foo(qin, qout):
while True:
bar = qin.get()
if bar is None:
break
qout.put({'bar': bar})
if __name__ == '__main__':
import sys
qin = Queue()
qout = Queue()
worker = Process(target=foo,args=(qin,qout))
worker.start()
for i in range(100000):
print i
sys.stdout.flush()
qin.put(i**2)
qin.put(None)
worker.join()
When I loop over 10,000 or more, my script hangs on worker.join(). It works fine when the loop only goes to 1,000.
Any ideas?
The qout queue in the subprocess gets full. The data you put in it from foo() doesn't fit in the buffer of the OS's pipes used internally, so the subprocess blocks trying to fit more data. But the parent process is not reading this data: it is simply blocked too, waiting for the subprocess to finish. This is a typical deadlock.
There must be a limit on the size of queues. Consider the following modification:
from multiprocessing import Process, Queue
def foo(qin,qout):
while True:
bar = qin.get()
if bar is None:
break
#qout.put({'bar':bar})
if __name__=='__main__':
import sys
qin=Queue()
qout=Queue() ## POSITION 1
for i in range(100):
#qout=Queue() ## POSITION 2
worker=Process(target=foo,args=(qin,))
worker.start()
for j in range(1000):
x=i*100+j
print x
sys.stdout.flush()
qin.put(x**2)
qin.put(None)
worker.join()
print 'Done!'
This works as-is (with qout.put line commented out). If you try to save all 100000 results, then qout becomes too large: if I uncomment out the qout.put({'bar':bar}) in foo, and leave the definition of qout in POSITION 1, the code hangs. If, however, I move qout definition to POSITION 2, then the script finishes.
So in short, you have to be careful that neither qin nor qout becomes too large. (See also: Multiprocessing Queue maxsize limit is 32767)
I had the same problem on python3 when tried to put strings into a queue of total size about 5000 cahrs.
In my project there was a host process that sets up a queue and starts subprocess, then joins. Afrer join host process reads form the queue. When subprocess producess too much data, host hungs on join. I fixed this using the following function to wait for subprocess in the host process:
from multiprocessing import Process, Queue
from queue import Empty
def yield_from_process(q: Queue, p: Process):
while p.is_alive():
p.join(timeout=1)
while True:
try:
yield q.get(block=False)
except Empty:
break
I read from queue as soon as it fills so it never gets very large
I was trying to .get() an async worker after the pool had closed
indentation error outside of a with block
i had this
with multiprocessing.Pool() as pool:
async_results = list()
for job in jobs:
async_results.append(
pool.apply_async(
_worker_func,
(job,),
)
)
# wrong
for async_result in async_results:
yield async_result.get()
i needed this
with multiprocessing.Pool() as pool:
async_results = list()
for job in jobs:
async_results.append(
pool.apply_async(
_worker_func,
(job,),
)
)
# right
for async_result in async_results:
yield async_result.get()
I need to do a blocking xmlrpc call from my python script to several physical server simultaneously and perform actions based on response from each server independently.
To explain in detail let us assume following pseudo code
while True:
response=call_to_server1() #blocking and takes very long time
if response==this:
do that
I want to do this for all the servers simultaneously and independently but from same script
Use the threading module.
Boilerplate threading code (I can tailor this if you give me a little more detail on what you are trying to accomplish)
def run_me(func):
while not stop_event.isSet():
response= func() #blocking and takes very long time
if response==this:
do that
def call_to_server1():
#code to call server 1...
return magic_server1_call()
def call_to_server2():
#code to call server 2...
return magic_server2_call()
#used to stop your loop.
stop_event = threading.Event()
t = threading.Thread(target=run_me, args=(call_to_server1))
t.start()
t2 = threading.Thread(target=run_me, args=(call_to_server2))
t2.start()
#wait for threads to return.
t.join()
t2.join()
#we are done....
You can use multiprocessing module
import multiprocessing
def call_to_server(ip,port):
....
....
for i in xrange(server_count):
process.append( multiprocessing.Process(target=call_to_server,args=(ip,port)))
process[i].start()
#waiting process to stop
for p in process:
p.join()
You can use multiprocessing plus queues. With one single sub-process this is the example:
import multiprocessing
import time
def processWorker(input, result):
def remoteRequest( params ):
## this is my remote request
return True
while True:
work = input.get()
if 'STOP' in work:
break
result.put( remoteRequest(work) )
input = multiprocessing.Queue()
result = multiprocessing.Queue()
p = multiprocessing.Process(target = processWorker, args = (input, result))
p.start()
requestlist = ['1', '2']
for req in requestlist:
input.put(req)
for i in xrange(len(requestlist)):
res = result.get(block = True)
print 'retrieved ', res
input.put('STOP')
time.sleep(1)
print 'done'
To have more the one sub-process simply use a list object to store all the sub-processes you start.
The multiprocessing queue is a safe object.
Then you may keep track of which request is being executed by each sub-process simply storing the request associated to a workid (the workid can be a counter incremented when the queue get filled with new work). Usage of multiprocessing.Queue is robust since you do not need to rely on stdout/err parsing and you also avoid related limitation.
Then, you can also set a timeout on how long you want a get call to wait at max, eg:
import Queue
try:
res = result.get(block = True, timeout = 10)
except Queue.Empty:
print error
Use twisted.
It has a lot of useful stuff for work with network. It is also very good at working asynchronously.