python append and read from the list simultaneously - python

I have a situation where I have to read and write the list simultaneously.
It seems the code starts to read after it completes writing all the elements in the list.What I want to do is that the code will keep on adding elements in one end and I need to keep on processing first 10 elements simultaneously.
import csv
testlist=[]
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
testlist.append(row)
def render(small)
#do some stuff
while(len(testlist)>0)
pool = Pool(processes=10)
small=testlist[:10]
del testlist[:10]
pool.map_async(render,small)
pool.close()
pool.join()

You need a queue that is shared between processes. One process adds to the queue, the other processes from the queue.
Here is a simplified example:
from multiprocessing import Process, Queue
def put(queue):
# writes items to the queue
queue.put('something')
def get(queue):
# gets item from the queue to work on
while True:
item = queue.get()
# do something with item
if __name__=='__main__':
queue = Queue()
getter_process = Process(target=get, args=((queue),))
getter_process.daemon = True
getter_process.start()
writer(queue) # Send something to the queue
getter_process.join() # Wait for the getter to finish
If you want to only process 10 things at a time, you can limit the queue size to 10. This means, the "writer" cannot write anything until if the queue already has 10 items waiting to be processed.
By default, the queue has no bounds/limits. The documentation is a good place to start for more on queues.

You can do like this
x=[]
y=[1,2,3,4,5,6,7,...]
for i in y:
x.append(i)
if len(x)<10:
print x
else:
print x[:10]
x=x-x[:10]
PS: assuming y is an infinite stream

Related

python multithreading/ multiprocessing for a loop with 3+ arguments

Hello i have a csv with about 2,5k lines of outlook emails and passwords
The CSV looks like
header:
username, password
content:
test1233#outlook.com,123password1
test1234#outlook.com,123password2
test1235#outlook.com,123password3
test1236#outlook.com,123password4
test1237#outlook.com,123password5
the code allows me to go into the accounts and delete every mail from them, but its taking too long for 2,5k accounts to pass the script so i wanted to make it faster with multithreading.
This is my code:
from csv import DictReader
import imap_tools
from datetime import datetime
def IMAPDumper(accountList, IMAP_SERVER, search_criteria, row):
accountcounter = 0
with open(accountList, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
print("_____________________________________________")
# resetting emailcounter each time
emailcounter = 0
This is a job that is best accomplished using a thread pool whose optimum size will need to be experimented with. I have set the size below to 100, which may be overly ambitious (or not). You can try decreasing or increasing NUM_THREADS to see what effect it has.
The important thing is to modify function IMAPDumper so that it is passed a single row from the csv file that it is to be processed and that it therefore does not need to open and read the file itself.
There are various methods you can use with class ThreadPool in module multiprocessing.pool (this class is not well-documented; it is the multithreading analog of the multiprocessing pool class Pool in module multiprocessing.pool and has the same exact interface). The advantage of imap_unordered is that (1) the passed iterable argument can be a generator that will not be converted to a list, which will save memory and time if that list would be very large and (2) the ordering of the results (return values from the worker function, IMAPDumper in this case) are arbitrary and therefore might run slightly faster than imap or map. Since your worker function does not explicitly return a value (defaults to None), this should not matter.
from csv import DictReader
import imap_tools
from datetime import datetime
from multiprocessing.pool import ThreadPool
from functools import partial
def IMAPDumper(IMAP_SERVER, search_criteria, row):
""" process a single row """
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
... # etc
def generate_rows():
""" generator function to yield rows """
with open('outlookAccounts.csv', newline='') as f:
dict_reader = DictReader(f)
for row in dict_reader:
yield row
NUM_THREADS = 100
worker = partial(IMAPDumper, "outlook.office365.com", "ALL")
pool = ThreadPool(NUM_THREADS)
for return_value in pool.imap_unordered(worker, generate_rows()):
# must iterate the iterator returned by imap_unordered to ensure all tasks are run and completed
pass # return values are None
This is not necessarily the best way to do it, but the shortest in writitng time. I don't know if you are familiar with python generators, but we will have to use one. the generator will work as a work dispatcher.
def generator():
with open("t.csv", 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in read_obj:
yield row
gen = generator()
Next, you will have your main function where you do your IMAP stuff
def main():
while True:
#The try prevent the thread from crashing when all the file will be processed
try:
#Returns next line of the csv
working_set = next(gen)
#do_some_stuff
# -
#do_other_stuff
except:
break
Then you just have to split the work in multiple thread!
#You can change the number of thread
number_of_threads = 5
thread_list = []
#Creates 5 thread object
for _ in range(number_of_threads):
thread_list.append(threading.Thread(target=main))
# Starts all thread object
for thread in thread_list:
thread.start()
I hope this helped you!

How to run two synchronous process at same time in python

Let two process function1, function2 are running at the same time.
function1// continuously appending the list
function2// take that the list from function1 and get all the data from the list and copy to another list, flush the original list and process that copied list.
sample code:
list_p =[]
def function1(data):
list_p.append(data)
def function2(list_p):
list_q = list_p.copy()
list_p.flush()
x= process(list_q)
return x
while True:
//coming data continously
function1(coming data)
So, how to work with both function1 and function2 at a time so that I can get the data from function1 and flush it (after flushing start appending the index in function1 from 0) Also, at the same time list could be appending in function1.
At the same time, function1 could be appending the list and function 2 could be processing the new list, after finishing function2's process, It again takes all the data in the original list that was appending while function2 was processing.
continue..
Here is an example using Threading. In place of data stream I used input function in producer. (It's based on https://techmonger.github.io/55/producer-consumer-python/.)
from threading import Thread
from queue import Queue
q = Queue()
final_results = []
def producer():
while True:
i = int(input('Give me some number: ')) # here you should get data from data stream
q.put(i)
def consumer():
while True:
number = q.get()
result = number**2
final_results.append(result)
print(final_results)
q.task_done()
t = Thread(target=consumer)
t.daemon = True
t.start()
producer()

Starting a large number of dependent process in async using python multiprocessing

Problem: I've a DAG(Directed-acyclic-graph) like structure for starting the execution of some massive data processing on a machine. Some of the process can only be started when their parent data processing is completed cause there is multi level of processing. I want to use python multiprocessing library to handle all on one single machine of it as first goal and later scale to execute on different machines using Managers. I've got no prior experience with python multiprocessing. Can anyone suggest if it's a good library to begin with? If yes, some basic implementation idea would do just fine. If not, what else can be used to do this thing in python?
Example:
A -> B
B -> D, E, F, G
C -> D
In the above example i want to kick A & C first(parallel), after their successful execution, other remaining processes would just wait for B to finish first. As soon as B finishes its execution all other process will start.
P.S.: Sorry i cannot share actual data because confidential, though i tried to make it clear using the example.
I'm a big fan of using processes and queues for things like this.
Like so:
from multiprocessing import Process, Queue
from Queue import Empty as QueueEmpty
import time
#example process functions
def processA(queueA, queueB):
while True:
try:
data = queueA.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
queueB.put(data)
def processA(queueB, _):
while True:
try:
data = queueB.get_nowait()
if data == 'END':
break
except QueueEmpty:
time.sleep(2) #wait some time for data to enter queue
continue
#do stuff with data
#helper functions for starting and stopping processes
def start_procs(num_workers, target_function, args):
procs = []
for _ in range(num_workers):
p = Process(target=target_function, args=args)
p.start()
procs.append(p)
return procs
def shutdown_process(proc_lst, queue):
for _ in proc_lst:
queue.put('END')
for p in proc_lst:
try:
p.join()
except KeyboardInterrupt:
break
queueA = Queue(<size of queue> * 3) #needs to be a bit bigger than actual. 3x works well for me
queueB = Queue(<size of queue>)
queueC = Queue(<size of queue>)
queueD = Queue(<size of queue>)
procsA = start_procs(number_of_workers, processA, (queueA, queueB))
procsB = start_procs(number_of_workers, processB, (queueB, None))
# feed some data to processA
[queueA.put(data) for data in start_data]
#shutdown processes
shutdown_process(procsA, queueA)
shutdown_process(procsB, queueB)
#etc, etc. You could arrange the start, stop, and data feed statements to arrive at the dag behaviour you desire

Python Multiprocessing on Iterator

I am trying to use multiprocessing in handling csv files in excess of 2GB. The problem is that the input is only being consumed in one process while the others seem to be idling.
The following recreates the problem I am encountering. Is it possible to use multiprocess with an iterator? Consuming the full input into memory is not desired.
import csv
import multiprocessing
import time
def something(row):
# print row[0]
# pass
return row
def main():
start = time.time()
i = open("input.csv")
reader = csv.reader(i, delimiter='\t')
print reader.next()
p = multiprocessing.Pool(16)
print "Starting processes"
j = p.imap(something, reader, chunksize=10000)
count= 1
while j:
print j.next()
print time.time() - start
if __name__ == '__main__':
main()
I think you are confusing "processes" with "processors".
Your program is definitely spawning multiple processes at the same time, as you can verify in the system or resources monitor while your program is running. How many processors or CPU cores are being used depends mainly on the OS, and has a lot to do with how CPU intensive is the task you are delegating to each process.
Make a little modification to your something function, to introduce a sleep time, that simulates the work being done in the function:
def something(row):
time.sleep(.4)
return row
Now, first run your function sequentially to each row in your file, and notice that each result is coming one by one every 400ms.
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# SEQUENTIALLY:
for row in reader:
result = something(row)
print (result)
Now try with the pool of workers. Keep it at a low number, say 4 workers, and you will see that the result comes every 400ms, but in groups of 4 (or roughly the number of workers in the pool):
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# IN PARALLEL
print ("Starting processes")
p = multiprocessing.Pool(4)
results = p.imap(something, reader)
for result in results:
print(result) # one result is the processing of 4 rows...
While running in parallel, check the system monitor and look for how many "python" processes are being executed. Should be one plus the number of workers.
I hope this explanation is useful.

Multi-process, using Queue & Pool

I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.

Categories