Hello i have a csv with about 2,5k lines of outlook emails and passwords
The CSV looks like
header:
username, password
content:
test1233#outlook.com,123password1
test1234#outlook.com,123password2
test1235#outlook.com,123password3
test1236#outlook.com,123password4
test1237#outlook.com,123password5
the code allows me to go into the accounts and delete every mail from them, but its taking too long for 2,5k accounts to pass the script so i wanted to make it faster with multithreading.
This is my code:
from csv import DictReader
import imap_tools
from datetime import datetime
def IMAPDumper(accountList, IMAP_SERVER, search_criteria, row):
accountcounter = 0
with open(accountList, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in csv_dict_reader:
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
print("_____________________________________________")
# resetting emailcounter each time
emailcounter = 0
This is a job that is best accomplished using a thread pool whose optimum size will need to be experimented with. I have set the size below to 100, which may be overly ambitious (or not). You can try decreasing or increasing NUM_THREADS to see what effect it has.
The important thing is to modify function IMAPDumper so that it is passed a single row from the csv file that it is to be processed and that it therefore does not need to open and read the file itself.
There are various methods you can use with class ThreadPool in module multiprocessing.pool (this class is not well-documented; it is the multithreading analog of the multiprocessing pool class Pool in module multiprocessing.pool and has the same exact interface). The advantage of imap_unordered is that (1) the passed iterable argument can be a generator that will not be converted to a list, which will save memory and time if that list would be very large and (2) the ordering of the results (return values from the worker function, IMAPDumper in this case) are arbitrary and therefore might run slightly faster than imap or map. Since your worker function does not explicitly return a value (defaults to None), this should not matter.
from csv import DictReader
import imap_tools
from datetime import datetime
from multiprocessing.pool import ThreadPool
from functools import partial
def IMAPDumper(IMAP_SERVER, search_criteria, row):
""" process a single row """
# TIMESTAMP FOR FURTHER DEBUGGING TO CHECK IF THE SCRIPT IS STOPPING AT A POINT
TIMESTAMP = datetime.now().strftime("[%H:%M:%S]")
# adds a counter for the amount of accounts processed by the script
accountcounter = accountcounter + 1
print("_____________________________________________")
print(TIMESTAMP, "Account", accountcounter)
... # etc
def generate_rows():
""" generator function to yield rows """
with open('outlookAccounts.csv', newline='') as f:
dict_reader = DictReader(f)
for row in dict_reader:
yield row
NUM_THREADS = 100
worker = partial(IMAPDumper, "outlook.office365.com", "ALL")
pool = ThreadPool(NUM_THREADS)
for return_value in pool.imap_unordered(worker, generate_rows()):
# must iterate the iterator returned by imap_unordered to ensure all tasks are run and completed
pass # return values are None
This is not necessarily the best way to do it, but the shortest in writitng time. I don't know if you are familiar with python generators, but we will have to use one. the generator will work as a work dispatcher.
def generator():
with open("t.csv", 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for row in read_obj:
yield row
gen = generator()
Next, you will have your main function where you do your IMAP stuff
def main():
while True:
#The try prevent the thread from crashing when all the file will be processed
try:
#Returns next line of the csv
working_set = next(gen)
#do_some_stuff
# -
#do_other_stuff
except:
break
Then you just have to split the work in multiple thread!
#You can change the number of thread
number_of_threads = 5
thread_list = []
#Creates 5 thread object
for _ in range(number_of_threads):
thread_list.append(threading.Thread(target=main))
# Starts all thread object
for thread in thread_list:
thread.start()
I hope this helped you!
Related
Let us consider the following code where I calculate the factorial of 4 really large numbers, saving each output to a separate .txt file (out_mp_{idx}.txt). I use multiprocessing (4 processes) to reduce the computation time. Though this works fine, I want to output all the 4 results in one file.
One way is to open each of the generated (4) files I create (from the code below) and append to a new file, but that's not my choice (below is just a simplistic version of my code, I have too many files to handle, which defeats the purpose of time-saving via multiprocessing). Is there a better way to automate such that the results from the processes are all dumped/appended to some file? Also, in my case the returned results form each process could be several lines, so how would we avoid open-file conflict for the case when the results are appended in the output file by one process and second process returns its answer and wants to open/access the output file?
As an alternative, I tried process.immap route, but that's not as computationally efficient as the below code. Something like this SO post.
from multiprocessing import Process
import os
import time
tic = time.time()
def factorial(n, idx): # function to calculate the factorial
num = 1
while n >= 1:
num *= n
n = n - 1
with open(f'out_mp_{idx}.txt', 'w') as f0: # saving output to a separate file
f0.writelines(str(num))
def My_prog():
jobs = []
N = [10000, 20000, 40000, 50000] # numbers for which factorial is desired
n_procs = 4
# executing multiple processes
for i in range(n_procs):
p = Process(target=factorial, args=(N[i], i))
jobs.append(p)
for j in jobs:
j.start()
for j in jobs:
j.join()
print(f'Exec. Time:{time.time()-tic} [s]')
if __name__=='__main__':
My_prog()
You can do this.
Create a Queue
a) manager = Manager()
b) data_queue = manager.Queue()
c) put all data in this queue.
Create a thread and start it before multiprocess
a) create a function which waits on data_queue.
Something like
`
def fun():
while True:
data = data_queue.get()
if instance(data_queue, Sentinal):
break
#write to a file
`
3) Remember to send some Sentinal object after all multiprocesses are done.
You can also make this thread a daemon thread and skip sentinal part.
I want to put a huge file into small files. There are approximately 2 million IDs in file and I want sort them by module. When you run program it should ask the number of files that you want to divide the main file.(x= int(input)). And I want to seperate file by module function. I mean if ID%x == 1 it should ad this ID to q1 and to f1. But it adds only first line that true for requirements.
import multiprocessing
def createlist(x,i,queue_name):
with open("events.txt") as f:
next(f)
for line in f:
if int(line) % x == i:
queue_name.put(line)
def createfile(x,i,queue_name):
for i in range(x):
file_name = "file{}.txt".format(i+1)
with open(file_name, "w") as text:
text.write(queue_name.get())
if __name__=='__main__':
x= int(input("number of parts "))
i = 0
for i in range(x):
queue_name = "q{}".format(i+1)
queue_name = multiprocessing.Queue()
p0=multiprocessing.Process(target = createlist, args = (x,i,queue_name,))
process_name = "p{}".format(i+1)
process_name = multiprocessing.Process(target = createfile, args = (x,i,queue_name,))
p0.start()
process_name.start()
Your createfile has two functional issues.
it only reads from the queue once, then terminates
it iterates the range of the desired number of subsets a second time, hence even after fixing the issue with the single queue-read you get one written file and parts - 1 empty files.
To fix your approach make createfile look like this:
def createfile(i, queue_name): # Note: x has been removed from the args
file_name = "file{}.txt".format(i + 1)
with open(file_name, "w") as text:
while True:
if queue.empty():
break
text.write(queue.get())
Since x has been removed from createfile's arguments, you'd also remove it from the process instantiation:
process_name = multiprocessing.Process(target = createfile, args = (i,queue_name,))
However ... do not do it like this. The more subsets you want, the more processes and queues you create (two processes and one queue per subset). That is a lot of overhead you create.
Also, while having one responsible process per output file for writing might still make some sense, having multiple processes reading the same (huge) file completely does not.
I did some timing and testing with an input file containing 1000 lines, each consisting of one random integer between 0 and 9999. I created three algorithms and ran each in ten iterations while tracking execution time. I did this for a desired number of subsets of 1, 2, 3, 4, 5 and 10. For the graph below I took the mean value of each series.
orwqpp (green): Is one-reader-writer-queue-per-part. Your approach. It saw an average increase in execution time of 0.48 seconds per additional subset.
orpp (blue): Is one-reader-per-part. This one had a common writer process that took care of writing to all files. It saw an average increase in execution time of 0.25 seconds per additional subset.
ofa (yellow): Is one-for-all. One single function, not run in a separate process, reading and writing in one go. It saw an average increase in execution time of 0.0014 seconds per additional subset.
Keep in mind that these figures were created with an input file 1/2000 the size of yours. The processes in what resembles your approach completed so quickly, they barely got in each other's way. Once the input is large enough to make the processes run for a longer amount of time, contention for CPU resources will increase and so will the penalty of having more processes as more subsets are requested.
Here's the one-for-all approach:
def one_for_all(parts):
handles = dict({el: open("file{}.txt".format(el), "w") for el in range(parts)})
with open("events.txt") as f:
next(f)
for line in f:
fnum = int(line) % parts
handles.get(fnum).write(line)
[h.close() for h in handles.values()]
if __name__ == '__main__':
x = int(input("number of parts "))
one_for_all(x)
This currently names the files based on the result of the modulo operation, so numbers where int(line) % parts is 0 will be in file0.txt and so on.
if you don't want that, simply add 1 when formatting the file name:
handles = dict({el: open("file{}.txt".format(el+1), "w") for el in range(parts)})
I am fairly new to python, kindly excuse me for insufficient information if any. As a part of the curriculum , I got introduced to python for quants/finance, I am studying multiprocessing and trying to understand this better. I tried modifying the problem given and now I am stuck mentally with the problem.
Problem:
I have a function which gives me ticks, in ohlc format.
{'scrip_name':'ABC','timestamp':1504836192,'open':301.05,'high':303.80,'low':299.00,'close':301.10,'volume':100000}
every minute. I wish to do the following calculation concurrently and preferably append/insert in the samelist
Find the Moving Average of the last 5 close data
Find the Median of the last 5 open data
Save the tick data to a database.
so expected data is likely to be
['scrip_name':'ABC','timestamp':1504836192,'open':301.05,'high':303.80,'low':299.00,'close':301.10,'volume':100000,'MA_5_open':300.25,'Median_5_close':300.50]
Assuming that the data is going to a db, its fairly easy to write a simple dbinsert routine to the database, I don't see that as a great challenge, I can spawn a to execute a insert statement for every minute.
How do I sync 3 different functions/process( a function to insert into db, a function to calculate the average, a function to calculate the median), while holding in memory 5 ticks to calculate the 5 period, simple average Moving Average and push them back to the dict/list.
The following assumption, challenges me in writing the multiprocessing routine. can someone guide me. I don't want to use pandas dataframe.
====REVISION/UPDATE===
The reason, why I don't want any solution on pandas/numpy is that, my objective is to understand the basics, and not the nuances of a new library. Please don't mistake my need for understanding to be arrogance or not wanting to be open to suggestions.
The advantage of having
p1=Process(target=Median,arg(sourcelist))
p2=Process(target=Average,arg(sourcelist))
p3=process(target=insertdb,arg(updatedlist))
would help me understand the possibility of scaling processes based on no of functions /algo components.. But how should I make sure p1&p2 are in sync while p3 should execute post p1&p2
Here is an example of how to use multiprocessing:
from multiprocessing import Pool, cpu_count
def db_func(ma, med):
db.save(something)
def backtest_strat(d, db_func):
a = d.get('avg')
s = map(sum, a)
db_func(s/len(a), median(a))
with Pool(cpu_count()) as p:
from functools import partial
bs = partial(backtest_strat, db_func=db_func)
print(p.map(bs, [{'avg': [1,2,3,4,5], 'median': [1,2,3,4,5]}]))
also see :
https://stackoverflow.com/a/24101655/2026508
note that this will not speed up anything unless there are a lot of slices.
so for the speed up part:
def get_slices(data)
for slice in data:
yield {'avg': [1,2,3,4,5], 'median': [1,2,3,4,5]}
p.map(bs, get_slices)
from what i understand multiprocessing works by message passing via pickles, so the pool.map when called should have access to all three things, the two arrays, and the db_save function. There are of course other ways to go about it, but hopefully this shows one way to go about it.
Question: how should I make sure p1&p2 are in sync while p3 should execute post p1&p2
If you sync all Processes, computing one Task (p1,p2,p3) couldn't be faster as the slowes Process are be.
In the meantime the other Processes running idle.
It's called "Producer - Consumer Problem".
Solution using Queue all Data serialize, no synchronize required.
# Process-1
def Producer()
task_queue.put(data)
# Process-2
def Consumer(task_queue)
data = task_queue.get()
# process data
You want multiple Consumer Processes and one Consumer Process gather all Results.
You don't want to use Queue, but Sync Primitives.
This Example let all Processes run independent.
Only the Process Result waits until notified.
This Example uses a unlimited Task Buffer tasks = mp.Manager().list().
The Size could be minimized if List Entrys for done Tasks are reused.
If you have some very fast algos combine some to one Process.
import multiprocessing as mp
# Base class for all WORKERS
class Worker(mp.Process):
tasks = mp.Manager().list()
task_ready = mp.Condition()
parties = mp.Manager().Value(int, 0)
#classmethod
def join(self):
# Wait until all Data processed
def get_task(self):
for i, task in enumerate(Worker.tasks):
if task is None: continue
if not self.__class__.__name__ in task['result']:
return (i, task['range'])
return (None, None)
# Main Process Loop
def run(self):
while True:
# Get a Task for this WORKER
idx, _range = self.get_task()
if idx is None:
break
# Compute with self Method this _range
result = self.compute(_range)
# Update Worker.tasks
with Worker.lock:
task = Worker.tasks[idx]
task['result'][name] = result
parties = len(task['result'])
Worker.tasks[idx] = task
# If Last, notify Process Result
if parties == Worker.parties.value:
with Worker.task_ready:
Worker.task_ready.notify()
class Result(Worker):
# Main Process Loop
def run(self):
while True:
with Worker.task_ready:
Worker.task_ready.wait()
# Get (idx, _range) from tasks List
idx, _range = self.get_task()
if idx is None:
break
# process Task Results
# Mark this tasks List Entry as done for reuse
Worker.tasks[idx] = None
class Average(Worker):
def compute(self, _range):
return average of DATA[_range]
class Median(Worker):
def compute(self, _range):
return median of DATA[_range]
if __name__ == '__main__':
DATA = mp.Manager().list()
WORKERS = [Result(), Average(), Median()]
Worker.start(WORKERS)
# Example creates a Task every 5 Records
for i in range(1, 16):
DATA.append({'id': i, 'open': 300 + randrange(0, 5), 'close': 300 + randrange(-5, 5)})
if i % 5 == 0:
Worker.tasks.append({'range':(i-5, i), 'result': {}})
Worker.join()
Tested with Python: 3.4.2
I am trying to use multiprocessing in handling csv files in excess of 2GB. The problem is that the input is only being consumed in one process while the others seem to be idling.
The following recreates the problem I am encountering. Is it possible to use multiprocess with an iterator? Consuming the full input into memory is not desired.
import csv
import multiprocessing
import time
def something(row):
# print row[0]
# pass
return row
def main():
start = time.time()
i = open("input.csv")
reader = csv.reader(i, delimiter='\t')
print reader.next()
p = multiprocessing.Pool(16)
print "Starting processes"
j = p.imap(something, reader, chunksize=10000)
count= 1
while j:
print j.next()
print time.time() - start
if __name__ == '__main__':
main()
I think you are confusing "processes" with "processors".
Your program is definitely spawning multiple processes at the same time, as you can verify in the system or resources monitor while your program is running. How many processors or CPU cores are being used depends mainly on the OS, and has a lot to do with how CPU intensive is the task you are delegating to each process.
Make a little modification to your something function, to introduce a sleep time, that simulates the work being done in the function:
def something(row):
time.sleep(.4)
return row
Now, first run your function sequentially to each row in your file, and notice that each result is coming one by one every 400ms.
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# SEQUENTIALLY:
for row in reader:
result = something(row)
print (result)
Now try with the pool of workers. Keep it at a low number, say 4 workers, and you will see that the result comes every 400ms, but in groups of 4 (or roughly the number of workers in the pool):
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# IN PARALLEL
print ("Starting processes")
p = multiprocessing.Pool(4)
results = p.imap(something, reader)
for result in results:
print(result) # one result is the processing of 4 rows...
While running in parallel, check the system monitor and look for how many "python" processes are being executed. Should be one plus the number of workers.
I hope this explanation is useful.
I have a situation where I have to read and write the list simultaneously.
It seems the code starts to read after it completes writing all the elements in the list.What I want to do is that the code will keep on adding elements in one end and I need to keep on processing first 10 elements simultaneously.
import csv
testlist=[]
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
testlist.append(row)
def render(small)
#do some stuff
while(len(testlist)>0)
pool = Pool(processes=10)
small=testlist[:10]
del testlist[:10]
pool.map_async(render,small)
pool.close()
pool.join()
You need a queue that is shared between processes. One process adds to the queue, the other processes from the queue.
Here is a simplified example:
from multiprocessing import Process, Queue
def put(queue):
# writes items to the queue
queue.put('something')
def get(queue):
# gets item from the queue to work on
while True:
item = queue.get()
# do something with item
if __name__=='__main__':
queue = Queue()
getter_process = Process(target=get, args=((queue),))
getter_process.daemon = True
getter_process.start()
writer(queue) # Send something to the queue
getter_process.join() # Wait for the getter to finish
If you want to only process 10 things at a time, you can limit the queue size to 10. This means, the "writer" cannot write anything until if the queue already has 10 items waiting to be processed.
By default, the queue has no bounds/limits. The documentation is a good place to start for more on queues.
You can do like this
x=[]
y=[1,2,3,4,5,6,7,...]
for i in y:
x.append(i)
if len(x)<10:
print x
else:
print x[:10]
x=x-x[:10]
PS: assuming y is an infinite stream