I got a dataset, sitting in a .txt file, consisting of 10 million rows in the form of RDF triples, like such:
wsdbm:User0 wsdbm:follows wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0 rev:hasReview wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf wsdbm:User119 .
....
Since these are RDF triples, in our case we have
Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119
My goal is to write a query in the SQL form:
SELECT follows.subject, follows.object, friendOf.object,
likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
So far, I create a class called PropertyTables, which has a method that iterates over the initial file and convert each subject, predicate and object into an integer to improve computational time on the join and save memory:
class PropertyTables():
"""
This class holds all 4 Property Tables necessary for the required query.
Each Property Table is an instance of the class 'PropertyTable'.
"""
def __init__(self):
self.property_tables = defaultdict()
self.hash_map = HashDict()
def parse_file(self, file_path, remove_prefix = False):
data = open(file_path, 'r')
for line in data:
subj, prop, *obj = line.rstrip('\n.').split('\t')
obj = obj[0].rstrip()
if remove_prefix:
subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
self.hash_and_store(subj, prop, obj)
data.close()
the class PropertyTable, mentioned in the docstring:
class PropertyTable():
"""
This class represents a single Property Table, i.e. it holds every Subject and Object
"""
def __init__(self):
self.table = []
def insert(self, r, s):
# If r and s are already tuples, they get appended to the Property Table.
# Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the
# Property Tables when reading the data.
if type(r) == tuple:
self.table.append(r + s)
else:
self.table.append((r, s))
The class HashDict() is a simple dictionary that hashes values, so we can retrieve them again after the join.
To not go to far with one post, I have now a single hash join algorithm:
def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
ht = defaultdict(list)
# Create Hash Table for table1
for s in property_1.table:
ht[s[index_0]].append(s)
# Join Tables
joined_table = PropertyTable()
for r in property_2.table:
for s in ht[r[index_1]]:
joined_table.insert(s, r)
return joined_table
I use this function to sequentially join each table, given the requirements from before.
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)
The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this. I am sorry for this very extensive post, but I guess some details are necessary in order for some advice!
Edit:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
53 68.0 MiB 68.0 MiB 1 #profile
54 def hash_and_store(self, subj, prop, obj):
55
56 68.0 MiB 0.0 MiB 1 hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
57
58 68.0 MiB 0.0 MiB 1 if prop not in self.property_tables:
59 self.property_tables[prop] = PropertyTable()
60 68.0 MiB 0.0 MiB 1 self.property_tables[prop].insert(hashed_subj, hashed_obj)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
32 68.1 MiB 68.1 MiB 1 #profile
33 def parse_file(self, file_path, remove_prefix = False):
34
35 68.1 MiB 0.0 MiB 1 data = open(file_path, 'r')
36
37
38
39
40
41 80.7 MiB 0.3 MiB 109311 for line in data:
42 80.7 MiB 0.0 MiB 109310 subj, prop, *obj = line.rstrip('\n.').split('\t')
43 80.7 MiB 0.5 MiB 109310 obj = obj[0].rstrip()
44
45 80.7 MiB 0.0 MiB 109310 if remove_prefix:
46 80.7 MiB 9.0 MiB 655860 subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
47
48 80.7 MiB 0.0 MiB 109310 if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
49 80.7 MiB 2.8 MiB 80084 self.hash_and_store(subj, prop, obj)
50
51 80.7 MiB 0.0 MiB 1 data.close()
Line # Mem usage Increment Occurrences Line Contents
=============================================================
38 80.7 MiB 80.7 MiB 1 #profile
39 def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
40
41 80.7 MiB 0.0 MiB 1 ht = defaultdict(list)
42
43 # Create Hash Table for table1
44
45 81.2 MiB 0.0 MiB 31888 for s in property_1.table:
46 81.2 MiB 0.5 MiB 31887 ht[s[index_0]].append(s)
47
48 # Join Tables
49
50 81.2 MiB 0.0 MiB 1 joined_table = PropertyTable()
51
52 203.8 MiB 0.0 MiB 45713 for r in property_2.table:
53 203.8 MiB 0.0 MiB 1453580 for s in ht[r[index_1]]:
54 203.8 MiB 122.6 MiB 1407868 joined_table.insert(s, r)
55
56 203.8 MiB 0.0 MiB 1 return joined_table
The core of your question is this:
The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this.
Following your top-level problem statement but with a less generic structure, we can do something like this:
def runQuery(dataLines):
from collections import defaultdict
pred = dict(zip(['follows','friendOf','likes','hasReview'],range(4)))
tables = [defaultdict(list) for _ in pred]
def encode(s):
if s[-1].isdigit():
i = 0
while s[-1 - i].isdigit():
i += 1
return int(s[-i:])
if any(s.endswith(k) for k in pred):
return sum(v for k, v in pred.items() if s.endswith(k))
return None
for line in dataLines:
if not line:
continue
subj, prop, *obj = line.rstrip('\n.').split('\t')
obj = obj[0].rstrip()
subj, prop, obj = [encode(s) for s in (subj, prop, obj)]
if prop is not None:
tables[prop][subj].append(obj)
tables = [{k:tuple(v) for k, v in table.items()} for table in tables]
#[print(list(pred.keys())[i], tables[i], sep='\n') for i in range(len(pred))]
# create reverse index for subject, object where subject [user] follows object [user]
object_of_follows = defaultdict(set)
for k, v in tables[pred['follows']].items():
for user in v:
object_of_follows[user].add(k)
# create reverse index for subject, object where subject [user] is friendOf object [user]
object_of_friendOf = defaultdict(set)
for k, v in tables[pred['friendOf']].items():
if k in object_of_follows:
for user in v:
object_of_friendOf[user].add(k)
# create reverse index for subject, object where subject [user] likes object [product]
object_of_likes = defaultdict(set)
for k, v in tables[pred['likes']].items():
if k in object_of_friendOf:
for product in v:
object_of_likes[product].add(k)
# create reverse index for subject, object where subject [product] hasReview object [review]
object_of_hasReview = defaultdict(set)
for k, v in tables[pred['hasReview']].items():
if k in object_of_likes:
for review in v:
object_of_hasReview[review].add(k)
def addToResult(result, e):
d = object_of_hasReview[e]
c = {y for x in d for y in object_of_likes[x]}
b = {y for x in c for y in object_of_friendOf[x]}
a = {y for x in b for y in object_of_follows[x]}
toAdd = [(ax, bx, cx, dx, e) for dx in d for cx in c for bx in b for ax in a]
result += toAdd
result = []
for e in object_of_hasReview:
addToResult(result, e)
print(f'result row count {len(result):,}')
return result
Explanation:
Create a list of 4 tables (follows, friendOf, likes, hasReview), each a dictionary mapping subject to a tuple of objects
Create 4 reverse indexes (object_of_follows, object_of_friendOf, object_of_likes, object_of_hasReview); for example:
object_of_follows is a dict that maps each user that is an object in follows to a set of users, each of which is a subject in follows that follows the object
object_of_friendOf is a dict that maps each object (user) in friendOf to a set of users, each of which is a subject (user) associated with the object in friendOf and is in object_of_follows (in other words, is an object for one or more subjects in follows)
etc.
Explode each review that survived in object_of_hasReview into multiple result rows containing each unique result follows.subject, follows.object, friendsOf.object, likes.object, hasReview.object as specified in the query
Return the list of all such exploded rows.
Test code for 10 million lines:
dataLines = []
numFollowers = 1000
numChildren = 10
overlapFactor = max(1, numChildren // 2)
def largerPowerOfTen(x):
y = 1
while x >= y:
y *= 10
return y
aCeil = largerPowerOfTen(numFollowers)
bCeil = largerPowerOfTen(aCeil * numChildren)
cCeil = largerPowerOfTen(bCeil * numChildren)
dCeil = largerPowerOfTen(cCeil * numChildren)
friendOf, likes = set(), set()
for a in range(numFollowers):
for b in range(aCeil + a * overlapFactor, aCeil + a * overlapFactor + numChildren):
dataLines.append(f'wsdbm:User{a} wsdbm:follows wsdbm:User{b} .\n')
for c in range(bCeil + b * overlapFactor, bCeil + b * overlapFactor + numChildren):
if (b,c) not in friendOf:
dataLines.append(f'wsdbm:User{b} wsdbm:friendOf wsdbm:User{c} .\n')
friendOf.add((b,c))
for d in range(cCeil + c * overlapFactor, cCeil + c * overlapFactor + numChildren):
if (c,d) not in likes:
dataLines.append(f'wsdbm:User{c} wsdbm:likes wsdbm:Product{d} .\n')
likes.add((c,d))
for e in range(dCeil * (d + 1), dCeil * (d + 1) + numChildren):
dataLines.append(f'wsdbm:Product{d} wsdbm:hasReview wsdbm:Review{e} .\n')
print(f'dataLines row count {len(dataLines):,}')
from timeit import timeit
n = 1
print(f'Timeit results:')
t = timeit(f"runQuery(dataLines)", setup=f"from __main__ import dataLines, runQuery", number=n) / n
print(f'======== runQuery ran in {t} seconds using {n} iterations')
'''
result = runQuery(dataLines)
print(f'result row count {len(result):,}')
print(f'{"follows.subject":>20}{"follows.object":>20}{"friendsOf.object":>20}{"likes.object":>20}{"hasReview.object":>20}')
[print(f'{a:20}{b:20}{c:20}{d:20}{e:20}') for a,b,c,d,e in result]
'''
Output:
dataLines row count 10,310,350
Timeit results:
result row count 12,398,500
======== runQuery ran in 81.53253880003467 seconds using 1 iterations
Here's input/output from a smaller-scale sample run:
Params
numFollowers = 3
numChildren = 3
overlapFactor = 2
Input (after storing in tables):
follows
{0: (10, 11, 12), 1: (12, 13, 14), 2: (14, 15, 16)}
friendOf
{10: (120, 121, 122), 11: (122, 123, 124), 12: (124, 125, 126), 13: (126, 127, 128), 14: (128, 129, 130), 15: (130, 131, 132), 16: (132, 133, 134)}
likes
{120: (1240, 1241, 1242), 121: (1242, 1243, 1244), 122: (1244, 1245, 1246), 123: (1246, 1247, 1248), 124: (1248, 1249, 1250), 125: (1250, 1251, 1252), 126: (1252, 1253, 1254), 127: (1254, 1255, 1256), 128: (1256, 1257, 1258), 129: (1258, 1259, 1260), 130: (1260, 1261, 1262), 131: (1262, 1263, 1264), 132: (1264, 1265, 1266), 133: (1266, 1267, 1268), 134: (1268, 1269, 1270)}
hasReview
{1240: (12410000, 12410001, 12410002), 1241: (12420000, 12420001, 12420002), 1242: (12430000, 12430001, 12430002, 12430000, 12430001, 12430002), 1243: (12440000, 12440001, 12440002), 1244: (12450000, 12450001, 12450002, 12450000, 12450001, 12450002, 12450000, 12450001, 12450002), 1245: (12460000, 12460001, 12460002, 12460000, 12460001, 12460002), 1246: (12470000, 12470001, 12470002, 12470000, 12470001, 12470002, 12470000, 12470001, 12470002), 1247: (12480000, 12480001, 12480002), 1248: (12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002), 1249: (12500000, 12500001, 12500002, 12500000, 12500001, 12500002, 12500000, 12500001, 12500002), 1250: (12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002), 1251: (12520000, 12520001, 12520002, 12520000, 12520001, 12520002), 1252: (12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002), 1253: (12540000, 12540001, 12540002, 12540000, 12540001, 12540002, 12540000, 12540001, 12540002), 1254: (12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002), 1255: (12560000, 12560001, 12560002), 1256: (12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002), 1257: (12580000, 12580001, 12580002, 12580000, 12580001, 12580002, 12580000, 12580001, 12580002), 1258: (12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002), 1259: (12600000, 12600001, 12600002, 12600000, 12600001, 12600002), 1260: (12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002), 1261: (12620000, 12620001, 12620002, 12620000, 12620001, 12620002, 12620000, 12620001, 12620002), 1262: (12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002), 1263: (12640000, 12640001, 12640002), 1264: (12650000, 12650001, 12650002, 12650000, 12650001, 12650002, 12650000, 12650001, 12650002), 1265: (12660000, 12660001, 12660002, 12660000, 12660001, 12660002), 1266: (12670000, 12670001, 12670002, 12670000, 12670001, 12670002, 12670000, 12670001, 12670002), 1267: (12680000, 12680001, 12680002), 1268: (12690000, 12690001, 12690002, 12690000, 12690001, 12690002), 1269: (12700000, 12700001, 12700002), 1270: (12710000, 12710001, 12710002)}
Output
result row count 351
follows.subject follows.object friendsOf.object likes.object hasReview.object
0 10 120 1240 12410000
0 10 120 1240 12410001
0 10 120 1240 12410002
0 10 120 1241 12420000
0 10 120 1241 12420001
0 10 120 1241 12420002
0 10 120 1242 12430000
0 10 121 1242 12430000
0 10 120 1242 12430001
0 10 121 1242 12430001
0 10 120 1242 12430002
0 10 121 1242 12430002
0 10 121 1243 12440000
0 10 121 1243 12440001
0 10 121 1243 12440002
0 10 121 1244 12450000
0 11 121 1244 12450000
0 10 122 1244 12450000
0 11 122 1244 12450000
0 10 121 1244 12450001
0 11 121 1244 12450001
0 10 122 1244 12450001
0 11 122 1244 12450001
0 10 121 1244 12450002
0 11 121 1244 12450002
etc.
I have a bunch of pandas dataframe I would like to print out to any format (csv, json, etc) -- and would like to preserve the order, based on the order of the data frames read. Unfortunately .to_csv() can take some time, sometimes 2x longer than just reading the dataframe.
Lets take the image as an example:
Here you can see that running the task linearly, reading the data frame, printing it out, then repeat for the remaining data frames. This can take about 3x longer than just reading the data frame. Theoretically, if we can push the printing (to_csv()) to a separate threads (2 threads, plus the main thread reading), we can achieve an improve performance that could almost be a third of the total execution compared to the linear (synchronous) version. Of course with just 3 reads, it looks like its just half as fast. But the more dataframes you read, the faster it will be (theoretically).
Unfortunately, the actual does not work like so. I am getting a very small gain in performance. Where the read time is actually taking longer. This might be due to the fact that the to_csv() is CPU extensive, and using all the reasources in the process. And since it is multithreaded, it all shares the same resources. Thus not much gains.
So my question is, how can I improve the code to get a performance closer to the theoretical numbers. I tried using multiprocessing but failed to get a working code. How can I have this in multiprocessing? Is there other ways I could improve the total execution time of such a task?
Here's my sample code using multithreads:
import pandas as pd
import datetime
import os
from threading import Thread
import queue
from io import StringIO
from line_profiler import LineProfiler
NUMS = 500
DEVNULL = open(os.devnull, 'w')
HEADERS = ",a,b,c,d,e,f,g\n"
SAMPLE_CSV = HEADERS + "\n".join([f"{x},{x},{x},{x},{x},{x},{x},{x}" for x in range(4000)])
def linear_test():
print("------Linear Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
print("total: {}".format(datetime.datetime.now() - main_start))
class Handler():
def __init__(self, num_workers=1):
self.num_workers = num_workers
self.total_num_jobs = 0
self.jobs_completed = 0
self.answers_sent = 0
self.jobs = queue.Queue()
self.results = queue.Queue()
self.start_workers()
def add_task(self, task, *args, **kwargs):
args = args or ()
kwargs = kwargs or {}
self.total_num_jobs += 1
self.jobs.put((task, args, kwargs))
def start_workers(self):
for i in range(self.num_workers):
t = Thread(target=self.worker)
t.daemon = True
t.start()
def worker(self):
while True:
item, args, kwargs = self.jobs.get()
item(*args, **kwargs)
self.jobs_completed += 1
self.jobs.task_done()
def get_answers(self):
while self.answers_sent < self.total_num_jobs or self.jobs_completed == 0:
yield self.results.get()
self.answers_sent += 1
self.results.task_done()
def task(task_num, df, q):
ans = df.to_csv()
q.put((task_num, ans))
def parallel_test():
print("------Parallel Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
h = Handler(num_workers=2)
q = h.results
answers = {}
curr_task = 1
t = 1
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
h.add_task(task, t, df, q)
t += 1
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
#data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
for task_num, ans in h.get_answers():
#print("got back: {}".format(task_num, ans))
answers[task_num] = ans
if curr_task in answers:
print(answers[curr_task], file=DEVNULL)
del answers[curr_task]
curr_task += 1
# In case others are left out
for k, v in answers.items():
print(k)
h.jobs.join() # block until all tasks are done
print("total: {}".format(datetime.datetime.now() - main_start))
if __name__ == "__main__":
# linear_test()
# parallel_test()
lp = LineProfiler()
lp_wrapper = lp(linear_test)
lp_wrapper()
lp.print_stats()
lp = LineProfiler()
lp_wrapper = lp(parallel_test)
lp_wrapper()
lp.print_stats()
The output will be below. Where you can see in the linear test reading the data frame only took 4.6 seconds (42% of the total execution). But reading the data frames in the parallel test took 9.7 seconds (93% of the total execution):
------Linear Test-------
total_read_time: 0:00:04.672765
total_add_task: 0:00:00.001000
total_to_csv_time: 0:00:05.582663
total_to_print: 0:00:00.668319
total: 0:00:10.935723
Timer unit: 1e-07 s
Total time: 10.9309 s
File: ./test.py
Function: linear_test at line 33
Line # Hits Time Per Hit % Time Line Contents
==============================================================
33 def linear_test():
34 1 225.0 225.0 0.0 print("------Linear Test-------")
35 1 76.0 76.0 0.0 main_start = datetime.datetime.now()
36 1 32.0 32.0 0.0 total_read_time = datetime.timedelta(0)
37 1 11.0 11.0 0.0 total_add_task = datetime.timedelta(0)
38 1 9.0 9.0 0.0 total_to_csv_time = datetime.timedelta(0)
39 1 9.0 9.0 0.0 total_to_print = datetime.timedelta(0)
40
41 501 3374.0 6.7 0.0 for x in range(NUMS):
42
43 500 5806.0 11.6 0.0 start = datetime.datetime.now()
44 500 46728029.0 93456.1 42.7 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
45 500 40199.0 80.4 0.0 total_read_time += datetime.datetime.now() - start
46
47 500 6821.0 13.6 0.0 start = datetime.datetime.now()
48 #
49 500 6916.0 13.8 0.0 total_add_task += datetime.datetime.now() - start
50
51 500 5794.0 11.6 0.0 start = datetime.datetime.now()
52 500 55843605.0 111687.2 51.1 data = df.to_csv()
53 500 53640.0 107.3 0.0 total_to_csv_time += datetime.datetime.now() - start
54
55 500 6798.0 13.6 0.0 start = datetime.datetime.now()
56 500 6589129.0 13178.3 6.0 print(data, file=DEVNULL)
57 500 18258.0 36.5 0.0 total_to_print += datetime.datetime.now() - start
58
59 1 221.0 221.0 0.0 print("total_read_time: {}".format(total_read_time))
60 1 95.0 95.0 0.0 print("total_add_task: {}".format(total_add_task))
61 1 87.0 87.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
62 1 85.0 85.0 0.0 print("total_to_print: {}".format(total_to_print))
63 1 112.0 112.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
------Parallel Test-------
total_read_time: 0:00:09.779954
total_add_task: 0:00:00.016984
total_to_csv_time: 0:00:00.003000
total_to_print: 0:00:00.001001
total: 0:00:10.488563
Timer unit: 1e-07 s
Total time: 10.4803 s
File: ./test.py
Function: parallel_test at line 106
Line # Hits Time Per Hit % Time Line Contents
==============================================================
106 def parallel_test():
107 1 100.0 100.0 0.0 print("------Parallel Test-------")
108 1 33.0 33.0 0.0 main_start = datetime.datetime.now()
109 1 24.0 24.0 0.0 total_read_time = datetime.timedelta(0)
110 1 10.0 10.0 0.0 total_add_task = datetime.timedelta(0)
111 1 10.0 10.0 0.0 total_to_csv_time = datetime.timedelta(0)
112 1 10.0 10.0 0.0 total_to_print = datetime.timedelta(0)
113 1 13550.0 13550.0 0.0 h = Handler(num_workers=2)
114 1 15.0 15.0 0.0 q = h.results
115 1 9.0 9.0 0.0 answers = {}
116 1 7.0 7.0 0.0 curr_task = 1
117 1 7.0 7.0 0.0 t = 1
118
119 501 5017.0 10.0 0.0 for x in range(NUMS):
120 500 6545.0 13.1 0.0 start = datetime.datetime.now()
121 500 97761876.0 195523.8 93.3 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
122 500 45702.0 91.4 0.0 total_read_time += datetime.datetime.now() - start
123
124 500 8259.0 16.5 0.0 start = datetime.datetime.now()
125 500 167269.0 334.5 0.2 h.add_task(task, t, df, q)
126 500 5009.0 10.0 0.0 t += 1
127 500 11865.0 23.7 0.0 total_add_task += datetime.datetime.now() - start
128
129 500 6949.0 13.9 0.0 start = datetime.datetime.now()
130 #data = df.to_csv()
131 500 7921.0 15.8 0.0 total_to_csv_time += datetime.datetime.now() - start
132
133 500 6498.0 13.0 0.0 start = datetime.datetime.now()
134 #print(data, file=DEVNULL)
135 500 8084.0 16.2 0.0 total_to_print += datetime.datetime.now() - start
136
137 1 3321.0 3321.0 0.0 print("total_read_time: {}".format(total_read_time))
138 1 4669.0 4669.0 0.0 print("total_add_task: {}".format(total_add_task))
139 1 1995.0 1995.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
140 1 113037.0 113037.0 0.1 print("total_to_print: {}".format(total_to_print))
141
142 501 176106.0 351.5 0.2 for task_num, ans in h.get_answers():
143 #print("got back: {}".format(task_num, ans))
144 500 5169.0 10.3 0.0 answers[task_num] = ans
145 500 4160.0 8.3 0.0 if curr_task in answers:
146 500 6429159.0 12858.3 6.1 print(answers[curr_task], file=DEVNULL)
147 500 5646.0 11.3 0.0 del answers[curr_task]
148 500 4144.0 8.3 0.0 curr_task += 1
149
150 # In case others are left out
151 1 24.0 24.0 0.0 for k, v in answers.items():
152 print(k)
153
154 1 61.0 61.0 0.0 h.jobs.join() # block until all tasks are done
155
156 1 328.0 328.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
Rather than cut your own solution you may want to look at Dask - particularly Dask's Distributed DataFrame if you want to read multiple CSV files into 1 "virtual" big DataFrame or Delayed to run functions, as per your example, in parallel across multiple cores. See light examples here if you scroll down: https://docs.dask.org/en/latest/
Your other lightweight choice is to use Joblib's Parallel interface, this looks exactly like Delayed but with much less functionality. I tend to go for Joblib if I want a lightweight solution, then upgrade to Dask if I need more: https://joblib.readthedocs.io/en/latest/parallel.html
For both tools if you go down the delayed route - write a function that works in a for loop in series (you have this already), then wrap it in the respective delayed syntax and "it should just work". In both cases by default it'll use all the cores on your machine.