Memory Profiler giving constant memory in all steps - python

I want to get the change in memory for every step in my function.
I have written the code for interpolation search and even given a input as large as 10000 no. of elements in a list, but still no change in memory.
The code is:
import time
from memory_profiler import profile
#profile()
def interpolation_search(numbers, value):
low = 0
high = len(numbers) - 1
mid = 0
while numbers[low] <= value and numbers[high] >= value:
mid = low + ((value - numbers[low]) * (high - low)) / (numbers[high] - numbers[low])
if numbers[mid] < value:
low = mid + 1
elif numbers[mid] > value:
high = mid - 1
else:
return mid
if numbers[low] == value:
return low
else:
return -1
if __name__ == "__main__":
# Pre-sorted numbers
numbers = [-100, -6, 0, 1, 5, 14, 15, 26,28,29,30,31,35,37,39,40,41,42]
num=[]
for i in range(100000):
num.append(i)
value = 15
# Print numbers to search
print 'Numbers:'
print ' '.join([str(i) for i in numbers])
# Find the index of 'value'
start_time1 = time.time()
index = interpolation_search(numbers, value)
# Print the index where 'value' is located
print '\nNumber %d is at index %d' % (value, index)
print("--- Run Time %s seconds---" % (time.time() - start_time1))
The output that I am getting is:
Numbers:
-100 -6 0 1 5 14 15 26 28 29 30 31 35 37 39 40 41 42
Filename: C:/Users/Admin/PycharmProjects/timenspace/Interpolation.py
Line # Mem usage Increment Line Contents
================================================
4 21.5 MiB 0.0 MiB #profile()
5 def interpolation_search(numbers, value):
6 21.5 MiB 0.0 MiB low = 0
7 21.5 MiB 0.0 MiB high = len(numbers) - 1
8 21.5 MiB 0.0 MiB mid = 0
9
10 21.5 MiB 0.0 MiB while numbers[low] <= value and numbers[high] >= value:
11 21.5 MiB 0.0 MiB mid = low + ((value - numbers[low]) * (high - low)) / (numbers[high] - numbers[low])
12
13 21.5 MiB 0.0 MiB if numbers[mid] < value:
14 low = mid + 1
15
16 21.5 MiB 0.0 MiB elif numbers[mid] > value:
17 21.5 MiB 0.0 MiB high = mid - 1
18 else:
19 21.5 MiB 0.0 MiB return mid
20
21 if numbers[low] == value:
22 return low
23 else:
24 return -1
Number 15 is at index 6
--- Run Time 0.0429999828339 seconds---
As you can see my memory remains constant at 21.5 Mib in all steps.
Please help me with this.Thank You

Why do you expect it to increase? I don't see any memory allocations, i.e., the array numbers does not grow in size

Related

Getting Out of Memory Error for Join Algorithm

I got a dataset, sitting in a .txt file, consisting of 10 million rows in the form of RDF triples, like such:
wsdbm:User0 wsdbm:follows wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0 rev:hasReview wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf wsdbm:User119 .
....
Since these are RDF triples, in our case we have
Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119
My goal is to write a query in the SQL form:
SELECT follows.subject, follows.object, friendOf.object,
likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
So far, I create a class called PropertyTables, which has a method that iterates over the initial file and convert each subject, predicate and object into an integer to improve computational time on the join and save memory:
class PropertyTables():
"""
This class holds all 4 Property Tables necessary for the required query.
Each Property Table is an instance of the class 'PropertyTable'.
"""
def __init__(self):
self.property_tables = defaultdict()
self.hash_map = HashDict()
def parse_file(self, file_path, remove_prefix = False):
data = open(file_path, 'r')
for line in data:
subj, prop, *obj = line.rstrip('\n.').split('\t')
obj = obj[0].rstrip()
if remove_prefix:
subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
self.hash_and_store(subj, prop, obj)
data.close()
the class PropertyTable, mentioned in the docstring:
class PropertyTable():
"""
This class represents a single Property Table, i.e. it holds every Subject and Object
"""
def __init__(self):
self.table = []
def insert(self, r, s):
# If r and s are already tuples, they get appended to the Property Table.
# Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the
# Property Tables when reading the data.
if type(r) == tuple:
self.table.append(r + s)
else:
self.table.append((r, s))
The class HashDict() is a simple dictionary that hashes values, so we can retrieve them again after the join.
To not go to far with one post, I have now a single hash join algorithm:
def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
ht = defaultdict(list)
# Create Hash Table for table1
for s in property_1.table:
ht[s[index_0]].append(s)
# Join Tables
joined_table = PropertyTable()
for r in property_2.table:
for s in ht[r[index_1]]:
joined_table.insert(s, r)
return joined_table
I use this function to sequentially join each table, given the requirements from before.
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)
The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this. I am sorry for this very extensive post, but I guess some details are necessary in order for some advice!
Edit:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
53 68.0 MiB 68.0 MiB 1 #profile
54 def hash_and_store(self, subj, prop, obj):
55
56 68.0 MiB 0.0 MiB 1 hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
57
58 68.0 MiB 0.0 MiB 1 if prop not in self.property_tables:
59 self.property_tables[prop] = PropertyTable()
60 68.0 MiB 0.0 MiB 1 self.property_tables[prop].insert(hashed_subj, hashed_obj)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
32 68.1 MiB 68.1 MiB 1 #profile
33 def parse_file(self, file_path, remove_prefix = False):
34
35 68.1 MiB 0.0 MiB 1 data = open(file_path, 'r')
36
37
38
39
40
41 80.7 MiB 0.3 MiB 109311 for line in data:
42 80.7 MiB 0.0 MiB 109310 subj, prop, *obj = line.rstrip('\n.').split('\t')
43 80.7 MiB 0.5 MiB 109310 obj = obj[0].rstrip()
44
45 80.7 MiB 0.0 MiB 109310 if remove_prefix:
46 80.7 MiB 9.0 MiB 655860 subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
47
48 80.7 MiB 0.0 MiB 109310 if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
49 80.7 MiB 2.8 MiB 80084 self.hash_and_store(subj, prop, obj)
50
51 80.7 MiB 0.0 MiB 1 data.close()
Line # Mem usage Increment Occurrences Line Contents
=============================================================
38 80.7 MiB 80.7 MiB 1 #profile
39 def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
40
41 80.7 MiB 0.0 MiB 1 ht = defaultdict(list)
42
43 # Create Hash Table for table1
44
45 81.2 MiB 0.0 MiB 31888 for s in property_1.table:
46 81.2 MiB 0.5 MiB 31887 ht[s[index_0]].append(s)
47
48 # Join Tables
49
50 81.2 MiB 0.0 MiB 1 joined_table = PropertyTable()
51
52 203.8 MiB 0.0 MiB 45713 for r in property_2.table:
53 203.8 MiB 0.0 MiB 1453580 for s in ht[r[index_1]]:
54 203.8 MiB 122.6 MiB 1407868 joined_table.insert(s, r)
55
56 203.8 MiB 0.0 MiB 1 return joined_table
The core of your question is this:
The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this.
Following your top-level problem statement but with a less generic structure, we can do something like this:
def runQuery(dataLines):
from collections import defaultdict
pred = dict(zip(['follows','friendOf','likes','hasReview'],range(4)))
tables = [defaultdict(list) for _ in pred]
def encode(s):
if s[-1].isdigit():
i = 0
while s[-1 - i].isdigit():
i += 1
return int(s[-i:])
if any(s.endswith(k) for k in pred):
return sum(v for k, v in pred.items() if s.endswith(k))
return None
for line in dataLines:
if not line:
continue
subj, prop, *obj = line.rstrip('\n.').split('\t')
obj = obj[0].rstrip()
subj, prop, obj = [encode(s) for s in (subj, prop, obj)]
if prop is not None:
tables[prop][subj].append(obj)
tables = [{k:tuple(v) for k, v in table.items()} for table in tables]
#[print(list(pred.keys())[i], tables[i], sep='\n') for i in range(len(pred))]
# create reverse index for subject, object where subject [user] follows object [user]
object_of_follows = defaultdict(set)
for k, v in tables[pred['follows']].items():
for user in v:
object_of_follows[user].add(k)
# create reverse index for subject, object where subject [user] is friendOf object [user]
object_of_friendOf = defaultdict(set)
for k, v in tables[pred['friendOf']].items():
if k in object_of_follows:
for user in v:
object_of_friendOf[user].add(k)
# create reverse index for subject, object where subject [user] likes object [product]
object_of_likes = defaultdict(set)
for k, v in tables[pred['likes']].items():
if k in object_of_friendOf:
for product in v:
object_of_likes[product].add(k)
# create reverse index for subject, object where subject [product] hasReview object [review]
object_of_hasReview = defaultdict(set)
for k, v in tables[pred['hasReview']].items():
if k in object_of_likes:
for review in v:
object_of_hasReview[review].add(k)
def addToResult(result, e):
d = object_of_hasReview[e]
c = {y for x in d for y in object_of_likes[x]}
b = {y for x in c for y in object_of_friendOf[x]}
a = {y for x in b for y in object_of_follows[x]}
toAdd = [(ax, bx, cx, dx, e) for dx in d for cx in c for bx in b for ax in a]
result += toAdd
result = []
for e in object_of_hasReview:
addToResult(result, e)
print(f'result row count {len(result):,}')
return result
Explanation:
Create a list of 4 tables (follows, friendOf, likes, hasReview), each a dictionary mapping subject to a tuple of objects
Create 4 reverse indexes (object_of_follows, object_of_friendOf, object_of_likes, object_of_hasReview); for example:
object_of_follows is a dict that maps each user that is an object in follows to a set of users, each of which is a subject in follows that follows the object
object_of_friendOf is a dict that maps each object (user) in friendOf to a set of users, each of which is a subject (user) associated with the object in friendOf and is in object_of_follows (in other words, is an object for one or more subjects in follows)
etc.
Explode each review that survived in object_of_hasReview into multiple result rows containing each unique result follows.subject, follows.object, friendsOf.object, likes.object, hasReview.object as specified in the query
Return the list of all such exploded rows.
Test code for 10 million lines:
dataLines = []
numFollowers = 1000
numChildren = 10
overlapFactor = max(1, numChildren // 2)
def largerPowerOfTen(x):
y = 1
while x >= y:
y *= 10
return y
aCeil = largerPowerOfTen(numFollowers)
bCeil = largerPowerOfTen(aCeil * numChildren)
cCeil = largerPowerOfTen(bCeil * numChildren)
dCeil = largerPowerOfTen(cCeil * numChildren)
friendOf, likes = set(), set()
for a in range(numFollowers):
for b in range(aCeil + a * overlapFactor, aCeil + a * overlapFactor + numChildren):
dataLines.append(f'wsdbm:User{a} wsdbm:follows wsdbm:User{b} .\n')
for c in range(bCeil + b * overlapFactor, bCeil + b * overlapFactor + numChildren):
if (b,c) not in friendOf:
dataLines.append(f'wsdbm:User{b} wsdbm:friendOf wsdbm:User{c} .\n')
friendOf.add((b,c))
for d in range(cCeil + c * overlapFactor, cCeil + c * overlapFactor + numChildren):
if (c,d) not in likes:
dataLines.append(f'wsdbm:User{c} wsdbm:likes wsdbm:Product{d} .\n')
likes.add((c,d))
for e in range(dCeil * (d + 1), dCeil * (d + 1) + numChildren):
dataLines.append(f'wsdbm:Product{d} wsdbm:hasReview wsdbm:Review{e} .\n')
print(f'dataLines row count {len(dataLines):,}')
from timeit import timeit
n = 1
print(f'Timeit results:')
t = timeit(f"runQuery(dataLines)", setup=f"from __main__ import dataLines, runQuery", number=n) / n
print(f'======== runQuery ran in {t} seconds using {n} iterations')
'''
result = runQuery(dataLines)
print(f'result row count {len(result):,}')
print(f'{"follows.subject":>20}{"follows.object":>20}{"friendsOf.object":>20}{"likes.object":>20}{"hasReview.object":>20}')
[print(f'{a:20}{b:20}{c:20}{d:20}{e:20}') for a,b,c,d,e in result]
'''
Output:
dataLines row count 10,310,350
Timeit results:
result row count 12,398,500
======== runQuery ran in 81.53253880003467 seconds using 1 iterations
Here's input/output from a smaller-scale sample run:
Params
numFollowers = 3
numChildren = 3
overlapFactor = 2
Input (after storing in tables):
follows
{0: (10, 11, 12), 1: (12, 13, 14), 2: (14, 15, 16)}
friendOf
{10: (120, 121, 122), 11: (122, 123, 124), 12: (124, 125, 126), 13: (126, 127, 128), 14: (128, 129, 130), 15: (130, 131, 132), 16: (132, 133, 134)}
likes
{120: (1240, 1241, 1242), 121: (1242, 1243, 1244), 122: (1244, 1245, 1246), 123: (1246, 1247, 1248), 124: (1248, 1249, 1250), 125: (1250, 1251, 1252), 126: (1252, 1253, 1254), 127: (1254, 1255, 1256), 128: (1256, 1257, 1258), 129: (1258, 1259, 1260), 130: (1260, 1261, 1262), 131: (1262, 1263, 1264), 132: (1264, 1265, 1266), 133: (1266, 1267, 1268), 134: (1268, 1269, 1270)}
hasReview
{1240: (12410000, 12410001, 12410002), 1241: (12420000, 12420001, 12420002), 1242: (12430000, 12430001, 12430002, 12430000, 12430001, 12430002), 1243: (12440000, 12440001, 12440002), 1244: (12450000, 12450001, 12450002, 12450000, 12450001, 12450002, 12450000, 12450001, 12450002), 1245: (12460000, 12460001, 12460002, 12460000, 12460001, 12460002), 1246: (12470000, 12470001, 12470002, 12470000, 12470001, 12470002, 12470000, 12470001, 12470002), 1247: (12480000, 12480001, 12480002), 1248: (12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002), 1249: (12500000, 12500001, 12500002, 12500000, 12500001, 12500002, 12500000, 12500001, 12500002), 1250: (12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002), 1251: (12520000, 12520001, 12520002, 12520000, 12520001, 12520002), 1252: (12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002), 1253: (12540000, 12540001, 12540002, 12540000, 12540001, 12540002, 12540000, 12540001, 12540002), 1254: (12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002), 1255: (12560000, 12560001, 12560002), 1256: (12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002), 1257: (12580000, 12580001, 12580002, 12580000, 12580001, 12580002, 12580000, 12580001, 12580002), 1258: (12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002), 1259: (12600000, 12600001, 12600002, 12600000, 12600001, 12600002), 1260: (12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002), 1261: (12620000, 12620001, 12620002, 12620000, 12620001, 12620002, 12620000, 12620001, 12620002), 1262: (12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002), 1263: (12640000, 12640001, 12640002), 1264: (12650000, 12650001, 12650002, 12650000, 12650001, 12650002, 12650000, 12650001, 12650002), 1265: (12660000, 12660001, 12660002, 12660000, 12660001, 12660002), 1266: (12670000, 12670001, 12670002, 12670000, 12670001, 12670002, 12670000, 12670001, 12670002), 1267: (12680000, 12680001, 12680002), 1268: (12690000, 12690001, 12690002, 12690000, 12690001, 12690002), 1269: (12700000, 12700001, 12700002), 1270: (12710000, 12710001, 12710002)}
Output
result row count 351
follows.subject follows.object friendsOf.object likes.object hasReview.object
0 10 120 1240 12410000
0 10 120 1240 12410001
0 10 120 1240 12410002
0 10 120 1241 12420000
0 10 120 1241 12420001
0 10 120 1241 12420002
0 10 120 1242 12430000
0 10 121 1242 12430000
0 10 120 1242 12430001
0 10 121 1242 12430001
0 10 120 1242 12430002
0 10 121 1242 12430002
0 10 121 1243 12440000
0 10 121 1243 12440001
0 10 121 1243 12440002
0 10 121 1244 12450000
0 11 121 1244 12450000
0 10 122 1244 12450000
0 11 122 1244 12450000
0 10 121 1244 12450001
0 11 121 1244 12450001
0 10 122 1244 12450001
0 11 122 1244 12450001
0 10 121 1244 12450002
0 11 121 1244 12450002
etc.

Which month has the highest median for maximum_gust_speed out of all the available records

Which month has the highest median for maximum_gust_speed out of all the available records. Also find the respective value
The data set looks like below
Day Average temperature (°F) Average humidity (%) Average dewpoint (°F) Average barometer (in) Average windspeed (mph) Average gustspeed (mph) Average direction (°deg) Rainfall for month (in) Rainfall for year (in) Maximum rain per minute Maximum temperature (°F) Minimum temperature (°F) Maximum humidity (%) Minimum humidity (%) Maximum pressure Minimum pressure Maximum windspeed (mph) Maximum gust speed (mph) Maximum heat index (°F)
0 1/01/2009 37.8 35 12.7 29.7 26.4 36.8 274 0.0 0.0 0.0 40.1 34.5 44 27 29.762 29.596 41.4 59.0 40.1
1 2/01/2009 43.2 32 14.7 29.5 12.8 18.0 240 0.0 0.0 0.0 52.8 37.5 43 16 29.669 29.268 35.7 51.0 52.8
2 3/01/2009 25.7 60 12.7 29.7 8.3 12.2 290 0.0 0.0 0.0 41.2 6.7 89 35 30.232 29.260 25.3 38.0 41.2
3 4/01/2009 9.3 67 0.1 30.4 2.9 4.5 47 0.0 0.0 0.0 19.4 -0.0 79 35 30.566 30.227 12.7 20.0 32.0
4 5/01/2009 23.5 30 -5.3 29.9 16.7 23.1 265 0.0 0.0 0.0 30.3 15.1 56 13 30.233 29.568 38.0 53.0 32.0
The code I have written is as below however the test case fails
Code :
data1= data[data['Maximum gust speed (mph)']!= 0.0]
#print(data1.count())
#print(data.count())
#print(data.median())
#print(data1.median())
max_gust_value_median = data1.groupby(pd.DatetimeIndex(data1['Day']).month).agg({'Maximum gust speed (mph)':pd.Series.median})
#print(max_gust_value_median)
max_gust_month = "max_gust_month = " + str(max_gust_value_median.idxmax()[0])
max_gust_value = "max_gust_value = " + format((max_gust_value_median.max()[0]),'.2f')
print(max_gust_value)
print(max_gust_month)
Output :
max_gust_value = 32.20
max_gust_month = 11
Error :
=================================== FAILURES ===================================
_____________________________ test_max_gust_month ______________________________
def test_max_gust_month():
assert hash_dict["max_gust_month"] == answer_dict["max_gust_month"]
E AssertionError: assert 'd1aecb72eff6...7412c2a651d81' == 'e6e3cedb0dc6...798711404a6c8'
E - e6e3cedb0dc67a96317798711404a6c8
E + d1aecb72eff64d1169f7412c2a651d81
test.py:52: AssertionError
_____________________________ test_max_gust_value ______________________________
def test_max_gust_value():
assert hash_dict["max_gust_value"] == answer_dict["max_gust_value"]
E AssertionError: assert '6879064548a1...2361f91ecd7b0' == '5818ebe448c4...471e93c92d545'
E - 5818ebe448c43f2dfed471e93c92d545
E + 6879064548a136da2f22361f91ecd7b0
test.py:55: AssertionError
=========================== short test summary info ============================
FAILED test.py::test_max_gust_month - AssertionError: assert 'd1aecb72eff6......
FAILED test.py::test_max_gust_value - AssertionError: assert '6879064548a1......
========================= 2 failed, 9 passed in 0.13s ==========================
Below code
data['Month'] = pd.to_datetime(data['Day'], dayfirst=True).dt.strftime('%B')
month_list =['January', 'February','March','April', 'May', 'June','July','August','September','October','November','December']
month_grp = data.groupby(['Month'])
month_name_value_all = []
max_value=[]
for i in month_list:
month_name_value =[]
value = month_grp.get_group(i).median().loc['Maximum gust speed (mph)']
month_name_value.append(i)
max_value.append(value)
month_name_value.append(value)
month_name_value_all.append(month_name_value)
max_gust_value = format(max(max_value), '.2f')
for j in month_name_value_all:
month_max_find =[]
month_max_find.append(j)
if max_gust_value in j:
break
max_gust_month = month_max_find[0][0]
print("max_gust_value = ", max_gust_value)
print("max_gust_month = ", max_gust_month)
You can try this way:
#Convert day column values to datetime
df['Date'] = pd.to_datetime(df['Day'],format = '%d/%m/%Y')
#Convert a new column month_index
df['month_index'] = df['Date'].dt.month
#Group the dataframe by month & then find the median for max gust speed
max_gust_month = df.groupby(['month_index'])
max_gust_month = max_gust_month['Maximum gust speed (mph)'].median()
#Find max value in the month
max_gust_value = max_gust_median_month.max()
max_gust_value
#Find the max value index in the month
max_gust_month = max_gust_median_month.idxmax()
max_gust_month

Python multithreading not getting desired performance

I have a bunch of pandas dataframe I would like to print out to any format (csv, json, etc) -- and would like to preserve the order, based on the order of the data frames read. Unfortunately .to_csv() can take some time, sometimes 2x longer than just reading the dataframe.
Lets take the image as an example:
Here you can see that running the task linearly, reading the data frame, printing it out, then repeat for the remaining data frames. This can take about 3x longer than just reading the data frame. Theoretically, if we can push the printing (to_csv()) to a separate threads (2 threads, plus the main thread reading), we can achieve an improve performance that could almost be a third of the total execution compared to the linear (synchronous) version. Of course with just 3 reads, it looks like its just half as fast. But the more dataframes you read, the faster it will be (theoretically).
Unfortunately, the actual does not work like so. I am getting a very small gain in performance. Where the read time is actually taking longer. This might be due to the fact that the to_csv() is CPU extensive, and using all the reasources in the process. And since it is multithreaded, it all shares the same resources. Thus not much gains.
So my question is, how can I improve the code to get a performance closer to the theoretical numbers. I tried using multiprocessing but failed to get a working code. How can I have this in multiprocessing? Is there other ways I could improve the total execution time of such a task?
Here's my sample code using multithreads:
import pandas as pd
import datetime
import os
from threading import Thread
import queue
from io import StringIO
from line_profiler import LineProfiler
NUMS = 500
DEVNULL = open(os.devnull, 'w')
HEADERS = ",a,b,c,d,e,f,g\n"
SAMPLE_CSV = HEADERS + "\n".join([f"{x},{x},{x},{x},{x},{x},{x},{x}" for x in range(4000)])
def linear_test():
print("------Linear Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
print("total: {}".format(datetime.datetime.now() - main_start))
class Handler():
def __init__(self, num_workers=1):
self.num_workers = num_workers
self.total_num_jobs = 0
self.jobs_completed = 0
self.answers_sent = 0
self.jobs = queue.Queue()
self.results = queue.Queue()
self.start_workers()
def add_task(self, task, *args, **kwargs):
args = args or ()
kwargs = kwargs or {}
self.total_num_jobs += 1
self.jobs.put((task, args, kwargs))
def start_workers(self):
for i in range(self.num_workers):
t = Thread(target=self.worker)
t.daemon = True
t.start()
def worker(self):
while True:
item, args, kwargs = self.jobs.get()
item(*args, **kwargs)
self.jobs_completed += 1
self.jobs.task_done()
def get_answers(self):
while self.answers_sent < self.total_num_jobs or self.jobs_completed == 0:
yield self.results.get()
self.answers_sent += 1
self.results.task_done()
def task(task_num, df, q):
ans = df.to_csv()
q.put((task_num, ans))
def parallel_test():
print("------Parallel Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
h = Handler(num_workers=2)
q = h.results
answers = {}
curr_task = 1
t = 1
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
h.add_task(task, t, df, q)
t += 1
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
#data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
for task_num, ans in h.get_answers():
#print("got back: {}".format(task_num, ans))
answers[task_num] = ans
if curr_task in answers:
print(answers[curr_task], file=DEVNULL)
del answers[curr_task]
curr_task += 1
# In case others are left out
for k, v in answers.items():
print(k)
h.jobs.join() # block until all tasks are done
print("total: {}".format(datetime.datetime.now() - main_start))
if __name__ == "__main__":
# linear_test()
# parallel_test()
lp = LineProfiler()
lp_wrapper = lp(linear_test)
lp_wrapper()
lp.print_stats()
lp = LineProfiler()
lp_wrapper = lp(parallel_test)
lp_wrapper()
lp.print_stats()
The output will be below. Where you can see in the linear test reading the data frame only took 4.6 seconds (42% of the total execution). But reading the data frames in the parallel test took 9.7 seconds (93% of the total execution):
------Linear Test-------
total_read_time: 0:00:04.672765
total_add_task: 0:00:00.001000
total_to_csv_time: 0:00:05.582663
total_to_print: 0:00:00.668319
total: 0:00:10.935723
Timer unit: 1e-07 s
Total time: 10.9309 s
File: ./test.py
Function: linear_test at line 33
Line # Hits Time Per Hit % Time Line Contents
==============================================================
33 def linear_test():
34 1 225.0 225.0 0.0 print("------Linear Test-------")
35 1 76.0 76.0 0.0 main_start = datetime.datetime.now()
36 1 32.0 32.0 0.0 total_read_time = datetime.timedelta(0)
37 1 11.0 11.0 0.0 total_add_task = datetime.timedelta(0)
38 1 9.0 9.0 0.0 total_to_csv_time = datetime.timedelta(0)
39 1 9.0 9.0 0.0 total_to_print = datetime.timedelta(0)
40
41 501 3374.0 6.7 0.0 for x in range(NUMS):
42
43 500 5806.0 11.6 0.0 start = datetime.datetime.now()
44 500 46728029.0 93456.1 42.7 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
45 500 40199.0 80.4 0.0 total_read_time += datetime.datetime.now() - start
46
47 500 6821.0 13.6 0.0 start = datetime.datetime.now()
48 #
49 500 6916.0 13.8 0.0 total_add_task += datetime.datetime.now() - start
50
51 500 5794.0 11.6 0.0 start = datetime.datetime.now()
52 500 55843605.0 111687.2 51.1 data = df.to_csv()
53 500 53640.0 107.3 0.0 total_to_csv_time += datetime.datetime.now() - start
54
55 500 6798.0 13.6 0.0 start = datetime.datetime.now()
56 500 6589129.0 13178.3 6.0 print(data, file=DEVNULL)
57 500 18258.0 36.5 0.0 total_to_print += datetime.datetime.now() - start
58
59 1 221.0 221.0 0.0 print("total_read_time: {}".format(total_read_time))
60 1 95.0 95.0 0.0 print("total_add_task: {}".format(total_add_task))
61 1 87.0 87.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
62 1 85.0 85.0 0.0 print("total_to_print: {}".format(total_to_print))
63 1 112.0 112.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
------Parallel Test-------
total_read_time: 0:00:09.779954
total_add_task: 0:00:00.016984
total_to_csv_time: 0:00:00.003000
total_to_print: 0:00:00.001001
total: 0:00:10.488563
Timer unit: 1e-07 s
Total time: 10.4803 s
File: ./test.py
Function: parallel_test at line 106
Line # Hits Time Per Hit % Time Line Contents
==============================================================
106 def parallel_test():
107 1 100.0 100.0 0.0 print("------Parallel Test-------")
108 1 33.0 33.0 0.0 main_start = datetime.datetime.now()
109 1 24.0 24.0 0.0 total_read_time = datetime.timedelta(0)
110 1 10.0 10.0 0.0 total_add_task = datetime.timedelta(0)
111 1 10.0 10.0 0.0 total_to_csv_time = datetime.timedelta(0)
112 1 10.0 10.0 0.0 total_to_print = datetime.timedelta(0)
113 1 13550.0 13550.0 0.0 h = Handler(num_workers=2)
114 1 15.0 15.0 0.0 q = h.results
115 1 9.0 9.0 0.0 answers = {}
116 1 7.0 7.0 0.0 curr_task = 1
117 1 7.0 7.0 0.0 t = 1
118
119 501 5017.0 10.0 0.0 for x in range(NUMS):
120 500 6545.0 13.1 0.0 start = datetime.datetime.now()
121 500 97761876.0 195523.8 93.3 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
122 500 45702.0 91.4 0.0 total_read_time += datetime.datetime.now() - start
123
124 500 8259.0 16.5 0.0 start = datetime.datetime.now()
125 500 167269.0 334.5 0.2 h.add_task(task, t, df, q)
126 500 5009.0 10.0 0.0 t += 1
127 500 11865.0 23.7 0.0 total_add_task += datetime.datetime.now() - start
128
129 500 6949.0 13.9 0.0 start = datetime.datetime.now()
130 #data = df.to_csv()
131 500 7921.0 15.8 0.0 total_to_csv_time += datetime.datetime.now() - start
132
133 500 6498.0 13.0 0.0 start = datetime.datetime.now()
134 #print(data, file=DEVNULL)
135 500 8084.0 16.2 0.0 total_to_print += datetime.datetime.now() - start
136
137 1 3321.0 3321.0 0.0 print("total_read_time: {}".format(total_read_time))
138 1 4669.0 4669.0 0.0 print("total_add_task: {}".format(total_add_task))
139 1 1995.0 1995.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
140 1 113037.0 113037.0 0.1 print("total_to_print: {}".format(total_to_print))
141
142 501 176106.0 351.5 0.2 for task_num, ans in h.get_answers():
143 #print("got back: {}".format(task_num, ans))
144 500 5169.0 10.3 0.0 answers[task_num] = ans
145 500 4160.0 8.3 0.0 if curr_task in answers:
146 500 6429159.0 12858.3 6.1 print(answers[curr_task], file=DEVNULL)
147 500 5646.0 11.3 0.0 del answers[curr_task]
148 500 4144.0 8.3 0.0 curr_task += 1
149
150 # In case others are left out
151 1 24.0 24.0 0.0 for k, v in answers.items():
152 print(k)
153
154 1 61.0 61.0 0.0 h.jobs.join() # block until all tasks are done
155
156 1 328.0 328.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
Rather than cut your own solution you may want to look at Dask - particularly Dask's Distributed DataFrame if you want to read multiple CSV files into 1 "virtual" big DataFrame or Delayed to run functions, as per your example, in parallel across multiple cores. See light examples here if you scroll down: https://docs.dask.org/en/latest/
Your other lightweight choice is to use Joblib's Parallel interface, this looks exactly like Delayed but with much less functionality. I tend to go for Joblib if I want a lightweight solution, then upgrade to Dask if I need more: https://joblib.readthedocs.io/en/latest/parallel.html
For both tools if you go down the delayed route - write a function that works in a for loop in series (you have this already), then wrap it in the respective delayed syntax and "it should just work". In both cases by default it'll use all the cores on your machine.

Why does my Python loop intends to consume all the memory?

I want to generate and keep a set of tuples in a certain time. Yet I found the program seemed to consume all the memory if given enough time.
I have tried two methods. One is delete the newly generated variables, the other is gc.collect(). But neither of them worked. If I just generate and not keep the tuples, the program would consume limited memory.
generate and keep: gk.py
import gc
import time
from memory_profiler import profile
from random import sample
from sys import getsizeof
#profile
def loop(limit):
t = time.time()
i = 0
A = set()
while True:
i += 1
duration = time.time() - t
a = tuple(sorted(sample(range(200), 100)))
A.add(a)
if not i % int(1e4):
print('step {:.2e}...'.format(i))
if duration > limit:
print('done')
break
# method 1: delete the variables
# del duration, a
# method 2: use gc
# gc.collect()
memory = getsizeof(t) + getsizeof(i) + getsizeof(duration) + \
getsizeof(a) + getsizeof(limit) + getsizeof(A)
print('memory consumed: {:.2e}MB'.format(memory/2**20))
pass
def main():
limit = 300
loop(limit)
pass
if __name__ == '__main__':
print('running...')
main()
generate and not keep: gnk.py
import time
from memory_profiler import profile
from random import sample
from sys import getsizeof
#profile
def loop(limit):
t = time.time()
i = 0
while True:
i += 1
duration = time.time() - t
a = tuple(sorted(sample(range(200), 100)))
if not i % int(1e4):
print('step {:.2e}...'.format(i))
if duration > limit:
print('done')
break
memory = getsizeof(t) + getsizeof(i) + getsizeof(duration) + \
getsizeof(a) + getsizeof(limit)
print('memory consumed: {:.2e}MB'.format(memory/2**20))
pass
def main():
limit = 300
loop(limit)
pass
if __name__ == '__main__':
print('running...')
main()
use "mprof" (needs module memory_profiler) in cmd/shell to check memory usage
mprof run my_file.py
mprof plot
result of gk.py
memory consumed: 4.00e+00MB
Filename: gk.py
Line # Mem usage Increment Line Contents
================================================
12 32.9 MiB 32.9 MiB #profile
13 def loop(limit):
14 32.9 MiB 0.0 MiB t = time.time()
15 32.9 MiB 0.0 MiB i = 0
16 32.9 MiB 0.0 MiB A = set()
17 32.9 MiB 0.0 MiB while True:
18 115.8 MiB 0.0 MiB i += 1
19 115.8 MiB 0.0 MiB duration = time.time() - t
20 115.8 MiB 0.3 MiB a = tuple(sorted(sample(range(200), 100)))
21 115.8 MiB 2.0 MiB A.add(a)
22 115.8 MiB 0.0 MiB if not i % int(1e4):
23 111.8 MiB 0.0 MiB print('step {:.2e}...'.format(i))
24 115.8 MiB 0.0 MiB if duration > limit:
25 115.8 MiB 0.0 MiB print('done')
26 115.8 MiB 0.0 MiB break
27 # method 1: delete the variables
28 # del duration, a
29 # method 2: use gc
30 # gc.collect()
31 memory = getsizeof(t) + getsizeof(i) + getsizeof(duration) + \
32 115.8 MiB 0.0 MiB getsizeof(a) + getsizeof(limit) + getsizeof(A)
33 115.8 MiB 0.0 MiB print('memory consumed: {:.2e}MB'.format(memory/2**20))
34 115.8 MiB 0.0 MiB pass
result of gnk.py
memory consumed: 9.08e-04MB
Filename: gnk.py
Line # Mem usage Increment Line Contents
================================================
11 33.0 MiB 33.0 MiB #profile
12 def loop(limit):
13 33.0 MiB 0.0 MiB t = time.time()
14 33.0 MiB 0.0 MiB i = 0
15 33.0 MiB 0.0 MiB while True:
16 33.0 MiB 0.0 MiB i += 1
17 33.0 MiB 0.0 MiB duration = time.time() - t
18 33.0 MiB 0.1 MiB a = tuple(sorted(sample(range(200), 100)))
19 33.0 MiB 0.0 MiB if not i % int(1e4):
20 33.0 MiB 0.0 MiB print('step {:.2e}...'.format(i))
21 33.0 MiB 0.0 MiB if duration > limit:
22 33.0 MiB 0.0 MiB print('done')
23 33.0 MiB 0.0 MiB break
24 memory = getsizeof(t) + getsizeof(i) + getsizeof(duration) + \
25 33.0 MiB 0.0 MiB getsizeof(a) + getsizeof(limit)
26 33.0 MiB 0.0 MiB print('memory consumed: {:.2e}MB'.format(memory/2**20))
27 33.0 MiB 0.0 MiB pass
I have two problems:
both the programs consumed more memory than the variables occupied. "gk.py" consumed 115.8MB, its variables occupied 4.00MB. "gnk.py" consumed 33.0MB, its variables occupied 9.08e-04MB. Why the programs consumed more memory than the corresponding variables occupied?
memory that "gk.py" consumed increases linearly with time. memory that "gnk.py" consumed remains constantly with time. Why does this happen?
Any help would be appreciated.
Given that the size of the set is being constantly increased, there will be a time when it will eventually consume all memory.
An estimative (from my computer):
10 seconds of code running ~ 5e4 tuples saved to the set
300 seconds of code running ~ 1.5e6 tuples saved to the set
1 tuple = 100 integers ~ 400bytes
total:
1.5e6 * 400bytes = 6e8bytes = 600MB filled in 300s

Strange increment value reported during IPython memory profiling

While checking Jake van der Plas' "Python Data Science Handbook", I was recreating the usage examples of various debugging and profiling tools. He provides an example for demonstrating %mprun with the following function:
def sum_of_lists(N):
total = 0
for i in range(5):
L = [j ^ (j >> i) for j in range(N)]
total += sum(L)
del L
return total
I proceeded to execute it in a Jupyter notebook, and got the following output:
Line # Mem usage Increment Line Contents
================================================
1 81.3 MiB 81.3 MiB def sum_of_lists(N):
2 81.3 MiB 0.0 MiB total = 0
3 81.3 MiB 0.0 MiB for i in range(5):
4 113.2 MiB -51106533.7 MiB L = [j ^ (j >> i) for j in range(N)]
5 119.1 MiB 23.5 MiB total += sum(L)
6 81.3 MiB -158.8 MiB del L
7 81.3 MiB 0.0 MiB return total
... which immediately struck me as odd. According to the book, I should have gotten a 25.4 MiB increase on line 4, and a corresponding negative increase on line 6. But I have a massive negative increment instead, which does not line up at all to what I would have expected to happen. According to line 6, there should be a 158.8 increment.
On the other hand, Mem usage paints a more sensible picture (113.2 - 81.3 = 31.9 MiB increase). So I'm left with a weird, giant negative increment, and two measured changes in memory usage that don't agree with each other. What is going on, then?
Just to check if there's something truly bizarre going on with my interpreter/profiler, I went ahead and replicated the example given in this answer, and got this output:
Line # Mem usage Increment Line Contents
================================================
2 86.5 MiB 86.5 MiB def my_func():
3 94.1 MiB 7.6 MiB a = [1] * (10 ** 6)
4 246.7 MiB 152.6 MiB b = [2] * (2 * 10 ** 7)
5 94.1 MiB -152.6 MiB del b
6 94.1 MiB 0.0 MiB return a
Nothing wrong there, I think. What could be going on with the previous example?

Categories