I have a function to find common, uncommon items and its rates between a given list (one list) and other
lists (60,000 lists) for each user (4,000 users). Running below loop takes too long time and high momery usage
with partial list construction and crash. I think due to the long returned list and heavy elements (tuples),
so I divided it into two functions as below , but it seems the problem in appending list items in the tuple,
[(user, [items],rate),(user, [items],rate),....]. I want to create a dataframes from returned values,
What should I do to an algorithm to get around this matter and reduce memory usage?
Iam using python 3.7, windows 10, 64-bit , RAM 8G.
common items function:
def common_items(user,list1, list2):
com_items = list(set(list1).intersection(set(list2)))
com_items_rate = len(com_items)/len(set(list1).union(set(list2)))
return user, com_items, com_items_rate
uncommon items function:
def uncommon_items(user,list1, list2):
com_items = list(set(list1).intersection(set(list2)))
com_items_rate = len(com_items)/len(set(list1).union(set(list2)))
uncom_items = list(set(list2) - set(com_items)) # uncommon items that blonge to list2
uncom_items_rate = len(uncom_items)/len(set(list1).union(set(list2)))
return user, com_items_rate, uncom_items, uncom_items_rate # common_items_rate is also needed
Constructing the list:
common_item_rate_tuple_list = []
for usr in users: # users.shape = 4,000
list1 = get_user_list(usr) # a function to get list1, it takes 0:00:00.015632 or less for a user
# print(usr, len(list1))
for list2 in df["list2"]: # df.shape = 60,000
common_item_rate_tuple = common_items(usr,list1, list2)
common_item_rate_tuple_list.append(common_item_rate_tuple)
print(len(common_item_rate_tuple_list)) # 4,000 * 60,000 = 240,000,000 items
# sample of common_item_rate_tuple_list:
#[(1,[2,5,8], 0.676), (1,[7,4], 0.788), ....(4000,[1,5,7,9],0.318), (4000,[8,9,6],0.521)
I looked at (Memory errors and list limits?) and
(Memory error when appending to list in Python) they deal with constructed list. And I couldnot deal with suggested answer for (Python list memory error).
There are a couple things you should consider for speed and memory management with data this big.
you are or should be working only with sets here because order has no meaning in your lists and you are doing a lot of intersecting of sets. So, can you change your get_user_list() function to return a set instead of a list? That will prevent all of the unnecessary conversions you are doing. Same for list2, just make a set right away
In your look for "uncommon items" you should just use the symmetric difference operator on the sets. Much faster, many less list -> set conversions
at the end of your loop, do you really want to create a list of 240M sub-lists? That is probably your memory explosion. I would suggest a dictionary with keys as user name. and you only need to create an entry in it if there are common items. If there are "sparse" matches, you will get a very much smaller data container
--- Edit w/ example
So I think your hope of keeping it in a data frame is too big. Perhaps you can do what is needed without storing it in a data frame. Dictionary makes sense. You may even be able to compute things "on the fly" and not store the data. Anyhow. Here is a toy example that shows the memory problem using 4K users and 10K "other lists". Of course the size of the intersected sets may make this vary, but it is informative:
import sys
import pandas as pd
# create list of users by index
users = list(range(4000))
match_data = list()
size_list2 = 10_000
for user in users:
for t in range(size_list2):
match_data.append(( user, (1,5,6,9), 0.55)) # 4 dummy matches and fake percentage
print(match_data[:4])
print(f'size of match: {sys.getsizeof(match_data)/1_000_000} MB')
df = pd.DataFrame(match_data)
print(df.head())
print(f'size of dataframe {sys.getsizeof(df)/1_000_000} MB')
This yields the following:
[(0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55)]
size of match: 335.072536 MB
0 1 2
0 0 (1, 5, 6, 9) 0.55
1 0 (1, 5, 6, 9) 0.55
2 0 (1, 5, 6, 9) 0.55
3 0 (1, 5, 6, 9) 0.55
4 0 (1, 5, 6, 9) 0.55
size of dataframe 3200.00016 MB
You can see that a nutshell of your idea for only 10K other lists is 3.2GB in a dataframe. This will be unmanageable.
Here is an idea for a data structure just to use dictionaries all the way.
del df
# just keep it in a dictionary
data = {} # intended format: key= (usr, other_list) : value= [common elements]
# some fake data
user_items = { 1: {2,3,5,7,99},
2: {3,5,88,790},
3: {2,4,100} }
# some fake "list 2 data"
list2 = [ {1,2,3,4,5},
{88, 100},
{11, 13, 200}]
for user in user_items.keys():
for idx, other_set in enumerate(list2): # using enumerate to get the index of the other list
common_elements = user_items.get(user) & other_set # set intersection
if common_elements: # only put it into the dictionary if it is not empty
data[(user, idx)] = common_elements
# try a couple data pulls
print(f'for user 1 and other list 0: {data.get((1, 0))}')
print(f'for user 2 and other list 2: {data.get((2, 2))}') # use .get() to be safe. It will return None if no entry
The output here is:
for user 1 and other list 0: {2, 3, 5}
for user 2 and other list 2: None
Your other alternative if you are going to be working with this data a lot is just to put these tables into a database like sqlite which is built in and won't bomb out your memory.
Related
Since my first question wasn't clear enough, I'm going to rewrite it so it's clearer (I'm sorry I wasn't clear in the 1st question).
I have a folder that has 46 subdirectories. Each subdirectory has a .txt file. The text within those .txt files varies in number of lines and looks like this:
0
0
0
1
...
1
0
0
1
I want to have a list with 46 tuples (1 for every subdirectory - and for each .txt file) and inside those tuples, I want to have the number of lines that have 0s and the number of lines that have 1s present in the text for the file that that tuple is assigned to. Something along the lines of this:
[(17342,2453),(342,127),...,(45003,69),(420,223)]
Currently my code (not working properly) is like this:
from pathlib import Path
def count_0s(paths):
for p in paths:
list_zeros = []
list_ones = []
for line in p.read_text().splitlines():
zeros = 0
zeros += line.count('0')
ones = 0
ones += line.count('1')
list_zeros.append(zeros)
list_ones.append(ones)
return list_zeros, list_ones
path = "/content/drive/MyDrive/data/classes/"
paths = Path(path).glob("*/marked*.txt")
n_zeros=count_0s(paths)
n_zeros
What changes do I need to apply so that, instead of 2 lists, I have a list with tuples?
Assuming you have two lists, the first with the number of zeros and the second with the number of ones , you can just rearrange the data at the end:
n_zeros = [1, 2, 3, 4]
n_ones = [5, 6, 7, 8]
pairs = []
for n_zero, n_one in zip(n_zeros, n_ones):
pairs.append((n_zero, n_one))
print(pairs)
Should return
[(1, 5), (2, 6), (3, 7), (4, 8)]
A better way would be to do the pairing in your main loop, instead of saving two lists.
I have 3 list
list1= [min_0,min_1...min_150] consists of minimum indexes which has usually has 50-150 elements,
list2= [max_0,max_1...max_150] consists of maximum indexes which has usually has 50-150 elements,
list3= [min_0,max_0,max_1,min_1 ...max_149,min_150]
list3 is joint of list1 and list2 and it is ordered. list3 has generally has 200-300 elements.
I want to create 5 elements [x0,x1,x2,x3,x4] combinations from list3 that fits to 2 conditions with itertools of python.
condition 1: x0,x2 and x4 must in list1 and x1, x3 must in list2 or x0,x2, x4 must in list2 and x1, x3 must in list1
condition 2 : x4-x0 <=89
the problem is performance. Possible combinations for (300,5) is 19,582,837,560 . I have tried the split list3 to n parts and get some good performance but in this case, I have missed some possibilities which fit my conditions.
I hope the question is clear. How can I get the best performance? thanks.
In order to avoid billions of iterations, you will need to simplify the combination domains. This will be easier using sets.
So lets say your 3 lists are actually sets:
set1 = set(list1)
set2 = set(list2)
set3 = set(list3)
You have two patterns to look for:
Lets do part 1:
elements of list3 where x0,x2,x4 are in list1 and x1,x3 are in list2
x0,x2,x4 will be combinations of 3 out of set3 & set1
x1,x3 will be combinations of 2 out of set3 & set2
The 5 value tuples will be the product of these combinations:
part1 = { (x0,x1,x2,x3,x4) for x0,x2,x4 in combinations(set3&set1,3)
if x4-x0 <= 89
for x1,x3 in combinations(set3&set2,2) }
the second part uses the same approach but with the odd/even elements from the other lists:
part2 = { (x0,x1,x2,x3,x4) for x0,x2,x4 in combinations(set3&set2,3)
if x4-x0 <= 89
for x1,x3 in combinations(set3&set1,2) }
And the result is the union of the two parts:
result = part1 | part2
Depending on the data, this could still be in the millions of combinations but this method will greatly reduce the number of invalid combinations that need to be filtered out by conditions.
If that is still not fast enough, you should consider writing your own combinations function to optimize applying the set3 filter and x4-x0<89 condition within the combinatory logic. (i.e. a 3 level nested loop giving (x0,x4,x2) that skips x4 values that don't fit the condition, preferably from sorted lists to allow short circuiting)
Note that if any of your lists contain duplicate values, you will definitely need to write your own filtering and combination functions to obtain pre-filtered subsets before multiplying 3-tuple and 2-tuple combinations
[EDIT] here is an example of how to write the custom combination function. I made it a generator in order to avoid creating a result set with a hundred million elements. It only generates valid combinations and applies condition 2 as early as possible to avoid useless iterations through invalid combinations:
m = 150
n = 200
list1 = list(range(m))
list2 = list(range(m,2*m))
list3 = list(range(2,2*n,2))
def combine(L1,L2,L3):
S3 =set(L3)
inL1 = [x for x in L1 if x in S3]
inL2 = [x for x in L2 if x in S3]
for evens,odds in [(inL1,inL2),(inL2,inL1)]: # only generate valid combinations
for p0,x0 in enumerate(evens[:-2]):
for p4,x4 in enumerate(evens[p0+2:],p0+2):
if abs(x4-x0)>89: continue # short circuit condition 2 early
for x2 in evens[p0+1:p4]:
for p1,x1 in enumerate(odds[:-1]):
for x3 in odds[p1+1:]:
yield (x0,x1,x2,x3,x4)
print(sum(1 for _ in combine(list1,list2,list3))) # 230488170
The 230,488,170 combinations were produced in 22 seconds on my laptop.
Here are the first few combinations in my example:
for combo in combine(list1,list2,list3): print(combo)
(2, 150, 4, 152, 6)
(2, 150, 4, 154, 6)
(2, 150, 4, 156, 6)
(2, 150, 4, 158, 6)
(2, 150, 4, 160, 6)
(2, 150, 4, 162, 6)
(2, 150, 4, 164, 6)
(2, 150, 4, 166, 6)
(2, 150, 4, 168, 6)
(2, 150, 4, 170, 6)
(2, 150, 4, 172, 6)
(2, 150, 4, 174, 6) ...
KeyboardInterrupt
If you get hundreds of millions of valid combinations, you may need to rethink the way you're processing the data because you will run into performance and memory problems at every corner.
use a function for condition 1. then apply it onto condition 2. that way condition 1 well have precise usage.
This problem is largely the same as a classic 0-1 knapsack problem, but with some minor rule changes and a large dataset to play with.
Dataset (product ID, price, length, width, height, weight):
(20,000 rows)
Problem:
A company is closing in fast on delivering its 1 millionth order. The marketing team decides to give the customer who makes that order a prize as a gesture of appreciation. The prize is: the lucky customer gets a delivery tote and 1 hour in the warehouse. Use the hour to fill up the tote with any products you desire and take them home for free.
Rules:
1 of each item
Combined volume < tote capacity (45 * 30 * 25 = 47250)
Item must fit individually (Dimensions are such that it can fit into the tote, e.g. 45 * 45 * 1 wouldn't fit)
Maximize value of combined products
Minimize weight on draws
Solution (using dynamic programming):
from functools import reduce
# The main solver function
def Solver(myItems, myCapacity):
dp = {myCapacity: (0, (), 0)}
getKeys = dp.keys
for i in range(len(myItems)):
itemID, itemValue, itemVolume, itemWeight = myItems[i]
for oldVolume in list(getKeys()):
newVolume = oldVolume - itemVolume
if newVolume >= 0:
myValue, ListOfItems, myWeight = dp[oldVolume]
node = (myValue + itemValue, ListOfItems + (itemID,), myWeight + itemWeight)
if newVolume not in dp:
dp[newVolume] = node
else:
currentValue, loi, currentWeight = dp[newVolume]
if currentValue < node[0] or (currentValue == node[0] and node[-1] < currentWeight):
dp[newVolume] = node
return max(dp.values())
# Generate the product of all elements within a given list
def List_Multiply(myList):
return reduce(lambda x, y: x * y, myList)
toteDims = [30, 35, 45]
totalVolume = List_Multiply(toteDims)
productsList = []
with open('products.csv', 'r') as myFile:
for myLine in myFile:
myData = [int(x) for x in myLine.strip().split(',')]
itemDims = [myDim for myDim, maxDim in zip(sorted(myData[2:5]), toteDims) if myDim <= maxDim]
if len(itemDims) == 3:
productsList.append((myData[0], myData[1], List_Multiply(myData[2:5]), myData[5]))
print(Solver(productsList, totalVolume))
Issue:
The output is giving repeated items
ie. (14018, (26, 40, 62, 64, 121, 121, 121, 152, 152), 13869)
How can I correct this to make it choose only 1 of each item?
It seems that the reason your code may produce answers with duplicate items is that in the inner loop, when you iterate over all generated volumes so far, it is possible for the code to have replaced the solution for an existing volume value before we get there.
E.g. if your productsList contained the following
productsList = [
# id, value, volume, weight
[1, 1, 2, 1],
[2, 1, 3, 2],
[3, 3, 5, 1]
]
and
totalVolume = 10
then by the time you got to the third item, dp.keys() would contain:
10, 8, 7, 5
The order of iteration is not guaranteed, but for the sake of this example, let's assume it is as given above. Then dp[5] would be replaced by a new solution containing item #3, and later in the iteration, we would be using that as a base for a new, even better solution (except now with a duplicate item).
To overcome the above problem, you could sort the keys before the iteration (in ascending order, which is the default), like for oldVolume in sorted(getKeys()). Assuming all items have a non-negative volume, this should guarantee that we never replace a solution in dp before we have iterated over it.
Another possible problem I see above is the way we get the optimal solution at the end using max(dp.values()). In the problem statement, it says that we want to minimize weight in the case of a draw. If I'm reading the code correctly, the elements of the tuple are value, list of items, weight in that order, so below we're tied for value, but the latter choice would be preferable because of the smaller weight... however max returns the first one:
>>> max([(4, (2, 3), 3), (4, (1, 3), 2)])
(4, (2, 3), 3)
It's possible to specify the sorting key to max so something like this might work:
>>> max([(4, (2, 3), 3), (4, (1, 3), 2)], key=lambda x: (x[0], -x[-1]))
(4, (1, 3), 2)
I'm working on an problem that finds the distance- the number of distinct items between two consecutive uses of an item in realtime. The input is read from a large file (~10G), but for illustration I'll use a small list.
from collections import OrderedDict
unique_dist = OrderedDict()
input = [1, 4, 4, 2, 4, 1, 5, 2, 6, 2]
for item in input:
if item in unique_dist:
indx = unique_dist.keys().index(item) # find the index
unique_dist.pop(item) # pop the item
size = len(unique_dist) # find the size of the dictionary
unique_dist[item] = size - indx # update the distance value
else:
unique_dist[item] = -1 # -1 if it is new
print input
print unique_dist
As we see, for each item I first check if the item is already present in the dictionary, and if it is, I update the value of the distance or else I insert it at the end with the value -1. The problem is that this seems to be very inefficient as the size grows bigger. Memory isn't a problem, but the pop function seems to be. I say that because, just for the sake if I do:
for item in input:
unique_dist[item] = random.randint(1,99999)
the program runs really fast. My question is, is there any way I could make my program more efficient(fast)?
EDIT:
It seems that the actual culprit is indx = unique_dist.keys().index(item). When I replaced that with indx = 1. The program was orders of magnitude faster.
According to a simple analysis I did with the cProfile module, the most expensive operations by far are OrderedDict.__iter__() and OrderedDict.keys().
The following implementation is roughly 7 times as fast as yours (according to the limited testing I did).
It avoids the call to unique_dist.keys() by maintaining a list of items keys. I'm not entirely sure, but I think this also avoids the call to OrderedDict.__iter__().
It avoids the call to len(unique_dist) by incrementing the size variable whenever necessary. (I'm not sure how expensive of an operation len(OrderedDict) is, but whatever)
def distance(input):
dist= []
key_set= set()
keys= []
size= 0
for item in input:
if item in key_set:
index= keys.index(item)
del keys[index]
del dist[index]
keys.append(item)
dist.append(size-index-1)
else:
key_set.add(item)
keys.append(item)
dist.append(-1)
size+= 1
return OrderedDict(zip(keys, dist))
I modified #Rawing's answer to overcome the overhead caused by the lookup and insertion time taken by set data structure.
from random import randint
dist = {}
input = []
for x in xrange(1,10):
input.append(randint(1,5))
keys = []
size = 0
for item in input:
if item in dist:
index = keys.index(item)
del keys[index]
keys.append(item)
dist[item] = size-index-1
else:
keys.append(item)
dist[item] = -1
size += 1
print input
print dist
How about this:
from collections import OrderedDict
unique_dist = OrderedDict()
input = [1, 4, 4, 2, 4, 1, 5, 2, 6, 2]
for item in input:
if item in unique_dist:
indx = unique_dist.keys().index(item)
#unique_dist.pop(item) # dont't pop the item
size = len(unique_dist) # now the directory is one element to big
unique_dist[item] = size - indx - 1 # therefor decrement the value here
else:
unique_dist[item] = -1 # -1 if it is new
print input
print unique_dist
[1, 4, 4, 2, 4, 1, 5, 2, 6, 2]
OrderedDict([(1, 2), (4, 1), (2, 2), (5, -1), (6, -1)])
Beware that the entries in unique_dist are now ordered by there first occurrence of the item in the input; yours were ordered by there last occurrence:
[1, 4, 4, 2, 4, 1, 5, 2, 6, 2]
OrderedDict([(4, 1), (1, 2), (5, -1), (6, -1), (2, 1)])
for i,(x,y,z) in enumerate( zip(analysisValues, analysisValues[1:], analysisValues[2:]) ):
if all(k<0.5 for k in (x,y,z)):
instance = i
break
this code iterates through an array and looks for the first 3 consecutive values that meet the condition '<0.5'
==============================
i'm working with 'timeseries' data and comparing the values at t, t+1s and t+2s
if the data is sampled at 1Hz then 3 consecutive values are compared and the code above is correct (points 0,1,2)
if the data is sampled at 2Hz then every other point must be compared (points 0,2,4) or
if the data is sampled at 3Hz then every third point must be compared (points 0,3,6)
the sample rate of input data can vary, but is known and recorded as the variable 'SRate'
==============================
please can you help me incorporate 'time' into this point-by-point analysis
You can use extended slice notation, giving the step value as SRate:
for i,(x,y,z) in enumerate(zip(analysisValues, \
analysisValues[SRate::SRate], \
analysisValues[2 * SRate::SRate])):
Let us first construct helper generator which does the following:
from itertools import izip, tee, ifilter
def sparsed_window(iterator, elements=2, step=1):
its = tee(iterator, elements)
for i,it in enumerate(its):
for _ in range(i*step):
next(it,None) # wind forward each iterator for the needed number of items
return izip(*its)
print list(sparsed_window([1,2,3,4,5,6,7,8,9,10],3,2))
Output:
>>>
[(1, 3, 5), (2, 4, 6), (3, 5, 7), (4, 6, 8), (5, 7, 9), (6, 8, 10)]
This helper avoids us of creating nearly the same lists in memory. It uses tee to clever cache only the part that is needed.
The helper code is based on pairwise recipe
Then we can use this helper to get what we want:
def find_instance(iterator, hz=1):
iterated_in_sparsed_window = sparsed_window(iterator, elements=3, step=hz)
fitting_values = ifilter(lambda (i,els): all(el<0.5 for el in els), enumerate(iterated_in_sparsed_window))
i, first_fitting = next(fitting_values, (None,None))
return i
print find_instance([1,0.4,1,0.4,1,0.4,1,0.4,1], hz=2)
Output:
>>>
1