Related
I am trying to speed up my nested loop it currently takes 15 mins for 100k customers.
I am also having trouble adding an additional condition that only multiplies states (A,B,C) by lookup2 val, else multiplies by 1.
customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
"state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
"cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
"year":[3, 3, 4, 3, 4, 2, 2, 4],
"group":[10, 25, 30, 40, 55, 60, 70, 85]})
state_list = ['A','B','C','D','E']
# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)}
for g in customer_data['group'].unique())]).explode('year')
multi_data = np.arange(250).reshape(10,5,5)
lookups = [lookup1, lookup2]
# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'],
categories=state_list,
ordered=True).codes
# Set index on lookups.
for i in range(len(lookups)):
if 'group' in lookups[i].columns:
lookups[i] = lookups[i].set_index(['year', 'group'])
else:
lookups[i] = lookups[i].set_index(['year'])
calculation:
results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
example of expected output:
{1: [[array([55000, 56000, 57000, 58000, 59000]),
array([5500., 5600., 5700., 5800., 5900.]),
array([550., 560., 570., 5800., 5900.])],...
You could use multiprocessing if you have more than one CPU.
from multiprocessing import Pool
def get_customer_data(data_tuple) -> dict:
results = {}
customer, state, amount, start, group = data_tuple
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
return results
p = Pool(mp.cpu_count())
# Pool.map() takes a function and an iterable like a list or generator
results_list = p.map(get_customer_data, [data_tuple for data_tuple in customer_data.itertuples(name=None, index=False)] )
# results is a list of dict()
results_dict = {k:v for x in results_list for k,v in x.items()}
p.close()
Glad to see you posting this! As promised, my thoughts:
With Pandas works with columns very well. What you need to look to do is remove the need for loops as much as possible (In your case I would say get rid of the main loop you have then keep the year and lookups loop).
To do this, forget about the results{} variable for now. You want to do the calculations directly on the DataFrame. For example your first calculation would become something like:
customer_data['meaningful_column_name'] = [[amount * multi_data[customer_data['year']-1, customer_data['state'], :]]]
For your lookups loop you just have to be aware that the if statement will be looking at entire columns.
Finally, as it seems you want to have your data in a list of arrays you will need to do some formatting to extract the data from a DataFrame structure.
I hope that makes some sense
i have this code:
list1 = [['player1', 5,1,300,100, ..., n],['player2', 10,5,650,150,...n],['player3', 17,6,1100,1050...,n]]
dictionary = {
'playersname':[]
'totalwin':[]
'totalloss':[]
'moneywon':[]
'moneyloss':[]
}
for x in listplayers:
dictionary['name'].append(x[0])
dictionary['totalwins'].append(x[1])
dictionary['totalloss'].append(x[2])
dictionary['moneywon'].append(x[3])
dictionary['moneylost'].append(x[4])
my output:
dictionary = {
'name': [player1,player2,player3,...,n],
'totalwin':[5,10,17,...,n],
'totalloss':[1,5,6],
'moneywon':[300,650,1100],
'moneyloss':[100,150,1050],
}
it works just fine, but i have to write out every dictionary keys and append every items individually
(ex:dictionary['totalwins'].append(x[1]))
so if i had a dictionary with 30 keys and a list with 30 different players caracteristics(ex:win, lost, etc) i would have to write 30 lines.
Is there a way to write the same code in fewer lines (ex:loop through everything) instead of writing 30 lines like so:
1 for x in listplayers:
2 dictionary['name'].append(x[0])
3 dictionary['totalwins'].append(x[1])
... ...
30 dictionary['key30'].append(x[30])
If you make a list of keys, you can zip up the values, then zip that up with the key passing the whole thing to dict()
listplayers = [['player1',5,1,300,100], ['player2',10,5,650,150], ['player3',17,6,1100,1050]]
keys = ['playersname','totalwins','totalloss','moneywon','moneylost']
dictionary = dict(zip(keys, zip(*listplayers)))
dictionary
# {'playersname': ('player1', 'player2', 'player3'),
# 'totalwins': (5, 10, 17),
# 'totalloss': (1, 5, 6),
# 'moneywon': (300, 650, 1100),
# 'moneylost': (100, 150, 1050)}
Notice, this give you tuples, not lists. If that's a problem, you can wrap the zips in a dict comprehension or use map to convert them:
dictionary = {key: list(values) for key, values in zip(keys, zip(*listplayers))}
or
dictionary = dict(zip(keys, map(list,zip(*listplayers))))
You could do the following.
list1 = [['player1', 5,1,300,100],['player2', 10,5,650,150]]
dictionary = {f'key_{i}':[*x] for i,x in enumerate(zip(*list1))}
The resulting dictionary:
{'key_0': ['player1', 'player2'],
'key_1': [5, 10],
'key_2': [1, 5],
'key_3': [300, 650],
'key_4': [100, 150]}
Or, if you have some key names in mind:
list1 = [['player1', 5,1,300,100],['player2', 10,5,650,150]]
keys = ['playersname',
'totalwin',
'totalloss',
'moneywon',
'moneyloss']
{keys[i]:[*x] for i,x in enumerate(zip(*list1))}
The result:
{'playersname': ['player1', 'player2'],
'totalwin': [5, 10],
'totalloss': [1, 5],
'moneywon': [300, 650],
'moneyloss': [100, 150]}
How can I write a code that changes the values of each individual arrays within the multidimensional array a to zeroes right after there was a negative value. So the second array within a has a negative value [12,34,5,6,88,-10,30,75] of -10 that would turn all the values of that and the values right after it to zeroes. Turning the array into [12,34,5,6,88,0,0,0]. How would I be able to get my Expected Output?
import numpy as np
a = np.array([[12,45,50,60,30],
[12,34,5,6,88,-10,30,75],
[3,45,332,45,-12,-4,-64,12],
[12,45,3,22,323]])
Expected Output:
[[12,45,50,60,30],
[12,34,5,6,88,0,0,0],
[3,45,332,45,0,0,0,0],
[12,45,3,22,323]]
try this:
import numpy as np
a = np.array([[12,45,50,60,30],
[12,34,5,6,88,-10,30,75],
[3,45,332,45,-12,-4,-64,12],
[12,45,3,22,323]], dtype='object')
for l in a:
for i in l:
if i<0:
l[l.index(i):] = [0] * len(l[l.index(i):])
a
output:
array([list([12, 45, 50, 60, 30]), list([12, 34, 5, 6, 88, 0, 0, 0]),
list([3, 45, 332, 45, 0, 0, 0, 0]), list([12, 45, 3, 22, 323])],
dtype=object)
second solution:
import numpy as np
def neg_to_zero(l):
for i in l:
if i<0:
l[l.index(i):] = [0] * len(l[l.index(i):])
a = np.array([[12,45,50,60,30],
[12,34,5,6,88,-10,30,75],
[3,45,332,45,-12,-4,-64,12],
[12,45,3,22,323]], dtype='object')
list(map(neg_to_zero, a))
a
Your array:
In [608]: a = np.array([[12,45,50,60,30],
...: [12,34,5,6,88,-10,30,75],
...: [3,45,332,45,-12,-4,-64,12],
...: [12,45,3,22,323]])
<ipython-input-608-894f7005e102>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
a = np.array([[12,45,50,60,30],
In [609]: a
Out[609]:
array([list([12, 45, 50, 60, 30]), list([12, 34, 5, 6, 88, -10, 30, 75]),
list([3, 45, 332, 45, -12, -4, -64, 12]),
list([12, 45, 3, 22, 323])], dtype=object)
This contains lists the vary in length. It is not multidimensional. Making it an array, as opposed to leaving it as a list of lists, does not make it any easier to process.
Either way you have to iterate, and change each list separately.
First, pay attention to the answer by hpaulj. Don't use numpy if your data is unsuitable. Your data is unsuitable for numpy because you have a list of lists where each contained list has a different length. It would be suitable for numpy if all had the same length (matrix shape).
To the problem itself: reduce it to solving the task on a single list, then transform each list.
data = [
[12, 45, 50, 60, 30],
[12, 34, 5, 6, 88, -10, 30, 75],
[3, 45, 332, 45, -12, -4, -64, 12],
[12, 45, 3, 22, 323]
]
for row in data:
transform(row)
The algorithm: we'll iterate over the list, and when we find the negative element, we know the current position, and then we can set all following elements.
I'll show you two variants.
The first variant uses slicing. It also uses enumerate(), which gives you (index, value) tuples for a list (or other iterable).
def transform(lst):
for (index, value) in enumerate(lst):
if value < 0:
lst[index:] = [0] * (len(lst) - index)
It creates a new list filled with zeroes by multiplying [0] (a 1-element list) by the length of the remainder. Then it assigns that to a slice of the list that is being transformed. This slice assignment changes the list itself.
The second variant works with a bit of "state":
def transform(lst):
do_overwrite = False
for (index, value) in enumerate(lst):
if value < 0:
do_overwrite = True # "flips a switch", stay on
if do_overwrite:
lst[index] = 0
Python lists are objects, like pretty much everything else in python. That means, when you call a function and pass a list as an argument, the list isn't copied, but the function call gets that list object to work with. Any changes to that object... are "visible" to the caller, because it is the same list object that is handled.
im trying to create a ranking system to order the fruits in the order of biggest to smallest:
is_big = [None,None,10,1/100,-1/100]
i need to add the values of is_big[i] corresponding with the indexes of fruits object
fruits = [['Pear','green',2,3,5],
['Apple','red',5,2,10],
['mango','yellow',4,6,12]]
to make the ranking system im giving a score based on the fruits element values and the values of is_big i need to multiply is_big[i] with fruits[i] and sum it up ,where is_big != None
the expected results should be fruits_ranking =[[19.98],[49.92],[39.94]]
so that after sorting fruits_ranking i get the results
fruits =[['Apple','red',5,2,10],
['mango','yellow',4,6,12],
['Pear','green',2,3,5]]
my code so far is:
rs = []
for i in range(len(fruits)):
for c in range(len(is_big)):
if battery[c]!= None and not isinstance(is_big[i],str):
rs.append(is_big[i]*fruits[c])
as you can see my code does not work any kind of help would be appreciated
One way is using below
[[i,sum(x*y for x,y in zip(i[2:],[0 if v is None else v for v in is_big][2:]))] for i in prod]
Output:
[[['Pear', 'green', 2, 3, 5], 19.98],
[['Apple', 'red', 5, 2, 10], 49.92],
[['mango', 'yellow', 4, 6, 12], 39.940000000000005]]
You could use sorted() with a specific function score as key parameter that computes the score you need. The score you describe is the sum of a dot product with is_big, that's why converting to np.ndarray proves useful as it enables you to do dot products as easily as with a *!
import numpy as np
# avoid Nones, convert to np.ndarray to enable dot product
coeffs = np.array(is_big[2:])
def score(input_fruit):
input_numeric = np.array(input_fruit[2:]) # keep only numerical values
return (input_numeric * coeffs).sum() # dot product & sum
First, you can check that scores match your requirements:
>>> print(list(zip(fruits, map(score, fruits))))
[
(["Pear", "green", 2, 3, 5], 19.98),
(["Apple", "red", 5, 2, 10], 49.92),
(["mango", "yellow", 4, 6, 12], 39.940000000000005),
]
Now sort:
# reverse=True for descending order
>>> print(sorted(fruits, key=score, reverse=True))
[["Apple", "red", 5, 2, 10], ["mango", "yellow", 4, 6, 12], ["Pear", "green", 2, 3, 5]]
An itertools solution might be:
import itertools as it
import operator as op
is_big = [None, None, 10, 1/100, -1/100]
fruits = [["Pear", "green", 2, 3, 5], ["Apple", "red", 5, 2, 10], ["mango", "yellow", 4, 6, 12]]
fr_ranks = [sum(fr*val for fr, val in zip_one if val is not None)
for zip_one in it.starmap(zip, zip(fruits, it.repeat(is_big)))]
# [19.98, 49.92, 39.940000000000005]
sorted_fruits, _ = zip(*sorted(zip(fruits, fr_ranks), key=op.itemgetter(1), reverse=True))
# (['Apple', 'red', 5, 2, 10], ['mango', 'yellow', 4, 6, 12], ['Pear', 'green', 2, 3, 5])
First, we align each fruit list in fruits with the value list is_big via repeating is_big and zipping it with the fruits, which will give 2-tuples of such list pairs when evaluated. Then it.starmap, with its function argument being zip, generates the desired fruit-value pairs that will reside in zip_one. A sum of products while skipping None's will give the fruit rankings.
Then we sort the fruits and this is sorting two parallel lists where the key is the corresponding element of fr_ranks. This gives a tuple but you can easily cast to list.
I have dictionary of arrays as like:
y_dict= {1: np.array([5, 124, 169, 111, 122, 184]),
2: np.array([1, 2, 3, 4, 5, 6, 111, 184]),
3: np.array([169, 5, 111, 152]),
4: np.array([0, 567, 5, 78, 90, 111]),
5: np.array([]),
6: np.array([])}
I need to find interception of arrays in my dictionary: y_dict.
As a first step I cleared dictionary from empty arrays, as like
dic = {i:j for i,j in y_dict.items() if np.array(j).size != 0}
So, dic has the following view:
dic = { 1: np.array([5, 124, 169, 111, 122, 184]),
2: np.array([1, 2, 3, 4, 5, 6, 111, 184]),
3: np.array([169, 5, 111, 152]),
4: np.array([0, 567, 5, 78, 90, 111])}
To find interception I tried to use tuple approach as like:
result_dic = list(set.intersection(*({tuple(p) for p in v} for v in dic.values())))
Actual result is empty list: [];
Expected result should be: [5, 111]
Could you please help me to find intersection of arrays in dictionary? Thanks
The code you posted is overcomplex and wrong because there's one extra inner iteration that needs to go. You want to do:
result_dic = list(set.intersection(*(set(v) for v in dic.values())))
or with map and without a for loop:
result_dic = list(set.intersection(*(map(set,dic.values()))))
result
[5, 111]
iterate on the values (ignore the keys)
convert each numpy array to a set (converting to tuple also works, but intersection would convert those to sets anyway)
pass the lot to intersection with argument unpacking
We can even get rid of step 1 by creating sets on every array and filtering out the empty ones using filter:
result_dic = list(set.intersection(*(filter(None,map(set,y_dict.values())))))
That's for the sake of a one-liner, but in real life, expressions may be decomposed so they're more readable & commentable. That decomposition may also help us to avoid the crash which occurs when passed no arguments (because there were no non-empty sets) which defeats the smart way to intersect sets (first described in Best way to find the intersection of multiple sets?).
Just create the list beforehand, and call intersection only if the list is not empty. If empty, just create an empty set instead:
non_empty_sets = [set(x) for x in y_dict.values() if x.size]
result_dic = list(set.intersection(*non_empty_sets)) if non_empty_sets else set()
You should be using numpy's intersection here, not directly in Python. And you'll need to add special handling for the empty intersection.
>>> intersection = None
>>> for a in y_dict.values():
... if a.size:
... if intersection is None:
... intersection = a
... continue
... intersection = np.intersect1d(intersection, a)
...
>>> if intersection is not None:
... print(intersection)
...
[ 5 111]
For the case where intersection is None, it means that all of the arrays in y_dict had size zero (no elements). In this case the intersection is not well-defined, you have to decide for yourself what the code should do here - probably raise an exception, but it depends on the use-case.