Back ground info
multi_data is a 3d array , (10,5,5) array. For this example multi_data = np.arange(250).reshape(10,5,5)
Each of the 10 matrices have 5X5 states (A-E).
Each of the matrices are in order and represent time in years in increments of 1.
Starting from matrices[0] which contains the matrix values for year 1, up to matrices[9] year 10.
Example of multi_data at year 1
multi_data[0]
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19],
[ 20, 21, 22, 23, 24]],
Customers usually make a purchase within a few years (not immediately on signup), for example this customer made a purchase in year 3.
Hence the matrix calculation for this customer starts at year 3.
Each user has a current_state (A-E) and I need to transform the user data so that I can multiply it by the matrices. For example user customer1 has a current state B, hence the amount is the second element in the array customer1= np.array([0, 1000, 0, 0, 0])
dataframe 1 (customers)
cust_id|state|amount|year|
1 | B | 1000 | 3
2 | D | 500 | 2
multi_data = np.arange(250).reshape(10,5,5)
customer1= np.array([0, 1000, 0, 0, 0])
output = customer1
results = []
for arr in multi_data[3:4]: #customer purchases at year 3 hence I am multiplying customer1 by matrix at year 3
output = output#arr
results.append(output)
example of output
results = [array([80000, 81000, 82000, 83000, 84000])]
I then need to multiply the results by dataframe 2
dataframe_2
| year | lim %
| 1 | 0.19
| 2 | 0.11
| 3 | 0.02
| 10 | 0.23
so I multiply the results by lim % at year 3.
dataframe2=dataframe2.loc[dataframe2['year'] == 3]
results=dataframe2['LimitPerc'].values * results
example output results
[array([1600,1620,1640,1660,1680])]
I then need to multiply these results by matrix year 4 and then by lim% year 4 , until year 10 is reached.
like this:
customer1= [array([1600,1620,1640,1660,1680])]
output = customer1
results = []
for arr in data[4:5]: #multiplying by year 4 matrix (multi_data)
output = output#arr
results.append(output)
dataframe2=dataframe2.loc[dataframe2['year'] == 4]
results=dataframe2['LimitPerc'].values * results
Is there an easier way to do this that is less manual?. I need to continue this calculation until year 10, for each customer.I need to save the results for each customer, after every calculation.
Additional info:
I am currently running through all customers years like below, but my problem is I have a lot of vlookup type calculations like dataframe2 that needs to be calculated inbetween each year for each customer, and I have to save the results for each customer after each calculation.
results_dict = {}
for _id, c, y in zip(cust_id ,cust_amt, year):
results = []
for m in multi_data[y:]:
c = c # m
results.append(c)
results_dict[_id] = results
Unfortunately, since you need all the intermediate results, I don't think it's possible to optimise this much, so you will need the loop. If you didn't need the intermediate results, you could precompute the matrix product for each year up to year 10. However, here it is not useful.
To integrate the look-ups in your loop, you could just put all the dataframes in a list and use the DataFrame index to query the values. Also, you can convert the state to an integer index. Note that you don't need to create the customer1 vector. Since it's non-zero only in one position, you can directly extract the relevant row of the matrix and multiply it by amount.
Sample data:
import pandas as pd
import numpy as np
customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
"state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
"cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
"year":[3, 3, 4, 3, 4, 2, 2, 4],
"group":[10, 25, 30, 40, 55, 60, 70, 85]})
state_list = ['A','B','C','D','E']
# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)}
for g in customer_data['group'].unique())]).explode('year')
multi_data = np.arange(250).reshape(10,5,5)
Preprocessing:
# Put all lookups in order of calculation in this list.
lookups = [lookup1, lookup2]
# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'],
categories=state_list,
ordered=True).codes
# Set index on lookups.
for i in range(len(lookups)):
if 'group' in lookups[i].columns:
lookups[i] = lookups[i].set_index(['year', 'group'])
else:
lookups[i] = lookups[i].set_index(['year'])
Calculating results:
results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
Accessing the results:
# Here are examples of how you obtain the results from the dictionary:
# Customer 1 at start year, first result.
results[1][0][0]
# Customer 1 at start year, second result (which would be after lookup1 here).
results[1][0][1]
# Customer 1 at start year, third result (which would be after lookup2 here).
results[1][0][2]
# Customer 1 at year start+1, first result.
results[1][1][0]
# ...
# Customer c at year start+y, result k+1.
results[c][y][k]
Related
I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())
I have a data frame which looks as follows
df= pd.DataFrame(np.array([[1, 2, 3], [1, 5, 6], [1, 8, 9],[2, 18, 9],[3, 99, 10],[3, 0.3, 5],[2, 58, 78],[4, 8, 9]]),
columns=['id', 'point_A', 'point_B'])
Now I want to create column which is the sum of both point_A and point_B row . I can do that by this code: df["sum_of_all"] = df[["point_A","point_B"]].sum(axis = 1)
Now I want to give sort them based on sum_of_all. Meaning the most sum will be graded as 1 and so on. Now it has to be done based on id , How can I do that ?
Update :
Once I have finished the sum and sorting I get the above output. Now My goal is to assigne grade based on id. i.e : id 2 ,index 6, -> grade = 1,id 2 in index 3 -> grade 2 , id 3 on index 4 -> grade 1 and id 3 on index 5 -> 2 and so on
Thats the expection
IICU
df2=df.sort_values(by=['sum_of_all','id'], ascending=[False, False])
df2['grade']=df2.groupby('id')['sum_of_all'].cumcount()+1
df2
Outcome
I have an array of binaries ... what I want is to be able to pick a specific percentage of ones from every row ...
f.e. let say the number of ones are 100 per row, I want to get back randomly 20% from the first row, 10% from the second,
40% 3rd, 30% from 4rd (total 100% of course).
0| 00000000001000000010000000000000000000001000000100000000000000000000000000000001 ...
1| 00000000000000010000000000001000000000000100000000000000000000000000000000000000 ...
2| 00000000000000000000000000000010010000000000000000000000000000010000100000000000 ...
3| 01000000000000100000000000000000000000001000100000000000000010000000000000000000 ...
that is easy just do random.choice(one_idxs, %) on every row. The problem is that the target number of bits has to be 100 too ..
i.e. if some bits overlap and random selection picks them the total number will be different than 100 bits.
Plus on every row it should try to pick bits which were not selected previously at least as an option !
Any idea
Ex. code I use for the straightforward case, (which does not account if the selected indexes repeat across rows, just within a row) :
for every row :
ones_count = 100
bits_cnt = int(ones_count * probs[i])
idxs = array.get_row(i).one_idxs()
selected = np.random.choice(idxs, size=bits_cnt, replace=False)
I have to pick only the ONES .. thats why I'm using indexes
Using lists of strings as a convenience instead of bit arrays and getting 4 samples...
In [39]: data = ['10000101',
...: '11110000',
...: '00011000']
In [40]: idxs = random.sample(range(len(data[0])), 4)
In [41]: # 20% row 1, 30% row 2, 50% row 3
In [42]: row_selections = random.choices(range(len(data)), [0.2, 0.3, 0.5], k=len(idxs))
In [43]: idxs
Out[43]: [7, 3, 1, 4]
In [44]: row_selections
Out[44]: [0, 2, 0, 1]
In [45]: picks = [ data[r][c] for (r, c) in zip(row_selections, idxs)]
In [46]: picks
Out[46]: ['1', '1', '0', '0']
OK, in light of your comment, this should work better as an example of how to pick ones only in proportion from each list/array:
import random
a1= '10001010111110001101010101'
a2= '00101010001011010010100010'
a1 = [int(t) for t in a1]
a2 = [int(t) for t in a2]
a1_one_locations= [idx for idx, v in enumerate(a1) if v==1]
a2_one_locations= [idx for idx, v in enumerate(a2) if v==1]
# lists of indices where 1 exists in each list...
print(a1_one_locations)
print(a2_one_locations)
n_samples = 6 # total desired
# 40% from a1, remainder from a2
a1_samples = int(n_samples * 0.4)
a2_samples = n_samples - a1_samples
a1_picks = random.sample(a1_one_locations, a1_samples)
a2_picks = random.sample(a2_one_locations, a2_samples)
# print results
print('indices from a1: ', a1_picks)
print('indices from a2: ', a2_picks)
Output:
[0, 4, 6, 8, 9, 10, 11, 12, 16, 17, 19, 21, 23, 25]
[2, 4, 6, 10, 12, 13, 15, 18, 20, 24]
indices from a1: [6, 21]
indices from a2: [10, 15, 4, 20]
Let's say I have an array such as this:
a = np.array([[1, 2, 3, 4, 5, 6, 7], [20, 25, 30, 35, 40, 45, 50], [2, 4, 6, 8, 10, 12, 14]])
and a dataframe such as this:
num letter
0 1 a
1 2 b
2 3 c
What I would then like to do is to calculate the difference between the first and last number in each sequence in the array and ultimately add this difference to a new column in the df.
Currently I am able to calculate the desired difference in each sequence in this manner:
for i in a:
print(i[-1] - i[0])
Giving me the following results:
6
30
12
I would expect to be able to do is replace the print with df['new_col'] like so:
df['new_col'] = (i[-1] - i[0])
And for my df to then look like this:
num letter new_col
0 1 a 6
1 2 b 30
2 3 c 12
However, I end up getting this:
num letter new_col
0 1 a 12
1 2 b 12
2 3 c 12
I would also really appreciate if anyone could tell me what the equivalent of .diff() and .shift() are in numpy as I tried that in the same way you would with a pandas dataframe as well but just got error messages. This would be useful for me if I want to calculate the difference not just between the first and last numbers but somewhere in between.
Any help would be really appreciated, cheers.
currently you are only performing the difference calculation in the very last one
use a list comprehension:
a = np.array([[1, 2, 3, 4, 5, 6, 7], [20, 25, 30, 35, 40, 45, 50], [2, 4, 6, 8, 10, 12, 14]])
b = [i[-1] - i[0] for i in a]
if the lengths mismatch, then you need to extend the list with NaNs:
b = b + [np.NaN]*(len(df) - len(b))
df['new_col'] = b
Might be better off doing this in a DataFrame if your array grows in size.
df1 = pd.DataFrame(a.T)
df['new_col'] = df1.iloc[-1] - df1.iloc[0]
print(df)
num letter new_col
0 1 a 6
1 2 b 30
2 3 c 12
I have a thousands of data frame like the following, though much larger (1000000 rows, 100 columns).
data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[45, 66, 6, 6, 1, 432, 3],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
I want to randomly sample from this data frame and make a new data frame such that the sum of count should only equal N. Meaning I want to randomly sample based on the count value as a weight, and make a new data frame with this new resampled data such that sum of count is N.
The relative proportions should stay approximately the same, and no value when resampled should exceed the count of the original count value. The values in cols1 (or any other column except Value and count) should remain the same.
For example, if N was 50, it might look like:
pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[4, 7, 1, 1, 0, 37, 0],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
How can this be done?
Efficiency is key, otherwise I could expand the data frame based on count and randomly sample without replacement, then merge it back together.
Thanks,
Jack
Using multinomial sampling, this is relatively easy.
import numpy as np
from itertools import chain
def downsample(df, N):
prob = df['count']/sum(df['count'])
df['count'] = list(chain.from_iterable(np.random.multinomial(n = N, pvals = prob, size = 1)))
df = df[df['count'] != 0]
return df
For OP's example:
downsample(data, 50)
returns:
Value cols1 count
1 Boy 5 1
3 Corn 4 16
5 Barnes 32 33