Iterate Thru All Possible Combinations, Summing Another Column - python

I have a dataframe such as below:
user_id
sales
example_flag_1
example_flag_2
quartile_1
quartile_2
1
10
0
1
1
1
2
21
1
1
2
2
3
300
0
1
3
3
4
41
0
1
4
4
5
55
0
1
1
1
...
I'm attempting to iterate through all possible combinations of (in my example) example_flag_1, example_flag_2, quartile_1, and quartile_2. Then, for each combination, what is the sum of sales for users who fit that combination profile?
For example, for all users with 1, 1, 1, 1, what is the sum of their sales?
What about 0, 1, 1, 1?
I want the computer to go through all possible combinations and tell me.
I hope that's clear, but let me know if you have any questions.

Sure.
Use itertools.product() to generate the combinations, functools.reduce() to generate the mask, and you're off to the races:
import itertools
from functools import reduce
import pandas as pd
data = pd.DataFrame(
{
"user_id": [1, 2, 3, 4, 5],
"sales": [10, 21, 300, 41, 55],
"example_flag_1": [0, 1, 0, 0, 0],
"example_flag_2": [1, 1, 1, 1, 1],
"quartile_1": [1, 2, 3, 4, 1],
"quartile_2": [1, 2, 3, 4, 1],
}
)
flag_columns = ["example_flag_1", "example_flag_2", "quartile_1", "quartile_2"]
flag_options = [set(data[col].unique()) for col in flag_columns]
for combo_options in itertools.product(*flag_options):
combo = {col: option for col, option in zip(flag_columns, combo_options)}
mask = reduce(lambda x, y: x & y, [data[col] == option for col, option in combo.items()])
sales_sum = data[mask].sales.sum()
print(combo, sales_sum)
This prints out (e.g.)
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 1, 'quartile_2': 1} 65
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 1, 'quartile_2': 2} 0
...
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 1} 0
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 2} 0
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 3} 300
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 4} 0
...

Related

How to find the highest value according to specified certain conditions in for loop?

I managed to put the arrays in for loop and, depending on the condition, select the values I need. From these selected values I try to choose the highest value from the matrix a and b. Unfortunately, somehow I miss some syntax.
my code
a=np.array([0, 0, 0, 1, 1, 1, 2, 4,2, 2])
b=np.array([0, 1, 2, 0, 1, 2, 0, 1, 2,5])
max_b=b[0]
for (j), (k) in zip(a,b):
#print(j,k)
if j>=2 and k>=1:
print(j,'a')
print(k,'b')
output:
4 a
1 b
2 a
2 b
2 a
5 b
i need : From these numbers I need to choose the largest number from j and k
4 a
5 b
I also created the code specifically to get the highest value in the loop from one matrix without other conditions to make it work better, but I can't incorporate it correctly into my code
maxv=a[0]
for i in a:
if i > maxv:
maxv=i
print(maxv)
This is my attempt, but it is stupid
a=np.array([0, 0, 0, 1, 1, 1, 2, 4,2, 2])
b=np.array([0, 1, 2, 0, 1, 2, 0, 1, 2,5])
#max_b=b[0]
for (j), (k) in zip(a,b):
#print(j,k)
if j>=2 and k>=1:
#print(j,'a')
# print(k,'b')
max_a=j
max_b=k
if j > max_a:
max_a=k
print(max_a)
Can you advise me how it could work?
A correct solution using for loops follows.
You were not updating max_b, not keeping max_a at all, and not checking if the current max_b or max_a is smaller than the current value in order to update them.
import numpy as np
a = np.array([0, 0, 0, 1, 1, 1, 2, 4, 2, 2])
b = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 5])
max_a = a[0]
max_b = b[0]
for j, k in zip(a, b):
# print(j,k)
if j >= 2 and k >= 1:
if max_a < j :
max_a = j
if max_b < k:
max_b = k
print(f"{max_a}, a)")
print(f"{max_b}, b)")
We can use numpy's masking, then .max().
This is a no-for-loops solution, also called vectorization.
import numpy as np
a = np.array([0, 0, 0, 1, 1, 1, 2, 4, 2, 2])
b = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 5])
a_gt_2 = a >= 2
b_gt_1 = b >= 1
conditions_apply_mask = a_gt_2 & b_gt_1
a_filtered = a[conditions_apply_mask]
b_filtered = b[conditions_apply_mask]
max_a_filtered = a_filtered.max()
max_b_filtered = b_filtered.max()
print(f"{max_a_filtered}, a")
print(f"{max_b_filtered}, b")

Tranformations of Numpy Array, Pass values to function

I am trying to take a numpy array and feed it into a constraint problem.
The table that I am working with is below -
Type1 2 3 4 5 6
A 0 1 1 0 1 1
B 0 1 1 0 1 1
C 0 0 1 1 1 1
D 0 0 1 1 1 1
E 1 1 1 1 0 1
F 1 1 1 1 1 1
When I transform the table into an array, I am trying to do two things:
1) Add 6 more columns (would be headers 7, 8, 9, 10, 11, and 12). The new columns would have the exact same values of columns 1-6 for each type (A, B, C, etc...)
2) For each type, I am trying to pass the values into a constraint method. So for example I'd like to pass (x, y) where x is the type ("a") and y are the columns where value = 1. In other words the first (x, y) would be ("a", [2, 3, 5, 6, 8, 9, 11, 12])
(for step #1 it may not be necessary to create columns 7-12 if I had the ability to just add 6 to existing "true" values columns 1-6)
I have tried using nditer but I am confused on how to hold on to the "Type" in a nditer, do a boolean check on the value and then pass the column name? I was thinking that I would not even use the column name and just use a counter to come up with the y variable.
you can do it using apply and tile like below
the np.tile function repeats the array with boolean index
df = pd.DataFrame([['A', 0, 1, 1, 0, 1, 1], ['B', 0, 1, 1, 0, 1, 1], ['C', 0, 0, 1, 1, 1, 1], ['D', 0, 0, 1, 1, 1, 1], ['E', 1, 1, 1, 1, 0, 1], ['F', 1, 1, 1, 1, 1, 1]], columns=('Type', '1', '2', '3', '4', '5', '6'))
def constraint(x, y):
print("x: ", x)
print("y: ", y)
return x
df.apply(lambda row: constraint(row[0], np.arange(2*len(row[1:]))[np.tile((row[1:]==1),2)]+1), axis=1)
OR using loop like below
for i in range(len(df)):
row=df.iloc[i]
constraint(row[0], np.arange(2*len(row[1:]))[np.tile((row[1:]==1),2)]+1)

Count of the number of identical values in two arrays for all the unique values in an array

I have two arrays A and B. A has multiple values (these values can be string or integer or float) and B has values 0 and 1. I need, for each unique value in A, the count of points that coincide with the 1s in B and the 0s in B. Both the counts need to be stored as separate variables.
For example:
A = [1, 1, 3, 2, 2, 1, 1, 3, 3] # input multivalue array; it has three unique values – 1,2,3
B = [0, 0, 0, 1, 1, 1, 0, 1, 0] # input binary array
#Desired result:
countA1_B1 = 1 #for unique value of '1' in A the count of places where there is '1' in B
countA1_B0 = 3 #for unique value of '1' in A the count of places where there is '0' in B
countAno1_B1 = 3 #for unique value of '1' in A the count of places where there is no '1' in A but there is '1' in B
countAno1_B0 = 2 #for unique value of '1' in A the count of places where there is no '1' in A and there is '0' in B
I need this for all the unique values in A. The A array/list would be a raster and hence the unique values will not be known. So the code would first extract the unique values in A and then do the remaining calculations
My approach to solving this (see post previous question:)
Import numpy as np
A = [1, 1, 3, 2, 2, 1, 1, 3, 3] # input array
B = [0, 0, 0, 1, 1, 1, 0, 1, 0] # input binary array
A_arr = np.array(A)
A_unq = np.unique(A_arr)
#code 1
A_masked_arrays = np.array((A_arr[None, :] == A_unq[:, None]).astype(int))
#code 2
# A_masked_arrays = [(A==unique_val).astype(int) for unique_val in
np.unique(A)]
print(A_masked_arrays)
out = {val: arr for val, arr in zip(list(A_unq), list(A_arr))}
#zip() throws error
#TypeError: 'zip' object is not callable.
dict = {}
for i in A_unq:
for j in A_masked_arrays:
dict = i, j
print(dict)
Result obtained:
# from code 1
[[1 1 0 0 0 1 1 0 0]
[0 0 0 1 1 0 0 0 0]
[0 0 1 0 0 0 0 1 1]]
# from code 2
[array([1, 1, 0, 0, 0, 1, 1, 0, 0]), array([0, 0, 0, 1, 1, 0, 0, 0, 0]),
array([0, 0, 1, 0, 0, 0, 0, 1, 1])]
Using dictionary creation I get this result
(1, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(1, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(1, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
(2, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(2, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(2, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
(3, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(3, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(3, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
This is where I am stuck up. From here how to get to the final count of each unique value in A as countA1_B1, countA1_B0, countAno1_B1, countAno1_B0 and so on. Need help with this. Thanks in advance.
Selective use of np.bincount should do the trick
Au, Ai = np.unique(A, return_index = True)
out = np.empty((2, Au.size))
out[0] = np.bincount(Ai, weight = 1-np.array(B), size = Au.size)
out[1] = bp.bincount(Ai, weight = np.array(B), size = Au.size)
outdict = {}
for i in range(Au.size):
for j in [0, 1]:
outdict[(Au(i), j)] = out[j, i]
It's much easier to use pandas to do this kind of groupby operation:
In [11]: import pandas as pd
In [12]: df = pd.DataFrame({"A": A, "B": B})
In [13]: df
Out[13]:
A B
0 1 0
1 1 0
2 3 0
3 2 1
4 2 1
5 1 1
6 1 0
7 3 1
8 3 0
Now you can use groupby:
In [14]: gb = df.groupby("A")["B"]
In [15]: gb.count() # number of As
Out[15]:
A
1 4
2 2
3 3
Name: B, dtype: int64
In [16]: gb.sum() # number of As where B == 1
Out[16]:
A
1 1
2 2
3 1
Name: B, dtype: int64
In [17]: gb.count() - gb.sum() # number of As where B == 0
Out[17]:
A
1 3
2 0
3 2
Name: B, dtype: int64
You can also do this more explicitly and more generally (e.g. if it's not just 0 and 1) with an apply:
In [18]: gb.apply(lambda x: (x == 1).sum())
Out[18]:
A
1 1
2 2
3 1
Name: B, dtype: int64

Pandas - Row number since last greater than 0 value

Let's say I have a Pandas series like so:
import pandas as pd
pd.Series([1, 0, 0, 1, 0, 0, 0], name='series')
How would I add a column with a row count since the last >0 number, like so:
pd.DataFrame({
'series': [1, 0, 0, 1, 0, 0, 0],
'row_num': [0, 1, 2, 0, 1, 2, 3]
})
Try this:
s.groupby(s.cumsum()).cumcount()
Output:
0 0
1 1
2 2
3 0
4 1
5 2
6 3
dtype: int64
Numpy
Find the places where the series/array is greater than 0
Calculate the differences from one place to the next
Subtract those values from a sequence
i = np.flatnonzero(s)
n = len(s)
delta = np.diff(np.append(i, n))
r = np.arange(n)
r - r[i].repeat(delta)
array([0, 1, 2, 0, 1, 2, 3])

Find the sum of certain columns in pandas

I am trying to use pandas to sum certain columns while retaining the others.
For eg:
member_no, data_1, data_2, data_3, dat_1, dat_2, other_1, other_2
1, 1, 3, 0, 0, 1, 1, 0
1, 1, 3, 0, 0, 1, 0, 1
2, 0, 1, 5, 1 ,0, 1, 0
2, 0, 1, 5, 1 ,0, 0, 1
I want the result to be
member_no, data_1, data_2, data_3, dat_1, dat_2, other_1, other_2
1, 1, 3, 0, 0, 1, 1, 1
2, 0, 1, 5, 1, 0, 1, 1
For a given member id, all the columns with 'data' and 'dat' will have the same value and so I just want to retain that. The columns with the 'other' attribute needs to be summed.
Thanks for the help.
You're looking for a groupby on member_no + max.
df = df.groupby('member_no', as_index=False).max()
print(df)
member_no data_1 data_2 data_3 dat_1 dat_2 other_1 other_2
0 1 1 3 0 0 1 1 1
1 2 0 1 5 1 0 1 1

Categories