Find the sum of certain columns in pandas - python

I am trying to use pandas to sum certain columns while retaining the others.
For eg:
member_no, data_1, data_2, data_3, dat_1, dat_2, other_1, other_2
1, 1, 3, 0, 0, 1, 1, 0
1, 1, 3, 0, 0, 1, 0, 1
2, 0, 1, 5, 1 ,0, 1, 0
2, 0, 1, 5, 1 ,0, 0, 1
I want the result to be
member_no, data_1, data_2, data_3, dat_1, dat_2, other_1, other_2
1, 1, 3, 0, 0, 1, 1, 1
2, 0, 1, 5, 1, 0, 1, 1
For a given member id, all the columns with 'data' and 'dat' will have the same value and so I just want to retain that. The columns with the 'other' attribute needs to be summed.
Thanks for the help.

You're looking for a groupby on member_no + max.
df = df.groupby('member_no', as_index=False).max()
print(df)
member_no data_1 data_2 data_3 dat_1 dat_2 other_1 other_2
0 1 1 3 0 0 1 1 1
1 2 0 1 5 1 0 1 1

Related

Iterate Thru All Possible Combinations, Summing Another Column

I have a dataframe such as below:
user_id
sales
example_flag_1
example_flag_2
quartile_1
quartile_2
1
10
0
1
1
1
2
21
1
1
2
2
3
300
0
1
3
3
4
41
0
1
4
4
5
55
0
1
1
1
...
I'm attempting to iterate through all possible combinations of (in my example) example_flag_1, example_flag_2, quartile_1, and quartile_2. Then, for each combination, what is the sum of sales for users who fit that combination profile?
For example, for all users with 1, 1, 1, 1, what is the sum of their sales?
What about 0, 1, 1, 1?
I want the computer to go through all possible combinations and tell me.
I hope that's clear, but let me know if you have any questions.
Sure.
Use itertools.product() to generate the combinations, functools.reduce() to generate the mask, and you're off to the races:
import itertools
from functools import reduce
import pandas as pd
data = pd.DataFrame(
{
"user_id": [1, 2, 3, 4, 5],
"sales": [10, 21, 300, 41, 55],
"example_flag_1": [0, 1, 0, 0, 0],
"example_flag_2": [1, 1, 1, 1, 1],
"quartile_1": [1, 2, 3, 4, 1],
"quartile_2": [1, 2, 3, 4, 1],
}
)
flag_columns = ["example_flag_1", "example_flag_2", "quartile_1", "quartile_2"]
flag_options = [set(data[col].unique()) for col in flag_columns]
for combo_options in itertools.product(*flag_options):
combo = {col: option for col, option in zip(flag_columns, combo_options)}
mask = reduce(lambda x, y: x & y, [data[col] == option for col, option in combo.items()])
sales_sum = data[mask].sales.sum()
print(combo, sales_sum)
This prints out (e.g.)
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 1, 'quartile_2': 1} 65
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 1, 'quartile_2': 2} 0
...
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 1} 0
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 2} 0
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 3} 300
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 4} 0
...

Getting binary labels on from a dataframe and a list of labels

Suppose I have the following list of labels,
labs = ['G1','G2','G3','G4','G5','G6','G7']
and also suppose that I have the following df:
group entity_label
0 0 G1
1 0 G2
3 1 G5
4 1 G1
5 2 G1
6 2 G2
7 2 G3
to produce the above df you can use:
df_test = pd.DataFrame({'group': [0,0,0,1,1,2,2,2,2],
'entity_label':['G1','G2','G2','G5','G1','G1','G2','G3','G3']})
df_test.drop_duplicates(subset=['group','entity_label'], keep='first')
for each group I want to use a mapping to look up on the labels and make a new dataframe with binary labels
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]
namely for group 0 we have G1 and G2 hence 1s in above table and so on. I wonder how one can do this?
One option, based on crosstab:
labs = ['G1','G2','G3','G4','G5','G6','G7']
(pd.crosstab(df_test['group'], df_test['entity_label'])
.clip(upper=1)
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Variant, with get_dummies and groupby.max:
(pd.get_dummies(df_test['entity_label'])
.groupby(df_test['group']).max()
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Output:
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]

Loading .dat file into one list and count entities in it

I have a .dat file with the following:
0 0 0 0 1 1
1 1 0 1 0 0
0 0 1 1 0 0
0 1 1 1 1 1
0 1 0 0 0 1
1 1 0 1 0 1
I'm trying to get this all into one list and then count the number of 1's and 0s.
I've got the code so far:
with open('image.dat', 'r') as a:
for line in a:
b = [line.strip()]
print(b)
c = b.count(0)
This just gives me:
['0 0 0 0 1 1']
['1 1 0 1 0 0']
['0 0 1 1 0 0']
['0 1 1 1 1 1']
['0 1 0 0 0 1']
['1 1 0 1 0 1']
0
I'm new to coding and I've tried everything.
Thanks for helping.
You can just count the number of times the string '0' (or '1') accures within the file:
with open("image.dat", "r") as a:
print(a.read().count('0'))
To load the data into one list, you can use list.extend method, for example:
data = []
with open('image.dat', 'r') as a:
for line in a:
data.extend(map(int, line.split()))
print(data)
Prints:
[0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1]
Then:
print('Number of 0:', data.count(0))
print('Number of 1:', data.count(1))
Prints:
Number of 0: 18
Number of 1: 18
EDIT: To load the data as list of lists:
lines = []
with open('image.dat', 'r') as a:
for line in a:
line = line.strip()
if not line:
continue
lines.append(list(map(int, line.split())))
print(lines)
Prints:
[[0, 0, 0, 0, 1, 1], [1, 1, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0], [0, 1, 1, 1, 1, 1], [0, 1, 0, 0, 0, 1], [1, 1, 0, 1, 0, 1]]

Count of the number of identical values in two arrays for all the unique values in an array

I have two arrays A and B. A has multiple values (these values can be string or integer or float) and B has values 0 and 1. I need, for each unique value in A, the count of points that coincide with the 1s in B and the 0s in B. Both the counts need to be stored as separate variables.
For example:
A = [1, 1, 3, 2, 2, 1, 1, 3, 3] # input multivalue array; it has three unique values – 1,2,3
B = [0, 0, 0, 1, 1, 1, 0, 1, 0] # input binary array
#Desired result:
countA1_B1 = 1 #for unique value of '1' in A the count of places where there is '1' in B
countA1_B0 = 3 #for unique value of '1' in A the count of places where there is '0' in B
countAno1_B1 = 3 #for unique value of '1' in A the count of places where there is no '1' in A but there is '1' in B
countAno1_B0 = 2 #for unique value of '1' in A the count of places where there is no '1' in A and there is '0' in B
I need this for all the unique values in A. The A array/list would be a raster and hence the unique values will not be known. So the code would first extract the unique values in A and then do the remaining calculations
My approach to solving this (see post previous question:)
Import numpy as np
A = [1, 1, 3, 2, 2, 1, 1, 3, 3] # input array
B = [0, 0, 0, 1, 1, 1, 0, 1, 0] # input binary array
A_arr = np.array(A)
A_unq = np.unique(A_arr)
#code 1
A_masked_arrays = np.array((A_arr[None, :] == A_unq[:, None]).astype(int))
#code 2
# A_masked_arrays = [(A==unique_val).astype(int) for unique_val in
np.unique(A)]
print(A_masked_arrays)
out = {val: arr for val, arr in zip(list(A_unq), list(A_arr))}
#zip() throws error
#TypeError: 'zip' object is not callable.
dict = {}
for i in A_unq:
for j in A_masked_arrays:
dict = i, j
print(dict)
Result obtained:
# from code 1
[[1 1 0 0 0 1 1 0 0]
[0 0 0 1 1 0 0 0 0]
[0 0 1 0 0 0 0 1 1]]
# from code 2
[array([1, 1, 0, 0, 0, 1, 1, 0, 0]), array([0, 0, 0, 1, 1, 0, 0, 0, 0]),
array([0, 0, 1, 0, 0, 0, 0, 1, 1])]
Using dictionary creation I get this result
(1, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(1, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(1, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
(2, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(2, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(2, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
(3, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(3, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(3, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
This is where I am stuck up. From here how to get to the final count of each unique value in A as countA1_B1, countA1_B0, countAno1_B1, countAno1_B0 and so on. Need help with this. Thanks in advance.
Selective use of np.bincount should do the trick
Au, Ai = np.unique(A, return_index = True)
out = np.empty((2, Au.size))
out[0] = np.bincount(Ai, weight = 1-np.array(B), size = Au.size)
out[1] = bp.bincount(Ai, weight = np.array(B), size = Au.size)
outdict = {}
for i in range(Au.size):
for j in [0, 1]:
outdict[(Au(i), j)] = out[j, i]
It's much easier to use pandas to do this kind of groupby operation:
In [11]: import pandas as pd
In [12]: df = pd.DataFrame({"A": A, "B": B})
In [13]: df
Out[13]:
A B
0 1 0
1 1 0
2 3 0
3 2 1
4 2 1
5 1 1
6 1 0
7 3 1
8 3 0
Now you can use groupby:
In [14]: gb = df.groupby("A")["B"]
In [15]: gb.count() # number of As
Out[15]:
A
1 4
2 2
3 3
Name: B, dtype: int64
In [16]: gb.sum() # number of As where B == 1
Out[16]:
A
1 1
2 2
3 1
Name: B, dtype: int64
In [17]: gb.count() - gb.sum() # number of As where B == 0
Out[17]:
A
1 3
2 0
3 2
Name: B, dtype: int64
You can also do this more explicitly and more generally (e.g. if it's not just 0 and 1) with an apply:
In [18]: gb.apply(lambda x: (x == 1).sum())
Out[18]:
A
1 1
2 2
3 1
Name: B, dtype: int64

Pandas - Row number since last greater than 0 value

Let's say I have a Pandas series like so:
import pandas as pd
pd.Series([1, 0, 0, 1, 0, 0, 0], name='series')
How would I add a column with a row count since the last >0 number, like so:
pd.DataFrame({
'series': [1, 0, 0, 1, 0, 0, 0],
'row_num': [0, 1, 2, 0, 1, 2, 3]
})
Try this:
s.groupby(s.cumsum()).cumcount()
Output:
0 0
1 1
2 2
3 0
4 1
5 2
6 3
dtype: int64
Numpy
Find the places where the series/array is greater than 0
Calculate the differences from one place to the next
Subtract those values from a sequence
i = np.flatnonzero(s)
n = len(s)
delta = np.diff(np.append(i, n))
r = np.arange(n)
r - r[i].repeat(delta)
array([0, 1, 2, 0, 1, 2, 3])

Categories