Pandas - Row number since last greater than 0 value - python

Let's say I have a Pandas series like so:
import pandas as pd
pd.Series([1, 0, 0, 1, 0, 0, 0], name='series')
How would I add a column with a row count since the last >0 number, like so:
pd.DataFrame({
'series': [1, 0, 0, 1, 0, 0, 0],
'row_num': [0, 1, 2, 0, 1, 2, 3]
})

Try this:
s.groupby(s.cumsum()).cumcount()
Output:
0 0
1 1
2 2
3 0
4 1
5 2
6 3
dtype: int64

Numpy
Find the places where the series/array is greater than 0
Calculate the differences from one place to the next
Subtract those values from a sequence
i = np.flatnonzero(s)
n = len(s)
delta = np.diff(np.append(i, n))
r = np.arange(n)
r - r[i].repeat(delta)
array([0, 1, 2, 0, 1, 2, 3])

Related

Iterate Thru All Possible Combinations, Summing Another Column

I have a dataframe such as below:
user_id
sales
example_flag_1
example_flag_2
quartile_1
quartile_2
1
10
0
1
1
1
2
21
1
1
2
2
3
300
0
1
3
3
4
41
0
1
4
4
5
55
0
1
1
1
...
I'm attempting to iterate through all possible combinations of (in my example) example_flag_1, example_flag_2, quartile_1, and quartile_2. Then, for each combination, what is the sum of sales for users who fit that combination profile?
For example, for all users with 1, 1, 1, 1, what is the sum of their sales?
What about 0, 1, 1, 1?
I want the computer to go through all possible combinations and tell me.
I hope that's clear, but let me know if you have any questions.
Sure.
Use itertools.product() to generate the combinations, functools.reduce() to generate the mask, and you're off to the races:
import itertools
from functools import reduce
import pandas as pd
data = pd.DataFrame(
{
"user_id": [1, 2, 3, 4, 5],
"sales": [10, 21, 300, 41, 55],
"example_flag_1": [0, 1, 0, 0, 0],
"example_flag_2": [1, 1, 1, 1, 1],
"quartile_1": [1, 2, 3, 4, 1],
"quartile_2": [1, 2, 3, 4, 1],
}
)
flag_columns = ["example_flag_1", "example_flag_2", "quartile_1", "quartile_2"]
flag_options = [set(data[col].unique()) for col in flag_columns]
for combo_options in itertools.product(*flag_options):
combo = {col: option for col, option in zip(flag_columns, combo_options)}
mask = reduce(lambda x, y: x & y, [data[col] == option for col, option in combo.items()])
sales_sum = data[mask].sales.sum()
print(combo, sales_sum)
This prints out (e.g.)
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 1, 'quartile_2': 1} 65
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 1, 'quartile_2': 2} 0
...
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 1} 0
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 2} 0
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 3} 300
{'example_flag_1': 0, 'example_flag_2': 1, 'quartile_1': 3, 'quartile_2': 4} 0
...

Getting binary labels on from a dataframe and a list of labels

Suppose I have the following list of labels,
labs = ['G1','G2','G3','G4','G5','G6','G7']
and also suppose that I have the following df:
group entity_label
0 0 G1
1 0 G2
3 1 G5
4 1 G1
5 2 G1
6 2 G2
7 2 G3
to produce the above df you can use:
df_test = pd.DataFrame({'group': [0,0,0,1,1,2,2,2,2],
'entity_label':['G1','G2','G2','G5','G1','G1','G2','G3','G3']})
df_test.drop_duplicates(subset=['group','entity_label'], keep='first')
for each group I want to use a mapping to look up on the labels and make a new dataframe with binary labels
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]
namely for group 0 we have G1 and G2 hence 1s in above table and so on. I wonder how one can do this?
One option, based on crosstab:
labs = ['G1','G2','G3','G4','G5','G6','G7']
(pd.crosstab(df_test['group'], df_test['entity_label'])
.clip(upper=1)
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Variant, with get_dummies and groupby.max:
(pd.get_dummies(df_test['entity_label'])
.groupby(df_test['group']).max()
.reindex(columns=labs, fill_value=0)
.agg(list, axis=1)
.reset_index(name='entity_label_binary')
)
Output:
group entity_label_binary
0 0 [1, 1, 0, 0, 0, 0, 0]
1 1 [1, 0, 0, 0, 1, 0, 0]
2 2 [1, 1, 1, 0, 0, 0, 0]

Count of the number of identical values in two arrays for all the unique values in an array

I have two arrays A and B. A has multiple values (these values can be string or integer or float) and B has values 0 and 1. I need, for each unique value in A, the count of points that coincide with the 1s in B and the 0s in B. Both the counts need to be stored as separate variables.
For example:
A = [1, 1, 3, 2, 2, 1, 1, 3, 3] # input multivalue array; it has three unique values – 1,2,3
B = [0, 0, 0, 1, 1, 1, 0, 1, 0] # input binary array
#Desired result:
countA1_B1 = 1 #for unique value of '1' in A the count of places where there is '1' in B
countA1_B0 = 3 #for unique value of '1' in A the count of places where there is '0' in B
countAno1_B1 = 3 #for unique value of '1' in A the count of places where there is no '1' in A but there is '1' in B
countAno1_B0 = 2 #for unique value of '1' in A the count of places where there is no '1' in A and there is '0' in B
I need this for all the unique values in A. The A array/list would be a raster and hence the unique values will not be known. So the code would first extract the unique values in A and then do the remaining calculations
My approach to solving this (see post previous question:)
Import numpy as np
A = [1, 1, 3, 2, 2, 1, 1, 3, 3] # input array
B = [0, 0, 0, 1, 1, 1, 0, 1, 0] # input binary array
A_arr = np.array(A)
A_unq = np.unique(A_arr)
#code 1
A_masked_arrays = np.array((A_arr[None, :] == A_unq[:, None]).astype(int))
#code 2
# A_masked_arrays = [(A==unique_val).astype(int) for unique_val in
np.unique(A)]
print(A_masked_arrays)
out = {val: arr for val, arr in zip(list(A_unq), list(A_arr))}
#zip() throws error
#TypeError: 'zip' object is not callable.
dict = {}
for i in A_unq:
for j in A_masked_arrays:
dict = i, j
print(dict)
Result obtained:
# from code 1
[[1 1 0 0 0 1 1 0 0]
[0 0 0 1 1 0 0 0 0]
[0 0 1 0 0 0 0 1 1]]
# from code 2
[array([1, 1, 0, 0, 0, 1, 1, 0, 0]), array([0, 0, 0, 1, 1, 0, 0, 0, 0]),
array([0, 0, 1, 0, 0, 0, 0, 1, 1])]
Using dictionary creation I get this result
(1, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(1, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(1, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
(2, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(2, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(2, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
(3, array([1, 1, 0, 0, 0, 1, 1, 0, 0]))
(3, array([0, 0, 0, 1, 1, 0, 0, 0, 0]))
(3, array([0, 0, 1, 0, 0, 0, 0, 1, 1]))
This is where I am stuck up. From here how to get to the final count of each unique value in A as countA1_B1, countA1_B0, countAno1_B1, countAno1_B0 and so on. Need help with this. Thanks in advance.
Selective use of np.bincount should do the trick
Au, Ai = np.unique(A, return_index = True)
out = np.empty((2, Au.size))
out[0] = np.bincount(Ai, weight = 1-np.array(B), size = Au.size)
out[1] = bp.bincount(Ai, weight = np.array(B), size = Au.size)
outdict = {}
for i in range(Au.size):
for j in [0, 1]:
outdict[(Au(i), j)] = out[j, i]
It's much easier to use pandas to do this kind of groupby operation:
In [11]: import pandas as pd
In [12]: df = pd.DataFrame({"A": A, "B": B})
In [13]: df
Out[13]:
A B
0 1 0
1 1 0
2 3 0
3 2 1
4 2 1
5 1 1
6 1 0
7 3 1
8 3 0
Now you can use groupby:
In [14]: gb = df.groupby("A")["B"]
In [15]: gb.count() # number of As
Out[15]:
A
1 4
2 2
3 3
Name: B, dtype: int64
In [16]: gb.sum() # number of As where B == 1
Out[16]:
A
1 1
2 2
3 1
Name: B, dtype: int64
In [17]: gb.count() - gb.sum() # number of As where B == 0
Out[17]:
A
1 3
2 0
3 2
Name: B, dtype: int64
You can also do this more explicitly and more generally (e.g. if it's not just 0 and 1) with an apply:
In [18]: gb.apply(lambda x: (x == 1).sum())
Out[18]:
A
1 1
2 2
3 1
Name: B, dtype: int64

How to obtain the documents that belongs to its cluster in density based clustering?

I use DBSCAN clustering for text document as follows,
thanks to this post.
db = DBSCAN(eps=0.3, min_samples=2).fit(X)
core_samples_mask1 = np.zeros_like(db1.labels_, dtype=bool)
core_samples_mask1[db1.core_sample_indices_] = True
labels1 = db1.labels_
Now I want to see which document belongs to which cluster, like:
[I have a car and it is blue] belongs to cluster0
or
idx [112] belongs to cluster0
The similar way my question asked in here but I am already tested the some of the answers provided there as:
X[labels == 1,:]
and I got :
array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]], dtype=int64)
but this does not help me. Please let me know if you have any suggestion or ways to do it.
If you have a pandas dataframe df with columns idx and messages, then all you have to do is
df['cluster'] = db.labels_
in order to get a new column cluster with the cluster membership.
Here is a short demo with dummy data:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
X = np.array([[1, 2], [5, 8], [2, 3],
[8, 7], [8, 8], [2, 2]])
db = DBSCAN(eps=3, min_samples=2).fit(X)
db.labels_
# array([0, 1, 0, 1, 1, 0], dtype=int64)
# convert our numpy array to pandas:
df = pd.DataFrame({'Column1':X[:,0],'Column2':X[:,1]})
print(df)
# result:
Column1 Column2
0 1 2
1 5 8
2 2 3
3 8 7
4 8 8
5 2 2
# add new column with the belonging cluster:
df['cluster'] = db.labels_
print(df)
# result:
Column1 Column2 cluster
0 1 2 0
1 5 8 1
2 2 3 0
3 8 7 1
4 8 8 1
5 2 2 0

Find the sum of certain columns in pandas

I am trying to use pandas to sum certain columns while retaining the others.
For eg:
member_no, data_1, data_2, data_3, dat_1, dat_2, other_1, other_2
1, 1, 3, 0, 0, 1, 1, 0
1, 1, 3, 0, 0, 1, 0, 1
2, 0, 1, 5, 1 ,0, 1, 0
2, 0, 1, 5, 1 ,0, 0, 1
I want the result to be
member_no, data_1, data_2, data_3, dat_1, dat_2, other_1, other_2
1, 1, 3, 0, 0, 1, 1, 1
2, 0, 1, 5, 1, 0, 1, 1
For a given member id, all the columns with 'data' and 'dat' will have the same value and so I just want to retain that. The columns with the 'other' attribute needs to be summed.
Thanks for the help.
You're looking for a groupby on member_no + max.
df = df.groupby('member_no', as_index=False).max()
print(df)
member_no data_1 data_2 data_3 dat_1 dat_2 other_1 other_2
0 1 1 3 0 0 1 1 1
1 2 0 1 5 1 0 1 1

Categories