I have a column called 'on' with a series of 0 and 1:
d1 = {'on': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]}
df = pd.DataFrame(d1)
I want to create a new column called 'value' such that it do a cumulative count cumsum() only when the '1' of the 'on' column is on and recount from zero once the 'on' column shows zero.
I tried using a combination of cumsum() and np.where but I don't get what I want as follows:
df['value_try'] = df['on'].cumsum()
df['value_try'] = np.where(df['on'] == 0, 0, df['value_try'])
Attempt:
on value_try
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 4
9 1 5
10 0 0
What my desired output would be:
on value
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 1
9 1 2
10 0 0
You can set groups on consecutive 0 or 1 by checking whether the value of on is equal to that of previous row by .shift() and get group number by .Series.cumsum(). Then for each group use .Groupby.cumsum() to get the value within group.
g = df['on'].ne(df['on'].shift()).cumsum()
df['value'] = df.groupby(g).cumsum()
Result:
print(df)
on value
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 1
9 1 2
10 0 0
Let us try cumcount + cumsum
df['out'] = df.groupby(df['on'].eq(0).cumsum()).cumcount()
Out[18]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 0
8 1
9 2
10 0
dtype: int64
Related
I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0
I have a pandas DataFrame that looks like Y =
0 1 2 3
0 1 1 0 0
1 0 0 0 0
2 1 1 1 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 0 0 0
7 1 1 1 0
8 1 0 0 0
... .. .. .. ..
14989 1 1 1 1
14990 1 1 1 0
14991 1 1 1 1
14992 1 1 1 0
[14993 rows x 4 columns]
There are a total of 5 unique values:
1 1 0 0
0 0 0 0
1 1 1 0
1 0 0 0
1 1 1 1
For each unique value, I want to count how many times it's in the Y DataFrame
Let us using np.unique
c,v=np.unique(df.values,axis=0,return_counts =True)
c
array([[0, 0, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0]], dtype=int64)
v
array([1, 2, 4, 2], dtype=int64)
We can use .groupby for this to get the unique combinations.
While applying groupby, we calculate the size of the aggregation.
# Groupby on all columns which aggregates the data
df_group = df.groupby(list(df.columns)).size().reset_index()
# Because we used reset_index we need to rename our count column
df_group.rename({0:'count'}, inplace=True, axis=1)
Output
0 1 2 3 count
0 0 0 0 0 1
1 1 0 0 0 2
2 1 1 0 0 4
3 1 1 1 0 4
4 1 1 1 1 2
Note
I copied the example dataframe you provided.
Which looks like this:
print(df)
0 1 2 3
0 1 1 0 0
1 0 0 0 0
2 1 1 1 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 0 0 0
7 1 1 1 0
8 1 0 0 0
14989 1 1 1 1
14990 1 1 1 0
14991 1 1 1 1
14992 1 1 1 0
I made sample for you.
import itertools
import random
iter_list = list(itertools.product([0,1],[0,1],[0,1],[0,1]))
sum_list = []
for i in range(1000):
sum_list.append(random.choice(iter_list))
target_df = pd.DataFrame(sum_list)
target_df.reset_index().groupby(list(target_df.columns)).count().rename(columns ={'index':'count'}).reset_index()
I have the following pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({"first_column": [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]})
>>> df
first_column
0 0
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 1
9 1
10 0
11 0
12 0
13 0
14 1
15 1
16 1
17 1
18 1
19 0
20 0
first_column is a binary column of 0s and 1s. There are "clusters" of consecutive ones, which are always found in pairs of at least two.
My goal is to create a column which "counts" the number of rows of ones per group:
>>> df
first_column counts
0 0 0
1 0 0
2 0 0
3 1 3
4 1 3
5 1 3
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
This sounds like a job for df.loc(), e.g. df.loc[df.first_column == 1]...something
I'm just not sure how to take into account each individual "cluster" of ones, and how to label each of the unique clusters with the "row count".
How would one do this?
Here's one approach with NumPy's cumsum and bincount -
def cumsum_bincount(a):
# Append 0 & look for a [0,1] pattern. Form a binned array based off 1s groups
ids = a*(np.diff(np.r_[0,a])==1).cumsum()
# Get the bincount, index into the count with ids and finally mask out 0s
return a*np.bincount(ids)[ids]
Sample run -
In [88]: df['counts'] = cumsum_bincount(df.first_column.values)
In [89]: df
Out[89]:
first_column counts
0 0 0
1 0 0
2 0 0
3 1 3
4 1 3
5 1 3
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
Set the first 6 elems to be 1s and then test out -
In [101]: df.first_column.values[:5] = 1
In [102]: df['counts'] = cumsum_bincount(df.first_column.values)
In [103]: df
Out[103]:
first_column counts
0 1 6
1 1 6
2 1 6
3 1 6
4 1 6
5 1 6
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
Since first_column is binary, I can use astype(bool) to get True/False
If I take the opposite of those and cumsum I get a handy way of lumping together the Trues or 1s
I then groupby and count with transform
transform broadcasts the count aggregation across the original index
I first use where to group all 0s together.
I use where again to set their counts to 0
I use assign to generate a copy of df with a new column. This is because I don't want to clobber the df we already have. If you want to write directly to df use df['counts'] = c
t = df.first_column.astype(bool)
c = df.groupby((~t).cumsum().where(t, -1)).transform('count').where(t, 0)
df.assign(counts=c)
first_column counts
0 0 0
1 0 0
2 0 0
3 1 3
4 1 3
5 1 3
6 0 0
7 0 0
8 1 2
9 1 2
10 0 0
11 0 0
12 0 0
13 0 0
14 1 5
15 1 5
16 1 5
17 1 5
18 1 5
19 0 0
20 0 0
Here is another approach with pandas groupby, which I think is quite readable. A (possible) advantage is that does not rely on the assumption that only 1 and 0 are present in the column.
The main insight is to create groups of consecutive values and then simply compute their length. We also carry the information of the value in the group, so we can filter for zeros.
# Relevant column -> grouper needs to be 1-Dimensional
col_vals = df['first_column']
# Group by sequence of consecutive values and value in the sequence.
grouped = df.groupby(((col_vals!=col_vals.shift(1)).cumsum(), col_vals))
# Get the length of consecutive values if they are different from zero, else zero
df['counts'] = grouped['first_column'].transform(lambda group: len(group))\
.where(col_vals!=0, 0)
This is how the groups and keys look like:
for key, group in grouped:
print key, group
(1, 0) first_column
0 0
1 0
2 0
(2, 1) first_column
3 1
4 1
5 1
(3, 0) first_column
6 0
7 0
(4, 1) first_column
8 1
9 1
(5, 0) first_column
10 0
11 0
12 0
13 0
(6, 1) first_column
14 1
15 1
16 1
17 1
18 1
(7, 0) first_column
19 0
20 0
I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
Given a DataFrame I would like to compute number of zeros per each row. How can I compute it with Pandas?
This is presently what I ve done, this returns indices of zeros
def is_blank(x):
return x == 0
indexer = train_df.applymap(is_blank)
Use a boolean comparison which will produce a boolean df, we can then cast this to int, True becomes 1, False becomes 0 and then call count and pass param axis=1 to count row-wise:
In [56]:
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
df
Out[56]:
a b c
0 1 0 0
1 0 0 0
2 0 1 0
3 1 0 0
4 3 1 0
In [64]:
(df == 0).astype(int).sum(axis=1)
Out[64]:
0 2
1 3
2 2
3 2
4 1
dtype: int64
Breaking the above down:
In [65]:
(df == 0)
Out[65]:
a b c
0 False True True
1 True True True
2 True False True
3 False True True
4 False False True
In [66]:
(df == 0).astype(int)
Out[66]:
a b c
0 0 1 1
1 1 1 1
2 1 0 1
3 0 1 1
4 0 0 1
EDIT
as pointed out by david the astype to int is unnecessary as the Boolean types will be upcasted to int when calling sum so this simplifies to:
(df == 0).sum(axis=1)
You can count the zeros per column using the following function of python pandas.
It may help someone who needs to count the particular values per each column
df.isin([0]).sum(axis=1)
Here df is the dataframe and the value which we want to count is 0
Here is another solution using apply() and value_counts().
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
df.apply( lambda s : s.value_counts().get(key=0,default=0), axis=1)
Given the following dataframe df
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 0, 1, 0, 0, 0],
'B': [0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
'C': [1, 1, 1, 0, 0, 1, 0, 0, 0, 0],
'D': [0, 0, 0, 0, 0, 0, 1, 0, 1, 0],
'E': [0, 0, 1, 0, 1, 0, 0, 1, 0, 1]})
[Out]:
A B C D E
0 1 0 1 0 0
1 1 0 1 0 0
2 1 0 1 0 1
3 1 0 0 0 0
4 1 1 0 0 1
5 0 0 1 0 0
6 1 0 0 1 0
7 0 0 0 0 1
8 0 0 0 1 0
9 0 1 0 0 1
Apart from the various answers mentioned before, if the requirement is just using Pandas, another option would be using pandas.DataFrame.eq
df['Zero_Count'] = df.eq(0).sum(axis=1)
[Out]:
A B C D E Zero_Count
0 1 0 1 0 0 3
1 1 0 1 0 0 3
2 1 0 1 0 1 2
3 1 0 0 0 0 4
4 1 1 0 0 1 2
5 0 0 1 0 0 4
6 1 0 0 1 0 3
7 0 0 0 0 1 4
8 0 0 0 1 0 4
9 0 1 0 0 1 3
However, one can also do it with numpy using numpy.sum
import numpy as np
df['Zero_Count'] = np.sum(df == 0, axis=1)
[Out]:
A B C D E Zero_Count
0 1 0 1 0 0 3
1 1 0 1 0 0 3
2 1 0 1 0 1 2
3 1 0 0 0 0 4
4 1 1 0 0 1 2
5 0 0 1 0 0 4
6 1 0 0 1 0 3
7 0 0 0 0 1 4
8 0 0 0 1 0 4
9 0 1 0 0 1 3
Or even using numpy.count_nonzero as follows
df['Zero_Count'] = np.count_nonzero(df == 0, axis=1)
[Out]:
A B C D E Zero_Count
0 1 0 1 0 0 3
1 1 0 1 0 0 3
2 1 0 1 0 1 2
3 1 0 0 0 0 4
4 1 1 0 0 1 2
5 0 0 1 0 0 4
6 1 0 0 1 0 3
7 0 0 0 0 1 4
8 0 0 0 1 0 4
9 0 1 0 0 1 3