I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
Related
I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0
When i put my data to groupby function with the following code
x =x.groupby(['Time', 'Distance'],as_index=True,observed=False).size().reset_index()
x.columns=['Time','Distance','Flow']
x.head(3)
i get such output:
Time Distance Flow
0 0 5 1
1 0 7 170
2 0 8 10
However, i need to do some smoothing, thus i need the skipped values such as:
Time Distance Flow
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 0 4 0
5 0 5 1
etc. In short, i need also the missed grouping values. How can i do this?
Use:
x = pd.DataFrame({
'Time':[0,1,1,1,1,0],
'Distance':[4,5,4,5,5,3],
})
df = x.groupby(['Time', 'Distance'],as_index=True,observed=False).size()
print (df)
Time Distance
0 3 1
4 1
1 4 1
5 3
dtype: int64
df1 = df.unstack(fill_value=0).stack().reset_index(name='Flow')
print (df1)
Time Distance Flow
0 0 3 1
1 0 4 1
2 0 5 0
3 1 3 0
4 1 4 1
5 1 5 3
Or:
m = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df1 = df.reindex(m, fill_value=0).reset_index(name='Flow')
I would like to delete the rows that users equal to 1 first occurred and its previous rows for each unique user in the DataFrame.
For instance, I have the following Dataframe, and I would like to get another dataframe which deletes the row in the "val" column 1 first occured and its previous rows for each user.
user val
0 1 0
1 1 1
2 1 0
3 1 1
4 2 0
5 2 0
6 2 1
7 2 0
8 3 1
9 3 0
10 3 0
11 3 0
12 3 1
user val
0 1 0
1 1 1
2 2 0
3 3 0
4 3 0
5 3 0
6 3 1
Sample Data
import pandas as pd
s = [1,1,1,1,2,2,2,2,3,3,3,3,3]
t = [0,1,0,1,0,0,1,0,1,0,0,0,1]
df = pd.DataFrame(zip(s,t), columns=['user', 'val'])
groupby checking cummax and shift to remove all rows before, and including, the first 1 in the 'val' column per user.
Assuming your values are either 1 or 0, also possible to create the mask with a double cumsum.
m = df.groupby('user').val.apply(lambda x: x.eq(1).cummax().shift().fillna(False))
# m = df.groupby('user').val.apply(lambda x: x.cumsum().cumsum().gt(1))
df.loc[m]
Output:
user val
2 1 0
3 1 1
7 2 0
9 3 0
10 3 0
11 3 0
12 3 1
I have original dataframe:
ID T value
1 0 1
1 4 3
2 0 0
2 4 1
2 7 3
The value is same previous row.
The output should be like:
ID T value
1 0 1
1 1 1
1 2 1
1 3 1
1 4 3
2 0 0
2 1 0
2 2 0
2 3 0
2 4 1
2 5 1
2 6 1
2 7 3
... ... ...
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!
For solution is necessary unique integer values in T for each group.
Use groupby with custom function - for each group use reindex and then replace NaNs in value column by forward filling ffill:
df1 = (df.groupby('ID')['T', 'value']
.apply(lambda x: x.set_index('T').reindex(np.arange(x['T'].min(), x['T'].max() + 1)))
.ffill()
.astype(int)
.reset_index())
print (df1)
ID T value
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 3
5 2 0 0
6 2 1 0
7 2 2 0
8 2 3 0
9 2 4 1
10 2 5 1
11 2 6 1
12 2 7 3
If get error:
ValueError: cannot reindex from a duplicate axis
it means some duplicated values per group like:
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 1 <-4 is duplicates per group 2
4 2 4 3 <-4 is duplicates per group 2
5 2 7 3
Solution is aggregate values first for unique T - e.g.by sum:
df = df.groupby(['ID', 'T'], as_index=False)['value'].sum()
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 4
4 2 7 3
Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3