pandas calculate difference based on indicators grouped by a column - python

Here is my question. I don't know how to describe it, so I will just give an example.
a b k
0 0 0
0 1 1
0 2 0
0 3 0
0 4 1
0 5 0
1 0 0
1 1 1
1 2 0
1 3 1
1 4 0
Here, "a" is user id, "b" is time, and "k" is a binary indicator flag. "b" is consecutive for sure.
What I want to get is this:
a b k diff_b
0 0 0 nan
0 1 1 nan
0 2 0 1
0 3 0 2
0 4 1 3
0 5 0 1
1 0 0 nan
1 1 1 nan
1 2 0 1
1 3 1 2
1 4 0 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently.
Can anyone revise my title? I don't know how to describe it in english. So complex...
Thank you!

IIUC
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby('a').New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']# yield
Out[260]:
0 NaN
1 NaN
2 1.0
3 2.0
4 3.0
5 1.0
6 NaN
7 NaN
8 1.0
9 2.0
10 1.0
dtype: float64

create partitions of the data of rows after k == 1 up to the next k == 1 using cumsum, and shift, for each group of a
parts = df.groupby('a').k.apply(lambda x: x.shift().cumsum())
group by the df.a & parts and calculate the difference between b & b.min() within each group
vals = df.groupby([df.a, parts]).b.apply(lambda x: x-x.min()+1)
set values to null when part == 0 & assign back to the dataframe
df['diff_b'] = np.select([parts!=0], [vals], np.nan)
outputs:
a b k diff_b
0 0 0 0 NaN
1 0 1 1 NaN
2 0 2 0 1.0
3 0 3 0 2.0
4 0 4 1 3.0
5 0 5 0 1.0
6 1 0 0 NaN
7 1 1 1 NaN
8 1 2 0 1.0
9 1 3 1 2.0
10 1 4 0 1.0

Related

Alternating end and start value of groups Pandas

I've got a df ('c' always starts with 0 and 'y' always starts with 1 in each group):
a b c d
0 1 0 1
0 1 0 1
0 2 0 1
0 2 0 0
1 3 0 1
1 3 1 0
1 2 0 1
1 2 0 0
The groups (grouped by 'a' and 'b') in columns 'c' and 'd' are the alternatives to pick-
End value of each group should alternate with first of next group (transition between subgroups doesn't matter).
For the first group in a subgroup ('a') pick column 'c'.
Result should be:
a b c d
0 1 0 nan
0 1 0 nan
0 2 nan 1
0 2 nan 0
1 3 0 nan
1 3 1 nan
1 2 0 nan
1 2 0 nan
Any ideas? Maybe sth like pick group in column c and if they are both 0, next group in c is nan nan.

How to subtract data frames with different fill values

I need to subtract two Data Frames with different indexes (which causes 'NaN' values when one of the values is missing) and I want to replace the missing values from each Data Frame with different number (fill value).
For example, let's say I have df1 and df2:
df1:
A B C
0 0 3 0
1 0 0 4
2 4 0 2
df2:
A B C
0 0 3 0
1 1 2 0
3 1 2 0
subtracted = df1.sub(df2):
A B C
0 0 0 0
1 -1 -2 4
2 NaN NaN NaN
3 NaN NaN NaN
I want the second row of subtracted to have the values from the second row in df1 and the third row of subtracted to have the value 5.
I expect -
subtracted:
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
I tried using the method sub with fill_value=5 but than in both rows 2 and 3 I'll get 0.
One way would be to reindex df2 setting fill_value to 0 before subtracting, then subtract and fillna with 5:
ix = pd.RangeIndex((df1.index|df2.index).max()+1)
df1.sub(df2.reindex(ix, fill_value=0)).fillna(5).astype(df1.dtypes)
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
We have to reindex here to get alligned indices. This way we can use the sub method.
idxmin = df2.index.min()
idxmax = df2.index.max()
idx = np.arange(idxmin, idxmax+1)
df1.reindex(idx).sub(df2.reindex(idx).fillna(0)).fillna(5)
A B C
0 0.0 0.0 0.0
1 -1.0 -2.0 4.0
2 4.0 0.0 2.0
3 5.0 5.0 5.0
I found the combine_first method that almost satisfies my needs:
df2.combine_first(df1).sub(df2, fill_value=0)
but still produces only:
A B C
0 0 0 0
1 0 0 0
2 4 0 2
3 0 0 0

Cumulative sum of a dataframe column with restart

I would like to perform the following function on a dataframe.
Calculate the cumulative sum of a column, notice:
It looks at the previous index only, not including the current one, e.g. the very first one will be zero as there is no previous data to look at.
When it doesn't cumulate, e.g the increment is zero, it restarts the count.
Number Cumulative
0 1 0
1 1 1
2 1 2
3 0 3
4 0 0
5 1 0
6 1 1
7 0 2
I know there is an expanding function, but it doesnt restart when it sees zero
IIUC, this works by making groups according to whether the previous row was 0, then getting the cumulative count:
>>> df
Number
0 1
1 1
2 1
3 0
4 0
5 1
6 1
7 0
df['Cumulative'] = df.groupby(df.Number.shift().eq(0).cumsum()).cumcount()
>>> df
Number Cumulative
0 1 0
1 1 1
2 1 2
3 0 3
4 0 0
5 1 0
6 1 1
7 0 2
Alternatively, if it really is cumsum you want, then apply cumsum with the same grouping as above, and shift it 1 down:
df['Cumulative '] = df.groupby(df.Number.eq(0).cumsum()).cumsum().shift().fillna(0)
>>> df
Number Cumulative
0 1 0.0
1 1 1.0
2 1 2.0
3 0 3.0
4 0 0.0
5 1 0.0
6 1 1.0
7 0 2.0

Adding new column to DataFrame with values dependent on index ref

I want to add a new column to this DataFrame in Pandas where I assign a StoreID rolling thru the indexes:
It currently looks like this:
Unnamed: 12 Store
0 NaN 1
1 NaN 1
2 NaN 1
0 NaN 1
1 NaN 1
2 NaN 1
0 NaN 1
1 NaN 1
2 NaN 1
0 NaN 1
1 NaN 1
2 NaN 1
I want it to look like this:
Unnamed: 12 Store StoreID
0 NaN 1 1
1 NaN 1 1
2 NaN 1 1
0 NaN 1 2
1 NaN 1 2
2 NaN 1 2
0 NaN 1 5
1 NaN 1 5
2 NaN 1 5
0 NaN 1 11
1 NaN 1 11
2 NaN 1 11
The variable changes upon the index hitting 0. The report will have variable numbers of items - most of them being 100's of 1000s of records per store.
I can create a new column easily but I can't seem to work out how to do this!
Any help much appreciated - I'm just starting out with Python.
You can also get the cumsum of the diff of the indexes
df['g'] = (df.index.to_series().diff() < 0).cumsum()
0 0
1 0
2 0
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
2 3
Using np.ndarray.cumsum:
df['g'] = (df.index == 0).cumsum() - 1
print(df)
col Store g
0 NaN 1 0
1 NaN 1 0
2 NaN 1 0
0 NaN 1 1
1 NaN 1 1
2 NaN 1 1
0 NaN 1 2
1 NaN 1 2
2 NaN 1 2
0 NaN 1 3
1 NaN 1 3
2 NaN 1 3
IIUC Try cumcount
df.groupby(df.index).cumcount()
Out[11]:
0 0
1 0
2 0
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
2 3
dtype: int64
Thanks for everyone's reply. I have ended up solving the problem with:
table['STORE_ID'] = (table.index == 0).cumsum() - 1
then adding some logic to lookup the store_id based on the sequence:
table.loc[table['STORE_ID'] == 3, 'STORE_ID'] = 11
table.loc[table['STORE_ID'] == 2, 'STORE_ID'] = 3
table.loc[table['STORE_ID'] == 1, 'STORE_ID'] = 2
table.loc[table['STORE_ID'] == 0, 'STORE_ID'] = 1
I imagine there's a simpler solution to get to the Store_ID sequence quicker but this gets the job done for now.

Pandas group operation on columns

I have a grouped pandas groupby object.
dis type id date qty
1 1 10 2017-01-01 1
1 1 10 2017-01-01 0
1 1 10 2017-01-02 4.5
1 2 11 2017-04-03 1
1 2 11 2017-04-03 2
1 2 11 2017-04-03 0
1 2 11 2017-04-05 0
I want to apply some operations on this groupby object.
I want to add a new column total_order that calculates the number of orders on a particular date for a particular material
A column zero_qty that calculates the number of zero orders for a particular date for a particular material
change the date column to make it calculate the number of days between each subsequent order for a particular material. The first order becomes 0.
The final dataframe should like something like this:
dis type id date qty total_order zero_qty
1 1 10 0 1 2 1
1 1 10 0 0 2 1
1 1 10 1 4.5 1 1
1 2 11 0 1 3 2
1 2 11 0 2 3 2
1 2 11 0 0 3 2
1 2 11 2 0 1 1
I think you need transform for count size of groups to total_order, then count number of zeros in qty and last get difference by diff with fillna and days:
Notice - for difference need sorted columns, sort_values do it if necessary:
df = df.sort_values(['dis','type','id','date'])
g = df.groupby(['dis','type','id','date'])
df['total_order'] = g['id'].transform('size')
df['zero_qty'] = g['qty'].transform(lambda x: (x == 0).sum()).astype(int)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
Another solution instead multiple transform use apply with custom function:
df = df.sort_values(['dis','type','id','date'])
def f(x):
x['total_order'] = len(x)
x['zero_qty'] = x['qty'].eq(0).sum().astype(int)
return x
df = df.groupby(['dis','type','id','date']).apply(f)
df['date'] = df.groupby(['dis','type','id'])['date'].diff().fillna(0).dt.days
print (df)
dis type id date qty total_order zero_qty
0 1 1 10 0 1.0 2 1
1 1 1 10 0 0.0 2 1
2 1 1 10 1 4.5 1 0
3 1 2 11 0 1.0 3 1
4 1 2 11 0 2.0 3 1
5 1 2 11 0 0.0 3 1
6 1 2 11 2 0.0 1 1
EDIT:
Last row can be rewrite too if need process more columns:
def f2(x):
#add another code
x['date'] = x['date'].diff().fillna(0).dt.days
return x
df = df.groupby(['dis','type','id']).apply(f2)

Categories