Alternating end and start value of groups Pandas - python

I've got a df ('c' always starts with 0 and 'y' always starts with 1 in each group):
a b c d
0 1 0 1
0 1 0 1
0 2 0 1
0 2 0 0
1 3 0 1
1 3 1 0
1 2 0 1
1 2 0 0
The groups (grouped by 'a' and 'b') in columns 'c' and 'd' are the alternatives to pick-
End value of each group should alternate with first of next group (transition between subgroups doesn't matter).
For the first group in a subgroup ('a') pick column 'c'.
Result should be:
a b c d
0 1 0 nan
0 1 0 nan
0 2 nan 1
0 2 nan 0
1 3 0 nan
1 3 1 nan
1 2 0 nan
1 2 0 nan
Any ideas? Maybe sth like pick group in column c and if they are both 0, next group in c is nan nan.

Related

Splitting a non delimited column and create an additional column to count which number value

I have a problem in which I want to take Table 1 and turn it into Table 2 using Python.
Does anybody have any ideas? I've tried to split the Value column from table 1 but run into issues in that each value is a different length, hence I can't always define how much to split it.
Equally I have not been able to think through how to create a new column that counts the position that value was in the string.
Table 1, before:
ID
Value
1
000000S
2
000FY
Table 2, after:
ID
Position
Value
1
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
S
2
1
0
2
2
0
2
3
0
2
4
F
2
5
Y
You can split the string to individual characters and explode:
out = (df
.assign(Value=df['Value'].apply(list))
.explode('Value')
)
output:
ID Value
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 S
1 2 0
1 2 0
1 2 0
1 2 F
1 2 Y
Given:
ID Value
0 1 000000S
1 2 000FY
Doing:
df.Value = df.Value.apply(list)
df = df.explode('Value')
df['Position'] = df.groupby('ID').cumcount() + 1
Output:
ID Value Position
0 1 0 1
0 1 0 2
0 1 0 3
0 1 0 4
0 1 0 5
0 1 0 6
0 1 S 7
1 2 0 1
1 2 0 2
1 2 0 3
1 2 F 4
1 2 Y 5

Transform cell values as column headers and fill it with 1 if matching in python

I have a dataframe:
df
ID 0 1 2 3 4 ....
1 10 20 5 1 2 ....
2 3 4 NaN 10 1 ....
And I need to transpose the cell values of the column 0,1,2,3,4... to the column headers, and fill it for the Id's with 1 if the cell value is present for the respective ID.
Desired Output:
ID 1 2 3 4 5 ... 10 20 ..
1 1 1 0 0 1 ... 1 1 ..
2 1 0 1 1 0 ... 1 0 ..
Note that some entries can be NaN.
How can I get the desired output?
Use DataFrame.set_index with DataFrame.stack for remove missing values, then create indicators by get_dummies and return 1/0 by max by first level, last convert columns to integers:
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
EDIT:
print (df)
ID 0 1 2 3 4 5
0 1 10 20 5.0 1 2 5
1 2 3 4 NaN 10 1 2
If use max then always in output are 0/1 values (check 5 column):
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 1 1 1 0 1 0
But if use sum it count values (check 5 column):
df2 = (pd.get_dummies(df.set_index('ID').stack())
.sum(level=0)
.rename(columns=int)
.reset_index())
print (df2)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 2 1 1
1 2 1 1 1 1 0 1 0
Another way using melt and pd.crosstab
df1 = df.melt('ID')
df_final = pd.crosstab(index=df1.ID, columns=df1.value).reset_index()
Out[673]:
value ID 1.0 2.0 3.0 4.0 5.0 10.0 20.0
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
Note: default counting of pd.crosstab uses frequency. Therefore, duplicate values will count as their frequencies. If you want only 1/0 indicator, just chain ge(1) and astype as follows
pd.crosstab(index=df1.ID, columns=df1.value).ge(1).astype(int).reset_index()

How to subtract data frames with different fill values

I need to subtract two Data Frames with different indexes (which causes 'NaN' values when one of the values is missing) and I want to replace the missing values from each Data Frame with different number (fill value).
For example, let's say I have df1 and df2:
df1:
A B C
0 0 3 0
1 0 0 4
2 4 0 2
df2:
A B C
0 0 3 0
1 1 2 0
3 1 2 0
subtracted = df1.sub(df2):
A B C
0 0 0 0
1 -1 -2 4
2 NaN NaN NaN
3 NaN NaN NaN
I want the second row of subtracted to have the values from the second row in df1 and the third row of subtracted to have the value 5.
I expect -
subtracted:
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
I tried using the method sub with fill_value=5 but than in both rows 2 and 3 I'll get 0.
One way would be to reindex df2 setting fill_value to 0 before subtracting, then subtract and fillna with 5:
ix = pd.RangeIndex((df1.index|df2.index).max()+1)
df1.sub(df2.reindex(ix, fill_value=0)).fillna(5).astype(df1.dtypes)
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
We have to reindex here to get alligned indices. This way we can use the sub method.
idxmin = df2.index.min()
idxmax = df2.index.max()
idx = np.arange(idxmin, idxmax+1)
df1.reindex(idx).sub(df2.reindex(idx).fillna(0)).fillna(5)
A B C
0 0.0 0.0 0.0
1 -1.0 -2.0 4.0
2 4.0 0.0 2.0
3 5.0 5.0 5.0
I found the combine_first method that almost satisfies my needs:
df2.combine_first(df1).sub(df2, fill_value=0)
but still produces only:
A B C
0 0 0 0
1 0 0 0
2 4 0 2
3 0 0 0

pandas calculate difference based on indicators grouped by a column

Here is my question. I don't know how to describe it, so I will just give an example.
a b k
0 0 0
0 1 1
0 2 0
0 3 0
0 4 1
0 5 0
1 0 0
1 1 1
1 2 0
1 3 1
1 4 0
Here, "a" is user id, "b" is time, and "k" is a binary indicator flag. "b" is consecutive for sure.
What I want to get is this:
a b k diff_b
0 0 0 nan
0 1 1 nan
0 2 0 1
0 3 0 2
0 4 1 3
0 5 0 1
1 0 0 nan
1 1 1 nan
1 2 0 1
1 3 1 2
1 4 0 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently.
Can anyone revise my title? I don't know how to describe it in english. So complex...
Thank you!
IIUC
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby('a').New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']# yield
Out[260]:
0 NaN
1 NaN
2 1.0
3 2.0
4 3.0
5 1.0
6 NaN
7 NaN
8 1.0
9 2.0
10 1.0
dtype: float64
create partitions of the data of rows after k == 1 up to the next k == 1 using cumsum, and shift, for each group of a
parts = df.groupby('a').k.apply(lambda x: x.shift().cumsum())
group by the df.a & parts and calculate the difference between b & b.min() within each group
vals = df.groupby([df.a, parts]).b.apply(lambda x: x-x.min()+1)
set values to null when part == 0 & assign back to the dataframe
df['diff_b'] = np.select([parts!=0], [vals], np.nan)
outputs:
a b k diff_b
0 0 0 0 NaN
1 0 1 1 NaN
2 0 2 0 1.0
3 0 3 0 2.0
4 0 4 1 3.0
5 0 5 0 1.0
6 1 0 0 NaN
7 1 1 1 NaN
8 1 2 0 1.0
9 1 3 1 2.0
10 1 4 0 1.0

pandas calculate difference based on indicators grouped by a column with duplicated grouped pair

Here is an example.
a b k c
0 0 0 0
0 1 1 0
0 2 0 0
0 3 0 0
0 4 1 0
0 5 0 0
0 0 0 1
0 1 1 1
0 2 0 1
0 3 0 1
0 4 1 1
0 5 0 1
1 0 0 0
1 1 1 0
1 2 0 0
1 3 1 0
1 4 0 0
1 0 0 1
1 1 1 1
1 2 0 1
1 3 1 1
1 4 0 1
Here, "a" is user id, "b" is time, 'c' is product and "k" is a binary indicator flag. For each c, "b" is consecutive for sure and binary flag 'k' of a unique pair (a,b) is same, which means it is independent with 'c'. What I want to get is this:
a b k c diff_b
0 0 0 0 nan
0 1 1 0 nan
0 2 0 0 1
0 3 0 0 2
0 4 1 0 3
0 5 0 0 1
0 0 0 1 nan
0 1 1 1 nan
0 2 0 1 1
0 3 0 1 2
0 4 1 1 3
0 5 0 1 1
1 0 0 0 nan
1 1 1 0 nan
1 2 0 0 1
1 3 1 0 2
1 4 0 0 1
1 0 0 1 nan
1 1 1 1 nan
1 2 0 1 1
1 3 1 1 2
1 4 0 1 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently and for a same user but different product, it should be independent with product also.
Thank you.
You just need to adding the c into the group indicator at second step
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby(['a','c']).New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']

Categories