fill missing days in pandas dataframe - python

Given the dataframe
df = pd.DataFrame(data=[[1,1,3],[1,2,6],[1,4,3],[2,2,6]],columns=['ID','Day','Value'])
df
Out[58]:
ID Day Value
0 1 1 3
1 1 2 6
2 1 4 3
3 2 2 6
As you can see for ID = 1 the Value related to Day3 is missing and for ID =2 the value related to Day1 is missing... I would like to fill these gaps adding np.nan and the missing day...
Out[59]:
ID Day Value
0 1 1 3.0
1 1 2 6.0
2 1 3 NaN
3 1 4 3.0
4 2 1 NaN
5 2 2 6.0

You'll need to define a custom function that performs some reindexing logic:
def f(x):
return x.set_index('Day').reindex(
np.arange(1, x.Day.max() + 1)
).Value
Now, perform a groupby + apply:
df.groupby('ID').apply(f).reset_index()
ID Day Value
0 1 1 3.0
1 1 2 6.0
2 1 3 NaN
3 1 4 3.0
4 2 1 NaN
5 2 2 6.0

Related

Difference in score to next rank

I have a dataframe
Group Score Rank
1 0 3
1 4 1
1 2 2
2 3 2
2 1 3
2 7 1
I have to take the difference of the score in next rank within each group. For example, in group 1 rank(1) - rank(2) = 4 - 2
Expected output:
Group Score Rank Difference
1 0 3 0
1 4 1 2
1 2 2 2
2 3 2 2
2 1 3 0
2 7 1 4
you can try:
df = df.sort_values(['Group', 'Rank'],ascending = [True,False])
df['Difference'] =df.groupby('Group', as_index=False)['Score'].transform('diff').fillna(0).astype(int)
OUTPUT:
Group Score Rank Difference
0 1 0 3 0
2 1 2 2 2
1 1 4 1 2
4 2 1 3 0
3 2 3 2 2
5 2 7 1 4
NOTE: The result is sorted based on the rank column.
I think you can create a new column for the values in the next rank by using the shift() and then calculate the difference. You can see the following codes:
# Sort the dataframe
df = df.sort_values(['Group','Rank']).reset_index(drop=True)
# Shift up values by one row within a group
df['Score_next'] = df.groupby('Group')['Score'].shift(-1).fillna(0)
# Calculate the difference
df['Difference'] = df['Score'] - df['Score_next']
Here is the result:
print(df)
Group Score Rank Score_next Difference
0 1 4 1 2.0 2.0
1 1 2 2 0.0 2.0
2 1 0 3 0.0 0.0
3 2 7 1 3.0 4.0
4 2 3 2 1.0 2.0
5 2 1 3 0.0 1.0

Pandas Insert a new row after every nth row

I have a dataframe that looks like below:
**L_Type L_ID C_Type E_Code**
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
I need to insert a new row after every 4 row and increment the value in third column (C_Type) by 01 like below table while keeping the values same as first two columns and does not want any value in last column:
L_Type L_ID C_Type E_Code
0 1 1 9
0 1 2 9
0 1 3 9
0 1 4 9
0 1 5
0 2 1 2
0 2 2 2
0 2 3 2
0 2 4 2
0 2 5
0 3 1 3
0 3 2 3
0 3 3 3
0 3 4 3
0 3 5
I have searched other threads but could not figure out the exact solution:
How to insert n DataFrame to another every nth row in Pandas?
Insert new rows in pandas dataframe
You can seelct rows by slicing, add 1 to column C_Type and 0.5 to index, for 100% sorrect slicing, because default method of sorting in DataFrame.sort_index is quicksort. Last join together, sort index and create default by concat with DataFrame.reset_index and drop=True:
df['C_Type'] = df['C_Type'].astype(int)
df2 = (df.iloc[3::4]
.assign(C_Type = lambda x: x['C_Type'] + 1, E_Code = np.nan)
.rename(lambda x: x + .5))
df1 = pd.concat([df, df2], sort=False).sort_index().reset_index(drop=True)
print (df1)
L_Type L_ID C_Type E_Code
0 0 1 1 9.0
1 0 1 2 9.0
2 0 1 3 9.0
3 0 1 4 9.0
4 0 1 5 NaN
5 0 2 1 2.0
6 0 2 2 2.0
7 0 2 3 2.0
8 0 2 4 2.0
9 0 2 5 NaN
10 0 3 1 3.0
11 0 3 2 3.0
12 0 3 3 3.0
13 0 3 4 3.0
14 0 3 5 NaN

Identify first non-zero element within group composed of multiple columns in pandas

I have a dataframe that looks like the following. The rightmost column is my desired column:
Group1 Group2 Value Target_Column
1 3 0 0
1 3 1 1
1 4 1 1
1 4 1 0
2 5 5 5
2 5 1 0
2 6 0 0
2 6 1 1
2 6 9 0
How do I identify the first non-zero value in a group that is made up of two columns(Group1 & Group2) and then create a column that shows the first non-zero value and shows all else as zeroes?
This question is very similar to one posed earlier here:
Identify first non-zero element within a group in pandas
but that solution gives an error on groups based on multiple columns.
I have tried:
import pandas as pd
dt = pd.DataFrame({'Group1': [1,1,1,1,2,2,2,2,2], 'Group2': [3,3,4,4,5,5,6,6,6], 'Value': [0,1,1,1,5,1,0,1,9]})
dt['Newcol']=0
dt.loc[dt.Value.ne(0).groupby(dt['Group1','Group2']).idxmax(),'Newcol']=dt.Value
Setup
df['flag'] = df.Value.ne(0)
Using numpy.where and assign:
df.assign(
target=np.where(df.index.isin(df.groupby(['Group1', 'Group2']).flag.idxmax()),
df.Value, 0)
).drop('flag', 1)
Using loc and assign
df.assign(
target=df.loc[df.groupby(['Group1', 'Group2']).flag.idxmax(), 'Value']
).fillna(0).astype(int).drop('flag', 1)
Both produce:
Group1 Group2 Value target
0 1 3 0 0
1 1 3 1 1
2 1 4 1 1
3 1 4 1 0
4 2 5 5 5
5 2 5 1 0
6 2 6 0 0
7 2 6 1 1
8 2 6 9 0
The number may off, since when there are only have two same values, I do not know you need the which one.
Using user3483203 's setting up
df['flag'] = df.Value.ne(0)
df['Target']=df.sort_values(['flag'],ascending=False).drop_duplicates(['Group1','Group2']).Value
df['Target'].fillna(0,inplace=True)
df
Out[20]:
Group1 Group2 Value Target_Column Target
0 1 3 0 0 0.0
1 1 3 1 1 1.0
2 1 4 1 1 1.0
3 1 4 1 0 0.0
4 2 5 5 5 5.0
5 2 5 1 0 0.0
6 2 6 0 0 0.0
7 2 6 1 1 1.0

Count if data is higher than another series within a rolling window of past two (or more) values in pandas

I have this two Series in a DataFrame:
A B
1 2
2 3
2 1
4 3
5 2
and I would to create a new column df['C] that counts how many times the value in column df['A']is higher than the value in column df['B'] for a rolling window of the previous 2 (or more) rows.
The result would be something like this:
A B C
1 2 NaN
2 3 NaN
2 1 0
4 3 1
5 2 2
I would also like to create a column that sums the data in df['A'] higher than df['B'] always using a rolling window.
With the following result:
A B C D
1 2 NaN NaN
2 3 NaN NaN
2 1 0 0
4 3 1 2
5 2 2 6
Thanks in advance.
IIUC
df.assign(C=df.A.gt(df.B).rolling(2).sum().shift(),D=(df.A.gt(df.B)*df.A).rolling(2).sum().shift())
Out[1267]:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 2 1 0.0 0.0
3 4 3 1.0 2.0
4 5 2 2.0 6.0

How can I randomly change the values of some rows in a pandas DataFrame?

I have a pandas Dataframe like below:
UserId ProductId Quantity
1 1 6
1 4 1
1 7 3
2 4 2
3 2 7
3 1 2
Now, I want to randomly select the 20% of rows of this DataFrame, using df.sample(n), and change the value of the Quantity column of these rows to zero. I would also like to keep the indexes of the altered rows. So the resulting DataFrame would be:
UserId ProductId Quantity
1 1 6
1 4 1
1 7 3
2 4 0
3 2 7
3 1 0
and I would like to keep in a list that the rows 3 and 5 have been altered. How can I achieve that?
By using update
dfupdate=df.sample(2)
dfupdate.Quantity=0
df.update(dfupdate)
update_list = dfupdate.index.tolist() # from cᴏʟᴅsᴘᴇᴇᴅ :)
df
Out[44]:
UserId ProductId Quantity
0 1.0 1.0 6.0
1 1.0 4.0 0.0
2 1.0 7.0 3.0
3 2.0 4.0 0.0
4 3.0 2.0 7.0
5 3.0 1.0 2.0
Using loc to change the data i.e
change = df.sample(2).index
df.loc[change,'Quantity'] = 0
Output:
UserId ProductId Quantity
0 1 1 0
1 1 4 1
2 1 7 3
3 2 4 0
4 3 2 7
5 3 1 2
change.tolist() : [3, 0]

Categories