Pandas column difference over time - python

**Edit at bottom **
I have a data frame with inventory data that looks like the following:
d = {'product': [a, b, a, b, c], 'amount': [1, 2, 3, 5, 2], 'date': [2020-6-6, 2020-6-6, 2020-6-7,
2020-6-7, 2020-6-7]}
df = pd.DataFrame(data=d)
df
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 a 3 2020-6-7
3 b 5 2020-6-7
4 c 2 2020-6-7
I would like to know what the inventory difference is month to month. The output would look like this:
df
product diff isnew date
0 a nan nan 2020-6-6
1 b nan nan 2020-6-6
2 a 2 False 2020-6-7
3 b 3 False 2020-6-7
4 c 2 True 2020-6-7
Sorry if I was not clear in the first example, In reality I have many months of data, so I am not just looking at doing the difference of one period vs the other. It would need to be a general case where it looks at the difference of month n vs n-1 and then n-1 and n-2 and so on.
What's the best way to do this in Pandas?

I guess the key here is to find the isnew:
# new products by `product`
new_prods = df['date'] != df.date.min()
duplicated = df.duplicated('product')
# first appearance of new products
# or duplicated *old* products
valids = new_prods ^ duplicated
df.loc[valids,'is_new'] = ~ duplicated
# then the difference:
df['diff'] = (df.groupby('product')['amount'].diff() # normal differences
.fillna(df['amount']) # fill the first value for all product
.where(df['is_new'].notna()) # remove the first month
)
Output:
product amount date is_new diff
0 a 1 2020-6-6 NaN NaN
1 b 2 2020-6-6 NaN NaN
2 a 3 2020-6-7 False 2.0
3 b 5 2020-6-7 False 3.0
4 c 2 2020-6-7 True 2.0

you can try groupby on the column product and diff the column amount for the column 'diff'. Then use duplicated for the column 'isnew'.
df['diff'] = df.groupby('product')['amount'].diff()
df['isnew'] = ~df['product'].duplicated()
print (df)
product amount date diff isnew
0 a 1 2020-6-6 NaN True
1 b 2 2020-6-6 NaN True
2 a 3 2020-6-7 2.0 False
3 b 5 2020-6-7 3.0 False
4 c 2 2020-6-7 NaN True

Related

Fill cell containing NaN with average of value before and after considering groupby

I would like to fill missing values in a pandas dataframe with the average of the cells directly before and after the missing value considering that there are different IDs.
maskedid test value
1 A 4
1 B NaN
1 C 5
2 A 5
2 B NaN
2 B 2
expected DF
maskedid test value
1 A 4
1 B 4.5
1 C 5
2 A 5
2 B 3.5
2 B 2
Try to interpolate:
df['value'] = df['value'].interpolate()
And by group:
df['value'] = df.groupby('maskedid')['value'].apply(pd.Series.interpolate)

Pandas Groupby (shift) function to return null for first entry

For the following code, in python, I'm trying to get the difference between the latest rating, by date, minus the previous rating, which I have done in 'orinc'
However, where there's no previous rating - or the first entry for 'H_Name' it's returning the current rating. Is there anything to add to this code so this would return 'null' or 'nan'?
df2['orinc'] = df2['HIR_OfficialRating'] - df2.groupby('H_Name')['HIR_OfficialRating'].shift(1)
Sure, check if values in column H_Name are unique, if yes, for each value after shift are returned missing values, because shifting for not exist previous value.
Also first value for each group is NaN for same reason.
df2 = pd.DataFrame({
'H_Name': ['a','a','a','a','e','b','b','c','d'],
'HIR_OfficialRating': list(range(9))})
df2['new'] = df2.groupby('H_Name')['HIR_OfficialRating'].shift(1)
print (df2)
H_Name HIR_OfficialRating new
0 a 0 NaN < first value of group a
1 a 1 0.0
2 a 2 1.0
3 a 3 2.0
4 e 4 NaN <-unique e
5 b 5 NaN < first value of group b
6 b 6 5.0
7 c 7 NaN <-unique c
8 d 8 NaN <-unique d
df2['orinc'] = df2['HIR_OfficialRating'] - df2.groupby('H_Name')['HIR_OfficialRating'].shift(1)
print (df2)
H_Name HIR_OfficialRating orinc
0 a 0 NaN
1 a 1 1.0
2 a 2 1.0
3 a 3 1.0
4 e 4 NaN
5 b 5 NaN
6 b 6 1.0
7 c 7 NaN
8 d 8 NaN

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Remove outliers from aggregated Dataframe (Python)

My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?
df2 = df2.loc[df2['products'] <= 20]

aggregate under certain condition

I have this data frame.
df = pd.DataFrame({'day':[1,2,1,4,2,3], 'user':['A','B','B','B','A','A'],
'num_posts':[1,2,3,4,5,6]})
I want a new column containing the total number of posts for that user to date of that post excluding that day. What I want looks like this:
user day num_post total_todate
A 1 1 0
B 2 2 3
B 1 3 0
B 4 4 5
A 2 5 1
A 3 6 6
Any ideas?
You can sort data frame by day, group by user, calculate the cumulative sum of num_posts column and then shift it down by 1:
df['total_todate'] = (df.sort_values('day').groupby('user').num_posts
.transform(
lambda p: p.cumsum().shift()
).fillna(0))
df
# day num_posts user total_todate
#0 1 1 A 0.0
#1 2 2 B 3.0
#2 1 3 B 0.0
#3 4 4 B 5.0
#4 2 5 A 1.0
#5 3 6 A 6.0

Categories