I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
I have a dataframe containing information about users rating items during a period of time. It has the following semblance :
In the dataframe I have a number of rows with identical 'user_id' and 'business_id' which i retrieve using the following code :
mask = reviews_df.duplicated(subset=['user_id','business_id'], keep=False)
dup = reviews_df[mask]
obtaining something like this :
I now need to remove all such duplicates from the original dataframe and substitute them with their average. Is there a fast and elegant way to achive this?Thanks!
Se if you do have a dataframe looks like
review_id user_id business_id stars date
0 1 0 3 2.0 2019-01-01
1 2 1 3 5.0 2019-11-11
2 3 0 2 4.0 2019-10-22
3 4 3 4 3.0 2019-09-13
4 5 3 4 1.0 2019-02-14
5 6 0 2 5.0 2019-03-17
Then the solution should be something like that:
df.loc[df.duplicated(['user_id', 'business_id'], keep=False)]\
.groupby(['user_id', 'business_id'])\
.apply(lambda x: x.stars - x.stars.mean())
With the following result:
user_id business_id
0 2 2 -0.5
5 0.5
3 4 3 1.0
4 -1.0
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I am working on pandas data frame, something like below:
id vals
0 1 11
1 1 5.5
2 1 -2
3 1 8
4 2 3
5 2 4
6 2 19
7 2 20
Above is just a small part of the df, the vals are grouped by id , and there will always be equal number of vals per id. In above case it's 4 and 4 values for id = 1 and id =2.
What I am trying to achieve is to add the value at index 0 with value at index 4, then value at index 1 with value at index 5 and so on.
Following is the expected df/ series, say df2:
total
0 14
1 9.5
2 17
3 28
Real df has hundreds of id and not just two as above.
Groupby() can be used but I dont get how to get the specific indices from each group.
Please let me know if anything is unclear.
groupby on modulo of df.index values and take sum of vals
In [805]: df.groupby(df.index % 4).vals.sum()
Out[805]:
0 14.0
1 9.5
2 17.0
3 28.0
Name: vals, dtype: float64
Since there are exactly 4 values per ID, we can simply reshape the underlying 1D array data to 2D array and sum along the appropriate axis (axis=0 in this case) -
pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Sample run -
In [192]: df
Out[192]:
id vals
0 1 11.0
1 1 5.5
2 1 -2.0
3 1 8.0
4 2 3.0
5 2 4.0
6 2 19.0
7 2 20.0
In [193]: pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Out[193]:
total
0 14.0
1 9.5
2 17.0
3 28.0
So ive some timeseries data on which i want to compute daily return/increment, where Daily increment = value_at_time(T)/ value_at_time(T-1)
import pandas as pd
df=pd.DataFrame([1,2,3,7]) #Sample data frame
df[1:]
out:
0
1 2
2 3
3 7
df[:-1]
out:
0
0 1
1 2
2 3
######### Method 1
df[1:]/df[:-1]
out:
0
0 NaN
1 1
2 1
3 NaN
######### Method 2
df[1:]/df[:-1].values
out:
0
1 2.000000
2 1.500000
3 2.333333
######### Method 3
df[1:].values/df[:-1]
out:
0
0 2
1 1
2 2
My questions are that
If df[:-1] and df[1:] have only three values (row slices of the
dataframe) then why doesnt Method_1 work ?
Why are method 2 & 3 which are almost similar giving different results?
Why using .values in Method_2 makes it work
Lets look at each
method 1, if you look at what the slices return you can see that the indices don't align:
In [87]:
print(df[1:])
print(df[:-1])
0
1 2
2 3
3 7
0
0 1
1 2
2 3
so then when do the division only 2 columns intersect:
In [88]:
df[1:]/df[:-1]
Out[88]:
0
0 NaN
1 1.0
2 1.0
3 NaN
Method 2 produces a np array, this has no index so the division will be performed in order element-wise as expected:
In [89]:
df[:-1].values
Out[89]:
array([[1],
[2],
[3]], dtype=int64)
Giving:
In [90]:
df[1:]/df[:-1].values
Out[90]:
0
1 2.000000
2 1.500000
3 2.333333
Method 3 is the same reason as method 2
So the question is how to do this in pure pandas? We use shift to allow you to align the indices as desired:
In [92]:
df.shift(-1)/df
Out[92]:
0
0 2.000000
1 1.500000
2 2.333333
3 NaN