python & pandas: get average rank - python

I have a data frame
ID 2014-01-01 2015-01-01 2016-01-01
1 NaN 0.1 0.2
2 0.1 0.3 0.5
3 0.2 NaN 0.7
4 0.8 0.4 0.1
For each date(col), I want to get the rank of each id. For example, in col '2014-01-01', id = 4 has greatest value, so we assign rank 1 to id = 4. id = 3 has second greatest value, so we give it rank 2. If the data is NaN, just ignore it.
ID 2014-01-01 2015-01-01 2016-01-01
1 NaN 3 3
2 3 2 2
3 2 NaN 1
4 1 1 4
Next step is to get the average rank of each id. Fore example, AvgRank of id1 = (4+3)/2 = 3.5 and AvgRank of id2 = (3+2+2)/3 = 2.33
ID AvgRank
1 3
2 2.33
3 1.5
4 2
My algorithm is:
create a dictionary for each id ({str:list})-> loop through all the columns -> for each column calculate the rank and update to the list in dictionary
but i think it is too complicated for this simple problem.
Is there any easy way to get the avgrank table?
Here is the code to create the dataframe
df = pd.DataFrame({'ID':[1,2,3,4],'2014-01-01':[float('NaN'),0.1,0.2,0.8],
'2015-01-01':[0.1,0.3,float('NaN'),0.4],'2016-01-01':[0.2,0.5,0.7,0.1]})

it's unclear why you think the rank should be 4 for the first row value in the second column but the following gives you what you want. Here we call rank on the cols of interest and pass method='dense' and ascending=False so it ranks correctly:
In [60]:
df.ix[:, :-1].rank(method='dense', ascending=False)
Out[60]:
2014-01-01 2015-01-01 2016-01-01
0 NaN 3 3
1 3 2 2
2 2 NaN 1
3 1 1 4
We then concat the single column from the orig df and rename the result of mean with axis=1 for row-wise mean:
In [67]:
pd.concat([df['ID'], df.ix[:, :-1].rank(method='dense', ascending=False).mean(axis=1)], axis=1).rename(columns={0:'AvgRank'})
Out[67]:
ID AvgRank
0 1 3.000000
1 2 2.333333
2 3 1.500000
3 4 2.000000

Related

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Pandas calculate average value of column for rows satisfying condition

I have a dataframe containing information about users rating items during a period of time. It has the following semblance :
In the dataframe I have a number of rows with identical 'user_id' and 'business_id' which i retrieve using the following code :
mask = reviews_df.duplicated(subset=['user_id','business_id'], keep=False)
dup = reviews_df[mask]
obtaining something like this :
I now need to remove all such duplicates from the original dataframe and substitute them with their average. Is there a fast and elegant way to achive this?Thanks!
Se if you do have a dataframe looks like
review_id user_id business_id stars date
0 1 0 3 2.0 2019-01-01
1 2 1 3 5.0 2019-11-11
2 3 0 2 4.0 2019-10-22
3 4 3 4 3.0 2019-09-13
4 5 3 4 1.0 2019-02-14
5 6 0 2 5.0 2019-03-17
Then the solution should be something like that:
df.loc[df.duplicated(['user_id', 'business_id'], keep=False)]\
.groupby(['user_id', 'business_id'])\
.apply(lambda x: x.stars - x.stars.mean())
With the following result:
user_id business_id
0 2 2 -0.5
5 0.5
3 4 3 1.0
4 -1.0

How to find rate of change across successive rows using time and data columns after grouping by a different column using pandas?

I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64

Adding values of a pandas column only at specific indices

I am working on pandas data frame, something like below:
id vals
0 1 11
1 1 5.5
2 1 -2
3 1 8
4 2 3
5 2 4
6 2 19
7 2 20
Above is just a small part of the df, the vals are grouped by id , and there will always be equal number of vals per id. In above case it's 4 and 4 values for id = 1 and id =2.
What I am trying to achieve is to add the value at index 0 with value at index 4, then value at index 1 with value at index 5 and so on.
Following is the expected df/ series, say df2:
total
0 14
1 9.5
2 17
3 28
Real df has hundreds of id and not just two as above.
Groupby() can be used but I dont get how to get the specific indices from each group.
Please let me know if anything is unclear.
groupby on modulo of df.index values and take sum of vals
In [805]: df.groupby(df.index % 4).vals.sum()
Out[805]:
0 14.0
1 9.5
2 17.0
3 28.0
Name: vals, dtype: float64
Since there are exactly 4 values per ID, we can simply reshape the underlying 1D array data to 2D array and sum along the appropriate axis (axis=0 in this case) -
pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Sample run -
In [192]: df
Out[192]:
id vals
0 1 11.0
1 1 5.5
2 1 -2.0
3 1 8.0
4 2 3.0
5 2 4.0
6 2 19.0
7 2 20.0
In [193]: pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Out[193]:
total
0 14.0
1 9.5
2 17.0
3 28.0

computing daily return/increment on dataframe

So ive some timeseries data on which i want to compute daily return/increment, where Daily increment = value_at_time(T)/ value_at_time(T-1)
import pandas as pd
df=pd.DataFrame([1,2,3,7]) #Sample data frame
df[1:]
out:
0
1 2
2 3
3 7
df[:-1]
out:
0
0 1
1 2
2 3
######### Method 1
df[1:]/df[:-1]
out:
0
0 NaN
1 1
2 1
3 NaN
######### Method 2
df[1:]/df[:-1].values
out:
0
1 2.000000
2 1.500000
3 2.333333
######### Method 3
df[1:].values/df[:-1]
out:
0
0 2
1 1
2 2
My questions are that
If df[:-1] and df[1:] have only three values (row slices of the
dataframe) then why doesnt Method_1 work ?
Why are method 2 & 3 which are almost similar giving different results?
Why using .values in Method_2 makes it work
Lets look at each
method 1, if you look at what the slices return you can see that the indices don't align:
In [87]:
print(df[1:])
print(df[:-1])
0
1 2
2 3
3 7
0
0 1
1 2
2 3
so then when do the division only 2 columns intersect:
In [88]:
df[1:]/df[:-1]
Out[88]:
0
0 NaN
1 1.0
2 1.0
3 NaN
Method 2 produces a np array, this has no index so the division will be performed in order element-wise as expected:
In [89]:
df[:-1].values
Out[89]:
array([[1],
[2],
[3]], dtype=int64)
Giving:
In [90]:
df[1:]/df[:-1].values
Out[90]:
0
1 2.000000
2 1.500000
3 2.333333
Method 3 is the same reason as method 2
So the question is how to do this in pure pandas? We use shift to allow you to align the indices as desired:
In [92]:
df.shift(-1)/df
Out[92]:
0
0 2.000000
1 1.500000
2 2.333333
3 NaN

Categories