I have a dataframe like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'year': [1990,1990,1992,1992,1992],
'value': [100,200,300,400,np.nan],
'rank': [2,1,2,1,3]})
print(df)
year value rank
0 1990 100.0 2
1 1990 200.0 1
2 1992 300.0 2
3 1992 400.0 1
4 1992 NaN 3
I am trying to achieve this:
# For year 1990, maximum value is 200, rank is 1 and also relative value is 1.
year value rank value_relative
0 1990 100.0 2 0.5
1 1990 200.0 1 1
2 1992 300.0 2 0.75
3 1992 400.0 1 1
4 1992 NaN 3 NaN
My attempt:
df['value_relative'] = df.groupby('year')['value'].transform(lambda x: x/x[x.rank == 1]['value'])
How can we do this operation where we calculate relative value for each year?
IIUC using transform with first after sort_values
df['value_relative']=df.value/df.sort_values('rank').groupby('year').value.transform('first')
df
Out[60]:
year value rank value_relative
0 1990 100.0 2 0.50
1 1990 200.0 1 1.00
2 1992 300.0 2 0.75
3 1992 400.0 1 1.00
4 1992 NaN 3 NaN
Or just do transform max
df['value_relative']=df.value/df.groupby('year').value.transform('max')
Another method
df.value/df.loc[df.groupby('year')['rank'].transform('idxmin'),'value'].values
Out[64]:
0 0.50
1 1.00
2 0.75
3 1.00
4 NaN
Name: value, dtype: float64
If you need 2nd rank as denominator
df.value/df.year.map(df.loc[df['rank']==2].set_index('year')['value'])
The different here is depends on how you get your rank , if is base on max of value , then both of them should return the same result , but if that is a given rank none related to the value columns , then you should using first
I liked and accepted the Wen's answer, but wanted to give my 2 cents:
The simplest method is just divide value by maximum, but I am trying to learn doing this using separate column called rank:
df.groupby('year')['value'].transform(lambda x: x/x.max())
0 0.50
1 1.00
2 0.75
3 1.00
4 NaN
Another simple method for rank ==2:
df.groupby('year')['value'].transform(lambda x: x/x.nlargest(2).iloc[-1])
0 1.000000
1 2.000000
2 1.000000
3 1.333333
4 NaN
NOTE: Wen's method:
df.value/df.year.map(df.loc[df['rank']==2].set_index('year')['value'])
0 1.000000
1 2.000000
2 1.000000
3 1.333333
4 NaN
Related
The original dataset is:
Group
Year
Value
A
1990
NaN
A
1992
1
A
1995
NaN
A
1997
NaN
A
1998
NaN
A
2001
NaN
A
2002
1
B
1991
1
B
1992
NaN
B
1995
NaN
B
1998
NaN
B
2001
1
B
2002
NaN
I want to do forward fill by group and conditional on the value of column 'Year': forward fill missing value until the 'Year' is more than five years apart.
For example, the value for group A in Year 1992 is 1, so the value for group A in 1995 should be forward filled with 1 since 1995-1992=3 <= 5; and the value for group A in 1997 should be forward filled with 1 since 1995-1992=3 <= 5; and the value for group A in 1998 should not be forward filled with 1 since 1998-1992=6 > 5.
The dataset I want is as follows:
Group
Year
Value
A
1990
NaN
A
1992
1
A
1995
1
A
1997
1
A
1998
NaN
A
2001
NaN
A
2002
1
B
1991
1
B
1992
1
B
1995
1
B
1998
NaN
B
2001
1
B
2002
1
You can use a double groupby.ffill and mask with where:
# identify rows within 5 of the previous non-NA value
m = (df['Year'].where(df['Value'].notna())
.groupby(df['Group']).ffill()
.rsub(df['Year']).le(5)
)
# groupby.ffill and mask
df['Value'] = df.groupby('Group')['Value'].ffill().where(m)
Output:
Group Year Value
0 A 1990 NaN
1 A 1992 1.0
2 A 1995 1.0
3 A 1997 1.0
4 A 1998 NaN
5 A 2001 NaN
6 A 2002 1.0
7 B 1991 1.0
8 B 1992 1.0
9 B 1995 1.0
10 B 1998 NaN
11 B 2001 1.0
12 B 2002 1.0
You can use cummax to identify 5-year ranges:
x = df['Year'].mask(df['Value'].notna(), df['Year'] + 5).groupby(df['Group']).cummax()
df['Value'] = df.groupby([df['Group'], x])['Value'].ffill()
Result:
Group Year Value
0 A 1990 NaN
1 A 1992 1.0
2 A 1995 1.0
3 A 1997 1.0
4 A 1998 NaN
5 A 2001 NaN
6 A 2002 1.0
7 B 1991 1.0
8 B 1992 1.0
9 B 1995 1.0
10 B 1998 NaN
11 B 2001 1.0
12 B 2002 1.0
I have this dataframe:
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
print(df)
Position1 Count1 Position2 Count2 Position3 Count3
0 1 55 4 15 3 45
1 2 35 2 35 5 95
2 3 45 7 75 6 105
I want to join the Position columns into one column named "Positions" while sorting the data in the Counts columns like so:
Positions Count1 Count2 Count3
0 1 55 Nan Nan
1 2 35 35 Nan
2 3 45 NaN 45
3 4 NaN 15 Nan
4 5 NaN NaN 95
5 6 Nan NaN 105
6 7 Nan 75 NaN
I've tried melting the dataframe, combining and merging columns but I am a bit stuck.
Note that the NaN types can easily be replaced by using df.fillna to get a dataframe like so:
df = df.fillna(0)
Positions Count1 Count2 Count3
0 1 55 0 0
1 2 35 35 0
2 3 45 0 45
3 4 0 15 0
4 5 0 0 95
5 6 0 0 105
6 7 0 75 0
Here is a way to do what you've asked:
df = df[['Position1', 'Count1']].rename(columns={'Position1':'Positions'}).join(
df[['Position2', 'Count2']].set_index('Position2'), on='Positions', how='outer').join(
df[['Position3', 'Count3']].set_index('Position3'), on='Positions', how='outer').sort_values(
by=['Positions']).reset_index(drop=True)
Output:
Positions Count1 Count2 Count3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
Explanation:
Use join first on Position1, Count1 and Position2, Count2 (with Position1 renamed as Positions) then on that join result and Position3, Count3.
Sort by Positions and use reset_index to create a new integer range index (ascending with no gaps).
Does this achieve what you are after?
import pandas as pd
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
df1, df2, df3 = df.iloc[:,:2], df.iloc[:, 2:4], df.iloc[:, 4:6]
df1.columns, df2.columns, df3.columns = ['Positions', 'Count1'], ['Positions', 'Count2'], ['Positions', 'Count3']
df1.merge(df2, on='Positions', how='outer').merge(df3, on='Positions', how='outer').sort_values('Positions')
Output:
wide_to_long unpivots the DF from Long to wide and that is what's used here.
columns names are also renamed here, with this edit
df['id'] = df.index
df2=pd.wide_to_long(df, stubnames=['Position','Count'], i='id', j='pos').reset_index()
df2=df2.pivot(index=['id','Position'], columns='pos', values='Count').reset_index().fillna(0).add_prefix('count_')
df2.rename(columns={'count_id': 'id', 'count_Position' :'Position'}, inplace=True)
df2
RESULT:
pos id Position 1 2 3
0 0 1 55.0 0.0 0.0
1 0 3 0.0 0.0 45.0
2 0 4 0.0 15.0 0.0
3 1 2 35.0 35.0 0.0
4 1 5 0.0 0.0 95.0
5 2 3 45.0 0.0 0.0
6 2 6 0.0 0.0 105.0
7 2 7 0.0 75.0 0.0
PS: I'm unable to format the output, I'll appreciate if someone guide me here. Thanks!
One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = None,
names_to = ('.value', 'num'),
names_pattern = r"(.+)(\d+)")
.pivot_wider(index = 'Position', names_from = 'num')
)
Position Count_1 Count_2 Count_3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
In the pivot_longer section, the .value determines which part of the column names remain as column headers - in this case it is is Position and Count.
I know how to compute the groupby mean or std. But now I want to compute both at a time.
My code:
df =
a b c d
0 Apple 3 5 7
1 Banana 4 4 8
2 Cherry 7 1 3
3 Apple 3 4 7
xdf = df.groupby('a').agg([np.mean(),np.std()])
Present output:
TypeError: _mean_dispatcher() missing 1 required positional argument: 'a'
Try to remove () from the np. functions:
xdf = df.groupby("a").agg([np.mean, np.std])
print(xdf)
Prints:
b c d
mean std mean std mean std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN
EDIT: To "flatten" column multi-index:
xdf = df.groupby("a").agg([np.mean, np.std])
xdf.columns = xdf.columns.map("_".join)
print(xdf)
Prints:
b_mean b_std c_mean c_std d_mean d_std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN
I have a DataFrame where I have 1000s of rows and 100s of column where I want to forwardfill the data but grouped by id and original data ( date range). What I mean by original data is if we have a data for id 1 for date 01/01/2020 but null value for date 01/05/2020, 02/02/2020, I would like to fill the data on 01/05/2020 but not 02/02/2020 since 02/02/2020 is not within 30 days period. When we ffill, it fills all data based on last result.
import pandas as pd
import numpy as np
res= pd.DataFrame({'id':[1,1,1,1,1,2,2],
'date':['01/01/2020','01/05/2020','02/03/2020','02/05/2020','04/01/2020','01/01/2020','01/02/2020'],
'result':[1.5,np.nan,np.nan,2.6,np.nan,np.nan,6.0]})
res['result1']= res.groupby(['id']).apply(lambda x: x.result.ffill()).reset_index(drop=True)
result I get is:
id date result result1
0 1 01/01/2020 1.5 1.5
1 1 01/05/2020 NaN 1.5
2 1 02/03/2020 NaN 1.5
3 1 02/05/2020 2.6 2.6
4 1 04/01/2020 NaN 2.6
5 2 01/01/2020 NaN NaN
6 2 01/02/2020 6.0 6.0
What I want is :
id date result result1
0 1 01/01/2020 1.5 1.5
1 1 01/05/2020 NaN 1.5
2 1 02/03/2020 NaN NaN
3 1 02/05/2020 2.6 2.6
4 1 04/01/2020 NaN NaN
5 2 01/01/2020 NaN NaN
6 2 01/02/2020 6.0 6.0
You can try with merge_asof
res['date']=pd.to_datetime(res['date'])
res = res.sort_values('date')
res1 = res.dropna(subset=['result']).rename(columns={'result':'result1'})
out = pd.merge_asof(res.reset_index(),res1 , by ='id', on ='date',tolerance = pd.Timedelta(30, unit='d'),direction = 'backward').sort_values('index')
Out[72]:
index id date result result1
0 0 1 2020-01-01 1.5 1.5
3 1 1 2020-01-05 NaN 1.5
4 2 1 2020-02-03 NaN NaN
5 3 1 2020-02-05 2.6 2.6
6 4 1 2020-04-01 NaN NaN
1 5 2 2020-01-01 NaN NaN
2 6 2 2020-01-02 6.0 6.0
Not so elegant as Ben's merge_asof, but you can do something like this:
res['date'] = pd.to_datetime(res['date'])
# valid blocks
valids = res['result'].notna().cumsum()
# first dates in each block
first_dates = res.groupby(['id',valids])['date'].transform('min')
# How far we ffill
mask = (res['date']-first_dates)<pd.Timedelta('30D')
# ffill and then mask
res['result1'] = res['result'].groupby(res['id']).ffill().where(mask)
Output:
id date result result1
0 1 2020-01-01 1.5 1.5
1 1 2020-01-05 NaN 1.5
2 1 2020-02-03 NaN NaN
3 1 2020-02-05 2.6 2.6
4 1 2020-04-01 NaN NaN
5 2 2020-01-01 NaN NaN
6 2 2020-01-02 6.0 6.0
I have a pandas dataframe that looks as follows:
In [23]: dataframe.head()
Out[23]:
column_id 1 10 11 12 13 14 15 16 17 18 ... 46 47 48 49 5 50 \
row_id ...
1 NaN NaN 1 1 1 1 1 1 1 1 ... 1 1 NaN 1 NaN NaN
10 1 1 1 1 1 1 1 1 1 NaN ... 1 1 1 NaN 1 NaN
100 1 1 NaN 1 1 1 1 1 NaN 1 ... NaN NaN 1 1 1 NaN
11 NaN 1 1 1 1 1 1 1 1 NaN ... NaN 1 1 1 1 1
12 1 1 1 NaN 1 1 1 1 NaN 1 ... 1 NaN 1 1 NaN 1
The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:
In [24]: dataframe.transpose().corr().head()
Out[24]:
row_id 1 10 100 11 12 13 14 15 16 17 ... 90 91 92 93 94 95 \
row_id ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?
Thanks!
A.
The key question here is what distance metric to use.
Let's say this is your data.
>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 \
0 1 1 1 NaN 1 NaN NaN 1 1 1 ... 1 1 NaN 1 NaN 1 1 1
1 1 1 1 NaN 1 1 1 1 1 1 ... NaN 1 1 NaN NaN 1 1 1
2 1 1 1 1 1 1 1 1 1 1 ... 1 NaN 1 1 1 1 1 NaN
3 1 NaN 1 NaN 1 NaN 1 NaN 1 1 ... 1 1 1 1 NaN 1 1 1
4 1 1 1 1 1 1 1 1 NaN 1 ... NaN 1 1 1 1 1 1 1
What is the % difference?
You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 7 8 9 ... 40 \
0 0.00 0.36 0.33 0.37 0.32 0.41 0.35 0.33 0.39 0.33 ... 0.37
1 0.36 0.00 0.37 0.29 0.30 0.37 0.33 0.37 0.33 0.31 ... 0.35
2 0.33 0.37 0.00 0.36 0.29 0.38 0.40 0.34 0.30 0.28 ... 0.28
3 0.37 0.29 0.36 0.00 0.29 0.30 0.34 0.26 0.32 0.36 ... 0.36
4 0.32 0.30 0.29 0.29 0.00 0.31 0.35 0.29 0.29 0.25 ... 0.27
What is the correlation coefficient?
Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 1.000000 0.013158 0.026262 -0.059786 -0.024293 -0.078056 0.054074
1 0.013158 1.000000 -0.093109 0.170159 0.043187 0.027425 0.108148
2 0.026262 -0.093109 1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786 0.170159 -0.124540 1.000000 0.004245 0.184153 0.042524
4 -0.024293 0.043187 -0.048485 0.004245 1.000000 0.079196 -0.099834
Incidentally, this is the same result that you would get with the Spearman R coefficient as well.
What is the Euclidean distance?
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 0.000000 6.000000 5.744563 6.082763 5.656854 6.403124 5.916080
1 6.000000 0.000000 6.082763 5.385165 5.477226 6.082763 5.744563
2 5.744563 6.082763 0.000000 6.000000 5.385165 6.164414 6.324555
3 6.082763 5.385165 6.000000 0.000000 5.385165 5.477226 5.830952
4 5.656854 5.477226 5.385165 5.385165 0.000000 5.567764 5.916080
By now, you'd have a sense of the pattern. Create a distance method. Then apply it pairwise to every column using
data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))
If your distance method relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).
A proposal to improve the excellent answer from #s-anand for Euclidian distance:
instead of
zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
we can apply the fillna the fill only the missing data, thus:
distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))
This way, the distance on missing dimensions will not be counted.
This is my numpy-only version of #S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.
Happy to share it with a short, reproducible example:
# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
Let's try scipy.stats.pearsonr first.
Executing:
distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
returns:
and:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0],
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, -0.12, 0.87, 0.82, 0.78],
[-0.12, 1.00, -0.43, -0.37, -0.43],
[0.87, -0.43, 1.00, 0.96, 0.95],
[0.82, -0.37, 0.96, 1.00, 0.96],
[0.78, -0.43, 0.95, 0.96, 1.00]])
As a second example let's try the distance correlation from the dcor library.
Executing:
import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
returns:
while:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2),
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, 0.31, 0.86, 0.83, 0.78],
[0.31, 1.00, 0.54, 0.51, 0.51],
[0.86, 0.54, 1.00, 0.97, 0.95],
[0.83, 0.51, 0.97, 1.00, 0.95],
[0.78, 0.51, 0.95, 0.95, 1.00]])
I compared 3 variants from the other answers here for their speed. I had a trial 1000x25 matrix (leading to resulting 1000x1000 matrix)
dcor library
Time: 0.03s
https://dcor.readthedocs.io/en/latest/functions/dcor.distances.pairwise_distances.html
import dcor
result = dcor.distances.pairwise_distances(data)
scipy.distance
Time: 0.05s
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
from scipy.spatial import distance_matrix
result = distance_matrix(data, data)
using lambda function and numpy or pandas
Time: 180s / 90s
import numpy as np # variant A (180s)
import pandas as pd # variant B (90s)
distance = lambda x, y: np.sqrt(np.sum((x - y) ** 2)) # variant A
distance = lambda x, y: pd.np.linalg.norm(x - y) # variant B
result = data.apply(lambda x: data.apply(lambda y: distance(x, y), axis=1), axis=1)