I have a dataframe below. I am trying to create lags for var1,var2, var3 by calculating
(var_n/ lag2(var_n))-2 (where n is 1,2,3)
below code works fine for lag2. But I need to perform the calculation grouped by "grp"
CODE:
lag=[2]
df=pd.concat([df]+[df.groupby('grp'['var1','var2','var3'].shift(x).add_prefix('lag'+str(x)) for x in lag],axis=1)
In a different approach I tried below but I am not able to apply group by:
yoy = [12]
columns_y = df.loc[:, 'var1':'var3']
for col in columns_y.columns:
for x in yoy:
columns_y.loc[:,col+"_yoy"] =(columns_y[col]/(columns_y[col].shift(x)))-1
Try this
df = pd.DataFrame({
'grp':['a','a','a','b','b','b'],'abc2':['l','m','n','p','q','r'], 'abc3':['x','y','z','a','b','c'],
'var1':[20,30,20,40,50,90],'var2':[50,80,70,20,30,40],'var3':[50,80,70,20,30,40]})
lag = [2]
lags_df = pd.concat([
df.groupby('grp')[[f'var{i+1}' for i in range(3)]]
.shift(x)
.add_prefix(f'lag{x}_')
for x in lag
], axis=1)
print(pd.concat([df, lags_df], axis=1))
otuputs
grp abc2 abc3 var1 var2 var3 lag2_var1 lag2_var2 lag2_var3
0 a l x 20 50 50 NaN NaN NaN
1 a m y 30 80 80 NaN NaN NaN
2 a n z 20 70 70 20.0 50.0 50.0
3 b p a 40 20 20 NaN NaN NaN
4 b q b 50 30 30 NaN NaN NaN
5 b r c 90 40 40 40.0 20.0 20.0
Related
df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN
Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.
I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.
Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).
df = pd.DataFrame({'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],'d':[1,0,0,1,1,1,0,1,0]})
I'm trying to summarize this dataframe using sum. I thought df.groupby(['a','b']).aggregate(sum) would work but it returns an empty Series.
How can I achieve this result?
a b
x 1 1
y 2 1
import numpy as np
import pandas as pd
NaN = np.nan
df = pd.DataFrame(
{'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],
'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],
'd':[32,12,55,98,23,11,9,91,3]})
melted = pd.melt(df, id_vars=['d'], value_vars=['a', 'b'])
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'],
aggfunc=np.median)
print(result)
yields
variable a b
value
x 10.0 17.5
y 55.0 50.5
Explanation:
Melting the DataFrame with melted = pd.melt(df, value_vars=['a', 'b']) produces
d variable value
0 32 a y
1 12 a NaN
2 55 a y
3 98 a NaN
4 23 a NaN
5 11 a x
6 9 a x
7 91 a y
8 3 a NaN
9 32 b NaN
10 12 b x
11 55 b NaN
12 98 b y
13 23 b x
14 11 b NaN
15 9 b NaN
16 91 b NaN
17 3 b y
and now we can use pd.pivot_table to pivot and aggregate the d values:
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'],
aggfunc=np.median)
Note that the aggfunc can take a list of functions, such as [np.sum, np.median, np.min, np.max, np.std] if you wish to summarize the data in more than one way.