df = pd.DataFrame({'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],'d':[1,0,0,1,1,1,0,1,0]})
I'm trying to summarize this dataframe using sum. I thought df.groupby(['a','b']).aggregate(sum) would work but it returns an empty Series.
How can I achieve this result?
a b
x 1 1
y 2 1
import numpy as np
import pandas as pd
NaN = np.nan
df = pd.DataFrame(
{'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],
'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],
'd':[32,12,55,98,23,11,9,91,3]})
melted = pd.melt(df, id_vars=['d'], value_vars=['a', 'b'])
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'],
aggfunc=np.median)
print(result)
yields
variable a b
value
x 10.0 17.5
y 55.0 50.5
Explanation:
Melting the DataFrame with melted = pd.melt(df, value_vars=['a', 'b']) produces
d variable value
0 32 a y
1 12 a NaN
2 55 a y
3 98 a NaN
4 23 a NaN
5 11 a x
6 9 a x
7 91 a y
8 3 a NaN
9 32 b NaN
10 12 b x
11 55 b NaN
12 98 b y
13 23 b x
14 11 b NaN
15 9 b NaN
16 91 b NaN
17 3 b y
and now we can use pd.pivot_table to pivot and aggregate the d values:
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'],
aggfunc=np.median)
Note that the aggfunc can take a list of functions, such as [np.sum, np.median, np.min, np.max, np.std] if you wish to summarize the data in more than one way.
Related
Toy example code
Let's say I have following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[11,21,31], "B":[12,22,32], "C":[np.nan,23,33], "D":[np.nan,24,34], "E":[15,25,35]})
Which would return:
>>> df
A B C D E
0 11 12 NaN NaN 15
1 21 22 23.0 24.0 25
2 31 32 33.0 34.0 35
Remove all columns with nan values
I know how to remove all the columns which have any row with a nan value like this:
out1 = df.dropna(axis=1, how="any")
Which returns:
>>> out1
A B E
0 11 12 15
1 21 22 25
2 31 32 35
Expected output
However what I expect is to remove all columns after a nan value is found. In the toy example code the expected output would be:
A B
0 11 12
1 21 22
2 31 32
Question
How can I remove all columns after a nan is found within any row in a pandas DataFrame ?
What I would do:
check every element for being null/not null
cumulative sum every row across the columns
check any for every column, across the rows
use that result as an indexer:
df.loc[:, ~df.isna().cumsum(axis=1).any(axis=0)]
Give me:
A B
0 11 12
1 21 22
2 31 32
I could find a way as follows to get the expected output:
colFirstNaN = df.isna().any(axis=0).idxmax() # Find column that has first NaN element in any row
indexColLastValue = df.columns.tolist().index(colFirstNaN) -1
ColLastValue = df.columns[indexColLastValue]
out2 = df.loc[:, :ColLastValue]
And the output would be then:
>>> out2
A B
0 11 12
1 21 22
2 31 32
I have a dataframe below. I am trying to create lags for var1,var2, var3 by calculating
(var_n/ lag2(var_n))-2 (where n is 1,2,3)
below code works fine for lag2. But I need to perform the calculation grouped by "grp"
CODE:
lag=[2]
df=pd.concat([df]+[df.groupby('grp'['var1','var2','var3'].shift(x).add_prefix('lag'+str(x)) for x in lag],axis=1)
In a different approach I tried below but I am not able to apply group by:
yoy = [12]
columns_y = df.loc[:, 'var1':'var3']
for col in columns_y.columns:
for x in yoy:
columns_y.loc[:,col+"_yoy"] =(columns_y[col]/(columns_y[col].shift(x)))-1
Try this
df = pd.DataFrame({
'grp':['a','a','a','b','b','b'],'abc2':['l','m','n','p','q','r'], 'abc3':['x','y','z','a','b','c'],
'var1':[20,30,20,40,50,90],'var2':[50,80,70,20,30,40],'var3':[50,80,70,20,30,40]})
lag = [2]
lags_df = pd.concat([
df.groupby('grp')[[f'var{i+1}' for i in range(3)]]
.shift(x)
.add_prefix(f'lag{x}_')
for x in lag
], axis=1)
print(pd.concat([df, lags_df], axis=1))
otuputs
grp abc2 abc3 var1 var2 var3 lag2_var1 lag2_var2 lag2_var3
0 a l x 20 50 50 NaN NaN NaN
1 a m y 30 80 80 NaN NaN NaN
2 a n z 20 70 70 20.0 50.0 50.0
3 b p a 40 20 20 NaN NaN NaN
4 b q b 50 30 30 NaN NaN NaN
5 b r c 90 40 40 40.0 20.0 20.0
I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.
I am trying below code and getting NaN for all the columns/rows in output
import numpy as np
import pandas as pd
data1 = np.array([1,2,4,5,6])
data2 = np.array([11,12,14,15,16])
ser1 = pd.Series(data1)
ser2 = pd.Series(data2)
ser4 = pd.Series(data1)
dataframe = pd.DataFrame([ser1,ser2,ser2],['a','b','c'])
Output is :
0 1 2 3 4
a 1 2 4 5 6
b 11 12 14 15 16
c 11 12 14 15 16
But for below code , i am getting NaN for all the data in output
dataframe = pd.DataFrame([ser1,ser2,ser2,ser4],['a','b','c','d'],['AA','BB','CC','DD','EE'])
AA BB CC DD EE
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
i was expecting the output should be data of series data with column name 'AA','BB','CC','DD','EE'respectively
tried to find any similar questions too on the forum but was unable to find any.
Problem is index alignmenet, it means original columns names are from 0 to N created from index values of Series, so if define another values in list it not match and pandas return NaNs in data.
Possible solution is can create index values of each Series by your new columns names:
data1 = np.array([1,2,4,5,6])
data2 = np.array([11,12,14,15,16])
i = ['AA','BB','CC','DD','EE']
ser1 = pd.Series(data1, index=i)
ser2 = pd.Series(data2, index=i)
ser4 = pd.Series(data1, index=i)
dataframe = pd.DataFrame([ser1,ser2,ser2],['a','b','c'])
print (dataframe)
AA BB CC DD EE
a 1 2 4 5 6
b 11 12 14 15 16
c 11 12 14 15 16
You can also specify index names in Series:
ser1 = pd.Series(data1, index=i, name='a')
ser2 = pd.Series(data2, index=i, name='b')
ser4 = pd.Series(data1, index=i, name='c')
dataframe = pd.DataFrame([ser1,ser2,ser2])
print (dataframe)
AA BB CC DD EE
a 1 2 4 5 6
b 11 12 14 15 16
b 11 12 14 15 16
You can ignore the index of series by stacking as an array using np.vstack , this will let you set your own index and columns:
pd.DataFrame(np.vstack([ser1,ser2,ser2,ser4]),['a','b','c','d'],['AA','BB','CC','DD','EE'])
AA BB CC DD EE
a 1 2 4 5 6
b 11 12 14 15 16
c 11 12 14 15 16
d 1 2 4 5 6
I am trying to calculate the difference in certain rows based on the values from other columns.
Using the example data frame below, I want to calculate the difference in Time based on the values in the Code column. Specifically, I want to loop through and determine the time difference between B and A. So Time in B - Time in A.
I can do this manually using the iloc function but I was hoping to determine a more efficient way. Especially if I have to repeat this process numerous times.
import pandas as pd
import numpy as np
k = 5
N = 15
d = ({'Time' : np.random.randint(k, k + 100 , size=N),
'Code' : ['A','x','B','x','A','x','B','x','A','x','B','x','A','x','B']})
df = pd.DataFrame(data=d)
Output:
Code Time
0 A 89
1 x 39
2 B 24
3 x 62
4 A 83
5 x 57
6 B 69
7 x 10
8 A 87
9 x 62
10 B 86
11 x 11
12 A 54
13 x 44
14 B 71
Expected Output:
diff
1 -65
2 -14
3 -1
4 17
First filter by boolean indexing, then subtract by sub with reset_index for default index for align Series a and b and last if want one column DataFrame add to_frame:
a = df.loc[df['Code'] == 'A', 'Time'].reset_index(drop=True)
b = df.loc[df['Code'] == 'B', 'Time'].reset_index(drop=True)
Similar alternative solution:
a = df.loc[df['Code'] == 'A'].reset_index()['Time']
b = df.loc[df['Code'] == 'B'].reset_index()['Time']
c = b.sub(a).to_frame('diff')
print (c)
diff
0 -65
1 -14
2 -1
3 17
Last for new index start from 1 add rename:
c = b.sub(a).to_frame('diff').rename(lambda x: x + 1)
print (c)
diff
1 -65
2 -14
3 -1
4 17
Another approach if need count more difference is reshape by unstack:
df = df.set_index(['Code', df.groupby('Code').cumcount() + 1])['Time'].unstack()
print (df)
1 2 3 4 5 6 7
Code
A 89.0 83.0 87.0 54.0 NaN NaN NaN
B 24.0 69.0 86.0 71.0 NaN NaN NaN
x 39.0 62.0 57.0 10.0 62.0 11.0 44.0
#last remove `NaN`s rows
c = df.loc['B'].sub(df.loc['A']).dropna()
print (c)
1 -65.0
2 -14.0
3 -1.0
4 17.0
dtype: float64
#subtract with NaNs values - fill_value=0 return non NaNs values
d = df.loc['x'].sub(df.loc['A'], fill_value=0)
print (d)
1 -50.0
2 -21.0
3 -30.0
4 -44.0
5 62.0
6 11.0
7 44.0
dtype: float64
Assuming your Code is a repeat of 'A', 'x', 'B', 'x', you can just use
>>> (df.Time[df.Code == 'B'].reset_index() - df.Time[df.Code == 'A'].reset_index())[['Time']]
Time
0 -65
1 -14
2 -1
3 17
But note that the original assumption, that 'A' and 'B' values alternate, seems fragile.
If you want the indexes to run from 1 to 4, as in your question, you can assign the previous to diff, and then use
diff.index += 1
>>> diff
Time
1 -65
2 -14
3 -1
4 17