I have this df:
CODE TMAX
0 000130 NaN
1 000130 NaN
2 000130 32.0
3 000130 32.2
4 000130 NaN
5 158328 NaN
6 158328 8.8
7 158328 NaN
8 158328 NaN
9 158328 9.2
... ... ...
I want to count the number of non nan values and the number of nan values in the 'TMAX' column. But i want to count since the first non NaN value and by code.
Expected result in code 000130: 2 non nan values and 1 NaN values.
Expected result in code 158328: 2 non nan values and 2 NaN values.
Same with the other codes...
How can i do this?
Thanks in advance.
If need CODEs too add GroupBy.cummax and count values by crosstab:
m = df.TMAX.notna()
s = m[m.groupby(df['CODE']).cummax()]
df1 = pd.crosstab(df['CODE'], s).rename(columns={True:'non NaNs',False:'NaNs'})
print (df1)
TMAX NaNs non NaNs
CODE
130 1 2
158328 2 2
If need explicitely filter also column CODE by mask:
m = df.TMAX.notna()
mask = m.groupby(df['CODE']).cummax()
df1 = pd.crosstab(df.loc[mask, 'CODE'], m[mask]).rename(columns={True:'non NaNs',False:'NaNs'})
Use first_valid_index to find the first non-NaN index and filter. Then use isna to create a boolean mask and count the values.
def countNaN(s):
return (
s.loc[s.first_valid_index():]
.isna()
.value_counts()
.rename({True: 'NaN', False: 'notNaN'})
)
df.groupby('CODE').apply(countNaN)
Output
CODE
000130 notNaN 2
NaN 1
158328 notNaN 2
NaN 2
I have a data frame in python pandas as follows:
( the first two columns, mygroup1 & mygroup2 are groupby columns)
df =
**mygroup1 mygroup2 tname #dt #num #vek**
a p alpha may 6 a
b q alpha june 8 b
c r beta may 9 c
d s beta june 11 d
I want to pivot the table (the values in tname column) which should be the following with names of columns joined with tname values taken from the other columns (#dt,#num and #vec)
**mygroup1 mygroup2 alpha#dt alpha#num alpha#vec beta#dt beta#num beta#vec**
a p may 6 a nan nan nan
b q june 8 b nan nan nan
c r nan nan nan may 9 c
d s nan nan nan june 11 d
I am trying to do a pivot using pandas pivot table but not able to get in the below format which I really want. I will appreciate any help.
You can do:
new_df = df.set_index(['mygroup1','mygroup2','tname']).unstack('tname')
new_df.columns = [f'{y}{x}' for x,y in new_df.columns]
new_df = new_df.sort_index(axis=1).reset_index()
Output:
mygroup1 mygroup2 alpha#dt alpha#num alpha#vek beta#dt beta#num beta#vek
0 a p may 6.0 a NaN NaN NaN
1 b q june 8.0 b NaN NaN NaN
2 c r NaN NaN NaN may 9.0 c
3 d s NaN NaN NaN june 11.0 d
I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43
First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0
One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.
I have the following DataFrame
VOTES CITY
24 A
22 A
20 B
NaN A
NaN A
30 B
NaN C
I need to fill the NaN with mean of values where CITY is 'A' or 'C'
The following code I tried was only updating the first row in VOTES and rest allwere updated to NaN.
train['VOTES'][((train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))]=train['VOTES'].loc[((~train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))].astype(int).mean(axis=0)
The output of 'VOTES' after this all values are updated as 'NaN' except one record which is at index 0. The Mean is calculated correctly though .
Use Series.fillna only for filtered rows with mean of filtered rows:
train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)
m= train['CITY'].isin(['A','C'])
mean = train.loc[m,'VOTES_EN'].mean()
train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)
train['VOTES_EN'] = train['VOTES_EN'].astype(int)
print (train)
VOTES CITY VOTES_EN
0 24.0 A 24
1 22.0 A 22
2 20.0 B 20
3 NaN A 23
4 NaN A 23
5 30.0 B 30
6 NaN C 23
Can someone please help me understand the steps to convert a Python pandas DataFrame that is in record form (data set A), into one that is pivoted with nested columns (as shown in data set B)?
For this question the underlying schema has the following rules:
Each ProjectID appears once
Each ProjectID is associated to a single PM
Each ProjectID is associated to a single Category
Multiple ProjectIDs can be associated with a single Category
Multiple ProjectIDs can be associated with a single PM
Input Data Set A
df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
'Score':[10,7,10,5,15,10,0,2]})
Desired Output
Notice above the addition of a nested index across the columns. Also notice that 'Comments' and 'Score' both appear at the same level beneath 'ProjectID'. Finally see how the desired output does NOT aggregate any data, but groups/merges the category data into one row per category value.
I have tried so far:
df_A.set_index(['Category','ProjectID'],append=True).unstack() - This would only work if I first create a nested index of ['Category','ProjectID] and ADD that to the original numerical index created with a standard dataframe, however it repeats each instance of a Category/ProjectID match as its own row (because of the original index).
df_A.groupby() - I wasn't able to use this because it appears to force aggregation of some sort in order to get all of the values of a single category on a single row.
df_A.pivot('Category','ProjectID',values='Comments') - I can perform a pivot to avoid unwanted aggregation and it starts to look similar to my intended output, but can only see the 'Comments' field and also cannot set nested columns this way. I receive an error when trying to set values=['Comments','Score'] in the pivot statement.
I think the answer is somewhere between pivot, unstack, set_index, or groupby, but I don't know how to complete the pivot, and then add the appropriate nested column index.
I'd appreciate any thoughts you all have.
Question updated based on Mr. T's comments. Thank you.
I think this is what you are looking for:
pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
0 Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
EDIT:
To select rows by category you should get rid of the row index 0 by adding .xs():
In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)
In [4]: df_A_transformed
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
In [5]: df_A_transformed.loc['Category B']
Out[5]:
PM ProjectID
Amy 6 Comments NaN
Score NaN
Bob 1 Comments NaN
Score NaN
Jack 3 Comments NaN
Score NaN
4 Comments Justification 4
Score 5
8 Comments Justification 8
Score 2
Jill 2 Comments Justification 2
Score 7
5 Comments NaN
Score NaN
7 Comments Justification 7
Score 0
Name: Category B, dtype: object