Can someone please help me understand the steps to convert a Python pandas DataFrame that is in record form (data set A), into one that is pivoted with nested columns (as shown in data set B)?
For this question the underlying schema has the following rules:
Each ProjectID appears once
Each ProjectID is associated to a single PM
Each ProjectID is associated to a single Category
Multiple ProjectIDs can be associated with a single Category
Multiple ProjectIDs can be associated with a single PM
Input Data Set A
df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
'Score':[10,7,10,5,15,10,0,2]})
Desired Output
Notice above the addition of a nested index across the columns. Also notice that 'Comments' and 'Score' both appear at the same level beneath 'ProjectID'. Finally see how the desired output does NOT aggregate any data, but groups/merges the category data into one row per category value.
I have tried so far:
df_A.set_index(['Category','ProjectID'],append=True).unstack() - This would only work if I first create a nested index of ['Category','ProjectID] and ADD that to the original numerical index created with a standard dataframe, however it repeats each instance of a Category/ProjectID match as its own row (because of the original index).
df_A.groupby() - I wasn't able to use this because it appears to force aggregation of some sort in order to get all of the values of a single category on a single row.
df_A.pivot('Category','ProjectID',values='Comments') - I can perform a pivot to avoid unwanted aggregation and it starts to look similar to my intended output, but can only see the 'Comments' field and also cannot set nested columns this way. I receive an error when trying to set values=['Comments','Score'] in the pivot statement.
I think the answer is somewhere between pivot, unstack, set_index, or groupby, but I don't know how to complete the pivot, and then add the appropriate nested column index.
I'd appreciate any thoughts you all have.
Question updated based on Mr. T's comments. Thank you.
I think this is what you are looking for:
pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
0 Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
EDIT:
To select rows by category you should get rid of the row index 0 by adding .xs():
In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)
In [4]: df_A_transformed
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
In [5]: df_A_transformed.loc['Category B']
Out[5]:
PM ProjectID
Amy 6 Comments NaN
Score NaN
Bob 1 Comments NaN
Score NaN
Jack 3 Comments NaN
Score NaN
4 Comments Justification 4
Score 5
8 Comments Justification 8
Score 2
Jill 2 Comments Justification 2
Score 7
5 Comments NaN
Score NaN
7 Comments Justification 7
Score 0
Name: Category B, dtype: object
Related
I'd to delete some outliers from my dataframe
Product
Brand
Year
calcium_100g
phosphorus_100g
iron_100g
magnesium_100g
Poduct A
Brand A
2020
8
50
NaN
NaN
Poduct B
Brand A
2021
54
-1
NaN
17
Poduct C
Brand C
2020
NaN
NaN
NaN
NaN
Poduct D
Brand C
2018
NaN
50
80
NaN
Poduct E
Brand E
2019
123
50
NaN
27
Outliers I'd like to delete are values bigger than 100 and below 0 from columns ending by "_100g" (-1 and 123) in that case.
I found a way to filter columns ending by "_100g"
Columns100g = list(data.filter(like='_100g', axis = 1).columns)
But at this point I can't find a way to delete my outliers.
I would suggest using the drop() method which accepts index to remove the rows, and to use lt(0) and gt(100) to get those indexes to be removed using | (or) and any(1) which would return True for any condition being satisfied for any column in the selected ones:
# Columns that have '_100g'
c = df.filter(like='_100g').columns
# Drop the rows above / below your threshold
new = df.drop(df[
df[(df[c].lt(0)) | (df[c].gt(100))
].any(1)==True].index)
Prints back:
print(new)
Product Brand Year ... phosphorus_100g iron_100g magnesium_100g
0 Poduct A Brand A 2020 ... 50.0 NaN NaN
2 Poduct C Brand C 2020 ... NaN NaN NaN
3 Poduct D Brand C 2018 ... 50.0 80.0 NaN
[3 rows x 7 columns]
Using your Columns100g, you can use a for-loop to filter multiple columns:
for col in Columns100g:
data = data[(data[col].fillna(0)>=0)&(data[col].fillna(0)<=100)]
Edit: But if you want to change outlier values to NaN, you can simply do:
data[(data[Columns100g]<0)|(data[Columns100g]>100)] = np.nan
Output:
Product Brand Year calcium_100g phosphorus_100g iron_100g magnesium_100g
0 Poduct A Brand A 2020 8.0 50.0 NaN NaN
1 Poduct B Brand A 2021 54.0 NaN NaN 17.0
2 Poduct C Brand C 2020 NaN NaN NaN NaN
3 Poduct D Brand C 2018 NaN 50.0 80.0 NaN
4 Poduct E Brand E 2019 NaN 50.0 NaN 27.0
Pandas have a function name drop that you can use alongside the filter code you wrote.
# Deleting columns
# Delete the "_100" column from the dataframe
data = data.drop("_100", axis=1)
# alternatively, delete columns using the columns parameter of drop
data = data.drop(columns="_100")
# Delete the _100 column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
data.drop("_100", axis=1, inplace=True).
# Delete multiple columns from the dataframe
data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)
If you can replace the ones I wrote (_100) with your filter, you are good to go.
also, you can read more about the drop function in the link below:
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
I try to filter out groups using pandas, I have tried the groupby but can't find out how to filter whole groups out with criteria from the DF. Below is a print of my dataframe. I want to group the users (1-4) and then filter based on whether they have a primary or not, then only show the users who do not have a primary account. Anyone got an idea for this?
So far my code looks like
df=pd.read_csv("accounts_test.csv")
grouped = df.groupby('User')
Dataframe:
User primary account_type
0 1 NaN current_acc
1 1 yes savings
2 1 NaN invest
3 2 NaN current_acc
4 2 NaN invest
5 2 NaN savings
6 3 NaN savings
7 3 yes current_acc
8 3 NaN invest
9 4 NaN savings
10 4 NaN invest
11 4 NaN current_acc
Wanted output after filtering:
User primary account_type
3 2 NaN current_acc
4 2 NaN invest
5 2 NaN savings
9 4 NaN savings
10 4 NaN invest
11 4 NaN current_acc
you can try via groupby()+filter():
df.groupby('User').filter(lambda x:x['primary'].ne('yes').all())
OR
via use groupby()+transform() as a mask and then pass it to df:
df[df.groupby('User')['primary'].transform(lambda x:x.ne('yes').all())]
Another way using only fast optimized vectorized operation (without using lambda function):
Use .loc + .groupby() + .transform() on 'all':
df.loc[df['primary'].ne('yes').groupby(df['User']).transform('all')]
or, if your NaN is null value equivalent to np.nan, you can also use:
df.loc[df['primary'].isna().groupby(df['User']).transform('all')]
Result:
User primary account_type
3 2 NaN current_acc
4 2 NaN invest
5 2 NaN savings
9 4 NaN savings
10 4 NaN invest
11 4 NaN current_acc
There are two columns UserID and country. In some rows of country there are values but in other there are nan values for the same UserID. I want to map the value of country in the nan values.
UserID Country
1 India
2 US
3 Uk
1 nan
4 nan
2 nan
4 nan
Output required:
UserID Country
1 India
2 US
3 Uk
1 India
4 nan
2 US
4 nan
I tried doing it this way:
df['Country']=df['UserID'].map(lambda x:df[x])
but I am getting error for UserID 4.
I tried replacing the country of UserID 4 manually:
df['Country']=np.where(df['UserID']==4,'India',df['Country'])
but still I am getting an error. What went wrong or is there any other way to approach it?
Try via groupby() and ffill():
df['Country']=df.groupby('UserID')['Country'].ffill()
OR
via groupby() and fillna():
df['Country']=df.groupby('UserID')['Country'].fillna(method='ffill')
I have this df:
CODE TMAX
0 000130 NaN
1 000130 NaN
2 000130 32.0
3 000130 32.2
4 000130 NaN
5 158328 NaN
6 158328 8.8
7 158328 NaN
8 158328 NaN
9 158328 9.2
... ... ...
I want to count the number of non nan values and the number of nan values in the 'TMAX' column. But i want to count since the first non NaN value and by code.
Expected result in code 000130: 2 non nan values and 1 NaN values.
Expected result in code 158328: 2 non nan values and 2 NaN values.
Same with the other codes...
How can i do this?
Thanks in advance.
If need CODEs too add GroupBy.cummax and count values by crosstab:
m = df.TMAX.notna()
s = m[m.groupby(df['CODE']).cummax()]
df1 = pd.crosstab(df['CODE'], s).rename(columns={True:'non NaNs',False:'NaNs'})
print (df1)
TMAX NaNs non NaNs
CODE
130 1 2
158328 2 2
If need explicitely filter also column CODE by mask:
m = df.TMAX.notna()
mask = m.groupby(df['CODE']).cummax()
df1 = pd.crosstab(df.loc[mask, 'CODE'], m[mask]).rename(columns={True:'non NaNs',False:'NaNs'})
Use first_valid_index to find the first non-NaN index and filter. Then use isna to create a boolean mask and count the values.
def countNaN(s):
return (
s.loc[s.first_valid_index():]
.isna()
.value_counts()
.rename({True: 'NaN', False: 'notNaN'})
)
df.groupby('CODE').apply(countNaN)
Output
CODE
000130 notNaN 2
NaN 1
158328 notNaN 2
NaN 2
I am trying to extract from a dataframe the rows that have only element no-Nan and the rest are None.
For example :
A B C
0 NaN NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
For this example of dataframe it should return the first row.
I tried this code but it doesn't work:
df_table.isnull(df_table[cols]).all(axis=1)
Thanks!
Use sum instead of all:
df.loc[df.notnull().sum(1)==1]
To get the non-nan elements, you can use, for example, max:
df.loc[df.notnull().sum(1)==1].max(1)
or
df.loc[df.notnull().sum(1)==1].ffill(1).iloc[:,-1]
which gives:
0 2.0
1 3.0
dtype: float64