I have the following Pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 2, 3, 4], 'type': ['a,b,c,d', 'b,d', 'c,e', np.nan]})
I need to split the type column based on the commma delimiter and pivot the values into multiple columns to get this
I looked at Pandas documentation for pivot() and also searched stackoverflow. I did not find anything that seems to achieve (directly or indirectly) what I need to do here. Any suggestions?
Edited:
enke's solution works using Pandas 1.3.5. However it does not work using the latest version 1.4.1. Here is the screenshot:
You could use str.get_dummies to get the dummy variables; then join back to df:
out = df[['id']].join(df['type'].str.get_dummies(sep=',').add_prefix('type_').replace(0, float('nan')))
Output:
id type_a type_b type_c type_d type_e
0 1 1.0 1.0 1.0 1.0 NaN
1 2 NaN 1.0 NaN 1.0 NaN
2 3 NaN NaN 1.0 NaN 1.0
3 4 NaN NaN NaN NaN NaN
Related
I'm cleaning some data and I've been struggling with one thing.
I have a dataframe with 7740 rows and 68 columns.
Most of the columns contains Nan values.
What i'm interested in, is to remove NaN values when it is NaN in those two columns : [SERIAL_ID],[NUMBER_ID]
Example :
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
NaN
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
NaN
NaN
4555555
Results
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
4555555
Removing rows when NaN is in the two columns.
I've used the followings to do so :
df.dropna(subset=['SERIAL_ID', 'NUMBER_ID'], how='all', inplace=True)
When I use this on my dataframe with 68 columns the result I get is this one :
SERIAL_ID
NUMBER_ID
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7896521
NaN
NaN
95856ERT5
NaN
NaN
NaN
NaN
4555555
I tried with a copy of the dataframe with only 3 columns, it is working fine.
It is somehow working (I can tel cause I have an identical ID in another column) but remove some of the value, and I have no idea why.
Please help I've been struggling the whole day with this.
Thanks again.
I don't know why it only works for 3 columns and not for 68 originals.
However, we can obtain desired output in other way.
use boolean indexing:
df[df[['SERIAL_ID', 'NUMBER_ID']].notnull().any(axis=1)]
You can use boolean logic or simple do something like this for any given column:
import numpy as np
import pandas as pd
# sample dataframe
d = {'SERIAL_ID':['8RY68U4R', '8756ERT5', np.nan, np.nan],
'NUMBER_ID':[np.nan, 8759321, np.nan ,7896521]}
df = pd.DataFrame(d)
# apply logic to columns
df['nans'] = df['NUMBER_ID'].isnull() * df['SERIAL_ID'].isnull()
# filter columns
df_filtered = df[df['nans']==False]
print(df_filtered)
which returns this:
SERIAL_ID NUMBER_ID nans
0 8RY68U4R NaN False
1 8756ERT5 8759321.0 False
3 NaN 7896521.0 False
Given the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,1,2],'b':[np.nan,np.nan,4]})
a b
0 NaN NaN
1 1.0 NaN
2 2.0 4.0
How do I return rows where both columns 'a' and 'b' are null without having to use pd.isnull for each column?
Desired result:
a b
0 NaN NaN
I know this works (but it's not how I want to do it):
df.loc[(pd.isnull(df['a']) & (pd.isnull(df['b'])]
I tried this:
df.loc[pd.isnull(df[['a', 'b']])]
...but got the following error:
ValueError: Cannot index with multidimensional key
Thanks in advance!
You are close:
df[~pd.isnull(df[['a', 'b']]).all(1)]
Or
df[df[['a','b']].isna().all(1)]
How about:
df.dropna(subset=['a','b'], how='all')
With your shown samples, please try following. Using isnull function here.
mask1 = df['a'].isnull()
mask2 = df['b'].isnull()
df[mask1 & mask2]
Above answer is with creating 2 variables for better understanding. In case you want to use conditions inside df itself and don't want to create condition variables(mask1 and mask2 in this case) then try following.
df[df['a'].isnull() & df['b'].isnull()]
Output will be as follows.
a b
0 NaN NaN
You can use dropna() with parameter as how=all
df.dropna(how='all')
Output:
a b
1 1.0 NaN
2 2.0 4.0
Since the question was updated, you can then create masking either using df.isnull() or using df.isna() and filter accordingly.
df[df.isna().all(axis=1)]
a b
0 NaN NaN
I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, 5, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
I want to do a ffill() on column B with df["B"].ffill(inplace=True) which results in the following df:
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 4.0 5.0 NaN
3 NaN 3.0 NaN 4.0
Now I want to replace all NaN values with their corresponding value from column B. The documentation states that you can give fillna() a Series, so I tried df.fillna(df["B"], inplace=True). This results in the exact same dataframe as above.
However, if I put in a simple value (e.g. df.fillna(0, inplace=True), then it does work:
A B C D
0 0.0 2.0 0.0 0.0
1 3.0 4.0 0.0 1.0
2 0.0 4.0 5.0 0.0
3 0.0 3.0 0.0 4.0
The funny thing is that the fillna() does seem to work with a Series as value parameter when operated on another Series object. For example, df["A"].fillna(df["B"], inplace=True) results in:
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 4.0 4.0 NaN 5
3 3.0 3.0 NaN 4
My real dataframe has a lot of columns and I would hate to manually fillna() all of them. Am I overlooking something here? Didn't I understand the docs correctly perhaps?
EDIT I have clarified my example in such a way that 'ffill' with axis=1 does not work for me. In reality, my dataframe has many, many columns (hundreds) and I am looking for a way to not have to explicitly mention all the columns.
Try changing the axis to 1 (columns):
df = df.ffill(1).bfill(1)
If you need to specify the columns, you can do something like this:
df[["B","C"]] = df[["B","C"]].ffill(1)
EDIT:
Since you need something more general and df.fillna(df.B, axis = 1) is not implemented yet, you can try with:
df = df.T.fillna(df.B).T
Or, equivalently:
df.T.fillna(df.B, inplace=True)
This works because the indices of df.B coincides with the columns of df.T so pandas will know how to replace it. From the docs:
value: scalar, dict, Series, or DataFrame.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
So, for example, the NaN in column 0 at row A (in df.T) will be replaced for the value with index 0 in df.B.
I have a dataframe as shown below (top 3 rows):
Sample_Name Sample_ID Sample_Type IS Component_Name IS_Name Component_Group_Name Outlier_Reasons Actual_Concentration Area Height Retention_Time Width_at_50_pct Used Calculated_Concentration Accuracy
Index
1 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown True GluCer(d18:1/12:0)_LCB_264.3 NaN NaN NaN 0.1 2.733532e+06 5.963840e+05 2.963911 0.068676 True NaN NaN
2 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown True GluCer(d18:1/17:0)_LCB_264.3 NaN NaN NaN 0.1 2.945190e+06 5.597470e+05 2.745026 0.068086 True NaN NaN
3 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown False GluCer(d18:1/16:0)_LCB_264.3 GluCer(d18:1/17:0)_LCB_264.3 NaN NaN NaN 3.993535e+06 8.912731e+05 2.791991 0.059864 True 125.927659773487 NaN
When trying to generate a pivot table:
pivoted_report_conc = raw_report.pivot(index = "Sample_Name", columns = 'Component_Name', values = "Calculated_Concentration")
I get the following error:
ValueError: Index contains duplicate entries, cannot reshape
I tried resetting the index but it did not help. I couldn't find any duplicate values in the "Index" column. Could someone please help identify the problem here?
The expected output would be a reshaped dataframe with only the unique component names as columns and respective concentrations for each sample name:
Sample_Name GluCer(d18:1/12:0)_LCB_264.3 GluCer(d18:1/17:0)_LCB_264.3 GluCer(d18:1/16:0)_LCB_264.3
20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN NaN 125.927659773487
To clarify, I am not looking to aggregate the data, just reshape it.
You can use groupby() and unstack() to get around the error you're seeing with pivot().
Here's some example data, with a few edge cases added, and some column values removed or substituted for MCVE:
# df
Sample_Name Sample_ID IS Component_Name Calculated_Concentration Outlier_Reasons
Index
1 foo NaN True x NaN NaN
1 foo NaN True y NaN NaN
2 foo NaN False z 125.92766 NaN
2 bar NaN False x 1.00 NaN
2 bar NaN False y 2.00 NaN
2 bar NaN False z NaN NaN
(df.groupby(['Sample_Name','Component_Name'])
.Calculated_Concentration
.first()
.unstack()
)
Output:
Component_Name x y z
Sample_Name
bar 1.0 2.0 NaN
foo NaN NaN 125.92766
You should be able to accomplish what you are looking to do by using the the pandas.pivot_table() functionality as documented here.
With your dataframe stored as df use the following code:
import pandas as pd
df = pd.read_table('table_from_which_to_read')
new_df = pd.pivot_table(df,index=['Simple Name'], columns = 'Component_Name', values = "Calculated_Concentration")
If you want something other than the mean of the concentration value, you will need to change the aggfunc parameter.
EDIT
Since you don't want to aggregate over the values, you can reshape the data by using the set_index function on your DataFrame with documentation found here.
import pandas as pd
df = pd.DataFrame({'NonUniqueLabel':['Item1','Item1','Item1','Item2'],
'SemiUniqueValue':['X','Y','Z','X'], 'Value':[1.0,100,5,None])
new_df = df.set_index(['NonUniqueLabel','SemiUniqueLabel'])
The resulting table should look like what you expect the results to be and will have a multi-index.
I'm trying to do a pivot of a table containing strings as results.
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
df1.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
But I get: DataError: No numeric types to aggregate.
This works as intended when I change result values to numbers:
df2 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})
df2.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
And I get what I need:
variable1 A B
variable2 a b a b
variable3 x y x y x y
index
0 1 NaN NaN NaN NaN NaN
1 NaN NaN 0 NaN NaN NaN
2 NaN NaN NaN NaN 0 NaN
3 NaN NaN NaN NaN NaN 1
4 NaN 1 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN 0 NaN
7 NaN NaN NaN 1 NaN NaN
I know I can map the strings to numerical values and then reverse the operation, but maybe there is a more elegant solution?
My original reply was based on Pandas 0.14.1, and since then, many things changed in the pivot_table function (rows --> index, cols --> columns... )
Additionally, it appears that the original lambda trick I posted no longer works on Pandas 0.18. You have to provide a reducing function (even if it is min, max or mean). But even that seemed improper - because we are not reducing the data set, just transforming it.... So I looked harder at unstack...
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
# these are the columns to end up in the multi-index columns.
unstack_cols = ['variable1', 'variable2', 'variable3']
First, set an index on the data using the index + the columns you want to stack, then call unstack using the level arg.
df1.set_index(['index'] + unstack_cols).unstack(level=unstack_cols)
Resulting dataframe is below.
I think the best compromise is to replace on/off with True/False, which will enable pandas to "understand" the data better and act in an intelligent, expected way.
df2 = df1.replace({'on': True, 'off': False})
You essentially conceded this in your question. My answer is, I don't think there's a better way, and you should replace 'on'/'off' anyway for whatever comes next.
As Andy Hayden points out in the comments, you'll get better performance if you replace on/off with 1/0.