Pandas pivot table ValueError: Index contains duplicate entries, cannot reshape - python

I have a dataframe as shown below (top 3 rows):
Sample_Name Sample_ID Sample_Type IS Component_Name IS_Name Component_Group_Name Outlier_Reasons Actual_Concentration Area Height Retention_Time Width_at_50_pct Used Calculated_Concentration Accuracy
Index
1 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown True GluCer(d18:1/12:0)_LCB_264.3 NaN NaN NaN 0.1 2.733532e+06 5.963840e+05 2.963911 0.068676 True NaN NaN
2 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown True GluCer(d18:1/17:0)_LCB_264.3 NaN NaN NaN 0.1 2.945190e+06 5.597470e+05 2.745026 0.068086 True NaN NaN
3 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown False GluCer(d18:1/16:0)_LCB_264.3 GluCer(d18:1/17:0)_LCB_264.3 NaN NaN NaN 3.993535e+06 8.912731e+05 2.791991 0.059864 True 125.927659773487 NaN
When trying to generate a pivot table:
pivoted_report_conc = raw_report.pivot(index = "Sample_Name", columns = 'Component_Name', values = "Calculated_Concentration")
I get the following error:
ValueError: Index contains duplicate entries, cannot reshape
I tried resetting the index but it did not help. I couldn't find any duplicate values in the "Index" column. Could someone please help identify the problem here?
The expected output would be a reshaped dataframe with only the unique component names as columns and respective concentrations for each sample name:
Sample_Name GluCer(d18:1/12:0)_LCB_264.3 GluCer(d18:1/17:0)_LCB_264.3 GluCer(d18:1/16:0)_LCB_264.3
20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN NaN 125.927659773487
To clarify, I am not looking to aggregate the data, just reshape it.

You can use groupby() and unstack() to get around the error you're seeing with pivot().
Here's some example data, with a few edge cases added, and some column values removed or substituted for MCVE:
# df
Sample_Name Sample_ID IS Component_Name Calculated_Concentration Outlier_Reasons
Index
1 foo NaN True x NaN NaN
1 foo NaN True y NaN NaN
2 foo NaN False z 125.92766 NaN
2 bar NaN False x 1.00 NaN
2 bar NaN False y 2.00 NaN
2 bar NaN False z NaN NaN
(df.groupby(['Sample_Name','Component_Name'])
.Calculated_Concentration
.first()
.unstack()
)
Output:
Component_Name x y z
Sample_Name
bar 1.0 2.0 NaN
foo NaN NaN 125.92766

You should be able to accomplish what you are looking to do by using the the pandas.pivot_table() functionality as documented here.
With your dataframe stored as df use the following code:
import pandas as pd
df = pd.read_table('table_from_which_to_read')
new_df = pd.pivot_table(df,index=['Simple Name'], columns = 'Component_Name', values = "Calculated_Concentration")
If you want something other than the mean of the concentration value, you will need to change the aggfunc parameter.
EDIT
Since you don't want to aggregate over the values, you can reshape the data by using the set_index function on your DataFrame with documentation found here.
import pandas as pd
df = pd.DataFrame({'NonUniqueLabel':['Item1','Item1','Item1','Item2'],
'SemiUniqueValue':['X','Y','Z','X'], 'Value':[1.0,100,5,None])
new_df = df.set_index(['NonUniqueLabel','SemiUniqueLabel'])
The resulting table should look like what you expect the results to be and will have a multi-index.

Related

Python - Pandas - DROPNA(subset) deleting value for no apparent reasons?

I'm cleaning some data and I've been struggling with one thing.
I have a dataframe with 7740 rows and 68 columns.
Most of the columns contains Nan values.
What i'm interested in, is to remove NaN values when it is NaN in those two columns : [SERIAL_ID],[NUMBER_ID]
Example :
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
NaN
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
NaN
NaN
4555555
Results
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
4555555
Removing rows when NaN is in the two columns.
I've used the followings to do so :
df.dropna(subset=['SERIAL_ID', 'NUMBER_ID'], how='all', inplace=True)
When I use this on my dataframe with 68 columns the result I get is this one :
SERIAL_ID
NUMBER_ID
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7896521
NaN
NaN
95856ERT5
NaN
NaN
NaN
NaN
4555555
I tried with a copy of the dataframe with only 3 columns, it is working fine.
It is somehow working (I can tel cause I have an identical ID in another column) but remove some of the value, and I have no idea why.
Please help I've been struggling the whole day with this.
Thanks again.
I don't know why it only works for 3 columns and not for 68 originals.
However, we can obtain desired output in other way.
use boolean indexing:
df[df[['SERIAL_ID', 'NUMBER_ID']].notnull().any(axis=1)]
You can use boolean logic or simple do something like this for any given column:
import numpy as np
import pandas as pd
# sample dataframe
d = {'SERIAL_ID':['8RY68U4R', '8756ERT5', np.nan, np.nan],
'NUMBER_ID':[np.nan, 8759321, np.nan ,7896521]}
df = pd.DataFrame(d)
# apply logic to columns
df['nans'] = df['NUMBER_ID'].isnull() * df['SERIAL_ID'].isnull()
# filter columns
df_filtered = df[df['nans']==False]
print(df_filtered)
which returns this:
SERIAL_ID NUMBER_ID nans
0 8RY68U4R NaN False
1 8756ERT5 8759321.0 False
3 NaN 7896521.0 False

What happens when I pass a boolean dataframe to the indexing operator for another dataframe in pandas?

There's something fundamental about manipulating pandas dataframes which I am not getting.
TL,DR: passing a boolean series to the indexing operator [] of a pandas dataframe returns the rows or columns of that df where the series was True. But passing a boolean dataframe (ie: multidimensional) returns a weird dataframe consisting only of NaN values.
Edit: to rephrase: why is it possible to pass a dataframe of boolean values to another dataframe, and what does it do? With a series, this makes sense, but with a dataframe, I don't understand what's happening 'under the hood', and why in my example I get a dataframe of null NaN values.
In detail with examples:
When I pass a pandas boolean Series to the indexing operator, it returns a list of rows corresponding to indices where the Series is True:
test_list = [[1,2,3,4],[3,4,5],[4,5]]
test_df = pd.DataFrame(test_list)
test_df
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df[test_df[2].isnull()]
0 1 2 3
2 4 5 NaN NaN
So far, so good. But what happens when I do this:
test_df[test_df.isnull()]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
Why does this return a dataframe consisting of only NaN values? I would expect it to either return an error, or perhaps to return a new dataframe truncated using the boolean mask dataframe. But I find this output completely confusing.
Edit: As an outcome I would expect to get an error. I don't understand why it's possible to pass a dataframe under these circumstances, or why it returns this dataframe of NaN values
test_df[..] calls an indexing method __getitem__(). From the source code:
def __getitem__(self, key):
...
# Do we have a (boolean) DataFrame?
if isinstance(key, DataFrame):
return self.where(key)
# Do we have a (boolean) 1d indexer?
if com.is_bool_indexer(key):
return self._getitem_bool_array(key)
As you can see, if the key is a boolean DataFrame, it will call pandas.DataFrame.where(). The function of where() is to replace values where the condition is False with NaN by default.
# print(test_df.isnull())
0 1 2 3
0 False False False False
1 False False False True
2 False False True True
# print(test_df)
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df.where(test_df.isnull()) replaces not null values with NaN.
I believe all values are transformed to NaN because you passed the entire df. The error 'message', precisely, is that all the returned values are NaN (including those that were not NaN), which allows us to see that something wrong happened. But surely a more experienced user will be able to answer you in more detail. Also note most of the time you want to remove or transform these NaN--not just flag them.
Following my comment above and LoukasPap's answer, here is a way to flag, count, and then remove or transform these NaN values:
First flag NaN values:
test_df.isnull()
You might also be interested to count your NaN values:
test_df.isnull().sum() # sum NaN by column
test_df.isnull().sum().sum() # get grand total of NaN
You can now drop NaN values by row
test_df.dropna()
Or by column:
test_df.dropna(axis=1)
Or replace NaN values by median:
test_df.fillna(test_df.median())

Check whether a dataframe cell contains value that is in another dataframe's cell

I'm trying to do the following:
Given a row in df1, if str(row['code']) is in any rows for df2['code'], then I would like all those rows in df2['lamer_url_1'] and df2['shopee_url_1'] to take the corresponding values as from df1.
Then carry on with the next row for df1['code']...
'''
==============
Initial Tables:
df1
code lamer_url_1 shopee_url_1
0 L61B18H089 b a
1 L61S19H014 e d
2 L61S19H015 z y
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 NaN NaN NaN NaN
1 L61S19H014-S1500 NaN NaN NaN NaN
2 L61B18H089-F1424 NaN NaN NaN NaN
==============
Expected output:
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 b a NaN NaN
1 L61S19H014-S1500 e d NaN NaN
2 L61B18H089-F1424 b a NaN NaN
'''
I assumed that common part of "code" from "df2" are chars before "-". I also assumed that from "df1" we want 'lamer_url_1', 'shopee_url_1' and from "df2" we want 'lamer_url_2', 'shopee_url_2' (correct me in comment if I am wrong so I can polish code):
df1.set_index(df1['code'], inplace=True)
df2.set_index(df2['code'].apply(lambda x: x.split('-')[0]), inplace=True)
df2.index.names = ['code_join']
df3 = pd.merge(df2[['code', 'lamer_url_2', 'shopee_url_2']],
df1[['lamer_url_1', 'shopee_url_1']],
left_index=True, right_index=True)

When i convert my numpy array to Dataframe it update values to Nan

import impyute.imputation.cs as imp
print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)
When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?
Before
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 31 5.0 ... 117.50 5.0
1 61 2.0 ... 122.80 3.0
2 116 0.0 ... 137.50 2.5
3 123 0.0 ... 77.58 2.0
4 27 0.0 ... 135.10 3.5
5 77 0.0 ... 84.60 2.5
After
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
Editted
Solution first
Instead of passing columns to pd.DataFrame, just manually assign column names:
data = pd.DataFrame(imp.em(data))
data.columns = columns
Cause
Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns).
imp.em has a decorator #preprocess which converts input into a numpy.array if it is a pandas.DataFrame.
...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
args[0] = args[0].as_matrix()
return pd_DataFrame(fn(*args, **kwargs))
It therefore returns a dataframe reconstructed from a matrix, having range(data.shape[1]) as column names.
And as I have pointed below, when pd.DataFrame is instantiated with mismatching columns on another pd.DataFrame, all the contents become NaN.
You can test this by
from impyute.util import preprocess
#preprocess
def test(data):
return data
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns
data = pd.DataFrame(test(data), columns = columns))
size time
0 NaN NaN
1 NaN NaN
2 NaN NaN
When you instantiate a pd.DataFrame from an existing pd.DataFrame, columns argument specifies which of the columns from original dataframe you want to use.
It does not re-label the dataframe. Which is not odd, just the way pandas intended in reindexing
By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.
# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
size time
0 3 1
1 2 2
2 1 3
#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
There may be some bug in impyute library. You are using em function which is nothing but a way to fill-missing values by expectation-maximization algorithm. You can try without using that function, as
df = pd.DataFrame(data = Data ,columns = columns)
You can raise this issue here after confirming. To confirm first load the data, using above example and find if there are null data present in the data by using df.isnull() method.
Data = pd.DataFrame(data = np.array(imp.em(Data)),columns = columns)
Doing this solved the issue i was facing, i guess the data after the use of em function doesn't return numpy array.

pandas - pivot_table with non-numeric values? (DataError: No numeric types to aggregate)

I'm trying to do a pivot of a table containing strings as results.
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
df1.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
But I get: DataError: No numeric types to aggregate.
This works as intended when I change result values to numbers:
df2 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})
df2.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
And I get what I need:
variable1 A B
variable2 a b a b
variable3 x y x y x y
index
0 1 NaN NaN NaN NaN NaN
1 NaN NaN 0 NaN NaN NaN
2 NaN NaN NaN NaN 0 NaN
3 NaN NaN NaN NaN NaN 1
4 NaN 1 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN 0 NaN
7 NaN NaN NaN 1 NaN NaN
I know I can map the strings to numerical values and then reverse the operation, but maybe there is a more elegant solution?
My original reply was based on Pandas 0.14.1, and since then, many things changed in the pivot_table function (rows --> index, cols --> columns... )
Additionally, it appears that the original lambda trick I posted no longer works on Pandas 0.18. You have to provide a reducing function (even if it is min, max or mean). But even that seemed improper - because we are not reducing the data set, just transforming it.... So I looked harder at unstack...
import pandas as pd
df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})
# these are the columns to end up in the multi-index columns.
unstack_cols = ['variable1', 'variable2', 'variable3']
First, set an index on the data using the index + the columns you want to stack, then call unstack using the level arg.
df1.set_index(['index'] + unstack_cols).unstack(level=unstack_cols)
Resulting dataframe is below.
I think the best compromise is to replace on/off with True/False, which will enable pandas to "understand" the data better and act in an intelligent, expected way.
df2 = df1.replace({'on': True, 'off': False})
You essentially conceded this in your question. My answer is, I don't think there's a better way, and you should replace 'on'/'off' anyway for whatever comes next.
As Andy Hayden points out in the comments, you'll get better performance if you replace on/off with 1/0.

Categories