Pandas Error: Index contains duplicate entries, cannot reshape

Pandas Error: Index contains duplicate entries, cannot reshape - python

My question seems duplicate as I found different questions with the same error as follows:
Pandas: grouping a column on a value and creating new column headings
Python/Pandas - ValueError: Index contains duplicate entries, cannot reshape
Pandas pivot produces "ValueError: Index contains duplicate entries, cannot reshape
I tried all the solutions presented on those posts, but none worked. I believe the error maybe be caused by my dataset format, which has Strings instead of numbers and possible duplicate entires. Here follows an example of my Dataset:
protocol_no
activity
description
1586212
walk
twice a day
1586212
drive
5 km
1586212
drive
At least 30 min
1586212
sleep
NaN
1586212
eat
1500 calories
2547852
walk
NaN
2547852
drive
NaN
2547852
eat
3200 calories
2547852
eat
Avoid pasta
2547852
sleep
At least 10 hours
The output I'm trying to achieve is:
protocol_no
walk
drive
sleep
eat
1586212
twice a day
5km
NaN
1500 calories
2547852
NaN
NaN
3200 calories
At least 10 hours
I tried using pivot and pivot_table with a code like this:
df.pivot(index="protocol_no", columns="activity", values="description")
But I'm still getting this error:
ValueError: Index contains duplicate entries, cannot reshape
Have no idea what is going wrong, so any help will be helpful!
EDIT:
I noticed my data contains duplicate entires as stated by the error and by #DYZ and #SeaBean users. So I've edited the database example and provided the correct answer for my dataset as well. Hope it helps someone.

Try using .piviot_table() with aggfunc='first' (or something similar) if you get duplicate index error when using .pivot()
df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc='first')
This is a common situation when the column you set as index has duplicated values. Using aggfunc='first' (or sometimes aggfunc='sum' depending on condition) most probably can solve the problem.
Result:
activity drive eat sleep walk
protocol_no
1586212 5 km 1500 calories NaN twice a day
2547852 NaN 3200 calories At least 10 hours NaN
Edit
Based on your latest edit with duplicate entries, you can just modify the solution above by changing the aggfunc function above, as follows:
df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc=lambda x: ' '.join(x.dropna()))
Here, we change the aggfunc from 'first' to lambda x: ' '.join(x.dropna()). It achieves the the same result as your desired output without adding multiple lines of codes.
Result:
activity drive eat sleep walk
protocol_no
1586212 5 km At least 30 min 1500 calories twice a day
2547852 3200 calories Avoid pasta At least 10 hours

Although the SeaBean answer worked on my data, I took a look into my data and noticed it really contained duplicated entires (as the example in my question I edited later). To deal with this, the best solution is to do a join with those duplicate entries.
1- Before the join, I needed to remove the NaNs of my Dataset. Otherwise it will raise another error:
df["description"].fillna("", inplace=True)
2- Then I executed the grouby function joining the duplicate entries:
df = df.groupby(["protocol_no", "activity"], as_index=False).agg({"description": " ".join})
3- The last, but not the least, I executed the pivot as I have intended to do in my question:
df.pivot(index="protocol_no", columns="activity", values="description")
4- Voilà, the result:
protocol_no
drive
eat
sleep
walk
1586212
5 km At least 30 min
1500 calories
twice a day
2547852
3200 calories Avoid pasta
At least 10 hours
5- The info of my dataset using df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1586212 to 2547852
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drive 2 non-null object
1 eat 2 non-null object
2 sleep 2 non-null object
3 walk 2 non-null object
dtypes: object(4)
memory usage: 80.0+ bytes
Hope It helps someone and many thanks to SeaBean and DYZ insights. :)

Related

How to split a series into multiple columns with a separator python pandas dataframe

I have a csv with one single column and I want to get a multiple columns dataframe in python to work on it.
My test.csv is the following (in one single column):
ID;location;level;temperature;season
001;leeds;63;11;autumn
002;sydney;36;11;spring
003;lyon;250;11;autumn
004;edmonton;645;8;autumn
I want to get the information in a Data Frame like this:
ID Location Level Temperature Season
001 Leeds 63 11 Autumn
002 Sydney 36 11 Sprint
003 Lyon 250 11 Autumn
004 Edmonton 645 8 Autumn
I've tried:
df = pd.read_csv(r'test.csv', header=0, index_col=0)
df= df.str.split(';', expand=True)
But I've got this error: 'DataFrame' object has no attribute 'str' this was an attempt to use it as in Series. Is the similar way to do it in dataframes?
I would like to know if there is a pythonic way to do it or I should iterate per rows.
Is it str.split deprecated? I found it that is for Series but seems to be deprecated.
Any guidance is much appreciated.

Incremental spend in 6 weeks for two groups using pandas

I have an excel data with the following information,
df.head()
User_id Group Week Spend Purchases Group
170309867 Test 2014-10-13 794.66 2 Test-NonRed
57954586 Test 2014-10-13 55.99 1 Test-Red
118068583 Test 2014-10-13 40.87 1 Test-NonRed
516478393 Test 2014-10-13 17.5 2 Test-NonRed
457873235 Test 2014-10-13 20.44 1 Test-Red
From the above information, I need to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. control. I need it in absolute ($) and % terms.
I have tried pandas as,
df2= df.groupby(by=['Group','Week']).sum().abs().groupby(level=[0]).cumsum()
And I have the following result,
df2.head()
And then I calculated the sum for each group as,
df2.groupby(by=['group2']).sum()
df2.head()
I would like to have them ( the incremental spend) as an absolute value which I tried by abs(), as well as I need it in absolute percentage.Any help would be much appreciated,
The expected results are to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. Control. I need it in absolute spend and then its percentage.The incremental spend for the 6 weeks. Something like this,
Group incremental_spend incremental_%
Control 11450175 #%
test-NonRed 50288158 #%
test-Red 12043938 #%
So my real questions,
1. Whether the above-mentioned approach is the right way to calculate incremental spend for Column Group in 6 Weeks from column Week on Spend?
2. Also, I need all my results in Absolute counts and Absolute %

I think there are several problems here which make your answer difficult to understand.
Vocabulary
What you describe as "Incremental spend" is just the sum.
What you do in two steps is the sum of the cumulative sum .cumsum().sum(), which is not right.
Also I am not sure whether you need abs, which gives the absolute value (abs(-1) gives 1) and will thus only have an effect if there are negative values in your data.
Unfortunately the sample dataset is not large enough to get a conclusion.
Dataset
Your dataset has two columns Group with identical names, which is error prone.
Missing information
You want to get final values (sums) as a ratio (%), but you do not indicate what is the reference value for this ratio.
Is it the sum of Spend for the control group ?
Potential solution
>>> df # Sample dataframe with one entry as 'Control' group
Out[]:
User_id Group Week Spend Purchases Group.1
0 170309867 Test 2014-10-13 794.66 2 Test-NonRed
1 57954586 Test 2014-10-13 55.99 1 Test-Red
2 118068583 Test 2014-10-13 40.87 1 Test-NonRed
3 516478393 Test 2014-10-13 17.50 2 Control
4 457873235 Test 2014-10-13 20.44 1 Test-Red
df2 = pd.DataFrame(df.groupby('Group.1').Spend.sum()) # Get 'Spend' sum for each group
>>> df2
Out[]:
Spend
Group.1
Control 17.50
Test-NonRed 835.53
Test-Red 76.43
control_spend_total = df2.loc['Control'].values # Get total spend for 'Control' group
>>> control_spend_total
Out[]: array([ 17.5])
df2['Spend_%'] = df2.Spend / control_spend_total * 100 # Add 'Spend_ratio' column
>>> df2
Out[]:
Spend Spend_%
Group.1
Control 17.50 100.000000
Test-NonRed 835.53 4774.457143
Test-Red 76.43 436.742857
Does it look like what you want?

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.

It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Fill DataFrame row values based on another dataframe row's values pandas

DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.

It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.

This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.

pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.

Filtering based on the "rows" data after creating a pivot table in python pandas

I have a set of data that I'm getting from a SQL database and reading into a pandas dataframe. The resulting df is about 250M rows and growing everyday. Therefore, I'd like to pivot the table to give me a much much smaller table to work with (few thousand rows).
The table looks something like this but much bigger:
data
report_date item_id views category
0 2013-06-01 2 3 a
1 2013-06-01 2 2 b
2 2013-06-01 5 16 a
3 2013-06-01 2 4 c
4 2013-06-01 2 5 d
I'd like to make this much smaller by ignoring the "category" column and just getting a total for views by date and item_id.
I'm doing this:
pivot = data.pivot_table(values=['views'], rows=['report_date','item_id'], aggfunc='sum')
views
report_date item_id
2013-06-01 2 14
2013-06-01 5 16
Now imagine this is much bigger with the data range going for months and thousands of item_id's. I'd like to select the total views for item_id = 2 and report_date between '2013-06-01' and '2013-06-10' or something along those lines.
I've searched for several hours straight but I can't see how to select and/or filter off of values in my "rows" (i.e. report_date and item_id) section. I can only filter/select data in the "values" section (ex: views). This question is similar, and at the very end the asker commented the same question I'm asking but was never answered. I just wanted to try and draw attention to it.
Filtering and selecting from pivot tables made with python pandas
I appreciated all the help. This site and the community have been absolutely invaluable.

You should be able to slice it like so:
In [11]: pivot.ix[('2013-06-01', 3):('2013-06-01', 6)]
Out[11]:
views
report_date item_id
2013-06-01 5 16
See advance indexing in the docs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.