I have the following data frame:
OBJECTID 2017 2018 2019 2020 2021
1.0 NaN NaN 7569.183179 7738.162829 7907.142480
2.0 NaN NaN 766.591146 783.861122 801.131099
3.0 NaN NaN 8492.215747 8686.747704 8881.279662
4.0 NaN NaN 40760.327825 41196.877473 41633.427120
5.0 NaN NaN 6741.819674 6788.981231 6836.142788
I am trying to apply a spline interpolation on each row to get the values for 2017 and 2018 using the following code:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
However, I get the following error:
ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
The dataframe in this question is just a subset of a much larger dataset I am using. All of the examples I have seen do the spline interpolation down each column, but I can't seem to get it work across each row. I feel like it's a simple solution and I'm just missing it. Could someone please help?
It appears to be because the dtype of the index (really columns for axis=1) is probably object in your case since the index contains a string column name also. Even though you are grabbing a slice of the columns that contains only integer years the overall index dtype remains the same - object. Then it looks like interpolate looks at the dtype and punts when it sees a dtype of object.
Example - even though the years are stored as integers the overall dtype is object:
df.columns
Index(['OBJECTID', 2017, 2018, 2019, 2020, 2021], dtype='object')
If we did this:
df.drop(columns=['OBJECTID'], inplace=True)
df.columns = df.columns.astype('uint64')
df.columns
UInt64Index([2017, 2018, 2019, 2020, 2021], dtype='uint64')
Then the axis=1 interpolation works:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
2017 2018 2019 2020 2021
0 7231.223878 7400.203528 7569.183179 7738.162829 7907.142480
1 732.051193 749.321169 766.591146 783.861122 801.131099
2 8103.151832 8297.683789 8492.215747 8686.747704 8881.279662
3 39887.228530 40323.778178 40760.327825 41196.877473 41633.427120
4 6647.496560 6694.658117 6741.819674 6788.981231 6836.142788
Dropping the OBJECTID was done to illustrate what is going on.
I am trying to fill missing values in subset of rows. I am using inplace=True in fillna(), but it is not working in jupyter notebook. You can see attached picture showing NaN in the first 2 rows in column of Surface. I am not sure why?
I have to do this so it is working. why? Thank you for your help.
data.loc[mark,'Surface']=data.loc[mark,'Surface'].fillna(value='TEST')
Here are my codes
mark=(data['Pad']==51) | (data['Pad']==52) | (data['Pad']==53) | (data['Pad']==54) | (data['Pad']==55)
data.loc[mark,'Surface'].fillna(value='TEST',inplace=True)
This one is working:
data.loc[mark,'Surface']=data.loc[mark,'Surface'].fillna(value='TEST')
The main issue you're bumping into here is that pandas does not have very explicit view vs copy rules. Your result indicates to me that the issue here is .loc is returning a copy instead of a view. While pandas does try to return a view from .loc, there are a decent number of caveats.
After playing around a little, it seems that using a boolean/positional index mask return a copy- you can verify this with the private _is_view attribute:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Pad": range(40, 60), "Surface": np.nan})
print(df)
Pad Surface
0 40 NaN
1 41 NaN
2 42 NaN
. ... ...
19 59 NaN
# Create masks
bool_mask = df["Pad"].isin(range(51, 56))
positional_mask = np.where(bool_mask)[0]
# Check `_is_view` after simple .loc:
>>> df.loc[bool_mask, "Surface"]._is_view
False
>>> df.loc[positional_mask, "Surface"]._is_view
False
So neither of the approaches above return a "view" of the original data, which is why performing an inplace operation does not change the original dataframe. In order to return a view from .loc you will need to use a slice as your row-index.
>>> df.loc[10:15, "Surface"]._is_view
True
Now this still won't resolve your issue because the value you're filling NaN with may or may not change the dtype of the "Surface" column. In the example I have set up, "Surface" has a float64 dtype- and by filling in NaN with the value "Test", you are forcing that dtype to change which is incompatible with the original dataframe. If your "Surface" columns is an object dtype, then you don't need to worry about this.
>>> df.dtypes
Pad int64
Surface float64
# this does not work because "Test" is incompatible with float64 dtype
>>> df.loc[10:15, "Surface"].fillna("Test", inplace=True)
# this works because 0.9 is an appropriate value for a float64 dtype
>>> df.loc[10:15, "Surface"].fillna(0.9, inplace=True)
>>> print(df)
Pad Surface
.. ... ...
8 48 NaN
9 49 NaN
10 50 0.9
11 51 0.9
12 52 0.9
13 53 0.9
14 54 0.9
15 55 0.9
16 56 NaN
17 57 NaN
.. ... ...
TLDR; don't rely on inplace in pandas in general. In the bulk of its operations it still creates a copy of the underlying data and then attempts to replace the original source with the new copy. Pandas is not memory efficient so if you're worried about memory-performance you may want to switch to something designed to be zero-copy from the ground up like Vaex, instead of trying to go through pandas.
Your approach of assigning the slice of the dataframe is the most appropriate and will ensure you receive the correct result of updating the dataframe as "inplace" as possible:
>>> df.loc[bool_mask, "Surface"] = df.loc[bool_mask, "Surface"].fillna("Test")
I have the following data frame called "new_df":
dato uttak annlegg Merd ID Leng BW CF F B H K
0 2020-12-15 12_20 LL 3 1 48.0 1200 1.085069 0.0 2.0 0.0 NaN
1 2020-12-15 12_20 LL 3 2 43.0 830 1.043933 0.0 1.0 0.0 NaN
columns are:
'dato', 'uttak', 'annlegg', 'Merd', 'ID', 'Leng', 'BW', 'CF', 'F', 'B', 'H', 'K'
when I do:
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
I got all means except the column "BW" like this:
annlegg Merd ID Leng CF F B H K
0 KH 1 42.557143 56.398649 1.265812 0.071770 1.010638 0.600000 0.127907
1 KH 2 42.683794 56.492228 1.270522 0.021978 0.739130 0.230769 0.075862
2 KH 3 42.177866 35.490119 1.125416 0.000000 0.384146 0.333333 0.034483
Column "BW" just disappeared when I groupby, no matter "as_index" True or False, why is that?
It appears the content as the BW column does not have a numerical type but an object type instead, which is used for storing strings for instance. Thus when applying groupby and meanaggregation function, tour column disappears has computing the mean value of an object (think of a string does not make sense in general).
You should start by converting your BW column :
First method : pd.to_numeric
This first method will safely convert all your column to float objects.
new_df['BW'] = pd.to_numeric(new_df['BW'])
Second method : df.astype
If you do not want to convert your data to float (for instance, you know that this column only contains int, or if floating point precision does not interest you), you can use the astype method which allows you to convert to almost any type you want :
new_df['BW'] = new_df['BW'].astype(float) # Converts to float
new_df['BW'] = new_df['BW'].astype(int) # Converts to integer
You can eventually apply your groupby and aggregation as you did !
That's probably due to the wrong data type. You can try this.
new_df = new_df.convert_dtypes()
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
You can check dtype via:
new_df.dtype
You can try .agg() function to target specific columns.
new_df.groupby(['annlegg','Merd']).agg({'BW':'mean'})
How do I multiply each element of a given column of my dataframe with a scalar?
(I have tried looking on SO, but cannot seem to find the right solution)
Doing something like:
df['quantity'] *= -1 # trying to multiply each row's quantity column with -1
gives me a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Note: If possible, I do not want to be iterating over the dataframe and do something like this...as I think any standard math operation on an entire column should be possible w/o having to write a loop:
for idx, row in df.iterrows():
df.loc[idx, 'quantity'] *= -1
EDIT:
I am running 0.16.2 of Pandas
full trace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
try using apply function.
df['quantity'] = df['quantity'].apply(lambda x: x*-1)
Note: for those using pandas 0.20.3 and above, and are looking for an answer, all these options will work:
df = pd.DataFrame(np.ones((5,6)),columns=['one','two','three',
'four','five','six'])
df.one *=5
df.two = df.two*5
df.three = df.three.multiply(5)
df['four'] = df['four']*5
df.loc[:, 'five'] *=5
df.iloc[:, 5] = df.iloc[:, 5]*5
which results in
one two three four five six
0 5.0 5.0 5.0 5.0 5.0 5.0
1 5.0 5.0 5.0 5.0 5.0 5.0
2 5.0 5.0 5.0 5.0 5.0 5.0
3 5.0 5.0 5.0 5.0 5.0 5.0
4 5.0 5.0 5.0 5.0 5.0 5.0
Here's the answer after a bit of research:
df.loc[:,'quantity'] *= -1 #seems to prevent SettingWithCopyWarning
More recent pandas versions have the pd.DataFrame.multiply function.
df['quantity'] = df['quantity'].multiply(-1)
The real problem of why you are getting the error is not that there is anything wrong with your code: you can use either iloc, loc, or apply, or *=, another of them could have worked.
The real problem that you have is due to how you created the df DataFrame. Most likely you created your df as a slice of another DataFrame without using .copy(). The correct way to create your df as a slice of another DataFrame is df = original_df.loc[some slicing].copy().
The problem is already stated in the error message you got " SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
You will get the same message in the most current version of pandas too.
Whenever you receive this kind of error message, you should always check how you created your DataFrame. Chances are you forgot the .copy()
Try df['quantity'] = df['quantity'] * -1.
A bit old, but I was still getting the same SettingWithCopyWarning. Here was my solution:
df.loc[:, 'quantity'] = df['quantity'] * -1
I got this warning using Pandas 0.22. You can avoid this by being very explicit using the assign method:
df = df.assign(quantity = df.quantity.mul(-1))
A little late to the game, but for future searchers, this also should work:
df.quantity = df.quantity * -1
You can use the index of the column you want to apply the multiplication for
df.loc[:,6] *= -1
This will multiply the column with index 6 with -1.
Also it's possible to use numerical indeces with .iloc.
df.iloc[:,0] *= -1
Update 2022-08-10
Python: 3.10.5 - pandas: 1.4.3
As Mentioned in Previous comments, one the applicable approaches is using lambda. But, Be Careful with data types when using lambda approach.
Suppose you have a pandas Data Frame like this:
# Create List of lists
products = [[1010, 'Nokia', '200', 1800], [2020, 'Apple', '150', 3000], [3030, 'Samsung', '180', 2000]]
# Create the pandas DataFrame
df = pd.DataFrame(products, columns=['ProductId', 'ProductName', 'Quantity', 'Price'])
# print DataFrame
print(df)
ProductId ProductName Quantity Price
0 1010 Nokia 200 1800
1 2020 Apple 150 3000
2 3030 Samsung 180 2000
So, if you want to triple the value of Quantity for all rows in Products and use the following Statement:
# This statement considers the values of Quantity as string and updates the DataFrame
df['Quantity'] = df['Quantity'].apply(lambda x:x*3)
# print DataFrame
print(df)
The Result will be:
ProductId ProductName Quantity Price
0 1010 Nokia 200200200 1800
1 2020 Apple 150150150 3000
2 3030 Samsung 180180180 2000
The above statement considers the values of Quantity as string.
So, in order to do the multiplication in the right way, the following statement with a convert could generate correct output:
# This statement considers the values of Quantity as integer and updates the DataFrame
df['Quantity'] = df['Quantity'].apply(lambda x:int(x)*3)
# print DataFrame
print(df)
Therefore the output will be like this:
ProductId ProductName Quantity Price
0 1010 Nokia 600 1800
1 2020 Apple 450 3000
2 3030 Samsung 540 2000
I Hope this could help :)
I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don't really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in affinity_matrix.columns, perhaps not shown in your question.
As others have said, you've probably got duplicate values in your original index. To find them do this:
df[df.index.duplicated()]
Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don't care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set ignore_index=True.
Alternatively, to overwrite your current index with a new one, instead of using df.reindex(), set:
df.index = new_index
Simple Fix
Run this before grouping
df = df.reset_index()
Thanks to this github comment for the solution.
For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:
df = df.loc[:,~df.columns.duplicated()]
Simply skip the error using .values at the end.
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values
I came across this error today when I wanted to add a new column like this
df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
I wanted to process the REMARK column of df_temp to return 1 or 0. However I typed wrong variable with df. And it returned error like this:
----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
2417 else:
2418 # set column
-> 2419 self._set_item(key, value)
2420
2421 def _setitem_slice(self, key, value):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
2483
2484 self._ensure_valid_index(value)
-> 2485 value = self._sanitize_column(key, value)
2486 NDFrame._set_item(self, key, value)
2487
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast)
2633
2634 if isinstance(value, Series):
-> 2635 value = reindexer(value)
2636
2637 elif isinstance(value, DataFrame):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value)
2625 # duplicate axis
2626 if not value.index.is_unique:
-> 2627 raise e
2628
2629 # other
ValueError: cannot reindex from a duplicate axis
As you can see it, the right code should be
df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
Because df and df_temp have a different number of rows. So it returned ValueError: cannot reindex from a duplicate axis.
Hope you can understand it and my answer can help other people to debug their code.
In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:
df.head()
SensA
date
2018-04-03 13:54:47.274 -0.45
2018-04-03 13:55:46.484 -0.42
2018-04-03 13:56:56.235 -0.37
2018-04-03 13:57:57.207 -0.34
2018-04-03 13:59:34.636 -0.33
series.head()
date
2018-04-03 14:09:36.577 62.2
2018-04-03 14:10:28.138 63.5
2018-04-03 14:11:27.400 63.1
2018-04-03 14:12:39.623 62.6
2018-04-03 14:13:27.310 62.5
Name: SensA_rrT, dtype: float64
df = series.to_frame().combine_first(df)
df.head(10)
SensA SensA_rrT
date
2018-04-03 13:54:47.274 -0.45 NaN
2018-04-03 13:55:46.484 -0.42 NaN
2018-04-03 13:56:56.235 -0.37 NaN
2018-04-03 13:57:57.207 -0.34 NaN
2018-04-03 13:59:34.636 -0.33 NaN
2018-04-03 14:00:34.565 -0.33 NaN
2018-04-03 14:01:19.994 -0.37 NaN
2018-04-03 14:02:29.636 -0.34 NaN
2018-04-03 14:03:31.599 -0.32 NaN
2018-04-03 14:04:30.779 -0.33 NaN
2018-04-03 14:05:31.733 -0.35 NaN
2018-04-03 14:06:33.290 -0.38 NaN
2018-04-03 14:07:37.459 -0.39 NaN
2018-04-03 14:08:36.361 -0.36 NaN
2018-04-03 14:09:36.577 -0.37 62.2
I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function.
Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.
I got this error when I tried adding a column from a different table. Indeed I got duplicate index values along the way. But it turned out I was just doing it wrong: I actually needed to df.join the other table.
This pointer might help someone in a similar situation.
In my case it was caused by mismatch in dimensions:
accidentally using a column from different df during the mul operation
This can also be a cause for this[:) I solved my problem like this]
It may happen even if you are trying to insert a dataframe type column inside dataframe
you can try this
df['my_new']=pd.Series(my_new.values)
if you get this error after merging two dataframe and remove suffix adnd try to write to excel
Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x' on the left and '_y' on the right.
If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:
# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
df,
tempdf[what_i_care_about],
on=['myid', 'myorder'],
how='outer',
suffixes=('', '_delete_suffix') # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delete_suffix')]]
Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.
Just add .to_numpy() to the end of the series you want to concatenate.
It happened to me when I appended 2 dataframes into another (df3 = df1.append(df2)), so the output was:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 4 d
1 5 e
2 6 f
df3
A B
0 1 a
1 2 b
2 3 c
0 4 d
1 5 e
2 6 f
The simplest way to fix the indexes is using the "df.reset_index(drop=bool, inplace=bool)" method, as Connor said... you can also set the 'drop' argument True to avoid the index list to be created as a columns, and 'inplace' to True to make the indexes reset permanent.
Here is the official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
In addition, you can also use the ".set_index(keys=list, inplace=bool)" method, like this:
new_index_list = list(range(0, len(df3)))
df3['new_index'] = new_index_list
df3.set_index(keys='new_index', inplace=True)
official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
Make sure your index does not have any duplicates, I simply did df.reset_index(drop=True, inplace=True) and I don't get the error anymore! But you might want to keep the index, in that case just set drop to False
df = df.reset_index(drop=True) worked for me
I was trying to create a histogram using seaborn.
sns.histplot(data=df, x='Blood Chemistry 1', hue='Outcome', discrete=False, multiple='stack')
I get ValueError: cannot reindex from a duplicate axis. To solve it, I had to choose only the rows where x has no missing values:
data = df[~df['Blood Chemistry 1'].isnull()]