Python: Pandas Dataframe how to multiply entire column with a scalar - python

How do I multiply each element of a given column of my dataframe with a scalar?
(I have tried looking on SO, but cannot seem to find the right solution)
Doing something like:
df['quantity'] *= -1 # trying to multiply each row's quantity column with -1
gives me a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Note: If possible, I do not want to be iterating over the dataframe and do something like this...as I think any standard math operation on an entire column should be possible w/o having to write a loop:
for idx, row in df.iterrows():
df.loc[idx, 'quantity'] *= -1
EDIT:
I am running 0.16.2 of Pandas
full trace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s

try using apply function.
df['quantity'] = df['quantity'].apply(lambda x: x*-1)

Note: for those using pandas 0.20.3 and above, and are looking for an answer, all these options will work:
df = pd.DataFrame(np.ones((5,6)),columns=['one','two','three',
'four','five','six'])
df.one *=5
df.two = df.two*5
df.three = df.three.multiply(5)
df['four'] = df['four']*5
df.loc[:, 'five'] *=5
df.iloc[:, 5] = df.iloc[:, 5]*5
which results in
one two three four five six
0 5.0 5.0 5.0 5.0 5.0 5.0
1 5.0 5.0 5.0 5.0 5.0 5.0
2 5.0 5.0 5.0 5.0 5.0 5.0
3 5.0 5.0 5.0 5.0 5.0 5.0
4 5.0 5.0 5.0 5.0 5.0 5.0

Here's the answer after a bit of research:
df.loc[:,'quantity'] *= -1 #seems to prevent SettingWithCopyWarning

More recent pandas versions have the pd.DataFrame.multiply function.
df['quantity'] = df['quantity'].multiply(-1)

The real problem of why you are getting the error is not that there is anything wrong with your code: you can use either iloc, loc, or apply, or *=, another of them could have worked.
The real problem that you have is due to how you created the df DataFrame. Most likely you created your df as a slice of another DataFrame without using .copy(). The correct way to create your df as a slice of another DataFrame is df = original_df.loc[some slicing].copy().
The problem is already stated in the error message you got " SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
You will get the same message in the most current version of pandas too.
Whenever you receive this kind of error message, you should always check how you created your DataFrame. Chances are you forgot the .copy()

Try df['quantity'] = df['quantity'] * -1.

A bit old, but I was still getting the same SettingWithCopyWarning. Here was my solution:
df.loc[:, 'quantity'] = df['quantity'] * -1

I got this warning using Pandas 0.22. You can avoid this by being very explicit using the assign method:
df = df.assign(quantity = df.quantity.mul(-1))

A little late to the game, but for future searchers, this also should work:
df.quantity = df.quantity * -1

You can use the index of the column you want to apply the multiplication for
df.loc[:,6] *= -1
This will multiply the column with index 6 with -1.

Also it's possible to use numerical indeces with .iloc.
df.iloc[:,0] *= -1

Update 2022-08-10
Python: 3.10.5 - pandas: 1.4.3
As Mentioned in Previous comments, one the applicable approaches is using lambda. But, Be Careful with data types when using lambda approach.
Suppose you have a pandas Data Frame like this:
# Create List of lists
products = [[1010, 'Nokia', '200', 1800], [2020, 'Apple', '150', 3000], [3030, 'Samsung', '180', 2000]]
# Create the pandas DataFrame
df = pd.DataFrame(products, columns=['ProductId', 'ProductName', 'Quantity', 'Price'])
# print DataFrame
print(df)
ProductId ProductName Quantity Price
0 1010 Nokia 200 1800
1 2020 Apple 150 3000
2 3030 Samsung 180 2000
So, if you want to triple the value of Quantity for all rows in Products and use the following Statement:
# This statement considers the values of Quantity as string and updates the DataFrame
df['Quantity'] = df['Quantity'].apply(lambda x:x*3)
# print DataFrame
print(df)
The Result will be:
ProductId ProductName Quantity Price
0 1010 Nokia 200200200 1800
1 2020 Apple 150150150 3000
2 3030 Samsung 180180180 2000
The above statement considers the values of Quantity as string.
So, in order to do the multiplication in the right way, the following statement with a convert could generate correct output:
# This statement considers the values of Quantity as integer and updates the DataFrame
df['Quantity'] = df['Quantity'].apply(lambda x:int(x)*3)
# print DataFrame
print(df)
Therefore the output will be like this:
ProductId ProductName Quantity Price
0 1010 Nokia 600 1800
1 2020 Apple 450 3000
2 3030 Samsung 540 2000
I Hope this could help :)

Related

pandas fillna sequentially step by step

I have dataframe like as below
Re_MC,Fi_MC,Fin_id,Res_id,
1,2,3,4
,7,6,11
11,,31,32
,,35,38
df1 = pd.read_clipboard(sep=',')
I would like to fillna based on two steps
a) First, compare only Re_MC and Fi_MC. If a value is missing in either of these columns, copy it from the other column.
b) Despite doing step a, if there is still NA for either Re_MC or Fi_MC, copy values from Fin_id for Fi_MC and Res_id for Re_MC.
So, I tried the below two approaches
Approach 1 - This works but not efficient/elegant
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC'])
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Fin_id'])
Approach 2 - This doesn't work and provide incorrect output
df1['Re_MC'] = df1['Re_MC'].fillna(df1['Fi_MC']).fillna(df1['Res_id'])
df1['Fi_MC'] = df1['Fi_MC'].fillna(df1['Re_MC']).fillna(df1['Fin_id'])
Is there any other efficient way to fillna in a sequential manner? Meaning, we do step a first and then based on result of step a, we do step b
I expect my output to be like as shown below
updated code
df_new = (df_new
.fillna({'Re MC': df_new['Re Cust'],'Re MC': df_new['Re Cust_System']})
.fillna({'Fi MC' : df_new['Fi.Fi Customer'],'Final MC':df_new['Re.Fi Customer']})
.fillna({'Fi MC' : df_new['Re MC']})
.fillna({'Class Fi MC':df_new['Re MC']})
)
You can use dictionaries in fillna:
(df1
.fillna({'Re_MC': df1['Fi_MC'], 'Fi_MC': df1['Re_MC']})
.fillna({'Re_MC': df1['Res_id'], 'Fi_MC': df1['Fin_id']})
)
output:
Re_MC Fi_MC Fin_id Res_id
0 1.0 2.0 3 4
1 7.0 7.0 6 11
2 11.0 11.0 31 32
3 38.0 35.0 35 38

update data frame based on data from another data frame using pandas python

I have two data frames df1 and df2. Both have first column common SKUCode=SKU
df1:
df2:
I want to update df1 and set SKUStatus=0 if SKUCode matches SKU in df2.
I want to add new row to df1 if SKU from df2 has no match to SKUCode.
So after the operation df1 looks like following:
One way I could get this done is via df2.iterrows() and looping through values however I think there must be another neat way of doing this?
Thank you
import pandas as pdx
df1=pdx.DataFrame({'SKUCode':['A','B','C','D'],'ListPrice':[1798,2997,1798,999],'SalePrice':[1798,2997,1798,999],'SKUStatus':[1,1,1,0],'CostPrice':[500,773,525,300]})
df2=pdx.DataFrame({'SKUCode':['X','Y','B'],'Status':[0,0,0],'e_date':['31-05-2020','01-06-2020','01-06-2020']})
df1.merge(df2,left_on='SKUCode')
try this, using outer merge which gives both matching and non-matching records.
In [75]: df_m = df1.merge(df2, on="SKUCode", how='outer')
In [76]: mask = df_m['Status'].isnull()
In [77]: df_m.loc[~mask, 'SKUStatus'] = df_m.loc[~mask, 'Status']
In [78]: df_m[['SKUCode', "ListPrice", "SalePrice", "SKUStatus", "CostPrice"]].fillna(0.0)
output
SKUCode ListPrice SalePrice SKUStatus CostPrice
0 A 1798.0 1798.0 1.0 500.0
1 B 2997.0 2997.0 0.0 773.0
2 C 1798.0 1798.0 1.0 525.0
3 D 999.0 999.0 0.0 300.0
4 X 0.0 0.0 0.0 0.0
5 Y 0.0 0.0 0.0 0.0
I'm not sure exactly if I understood you correctly but I think you can use .loc. something along the lines of:
df1.loc[df2['SKUStatu'] != 0, 'SKUStatus'] = 1
You should have a look at pd.merge function [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html].
First rename a column with the same name (e.g rename SKU to SKUCode). Then try:
df1.merge(df2, left_on='SKUCode')
If you provide input data (not screenshots), I can try with the appropriate parameters.

How to reset Pandas DataFrame indices?

all,
I have (what should be) a very simple Pandas question. I'm trying to "refresh" the indices of my DataFrame, df.
I'm trying:
df3 = df2.reindex()
df3.head()
But this still gives me:
Dose
13539.0 1.0
13539.0 2.0
13539.0 5.0
13539.0 3.0
13539.0 4.0
I need to keep the "Dose" column in the same order, but make its indices 0 -> len(Dose).
In other words, my desired output is:
Dose
0 1.0
1 2.0
2 5.0
3 3.0
4 4.0
Clearly I've pre-processed something to mess this up, but I'd appreciate any insight as to how I can ammend it :-)
Thank you!!
df3.reset_index(drop=True, inplace=True)
When you reset the index, the old index is added as a column, and a new sequential index is used. Then you can use the drop parameter to avoid the old index being added as a column.
Refer to this Pandas documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
You might want to lookup the documentation for reset_index. In your case:
df3.reset_index(drop=True)
This solution does the job...
df2 = df2.reset_index()
df2 = df2['Dose']

Rounding down values in Pandas dataframe column with NaNs

I have a Pandas dataframe that contains a column of float64 values:
tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76,91,91,91,91],
'measure': [3.2,4.2,6.8,5.6,3.1,4.8,8.8,3.0,1.9,2.1,2.4,3.5,4.2,5.2,4.3,3.6,5.2,7.1,6.5,7.3]})
I want to create a new column containing just the integer part. My first thought was to use .astype(int):
tempDF['int_measure'] = tempDF['measure'].astype(int)
This works fine but, as an extra complication, the column I have contains a missing value:
tempDF.ix[10,'measure'] = np.nan
This missing value causes the .astype(int) method to fail with:
ValueError: Cannot convert NA to integer
I thought I could round down the floats in the column of data. However, the .round(0) function will round to the nearest integer (higher or lower) rather than rounding down. I can't find a function equivalent to ".floor()" that will act on a column of a Pandas dataframe.
Any suggestions?
You could just apply numpy.floor;
import numpy as np
tempDF['int_measure'] = tempDF['measure'].apply(np.floor)
id measure int_measure
0 12 3.2 3
1 12 4.2 4
2 12 6.8 6
...
9 51 2.1 2
10 51 NaN NaN
11 51 3.5 3
...
19 91 7.3 7
You could also try:
df.apply(lambda s: s // 1)
Using np.floor is faster, however.
The answers here are pretty dated and as of pandas 0.25.2 (perhaps earlier) the error
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Which would be
df.iloc[:,0] = df.iloc[:,0].astype(int)
for one particular column.

What does `ValueError: cannot reindex from a duplicate axis` mean?

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don't really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in affinity_matrix.columns, perhaps not shown in your question.
As others have said, you've probably got duplicate values in your original index. To find them do this:
df[df.index.duplicated()]
Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don't care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set ignore_index=True.
Alternatively, to overwrite your current index with a new one, instead of using df.reindex(), set:
df.index = new_index
Simple Fix
Run this before grouping
df = df.reset_index()
Thanks to this github comment for the solution.
For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:
df = df.loc[:,~df.columns.duplicated()]
Simply skip the error using .values at the end.
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values
I came across this error today when I wanted to add a new column like this
df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
I wanted to process the REMARK column of df_temp to return 1 or 0. However I typed wrong variable with df. And it returned error like this:
----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
2417 else:
2418 # set column
-> 2419 self._set_item(key, value)
2420
2421 def _setitem_slice(self, key, value):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
2483
2484 self._ensure_valid_index(value)
-> 2485 value = self._sanitize_column(key, value)
2486 NDFrame._set_item(self, key, value)
2487
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast)
2633
2634 if isinstance(value, Series):
-> 2635 value = reindexer(value)
2636
2637 elif isinstance(value, DataFrame):
/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value)
2625 # duplicate axis
2626 if not value.index.is_unique:
-> 2627 raise e
2628
2629 # other
ValueError: cannot reindex from a duplicate axis
As you can see it, the right code should be
df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)
Because df and df_temp have a different number of rows. So it returned ValueError: cannot reindex from a duplicate axis.
Hope you can understand it and my answer can help other people to debug their code.
In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:
df.head()
SensA
date
2018-04-03 13:54:47.274 -0.45
2018-04-03 13:55:46.484 -0.42
2018-04-03 13:56:56.235 -0.37
2018-04-03 13:57:57.207 -0.34
2018-04-03 13:59:34.636 -0.33
series.head()
date
2018-04-03 14:09:36.577 62.2
2018-04-03 14:10:28.138 63.5
2018-04-03 14:11:27.400 63.1
2018-04-03 14:12:39.623 62.6
2018-04-03 14:13:27.310 62.5
Name: SensA_rrT, dtype: float64
df = series.to_frame().combine_first(df)
df.head(10)
SensA SensA_rrT
date
2018-04-03 13:54:47.274 -0.45 NaN
2018-04-03 13:55:46.484 -0.42 NaN
2018-04-03 13:56:56.235 -0.37 NaN
2018-04-03 13:57:57.207 -0.34 NaN
2018-04-03 13:59:34.636 -0.33 NaN
2018-04-03 14:00:34.565 -0.33 NaN
2018-04-03 14:01:19.994 -0.37 NaN
2018-04-03 14:02:29.636 -0.34 NaN
2018-04-03 14:03:31.599 -0.32 NaN
2018-04-03 14:04:30.779 -0.33 NaN
2018-04-03 14:05:31.733 -0.35 NaN
2018-04-03 14:06:33.290 -0.38 NaN
2018-04-03 14:07:37.459 -0.39 NaN
2018-04-03 14:08:36.361 -0.36 NaN
2018-04-03 14:09:36.577 -0.37 62.2
I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function.
Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.
I got this error when I tried adding a column from a different table. Indeed I got duplicate index values along the way. But it turned out I was just doing it wrong: I actually needed to df.join the other table.
This pointer might help someone in a similar situation.
In my case it was caused by mismatch in dimensions:
accidentally using a column from different df during the mul operation
This can also be a cause for this[:) I solved my problem like this]
It may happen even if you are trying to insert a dataframe type column inside dataframe
you can try this
df['my_new']=pd.Series(my_new.values)
if you get this error after merging two dataframe and remove suffix adnd try to write to excel
Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x' on the left and '_y' on the right.
If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:
# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
df,
tempdf[what_i_care_about],
on=['myid', 'myorder'],
how='outer',
suffixes=('', '_delete_suffix') # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delete_suffix')]]
Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.
Just add .to_numpy() to the end of the series you want to concatenate.
It happened to me when I appended 2 dataframes into another (df3 = df1.append(df2)), so the output was:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 4 d
1 5 e
2 6 f
df3
A B
0 1 a
1 2 b
2 3 c
0 4 d
1 5 e
2 6 f
The simplest way to fix the indexes is using the "df.reset_index(drop=bool, inplace=bool)" method, as Connor said... you can also set the 'drop' argument True to avoid the index list to be created as a columns, and 'inplace' to True to make the indexes reset permanent.
Here is the official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
In addition, you can also use the ".set_index(keys=list, inplace=bool)" method, like this:
new_index_list = list(range(0, len(df3)))
df3['new_index'] = new_index_list
df3.set_index(keys='new_index', inplace=True)
official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
Make sure your index does not have any duplicates, I simply did df.reset_index(drop=True, inplace=True) and I don't get the error anymore! But you might want to keep the index, in that case just set drop to False
df = df.reset_index(drop=True) worked for me
I was trying to create a histogram using seaborn.
sns.histplot(data=df, x='Blood Chemistry 1', hue='Outcome', discrete=False, multiple='stack')
I get ValueError: cannot reindex from a duplicate axis. To solve it, I had to choose only the rows where x has no missing values:
data = df[~df['Blood Chemistry 1'].isnull()]

Categories