Set value to an entire column of a pandas dataframe - python

I'm trying to set the entire column of a dataframe to a specific value.
In [1]: df
Out [1]:
issueid industry
0 001 xxx
1 002 xxx
2 003 xxx
3 004 xxx
4 005 xxx
From what I've seen, loc is the best practice when replacing values in a dataframe (or isn't it?):
In [2]: df.loc[:,'industry'] = 'yyy'
However, I still received this much talked-about warning message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
If I do
In [3]: df['industry'] = 'yyy'
I got the same warning message.
Any ideas? Working with Python 3.5.2 and pandas 0.18.1.
EDIT Jan 2023:
Given the volume of visits on this question, it's worth stating that my original question was really more about dataframe copy-versus-slice than "setting value to an entire column".
On copy-versus-slice: My current understanding is that, in general, if you want to modify a subset of a dataframe after slicing, you should create the subset by .copy(). If you only want a view of the slice, no copy() needed.
On setting value to an entire column: simply do df[col_name] = col_value

You can use the assign function:
df = df.assign(industry='yyy')

Python can do unexpected things when new objects are defined from existing ones. You stated in a comment above that your dataframe is defined along the lines of df = df_all.loc[df_all['issueid']==specific_id,:]. In this case, df is really just a stand-in for the rows stored in the df_all object: a new object is NOT created in memory.
To avoid these issues altogether, I often have to remind myself to use the copy module, which explicitly forces objects to be copied in memory so that methods called on the new objects are not applied to the source object. I had the same problem as you, and avoided it using the deepcopy function.
In your case, this should get rid of the warning message:
from copy import deepcopy
df = deepcopy(df_all.loc[df_all['issueid']==specific_id,:])
df['industry'] = 'yyy'
EDIT: Also see David M.'s excellent comment below!
df = df_all.loc[df_all['issueid']==specific_id,:].copy()
df['industry'] = 'yyy'

df.loc[:,'industry'] = 'yyy'
This does the magic. You are to add '.loc' with ':' for all rows. Hope it helps

You can do :
df['industry'] = 'yyy'

Assuming your Data frame is like 'Data' you have to consider if your data is a string or an integer. Both are treated differently. So in this case you need be specific about that.
import pandas as pd
data = [('001','xxx'), ('002','xxx'), ('003','xxx'), ('004','xxx'), ('005','xxx')]
df = pd.DataFrame(data,columns=['issueid', 'industry'])
print("Old DataFrame")
print(df)
df.loc[:,'industry'] = str('yyy')
print("New DataFrame")
print(df)
Now if want to put numbers instead of letters you must create and array
list_of_ones = [1,1,1,1,1]
df.loc[:,'industry'] = list_of_ones
print(df)
Or if you are using Numpy
import numpy as np
n = len(df)
df.loc[:,'industry'] = np.ones(n)
print(df)

This provides you with the possibility of adding conditions on the rows and then change all the cells of a specific column corresponding to those rows:
df.loc[(df['issueid'] == '001'), 'industry'] = str('yyy')

Seems to me that:
df1 = df[df['col1']==some_value] will not create a new DataFrame, basically, changes in df1 will be reflected in the parent df. This leads to the warning.
Whereas, df1 = df[df['col1]]==some_value].copy() will create a new DataFrame, and changes in df1 will not be reflected in df. The copy method is recommended if you don't want to make changes to your original df.

I had a similar issue before even with this approach df.loc[:,'industry'] = 'yyy', but once I refreshed the notebook, it ran well.
You may want to try refreshing the cells after you have df.loc[:,'industry'] = 'yyy'.

Only use them instead:
df.iloc[:]['industry'] = 'yyy'
remember: this only works with exist columns in dataframe
this for people who didn't work .loc

For anyone else coming for this answer and doesn't want to use copy -
df['industry'] = df['industry'].apply(lambda x: '')

if you just create new but empty data frame, you cannot directly sign a value to a whole column. This will show as NaN because the system wouldn't know how many rows the data frame will have!You need to either define the size or have some existing columns.
df = pd.DataFrame()
df["A"] = 1
df["B"] = 2
df["C"] = 3

Related

Pandas Dataframe: Function doesn't preserve my custom column order when returning df

I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable

How to undo the changes made in original DataFrame?

I am working with a dataset. As a precautionary measure, I created a back-up copy using the following command.
Orig. Dataframe = df
df_copy = df.copy(deep = True)
Now, I dropped few columns from original dataframe (df) by mistake using inplace = True.
I tried to undo the operation, but no use.
So, the question is how to get my original dataframe (df) from copied dataframe (df_copy) ?
Yoy cannot restore it. Code like below dosen't work.
df = df_copy.copy(deep = True)
Every variables which reference original df keep reference after operation above.

How to filter multiple dataframes in a loop?

I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.

Drop Columns in Pandas Dataframe: Inconsistency in Output

Problem: While dropping column labelled 'Happiness_Score' below, I'm getting it dropped in the parent Dataframe as well. This is not supposed to happen, would like clarification on this?
A = df_new
A.drop('Happiness_Score', axis = 1, inplace = True)
This is the output: As you can see the column gets dropped in df_new too; isn't inplace = True mean that it gets dropped only in the A Dataframe.
NOTE:
I'm able to workaround this by changing the code; now output is as expected.
B=df_new.drop('Happiness_Score', axis = 1)
Actually, when you do A = df_new
you are not creating a copy of the Dataframe, rather just a pointer. So to execute this correctly you should use A = df_new.copy()
When you are selecting a subset or indexing: A = df_new[condition] then it creates copy of a slice of a dataframe, so your workaround works too.
A = def_new creates a new reference to your original def_new, an not a new copy. You are binding A to the same thing def_new holding the reference to. And what happens when you do modification in a reference? It is reflected in the original object. I'll illustrate this with an example.
orgList = [1,2,3,4,5]
bkpList = orgList
print(bkpList is orgList) #OUTPUT: True
This is because both variables are pointing to same list. Modify any one, and change will be reflected in original list. Same thing can be observed in your dataframe case.
Solution: Keep a separate copy of your dataframe.
The variable A is a reference to df_new. Try creating A by doing a complete slice of df_new or df_new.copy().

Python Pandas SettingWithCopyWarning copies vs new objects

I'm working with a dataframe 'copy' created by sub-setting a previous one - see below:
import random
import pandas as pd
df = pd.DataFrame({'data':list(random.sample(range(10,100),25))})
df_filtered = df.query('data > 20 and data < 80')
df_filtered.rename(columns={'data':'observations'},inplace=True)
The problem is, when the rename method is called I receive a SettingWithCopy warning that, as I understand it, means I'm operating on a copy of the original (df in this case) object. The warning text is: "A value is trying to be set on a copy of a slice from a DataFrame"
I found this question that was answered using a different approach to subsetting. I prefer the Dataframe.query() method myself (syntax-wise). Is there a way I can create a new Dataframe object using the.query() method rather than the method suggested in the question I linked? I've tried a few options with iloc but haven't been successful thus-far.
You can always explicitly make a copy by calling .copy() on your filtered dataframe. Concretely, replace
df_filtered = df.query('data > 20 and data < 80')
with
df_filtered = df.query('data > 20 and data < 80').copy()
Does that get rid of the warning?
try this instead of using inplace=True:
In [12]: df_filtered = df.query('data > 20 and data < 80')
In [13]: df_filtered = df_filtered.rename(columns={'data':'observations'})
.rename() function returns a new object, so you can simply overwrite your DF with the returned new DF
if you use inplace the following is happening
from docs:
inplace : boolean, default False
Whether to return a new DataFrame. If True then value of copy is ignored.
Returns:
renamed : DataFrame (new object)
PS basically you should try to avoid using inplace=True and use df = df.function(...) technique instead

Categories