I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable
I would like to use assign() to create new columns by method chaining (which is an elegant way of expressing a number of operations on a dataframe), however I can’t seem to find a way to do this without creating a copy which is much slower than doing it in place, due to the associated memory allocation. It it possible to do this in place with a simple method that modifies in-place and returns the resulting dataframe?
For example:
df = pd.DataFrame(np.random.randn(5,2), columns=['a', 'b'])
df['c']=df.a+df.b # in place, fast, but cannot chain
df.sum() # ….takes two lines of code
df.assign(c=df.a+df.b).sum() # compact but MUCH slower as assign() returns a copy of the df rather than assigning in place
.assign can take a callable that will accept the current state of the dataframe within a chain.
df = (
pd.DataFrame(np.random.randn(5,2), columns=['a', 'b'])
.assign(c=lambda df: df["a"] + df["b"])
.sum()
)
I want to normalize some columns of a pandas data frame using MinMaxScaler in this way:
scaler = MinMaxScaler()
numericals = ["TX_TIME_SECONDS",'TX_Amount']
while I do in this way:
df.loc[:][numericals] = scaler.fit_transform(df.loc[:][numericals])
it's not done inplace and df is not changed;
whereas, when I do in this way:
df.loc[:, numericals] = scaler.fit_transform(df.loc[:][numericals])
the numerical columns of df are changed in place,
So, What's the difference between df.loc[:, ~] and df.loc[:][~]
df.loc[:][numericals] selects all rows and then selects columns "TX_TIME_SECONDS" and 'TX_Amount' of the returning object, and assigns some value to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame.
The correct way of making this assignment is using df.loc[:, numericals], because with .loc you are guaranteed to modify the original DataFrame.
I suggest you read some documentation because this is pretty basic.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://www.geeksforgeeks.org/python-pandas-dataframe-loc/
I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.
Problem: While dropping column labelled 'Happiness_Score' below, I'm getting it dropped in the parent Dataframe as well. This is not supposed to happen, would like clarification on this?
A = df_new
A.drop('Happiness_Score', axis = 1, inplace = True)
This is the output: As you can see the column gets dropped in df_new too; isn't inplace = True mean that it gets dropped only in the A Dataframe.
NOTE:
I'm able to workaround this by changing the code; now output is as expected.
B=df_new.drop('Happiness_Score', axis = 1)
Actually, when you do A = df_new
you are not creating a copy of the Dataframe, rather just a pointer. So to execute this correctly you should use A = df_new.copy()
When you are selecting a subset or indexing: A = df_new[condition] then it creates copy of a slice of a dataframe, so your workaround works too.
A = def_new creates a new reference to your original def_new, an not a new copy. You are binding A to the same thing def_new holding the reference to. And what happens when you do modification in a reference? It is reflected in the original object. I'll illustrate this with an example.
orgList = [1,2,3,4,5]
bkpList = orgList
print(bkpList is orgList) #OUTPUT: True
This is because both variables are pointing to same list. Modify any one, and change will be reflected in original list. Same thing can be observed in your dataframe case.
Solution: Keep a separate copy of your dataframe.
The variable A is a reference to df_new. Try creating A by doing a complete slice of df_new or df_new.copy().