I would like to solve the below problem
I have the below code. I need to insert several data frames and apply the change at once
def reverse_df(*df):
for x in df:
x=x.loc[::-1].reset_index(level=0, drop=True)
return
reverse_df(df1,df2,df3,df4,df5)
I am able to do changes to a dataframe inside a function only when i am using inplace=True like in below
def remove_na(*df):
for x in df:
x.dropna(axis=0, how='all',inplace=True)
return
remove_na(df1,df2,df3,df4,df5)
buy the below doesn't work
def remove_na(*df):
for x in df:
x=x.dropna(axis=0, how='all')
return
remove_na(df1,df2,df3,df4,df5)
What am I doing wrong?
Short answer: x = x.dropna(axis=0, how='all') inside a function creates a local variable called x, so the reference to the original dataframe is lost, and any changes you make are not applied.
To solve the particular case of reversing the dataframe you can do:
def reverse(df):
df.reset_index(drop=False, inplace=True)
df.sort_index(ascending=False, inplace=True)
df.set_index('index', drop=True, inplace=True)
However, since inplace operations are not really inplace, you're probably better off returning a modified dataframe.
Related
I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable
From experience, some pandas functions require that I redefine the dataframe if I intend to use them, otherwise they won't return a copy by default. For example: df.drop("ColA", axis=1) will not actually drop the column, but I need to implement it by df = df.drop("ColA", axis=1) or by df.drop("ColA", axis=1, inplace=True) if I need to modify the dataframe.
This seems to be the case with some other pandas functions. Therefore, what I usually do is redefine a dataframe for every function so that I can ensure it is modified. For example:
df = df.set_index("id")
df = df.sort_values(by="Date")
df["B"] = df["B"].fillna(-1)
df = df.reset_index(drop = True)
df["ColA"] = df["ColA"].astype(str)
I know some of these functions do not require to define the dataframe, but I just do it to make sure the changes are applied. My question is if there is a way to know which functions require redefining the dataframe and which don't need it, and also if there is any computational difference between using df = df.set_index("id") and df.set_index("id") if they have the same output.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)?
My question is if there is a way to know which functions require redefining the dataframe and which don't need it
It's called the manual.
set_index() has an inplace=True parameter; if that's set, you won't need to reassigned.
sort_values() has that too.
fillna() has that too.
reset_index() has that too.
astype() has copy=True by default, but heed the warning setting it to False:
"be very careful setting copy=False as changes to values then may propagate to other pandas objects"
if there is any computational difference between
Yes – if Pandas is able to make the changes in-place, it won't need to copy the series or dataframe, which could be a significant time and memory expense with large dataframes.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)?
Yes, there is. The first reassigns a series into a dataframe, the other just assigns the single series into the (now misnamed) name df
In pandas github is long discussion about this, check this.
I also agree the best dont use inplace, because confused and not sure how/when it save memory.
Should I redefine a pandas dataframe with every function?
I think yes, maybe if use large DataFrames here should be exceptions, link.
There is always list of methods with inplace parameter.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)
If use df["B"] = df["B"].fillna(-1) it reassign column B (Series) back with replaced missing values to -1.
If use df = df["B"].fillna(-1) it return Series with replaced values, but it is reassigned to df, so original DataFrame is overwitten by this replaced Series.
I don't think there is a solution for this. Some methods work inplace by default and some others return a copy of the df and you need to reassign the df as you usually do. The best option is to check the docs (for the inplace parameter) everytime you want to use some method and you will learn by practice, at least the most common ones, like sorting, reseting index, etc
What does the inplace parameter of replace() and drop() methods do?
I didn't manage to understand from the docs.
Example:
df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
If you pass the parameter inplace=False, it will create a new DataFrame on which the operation has been performed.
If you pass the parameter inplace=True, it will apply the operation directly on the DataFrame you're working on. Hence, the following lines are doing the same thing (conceptually):
df.replace('?',-99999, inplace=True)
df = df.replace('?', -99999, inplace=False)
Using the inplace version allow you to work on a single DataFrame. Using the other version allows you to create a new DataFrame on which you can work while keeping the original one, like this:
df_dropped = df.replace('?', -99999, inplace=False)
Without the inplace, df.replace('?',-99999, inplace=True) creates a new dataframe which is just like df, but with '?' replaced by -9999. df is not changed. inplace changes df.
I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.
Is there a fluent setter for index? Something like df.with_index(other_index), which would keep the data (the np.array) as is, but replaces the index with other_index?
A non-fluent way (that modifies an existing DataFrame) is:
df.index = other_index
I found a way that doesn't affect the original, can be generated on-the-fly without temporary variables (so, kind of fluent) and is a shallow copy (it doesn't duplicate the data itself; the same unique np.array is shared by both df and the result). However, is a bit verbose:
def with_index(df, index):
return pd.DataFrame(data=df.values, index=index, columns=df.columns)
Alternatively:
def with_axes(df, index=None, columns=None):
if index is None:
index = df.index
if columns is None:
columns = df.columns
return pd.DataFrame(data=df.values, index=index, columns=columns)
Is there some method that I missed that does that? I tried df.assign(index=other_index), but it just creates a new column called 'index'... And of course, df.reindex(), df.replace(), df.set_index() do different things.
Doh. I just found it:
df.set_axis(other_index, inplace=False)
(Apparently, in a future pandas release, inplace=None will default to False, instead of the current --as of 0.25.1-- default to True).