I would like to use assign() to create new columns by method chaining (which is an elegant way of expressing a number of operations on a dataframe), however I can’t seem to find a way to do this without creating a copy which is much slower than doing it in place, due to the associated memory allocation. It it possible to do this in place with a simple method that modifies in-place and returns the resulting dataframe?
For example:
df = pd.DataFrame(np.random.randn(5,2), columns=['a', 'b'])
df['c']=df.a+df.b # in place, fast, but cannot chain
df.sum() # ….takes two lines of code
df.assign(c=df.a+df.b).sum() # compact but MUCH slower as assign() returns a copy of the df rather than assigning in place
.assign can take a callable that will accept the current state of the dataframe within a chain.
df = (
pd.DataFrame(np.random.randn(5,2), columns=['a', 'b'])
.assign(c=lambda df: df["a"] + df["b"])
.sum()
)
Related
From experience, some pandas functions require that I redefine the dataframe if I intend to use them, otherwise they won't return a copy by default. For example: df.drop("ColA", axis=1) will not actually drop the column, but I need to implement it by df = df.drop("ColA", axis=1) or by df.drop("ColA", axis=1, inplace=True) if I need to modify the dataframe.
This seems to be the case with some other pandas functions. Therefore, what I usually do is redefine a dataframe for every function so that I can ensure it is modified. For example:
df = df.set_index("id")
df = df.sort_values(by="Date")
df["B"] = df["B"].fillna(-1)
df = df.reset_index(drop = True)
df["ColA"] = df["ColA"].astype(str)
I know some of these functions do not require to define the dataframe, but I just do it to make sure the changes are applied. My question is if there is a way to know which functions require redefining the dataframe and which don't need it, and also if there is any computational difference between using df = df.set_index("id") and df.set_index("id") if they have the same output.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)?
My question is if there is a way to know which functions require redefining the dataframe and which don't need it
It's called the manual.
set_index() has an inplace=True parameter; if that's set, you won't need to reassigned.
sort_values() has that too.
fillna() has that too.
reset_index() has that too.
astype() has copy=True by default, but heed the warning setting it to False:
"be very careful setting copy=False as changes to values then may propagate to other pandas objects"
if there is any computational difference between
Yes – if Pandas is able to make the changes in-place, it won't need to copy the series or dataframe, which could be a significant time and memory expense with large dataframes.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)?
Yes, there is. The first reassigns a series into a dataframe, the other just assigns the single series into the (now misnamed) name df
In pandas github is long discussion about this, check this.
I also agree the best dont use inplace, because confused and not sure how/when it save memory.
Should I redefine a pandas dataframe with every function?
I think yes, maybe if use large DataFrames here should be exceptions, link.
There is always list of methods with inplace parameter.
Also is there a difference between df["B"] = df["B"].fillna(-1) and df = df["B"].fillna(-1)
If use df["B"] = df["B"].fillna(-1) it reassign column B (Series) back with replaced missing values to -1.
If use df = df["B"].fillna(-1) it return Series with replaced values, but it is reassigned to df, so original DataFrame is overwitten by this replaced Series.
I don't think there is a solution for this. Some methods work inplace by default and some others return a copy of the df and you need to reassign the df as you usually do. The best option is to check the docs (for the inplace parameter) everytime you want to use some method and you will learn by practice, at least the most common ones, like sorting, reseting index, etc
I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.
The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')
I have an hierarchical dataset:
df = pd.DataFrame(np.random.rand(6,6),
columns=[['A','A','A','B','B','B'],
['mean', 'max', 'avg']*2],
index=pd.date_range('20000103', periods=6))
I want to apply a function to all values under the columns A. I can set the value to something:
df.loc[slice(None), 'A'] = 1
Easy enough. Now, instead of assigning a value, if I want to apply a mapping to this MultiIndex slice, it does not work.
For example, let me apply a simple formatting statement:
df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
This step works fine. However, I cannot assign this to the original df:
df.loc[slice(None), 'A'] = df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
Everything turns into a NaN. Any help would be appreciated.
You can do it in a couple of ways:
df['A'] = df['A'].applymap('{:.2f}'.format)
or (this will keep the original dtype)
df['A'] = df['A'].round(2)
or as a string
df['A'] = df['A'].round(2).astype(str)
I have an apply function that operates on each row in my dataframe. The result of that apply function is a new value. This new value is intended to go in a new column for that row.
So, after applying this function to all of the rows in the dataframe, there will be an entirely new column in that dataframe.
How do I do this in pandas?
Two ways primarily:
df['new_column'] = df.apply(my_fxn, axis=1)
or
df = df.assign(new_column=df.apply(my_fxn, axis=1))
If you need to use other arguments, you can pass them to the apply function, but sometimes it's easier (for me) to just use a lambda:
df['new_column'] = df.apply(lambda row: my_fxn(row, global_dict), axis=1)
Additionally, if your function can operate on arrays in a vectorized fashion, you could just do:
df['new_column'] = my_fxn(df['col1'], df['col2'])