I am creating a function to categorize data in bins in a df. I have made the function, and am first extracting numbers from a string, and replacing the column of text with a column of numbers.
The function is somehow overwriting the original dataframe, despite me only manipulating a copy of it.
def categorizeColumns(df):
newdf = df
if 'Runtime' in newdf.columns:
for row in range(len(newdf['Runtime'])):
strRuntime = newdf['Runtime'][row]
numsRuntime = [int(i) for i in strRuntime.split() if i.isdigit()]
newdf.loc[row,'Runtime'] = numsRuntime[0]
return newdf
df = pd.read_csv('moviesSeenRated.csv')
newdf = categorizeColumns(df)
The original df has a column of runtimes like this [34 mins, 32 mins, 44 mins] etc, and the newdf should have [33,32,44], which it does. However, the original df also changes outside the function.
Whats going on here? Any fixes? Thanks in advance.
EDIT: Seems like I wasn't making a copy, I needed to do
df.copy()
Thank you all!
The problem is that you aren't actually making a copy of the dataframe in the line newdf = df. To make a copy, you could do newdf = df.copy().
I think you are not making a copy of dataframe. What you did on newdf = df is called reference.
You have to .copy() your dataframe.
def categorizeColumns(df):
newdf = df.copy()
if 'Runtime' in newdf.columns:
for row in range(len(newdf['Runtime'])):
strRuntime = newdf['Runtime'][row]
numsRuntime = [int(i) for i in strRuntime.split() if i.isdigit()]
newdf.loc[row,'Runtime'] = numsRuntime[0]
return newdf
df = pd.read_csv('moviesSeenRated.csv')
newdf = categorizeColumns(df)
I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.
So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.
I'm trying to work out the correct method for cycling through a number of pandas dataframes using a 'for loop'. All of them contain 'year' columns from 1960 to 2016, and from each df I want to remove the columns '1960' to '1995'.
I created a list of dfs and also a list of str values for the years.
dflist = [apass,rtrack,gdp,pop]
dfnewlist =[]
for i in range(1960, 1996):
dfnewlist.append(str(i))
for df in dflist:
df = df.drop(dfnewlist, axis = 1)
My for loop runs without error, but it does not remove the columns.
Edit - Just to add, when I do this manually without the for loop, such as below, it works fine:
gdp = gdp.drop(dfnewlist, axis = 1)
This is a common issues for people in for loops. When you say
for df in dflist:
and then change df, the changes do not happen to the actual object in the list, just to df
use enumerate to fix
for i,df in enumerate(dflist):
dflist[i]=df.drop(dfnewlist,axis=1)
To ensure some robustness, you can us the errors='ignore' flag just in case one of the columns doesn't exist, the drop won't error out.
However, your real problem is that when you loop, df starts by referring to the thing in the list. But then you overwrite the name df by assigning to that name the results of df.drop(dfnewlist, axis=1). This does not replace the dataframe in your list as you'd hoped but creates a new name df that no longer points to the item in the list.
Instead, you can use the inplace=True flag.
drop_these = [*map(str, range(1960, 1996)]
for df in dflist:
df.drop(drop_these, axis=1, errors='ignore', inplace=True)
I can't understand why this snippet of code:
df = PA.DataFrame()
[df.append(aFunction(x)) for x in aPandaSeries]
does not give me the same DataFrame (df) as:
df = PA.DataFrame()
for x in xrange(len(aPandaSeries)):
df = df.append(aFunction(aPandaSeries[x]))
I am trying to pythonise the second section by using the first section, but df has far fewer rows in the former than the latter.
A couple of things...
.append() method returns None. So df = df.append() will set df to None value.
List comprehensions are useful to filter or process a list of values, so you generally wouldn't use .append() with a list comprehension. It makes more sense to rewrite the 2nd line in first snippet as:
for x in aPandaSeries:
df.append(aFunction(x))