Is there a way to make changing DataFrame faster in a loop?

Is there a way to make changing DataFrame faster in a loop? - python

for index, row in df.iterrows():
print(index)
name = row['name']
new_name = get_name(name)
row['new_name'] = new_name
df.loc[index] = row
In this piece of code, my testing shows that the last line makes it quite slow, really slow. It basically insert a new column row by row. Maybe I should store all the 'new_name' into a list, and update the df outside of the loop?

Use Series.apply for processing function for each value of column, it is faster like iterrows:
df['new_name'] = df['name'].apply(get_name)
If want improve performance then is necessary change function if possible, but it depends of function.

df['new_name'] = df.apply(lambda x: get_name(x) if x.name == 'name' else x)
.apply isn't a best practice, however I am not sure there is a better one here.

Related

How to get the full dataframe using lambda function in python?

I have a loop logic using iterrows but the performance is bad
result = []
for index, row in df_test.iterrows():
result.append(product_recommendation_model.predict(df_test.iloc[[index]]))
submission = pd.DataFrame({'ID': df_test['ID'],
'Result': result
})
display(submission)
I would like to rewrite it with using apply lambda but I have no idea how to get the full data frame.
a = df_test.apply(lambda x: product_recommendation_model.predict(df_test.iloc[[x]]) ,axis=1)
Can anyone help me please? Thanks.

I think this works for you
df_new = df_test.apply(lambda row: pd.Series([row['ID'],product_recommendation_model.predict(row)] ,axis=1)
df_new.columns = ['ID','Result']
Note: You can also pass argument to your prediction like row[column_name] if you want to pass only one column value to predict, row will send all column values of a row.

Finally, I can run it with the below code.
df_test.apply(lambda i: product_recommendation_model.predict(i.to_frame().T), axis=1)

How to filter multiple dataframes in a loop?

I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you

The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]

As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.

Python does not update dataframe while iterating over rows

I don't get why python won't update my dataframe object:
The code snippet is this:
for index, row in df.iterrows():
t = df.loc[index, :"score"]
b = [float(i) for i in t if i != 's']
m = sum(b)/len(b)
df.at[index, "score"] = m
print(df.at[index, "score"]) # Does not print out m, it prints out 0, the default value
The thing that this snippet should do is get all the values in a row, compute the average and then add this average to the dataframe.

Iterating over rows in a DataFrame is very seldomly the way to go.
Instead, use
df.loc[:, :'score'].mean('columns')
which is more readable and much faster.
To answer your question directly (why your way doesn't work) we would need more information (see comments).

How to use Apply() and self defined function to change data in DataFrame?

What is the easiest way to make some changes in the index column of different rows in a DataFrame ?
def fn(country):
if any(char.isdigit() for char in country):
return country[:-2]
else:
return country
df.loc["Country"].apply(fn,axis=1)

I cant test now. Can you try: df['Country'] = df.apply(lambda row: fn(row),axis = 1) and change your function argument to take the row into account (like row['Country']). This way you can manipulate anything you want row by row using other column values.

append to empty dataframe iteratively

I have a for loop. At each iteration a dataframe is created. I want this dataframe to be appended to an overall result dataframe.
Currently I tried to do it with this code:
resultDf = pd.DataFrame()
for name in list:
iterationresult = calculatesomething(name)
resultDf.append(iterationresult)
print(resultDf)
However, the resultDf is empty.
How can this be done?
UPDATE
I think changing
resultDf.append(iterationresult)
to
resultDf = resultDf.append(iterationresult)
does the trick

Not iterative, but how about simply:
df = pd.DataFrame([calculatesomething(name) for name in list])
This is much more straightforward, and faster as well.
Another idiomatic idea could be to do this:
df = pd.DataFrame(list, columns = ["name"])
df["calc"] = df.name.map(calculatesomething)
By the way, it's a bad practice to call a list list, because it will shadow the builtin type.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a way to make changing DataFrame faster in a loop? - python

Use Series.apply for processing function for each value of column, it is faster like iterrows: df['new_name'] = df['name'].apply(get_name) If want improve performance then is necessary change function if possible, but it depends of function.

df['new_name'] = df.apply(lambda x: get_name(x) if x.name == 'name' else x) .apply isn't a best practice, however I am not sure there is a better one here.

Related

How to get the full dataframe using lambda function in python?

How to filter multiple dataframes in a loop?

Python does not update dataframe while iterating over rows

How to use Apply() and self defined function to change data in DataFrame?

append to empty dataframe iteratively

Categories

Resources