How to make get_dummies work in place? - python

I apply get_dummies on my DataFrame to generate dummy variables. It creates a new DataFrame. How can I change my original DataFrame instead?
This works, but is there a better way?
import pandas as pd
data = pd.DataFrame({'gender': [ 'female', 'male']})
data1 = pd.get_dummies(data, columns = ['gender'])
# data is still unchanged
data.drop(data.columns, inplace=True, axis=1)
data[data1.columns] = data1

In your code, you are creating a new dataframe, then removing all of the data from the old dataframe, and then putting the new data back into the old dataframe.
Instead of your last three lines of code, you can just say:
data = pd.get_dummies(data, columns = ['gender'])
The get_dummies function creates a new dataframe and saves it in the place of the old one. This is functionally the same as your code, but it is much easier to understand.

Related

Apply function to multiple dataframes and create multiple dataframes from that

I have a list of multiple data frames on cryptocurrency. So I want to apply a function to all of these data frames, which should convert all the data frames so that I am only left with data from 2021.
The function looks like this:
dataframe_list = [bitcoin, have, binance, Cardano, chainlink, cosmos, crypto com, dogecoin, eos, Ethereum, iota, litecoin, monero, nem, Polkadot, Solana, stellar, tether, uni swap, usdcoin, wrapped, xrp]
def date_func(i):
i['Date'] = pd.to_datetime(i['Date'])
i = i.set_index(i['Date'])
i = i.sort_index()
i = i['2021-01-01':]
return(i)
for dataframe in dataframe_list:
dataframe = date_func(dataframe)
However, I am only left with one data frame called 'dataframe', which only contains values of the xrp dataframe.
I would like to have a new dataframe from each dataframe, called aave21, bitcoin21 .... which only contains values from 2021 onwards.
What am I doing wrong?
Best regards and thanks in advance.
You are overwriting dataframe when iterating over dataframe_list, i.e. you only keep the latest dataframe.
You can either try:
dataframe = pd.DataFrame()
for df in dataframe_list:
dataframe.append(date_func(df))
Or shorter:
dataframe = pd.concat([data_func(df) for df in dataframe_list])
You are overwriting dataframe variable in your for loop when iterating over dataframe_list. You need to keep appending results into a new variable.
final_df = pd.DataFrame()
for dataframe in dataframe_list:
final_df.append(date_func(dataframe))
print(final_df)

The Drop() method appears to do nothing to my dataframe, but no error is given

I am using drop() to try to remove a row from a dataframe, based on an index. Nothing seems to happen. I get lots of errors if I play with the syntax, but the below example yields the same dataframe I started with.
data= {"col1":[1, 3, 3,3],"col2":[4,5,6,4],"col3":[7,6,6,8]}
testdf2 = pd.DataFrame(data)
testdf2.drop([1])
testdf2
I assume I'm missing something obvious?
When using drop, you must reassign to your df (Or create a new one).
import pandas as pd
data= {"col1":[1, 3, 3,3],"col2":[4,5,6,4],"col3":[7,6,6,8]}
testdf2 = pd.DataFrame(data)
testdf2 = testdf2.drop([1])
testdf2
Alternatively, supply inplace.
testdf2.drop([1], inplace=True)
However, this could lead to complications regarding view / copy and I usually reassign.
Hope that helps!

How to filter multiple dataframes in a loop?

I have a lot of dataframes and I would like to apply the same filter to all of them without having to copy paste the filter condition every time.
This is my code so far:
df_list_2019 = [df_spain_2019,df_amsterdam_2019, df_venice_2019, df_sicily_2019]
for data in df_list_2019:
data = data[['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
]]
but it doesn't apply the filter to the data frame. How can I change the code to do that?
Thank you
The filter (column selection) is actually applied to every DataFrame, you just throw the result away by overriding what the name data points to.
You need to store the results somewhere, a list for example.
cols = ['host_since','host_response_time', ...]
filtered = [df[cols] for df in df_list_2019]
As soon as you write var = new_value, you do not change the original object but have the variable refering a new object.
If you want to change the dataframes from df_list_2019, you have to use an inplace=True method. Here, you could use drop:
keep = set(['host_since','host_response_time','host_response_rate',
'host_acceptance_rate','host_is_superhost','host_total_listings_count',
'host_has_profile_pic','host_identity_verified',
'neighbourhood','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type',
'accommodates','bathrooms','bedrooms','beds','amenities','price','weekly_price',
'monthly_price','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights',
'minimum_nights_avg_ntm','has_availability','availability_30','availability_60','availability_90',
'availability_365','number_of_reviews','number_of_reviews_ltm','review_scores_rating',
'review_scores_checkin','review_scores_communication','review_scores_location', 'review_scores_value',
'instant_bookable','is_business_travel_ready','cancellation_policy','reviews_per_month'
])
for data in df_list_2019:
data.drop(columns=[col for col in data.columns if col not in keep], inplace=True)
But beware, pandas experts recommend to prefere the df = df. ... idiom to the df...(..., inplace=True) because it allows chaining the operations. So you should ask yourself if #timgeb's answer cannot be used. Anyway this one should work for your requirements.

Efficient way to empty a pandas dataframe using its mutable alias property

So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

Categories