Efficient way to empty a pandas dataframe using its mutable alias property - python

So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).

df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.

Related

Why does pandas.DataFrame change the data source?

I'm learning Python, and I found a thing I can't understand.
I created a pandas.DataFrame from a ndarray, and then only modified the DF instead of ndarray.
And to my suprise, the ndarray has changed too!
Is the data cached inside DF?
If yes, why does they changed inside ndarray?
If no, how about a DF created without any source?
from pandas import DataFrame
import numpy as np
if __name__ == '__main__':
nda1 = np.zeros((3,3), dtype=float)
print(f'original nda1:\n{nda1}\n')
df1 = DataFrame(nda1)
print(f'original df1:\n{df1}\n')
df1.iat[2,2] = 999
#print(f'df1 in main:\n{df}\n')
print(f'nda1 after modify:\n{nda1}\n')
DataFrames are using numpy arrays under the hood. As you have a full homogeneous type, the array is kept as is.
You can check it with:
pd.DataFrame(nda1).values.base is nda1
# True
You can force a copy to avoid the issue:
df1 = pd.DataFrame(nda1.copy())
or copy from within the constructor:
df1 = pd.DataFrame(nda1, copy=True)
check that the underlying array is different:
pd.DataFrame(nda1, copy=True).values.base is nda1
# False
Many programmers experience this. This is because of this line:
df1 = DataFrame(nda1)
When you set these 2 things as equal, both will be intertwined. If you want to have a "no source" dataframe, use:
df2 = df1.copy()
or
df1 = DataFrame(nda1.copy())
High relevant post:
Why can pandas dataframes change each other

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)

pandas conditional selection - returning a view not a copy

I have an original pandas Dataframe with a chain of objects doing conditional selection on it. Each time I do a conditional selection, pandas creates a new dataframe. In other words:
import pandas as pd
df = pd.DataFrame(dict(A=range(3,23), B=range(5,25)))
print(id(df))
df2 = df[df['A']> 15]
print(id(df2))
df = pd.DataFrame(dict(A=range(3,43), B=range(5,45)))
print(id(df))
# output:
139963862409288
139963862409456
139963862275296
In the above example, I want df2 to change when I update df. I know that now because I rebind the variable df to a new Pandas DataFrame (a new object), its ID changes and df2 is not connected to the new df anymore. Is there anyway to do it the way I want? Is there any method/attribute in pandas to keep the connection between the original Dataframe and my conditional selection, or any Pythonic way I'm not aware of?
What are you trying to accomplish? Maybe it can be accomplished in a different way?
Regarding having views instead of copies -- when you select a single row or column, you have a view. The code below demonstrates this:
import pandas as pd
df = pd.DataFrame(dict(A=range(8,13), B=range(10,15), C=range(-3,2)))
print(df)
print('-----------')
dfa = df['A']
df2 = df.loc[2]
dfi = df.iloc[2]
dfa[2]=42
df2['B']=99
dfi['C']=-1
print(df)
print(dfa)
print(df2)
print(dfi)

Modifying a pandas dataframe that may be a view

I have a pandas DataFrame df that is returned from a function and I generally don't know whether it is an independent object or a view on another DataFrame. I want to add new columns to it but don't want to copy it unnecessarily.
df['new_column'] = 0
may give a nasty warning about modifying a copy
df = df.copy()
may be expensive if df is large.
What's the best way here?
you should use an indexer to create your s1 such has:
import pandas as pd
s = pd.DataFrame({'a':[1,2], 'b':[2,3]})
indexer = s[s.a > 1].index
s1 = s.loc[indexer, :]
s1['c'] = 0
should remove the warning.

How to make get_dummies work in place?

I apply get_dummies on my DataFrame to generate dummy variables. It creates a new DataFrame. How can I change my original DataFrame instead?
This works, but is there a better way?
import pandas as pd
data = pd.DataFrame({'gender': [ 'female', 'male']})
data1 = pd.get_dummies(data, columns = ['gender'])
# data is still unchanged
data.drop(data.columns, inplace=True, axis=1)
data[data1.columns] = data1
In your code, you are creating a new dataframe, then removing all of the data from the old dataframe, and then putting the new data back into the old dataframe.
Instead of your last three lines of code, you can just say:
data = pd.get_dummies(data, columns = ['gender'])
The get_dummies function creates a new dataframe and saves it in the place of the old one. This is functionally the same as your code, but it is much easier to understand.

Categories