I have a pandas DataFrame df that is returned from a function and I generally don't know whether it is an independent object or a view on another DataFrame. I want to add new columns to it but don't want to copy it unnecessarily.
df['new_column'] = 0
may give a nasty warning about modifying a copy
df = df.copy()
may be expensive if df is large.
What's the best way here?
you should use an indexer to create your s1 such has:
import pandas as pd
s = pd.DataFrame({'a':[1,2], 'b':[2,3]})
indexer = s[s.a > 1].index
s1 = s.loc[indexer, :]
s1['c'] = 0
should remove the warning.
Related
I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)
I am using drop in pandas with an inplace=True set. I am performing this on a duplicate dataframe, but the original dataframe is also being modified.
df1 = df
for col in df1.columns:
if df1[col].sum() > 1:
df1.drop(col,inplace=True,axis=1)
This is modifying my 'df' dataframe and don't seem to understand why.
Use df1 = df.copy(). Otherwise they are the same object in memory.
However, it would be better to generate a new DataFrame directly, e.g.
df1 = df.loc[:, df.sum() <= 0]
So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.
I need to perform some operations on a pandas DataFrame() in order to evaluate some measure but leave my DataFrame as is. So I thought that I should start by duplicating it in memory :
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame(df1)
When printing
print(id(df1), id(df2))
I do get two different system adresses. So in my sense, these are two different instances of DataFrame().
However, if I do:
df2['b'] = [4,5,6]
print(df1)
df1 appears with a 'b' column, although I only added it in df2.
Why is this happening?
How can I really duplicate my DataFrame so that operations on one do not modify the other?
I am on Python 3.5 and pandas 0.24.2
You need to use pd.DataFrame.copy
df2 = df1.copy()
An assignement, even when you assign to a new variable, is referencing the same data/indices in memory, which means a manipulation on df1 or df2 will change the same data in memory. Using copy however, df2 gets its own copy of data that can be manipulated independently.
Explanation:
Why do you get two different memory addresses when calling the pd.DataFrame on a DataFrame?
Simply put, pandas.DataFrame is a wrapper around numpy.ndarry. When you called the pd.DataFrame with df1 dataframe as input, there was a new pd.DataFrame wrapper that was created (thus a different memory address), but the data is exactly the same. You can verify that with the following code:
In [2]: import pandas as pd
...: df1 = pd.DataFrame({'a':[1,2,3]})
...: df2 = pd.DataFrame(df1)
...:
In [3]: print(id(df1), id(df2))
(4665009296, 4665009360)
In [4]: df1._data
Out[4]:
BlockManager
Items: Index([u'a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64
In [5]: id(df1._data)
Out[5]: 4522343248
In [6]: id(df2._data)
Out[6]: 4522343248
As you can see, the memory address for df1._data and df2._data is exactly the same.
This is also clear when you read the DataFrame source code in github, where, at the beginning of the constructor, the same data is referenced by the new dataframe:
if isinstance(data, DataFrame):
data = data._data
I have an original pandas Dataframe with a chain of objects doing conditional selection on it. Each time I do a conditional selection, pandas creates a new dataframe. In other words:
import pandas as pd
df = pd.DataFrame(dict(A=range(3,23), B=range(5,25)))
print(id(df))
df2 = df[df['A']> 15]
print(id(df2))
df = pd.DataFrame(dict(A=range(3,43), B=range(5,45)))
print(id(df))
# output:
139963862409288
139963862409456
139963862275296
In the above example, I want df2 to change when I update df. I know that now because I rebind the variable df to a new Pandas DataFrame (a new object), its ID changes and df2 is not connected to the new df anymore. Is there anyway to do it the way I want? Is there any method/attribute in pandas to keep the connection between the original Dataframe and my conditional selection, or any Pythonic way I'm not aware of?
What are you trying to accomplish? Maybe it can be accomplished in a different way?
Regarding having views instead of copies -- when you select a single row or column, you have a view. The code below demonstrates this:
import pandas as pd
df = pd.DataFrame(dict(A=range(8,13), B=range(10,15), C=range(-3,2)))
print(df)
print('-----------')
dfa = df['A']
df2 = df.loc[2]
dfi = df.iloc[2]
dfa[2]=42
df2['B']=99
dfi['C']=-1
print(df)
print(dfa)
print(df2)
print(dfi)