Memory handling in pandas DataFrame - python

I need to perform some operations on a pandas DataFrame() in order to evaluate some measure but leave my DataFrame as is. So I thought that I should start by duplicating it in memory :
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame(df1)
When printing
print(id(df1), id(df2))
I do get two different system adresses. So in my sense, these are two different instances of DataFrame().
However, if I do:
df2['b'] = [4,5,6]
print(df1)
df1 appears with a 'b' column, although I only added it in df2.
Why is this happening?
How can I really duplicate my DataFrame so that operations on one do not modify the other?
I am on Python 3.5 and pandas 0.24.2

You need to use pd.DataFrame.copy
df2 = df1.copy()
An assignement, even when you assign to a new variable, is referencing the same data/indices in memory, which means a manipulation on df1 or df2 will change the same data in memory. Using copy however, df2 gets its own copy of data that can be manipulated independently.
Explanation:
Why do you get two different memory addresses when calling the pd.DataFrame on a DataFrame?
Simply put, pandas.DataFrame is a wrapper around numpy.ndarry. When you called the pd.DataFrame with df1 dataframe as input, there was a new pd.DataFrame wrapper that was created (thus a different memory address), but the data is exactly the same. You can verify that with the following code:
In [2]: import pandas as pd
...: df1 = pd.DataFrame({'a':[1,2,3]})
...: df2 = pd.DataFrame(df1)
...:
In [3]: print(id(df1), id(df2))
(4665009296, 4665009360)
In [4]: df1._data
Out[4]:
BlockManager
Items: Index([u'a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64
In [5]: id(df1._data)
Out[5]: 4522343248
In [6]: id(df2._data)
Out[6]: 4522343248
As you can see, the memory address for df1._data and df2._data is exactly the same.
This is also clear when you read the DataFrame source code in github, where, at the beginning of the constructor, the same data is referenced by the new dataframe:
if isinstance(data, DataFrame):
data = data._data

Related

pandas apply function (UDF) fails to return multiple values in case the dataframe is empty

I want to be able to return more than a single columns from a pandas UDF (apply function). This works nicely, as long as the data frame is non empty!
In case it is empty it fails with: not enough values to unpack (expected 3, got 0). Is this to be considered as a bug in pandas? Or should the user be forced to manually check the length of the filtered data frame before executing the function? Or is there a better way not to run into this problem?
import pandas as pd
df = pd.DataFrame({'foo':[1,2,3], 'bar':[4,5,6]})
def my_function(x):
#print(x)
# some computation
# returns multiple values (tuple)
# simplified here
return 1,1,1
df = df[df.foo > 10]
df['r1'], df['r2'], df['r3'] = zip(*df.apply(my_function, axis=1))
df
One solution is to use pd.concat in combination with result_type='expand'.
cols = {0: 'r1', 1: 'r2', 2: 'r3'}
df = pd.concat([df, df.apply(my_function, axis=1, result_type='expand')], axis=1).rename(columns=cols)
You have to rename the columns afterwards. Also, the resulting empty dataframe repeats the first two columns:
Output:
foo bar foo bar
versus
foo bar
Both dataframes are empty, so it may be of no interest for you.
I think to check for empty dataframes in pandas is good practice. So, Siddhants solution in the comment is just fine.

Drop function in Pandas is changing original dataframe?

I am using drop in pandas with an inplace=True set. I am performing this on a duplicate dataframe, but the original dataframe is also being modified.
df1 = df
for col in df1.columns:
if df1[col].sum() > 1:
df1.drop(col,inplace=True,axis=1)
This is modifying my 'df' dataframe and don't seem to understand why.
Use df1 = df.copy(). Otherwise they are the same object in memory.
However, it would be better to generate a new DataFrame directly, e.g.
df1 = df.loc[:, df.sum() <= 0]

Efficient way to empty a pandas dataframe using its mutable alias property

So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.

Modifying a pandas dataframe that may be a view

I have a pandas DataFrame df that is returned from a function and I generally don't know whether it is an independent object or a view on another DataFrame. I want to add new columns to it but don't want to copy it unnecessarily.
df['new_column'] = 0
may give a nasty warning about modifying a copy
df = df.copy()
may be expensive if df is large.
What's the best way here?
you should use an indexer to create your s1 such has:
import pandas as pd
s = pd.DataFrame({'a':[1,2], 'b':[2,3]})
indexer = s[s.a > 1].index
s1 = s.loc[indexer, :]
s1['c'] = 0
should remove the warning.

Selecting/excluding sets of columns in pandas [duplicate]

This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 4 years ago.
I would like to create views or dataframes from an existing dataframe based on column selections.
For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')
# This returns an error ("unhashable type: set")
df2 = df[my_cols]
What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?
You can either Drop the columns you do not need OR Select the ones you need
# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)
# Select the ones you want
df1 = df[['a','d']]
There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B and D from df:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns
# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])
df = df.drop_duplicates(subset=subset)
Another option, without dropping or filtering in a loop:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]
# or more simply include columns:
df[['A', 'B']]
# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]
# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically
df[df.columns.difference(['C', 'D'])]
You don't really need to convert that into a set:
cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Also have a look into the built-in DataFrame.filter function.
Minimalistic but greedy approach (sufficient for the given df):
df.filter(regex="[^BD]")
Conservative/lazy approach (exact matches only):
df.filter(regex="^(?!(B|D)$).*$")
Conservative and generic:
exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
You have 4 columns A,B,C,D
Here is a better way to select the columns you need for the new dataframe:-
df2 = df1[['A','D']]
if you wish to use column numbers instead, use:-
df2 = df1[[0,3]]
You just need to convert your set to a list
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Here's how to create a copy of a DataFrame excluding a list of columns:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)
But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)
This method doesn't achieve that:
>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
A B C D
0 999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
A C
0 0.251262 -1.980673
Note also that this is also true of #piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)
For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.
In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.
As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:
skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)
The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.
Given
Data in a file
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
filename = "foo.csv"
df.to_csv(filename)
Code
skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1
Output
A C
0 0.062350 0.076924
1 -0.016872 1.091446
2 0.213050 1.646109
3 -1.196928 1.153497
4 -0.628839 -0.856529
...
Details
A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).
Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.
A request was raised for a "skipcols" option in this issue and was addressed in a later issue.

Categories