Why does a copy get created when assigned with None? - python

In[216]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In[217]: bar = foo.ix[:1]
In[218]: bar
Out[218]:
a b
0 1 3
1 2 4
A view is created as expected.
In[219]: bar['a'] = 100
In[220]: bar
Out[220]:
a b
0 100 3
1 100 4
In[221]: foo
Out[221]:
a b
0 100 3
1 100 4
2 3 5
If view is modified, so is the original dataframe foo.
However, if the assignment is done with None, then a copy seems to be made.
Can anyone shed some light on what's happening and maybe the logic behind?
In[222]: bar['a'] = None
In[223]: bar
Out[223]:
a b
0 None 3
1 None 4
In[224]: foo
Out[224]:
a b
0 100 3
1 100 4
2 3 5

When you assign bar['a'] = None, you're forcing the column to change its dtype from, e.g., I4 to object.
Doing so forces it to allocate a new array of object for the column, and then of course it writes to that new array instead of to the old array that's shared with the original DataFrame.

You are doing a form of chained assignment, see here why this is a really bad idea.
See this question as well here
Pandas will generally warn you that you are modifying a view (even more so in 0.15.0).
In [49]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In [51]: foo
Out[51]:
a b
0 1 3
1 2 4
2 3 5
In [52]: bar = foo.ix[:1]
In [53]: bar
Out[53]:
a b
0 1 3
1 2 4
In [54]: bar.dtypes
Out[54]:
a int64
b int64
dtype: object
# this is an internal method (but is for illustration)
In [56]: bar._is_view
Out[56]: True
# this will warn in 0.15.0
In [57]: bar['a'] = 100
/usr/local/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/usr/local/bin/python
In [58]: bar._is_view
Out[58]: True
# bar is now a copied object (and will replace the existing dtypes with new ones).
In [59]: bar['a'] = None
In [60]: bar.dtypes
Out[60]:
a object
b int64
dtype: object
You should never rely on whether something is a view (even in numpy), except in certain very performant situations. It is not a guaranteed construct, depending on the memory layout of the underlying data.
You should very very very rarely try to set the data for propogation thru a view. and doing this in pandas is almost always going to cause trouble, when you mixed dtypes. (In numpy you can only have a view on a single dtype; I am not even sure what a view on a multi-dtyped array which changes the dtype does, or if its even allowed).

Related

Why does assigning with [:] versus iloc[:] yield different results in pandas?

I am so confused with different indexing methods using iloc in pandas.
Let say I am trying to convert a 1-d Dataframe to a 2-d Dataframe. First I have the following 1-d Dataframe
a_array = [1,2,3,4,5,6,7,8]
a_df = pd.DataFrame(a_array).T
And I am going to convert that into a 2-d Dataframe with the size of 2x4. I start by preseting the 2-d Dataframe as follow:
b_df = pd.DataFrame(columns=range(4),index=range(2))
Then I use for-loop to help me converting a_df (1-d) to b_df (2-d) with the following code
for i in range(2):
b_df.iloc[i,:] = a_df.iloc[0,i*4:(i+1)*4]
It only gives me the following results
0 1 2 3
0 1 2 3 4
1 NaN NaN NaN NaN
But when I changed b_df.iloc[i,:] to b_df.iloc[i][:]. The result is correct like the following, which is what I want
0 1 2 3
0 1 2 3 4
1 5 6 7 8
Could anyone explain to me what the difference between .iloc[i,:] and .iloc[i][:] is, and why .iloc[i][:] worked in my example above but not .iloc[i,:]
There is a very, very big difference between series.iloc[:] and series[:], when assigning back. (i)loc always checks to make sure whatever you're assigning from matches the index of the assignee. Meanwhile, the [:] syntax assigns to the underlying NumPy array, bypassing index alignment.
s = pd.Series(index=[0, 1, 2, 3], dtype='float')
s
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
# Let's get a reference to the underlying array with `copy=False`
arr = s.to_numpy(copy=False)
arr
# array([nan, nan, nan, nan])
# Reassign using slicing syntax
s[:] = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s
0 1
1 2
2 3
3 4
dtype: int64
arr
# array([1., 2., 3., 4.]) # underlying array has changed
# Now, reassign again with `iloc`
s.iloc[:] = pd.Series([5, 6, 7, 8], index=[3, 4, 5, 6])
s
0 NaN
1 NaN
2 NaN
3 5.0
dtype: float64
arr
# array([1., 2., 3., 4.]) # `iloc` created a new array for the series
# during reassignment leaving this unchanged
s.to_numpy(copy=False) # the new underlying array, for reference
# array([nan, nan, nan, 5.])
Now that you understand the difference, let's look at what happens in your code. Just print out the RHS of your loops to see what you are assigning:
for i in range(2):
print(a_df.iloc[0, i*4:(i+1)*4])
# output - first row
0 1
1 2
2 3
3 4
Name: 0, dtype: int64
# second row. Notice the index is different
4 5
5 6
6 7
7 8
Name: 0, dtype: int64
When assigning to b_df.iloc[i, :] in the second iteration, the indexes are different so nothing is assigned and you only see NaNs. However, changing b_df.iloc[i, :] to b_df.iloc[i][:] will mean you assign to the underlying NumPy array, so indexing alignment is bypassed. This operation is better expressed as
for i in range(2):
b_df.iloc[i, :] = a_df.iloc[0, i*4:(i+1)*4].to_numpy()
b_df
0 1 2 3
0 1 2 3 4
1 5 6 7 8
It's also worth mentioning this is a form of chained assignment, which is not a good thing, and also makes your code harder to read and understand.
The difference is that in the first case the Python interpreter executed the code as:
b_df.iloc[i,:] = a_df.iloc[0,i*4:(i+1)*4]
#as
b_df.iloc.__setitem__((i, slice(None)), value)
where the value would be the right hand side of the equation.
Whereas in the second case the Python interpreter executed the code as:
b_df.iloc[i][:] = a_df.iloc[0,i*4:(i+1)*4]
#as
b_df.iloc.__getitem__(i).__setitem__(slice(None), value)
where again the value would be the right hand side of the equation.
In each of those two cases a different method would be called inside setitem due to the difference in the keys (i, slice(None)) and slice(None)
Therefore we have different behavior.
Could anyone explain to me what the difference between .iloc[i,:] and
.iloc[i][:] is
The difference between .iloc[i,:] and .iloc[i][:]
In the case of .iloc[i,:] you are accessing directly to a specific possition of the DataFrame, by selecting all (:) columns of the ith row. As far as I know, it is equivalent to leave the 2nd dimension unspecified (.iloc[i]).
In the case of .iloc[i][:] you are performing a 2 chained operations. So, the result of .iloc[i] will then be affected by [:]. Using this to set values is discouraged by Pandas itself here with a warning, so you shouldn't use it:
Whether a copy or a reference is returned for a setting operation, may
depend on the context. This is sometimes called chained assignment and
should be avoided
... and why .iloc[i][:] worked in my example above but not .iloc[i,:]
As #Scott mentioned on the OP comments, data alignment is intrinsic, so the indexes in the right side of the = won't be included if they are not present in the left side. This is why there are NaN values on the 2nd row.
So, to leave things clear, you could do as follows:
for i in range(2):
# Get the slice
a_slice = a_df.iloc[0, i*4:(i+1)*4]
# Reset the indices
a_slice.reset_index(drop=True, inplace=True)
# Set the slice into b_df
b_df.iloc[i,:] = a_slice
Or you can convert to list instead of using reset_index:
for i in range(2):
# Get the slice
a_slice = a_df.iloc[0, i*4:(i+1)*4]
# Convert the slice into a list and set it into b_df
b_df.iloc[i,:] = list(a_slice)

Error when setting default value to entire new column in Pandas dataframe

Code works however getting this error when trying to set default value =1 to entire new column in Pandas dataframe. What does this warning error mean and how can I rework it so I don't get this warning error.
df['new']=1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
this should solve the problem:
soldactive = df[(df.DispositionStatus == 'Sold') & (df.AssetStatus == 'Active')].copy()
your code:
removesold = df(df.ExitDate.isin(errorval)) & (df.DispositionStatus == 'Sold') & (af.AssetStatus == 'Resolved')]
df = df.drop(removesold.index)
soldactive = df[(df.DispositionStatus == 'Sold') & (df.AssetStatus == 'Active')]
soldactive['FlagError'] = 1
you've created soldactive DF as a copy of the subset (sliced) df.
After that you'are trying to create a new column on that copy. It gives you a warning: A value is trying to be set on a copy of a slice from a DataFrame because dataframes are value-mutable (see excerpt from docs below)
Docs:
All pandas data structures are value-mutable (the values they contain
can be altered) but not always size-mutable. The length of a Series
cannot be changed, but, for example, columns can be inserted into a
DataFrame. However, the vast majority of methods produce new objects
and leave the input data untouched. In general, though, we like to
favor immutability where sensible.
Here is a test case:
In [375]: df
Out[375]:
a b c
0 9 6 4
1 5 2 8
2 8 1 6
3 3 4 1
4 8 0 2
In [376]: a = df[1:3]
In [377]: a['new'] = 1
C:\envs\py35\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [378]: del a
In [379]: a = df[1:3].copy()
In [380]: a['new'] = 1
In [381]: a
Out[381]:
a b c new
1 5 2 8 1
2 8 1 6 1
In [382]: df
Out[382]:
a b c
0 9 6 4
1 5 2 8
2 8 1 6
3 3 4 1
4 8 0 2
Solution
df.loc[:, 'new'] = 1
pandas uses [] to provide a copy. Use loc and iloc to access the DataFrame directly.
What's more is that if the 'new' column didn't already exist, it would have worked. It only threw that error because the column already existed and you were trying to edit it on a view or copy... I think

Subsetting DataFrame using ix in Python

I am trying to learn how subsetting works in pandas DataFrame. I made a random dataframe as below.
import pandas as pd
import numpy as np
np.random.seed(1234)
X = pd.DataFrame({'var1' : np.random.randint(1,6,5), 'var2' : np.random.randint(6,11,5),
'var3': np.random.randint(11,16,5)})
X = X.reindex(np.random.permutation(X.index))
X.iloc[[0,2], 1] = None
X returns,
var1 var2 var3
0 3 NaN 11
4 3 9 13
3 2 NaN 14
2 5 9 12
1 2 7 13
pandas method .loc is strictly label based and .iloc is for integer positions. .ix can be used to combine position based index and labels.
However, in the above example, the row indices are integers, and .ix understands them as row indices not positions. Suppose that I want to retrieve the first two rows of 'var2'. In R, X[1:2, 'var2'] would give the answer. In Python, X.ix[[0,1], 'var2'] returns NaN 7 rather than NaN 9.
The question is "Is there a simple way to let .ix know the indices are position based?"
I've found some solutions for this but they are not simple and intuitive in some cases.
For example, by using _slice() as below, I could get the result I wanted.
>>> X._slice(slice(0, 2), 0)._slice(slice(1,2),1)
var2
0 NaN
4 9
When the row indices are not integers, there's no problem.
>>> X.index = list('ABCED')
>>> X.ix[[0,1], 'var2']
A NaN
B 9
Name: var2, dtype: float64
You could use X['var2'].iloc[[0,1]]:
In [280]: X['var2'].iloc[[0,1]]
Out[280]:
0 NaN
4 9
Name: var2, dtype: float64
Since X['var2'] is a view of X, X['var2'].iloc[[0,1]] is safe for both
access and assignments. But be careful if you use this "chained indexing"
pattern (such as the index-by-column-then-index-by-iloc pattern used here) for assignments, since it does not
generalize to the case of assignments with multiple columns.
For example, X[['var2', 'var3']].iloc[[0,1]] = ... generates a copy of a
sub-DataFrame of X so assignment to this sub-DataFrame does not modify X.
See the docs on "Why assignments using chained indexing
fails" for more explanation.
To be concrete and to show why this view-vs-copy distinction is important: If you have this warning turned on:
pd.options.mode.chained_assignment = 'warn'
then this assign raises a SettingWithCopyWarning warning:
In [252]: X[['var2', 'var3']].iloc[[0,1]] = 100
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a
DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
and the assignment fails to modify X. Eek!
In [281]: X
Out[281]:
var1 var2 var3
0 3 NaN 11
4 3 9 13
3 2 NaN 14
2 5 9 12
1 2 7 13
To get around this problem, when you want an assignment to affect X, you must
assign to a single indexer (e.g. X.iloc = ... or X.loc = ... or X.ix = ...) -- that is, without chained indexing.
In this case, you could use
In [265]: X.iloc[[0,1], X.columns.get_indexer_for(['var2', 'var3'])] = 100
In [266]: X
Out[266]:
var1 var2 var3
0 3 100 100
4 3 100 100
3 2 NaN 14
2 5 9 12
1 2 7 13
but I wonder if there is a better way, since this is not terribly pretty.

pandas dataframe view vs copy, how do I tell?

What's the difference between:
pandas df.loc[:,('col_a','col_b')]
and
df.loc[:,['col_a','col_b']]
The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Thanks
If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.
I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

How to create lazy_evaluated dataframe columns in Pandas

A lot of times, I have a big dataframe df to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.
I can do that in Pandas like:
df['derivative_col1'] = df['basic_col1'] + df['basic_col2']
df['derivative_col2'] = df['basic_col1'] * df['basic_col2']
....
df['derivative_coln'] = func(list_of_basic_cols)
etc. Pandas will calculate and allocate the memory for all derivative columns all at once.
What I want now is to have a lazy evaluation mechanism to postpone the calculation and memory allocation of derivative columns to the actual need moment. Somewhat define the lazy_eval_columns as:
df['derivative_col1'] = pandas.lazy_eval(df['basic_col1'] + df['basic_col2'])
df['derivative_col2'] = pandas.lazy_eval(df['basic_col1'] * df['basic_col2'])
That will save the time/memory like Python 'yield' generator, for if I issue df['derivative_col2'] command will only triger the specific calculation and memory allocation.
So how to do lazy_eval() in Pandas ? Any tip/thought/ref are welcome.
Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here
In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])
In [20]: df
Out[20]:
a b
0 -1.949107 -0.763762
1 -0.382173 -0.970349
2 0.202116 0.094344
3 -1.225579 -0.447545
4 1.739508 -0.400829
In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]
Create a generator that evaluates a formula using eval; assigns the result, then yields the result.
In [22]: def lazy(x, formulas):
....: for col, f in formulas:
....: x[col] = x.eval(f)
....: yield x
....:
In action
In [23]: gen = lazy(df,formulas)
In [24]: gen.next()
Out[24]:
a b c
0 -1.949107 -0.763762 -2.712869
1 -0.382173 -0.970349 -1.352522
2 0.202116 0.094344 0.296459
3 -1.225579 -0.447545 -1.673123
4 1.739508 -0.400829 1.338679
In [25]: gen.next()
Out[25]:
a b c d
0 -1.949107 -0.763762 -2.712869 5.287670
1 -0.382173 -0.970349 -1.352522 0.516897
2 0.202116 0.094344 0.296459 0.059919
3 -1.225579 -0.447545 -1.673123 2.050545
4 1.739508 -0.400829 1.338679 2.328644
So its user determined ordering for the evaluation (and not on-demand). In theory numba is going to support this, so pandas possibly support this as a backend for eval (which currently uses numexpr for immediate evaluation).
my 2c.
lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.
You could subclass DataFrame, and add the column as a property. For example,
import pandas as pd
class LazyFrame(pd.DataFrame):
#property
def derivative_col1(self):
self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2']
return result
x = LazyFrame({'basic_col1':[1,2,3],
'basic_col2':[4,5,6]})
print(x)
# basic_col1 basic_col2
# 0 1 4
# 1 2 5
# 2 3 6
Accessing the property (via x.derivative_col1, below) calls the derivative_col1 function defined in LazyFrame. This function computes the result and adds the derived column to the LazyFrame instance:
print(x.derivative_col1)
# 0 5
# 1 7
# 2 9
print(x)
# basic_col1 basic_col2 derivative_col1
# 0 1 4 5
# 1 2 5 7
# 2 3 6 9
Note that if you modify a basic column:
x['basic_col1'] *= 10
the derived column is not automatically updated:
print(x['derivative_col1'])
# 0 5
# 1 7
# 2 9
But if you access the property, the values are recomputed:
print(x.derivative_col1)
# 0 14
# 1 25
# 2 36
print(x)
# basic_col1 basic_col2 derivative_col1
# 0 10 4 14
# 1 20 5 25
# 2 30 6 36

Categories