pandas dataframe view vs copy, how do I tell?

pandas dataframe view vs copy, how do I tell? - python

What's the difference between:
pandas df.loc[:,('col_a','col_b')]
and
df.loc[:,['col_a','col_b']]
The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Thanks

If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.
I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.

Related

Pandas df['col1':'col2'] giving the output I don't understand

I was doing something very basic like this -
data = np.arange(1,13).reshape(4,3)
table = pd.DataFrame(data, index = list('abcd'), columns =['foo','bar','baz'])
table
foo bar baz
a 1 2 3
b 4 5 6
c 7 8 9
d 10 11 12
And then I ran this -
table['bar':'foo']
#output
foo bar baz
c 7 8 9
d 10 11 12
I don't get why I am getting this result. Note that I am not asking for any other solution or workaround. I am just looking for explanation/rules behind this behavior.

I'm not entirly sure, but it looks like you can't use slicing for column names, the slicing only works on the rows, so only c and d are (lexicography) between bar and foo
You can instead use loc:
table.loc[:, 'foo':'bar']
Note that I changed the order of foo and bar, this is because they are ordered as you defined them, foo -> baz -> bar and not lexicographically. 'bar':'foo' will return an empty dataframe.

It's basically outputting row slices by comparing bar and foo lexicographically with the existing column names. The output includes column c and d as they're only two columns that fall between bar and foo: a < b < bar < c < d < ... < foo

First, you have to know that the notation df[x:y] try to slice your dataframe by index labels and not columns. This is different than the notation df[x] which try to select a column. A generic way to filter your dataframe is to use .loc (or .iloc). You should the documentation about Indexing and selecting data
>>> table['bar':'foo']
foo bar baz
c 7 8 9 # 'bar' >= 'c' (a b bar c d )
d 10 11 12 # 'd' <= 'foo' (a b c d foo)
When you use your code you raise an exception because we often use a RangeIndex or IntIndex index label:
>>> table.reset_index(drop=True)['bar':'foo']
...
TypeError: cannot do slice indexing on RangeIndex with these indexers [bar] of type str
Immediately, we understand the problem and use .loc:
>>> table.loc[:, 'foo':'bar']
foo bar
a 1 2
b 4 5
c 7 8
d 10 11

How to implement arbitrary condition in pandas style function? [duplicate]

I would like to perform arithmetic on one or more dataframes columns using pd.eval. Specifically, I would like to port the following code that evaluates a formula:
x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
...to code using pd.eval. The reason for using pd.eval is that I would like to automate many workflows, so creating them dynamically will be useful to me.
My two input DataFrames are:
import pandas as pd
import numpy as np
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df1
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
df2
A B C D
0 5 9 8 9
1 4 3 0 3
2 5 0 2 3
3 8 1 3 3
4 3 7 0 1
I am trying to better understand pd.eval's engine and parser arguments to determine how best to solve my problem. I have gone through the documentation, but the difference was not made clear to me.
What arguments should be used to ensure my code is working at the maximum performance?
Is there a way to assign the result of the expression back to df2?
Also, to make things more complicated, how do I pass x as an argument inside the string expression?

You can use 1) pd.eval(), 2) df.query(), or 3) df.eval(). Their various features and functionality are discussed below.
Examples will involve these dataframes (unless otherwise specified).
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
1) pandas.eval
This is the "Missing Manual" that pandas doc should contain.
Note: of the three functions being discussed, pd.eval is the most important. df.eval and df.query call
pd.eval under the hood. Behaviour and usage is more or less
consistent across the three functions, with some minor semantic
variations which will be highlighted later. This section will
introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.
pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do
x = 5
pd.eval("df1.A + (df1.B * x)")
Some things to note here:
The entire expression is a string
df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
Specific columns are accessed using the attribute accessor index. You can also use "df1['A'] + (df1['B'] * x)" to the same effect.
I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:
pd.eval("df1.A + df2.A") # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5") # Valid, returns a pd.DataFrame object
...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.
pd.eval("df1 > df2")
pd.eval("df1 > 5")
pd.eval("df1 < df2 and df3 < df4")
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")
A list detailing all the supported features and syntax can be found in the documentation. In summary,
Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
Comparison operations, including chained comparisons, e.g., 2 < df < df2
Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool
list and tuple literals, e.g., [1, 2] or (1, 2)
Attribute access, e.g., df.a
Subscript expressions, e.g., df[0]
Simple variable evaluation, e.g., pd.eval('df') (this is not very useful)
Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and
arctan2.
This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.
From the list, it is obvious you can also pass expressions involving the index, such as
pd.eval('df1.A * (df1.index > 1)')
1a) Parser Selection: The parser=... argument
pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.
Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and or. So,
pd.eval("(df1 > df2) & (df3 < df4)")
Will be the same as
pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')
And also the same as
pd.eval("df1 > df2 and df3 < df4")
Here, the parentheses are necessary. To do this conventionally, the parentheses would be required to override the higher precedence of bitwise operators:
(df1 > df2) & (df3 < df4)
Without that, we end up with
df1 > df2 & df3 < df4
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Use parser='python' if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.
pd.eval("(df1 > df2) & (df3 < df4)", parser='python')
The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the 'pandas' parser. For example,
pd.eval("df1 == [1, 2, 3]")
Is valid, and will run with the same semantics as
pd.eval("df1 in [1, 2, 3]")
OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') will throw a NotImplementedError error.
1b) Backend Selection: The engine=... argument
There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.
With Python backend, your expression is evaluated similar to just passing the expression to Python's eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.
df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')
0 True
1 False
2 True
Name: A, dtype: bool
Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so use at your own risk! It is generally not recommended to change this option to 'python' unless you know what you're doing.
1c) local_dict and global_dict arguments
Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict
For example:
pd.eval("df1 > thresh")
UndefinedVariableError: name 'thresh' is not defined
This fails because thresh is not defined. However, this works:
pd.eval("df1 > thresh", local_dict={'thresh': 10})
This is useful when you have variables to supply from a dictionary. Alternatively, with the Python engine, you could simply do this:
mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')
But this is going to possibly be much slower than using the 'numexpr' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.
1d) The target (+ inplace) argument, and Assignment Expressions
This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.
Consider the example in the question
x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
To assign a column "D" to df2, we do
pd.eval('D = df1.A + (df1.B * x)', target=df2)
A B C D
0 5 9 8 5
1 4 3 0 52
2 5 0 2 22
3 8 1 3 48
4 3 7 0 42
This is not an in-place modification of df2 (but it can be... read on). Consider another example:
pd.eval('df1.A + df2.A')
0 10
1 11
2 7
3 16
4 10
dtype: int32
If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:
df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
F B G H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to
# df = df.assign(B=pd.eval('df1.A + df2.A'))
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
If you wanted to perform an in-place mutation on df, set inplace=True.
pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to
# df['B'] = pd.eval('df1.A + df2.A')
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
If inplace is set without a target, a ValueError is raised.
While the target argument is fun to play around with, you will seldom need to use it.
If you wanted to do this with df.eval, you would use an expression involving an assignment:
df = df.eval("B = #df1.A + #df2.A")
# df.eval("B = #df1.A + #df2.A", inplace=True)
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
Note
One of pd.eval's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:
pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)
It can also parse nested lists with the 'python' engine:
pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]
And lists of strings:
pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]
The problem, however, is for lists with length larger than 100:
pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python')
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
More information can this error, causes, fixes, and workarounds can be found here.
2) DataFrame.eval:
As mentioned above, df.eval calls pd.eval under the hood, with a bit of juxtaposition of arguments. The v0.23 source code shows this:
def eval(self, expr, inplace=False, **kwargs):
from pandas.core.computation.eval import eval as _eval
inplace = validate_bool_kwarg(inplace, 'inplace')
resolvers = kwargs.pop('resolvers', None)
kwargs['level'] = kwargs.pop('level', 0) + 1
if resolvers is None:
index_resolvers = self._get_index_resolvers()
resolvers = dict(self.iteritems()), index_resolvers
if 'target' not in kwargs:
kwargs['target'] = self
kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
return _eval(expr, inplace=inplace, **kwargs)
eval creates arguments, does a little validation, and passes the arguments on to pd.eval.
For more, you can read on: When to use DataFrame.eval() versus pandas.eval() or Python eval()
2a) Usage Differences
2a1) Expressions with DataFrames vs. Series Expressions
For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2") when you call df1.eval or df2.eval.
2a2) Specifying Column Names
Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1, you would call pd.eval with the following expression:
pd.eval("df1.A + df1.B")
With df.eval, you need only supply the column names:
df1.eval("A + B")
Since, within the context of df1, it is clear that "A" and "B" refer to column names.
You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).
df1.eval("A + index")
Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as df1.eval("A + ilevel_0").
These rules also apply to df.query.
2a3) Accessing Variables in Local/Global Namespace
Variables supplied inside expressions must be preceded by the "#" symbol, to avoid confusion with column names.
A = 5
df1.eval("A > #A")
The same goes for query.
It goes without saying that your column names must follow the rules for valid identifier naming in Python to be accessible inside eval. See here for a list of rules on naming identifiers.
2a4) Multiline Queries and Assignment
A little known fact is that eval supports multiline expressions that deal with assignment (whereas query doesn't). For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do
df1.eval("""
E = A + B
F = #df2.A + #df2.B
G = E >= F
""")
A B C D E F G
0 5 0 3 3 5 14 False
1 7 9 3 5 16 7 True
2 2 4 7 6 6 5 True
3 8 8 1 6 16 9 True
4 7 7 8 1 14 10 True
3) eval vs query
It helps to think of df.query as a function that uses pd.eval as a subroutine.
Typically, query (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True result. The result of the expression is then passed to loc (in most cases) to return the rows that satisfy the expression. According to the documentation,
The result of the evaluation of this expression is first passed to
DataFrame.loc and if that fails because of a multidimensional key
(e.g., a DataFrame) then the result will be passed to
DataFrame.__getitem__().
This method uses the top-level pandas.eval() function to evaluate the
passed query.
In terms of similarity, query and df.eval are both alike in how they access column names and variables.
This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider
df1.A
0 5
1 7
2 2
3 8
4 7
Name: A, dtype: int32
df1.B
0 9
1 3
2 0
3 1
4 7
Name: B, dtype: int32
To get all rows where "A" >= "B" in df1, we would use eval like this:
m = df1.eval("A >= B")
m
0 True
1 False
2 False
3 True
4 True
dtype: bool
m represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter df1:
df1[m]
# df1.loc[m]
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
However, with query, the intermediate result "m" is directly passed to loc, so with query, you would simply need to do
df1.query("A >= B")
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
Performance wise, it is exactly the same.
df1_big = pd.concat([df1] * 100000, ignore_index=True)
%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")
14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But the latter is more concise, and expresses the same operation in a single step.
Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)
df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
But don't.
Bottom line: Please use query when querying or filtering rows based on a conditional expression.

There are great tutorials already, but bear in mind that before jumping wildly into the usage of eval/query attracted by its simpler syntax, it has severe performance issues if your dataset has less than 15,000 rows.
In that case, simply use df.loc[mask1, mask2].
Refer to: Expression Evaluation via eval()

Conditionally replace values in pandas.DataFrame with previous value

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.
I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).
Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)
A simple example:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
A B
0 1 a
1 2 b
2 3 c
3 4 d
4 1000 e # '1000 e' --> '4 e'
5 6 f
6 7 g
7 8 h

You can simply mask values over your threshold and use ffill:
df.assign(A=df.A.mask(df.A.gt(10)).ffill())
A B
0 1.0 a
1 2.0 b
2 3.0 c
3 4.0 d
4 4.0 e
5 6.0 f
6 7.0 g
7 8.0 h
Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.

I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.
def df_replace_with_previous(df,col,maskfunc,inplace=False):
arr = np.array(df[col])
mask = maskfunc(arr)
arr[ mask ] = arr[ list(mask)[1:]+[False] ]
if inplace:
df[col] = arr
return
else:
df2 = df.copy()
df2[col] = arr
return df2
This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.
Usage in the case given in OP:
df_replace_with_previous(df,'A',lambda x:x>10,False)

Subsetting DataFrame using ix in Python

I am trying to learn how subsetting works in pandas DataFrame. I made a random dataframe as below.
import pandas as pd
import numpy as np
np.random.seed(1234)
X = pd.DataFrame({'var1' : np.random.randint(1,6,5), 'var2' : np.random.randint(6,11,5),
'var3': np.random.randint(11,16,5)})
X = X.reindex(np.random.permutation(X.index))
X.iloc[[0,2], 1] = None
X returns,
var1 var2 var3
0 3 NaN 11
4 3 9 13
3 2 NaN 14
2 5 9 12
1 2 7 13
pandas method .loc is strictly label based and .iloc is for integer positions. .ix can be used to combine position based index and labels.
However, in the above example, the row indices are integers, and .ix understands them as row indices not positions. Suppose that I want to retrieve the first two rows of 'var2'. In R, X[1:2, 'var2'] would give the answer. In Python, X.ix[[0,1], 'var2'] returns NaN 7 rather than NaN 9.
The question is "Is there a simple way to let .ix know the indices are position based?"
I've found some solutions for this but they are not simple and intuitive in some cases.
For example, by using _slice() as below, I could get the result I wanted.
>>> X._slice(slice(0, 2), 0)._slice(slice(1,2),1)
var2
0 NaN
4 9
When the row indices are not integers, there's no problem.
>>> X.index = list('ABCED')
>>> X.ix[[0,1], 'var2']
A NaN
B 9
Name: var2, dtype: float64

You could use X['var2'].iloc[[0,1]]:
In [280]: X['var2'].iloc[[0,1]]
Out[280]:
0 NaN
4 9
Name: var2, dtype: float64
Since X['var2'] is a view of X, X['var2'].iloc[[0,1]] is safe for both
access and assignments. But be careful if you use this "chained indexing"
pattern (such as the index-by-column-then-index-by-iloc pattern used here) for assignments, since it does not
generalize to the case of assignments with multiple columns.
For example, X[['var2', 'var3']].iloc[[0,1]] = ... generates a copy of a
sub-DataFrame of X so assignment to this sub-DataFrame does not modify X.
See the docs on "Why assignments using chained indexing
fails" for more explanation.
To be concrete and to show why this view-vs-copy distinction is important: If you have this warning turned on:
pd.options.mode.chained_assignment = 'warn'
then this assign raises a SettingWithCopyWarning warning:
In [252]: X[['var2', 'var3']].iloc[[0,1]] = 100
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a
DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
and the assignment fails to modify X. Eek!
In [281]: X
Out[281]:
var1 var2 var3
0 3 NaN 11
4 3 9 13
3 2 NaN 14
2 5 9 12
1 2 7 13
To get around this problem, when you want an assignment to affect X, you must
assign to a single indexer (e.g. X.iloc = ... or X.loc = ... or X.ix = ...) -- that is, without chained indexing.
In this case, you could use
In [265]: X.iloc[[0,1], X.columns.get_indexer_for(['var2', 'var3'])] = 100
In [266]: X
Out[266]:
var1 var2 var3
0 3 100 100
4 3 100 100
3 2 NaN 14
2 5 9 12
1 2 7 13
but I wonder if there is a better way, since this is not terribly pretty.

Why does a copy get created when assigned with None?

In[216]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In[217]: bar = foo.ix[:1]
In[218]: bar
Out[218]:
a b
0 1 3
1 2 4
A view is created as expected.
In[219]: bar['a'] = 100
In[220]: bar
Out[220]:
a b
0 100 3
1 100 4
In[221]: foo
Out[221]:
a b
0 100 3
1 100 4
2 3 5
If view is modified, so is the original dataframe foo.
However, if the assignment is done with None, then a copy seems to be made.
Can anyone shed some light on what's happening and maybe the logic behind?
In[222]: bar['a'] = None
In[223]: bar
Out[223]:
a b
0 None 3
1 None 4
In[224]: foo
Out[224]:
a b
0 100 3
1 100 4
2 3 5

When you assign bar['a'] = None, you're forcing the column to change its dtype from, e.g., I4 to object.
Doing so forces it to allocate a new array of object for the column, and then of course it writes to that new array instead of to the old array that's shared with the original DataFrame.

You are doing a form of chained assignment, see here why this is a really bad idea.
See this question as well here
Pandas will generally warn you that you are modifying a view (even more so in 0.15.0).
In [49]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In [51]: foo
Out[51]:
a b
0 1 3
1 2 4
2 3 5
In [52]: bar = foo.ix[:1]
In [53]: bar
Out[53]:
a b
0 1 3
1 2 4
In [54]: bar.dtypes
Out[54]:
a int64
b int64
dtype: object
# this is an internal method (but is for illustration)
In [56]: bar._is_view
Out[56]: True
# this will warn in 0.15.0
In [57]: bar['a'] = 100
/usr/local/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/usr/local/bin/python
In [58]: bar._is_view
Out[58]: True
# bar is now a copied object (and will replace the existing dtypes with new ones).
In [59]: bar['a'] = None
In [60]: bar.dtypes
Out[60]:
a object
b int64
dtype: object
You should never rely on whether something is a view (even in numpy), except in certain very performant situations. It is not a guaranteed construct, depending on the memory layout of the underlying data.
You should very very very rarely try to set the data for propogation thru a view. and doing this in pandas is almost always going to cause trouble, when you mixed dtypes. (In numpy you can only have a view on a single dtype; I am not even sure what a view on a multi-dtyped array which changes the dtype does, or if its even allowed).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe view vs copy, how do I tell? - python

Related

Pandas df['col1':'col2'] giving the output I don't understand

How to implement arbitrary condition in pandas style function? [duplicate]

Conditionally replace values in pandas.DataFrame with previous value

Subsetting DataFrame using ix in Python

Why does a copy get created when assigned with None?

Categories

Resources