Pandas advanced indexing assignment - python

in a Pandas (v0.8.0) DataFrame I want to overwrite one slice of columns with another.
The below code throws the listed error.
What would be an efficient alternative method for achieving this?
df = DataFrame({'a' : range(0,7),
'b' : np.random.randn(7),
'c' : np.random.randn(7),
'd' : np.random.randn(7),
'e' : np.random.randn(7),
'f' : np.random.randn(7),
'g' : np.random.randn(7)})
# overwrite cols
df.ix[:,'b':'d'] = df.ix[:, 'e':'g']
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 68, in __setitem__
self._setitem_with_indexer(indexer, value)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 98, in _setitem_with_indexer
raise ValueError('Setting mixed-type DataFrames with '
ValueError: Setting mixed-type DataFrames with array/DataFrame pieces not yet supported
Edited
And as a permutation, how could I also specify a subset of the rows to set
df.ix[df['a'] < 3, 'b':'d'] = df.ix[df['a'] < 3, 'e':'g']

The issue is that using .ix[] returns a view to the actual memory objects for that subset of the DataFrame, rather than a new DataFrame made out of its contents.
Instead use
# The left-hand-side does not use .ix, since we're assigning into it.
df[['b','c']] = df.ix[:,'e':'f'].copy()
Note that you will need .copy() if you are intent on using .ix to do the slicing, otherwise it would set columns 'b' and 'c' as the same objects in memory as the columns 'e' and 'f', which does not seem like what you want to do here.
Alternatively, to avoid worrying about the copying you, you can just do:
df[['b','c']] = df[['e','f']]
If the convenience of indexing matters to you, one way to simulate this effect is to write your own function:
def col_range(df, col1, col2):
return list(dfrm.ix[dfrm.index.values[0],col1:col2].index)
Now you could do the following:
df[col_range(df,'b','d')] = df.ix[:,'e':'g'].copy()
Note: in the definition of col_range I used the first index which will select the first row of the data frame. I did this because making a view of the whole data frame just to select a range of columns seems wasteful, whereas one row probably won't matter. Since slicing this way produces a Series, the way to extract the columns is to actually grab the index, and I return them as a list.
Added for additional row slice request:
To specify a set of rows in the assignment, you can use .ix, but you need to specify just a matrix of values on the right-hand side. Having the structure of a sub-DataFrame on the right-hand side will cause problems.
df.ix[0:4,col_range(df,'b','d')] = df.ix[0:4,'e':'g'].values
You can replace the [0:4] with [df.index.values[i]:df.index.values[j]] or [df.index.values[i] for i in range(N)] or even with logical values such as [df['a']>5] to only get rows where the 'a' column exceeds 5, for example.
The full slice for an example of logical indexing where you want column 'a' bigger than 5 and column 'e' less than 10 might look like this:
import numpy as np
my_rows = np.logical_and(df['a'] > 5), df['e'] < 10)
df.ix[my_rows,col_range(df,'b','d')] = df.ix[my_rows,'e':'g'].values
In many cases, you will not need to use the .ix on the left-hand side (I recommend against it because it only works in some cases and not in others). For instance, something like:
df["A"] = np.repeat(False, len(df))
df["A"][df["B"] > 0] = True
will work as is, no special .ix needed for identifying the rows where the condition is true. The .ix seems to be needed on the left when the thing on the right is complicated.

Related

Why do I need to slice my series for my function to work? [duplicate]

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

Why does one use of iloc() give a SettingWithCopyWarning, but the other doesn't?

Inside a method from a class i use this statement:
self.__datacontainer.iloc[-1]['c'] = value
Doing this i get a
"SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame"
Now i tried to reproduce this error and write the following simple code:
import pandas, numpy
df = pandas.DataFrame(numpy.random.randn(5,3),columns=list('ABC'))
df.iloc[-1]['C'] = 3
There i get no error. Why do i get an error in the first statement and not in the second?
Chain indexing
As the documentation and a couple of other answers on this site ([1], [2]) suggest, chain indexing is considered bad practice and should be avoided.
Since there doesn't seem to be a graceful way of making assignments using integer position based indexing (i.e. .iloc) without violating the chain indexing rule (as of pandas v0.23.4), it is advised to instead use label based indexing (i.e. .loc) for assignment purposes whenever possible.
However, if you absolutely need to access data by row number you can
df.iloc[-1, df.columns.get_loc('c')] = 42
or
df.iloc[[-1, 1], df.columns.get_indexer(['a', 'c'])] = 42
Pandas behaving oddly
From my understanding you're absolutely right to expect the warning when trying to reproduce the error artificially.
What I've found so far is that it depends on how a dataframe is constructed
df = pd.DataFrame({'a': [4, 5, 6], 'c': [3, 2, 1]})
df.iloc[-1]['c'] = 42 # no warning
df = pd.DataFrame({'a': ['x', 'y', 'z'], 'c': ['t', 'u', 'v']})
df.iloc[-1]['c'] = 'f' # no warning
df = pd.DataFrame({'a': ['x', 'y', 'z'], 'c': [3, 2, 1]})
df.iloc[-1]['c'] = 42 # SettingWithCopyWarning: ...
It seems that pandas (at least v0.23.4) handles mixed-type and single-type dataframes differently when it comes to chain assignments [3]
def _check_is_chained_assignment_possible(self):
"""
Check if we are a view, have a cacher, and are of mixed type.
If so, then force a setitem_copy check.
Should be called just near setting a value
Will return a boolean if it we are a view and are cached, but a
single-dtype meaning that the cacher should be updated following
setting.
"""
if self._is_view and self._is_cached:
ref = self._get_cacher()
if ref is not None and ref._is_mixed_type:
self._check_setitem_copy(stacklevel=4, t='referant',
force=True)
return True
elif self._is_copy:
self._check_setitem_copy(stacklevel=4, t='referant')
return False
which appears really odd to me although I'm not sure if it's not expected.
However, there's an old bug with a similar behavour.
UPDATE
According to the developers the above behaviour is expected.
Don't focus on the warning. The warning is just an indication, sometimes it doesn't even come up when you expect it should. Sometimes you will notice it occurs inconsistently. Instead, just avoid chained indexing or generally working with what could be a copy.
You wish to index by row integer location and column label. That's an unnatural mix, given Pandas has functionality to index by integer positions or labels, but not both simultaneously.
In this case, you can use use integer positional indexing for both rows and columns via a single iat call:
df.iat[-1, df.columns.get_loc('C')] = 3
Or, if your index labels are guaranteed to be unique, you can use at:
df.at[df.index[-1], 'C'] = 3
So it's pretty hard to answer this without context around your problem operation, but the pandas documentation covers this pretty well.
>>> df[['C']].iloc[0] = 2 # This is a problem
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Basically it boils down to - don't chain together indexing operations if you can just use a single operation to do it.
>>> df.loc[0, 'C'] = 2 # This is ok
The warning you're getting is because you've failed to set a value in the original dataframe that you're presumably trying to modify - instead, you've copied it and set something into the copy (usually when this happens to me I don't even have a reference to the copy and it just gets garbage collected, so the warning is pretty helpful)

Assigning pandas selection to a variable and then modifying it

I am trying to select some rows from a pandas dataframe and store the subset/selection into a variable so I can perform multiple operations on this subset (including modification) without having to do the selection again. But I don't quite understand why it doesn't work.
For example, this doesn't work as expected (the original df doesn't get modified):
df = pd.DataFrame({"a":list(range(1,3))})
subDf = df.loc[df.a==2,:]
subDf.loc[:,"a"] = -1 # also throws SettingWithCopyWarning
# ... do more stuff with subDf...
But, this works as expected:
df = pd.DataFrame({"a":list(range(1,3))})
mask = (df.a==2)
df.loc[mask,"a"] = -1
After reading the pandas docs on indexing view vs copy, I was under the impression that selecting via .loc will return a view, but apparently that's not the case given the SettingWithCopyWarning. What am I misunderstanding here?
In subDf = df.loc[df.a==2,:] the method you are using is actually __getitem__ (df.loc.__getitem__) which is not guaranteed to return a view. When you assign something to loc (for example df.loc[mask,"a"] = -1) you are actually calling __setitem__ (df.loc.__setitem__). Here, since it has to assign a value to that slice, it is guaranteed to be a view.

Tricky str value replacement within PANDAS DataFrame

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

What rules does Pandas use to generate a view vs a copy?

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

Categories