Should pandas dataframes be nested? - python

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.

Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Related

Why do I need to slice my series for my function to work? [duplicate]

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

Why does one use of iloc() give a SettingWithCopyWarning, but the other doesn't?

Inside a method from a class i use this statement:
self.__datacontainer.iloc[-1]['c'] = value
Doing this i get a
"SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame"
Now i tried to reproduce this error and write the following simple code:
import pandas, numpy
df = pandas.DataFrame(numpy.random.randn(5,3),columns=list('ABC'))
df.iloc[-1]['C'] = 3
There i get no error. Why do i get an error in the first statement and not in the second?
Chain indexing
As the documentation and a couple of other answers on this site ([1], [2]) suggest, chain indexing is considered bad practice and should be avoided.
Since there doesn't seem to be a graceful way of making assignments using integer position based indexing (i.e. .iloc) without violating the chain indexing rule (as of pandas v0.23.4), it is advised to instead use label based indexing (i.e. .loc) for assignment purposes whenever possible.
However, if you absolutely need to access data by row number you can
df.iloc[-1, df.columns.get_loc('c')] = 42
or
df.iloc[[-1, 1], df.columns.get_indexer(['a', 'c'])] = 42
Pandas behaving oddly
From my understanding you're absolutely right to expect the warning when trying to reproduce the error artificially.
What I've found so far is that it depends on how a dataframe is constructed
df = pd.DataFrame({'a': [4, 5, 6], 'c': [3, 2, 1]})
df.iloc[-1]['c'] = 42 # no warning
df = pd.DataFrame({'a': ['x', 'y', 'z'], 'c': ['t', 'u', 'v']})
df.iloc[-1]['c'] = 'f' # no warning
df = pd.DataFrame({'a': ['x', 'y', 'z'], 'c': [3, 2, 1]})
df.iloc[-1]['c'] = 42 # SettingWithCopyWarning: ...
It seems that pandas (at least v0.23.4) handles mixed-type and single-type dataframes differently when it comes to chain assignments [3]
def _check_is_chained_assignment_possible(self):
"""
Check if we are a view, have a cacher, and are of mixed type.
If so, then force a setitem_copy check.
Should be called just near setting a value
Will return a boolean if it we are a view and are cached, but a
single-dtype meaning that the cacher should be updated following
setting.
"""
if self._is_view and self._is_cached:
ref = self._get_cacher()
if ref is not None and ref._is_mixed_type:
self._check_setitem_copy(stacklevel=4, t='referant',
force=True)
return True
elif self._is_copy:
self._check_setitem_copy(stacklevel=4, t='referant')
return False
which appears really odd to me although I'm not sure if it's not expected.
However, there's an old bug with a similar behavour.
UPDATE
According to the developers the above behaviour is expected.
Don't focus on the warning. The warning is just an indication, sometimes it doesn't even come up when you expect it should. Sometimes you will notice it occurs inconsistently. Instead, just avoid chained indexing or generally working with what could be a copy.
You wish to index by row integer location and column label. That's an unnatural mix, given Pandas has functionality to index by integer positions or labels, but not both simultaneously.
In this case, you can use use integer positional indexing for both rows and columns via a single iat call:
df.iat[-1, df.columns.get_loc('C')] = 3
Or, if your index labels are guaranteed to be unique, you can use at:
df.at[df.index[-1], 'C'] = 3
So it's pretty hard to answer this without context around your problem operation, but the pandas documentation covers this pretty well.
>>> df[['C']].iloc[0] = 2 # This is a problem
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Basically it boils down to - don't chain together indexing operations if you can just use a single operation to do it.
>>> df.loc[0, 'C'] = 2 # This is ok
The warning you're getting is because you've failed to set a value in the original dataframe that you're presumably trying to modify - instead, you've copied it and set something into the copy (usually when this happens to me I don't even have a reference to the copy and it just gets garbage collected, so the warning is pretty helpful)

How to unpack the columns of a pandas DataFrame to multiple variables

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)

Why use pandas.assign rather than simply initialize new column?

I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?
For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])
but the pandas.DataFrame.assign documentation recommends doing this:
df.assign(ln_A = lambda x: np.log(x.A))
# or
newcol = np.log(df['A'])
df.assign(ln_A=newcol)
Both methods return the same dataframe. In fact, the first method (my 'on the fly' assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign method (0.353 seconds for 1000 iterations).
So is there a reason I should stop using my old method in favour of df.assign?
The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.
In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.
In your particular case:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.
The premise on assign is that it returns:
A new DataFrame with the new columns in addition to all the existing columns.
And also you cannot do anything in-place to change the original dataframe.
The callable must not change input DataFrame (though pandas doesn't check it).
On the other hand df['ln_A'] = np.log(df['A']) will do things inplace.
So is there a reason I should stop using my old method in favour of df.assign?
I think you can try df.assign but if you do memory intensive stuff, better to work what you did before or operations with inplace=True.

What rules does Pandas use to generate a view vs a copy?

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

Categories