Why use pandas.assign rather than simply initialize new column? - python

I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?
For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])
but the pandas.DataFrame.assign documentation recommends doing this:
df.assign(ln_A = lambda x: np.log(x.A))
# or
newcol = np.log(df['A'])
df.assign(ln_A=newcol)
Both methods return the same dataframe. In fact, the first method (my 'on the fly' assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign method (0.353 seconds for 1000 iterations).
So is there a reason I should stop using my old method in favour of df.assign?

The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.
In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.
In your particular case:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.

The premise on assign is that it returns:
A new DataFrame with the new columns in addition to all the existing columns.
And also you cannot do anything in-place to change the original dataframe.
The callable must not change input DataFrame (though pandas doesn't check it).
On the other hand df['ln_A'] = np.log(df['A']) will do things inplace.
So is there a reason I should stop using my old method in favour of df.assign?
I think you can try df.assign but if you do memory intensive stuff, better to work what you did before or operations with inplace=True.

Related

Why do I need to slice my series for my function to work? [duplicate]

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

How is Pandas DataFrame handled across multiple custom functions when passed as argument?

We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.
But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?
Does modifying the df variable alters the parent/main/global variable as well?
Consider the following example:
import pandas as pd
def add_Col(df):
df["New Column"] = 10 * 3
def mod_Col(df):
df["Existing Column"] = df["Existing Column"] ** 2
data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])
add_Col(df)
mod_col(df)
df
When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col?
Did invoking add_Col function create a copy of df or only a pointer?
What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?
It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.
On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.
If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.
Changes the original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df.loc[1,0] = 7
mutate_data(df)
print(df)
# 0
#0 1
#1 7
#2 4
Will not change original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df = df*2
mutate_data(df)
print(df)
# 0
#0 1
#1 2
#2 4
What should you do?
If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.
def add_column(df):
df['new_column'] = 7
return df
df = add_column(df)
#┃ ┃
#┗ on lhs & rhs ┛
In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.
However, that may have unintended consequences if you plan to write to a new object
df1 = add_column(df)
# | |
# New Obj Function still modifies this though!
A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df do not impact the original df outside of the function.
def add_column_maintain_original(df):
df = df.copy()
df['new_column'] = 7
return df
Another possibility is to pass a copy to the function:
df1 = add_column(df.copy())
yes the function will indeed change the data frame itself without creating a copy of it. You should be careful of it because you might end up having columns changed without you noticing.
In my opinion the best practice depend on use cases and using .copy() will indeed have an impact on your memory.
If for instance you are creating a pipeline with some dataframe as input you do not want to change the input dataframe itself. While if you are just processing a dataframe and you are splitting the processing in different function you can write the function how you did it

Assigning pandas selection to a variable and then modifying it

I am trying to select some rows from a pandas dataframe and store the subset/selection into a variable so I can perform multiple operations on this subset (including modification) without having to do the selection again. But I don't quite understand why it doesn't work.
For example, this doesn't work as expected (the original df doesn't get modified):
df = pd.DataFrame({"a":list(range(1,3))})
subDf = df.loc[df.a==2,:]
subDf.loc[:,"a"] = -1 # also throws SettingWithCopyWarning
# ... do more stuff with subDf...
But, this works as expected:
df = pd.DataFrame({"a":list(range(1,3))})
mask = (df.a==2)
df.loc[mask,"a"] = -1
After reading the pandas docs on indexing view vs copy, I was under the impression that selecting via .loc will return a view, but apparently that's not the case given the SettingWithCopyWarning. What am I misunderstanding here?
In subDf = df.loc[df.a==2,:] the method you are using is actually __getitem__ (df.loc.__getitem__) which is not guaranteed to return a view. When you assign something to loc (for example df.loc[mask,"a"] = -1) you are actually calling __setitem__ (df.loc.__setitem__). Here, since it has to assign a value to that slice, it is guaranteed to be a view.

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

What rules does Pandas use to generate a view vs a copy?

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

Categories