Unexpected: Is Pandas DataFrame a slice of its former self? - python

I intended to drop all rows in a dataframe that I no longer need using the following:
df = df[my_selection]
where my_selection is a series of boolean values.
Later when I tried to add a column as follows:
df['New column'] = pd.Series(data)
I got the well-known "SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
Does this mean that df is actually a slice of its former self?
Or why am I being accused of assigning values to a slice?
Demo code:
import pandas as pd
data = {
'A': pd.Series(range(8)),
'B': pd.Series(range(8,0,-1))
}
df = pd.DataFrame(data)
df
Output:
A B
0 0 8
1 1 7
2 2 6
3 3 5
4 4 4
5 5 3
6 6 2
7 7 1
This causes a warning:
my_selection = df['A'] < 4
df = df[my_selection]
df['C'] = pd.Series(range(4))
This does not create a warning:
df = pd.DataFrame(data)
df['C'] = pd.Series(range(8))
Should I be using df.drop?

Related

pd.DataFrame on dataframe

What does pd.DataFrame does on a dataframe? Please see the code below.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
In [3]: b = pd.DataFrame(a)
In [4]: a['c'] = [7,8,9]
In [5]: a
Out[5]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [6]: b
Out[6]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [7]: a.drop(columns='c', inplace=True)
In [8]: a
Out[8]:
a b
0 1 4
1 2 5
2 3 6
In [9]: b
Out[9]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In In[3], the function pd.DataFrame is applied on a dataframe a. It turns out that the ids of a and b are different. However, when a column is added to a, the same column is added to b, but when we drop a column from a, the column is not dropped from b. So what does pd.DataFrame does? Are a and b the same object or different? What should we do to a so that we drop the column from b? Or, how do we prevent a column from being added to b when we add a column to a?
I would avoid your statements at all cost. Better would be to make a dataframe as such:
df=pd.DataFrame({'a': [0,1,2], 'b': [3,4,5], 'c':[6,7,8]})
The above result is a dataframe, with indices and column names.
You can add a column to df, like this:
df['d'] = [8,9,10]
And remove a column to the dataframe, like this:
df.drop(columns='c',inplace=True)
I would not create a dataframe from a function definition, but use 'append' instead. Append works for dictionaries and dataframes. An example for a dictionary based append:
df = pd.DataFrame(columns=['Col1','Col2','Col3','Col4']) # create empty df with column names.
append_dict = {'Col1':value_1, 'Col2':value_2, 'Col3':value_3,'Col4':value_4}
df = df.append(append_dict,ignore_index=True).
The values can be changed in a loop, so it does something with respect to the previous values. For dataframe append, you can check the pandas documentation (just replace the append_dict argument with the dataframe that you like to append)
Is this what you want?

Pandas dataframe how to replace row with one with additional attributes

I have a method that adds additional attributes to a given pandas series and I want to update a row in the df with the returned series.
Lets say I have a simple dataframe:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
a b
0 1 3
1 2 4
and now I want to replace a row with one with additional attributes, all other rows will show Nan for that column ex:
subdf = df.loc[1]
subdf["newVal"] = "foo"
# subdf is created externally and returned. Now it must be updated.
df.loc[1] = subdf #or something
df would look like:
a b newVal
0 1 3 Nan
1 2 4 foo
Without loss in generalisation, first reindex and then assign with (i)loc:
df = df.reindex(subdf.index, axis=1)
df.iloc[-1] = subdf
df
a b newVal
0 1 3 NaN
1 2 4 foo

Pandas (Python) - Update column of a dataframe from another one with conditions and different columns

I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df

Make new column in Panda dataframe by adding values from other columns

I have a dataframe with values like
A B
1 4
2 6
3 9
I need to add a new column by adding values from column A and B, like
A B C
1 4 5
2 6 8
3 9 12
I believe this can be done using lambda function, but I can't figure out how to do it.
Very simple:
df['C'] = df['A'] + df['B']
Building a little more on Anton's answer, you can add all the columns like this:
df['sum'] = df[list(df.columns)].sum(axis=1)
The simplest way would be to use DeepSpace answer. However, if you really want to use an anonymous function you can use apply:
df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
You could use sum function to achieve that as #EdChum mentioned in the comment:
df['C'] = df[['A', 'B']].sum(axis=1)
In [245]: df
Out[245]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
You could do:
df['C'] = df.sum(axis=1)
If you only want to do numerical values:
df['C'] = df.sum(axis=1, numeric_only=True)
The parameter axis takes as arguments either 0 or 1, with 0 meaning to sum across columns and 1 across rows.
As of Pandas version 0.16.0 you can use assign as follows:
df = pd.DataFrame({"A": [1,2,3], "B": [4,6,9]})
df.assign(C = df.A + df.B)
# Out[383]:
# A B C
# 0 1 4 5
# 1 2 6 8
# 2 3 9 12
You can add multiple columns this way as follows:
df.assign(C = df.A + df.B,
Diff = df.B - df.A,
Mult = df.A * df.B)
# Out[379]:
# A B C Diff Mult
# 0 1 4 5 3 4
# 1 2 6 8 4 12
# 2 3 9 12 6 27
Concerning n00b's comment: "I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
I was getting the same error. In my case it was because I was trying to perform the column addition on a dataframe that was created like this:
df_b = df[['colA', 'colB', 'colC']]
instead of:
df_c = pd.DataFrame(df, columns=['colA', 'colB', 'colC'])
df_b is a copy of a slice from df
df_c is an new dataframe. So
df_c['colD'] = df['colA'] + df['colB']+ df['colC']
will add the columns and won't raise any warning. Same if .sum(axis=1) is used.
I wanted to add a comment responding to the error message n00b was getting but I don't have enough reputation. So my comment is an answer in case it helps anyone...
n00b said:
I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
He got this error because whatever manipulations he did to his dataframe prior to creating df['C'] created a view into the dataframe rather than a copy of it. The error didn't arise form the simple calculation df['C'] = df['A'] + df['B'] suggested by DeepSpace.
Have a look at the Returning a view versus a copy docs.
Can do using loc
In [37]: df = pd.DataFrame({"A":[1,2,3],"B":[4,6,9]})
In [38]: df
Out[38]:
A B
0 1 4
1 2 6
2 3 9
In [39]: df['C']=df.loc[:,['A','B']].sum(axis=1)
In [40]: df
Out[40]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
eval lets you sum and create columns right away:
In [8]: df.eval('C = A + B', inplace=True)
In [9]: df
Out[9]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
Since inplace=True you don't need to assign it back to df.
You can solve it by adding simply:
df['C'] = df['A'] + df['B']

Multi Column DDPLY/R function in Pandas/Python

I have the following statement in R
library(plyr)
filteredData <- ddply(data, .(ID1, ID2), businessrule)
I am trying to use Python and Pandas to duplicate the action.
I have tried...
data['judge'] = data.groupby(['ID1','ID2']).apply(lambda x: businessrule(x))
But this provides error...
incompatible index of inserted column with frame index
The error message can be reproduced with
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['ID1', 'ID2', 'val'])
df['new'] = df.groupby(['ID1', 'ID2']).apply(lambda x: x.values.sum())
# TypeError: incompatible index of inserted column with frame index
It is likely that your code raises an error for the same reason this toy example does.
The right-hand side is a Series with a 2-level MultiIndex:
ID1 ID2
0 1 3
3 4 12
6 7 21
9 10 30
dtype: int64
df['new'] = ... tells Pandas to assign this Series to a column in df.
But df has a single-level index:
ID1 ID2 val
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
Because the single-level index is incompatible with the 2-level MultiIndex, the
assignment fails. It is in general never correct to assign the result of
groupby/apply to a columns of df unless the columns or levels you group by
also happen to be valid index keys in the original DataFrame, df.
Instead, assign the Series to a new variable, just like what the R code does:
filteredData = data.groupby(['ID1','ID2']).apply(businessrule)
Note that lambda x: businessrule(x) can be replaced with businessrule.

Categories