Make new column in Panda dataframe by adding values from other columns - python

I have a dataframe with values like
A B
1 4
2 6
3 9
I need to add a new column by adding values from column A and B, like
A B C
1 4 5
2 6 8
3 9 12
I believe this can be done using lambda function, but I can't figure out how to do it.

Very simple:
df['C'] = df['A'] + df['B']

Building a little more on Anton's answer, you can add all the columns like this:
df['sum'] = df[list(df.columns)].sum(axis=1)

The simplest way would be to use DeepSpace answer. However, if you really want to use an anonymous function you can use apply:
df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

You could use sum function to achieve that as #EdChum mentioned in the comment:
df['C'] = df[['A', 'B']].sum(axis=1)
In [245]: df
Out[245]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12

You could do:
df['C'] = df.sum(axis=1)
If you only want to do numerical values:
df['C'] = df.sum(axis=1, numeric_only=True)
The parameter axis takes as arguments either 0 or 1, with 0 meaning to sum across columns and 1 across rows.

As of Pandas version 0.16.0 you can use assign as follows:
df = pd.DataFrame({"A": [1,2,3], "B": [4,6,9]})
df.assign(C = df.A + df.B)
# Out[383]:
# A B C
# 0 1 4 5
# 1 2 6 8
# 2 3 9 12
You can add multiple columns this way as follows:
df.assign(C = df.A + df.B,
Diff = df.B - df.A,
Mult = df.A * df.B)
# Out[379]:
# A B C Diff Mult
# 0 1 4 5 3 4
# 1 2 6 8 4 12
# 2 3 9 12 6 27

Concerning n00b's comment: "I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
I was getting the same error. In my case it was because I was trying to perform the column addition on a dataframe that was created like this:
df_b = df[['colA', 'colB', 'colC']]
instead of:
df_c = pd.DataFrame(df, columns=['colA', 'colB', 'colC'])
df_b is a copy of a slice from df
df_c is an new dataframe. So
df_c['colD'] = df['colA'] + df['colB']+ df['colC']
will add the columns and won't raise any warning. Same if .sum(axis=1) is used.

I wanted to add a comment responding to the error message n00b was getting but I don't have enough reputation. So my comment is an answer in case it helps anyone...
n00b said:
I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
He got this error because whatever manipulations he did to his dataframe prior to creating df['C'] created a view into the dataframe rather than a copy of it. The error didn't arise form the simple calculation df['C'] = df['A'] + df['B'] suggested by DeepSpace.
Have a look at the Returning a view versus a copy docs.

Can do using loc
In [37]: df = pd.DataFrame({"A":[1,2,3],"B":[4,6,9]})
In [38]: df
Out[38]:
A B
0 1 4
1 2 6
2 3 9
In [39]: df['C']=df.loc[:,['A','B']].sum(axis=1)
In [40]: df
Out[40]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12

eval lets you sum and create columns right away:
In [8]: df.eval('C = A + B', inplace=True)
In [9]: df
Out[9]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
Since inplace=True you don't need to assign it back to df.

You can solve it by adding simply:
df['C'] = df['A'] + df['B']

Related

pd.DataFrame on dataframe

What does pd.DataFrame does on a dataframe? Please see the code below.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
In [3]: b = pd.DataFrame(a)
In [4]: a['c'] = [7,8,9]
In [5]: a
Out[5]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [6]: b
Out[6]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [7]: a.drop(columns='c', inplace=True)
In [8]: a
Out[8]:
a b
0 1 4
1 2 5
2 3 6
In [9]: b
Out[9]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In In[3], the function pd.DataFrame is applied on a dataframe a. It turns out that the ids of a and b are different. However, when a column is added to a, the same column is added to b, but when we drop a column from a, the column is not dropped from b. So what does pd.DataFrame does? Are a and b the same object or different? What should we do to a so that we drop the column from b? Or, how do we prevent a column from being added to b when we add a column to a?
I would avoid your statements at all cost. Better would be to make a dataframe as such:
df=pd.DataFrame({'a': [0,1,2], 'b': [3,4,5], 'c':[6,7,8]})
The above result is a dataframe, with indices and column names.
You can add a column to df, like this:
df['d'] = [8,9,10]
And remove a column to the dataframe, like this:
df.drop(columns='c',inplace=True)
I would not create a dataframe from a function definition, but use 'append' instead. Append works for dictionaries and dataframes. An example for a dictionary based append:
df = pd.DataFrame(columns=['Col1','Col2','Col3','Col4']) # create empty df with column names.
append_dict = {'Col1':value_1, 'Col2':value_2, 'Col3':value_3,'Col4':value_4}
df = df.append(append_dict,ignore_index=True).
The values can be changed in a loop, so it does something with respect to the previous values. For dataframe append, you can check the pandas documentation (just replace the append_dict argument with the dataframe that you like to append)
Is this what you want?

Unexpected: Is Pandas DataFrame a slice of its former self?

I intended to drop all rows in a dataframe that I no longer need using the following:
df = df[my_selection]
where my_selection is a series of boolean values.
Later when I tried to add a column as follows:
df['New column'] = pd.Series(data)
I got the well-known "SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
Does this mean that df is actually a slice of its former self?
Or why am I being accused of assigning values to a slice?
Demo code:
import pandas as pd
data = {
'A': pd.Series(range(8)),
'B': pd.Series(range(8,0,-1))
}
df = pd.DataFrame(data)
df
Output:
A B
0 0 8
1 1 7
2 2 6
3 3 5
4 4 4
5 5 3
6 6 2
7 7 1
This causes a warning:
my_selection = df['A'] < 4
df = df[my_selection]
df['C'] = pd.Series(range(4))
This does not create a warning:
df = pd.DataFrame(data)
df['C'] = pd.Series(range(8))
Should I be using df.drop?

Pandas (Python) - Update column of a dataframe from another one with conditions and different columns

I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df

set multiple Pandas DataFrame columns to values in a single column or multiple scalar values at the same time

I'm trying to set multiple new columns to one column and, separately, multiple new columns to multiple scalar values. Can't do either. Any way to do it other than setting each one individually?
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df.loc[:,['C','D']]=df['A']
df.loc[:,['C','D']]=[0,1]
for c in ['C', 'D']:
df[c] = d['A']
df['C'] = 0
df['D'] = 1
Maybe it is what you are looking for.
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df['C'], df['D'] = df['A'], df['A']
df['E'], df['F'] = 0, 1
# Result
A B C D E F
0 0 1 0 0 0 1
1 2 3 2 2 0 1
2 4 5 4 4 0 1
The assign method will create multiple, new columns in one step. You can pass a dict() with the column and values to return a new DataFrame with the new columns appended to the end.
Using your examples:
df = df.assign(**{'C': df['A'], 'D': df['A']})
and
df = df.assign(**{'C': 0, 'D':1})
See this answer for additional detail: https://stackoverflow.com/a/46587717/4843561

How to merge a Series and DataFrame

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

Categories