Assign values to multiple columns in Pandas [duplicate] - python

This question already has answers here:
How to add multiple columns to pandas dataframe in one assignment?
(13 answers)
Closed 2 years ago.
I have follow simple DataFrame - df:
0
0 1
1 2
2 3
Once I try to create a new columns and assign some values for them, as example below:
df['col2', 'col3'] = [(2,3), (2,3), (2,3)]
I got following structure
0 (col2, col3)
0 1 (2, 3)
1 2 (2, 3)
2 3 (2, 3)
However, I am looking a way to get as here:
0 col2, col3
0 1 2, 3
1 2 2, 3
2 3 2, 3

Looks like solution is simple:
df['col2'], df['col3'] = zip(*[(2,3), (2,3), (2,3)])

There is a convenient solution to joining multiple series to a dataframe via a list of tuples. You can construct a dataframe from your list of tuples before assignment:
df = pd.DataFrame({0: [1, 2, 3]})
df[['col2', 'col3']] = pd.DataFrame([(2,3), (2,3), (2,3)])
print(df)
0 col2 col3
0 1 2 3
1 2 2 3
2 3 2 3
This is convenient, for example, when you wish to join an arbitrary number of series.

alternatively assign can be used
df.assign(col2 = 2, col3= 3)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html

I ran across this issue when trying to apply multiple scalar values to multiple new columns and couldn't find a better way. If I'm missing something blatantly obvious, let me know, but df[['b','c']] = 0 doesn't work. but here's the simplified code:
# Create the "current" dataframe
df = pd.DataFrame({'a':[1,2]})
# List of columns I want to add
col_list = ['b','c']
# Quickly create key : scalar value dictionary
scalar_dict = { c : 0 for c in col_list }
# Create the dataframe for those columns - key here is setting the index = df.index
df[col_list] = pd.DataFrame(scalar_dict, index = df.index)
Or, what appears to be slightly faster is to use .assign():
df = df.assign(**scalar_dict)

Related

Pandas Dataframe: How to create a column of incremental unique value count of another column [duplicate]

This question already has answers here:
Creating a new column assigning same index to repeated values in Pandas DataFrame [closed]
(2 answers)
Closed 2 years ago.
Consider the sample dataframe ('value' column is of no significance here):
df = pd.DataFrame({'key':list('AABBBC'), 'value': [1, 2, 3, 4, 5, 6]})
What I want is a column to count the unique value of only the 'key' column, the caveat being value count will be incrementally ascending and the count will only go up if the cell value hasn't appeared in previous rows. So here "A" will be assigned value 1, "B" 2 and "C" 3.
The desired result looks like this:
Right now I can only achieve this with a couple of steps:
df1 = df.drop_duplicates('key').reset_index(drop = True).drop(columns = ['value'])
df1['count_unique'] = df1.index+1
pd.merge(df, df1.set_index(['key']), left_on = ['key'], right_index= True, how = 'left')
It doesn't look very Pythonic and is not the most efficient. Any advice is appreciated.
Is it:
df['count_unique'] = df['key'].factorize()[0] + 1
Output:
key value count_unique
0 A 1 1
1 A 2 1
2 B 3 2
3 B 4 2
4 B 5 2
5 C 6 3

simple pivot dataframe in pandas [duplicate]

This question already has answers here:
How to switch columns rows in a pandas dataframe
(2 answers)
Closed 4 years ago.
Have a simple df:
df = pd.DataFrame({"v": [1, 2]}, index = pd.Index(data = ["a", "b"], name="colname"))
Want to reshape it to look like this:
a b
0 1 2
How do I do that? I looked at the docs for pd.pivot and pd.pivot_table but
df.reset_index().pivot(columns = "colname", values = "v")
produces a df that has NaNs obviously.
update: i want dataframes not series because i am going to concatenate a bunch of them together to store results of a computation.
From your setup
v
colname
a 1
b 2
Seems like you need to transpose
>>> df.T
or
>>> df.transpose()
Which yield
colname a b
v 1 2
You can always reset the index to get 0 and set the column name to None to get your expected output
ndf = df.T.reset_index(drop=True)
ndf.columns.name = None
a b
0 1 2
How about:
df.T.reset_index(drop=True)
[out]
colname a b
0 1 2

Pandas (Python) - Update column of a dataframe from another one with conditions and different columns

I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df

set multiple Pandas DataFrame columns to values in a single column or multiple scalar values at the same time

I'm trying to set multiple new columns to one column and, separately, multiple new columns to multiple scalar values. Can't do either. Any way to do it other than setting each one individually?
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df.loc[:,['C','D']]=df['A']
df.loc[:,['C','D']]=[0,1]
for c in ['C', 'D']:
df[c] = d['A']
df['C'] = 0
df['D'] = 1
Maybe it is what you are looking for.
df=pd.DataFrame(columns=['A','B'],data=np.arange(6).reshape(3,2))
df['C'], df['D'] = df['A'], df['A']
df['E'], df['F'] = 0, 1
# Result
A B C D E F
0 0 1 0 0 0 1
1 2 3 2 2 0 1
2 4 5 4 4 0 1
The assign method will create multiple, new columns in one step. You can pass a dict() with the column and values to return a new DataFrame with the new columns appended to the end.
Using your examples:
df = df.assign(**{'C': df['A'], 'D': df['A']})
and
df = df.assign(**{'C': 0, 'D':1})
See this answer for additional detail: https://stackoverflow.com/a/46587717/4843561

How to merge a Series and DataFrame

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

Categories