Merge columns with more than one value in pandas dataframe - python

I've got this DataFrame in Python using pandas:
Column 1
Column 2
Column 3
hello
a,b,c
1,2,3
hi
b,c,a
4,5,6
The values in column 3 belong to the categories in column 2.
Is there a way to combine columns 2 and 3 that I get this output?
Column 1
a
b
c
hello
1
2
3
hi
6
4
5
Any advise will be very helpful! Thank you!

df.apply(lambda x: pd.Series(x['Column 3'].split(','), index=x['Column2'].split(',')), axis=1)
output:
a b c
0 1 2 3
1 4 5 6
result make to df1 and concat
df1 = df.apply(lambda x: pd.Series(x['Column 3'].split(','), index=x['Column2'].split(',')), axis=1)
pd.concat([df['Column 1'], df1], axis=1)
output:
col1 a b c
0 hello 1 2 3
1 hi 4 5 6

You can use pd.crosstab after exploding the commas:
new_df = ( df.assign(t=df['Column 2'].str.split(','), a=df['Column 3'].str.split(',')).
explode(['t', 'a']) )
output = ( pd.crosstab(index=new_df['Column 1'], columns=new_df['t'],
values=new_df['a'], aggfunc='sum').reset_index() )
Output:
t Column 1 a b c
0 hello 1 2 3
1 hi 4 5 6

Efficiency wise, I'd say do all the wrangling in vanilla python and create a new dataframe:
from collections import defaultdict
outcome = defaultdict(list)
for column, row in zip(df['Column 2'], df['Column 3']):
column = column.split(',')
row = row.split(',')
for first, last in zip(column, row):
outcome[first].append(last)
pd.DataFrame(outcome).assign(Column = df['Column 1'])
a b c Column
0 1 2 3 hello
1 6 4 5 hi

Related

How to find duplicate values (not rows) in an entire pandas dataframe?

Consider this dataframe.
df = pd.DataFrame(data={'one': list('abcd'),
'two': list('efgh'),
'three': list('ajha')})
one two three
0 a e a
1 b f j
2 c g h
3 d h a
How can I output all duplicate values and their respective index? The output can look something like this.
id value
0 2 h
1 3 h
2 0 a
3 0 a
4 3 a
Try .melt + .duplicated:
x = df.reset_index().melt("index")
print(
x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
.reset_index(drop=True)
.rename(columns={"index": "id"})
)
Prints:
id value
0 0 a
1 3 h
2 0 a
3 2 h
4 3 a
We can stack the DataFrame, use Series.loc to keep only where value is Series.duplicated then Series.reset_index to convert to a DataFrame:
new_df = (
df.stack() # Convert to Long Form
.droplevel(-1).rename_axis('id') # Handle MultiIndex
.loc[lambda x: x.duplicated(keep=False)] # Filter Values
.reset_index(name='value') # Make Series a DataFrame
)
new_df:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
I used here melt to reshape and duplicated(keep=False) to select the duplicates:
(df.rename_axis('id')
.reset_index()
.melt(id_vars='id')
.loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
.sort_values(by='id')
.reset_index(drop=True)
)
Output:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a

Assign each unique value of column to whole Dataframe as if data frame duplicate itself based on value of another column

i am trying to iterate value of column from df2 and assign each value of column from df2 to the df1.As if df1 will multiply itself based on value of column from df2.
let's say i have df1 as per below:
df1
1
2
3
and df2 as per below:
df2
A
B
C
I want third dataframe df3 will became like below:
df3
1 A
2 A
3 A
1 B
2 B
3 B
1 C
2 C
3 C
for now i have tried below code
for i, value in ACS_shock['scenario'].iteritems():
df1['sec'] = df1[i] = value[:]
But when i generate the file from DF1 my output is like below:
1 A B C
2 A B C
3 A B C
Any idea how can i correct this code.
much appreciated.
You can use pd.concat and np.repeat:
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.Series([1,2,3])
>>> df1
0 1
1 2
2 3
dtype: int64
>>> df2 = pd.Series(list('ABC'))
>>> df2
0 A
1 B
2 C
dtype: object
>>> df3 = pd.DataFrame({'df1': pd.concat([df1]*3).reset_index(drop=True),
'df2': np.repeat(df2, 3).reset_index(drop=True)})
>>> df3
df1 df2
0 1 A
1 2 A
2 3 A
3 1 B
4 2 B
5 3 B
6 1 C
7 2 C
8 3 C

Looking up values in two pandas data frames and create new columns

I have two data frames in my problem.
df1
ID Value
1 A
2 B
3 C
df2:
ID F_ID S_ID
1 2 3
2 3 1
3 1 2
I want to create a column next to each ID column that will store the values looked up from df1. The output should look like this :
ID ID_Value F_ID F_ID_Value S_ID S_ID_Value
1 A 2 B 3 C
2 B 3 C 1 A
3 C 1 A 2 B
Basically looking up from df1 and creating a new column to store these values.
you can use map on each column of df2 with the value of df1.
s = df1.set_index('ID')['Value']
for col in df2.columns:
df2[f'{col}_value'] = df2[col].map(s)
print (df2)
ID F_ID S_ID ID_value F_ID_value S_ID_value
0 1 2 3 A B C
1 2 3 1 B C A
2 3 1 2 C A B
or with apply and concat
df_ = pd.concat([df2, df2.apply(lambda x: x.map(s)).add_prefix('_value')], axis=1)
df_ = df_.reindex(sorted(df_.columns), axis=1)
If order is important (I realised not in comments) is necessary use DataFrame.insert with enumerate and some maths:
s = df1.set_index('ID')['Value']
for i, col in enumerate(df2.columns, 1):
df2.insert(i * 2 - 1, f'{col}_value', df2[col].map(s))
print (df2)
ID ID_value F_ID F_ID_value S_ID S_ID_value
0 1 A 2 B 3 C
1 2 B 3 C 1 A
2 3 C 1 A 2 B

delimit/split row values and form individual rows

reproducible code for data:
import pandas as pd
dict = {"a": "[1,2,3,4]", "b": "[1,2,3,4]"}
dict = pd.DataFrame(list(dict.items()))
dict
0 1
0 a [1,2,3,4]
1 b [1,2,3,4]
I wanted to split/delimit "column 1" and create individual rows for each split values.
expected output:
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
Should I be removing the brackets first and then split the values? I really don't get any idea of doing this. Any reference that would help me solve this please?
Based on the logic from that answer:
s = d[1]\
.apply(lambda x: pd.Series(eval(x)))\
.stack()
s.index = s.index.droplevel(-1)
s.name = "split"
d.join(s).drop(1, axis=1)
Because you have strings containing a list (and not lists) in your cells, you can use eval:
dict_v = {"a": "[1,2,3,4]", "b": "[1,2,3,4]"}
df = pd.DataFrame(list(dict_v.items()))
df = (df.rename(columns={0:'l'}).set_index('l')[1]
.apply(lambda x: pd.Series(eval(x))).stack()
.reset_index().drop('level_1',1).rename(columns={'l':0,0:1}))
or another way could be to create a DataFrame (probably faster) such as:
df = (pd.DataFrame(df[1].apply(eval).tolist(),index=df[0])
.stack().reset_index(level=1, drop=True)
.reset_index(name='1'))
your output is
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
all the rename are to get exactly your input/output

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

Categories