Shaping a Pandas DataFrame (multiple columns into 2) - python

I have a simular dataframe to below and require it to be shaped as per expected output.
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
'col2': [1, 3, 5, 7, 9, 11],
'col3': [2, 4, 6, 8, 10, 12]
})
col1 col2 col3
0 A 1 2
1 A 3 4
2 A 5 6
3 B 7 8
4 B 9 10
5 B 11 12
Expected Output
df_expected = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12]
})
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
So far I have tried pack, unpack & pivot without getting the desired result
Thanks for your help!

pd.DataFrame(df.groupby('col1').agg(list).T.sum().to_dict())

Use Numpy to reshape the data then package back up into a dataframe.
cols = (df['col2'],df['col3'])
data = np.stack(cols,axis=1).reshape(len(cols),len(df))
dft = pd.DataFrame(data, index=df['col1'].unique()).T
print(dft)
Result
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12

Related

Replace values in dataframe where updated versions are in another dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have two dataframes, something like this:
df1 = pd.DataFrame({
'Identity': [3, 4, 5, 6, 7, 8, 9],
'Value': [1, 2, 3, 4, 5, 6, 7],
'Notes': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
})
df2 = pd.DataFrame({
'Identity': [4, 8],
'Value': [0, 128],
})
In[3]: df1
Out[3]:
Identity Value Notes
0 3 1 a
1 4 2 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 6 f
6 9 7 g
In[4]: df2
Out[4]:
Identity Value
0 4 0
1 8 128
I'd like to use df2 to overwrite df1 but only where values exist in df2, so I end up with:
Identity Value Notes
0 3 1 a
1 4 0 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 128 f
6 9 7 g
I've been searching through all the various merge, combine, join functions etc, but I can't seem to find one that does what I want. Is there a simple way of doing this?
Use:
df1['Value'] = df1['Identity'].map(df2.set_index('Identity')['Value']).fillna(df1['Value'])
Or try reset_index with reindex and set_index with fillna:
df1['Value'] = df2.set_index('Identity').reindex(df1['Identity'])
.reset_index(drop=True)['Value'].fillna(df1['Value'])
>>> df1
Identity Value Notes
0 3 1.0 a
1 4 0.0 b
2 5 3.0 c
3 6 4.0 d
4 7 5.0 e
5 8 128.0 f
6 9 7.0 g
>>>
This fills missing rows in df2 with NaN and fills the NaNs with df1 values.

How to merge multiple column of same data frame

How to merge multiple column values into one column of same data frame and get new column with unique values.
Column1 Column2 Column3 Column4 Column5
0 a 1 2 3 4
1 a 3 4 5
2 b 6 7 8
3 c 7 7
Output:
Column A
a
a
b
c
1
3
6
7
2
4
5
8
Use unstack or melt for reshape, remove missinf values by dropna and duplicates by drop_duplicates:
df1 = df.unstack().dropna().drop_duplicates().reset_index(drop=True).to_frame('A')
df1 = df.melt(value_name='A')[['A']].dropna().drop_duplicates().reset_index(drop=True)
print (df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8
Here is another way to do it if you are ok using numpy. This will handle either nans or empty strings in the original dataframe and is a bit faster than unstack
or melt.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Column1': ['a', 'a', 'b', 'c'],
'Column2': [1, 3, 6, 7],
'Column3': [2, 4, 7, 7],
'Column4': [3, 5, 8, np.nan],
'Column5': [4, '', '', np.nan]})
u = pd.unique(df.values.flatten(order='F'))
u = u[np.where(~np.isin(u, ['']) & ~pd.isnull(u))[0]]
df1 = pd.DataFrame(u, columns=['A'])
print(df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8

Performing outer join that merges joined columns

I am performing an outer join on two DataFrames:
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'date': [4, 5, 6, 7, 8],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 6],
'date': [4, 5, 6, 7, 8],
'str': ['A', 'B', 'C', 'D', 'Q']})
pd.merge(df1, df2, on=["id","date"], how="outer")
This gives the result
date id str_x str_y
0 4 1 a A
1 5 2 b B
2 6 3 c C
3 7 4 d D
4 8 5 e NaN
5 8 6 NaN Q
Is it possible to perform the outer join such that the str-columns are concatenated? In other words, how to perform the join such that we get the DataFrame
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
where all NaN have been set to None.
I think not, possible solution is replace NaNs and join together:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x['str'].fillna('') + x['str_'].fillna(''))
.drop('str_', 1))
Similar alternative:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x.filter(like='str').fillna('').values.sum(axis=1))
.drop('str_', 1))
print (df)
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
If 'id', 'date' is unique in each data frame, then you can set the index and add the dataframes.
icols = ['date', 'id']
df1.set_index(icols).add(df2.set_index(icols), fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q

Pandas loop a dataframe and compare all rows with other DF rows and assign a value

I have two DF:
df1 = pd.DataFrame({'A':[3, 2, 5, 1, 6], 'B': [4, 6, 5, 8, 2], 'C': [4, 8, 3, 8, 0], 'D':[1, 4, 2, 8, 7], 'zebra': [5, 7, 2, 4, 8]})
df2 = pd.DataFrame({'B': [7, 3, 5, 1, 8], 'D':[4, 5, 8, 2, 3] })
print(df1)
print(df2)
A B C D zebra
0 3 4 4 1 5
1 2 8 8 5 7
2 5 5 3 2 2
3 1 6 8 5 4
4 6 2 0 7 8
B D
0 7 4
1 3 5
2 5 8
3 8 5
4 8 3
This is a simple example, in real df1 is with 1000k+ rows and 10+ columns, df2 is with only 24 rows and fewer columns as well. I would like to loop all rows in df2 and to compare those specific rows (for example column 'B' and 'D') from df2 with same column names in df1 and if row values match (if value in column B and column D in df2 match same values in same columns but in df1) to assign corresponding zebra value in that row to the same row in df2 creating new column zebra and assigning that value. If no matching found to assign 0s or NaN's.
B D zebra
0 7 4 nan
1 3 5 nan
2 5 8 nan
3 8 5 7
4 8 3 nan
From example, only row index 3 in df2 matched values 'B': 8 and 'D':5 with a row with index 2 from df1 (NOTE: row index should not be important in comparisons) and assign corresponding row value 7 from column 'zebra' to df2.
A merge would do
df2.merge(df1[['B', 'D', 'zebra']], on = ['B', 'D'], how = 'left')
B D zebra
0 7 4 NaN
1 3 5 NaN
2 5 8 NaN
3 8 5 7.0
4 8 3 NaN

groupby common values in two columns

I need to extract a common max value from pairs of rows that have common values in two columns.
The commonality is between values in columns A and B. Rows 0 and 1 are common, 2 and 3, and 4 is on its own.
f = DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]], columns=['A', 'B', 'Value'])
f
A B Value
0 1 2 30
1 2 1 20
2 2 6 15
3 6 2 70
4 7 10 35
The goal is to extract max values, so the end result is:
f_final = DataFrame([[1, 2, 30, 30], [2, 1, 20, 30], [2, 6, 15, 70], [6, 2, 70, 70], [7, 10, 35, 35]], columns=['A', 'B', 'Value', 'Max'])
f_final
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
I could do this if there is a way to assign a common, non-repeating key:
f_key = DataFrame([[1, 1, 2, 30], [1, 2, 1, 20], [2, 2, 6, 15], [2, 6, 2, 70], [3, 7, 10, 35]], columns=['key', 'A', 'B', 'Value'])
f_key
key A B Value
0 1 1 2 30
1 1 2 1 20
2 2 2 6 15
3 2 6 2 70
4 3 7 10 35
Following up with the groupby and transform:
f_key['Max'] = f_key.groupby(['key'])['Value'].transform(lambda x: x.max())
f_key.drop('key', 1, inplace=True)
f_key
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
Question 1:
How would one assign this common key?
Question 2:
Is there a better way of doing this, skipping the common key step
Cheers...
You could sort the values in columns A and B so that for each row the value in A is less than or equal to the value in B. Once the values are ordered, then you could apply groupby-transform-max as usual:
import pandas as pd
df = pd.DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]],
columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
print(df)
yields
A B Value Max
0 1 2 30 30
1 1 2 20 30
2 2 6 15 70
3 2 6 70 70
4 7 10 35 35
The above method will still work even if the values in A and B are strings. For example,
df = DataFrame([['ab', 'ac', 30], ['ac', 'ab', 20],
['cb', 'ca', 15], ['ca', 'cb', 70],
['ff', 'zz', 35]], columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
yields
In [267]: df
Out[267]:
A B Value Max
0 ab ac 30 30
1 ab ac 20 30
2 ca cb 15 70
3 ca cb 70 70
4 ff zz 35 35

Categories