pandas dataframe how to merge all rows based on groupby - python

I have dataframe with many columns, 2 are categorical and the rest are numeric:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8]
I want to apply a merge based on the operation groupby(["type1","type2"]) that will create the following dataframe:
df = [type1 , type2 ,type3, val1, val2, val3 , val1_a, val2_b, val3_b
a b q 1 2 3 4 6 7
a c w 3 5 2 2 1 8
b c t 2 9 0 2 9 0
Please notice: there could be 1 or 2 rows at each groupby, but not more. in case of 1 - just duplicate the single row

Idea is use GroupBy.cumcount for counter by type1, type2, then is created MultiIndex, reshaped by DataFrame.unstack, forward filling missing values per rows by ffill, converting to integers, sorting by counter level and last in list comprehension flatten MultiIndex:
g = df.groupby(["type1","type2"]).cumcount()
df1 = (df.set_index(["type1","type2", g])
.unstack()
.ffill(axis=1)
.astype(int)
.sort_index(level=1, axis=1))
df1.columns = [f'{a}_{b}' if b != 0 else a for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
type1 type2 val1 val2 val3 val1_1 val2_1 val3_1
0 a b 1 2 3 4 6 7
1 a c 3 5 2 2 1 8
2 b c 2 9 0 2 9 0

Related

Pandas replace columns by merging another dataframe

I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?
here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2

How to set column headers to the first row in Pandas dataframe?

How do I set the column header of a dataframe to the first row of a dataframe and reset the column names?
# Creation of dataframe
df = pd.DataFrame({"A": ["1", "4", "7"],
"B": ["2", "5", "8"],
"C": ['3','6','9']})
# df:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
Desired Outcome:
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9
Use concat with Index.to_frame with transpose for one row DataFrame and last set columns names by range:
df = pd.concat([df.columns.to_frame().T, df], ignore_index=True)
df.columns = range(len(df.columns))
print (df)
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9
Or use DataFrame.set_axis for chained method solution:
df = (pd.concat([df.columns.to_frame().T, df], ignore_index=True)
.set_axis(range(len(df.columns)), axis=1))
What you want to do is similar to reset_index but on the other axis. Unfortunately, there is no axis parameter in reset_index.
But, you can cheat a bit and apply a double transposition to handle the columns as index temporarily:
df.T.reset_index().T.reset_index(drop=True)
output:
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9
You can use np.vstack on a list of column names and the DataFrame to create an array with one extra row; then cast it into pd.DataFrame:
out = pd.DataFrame(np.vstack([df.columns, df]))
Output:
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9

Pandas dataframe take group max value per group when groupby

I have dataframe with many columns, 2 are categorical and the rest are numeric:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8]
I want to apply a merge based on the operation groupby(["type1","type2"]) that will create take the max value from the grouped row:
df = [type1 , type2 ,type3, val1, val2, val3
a b q 2 6 7
a c w 4 5 8
b c t 2 9 0
Explanation: val3 of first row is 7 because this is the maximal value when type1 = a, type2 = b.
Similarly, val3 of second row is 8 because this is the maximal value when type1 = a, type2 = c.
If need aggregate all columns by max:
df = df.groupby(["type1","type2"]).max()
print (df)
type3 val1 val2 val3
type1 type2
a b q 4 6 7
c w 3 5 8
b c t 2 9 0
If need some columns aggregate different you can create dictionary of columns names with aggregate functions and then set another aggregate functuions for some columns, like for type3 is used first and for val1 is used last:
d = dict.fromkeys(df.columns.difference(['type1','type2']), 'max')
d['type3'] = 'first'
d['val1'] = 'last'
df = df.groupby(["type1","type2"], as_index=False, sort=False).agg(d)
print (df)
type1 type2 type3 val1 val2 val3
0 a b q 4 6 7
1 a c w 2 5 8
2 b c t 2 9 0

Pandas - combine columns and put one after another?

I have the following dataframe:
a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6
The desirable output is:
a,b
1,3
2,4
3,5
2,4
3,5
4,6
There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".
I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.
You can use pd.wide_to_long:
pd.wide_to_long(df.reset_index(), ['a','b'], 'index', 'No').reset_index()[['a','b']]
Output:
a b
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
5 4 6
First we read the dataframe:
import pandas as pd
from io import StringIO
s = """a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6"""
df = pd.read_csv(StringIO(s), sep=',')
Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':
stacked = df.stack().rename("val").reset_index(1).reset_index()
cols_numbers = pd.DataFrame(stacked
.level_1
.str.split('(\d)')
.apply(lambda l: l[:2])
.tolist(),
columns=["col", "num"])
x = cols_numbers.join(stacked[['val', 'index']])
print(x)
col num val index
0 a 1 1 0
1 a 2 2 0
2 b 1 3 0
3 b 2 4 0
4 a 1 2 1
5 a 2 3 1
6 b 1 4 1
7 b 2 5 1
8 a 1 3 2
9 a 2 4 2
10 b 1 5 2
11 b 2 6 2
Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:
result = (x
.set_index("col", append=True)
.groupby(["index", "num"])
.val
.apply(lambda g:
g
.unstack()
.fillna(method="bfill")
.head(1))
.reset_index(-1, drop=True))
print(result)
col a b
index num
0 1 1.0 3.0
2 2.0 4.0
1 1 2.0 4.0
2 3.0 5.0
2 1 3.0 5.0
2 4.0 6.0
To get rid of the multiindex at the end: result.reset_index(drop=True)

How to extract values of one dataframe with values of other dataframe in pandas?

Suppose that you create the next python pandas data frames:
In[1]: print df1.to_string()
ID value
0 1 a
1 2 b
2 3 c
3 4 d
In[2]: print df2.to_string()
Id_a Id_b
0 1 2
1 4 2
2 2 1
3 3 3
4 4 4
5 2 2
How can I create a frame df_ids_to_values with the next values:
In[2]: print df_ids_to_values.to_string()
value_a value_b
0 a b
1 d b
2 b a
3 c c
4 d d
5 b b
In other words, I would like to replace the id's of df2 with the corresponding values in df1. I have tried doing this by performing a for loop but it is very slow and I am hopping that there is a function in pandas that allow me to do this operation very efficiently.
Thanks for your help...
Start by setting an index on df1
df1 = df1.set_index('ID')
then join the two columns
df = df2.join(df1, on='Id_a')
df = df.rename(columns = {'value' : 'value_a'})
df = df.join(df1, on='Id_b')
df = df.rename(columns = {'value' : 'value_b'})
result:
> df
Id_a Id_b value_a value_b
0 1 2 a b
1 4 2 d b
2 2 1 b a
3 3 3 c c
4 4 4 d d
5 2 2 b b
[6 rows x 4 columns]
(and you get to your expected output with df[['value_a','value_b']])

Categories