How to group column which respects certain conditions - python

I try since this afternoon to group column which respects certain conditions. I giva an easy example, I got 3 column like it :
ID1_column_A ID2_column_B ID2_column_C
234 100 10
334 130 11
34 250 40
34 200 25
My aim is to group column who start per the same ID, so here, I will have only 2 column in output :
ID1_column_A Fusion_B_C
234 110
334 141
34 290
34 225
thanks for reading me

IIUC, you can try this:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
ID1 ID2
0 234 110
1 334 141
2 34 290
3 34 225

import pandas as pd
data = pd.DataFrame({'ID1_column_A': [234, 334, 34, 34], 'ID2_column_B': [100, 130, 250, 200], 'ID2_column_C': [10, 11, 40, 25]})
data['Fusion_B_C'] = data.loc[:, 'ID2_column_B':'ID2_column_C'].sum(axis=1)
data.drop(columns=['ID2_column_B', 'ID2_column_C'], inplace=True)#Deleting unnecessary columns
print(data)
Output
ID1_column_A Fusion_B_C
0 234 110
1 334 141
2 34 290
3 34 225

Related

Merge two dataframes with overlapping index, keeping column values from left DataFrame

How can I join/merge two Pandas DataFrames with partially overlapping indexes, where I wish the resulting joined DataFrame to retain the column values in the first DataFrame i.e. dropping the duplicates in df2?
import pandas as pd
import io
df1 = """
date; count
'2020-01-01'; 210
'2020-01-02'; 189
'2020-01-03'; 612
'2020-01-04'; 492
'2020-01-05'; 185
'2020-01-06'; 492
'2020-01-07'; 155
'2020-01-08'; 62
'2020-01-09'; 15
"""
df2 = """
date; count
'2020-01-04'; 21
'2020-01-05'; 516
'2020-01-06'; 121
'2020-01-07'; 116
'2020-01-08'; 82
'2020-01-09'; 121
'2020-01-10'; 116
'2020-01-11'; 82
'2020-01-12'; 116
'2020-01-13'; 82
"""
df1 = pd.read_csv(io.StringIO(df1), sep=";")
df2 = pd.read_csv(io.StringIO(df2), sep=";")
print(df1)
print(df2)
I have tried using
df1.reset_index().merge(df2, how='outer').set_index('date')
however, this drops the joined df2 values. Is there a method to keep the duplicated rows of the first dataframe?
Desired outcome:
print(df3)
date count
'2020-01-01' 210
'2020-01-02' 189
'2020-01-03' 612
'2020-01-04' 492
'2020-01-05' 185
'2020-01-06' 492
'2020-01-07' 155
'2020-01-08' 62
'2020-01-09' 15
'2020-01-10' 116
'2020-01-11' 82
'2020-01-12' 116
'2020-01-13' 82
Any help greatly appreciated, thank you.
Use combine_first:
df3 = (df1.set_index('date')
.combine_first(df2.set_index('date'))
.reset_index()
)
Output:
date count
0 '2020-01-01' 210
1 '2020-01-02' 189
2 '2020-01-03' 612
3 '2020-01-04' 492
4 '2020-01-05' 185
5 '2020-01-06' 492
6 '2020-01-07' 155
7 '2020-01-08' 62
8 '2020-01-09' 15
9 '2020-01-10' 116
10 '2020-01-11' 82
11 '2020-01-12' 116
12 '2020-01-13' 82
here is another way usingconcat and drop_duplicates:
df3=pd.concat([df1, df2]).drop_duplicates(["date"], keep="first", ignore_index=True)
output:
date count
0 '2020-01-01' 210
1 '2020-01-02' 189
2 '2020-01-03' 612
3 '2020-01-04' 492
4 '2020-01-05' 185
5 '2020-01-06' 492
6 '2020-01-07' 155
7 '2020-01-08' 62
8 '2020-01-09' 15
9 '2020-01-10' 116
10 '2020-01-11' 82
11 '2020-01-12' 116
12 '2020-01-13' 82

Create new dataframe column from a shifted existing column

I have a dataframe with open, high, low, close prices of a stock. I want to add an additional column that has the percent change between today's open and yesterday's high. This is my current implementation, however, the resulting column contains percent changes between the current day's high and open.
df
open high low close
0 100 110 95 103
1 103 113 103 111
2 111 132 109 124
3 124 136 114 130
My attempt (incorrect):
df['prevhigh_curropen'] = (df['open'] - df['high']).shift(-1) / df['high'].shift(-1)
Output (incorrect):
open high low close prevhigh_curropen
0 100 110 95 103 -0.091
1 103 113 103 111 -0.089
2 111 132 109 124 -0.159
3 124 136 114 130 -0.088
Desired output:
open high low close prevhigh_curropen
0 100 110 95 103 nan
1 103 113 103 111 -0.064
2 111 132 109 124 -0.018
3 124 136 114 130 -0.061
Is there a non-iterative way to do this like I attempted above?
Your formula is wrong, you have to use df['high'].shift():
df = pd.DataFrame({'open': range(1, 11), 'high': range(1, 11)})
df['prevhigh_curropen'] = df['open'].sub(df['high'].shift()) \
.div(df['high'].shift()) \
.mul(100)
>>> df
open high prevhigh_curropen
0 1 1 NaN
1 2 2 100.000000
2 3 3 50.000000
3 4 4 33.333333
4 5 5 25.000000
5 6 6 20.000000
6 7 7 16.666667
7 8 8 14.285714
8 9 9 12.500000
9 10 10 11.111111
For your sample the output is:
>>> df
open high low close prevhigh_curropen
0 100 110 95 103 NaN
1 103 113 103 111 -6.363636
2 111 132 109 124 -1.769912
3 124 136 114 130 -6.060606
The first value is NaN because we don't know the high value from the previous day.
We can simplify the terms slightly from (a - b) / b to (a / b) - (b / b) to (a / b) - 1.
Mathematical Operators:
df['prevhigh_curropen'] = (df['open'] / df['high'].shift()) - 1
or with Series Methods:
df['prevhigh_curropen'] = df['open'].div(df['high'].shift()).sub(1)
*The benefit here is that we only need to shift once, and maintain 1 copy of df['high'].shift()
Resulting df:
open high low close prevhigh_curropen
0 100 110 95 103 NaN
1 103 113 103 111 -0.063636
2 111 132 109 124 -0.017699
3 124 136 114 130 -0.060606
Setup Used:
import pandas as pd
df = pd.DataFrame({
'open': [100, 103, 111, 124],
'high': [110, 113, 132, 136],
'low': [95, 103, 109, 114],
'close': [103, 111, 124, 130]
})

Pandas Dataframe

I have a dataframe containing a number of columns and rows, in all of the columns except for the leftmost two, there is data of the form "integer-integer". I would like to split all of these columns into two columns, with each integer in its own cell, and remove the dash.
I have tried to follow the answers in Pandas Dataframe: Split multiple columns each into two columns, but it seems that they are splitting after one element, while I would like to split on the "-".
By way of example, suppose I have a dataframe of the form:
I would like to split the columns labelled 2 through to 22, to have them called 2F, 2A, 3F, 3A, ..., 6A with the data in the first row being R1, Hawthorn, 229, 225, 91, 81, ..., 12.
Thank you for any help.
You can use DataFrame.set_index with DataFrame.stack for Series, then split to new 2 columns by Series.str.split, convert to integers, create new columns names by DataFrame.set_axis, reshape by DataFrame.unstack, sorting columns by DataFrame.sort_index and last flatten MultiIndexwith convert index to columns by DataFrame.reset_index:
#first replace columns names to default values
df.columns = range(len(df.columns))
df = (df.set_index([0,1])
.stack()
.str.split('-', expand=True)
.astype(int)
.set_axis(['F','A'], axis=1, inplace=False)
.unstack()
.sort_index(axis=1, level=[1,0], ascending=[True, False]))
df.columns = df.columns.map(lambda x: f'{x[1]}{x[0]}')
df = df.reset_index()
print (df)
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 216 142 439 367 7 12
1 R2 Sydney 226 214 93 92 151 167 377 381 12 8
2 R3 Geelong 216 228 91 166 159 121 369 349 16 14
3 R4 North Melbourne 213 239 169 126 142 155 355 394 8 9
4 R5 Gold Coast 248 226 166 94 267 169 455 389 18 6
5 R6 St Kilda 242 197 118 161 158 156 466 353 15 16
6 R7 Fremantle 225 219 72 84 224 185 449 464 7 5
For Input:
df = pd.DataFrame({0: ['R1'], 1: ['Hawthorn'], 2: ['229-225'], 3: ['91-81'], 4:['210-142'], 5:['439-367'], 6:['7-12']})
0 1 2 3 4 5 6
0 R1 Hawthorn 229-225 91-81 210-142 439-367 7-12
Trying the code:
for i in df.columns[2::]:
df[[str(i)+'F', str(i)+'A']] =pd.DataFrame(df[i].str.split('-').tolist(), index= df.index)
del df[i]
Prints (1st row):
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 210 142 439 367 7 12
you can use lambda function for split a series
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
>>> data
0 12-24
1 13-26
2 14-28
3 15-30
df["d1"] = df["data"].apply(lambda x: x.split("-")[0])
df["d2"] = df["data"].apply(lambda x: x.split("-")[1])
df.head()
>>>
data d1 d2
0 12-24 12 24
1 13-26 13 26
2 14-28 14 28
3 15-30 15 30

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

I have two data frames.
df1 includes 4 men and 4 women with their weight and height (inches).
#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75
df2 includes 4 men and 4 women with their weight and height (inches).
#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73
What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.
So in this case John's closest match is Abe so in a new column
df1['doppelganger'] = "Abe".
I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.
First is necessary distinguish men and women, here is used new column with repeat 4 times m and f. Then is used DataFrame.merge with outer join by new column for all combinations and created new columns for differences, last column is sum of them. then sorting by 3 columns by DataFrame.sort_values, so first row per groups by A and g are filtered by DataFrame.drop_duplicates:
df = (df1.assign(g = ['m']*4 + ['f']*4)
.merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
.assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
dif2 = lambda x: x['C'].sub(x['C_']).abs(),
sumdiff = lambda x: x['dif1'] + x['dif2'])
.sort_values(['A', 'g','sumdiff'])
.drop_duplicates(['A','g'])
.sort_index()
.rename(columns={'A_':'doppelganger'})
)
print (df)
A B C g doppelganger B_ C_ dif1 dif2 sumdiff
1 John 236 76 m Abe 236 75 0 1 1
7 Jack 204 74 m Adam 195 71 9 3 12
10 Jim 156 71 m Alex 178 72 22 1 23
14 Jared 182 72 m Alex 178 72 4 0 4
16 Suzy 119 60 f Mary 148 66 29 6 35
20 Sally 149 66 f Mary 148 66 1 0 1
25 Sharon 169 65 f Maylee 155 66 14 1 15
31 Sammy 182 75 f Madison 160 73 22 2 24
Input DataFrames:
print (df1)
A B C
0 John 236 76
1 Jack 204 74
2 Jim 156 71
3 Jared 182 72
4 Suzy 119 60
5 Sally 149 66
6 Sharon 169 65
7 Sammy 182 75
print (df2)
A B C
0 Aaron 285 77
1 Abe 236 75
2 Alex 178 72
3 Adam 195 71
4 Mary 148 66
5 Maylee 155 66
6 Marilyn 199 65
7 Madison 160 73

Comparing/Mapping different series in different Dataframes

I have two data frames. Dataframe "A" which is the main dataframe has 3 columns "Number", "donation" and "Var1" . Dataframe B has 2 columns "Number" and "location". The "Number" column in DataFrame B is a subset of "Number" in A. What I would like to do is form a new column in DataFrame A - "NEW" which would map the values of numbers in both the column and if its present in DataFrame B would add value as 1 else all other values will be 0.
>>>DFA
Number donation Var1
243 4 45
677 56 34
909 34 22
565 78 24
568 90 21
784 33 88
787 22 66
>>>DFB
Number location
909 PB
565 WB
784 AU
These are the two dataframes, I want the DFA with a new column which looks something like this.
>>>DFA
Number donation Var1 NEW
243 4 45 0
677 56 34 0
909 34 22 1
565 78 24 1
568 90 21 0
784 33 88 1
787 22 66 0
This has a new column which has value as 1 if the Number was present in DFB if absent it gives 0.
You could use the isin method:
DFA['NEW'] = (DFA['Number'].isin(DFB['Number'])).astype(int)
For example,
import pandas as pd
DFA = pd.DataFrame({'Number': [243, 677, 909, 565, 568, 784, 787],
'Var1': [45, 34, 22, 24, 21, 88, 66],
'donation': [4, 56, 34, 78, 90, 33, 22]})
DFB = pd.DataFrame({'Number': [909, 565, 784], 'location': ['PB', 'WB', 'AU']})
DFA['NEW'] = (DFA['Number'].isin(DFB['Number'])).astype(int)
print(DFA)
yields
Number Var1 donation NEW
0 243 45 4 0
1 677 34 56 0
2 909 22 34 1
3 565 24 78 1
4 568 21 90 0
5 784 88 33 1
6 787 66 22 0

Categories