Merge two dataframes in Pandas by taking the mean between the columns - python

I have the following dataframes:
df1
C1 C2 C3
0 0 0 0
1 0 0 0
df2
C1 C4 C5
0 1 1 1
1 1 1 1
The result I am looking for is:
df3
C1 C2 C3 C4 C5
0 0.5 0 0 1 1
1 0.5 0 0 1 1
Is there an easy way to accomplish this ?
Thanks in advance!

You can using concat and groupby axis =1
s=pd.concat([df1,df2],axis=1)
s.groupby(s.columns.values,axis=1).mean()
Out[116]:
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0
A nice alternative from #cᴏʟᴅsᴘᴇᴇᴅ
s.groupby(level=0,axis=1).mean()
Out[117]:
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0

DataFrame.add
df3 = df2.add(df1, fill_value=0)
df3[df1.columns.intersection(df2.columns)] /= 2
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0
concat + groupby
pd.concat([df1, df2], axis=1).groupby(axis=1, level=0).mean()
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0

Related

Pandas dataframe intersection with varying groups

I have a large pandas dataframe with varying rows and columns but looks more or less like:
time id angle ...
0.0 a1 33.67 ...
0.0 b2 35.90 ...
0.0 c3 42.01 ...
0.0 d4 45.00 ...
0.1 a1 12.15 ...
0.1 b2 15.35 ...
0.1 c3 33.12 ...
0.2 a1 65.28 ...
0.2 c3 87.43 ...
0.3 a1 98.85 ...
0.3 c3 100.12 ...
0.4 a1 11.11 ...
0.4 c3 83.22 ...
...
I am trying to aggregate the id's together and then find id's that have in common time-intervals. I have tried using pandas groupby and can easily group them by id and get their respective groups with information. How can I then take it a step further to find id's that also have the same time stamps?
Ideally I'd like to return intersection of certain fixed time intervals (2-3 seconds) for similar id's with the fixed time interval overlap:
time id angle ...
0.0 a1 33.67 ...
0.1 a1 12.15 ...
0.2 a1 65.28 ...
0.3 a1 98.85 ...
0.0 c3 42.01 ...
0.1 c3 33.12 ...
0.2 c3 87.43 ...
0.3 c3 100.12 ...
Code tried so far:
#create pandas grouped by id
df1 = df.groupby(['id'], as_index=False)
Which outputs:
time id angle ...
(0.0 a1 33.67
...
0.4 a1 11.11)
(0.0 b2 35.90
0.1 b2 15.35)
(0.0 c3 42.01
...
0.4 c3 83.22)
(0.0 d4 45.00)
But I'd like to return only a dataframe where id and time are the same for a fixed interval, in the above example .4 seconds.
Any ideas on a fairly simple way to achieve this with pandas dataframes?
If need filter rows by some intervals - e.g. here between 0 and 0.4 and get all id which overlap use boolean indexing with Series.between first, then DataFrame.pivot:
df1 = df[df['time'].between(0, 0.4)].pivot('time','id','angle')
print (df1)
id a1 b2 c3 d4
time
0.0 33.67 35.90 42.01 45.0
0.1 12.15 15.35 33.12 NaN
0.2 65.28 NaN 87.43 NaN
0.3 98.85 NaN 100.12 NaN
0.4 11.11 NaN 83.22 NaN
There are missing values for non overlap id, so remove columns with any NaNs by DataFrame.any and reshape to 3 columns by DataFrame.unstack and Series.reset_index:
print (df1.dropna(axis=1))
id a1 c3
time
0.0 33.67 42.01
0.1 12.15 33.12
0.2 65.28 87.43
0.3 98.85 100.12
0.4 11.11 83.22
df2 = df1.dropna(axis=1).unstack().reset_index(name='angle')
print (df2)
id time angle
0 a1 0.0 33.67
1 a1 0.1 12.15
2 a1 0.2 65.28
3 a1 0.3 98.85
4 a1 0.4 11.11
5 c3 0.0 42.01
6 c3 0.1 33.12
7 c3 0.2 87.43
8 c3 0.3 100.12
9 c3 0.4 83.22
There are many ways to define the filter you're asking for:
df.groupby('id').filter(lambda x: len(x) > 4)
# OR
df.groupby('id').filter(lambda x: x['time'].eq(0.4).any())
# OR
df.groupby('id').filter(lambda x: x['time'].max() == 0.4)
Output:
time id angle
0 0.0 a1 33.67
2 0.0 c3 42.01
4 0.1 a1 12.15
6 0.1 c3 33.12
7 0.2 a1 65.28
8 0.2 c3 87.43
9 0.3 a1 98.85
10 0.3 c3 100.12
11 0.4 a1 11.11
12 0.4 c3 83.22

Transposing some columns after groupby in pandas

Hey I am struggling with a transformation of a DataFrame:
The initial frame has a format like this:
df=pd.DataFrame({'A':['A1','A1','A1','A1','A1','A2','A2','A2','A2','A3','A3','A3','A3'],
'B':['B1','B1','B1','B1','B2','B2','B2','B3','B3','B3','B4','B4','B4'],
'C':['C1','C1','C1','C2','C2','C3','C3','C4','C4','C5','C5','C6','C6'],
'X':['a','b','c','a','c','a','b','b','c','a','c','a','c'],
'Y':[1,4,4,2,4,1,4,3,1,2,3,4,5]})
A B C X Y
A1 B1 C1 a 1
A1 B1 C1 b 4
A1 B1 C1 c 4
A1 B1 C2 a 2
A1 B2 C2 c 4
A2 B2 C3 a 1
A2 B2 C3 b 4
A2 B3 C4 b 3
A2 B3 C4 c 1
A3 B3 C5 a 2
A3 B4 C5 c 3
A3 B4 C6 a 4
A3 B4 C6 c 5
I have some columns in the beginning where I want to apply groupby and then transpose the last two columns:
First df.groupby(['A','B','C','X']).sum()
Y
A B C X
A1 B1 C1 a 1
b 4
c 4
C2 a 2
B2 C2 c 4
A2 B2 C3 a 1
b 4
B3 C4 b 3
c 1
A3 B3 C5 a 2
B4 C5 c 3
C6 a 4
c 5
and then transpose the X/Y columns and add them horizontally.
A B C a b c
A1 B1 C1 1.0 4.0 4.0
A1 B1 C2 2.0 NaN NaN
A1 B2 C2 NaN NaN 4.0
A2 B2 C3 1.0 4.0 NaN
A2 B3 C4 NaN 3.0 1.0
A3 B3 C5 2.0 NaN NaN
A3 B4 C5 NaN NaN 3.0
A3 B4 C6 4.0 NaN 5.0
Not all groupby rows have all values so they need to be filled with something like np.nan.
The question is linked to this one here but it is more complicated and I couldn't figure it out.
Use Series.unstack for reshape:
df1 = (df.groupby(['A','B','C','X'])['Y'].sum()
.unstack()
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C a b c
0 A1 B1 C1 1.0 4.0 4.0
1 A1 B1 C2 2.0 NaN NaN
2 A1 B2 C2 NaN NaN 4.0
3 A2 B2 C3 1.0 4.0 NaN
4 A2 B3 C4 NaN 3.0 1.0
5 A3 B3 C5 2.0 NaN NaN
6 A3 B4 C5 NaN NaN 3.0
7 A3 B4 C6 4.0 NaN 5.0
Alternative with DataFrame.pivot_table:
df1 = (df.pivot_table(index=['A','B','C'],
columns='X',
values='Y',
aggfunc='sum').reset_index().rename_axis(None, axis=1))
print (df1)
A B C a b c
0 A1 B1 C1 1.0 4.0 4.0
1 A1 B1 C2 2.0 NaN NaN
2 A1 B2 C2 NaN NaN 4.0
3 A2 B2 C3 1.0 4.0 NaN
4 A2 B3 C4 NaN 3.0 1.0
5 A3 B3 C5 2.0 NaN NaN
6 A3 B4 C5 NaN NaN 3.0
7 A3 B4 C6 4.0 NaN 5.0

How to split variable sized string-based column into multiple columns in Pandas DataFrame?

I have a pandas DataFrame which is of the form :
A B C D
A1 6 7.5 NaN
A1 4 23.8 <D1 0.0 6.5 12 4, D2 1.0 4 3.5 1>
A2 7 11.9 <D1 2.0 7.5 10 2, D3 7.5 4.2 13.5 4>
A3 11 0.8 <D2 2.0 7.5 10 2, D3 7.5 4.2 13.5 4, D4 2.0 7.5 10 2, D5 7.5 4.2 13.5 4>
The column D is a raw-string column with multiple categories in each entry. The value of entry is calculated by dividing the last two values for each category. For example, in 2nd row :
D1 = 12/4 = 3
D2 = 3.5/1 = 3.5
I need to split column D based on it's categories and join them to my DataFrame. The problem is the column is dynamic and can have nearly 35-40 categories within a single entry. For now, all I'm doing is a Brute Force Approach by iterating all rows, which is very slow for large datasets. Can someone please help me?
EXPECTED OUTCOME
A B C D1 D2 D3 D4 D5
A1 6 7.5 NaN NaN NaN NaN NaN
A1 4 23.8 3.0 3.5 NaN NaN NaN
A2 7 11.9 5.0 NaN 3.4 NaN NaN
A3 11 0.8 NaN 5.0 3.4 5.0 3.4
Use:
d = df['D'].str.extractall(r'(D\d+).*?([\d.]+)\s([\d.]+)(?:,|\>)')
d = d.droplevel(1).set_index(0, append=True).astype(float)
d = df.join(d[1].div(d[2]).round(1).unstack()).drop('D', 1)
Details:
Use Series.str.extractall to extract all the capture groups from the column D as specified by the regex pattern. You can test the regex pattern here.
print(d)
0 1 2 # --> capture groups
match
1 0 D1 12 4
1 D2 3.5 1
2 0 D1 10 2
1 D3 13.5 4
3 0 D2 10 2
1 D3 13.5 4
2 D4 10 2
3 D5 13.5 4
Use DataFrame.droplevel + set_index with optional parameter append=True to drop the unused level and append a new index to datafarme.
print(d)
1 2
0
1 D1 12.0 4.0
D2 3.5 1.0
2 D1 10.0 2.0
D3 13.5 4.0
3 D2 10.0 2.0
D3 13.5 4.0
D4 10.0 2.0
D5 13.5 4.0
Use Series.div to divide column 1 by 2 and use Series.round to round the values then use Series.unstack to reshape the dataframe, then using DataFrame.join join the new dataframe with df
print(d)
A B C D1 D2 D3 D4 D5
0 A1 6 7.5 NaN NaN NaN NaN NaN
1 A1 4 23.8 3.0 3.5 NaN NaN NaN
2 A2 7 11.9 5.0 NaN 3.4 NaN NaN
3 A3 11 0.8 NaN 5.0 3.4 5.0 3.4

Pandas - Remove leading and trailing zeroes from each row

I would like to remove preceding and trialing zero values row-wise in my df and then have them shift to be 'aligned'.
Probably best demonstrated with the below example.
Initial df:
index c1 c2 c3 c4 c5 c6 c7 c8
1 0 0 1 2 3 4 5 0
2 0 0 0 1 2 3 4 5
3 0 1 2 3 0 0 0 0
4 0 0 1 2 3 4 0 0
5 1 2 3 4 5 6 7 0
6 0 0 0 1 0 0 4 0
Output:
index c1 c2 c3 c4 c5 c6 c7
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3
4 1 2 3 4
5 1 2 3 4 5 6 7
6 1 0 0 4
Note that there is potential to be zeroes within the "string" of true values so need to stop at the first / reverse first occurrence. Is this possible? Thanks.
Using np.trim_zeros:
Trim the leading and/or trailing zeros from a 1-D array or sequence.
out = pd.DataFrame([np.trim_zeros(i) for i in df.values], index=df.index)
out.columns = df.columns[:len(out.columns)]
c1 c2 c3 c4 c5 c6 c7
index
1 1 2 3 4.0 5.0 NaN NaN
2 1 2 3 4.0 5.0 NaN NaN
3 1 2 3 NaN NaN NaN NaN
4 1 2 3 4.0 NaN NaN NaN
5 1 2 3 4.0 5.0 6.0 7.0
6 1 0 0 4.0 NaN NaN NaN
You can use this:
df_out = df.apply(lambda x: pd.Series(x.loc[x.mask(x == 0).first_valid_index():
x.mask(x == 0).last_valid_index()].tolist()),
axis=1)
df_out.set_axis(df.columns[df_out.columns], axis=1, inplace=False)
Output:
c1 c2 c3 c4 c5 c6 c7
index
1 1.0 2.0 3.0 4.0 5.0 NaN NaN
2 1.0 2.0 3.0 4.0 5.0 NaN NaN
3 1.0 2.0 3.0 NaN NaN NaN NaN
4 1.0 2.0 3.0 4.0 NaN NaN NaN
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0
6 1.0 0.0 0.0 4.0 NaN NaN NaN
N

How can i compare a DataFrame with other DataFrame's columns?

I have two different DataFrame's. df is the complete one and sample is for the comparing. Here is the data i have:
sample.tail()
T1 C C1 C2 C3
0 1 5 0.0 7.0 5.0
df.tail()
T1 T2 C C1 C2 C3
0 1 0 5 4.0 6.0 6.0
1 0 0 5 5.0 4.0 6.0
2 0 1 7 5.0 5.0 4.0
3 1 1 0 7.0 5.0 5.0
4 1 1 5 0.0 7.0 5.0
I have selected some columns from sample df and trying to find values in df matches the sample
Here what i did so far but no luck:
cols = sample.columns
df = df[df[cols] == sample[cols]]
and i am getting the following error:
ValueError: Can only compare identically-labeled DataFrame objects
Can you kindly help me to findout the solution for this?
EDIT: Expected output
df.tail()
T1 T2 C C1 C2 C3
0 1 0 5 0.0 7.0 5.0
21 1 1 5 0.0 7.0 5.0
27 1 0 5 0.0 7.0 5.0
34 1 1 5 0.0 7.0 5.0
42 1 1 5 0.0 7.0 5.0
47 1 0 5 0.0 7.0 5.0
51 1 1 5 0.0 7.0 5.0
You can see that All data matches with sample dataframe except T2. This is expected output for me
Thanks
Using pd.Index.intersection, you can use
cols = sample.columns.intersection(df.columns)
df[df[cols].apply(tuple, axis=1).isin(sample[cols].apply(tuple, axis=1))]

Categories