I am trying to use interpolation (linear) to fill in the missing values in my data frame. The interpolation should apply on the group of rows (which have the same id ) separately. An example of the data frame is below:
mdata:
id f1 f2 f3 f4 f5
d1 34 3 5 nan 6
d1 nan 4 6 9 7
d1 37 nan 6 10 8
d2 nan 7 8 1 32
d2 12 8 nan 45 56
d2 13 9 11 46 59
Given the above example , I want to apply the interpolation function on the rows which have id1, then id2 and etc. I tried to group them and then use interpolation, but it seems something is wrong in my code:
mdata=[~mdata['id'].map(mdata.groupby('id').apply(mdata.interpolate(method
='linear', limit_direction ='both')))]
My desired output should be something like this:
output:
id f1 f2 f3 f4 f5
d1 34 3 5 9 6
d1 35.5 4 6 9 7
d1 37 5 6 10 8
d2 12 7 8 1 32
d2 12 8 9.5 45 56
d2 13 9 11 46 59
You can define a function:
def f(x):
return x.interpolate(method ='linear', limit_direction ='both')
#Finally:
mdata=mdata.groupby('id').apply(f)
OR
via anonymous function:
mdata=(mdata.groupby('id')
.apply(lambda x:x.interpolate(method ='linear', limit_direction ='both')))
output of mdata:
id f1 f2 f3 f4 f5
0 d1 34.0 3.0 5.0 9.0 6
1 d1 35.5 4.0 6.0 9.0 7
2 d1 37.0 4.0 6.0 10.0 8
3 d2 12.0 7.0 8.0 1.0 32
4 d2 12.0 8.0 9.5 45.0 56
5 d2 13.0 9.0 11.0 46.0 59
I have two dfs
F1_ID
F2_ID
Event_ID
Date
a1
b2
ab4
5/12/21
a2
b3
ab5
5/12/21
b2
a1
ab4
5/12/21
b3
a2
ab5
5/12/21
the second df has a lot more information on it so I am going to show a filtered version of it.
F1_ID
Event_Name
F2_ID
Event_ID
Date
stats
amount
F1_str_total
F2_str_total
a1
Test
b2
ab1
5/8/21
12
41
13
17
a2
Test1
b3
ab2
5/8/21
16
42
12
54
b2
Test
a1
ab1
5/8/21
-12
-41
0
7
b3
Test1
a2
ab2
5/8/21
-16
-42
87
97
I would like to append the details in df1 to df2 and put None in the missing columns but im not sure how to do this.
Expected Output:
F1_ID
Event_Name
F2_ID
Event_ID
Date
stats
amount
F1_str_total
F2_str_total
a1
Test
b2
ab1
5/8/21
12
41
13
17
a2
Test1
b3
ab2
5/8/21
16
42
12
54
b2
Test
a1
ab1
5/8/21
-12
-41
0
7
b3
Test1
a2
ab2
5/8/21
-16
-42
87
97
a1
None
b2
ab4
5/12/21
None
None
None
None
a2
None
b3
ab5
5/12/21
None
None
None
None
b2
None
a1
ab4
5/12/21
None
None
None
None
b3
None
a2
ab%
5/12/21
None
None
None
None
Simply use pandas.DataFrame.append()
df2 = df2.append(df1, ignore_index=True)
print(df2)
F1_ID Event_Name F2_ID Event_ID Date stats amount F1_str_total \
0 a1 Test b2 ab1 5/8/21 12.0 41.0 13.0
1 a2 Test1 b3 ab2 5/8/21 16.0 42.0 12.0
2 b2 Test a1 ab1 5/8/21 -12.0 -41.0 0.0
3 b3 Test1 a2 ab2 5/8/21 -16.0 -42.0 87.0
4 a1 NaN b2 ab4 5/12/21 NaN NaN NaN
5 a2 NaN b3 ab5 5/12/21 NaN NaN NaN
6 b2 NaN a1 ab4 5/12/21 NaN NaN NaN
7 b3 NaN a2 ab5 5/12/21 NaN NaN NaN
F2_str_total
0 17.0
1 54.0
2 7.0
3 97.0
4 NaN
5 NaN
6 NaN
7 NaN
Or you can use pandas.concat()
df2 = pd.concat([df2, df1], ignore_index=True)
I'm trying to fill a dataframe with missing data. I've got these two dataframes:
df1:
df1 = pd.DataFrame({'a':['11','11','11','11','22','22','43','43'], 'x': ['d1', 'd2','d3','d4','d1','d2','d1','d3'], 'b': [1, 2,3,4,5,6,7,8]})
a x b
0 11 d1 1
1 11 d2 2
2 11 d3 3
3 11 d4 4
4 22 d1 5
5 22 d2 6
6 43 d1 7
7 43 d3 8
df2:
df2 = pd.DataFrame({'x': ['d1', 'd2','d3','d4']})
x
0 d1
1 d2
2 d3
3 d4
I've tried doing this:
df1.groupby('a', as_index=False).apply(lambda d: d.merge(df2, on='x', how='right')).reset_index(drop=True)
But my result is:
a x b
0 11 d1 1.0
1 11 d2 2.0
2 11 d3 3.0
3 11 d4 4.0
4 22 d1 5.0
5 22 d2 6.0
6 NaN d3 NaN
7 NaN d4 NaN
8 NaN d2 NaN
9 NaN d4 NaN
10 43 d1 7.0
11 43 d3 8.0
My desired output would be:
a x b
0 11 d1 1.0
1 11 d2 2.0
2 11 d3 3.0
3 11 d4 4.0
4 22 d1 5.0
5 22 d2 6.0
6 22 d3 NaN
7 22 d4 NaN
8 43 d1 7.0
9 43 d2 NaN
10 43 d3 8.0
11 43 d4 NaN
Is it possible to fill the missing data represented by NaN in the rows that I need? This way I've got d2 and d4in rows 8 and 9 when I need them in rows 10 and 11
My dataframe has around 150-200 rows so I'm trying to keep this generic as much as I can
For performance groupby with merge is not good idea. Better is create MultiIndex with all possible combinations for a and x columns and use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df1['a'].unique(), df2['x']], names=['a','x'])
df = df1.set_index(['a','x']).reindex(mux).reset_index()
print (df)
a x b
0 11 d1 1.0
1 11 d2 2.0
2 11 d3 3.0
3 11 d4 4.0
4 22 d1 5.0
5 22 d2 6.0
6 22 d3 NaN
7 22 d4 NaN
8 43 d1 7.0
9 43 d2 NaN
10 43 d3 8.0
11 43 d4 NaN
Then if need set a by missing values from b column and get them to end of groups by a use:
df = (df.assign(tmp = df['b'].isna())
.sort_values(['a','tmp'])
.assign(a = lambda x: x['a'].mask(x['b'].isna()))
.drop('tmp', axis=1))
print (df)
a x b
0 11 d1 1.0
1 11 d2 2.0
2 11 d3 3.0
3 11 d4 4.0
4 22 d1 5.0
5 22 d2 6.0
6 NaN d3 NaN
7 NaN d4 NaN
8 43 d1 7.0
10 43 d3 8.0
9 NaN d2 NaN
11 NaN d4 NaN
I might not fully understand the question, shouldn't the concatenation be more like:
a x b
0 11 d1 1.0
1 11 d2 2.0
2 11 d3 3.0
3 11 d4 4.0
4 22 d1 5.0
5 22 d2 6.0
6 NaN d3 NaN
7 NaN d4 NaN
8 43 d1 7.0
9 NaN d2 NaN
10 43 d3 8.0
11 NaN d4 NaN
Which is what I get from your code:
import pandas as pd
df1 = pd.DataFrame({'a':['11','11','11','11','22','22','43','43'], 'x': ['d1', 'd2','d3','d4','d1','d2','d1','d3'], 'b': [1, 2,3,4,5,6,7,8]})
df2 = pd.DataFrame({'x': ['d1', 'd2','d3','d4']})
print(df1.groupby('a', as_index=False).apply(lambda d: d.merge(df2, on='x', how='right')).reset_index(drop=True))
Result:
[Running] python -u "c:\MyProjects\~python\pandas\dframe.py"
a x b
0 11 d1 1.0
1 11 d2 2.0
2 11 d3 3.0
3 11 d4 4.0
4 22 d1 5.0
5 22 d2 6.0
6 NaN d3 NaN
7 NaN d4 NaN
8 43 d1 7.0
9 NaN d2 NaN
10 43 d3 8.0
11 NaN d4 NaN
**Table 1** **Table2**
Column_name Value Column_name Value
K1 13 K1 65
K2 25 K2 31
K4 46 K3 71
H1 56 H2 56
H3 26
H4 46
H6 56
I want to merge tables whilst having the same column name. The inner join below only search for common columns name with Table 1 as reference.
left_join = pd.merge(table1, table2,
on = 'column_name',
how = 'left')
I want my output to be:
Column_name Value1 Value2
K1 13 65
K2 25 31
K3 71
K4 46
H1 56
H2 56
H3 26
H4 46
H6 56
Use outer join from the pandas.
>>> df1 = pd.DataFrame({"Column_name":["K1","K2","K4","H1","H3","H4","H6"],"col2":[13,25,46,56,26,46,56]})
>>> df2 = pd.DataFrame({"Column_name":["K1","K2","K3","H2"],"col3":[65,31,71,56]})
>>> pd.merge(df1, df2, on="Column_name", how="outer")
Column_name col2 col3
0 K1 13.0 65.0
1 K2 25.0 31.0
2 K4 46.0 NaN
3 H1 56.0 NaN
4 H3 26.0 NaN
5 H4 46.0 NaN
6 H6 56.0 NaN
7 K3 NaN 71.0
8 H2 NaN 56.0
>>>
This question is similar to a few questions regarding conditionally filling. I'm trying to conditionally fill the column based off the following statements.
If the value in Code starts with A, I want to keep the values as they are.
If the value Code starts with B, I want to keep the same initial value and return nan's to the following rows until the next value in Code.
If the value in Code starts with C, I want to keep the same first value until the next floats in ['Numx','Numy]
import pandas as pd
import numpy as np
d = ({
'Code' :['A1','A1','','B1','B1','A2','A2','','B2','B2','','A3','A3','A3','','B1','','B4','B4','A2','A2','A1','A1','','B4','B4','C1','C1','','','D1','','B2'],
'Numx' : [30.2,30.5,30.6,35.6,40.2,45.5,46.1,48.1,48.5,42.2,'',30.5,30.6,35.6,40.2,45.5,'',48.1,48.5,42.2, 40.1,48.5,42.2,'',48.5,42.2,43.1,44.1,'','','','',45.1],
'Numy' : [1.9,2.3,2.5,2.2,2.5,3.1,3.4,3.6,3.7,5.4,'',2.3,2.5,2.2,2.5,3.1,'',3.6,3.7,5.4,6.5,8.5,2.2,'',8.5,2.2,2.3,2.5,'','','','',3.2]
})
df = pd.DataFrame(data=d)
Output:
Code Numx Numy
0 A1 30.2 1.9
1 A1 30.5 2.3
2 30.6 2.5
3 B1 35.6 2.2
4 B1 40.2 2.5
5 A2 45.5 3.1
6 A2 46.1 3.4
7 48.1 3.6
8 B2 48.5 3.7
9 B2 42.2 5.4
10 nan nan
11 A3 30.5 2.3
12 A3 30.6 2.5
13 A3 35.6 2.2
14 40.2 2.5
15 B1 45.5 3.1
16 nan nan
17 B4 48.1 3.6
18 B4 48.5 3.7
19 A2 42.2 5.4
20 A2 40.1 6.5
21 A1 48.5 8.5
22 A1 42.2 2.2
23 nan nan
24 B4 48.5 8.5
25 B4 42.2 2.2
26 C1 43.1 2.3
27 C1 44.1 2.5
28 nan nan
29 nan nan
30 D1 nan nan
31 nan nan
32 B2 45.1 3.2
I have used code posted from another question but I return too many Nan's
df['Code_new'] = df['Code'].where(df['Code'].isin(['A1','A2','A3','A4','B1','B2','B4','C1'])).ffill()
df[['Numx','Numy']] = df[['Numx','Numy']].mask(df['Code_new'].duplicated())
mask = df['Code_new'] == 'A1'
df.loc[mask, ['Numx','Numy']] = df.loc[mask, ['Numx','Numy']].ffill()
This produces this output:
Code Numx Numy Code_new
0 A1 30.2 1.9 A1
1 A1 30.2 1.9 A1
2 30.2 1.9 A1
3 B1 35.6 2.2 B1
4 B1 NaN NaN B1
5 A2 45.5 3.1 A2
6 A2 NaN NaN A2
7 NaN NaN A2
8 B2 48.5 3.7 B2
9 B2 NaN NaN B2
10 NaN NaN B2
11 A3 30.5 2.3 A3
12 A3 NaN NaN A3
13 A3 NaN NaN A3
14 NaN NaN A3
15 B1 NaN NaN B1
16 NaN NaN B1
17 B4 48.1 3.6 B4
18 B4 NaN NaN B4
19 A2 NaN NaN A2
20 A2 NaN NaN A2
21 A1 30.2 1.9 A1
22 A1 30.2 1.9 A1
23 30.2 1.9 A1
24 B4 NaN NaN B4
25 B4 NaN NaN B4
26 C1 43.1 2.3 C1
27 C1 NaN NaN C1
28 NaN NaN C1
29 NaN NaN C1
30 D1 NaN NaN C1
31 NaN NaN C1
32 B2 NaN NaN B2
My desired output would be:
Code Numx Numy
0 A1 30.2 1.9
1 A1 30.5 2.3
2 30.6 2.5
3 B1 35.6 2.2
4 B1 nan nan
5 A2 45.5 3.1
6 A2 46.1 3.4
7 48.1 3.6
8 B2 48.5 3.7
9 B2 nan nan
10 nan nan
11 A3 30.5 2.3
12 A3 30.6 2.5
13 A3 35.6 2.2
14 40.2 2.5
15 B1 45.5 3.1
16 nan nan
17 B4 48.1 3.6
18 B4 nan nan
19 A2 42.2 5.4
20 A2 40.1 6.5
21 A1 48.5 8.5
22 A1 42.2 2.2
23 nan nan
24 B4 48.5 8.5
25 B4 nan nan
26 C1 43.1 2.3
27 C1 43.1 2.3
28 43.1 2.3
29 43.1 2.3
30 D1 43.1 2.3
31 43.1 2.3
32 B2 45.1 3.2
I think this this line mask = df['Code_new'] == 'A1' I need to change. The code works but I'm only applying to to the values in code that are 'A1'. Is as easy as adding all the other values in here. So A1-A4,B1-B4,C1?
I believe need
m2 = df['Code'].isin(['A1','A2','A3','A4','B1','B2','B4','C1'])
#create helper column for unique categories
df['Code_new'] = df['Code'].where(m2).ffill()
df['Code_new'] = (df['Code_new'] + '_' +
df['Code_new'].ne(df['Code_new'].shift()).cumsum().astype(str))
#check by start values and filter all columns without A
m1 = df['Code_new'].str.startswith(tuple(['A1','A2','A3','A4'])).fillna(False)
df[['Numx','Numy']] = df[['Numx','Numy']].mask(df['Code_new'].duplicated() & ~m1)
#replace by forward filling only starting with C
mask = df['Code_new'].str.startswith('C').fillna(False)
df.loc[mask, ['Numx','Numy']] = df.loc[mask, ['Numx','Numy']].ffill()
print (df)
Code Numx Numy Code_new
0 A1 30.2 1.9 A1_1
1 A1 30.5 2.3 A1_1
2 30.6 2.5 A1_1
3 B1 35.6 2.2 B1_2
4 B1 NaN NaN B1_2
5 A2 45.5 3.1 A2_3
6 A2 46.1 3.4 A2_3
7 48.1 3.6 A2_3
8 B2 48.5 3.7 B2_4
9 B2 NaN NaN B2_4
10 NaN NaN B2_4
11 A3 30.5 2.3 A3_5
12 A3 30.6 2.5 A3_5
13 A3 35.6 2.2 A3_5
14 40.2 2.5 A3_5
15 B1 45.5 3.1 B1_6
16 NaN NaN B1_6
17 B4 48.1 3.6 B4_7
18 B4 NaN NaN B4_7
19 A2 42.2 5.4 A2_8
20 A2 40.1 6.5 A2_8
21 A1 48.5 8.5 A1_9
22 A1 42.2 2.2 A1_9
23 A1_9
24 B4 48.5 8.5 B4_10
25 B4 NaN NaN B4_10
26 C1 43.1 2.3 C1_11
27 C1 43.1 2.3 C1_11
28 43.1 2.3 C1_11
29 43.1 2.3 C1_11
30 D1 43.1 2.3 C1_11
31 43.1 2.3 C1_11
32 B2 45.1 3.2 B2_12