update / merge and update a subset of columns pandas - python

I have df1:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2:
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
I am trying to update df2 only on columns which wil be a choice= ['ColA','ColB'] where ID1 and ID2 both matches in the 2 dfs.
Expected output:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
So far I have tried:
u = df1.set_index(['ID1','ID2'])
u = u.loc[u.index.dropna()]
v = df2.set_index(['ID1','ID2'])
v= v.loc[v.index.dropna()]
v.update(u)
v.reset_index()
Which gives me the correct update(but I loose the Ids which are NaN) also the update takes place on ColC which i dont want:
ID1 ID2 ColA ColB ColC
0 45.0 23.0 a 1.0 xyz
1 56.0 24.0 b 2.0 abc
2 23.0 45.0 NaN 0.0 e
3 56.0 29.0 NaN 0.0 NaN
I have also tried merge and combine_first. cant figure out what is the best approach to do this based on the choicelist.

Use merge with right join and then combine_first:
choice= ['ColA','ColB']
joined = ['ID1','ID2']
c = choice + joined
df3 = df1[c].merge(df2[c], on=joined, suffixes=('','_'), how='right')[c]
print (df3)
ColA ColB ID1 ID2
0 a 1.0 45.0 23.0
1 b 2.0 56.0 24.0
2 NaN NaN NaN 25.0
3 NaN NaN NaN 26.0
4 NaN NaN 23.0 45.0
5 NaN NaN 45.0 NaN
6 NaN NaN 56.0 29.0
df2[c] = df3.combine_first(df2[c])
print (df2)
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0

here's a way
df1
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
df3 = df1.merge(df2, on=['ID1','ID2'], left_index=True)[['ColA_x','ColB_x']]
df2.loc[df3.index, 'ColA'] = df3['ColA_x']
df2.loc[df3.index, 'ColB'] = df3['ColB_x']
output
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0

There seems to still be the issue in 0.24 where NaN merges with NaN when they are keys. Prevent this by dropping those records before merging. I'm assuming ['ID1', 'ID2'] is a unique key for df1 (for rows where both are not null):
keys = ['ID1', 'ID2']
updates = ['ColA', 'ColB']
df3 = df2.merge(df1[updates+keys].dropna(subset=keys), on=keys, how='left')
Then resolve information. Take the value in df1 if it's not null, else take the value in df2. In recent versions of python the merge output should be ordered so for duplicated columns _x appears to the left of the _y column. If not, sort the index
#df3 = df3.sort_index(axis=1) # If not sorted _x left of _y
df3.groupby([x[0] for x in df3.columns.str.split('_')], axis=1).apply(lambda x: x.ffill(1).iloc[:, -1])
ColA ColB ColC ID1 ID2
0 a 1.0 NaN 45.0 23.0
1 b 2.0 NaN 56.0 24.0
2 NaN 0.0 fd NaN 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 e 23.0 45.0
5 NaN 0.0 r 45.0 NaN
6 NaN 0.0 NaN 56.0 29.0

Related

Stretching the data frame by date

I have this data frame:
ID date X1 X2 Y
A 16-07-19 58 50 0
A 21-07-19 28 74 0
B 25-07-19 54 65 1
B 27-07-19 50 30 0
B 29-07-19 81 61 0
C 30-07-19 55 29 0
C 31-07-19 97 69 1
C 03-08-19 13 48 1
D 19-07-18 77 27 1
D 20-07-18 68 50 1
D 22-07-18 89 57 1
D 23-07-18 46 70 0
D 26-07-18 56 13 0
E 06-08-19 47 35 1
I want to "stretch" the data by date, from the first row, to the last row of each ID (groupby),
and to fill the missing values with NaN.
For example: ID A has two rows on 16-07-19, and 21-07-19.
After the implementation, (s)he should have 6 rows on 16-21 of July, 2019.
Expected result:
ID date X1 X2 Y
A 16-07-19 58.0 50.0 0.0
A 17-07-19 NaN NaN NaN
A 18-07-19 NaN NaN NaN
A 19-07-19 NaN NaN NaN
A 20-07-19 NaN NaN NaN
A 21-07-19 28.0 74.0 0.0
B 25-07-19 54.0 65.0 1.0
B 26-07-19 NaN NaN NaN
B 27-07-19 50.0 30.0 0.0
B 28-07-19 NaN NaN NaN
B 29-07-19 81.0 61.0 0.0
C 30-07-19 55.0 29.0 0.0
C 31-07-19 97.0 69.0 1.0
C 01-08-19 NaN NaN NaN
C 02-08-19 NaN NaN NaN
C 03-08-19 13.0 48.0 1.0
D 19-07-18 77.0 27.0 1.0
D 20-07-18 68.0 50.0 1.0
D 21-07-18 NaN NaN NaN
D 22-07-18 89.0 57.0 1.0
D 23-07-18 46.0 70.0 0.0
D 24-07-18 NaN NaN NaN
D 25-07-18 NaN NaN NaN
D 26-07-18 56.0 13.0 0.0
E 06-08-19 47.0 35.0 1.0
Use DataFrame.asfreq per groups working with DatetimeIndex:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
df = df.set_index('date').groupby('ID')[cols].apply(lambda x: x.asfreq('d')).reset_index()
print (df)
ID date X1 X2 Y
0 A 2019-07-16 58.0 50.0 0.0
1 A 2019-07-17 NaN NaN NaN
2 A 2019-07-18 NaN NaN NaN
3 A 2019-07-19 NaN NaN NaN
4 A 2019-07-20 NaN NaN NaN
5 A 2019-07-21 28.0 74.0 0.0
6 B 2019-07-25 54.0 65.0 1.0
7 B 2019-07-26 NaN NaN NaN
8 B 2019-07-27 50.0 30.0 0.0
9 B 2019-07-28 NaN NaN NaN
10 B 2019-07-29 81.0 61.0 0.0
11 C 2019-07-30 55.0 29.0 0.0
12 C 2019-07-31 97.0 69.0 1.0
13 C 2019-08-01 NaN NaN NaN
14 C 2019-08-02 NaN NaN NaN
15 C 2019-08-03 13.0 48.0 1.0
16 D 2018-07-19 77.0 27.0 1.0
17 D 2018-07-20 68.0 50.0 1.0
18 D 2018-07-21 NaN NaN NaN
19 D 2018-07-22 89.0 57.0 1.0
20 D 2018-07-23 46.0 70.0 0.0
21 D 2018-07-24 NaN NaN NaN
22 D 2018-07-25 NaN NaN NaN
23 D 2018-07-26 56.0 13.0 0.0
24 E 2019-08-06 47.0 35.0 1.0
Another idea with DataFrame.reindex per groups:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max()))
df = df.set_index('date').groupby('ID')[cols].apply(f).reset_index()
Here is my sort jitsu:
def Sort_by_date(dataf):
# rule1
dataf['Current'] = pd.to_datetime(dataf.Current)
dataf = dataf.sort_values(by=['Current'],ascending=True)
# rule2
dataf['Current'] = pd.to_datetime(dataf.Current)
Mask = (dataf['Current'] > '1/1/2020') & (dataf['Current'] <= '12/31/2022')
dataf = dataf.loc[Mask]
return dataf
you can modify this code to learn to sort by date for your solution.
Next lets sort by group:
Week1 = WeeklyDF.groupby('ID')
Week1_Report = Week1['ID','date','X1','X2','Y']
Week1_Report
lastly, Lets replace the NaN
Week1_Report['X1'.fillna("X1 is 0", inplace = True)
Week1_Report['X2'.fillna("X2 is 0", inplace = True)
Week1_Report['Y'.fillna("Y is 0", inplace = True)

How to i transform a very large dataframe to get the count of values in all columns (without using df.stack or df.apply)

I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.
I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.
For example if my sample dataframe is like,
id,date,field1,field2,field3
1,1/1/2014,abc,,abc
2,1/1/2014,abc,,abc
3,1/2/2014,,abc,abc
4,1/4/2014,xyz,abc,
1,1/1/2014,,abc,abc
1,1/1/2014,xyz,qwe,xyz
4,1/7/2014,,qwe,abc
2,1/4/2014,qwe,,qwe
2,1/4/2014,qwe,abc,qwe
2,1/5/2014,abc,,abc
3,1/5/2014,xyz,xyz,
I have written the following script that does the needed for a small sample but fails in a large dataframe.
df.set_index(["id", "date"], inplace=True)
df = df.stack(level=[0])
df = df.groupby(level=[0,1]).value_counts()
df = df.unstack(level=[1,2])
I also have a solution via apply but it has the same complications.
The expected result is,
date 1/1/2014 1/4/2014 ... 1/5/2014 1/4/2014 1/7/2014
abc xyz qwe qwe ... xyz xyz abc qwe
id ...
1 4.0 2.0 1.0 NaN ... NaN NaN NaN NaN
2 2.0 NaN NaN 4.0 ... NaN NaN NaN NaN
3 NaN NaN NaN NaN ... 2.0 NaN NaN NaN
4 NaN NaN NaN NaN ... NaN 1.0 1.0 1.0
I am looking for a more optimized version of what I have written.
Thanks for the help !!
You don't want to use stack. Therefore, another solution is using crosstab on id with each date and fields columns. Finally, concat them together, groupby() the index and sum. Use listcomp on df.columns[2:] to create each crosstab (note: I assume the first 2 columns is id and date as your sample):
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum()
Out[497]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 0.0 0.0 0.0 1.0 4.0 0.0 2.0 0.0 0.0 0.0
3 0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
4 0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
I think showing 0 is better than NaN. However, if you want NaN instead of 0, you just need to chain additional replace as follows:
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum().replace({0: np.nan})
Out[501]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4.0 1.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN NaN 1.0 4.0 NaN 2.0 NaN NaN NaN
3 NaN NaN NaN 2.0 NaN NaN NaN NaN 2.0 NaN NaN
4 NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN 1.0 1.0

Pandas Dataframe interpolating in sections delimited by indexes

My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False
This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0
Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.
This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True

Combining dataframes in pandas with the same rows and columns, but different cell values

I'm interested in combining two dataframes in pandas that have the same row indices and column names, but different cell values. See the example below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':[22,2,np.NaN,np.NaN],
'B':[23,4,np.NaN,np.NaN],
'C':[24,6,np.NaN,np.NaN],
'D':[25,8,np.NaN,np.NaN]})
df2 = pd.DataFrame({'A':[np.NaN,np.NaN,56,100],
'B':[np.NaN,np.NaN,58,101],
'C':[np.NaN,np.NaN,59,102],
'D':[np.NaN,np.NaN,60,103]})
In[6]: print(df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In[7]: print(df2)
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I would like the resulting frame to look like this:
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I have tried different ways of pd.concat and pd.merge but some of the data always gets replaced with NaNs. Any pointers in the right direction would be greatly appreciated.
Use combine_first:
print (df1.combine_first(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or fillna:
print (df1.fillna(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or update:
df1.update(df2)
print (df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Use combine_first
df1.combine_first(df2)

Pandas reorder data

This is probably an easy one using pivot, but since I am not adding the numbers (every row is unique) how should I go about doing this?
Input:
Col1 Col2 Col3
0 123.0 33.0 ABC
1 345.0 39.0 ABC
2 567.0 100.0 ABC
3 123.0 82.0 PQR
4 345.0 10.0 PQR
5 789.0 38.0 PQR
6 890.0 97.0 XYZ
7 345.0 96.0 XYZ
Output:
Col1 ABC PQR XYZ
0 123.0 33.0 82.0 NaN
1 345.0 39.0 10.0 96.0
2 567.0 100.0 NaN NaN
3 789.0 NaN 38.0 NaN
4 890.0 NaN NaN 97.0
And could I get this output in dataframe format then pls? Thanks so much for taking a look!
You can use pivot:
print (df.pivot(index='Col1', columns='Col3', values='Col2'))
Col3 ABC PQR XYZ
Col1
123.0 33.0 82.0 NaN
345.0 39.0 10.0 96.0
567.0 100.0 NaN NaN
789.0 NaN 38.0 NaN
890.0 NaN NaN 97.0
Another solution with set_index and unstack:
print (df.set_index(['Col1','Col3'])['Col2'].unstack())
Col3 ABC PQR XYZ
Col1
123.0 33.0 82.0 NaN
345.0 39.0 10.0 96.0
567.0 100.0 NaN NaN
789.0 NaN 38.0 NaN
890.0 NaN NaN 97.0
EDIT by comment:
Need pivot_table:
print (df.pivot_table(index='Col1', columns='Col3', values='Col2'))
Col3 ABC PQR XYZ
Col1
123.0 33.0 82.0 NaN
345.0 39.0 10.0 96.0
567.0 100.0 NaN NaN
789.0 NaN 38.0 NaN
890.0 NaN NaN 97.0
Another faster solution with groupby, aggregating mean (by default pivot_table aggreagate mean also), convert to Series by DataFrame.squeeze and last unstack:
print (df.groupby(['Col1','Col3']).mean().squeeze().unstack())
Col3 ABC PQR XYZ
Col1
123.0 33.0 82.0 NaN
345.0 39.0 10.0 96.0
567.0 100.0 NaN NaN
789.0 NaN 38.0 NaN
890.0 NaN NaN 97.0

Categories