I have a dataframe in the following format:
A B C D
2020-11-18 64.0 74.0 34.0 57.0
2020-11-20 NaN 71.0 NaN 58.0
2020-11-23 NaN 11.0 NaN NaN
2020-11-25 69.0 NaN NaN 0.0
2020-11-27 NaN 37.0 19.0 NaN
2020-11-29 63.0 NaN NaN 85.0
2020-12-03 NaN 73.0 NaN 49.0
2020-12-10 NaN NaN 32.0 NaN
2020-12-22 52.0 90.0 33.0 24.0
2020-12-23 NaN 96.0 NaN NaN
2020-12-28 78.0 NaN NaN 68.0
2020-12-29 17.0 70.0 NaN 16.0
2021-01-03 51.0 43.0 NaN 66.0
I want to obtain a new dataframe that contains the last non-NaN values for each month in each column:
A B C D
2020-11 63.0 37.0 19.0 85.0
2020-12 17.0 70.0 33.0 16.0
I tried grouping by month and applying a lambda that returns the in-group maximum index like so:
df.loc[df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())]
which yields:
A B C D
2020-11-29 63.0 NaN NaN 85.0
2020-12-29 17.0 70.0 NaN 16.0
This returns the values that appear on the last day in each month but not the last non-NaN value. In case the value for the last day in a particular month is a NaN, I will have a NaN appearing here. Instead, I'd only like to have NaN values present if there are absolutely no values for that particular month in that column.
Use GroupBy.last:
df = df.groupby(df.index.to_period('M')).last()
print (df)
A B C D
2020-11 63.0 37.0 19.0 85.0
2020-12 17.0 70.0 33.0 16.0
2021-01 51.0 43.0 NaN 66.0
Given a dataframe as follows:
id value1 value2
0 3918703 62.0 64.705882
1 3919144 60.0 60.000000
2 3919534 62.5 30.000000
3 3919559 55.0 55.000000
4 3920438 82.0 82.031250
5 3920463 71.0 71.428571
6 3920502 70.0 69.230769
7 3920535 80.0 40.000000
8 3920674 62.0 62.222222
9 3920856 80.0 79.987176
I want to check if value2 is in the range of plus and minus 10% of value1, and return a new column result_review.
If it's not in the range as required, then indicate No as result_review's values.
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
How can I do that in Pandas? Thanks for your help at advance.
Use Series.between with DataFrame.loc:
m = df['value2'].between(df['value1'].mul(0.9), df['value1'].mul(1.1))
df.loc[~m, 'results_review'] = 'no'
print(df)
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
I have df1:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2:
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
I am trying to update df2 only on columns which wil be a choice= ['ColA','ColB'] where ID1 and ID2 both matches in the 2 dfs.
Expected output:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
So far I have tried:
u = df1.set_index(['ID1','ID2'])
u = u.loc[u.index.dropna()]
v = df2.set_index(['ID1','ID2'])
v= v.loc[v.index.dropna()]
v.update(u)
v.reset_index()
Which gives me the correct update(but I loose the Ids which are NaN) also the update takes place on ColC which i dont want:
ID1 ID2 ColA ColB ColC
0 45.0 23.0 a 1.0 xyz
1 56.0 24.0 b 2.0 abc
2 23.0 45.0 NaN 0.0 e
3 56.0 29.0 NaN 0.0 NaN
I have also tried merge and combine_first. cant figure out what is the best approach to do this based on the choicelist.
Use merge with right join and then combine_first:
choice= ['ColA','ColB']
joined = ['ID1','ID2']
c = choice + joined
df3 = df1[c].merge(df2[c], on=joined, suffixes=('','_'), how='right')[c]
print (df3)
ColA ColB ID1 ID2
0 a 1.0 45.0 23.0
1 b 2.0 56.0 24.0
2 NaN NaN NaN 25.0
3 NaN NaN NaN 26.0
4 NaN NaN 23.0 45.0
5 NaN NaN 45.0 NaN
6 NaN NaN 56.0 29.0
df2[c] = df3.combine_first(df2[c])
print (df2)
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
here's a way
df1
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
df3 = df1.merge(df2, on=['ID1','ID2'], left_index=True)[['ColA_x','ColB_x']]
df2.loc[df3.index, 'ColA'] = df3['ColA_x']
df2.loc[df3.index, 'ColB'] = df3['ColB_x']
output
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
There seems to still be the issue in 0.24 where NaN merges with NaN when they are keys. Prevent this by dropping those records before merging. I'm assuming ['ID1', 'ID2'] is a unique key for df1 (for rows where both are not null):
keys = ['ID1', 'ID2']
updates = ['ColA', 'ColB']
df3 = df2.merge(df1[updates+keys].dropna(subset=keys), on=keys, how='left')
Then resolve information. Take the value in df1 if it's not null, else take the value in df2. In recent versions of python the merge output should be ordered so for duplicated columns _x appears to the left of the _y column. If not, sort the index
#df3 = df3.sort_index(axis=1) # If not sorted _x left of _y
df3.groupby([x[0] for x in df3.columns.str.split('_')], axis=1).apply(lambda x: x.ffill(1).iloc[:, -1])
ColA ColB ColC ID1 ID2
0 a 1.0 NaN 45.0 23.0
1 b 2.0 NaN 56.0 24.0
2 NaN 0.0 fd NaN 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 e 23.0 45.0
5 NaN 0.0 r 45.0 NaN
6 NaN 0.0 NaN 56.0 29.0
My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False
This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0
Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.
This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True
I'm interested in combining two dataframes in pandas that have the same row indices and column names, but different cell values. See the example below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':[22,2,np.NaN,np.NaN],
'B':[23,4,np.NaN,np.NaN],
'C':[24,6,np.NaN,np.NaN],
'D':[25,8,np.NaN,np.NaN]})
df2 = pd.DataFrame({'A':[np.NaN,np.NaN,56,100],
'B':[np.NaN,np.NaN,58,101],
'C':[np.NaN,np.NaN,59,102],
'D':[np.NaN,np.NaN,60,103]})
In[6]: print(df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In[7]: print(df2)
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I would like the resulting frame to look like this:
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I have tried different ways of pd.concat and pd.merge but some of the data always gets replaced with NaNs. Any pointers in the right direction would be greatly appreciated.
Use combine_first:
print (df1.combine_first(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or fillna:
print (df1.fillna(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or update:
df1.update(df2)
print (df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Use combine_first
df1.combine_first(df2)