Pandas: Missing value imputation based on date - python

I have a pandas data-frame which is as follows:
df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103], "val1": [np.nan, 4, np.nan, np.nan, 1, np.nan], "val2": [5, np.nan, np.nan, np.nan, np.nan, 5], "rand": [np.nan, 3, 7, 8, np.nan, 4], "val3": [5, np.nan, np.nan, np.nan, 3, np.nan], "unique_date": [pd.Timestamp(2002, 3, 3), pd.Timestamp(2002, 3, 5), pd.Timestamp(2003, 4, 5), pd.Timestamp(2003, 4, 9), pd.Timestamp(2003, 8, 7), pd.Timestamp(2003, 9, 7)], "end_date": [pd.Timestamp(2005, 3, 3), pd.Timestamp(2003, 4, 7), np.nan, np.nan, pd.Timestamp(2003, 10, 7), np.nan]})
df_first
id val1 val2 rand val3 unique_date end_date
0 102 NaN 5.0 NaN 5.0 2002-03-03 2005-03-03
1 102 4.0 NaN 3.0 NaN 2002-03-05 2003-04-07
2 102 NaN NaN 7.0 NaN 2003-04-05 NaT
3 102 NaN NaN 8.0 NaN 2003-04-09 NaT
4 103 1.0 NaN NaN 3.0 2003-08-07 2003-10-07
5 103 NaN 5.0 4.0 NaN 2003-09-07 NaT
The missing value imputation should be done in a way that there is forward fill of the values that appear in each row from the data-frame that has an end_date value.
The forward fill performs for as long as the unique_date is before the end_date for the same id.
Based on what is said in the last paragraph above, the forward fill should be done per id.
Lastly, the missing value imputation should take place only for certain columns that have a name that has val in it. An important note is that no other columns have that pattern in their name. In case I haven't made myself clear enough, the solution for the above posted data-frames is posted bellow:
id val1 val2 rand val3 unique_date
0 102 NaN 5.0 NaN 5.0 2002-03-03
1 102 4.0 5.0 3.0 5.0 2002-03-05
2 102 4.0 5.0 7.0 5.0 2003-04-05
3 102 NaN 5.0 8.0 5.0 2003-04-09
4 103 1.0 NaN NaN 3.0 2003-08-07
5 103 1.0 5.0 4.0 3.0 2003-08-07
Let me know if you need any further clarification since the whole thing seems rather complicated at first sight.
Looking forward to you answers!

Sorry for the confusing question as well as explanation. At the end I was able to achieve what I wanted in the following way.
df_first = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103],
"val1": [np.nan, 4, np.nan, np.nan, 1, np.nan],
"val2": [5, np.nan, np.nan, np.nan, np.nan, 5],
"val3": [np.nan, 3, np.nan, np.nan, np.nan, 4],
"val4": [5, np.nan, np.nan, np.nan, 3, np.nan],
"rand": [3, np.nan, 1, np.nan, 5, 6],
"unique_date": [pd.Timestamp(2002, 3, 3),
pd.Timestamp(2002, 3, 5),
pd.Timestamp(2003, 4, 5),
pd.Timestamp(2003, 4, 9),
pd.Timestamp(2003, 8, 7),
pd.Timestamp(2003, 9, 7)],
"end_date": [pd.Timestamp(2005, 3, 3),
pd.Timestamp(2003, 4, 7),
np.nan,
np.nan,
pd.Timestamp(2003, 10, 7),
np.nan]})
display(df_first)
indexes = []
columns = df_first.filter(like="val").columns
for column in columns:
indexes.append(df_first.columns.get_loc(column))
elements = df_first.values[:,indexes]
ids = df_first.values[:,df_first.columns.get_loc("id")]
start_dates = df_first.values[:,df_first.columns.get_loc("unique_date")]
end_dates = df_first.values[:,df_first.columns.get_loc("end_date")]
for i in range(len(elements)):
if pd.notnull(end_dates[i]):
not_nan_indexes = np.argwhere(~pd.isnull(elements[i])).ravel()
elements_prop = elements[i,not_nan_indexes]
j = i
while (j < len(elements) and start_dates[j] < end_dates[i] and ids[i] == ids[j]):
elements[j, not_nan_indexes] = elements_prop
j+=1
df_first[columns] = elements
df_first = df_first.drop(columns="end_date")
display(df_first)
Probably the solution is an overkill, but I was not able to find anything pandas specific to achieve what I wanted to.

Related

Substract Two Dataframes by Index and Keep String Columns

I would like to subtract two data frames by indexes:
# importing pandas as pd
import pandas as pd
# Creating the second dataframe
df1 = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 11, 7, 8, 5],
"B":[21, 5, 32, 4, 6],
"C":[11, 21, 23, 7, 9],
"D":[1, 5, 3, 8, 6]},
index =["2001", "2002", "2003", "2004", "2005"])
df1
# Creating the first dataframe
df2 = pd.DataFrame({"A":[1, 2, 2, 2],
"B":[3, 2, 4, 3],
"C":[2, 2, 7, 3],
"D":[1, 3, 2, 1]},
index =["2000", "2002", "2003", "2004"])
df2
# Desired
df = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 9, 5, 6, 5],
"B":[21, 3, 28, 1, 6],
"C":[11, 19, 16, 4, 9],
"D":[1, 2, 1, 7, 5]},
index =["2001", "2002", "2003", "2004", "2005"])
df
df1.subtract(df2)
However, it returns in some cases NAs, I would like to keep values from the first df1 if not deductable.
You could handle NaN using:
df1.subtract(df2).combine_first(df1).dropna(how='all')
output:
A B C D Type
2001 10.0 21.0 11.0 1.0 T1
2002 9.0 3.0 19.0 2.0 T2
2003 5.0 28.0 16.0 1.0 T3
2004 6.0 1.0 4.0 7.0 T4
2005 5.0 6.0 9.0 6.0 T5
You can use select_dtypes to choose the correct data type, then subtract the reindex data:
(df1.select_dtypes(include='number')
.sub(df2.reindex(df1.index, fill_value=0))
.join(df1.select_dtypes(exclude='number'))
)
Output:
A B C D Type
2001 10 21 11 1 T1
2002 9 3 19 2 T2
2003 5 28 16 1 T3
2004 6 1 4 7 T4
2005 5 6 9 6 T5

is there another alternative than trying to make iloc or loc or lambda work to iterate over rows

import pandas as pd
import numpy as np
dataset = [(2, 4, 6, 43, np.nan, np.nan, np.nan, np.nan),
(10, 12, 14, np.nan, np.nan, np.nan, np.nan, np.nan),
(20, 22, 24, 4, np.nan, np.nan, 4, np.nan),
(10, 12, 14, np.nan, np.nan, np.nan, 4, np.nan),
(10, 12, 14, np.nan, np.nan, np.nan, 4, np.nan),
(20, 22, 24, 60, np.nan, np.nan, 60, 4),
(10, 12, 14, np.nan, np.nan, np.nan, 60, 4),
(10, 12, 14, np.nan, np.nan, np.nan, 60, 4),
(20, 22, 24, 13, np.nan, np.nan, 13, 60),
(10, 12, 14, np.nan, np.nan, np.nan, 13, 60),
(20, 22, 24, 26, np.nan, np.nan, 26, 13),
(28, 30, 32, np.nan, np.nan, np.nan, 26, 13)]
df = pd.DataFrame(dataset, columns = ("A", "B", "C", "D", "1 prev D", "2 prev D", "1 prev D example", "2 prev D example" ))
## 1 prev d example = if D == np.nan look 1 cell above
## 2 prev d example = if 1 prev D example == D, look 1 cell above D, if np.nan, look 1 cell above again
i have done circles with loc and iloc and lambda, and trying to work out the obvious "go to method" for iterating time stamped data
Your comment was not super clear, but if all you want are the last two non-NaN values from column D then use masking to ignore NaN and then slice:
In [4]: df[df["D"].notnull()][-2:]
Out[4]:
A B C D 1 prev D 2 prev D 1 prev D example 2 prev D example
8 20 22 24 13.0 NaN NaN 13.0 60.0
10 20 22 24 26.0 NaN NaN 26.0 13.0
Again, not really sure what the other columns are for or if this is actually what you want.
Please update your original post with an example of your expected output if this answer isn't sufficient.

Grouping data by id, var1 into consecutive dates in python using pandas

I have some data that looks like:
df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
"val": [9, 2, 4, 7, 6, 3, 2],
"dates": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.
For example the final output should be:
df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
"val": [13, 2, 13, 3, 2],
"start_date": [pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)],
"end_date": [pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12)]})
I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.
For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'
s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)
I have also tried:
df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()
Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.
EDIT: adding more context
Another approach is:
df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)
which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window
Extended example that has a break in the date range for group (102,'a').
df_raw_dates = pd.DataFrame(
{
"id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
"var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
"val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
"dates": [
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 1),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 2),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 5),
pd.Timestamp(2020, 3, 12),
pd.Timestamp(2020, 1, 3),
pd.Timestamp(2020, 1, 7),
pd.Timestamp(2020, 1, 8),
pd.Timestamp(2020, 1, 9),
pd.Timestamp(2020, 1, 21),
pd.Timestamp(2020, 1, 25),
],
}
)
Further example
This is using the anwser below from wwii
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19
From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?
Not sure what the difference with this example and the extended example above is and why this seems to not work?
I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.
Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.
import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])
for k,g in gb:
print(g)
dt = g['dates']
# find difference in days between rows
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
# create a Series to identify consecutive ranges to group by
# this cumsum trick can be found in many SO answers
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
# split into date ranges
date_groups = g.groupby(groups)
for _,date_range in date_groups:
print(date_range)
print(sep)
You can see that the (102,'a') group has been split into two groups.
id var1 val dates
0 102 a 9 2020-01-01
2 102 a 4 2020-01-02
7 102 a 1 2020-01-03
id var1 val dates
8 102 a 2 2020-01-07
9 102 a 3 2020-01-08
10 102 a 4 2020-01-09
Going a bit further: while iterating construct a dictionary to make a new DataFrame with.
import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed,var = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(g.val.mean())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(),date_range['dates'].max()
val = date_range.val.mean()
new_df['id'].append(eyed)
new_df['var1'].append(var)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>>
id var1 val start end
0 102 a 4.666667 2020-01-01 2020-01-03
1 102 a 3.000000 2020-01-07 2020-01-09
2 102 b 2.000000 2020-01-01 2020-01-01
3 103 a 3.000000 2020-01-05 2020-01-05
4 103 b 6.500000 2020-01-02 2020-01-03
5 104 c 2.000000 2020-03-12 2020-03-12
6 108 a 99.000000 2020-01-21 2020-01-25
Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.
It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.
id var1 val dates
11 108 a 99 2020-01-21
12 108 a 99 2020-01-25
You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.

Sort rows and removing NaN values

I have a dataset looks like below:
state Item_Number
0 AP 1.0, 4.0, 20.0, 2.0, 11.0, 7.0
1 GOA 1.0, 4.0, nan, 2.0, 8.0, nan
2 GU 1.0, 4.0, 13.0, 2.0, 11.0, 7.0
3 KA 1.0, 23.0, nan, nan, 11.0, 7.0
4 MA 1.0, 14.0, 13.0, 2.0, 19.0, 21.0
I want to remove NaN values and sort the rows, as well as convert float to int. After completion the dataset should looks like below:
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21
Another solution using Series.str.split and Series.apply:
df['Item_Number'] = (df.Item_Number.str.split(',')
.apply(lambda x: ', '.join([str(z) for z in sorted([int(float(y)) for y in x if 'nan' not in y])])))
[out]
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21
Use list comprehension with remove missing values by principe NaN != NaN:
df['Item_Number'] = [sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP [1, 2, 4, 7, 11, 20]
1 GOA [1, 2, 4, 8]
2 GU [1, 2, 4, 7, 11, 13]
3 KA [1, 7, 11, 23]
4 MA [1, 2, 13, 14, 19, 21]
If need strings:
df['Item_Number'] = [' '.join(map(str, sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]))) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP 1 2 4 7 11 20
1 GOA 1 2 4 8
2 GU 1 2 4 7 11 13
3 KA 1 7 11 23
4 MA 1 2 13 14 19 21

Merging two pandas dataframes by interval

I have two pandas dataframes with following format:
df_ts = pd.DataFrame([
[10, 20, 1, 'id1'],
[11, 22, 5, 'id1'],
[20, 54, 5, 'id2'],
[22, 53, 7, 'id2'],
[15, 24, 8, 'id1'],
[16, 25, 10, 'id1']
], columns = ['x', 'y', 'ts', 'id'])
df_statechange = pd.DataFrame([
['id1', 2, 'ok'],
['id2', 4, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
I am trying to get it to the format, such as:
df_out = pd.DataFrame([
[10, 20, 1, 'id1', None ],
[11, 22, 5, 'id1', 'ok' ],
[20, 54, 5, 'id2', 'not ok'],
[22, 53, 7, 'id2', 'not ok'],
[15, 24, 8, 'id1', 'ok' ],
[16, 25, 10, 'id1', 'not ok']
], columns = ['x', 'y', 'ts', 'id', 'state'])
I understand how to accomplish it iteratively by grouping by id and then iterating through each row and changing status when it appears. Is there a pandas build-in more scalable way of doing this?
Unfortunately pandas merge support only equality joins. See more details at the following thread:
merge pandas dataframes where one value is between two others
if you want to merge by interval you'll need to overcome the issue, for example by adding another filter after the merge:
joined = a.merge(b,on='id')
joined = joined[joined.ts.between(joined.ts1,joined.ts2)]
You can merge pandas data frames on two columns:
pd.merge(df_ts,df_statechange, how='left',on=['id','ts'])
in df_statechange that you shared here there is no common values on ts in both dataframes. Apparently you just copied not complete data frame here. So i got this output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 NaN
2 20 54 5 id2 NaN
3 22 53 7 id2 NaN
4 15 24 8 id1 NaN
5 16 25 10 id1 NaN
But indeed if you have common ts in the data frames it will have your desired output. For example:
df_statechange = pd.DataFrame([
['id1', 5, 'ok'],
['id1', 8, 'ok'],
['id2', 5, 'not ok'],
['id2',7, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
the output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 ok
2 20 54 5 id2 not ok
3 22 53 7 id2 not ok
4 15 24 8 id1 ok
5 16 25 10 id1 NaN

Categories