fill missing data with Python

fill missing data with Python - python

I am little new to Python and have a problem like this. I have a dataframe of multiple sensor data. There are NA missing values in the dataset and need to be filled with below rules.
if the next sensor has data at the same time stamp, fill it using the next sensor data.
If near sensor has no data either, fill it with average value of all available sensors at the same timestamp.
If all sensor missing data at the same timestamp, use linear interpolation of it's own to fill the missing values
There's a sample data I built.
import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
sensor2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
sensor3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
sensordata = sensor1.append([sensor2,sensor3]).reset_index(drop = True)
Any help would be appreciated.
With the answer from Christian, the solution will be as follows.
# create data
df1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
df2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
df3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
df = df1.append([df2,df3]).reset_index(drop = True)
# pivot dataframe
df = df.pivot(index = 'date', columns ='sensor',values ='value')
# step 1, using specified sensor to fill missing values first, here use sensor 3
for c in df.columns:
selectedsensor = 3
df[c] = df[c].fillna(df[selectedsensor])
# step 2, use average of all available sensors to fill
df = df.transpose().fillna(df.transpose().mean()).transpose()
# step 3, use interpolate to fill remaining missing values
df = df.interpolate()
# unstack back to the original data format
df = df.reset_index()
df = df.melt(id_vars=['date'],var_name = 'sensor')
#df = df.unstack('sensor').reset_index()
#df = df.rename(columns ={0:'value'})
The final output is as follows:
date sensor value
0 2000-01-01 1 2.0
1 2000-01-02 1 2.0
2 2000-01-03 1 2.0
3 2000-01-04 1 2.0
4 2000-01-05 1 2.0
5 2000-01-06 1 7.0
6 2000-01-07 1 6.0
7 2000-01-08 1 5.0
8 2000-01-09 1 4.0
9 2000-01-10 1 6.0
10 2000-01-01 2 3.0
11 2000-01-02 2 4.0
12 2000-01-03 2 5.0
13 2000-01-04 2 6.0
14 2000-01-05 2 7.0
15 2000-01-06 2 7.0
16 2000-01-07 2 7.0
17 2000-01-08 2 7.0
18 2000-01-09 2 7.0
19 2000-01-10 2 8.0
20 2000-01-01 3 2.0
21 2000-01-02 3 3.0
22 2000-01-03 3 4.0
23 2000-01-04 3 5.0
24 2000-01-05 3 6.0
25 2000-01-06 3 7.0
26 2000-01-07 3 7.0
27 2000-01-08 3 7.0
28 2000-01-09 3 7.0
29 2000-01-10 3 8.0

You can do the following:
Your dataset, pivoted:
df = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor1":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6], "sensor2":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8], "sensor3":[2,3,4,5,6,7,np.nan,np.nan,7,8]}).set_index('date')
1) This is fillna with options backward, and limit = 1 along axis 1
df.fillna(method='bfill',limit=1,axis=1)
2) This is fillna with mean along the axis 1. This isn't really implemented apparently, but we can trick it with transposing:
df.transpose().fillna(df.transpose().mean()).transpose()
3) This is just interpolate
df.interpolate()
Bonus:
This got a bit uglier, since i had to apply column by column, but here is one selecting sensor 3 to fill:
for c in df.columns:
df[c] = df[c].fillna(df["sensor3"])
df

Related

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

How to display grouped by column during ffill() and not agg using pandas?

This isn't duplicate. I already referred this post_1 and post_2
My question is different and not about agg function. It is about displaying grouped by column as well during ffill operation. Though the code works fine, just sharing the full code for you to get an idea. Problem is in the commented line. look out for that line below.
I have a dataframe like as given below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What this code with the help of Jezrael from forum does is add missing dates based on threshold value. Only issue is,I don't see the grouped by column during output
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
df2 = df2.groupby(df2['subject_id']).ffill() # problem is here #here is the problem
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
As shown in code above, I tried the below approaches
df2 = df2.groupby(df2['subject_id']).ffill() # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index() # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill() # doesn't help
Incorrect output without subject_id
I expect my output to have subject_id column as well

Here are 2 possible solutions - specify all columns in list after groupby and assign back:
cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill() # problem is here #here is the problem
Or create index by subject_id column and grouping by index:
#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()
#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day month count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 4.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 4.0 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 4.0 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 4.0 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 5.0 1.0
33 1 2173-05-05 2173-05-05 13:37:00 1 5 5.0 1.0
95 1 2173-07-06 2173-07-06 13:39:00 6 6 7.0 1.0
96 1 2173-07-07 2173-07-07 13:39:00 6 7 7.0 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 7.0 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 4.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8 9 4.0 NaN
100 2 2173-04-10 2173-04-10 22:00:00 8 10 4.0 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 4.0 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 4.0 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 4.0 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 4.0 1.0

Display variable data as month and year

I have the code below:
import pandas as pd
import datetime
df=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
df["date"]=pd.to_datetime(df["date"])
df['date'] = df.date.apply(lambda x: datetime.datetime.strftime(x,'%b')) # SHOWS date as MONTH
pvt_enroll=df.pivot_table(index='site', columns="date", values = 'baseline', aggfunc = {'baseline' : 'count'}, fill_value=0, margins=True) # Pivot_Table with enrollment by SITE by MONTH
pvt_enroll.to_csv("pivot_test.csv")
table_enroll_site_month = pd.read_csv('pivot_test.csv', encoding='latin-1')
table_enroll_site_month.rename(columns={'site':'Study Site'}, inplace=True)
table_enroll_site_month
Study Site Apr Jul Jun May All
0 A 5.0 0.0 8.0 4.0 17.0
1 B 9.0 0.0 11.0 5.0 25.0
2 C 6.0 1.0 3.0 20.0 30.0
3 D 5.0 0.0 3.0 2.0 10.0
4 E 5.0 0.0 5.0 0.0 10.0
5 All 30.0 1.0 30.0 31.0 92.0
And wonder how to:
1. Display months with year as
Apr16 Jul16 Jun16 May16
2. Is it possible to get same table without running this step (pvt_enroll.to_csv("pivot_test.csv")? I mean, can I get same result without needing to save to .csv file first?

I think by using %b%y you can get 'Apr16' etc format.
I tried with the following code, without saving into .csv.
import pandas as pd
from datetime import datetime
df=pd.read_csv("demo.csv")
df["date"]=pd.to_datetime(df["date"])
df['date'] = df['date'].apply(lambda x: datetime.strftime(x,'%b%y'))
pvt_enroll=df.pivot_table(index='site', columns="date", values = 'baseline', aggfunc = {'baseline' : 'count'}, fill_value=0, margins=True) # Pivot_Table with enrollment by SITE by MONTH
pvt_enroll.reset_index(inplace=True)
pvt_enroll.rename(columns={'site':'Study Site'}, inplace=True)
print(pvt_enroll)
And I got the output as follows
date Study Site Apr16 Jul16 Jun16 May16 All
0 A 5 0 8 4 17
1 B 9 0 11 5 25
2 C 6 1 3 20 30
3 D 5 0 3 2 10
4 E 5 0 5 0 10
5 All 30 1 30 31 92

pandas merge DataFrames like magnetic thing

import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-10', '2015-01-11', '2015-01-12'], 'a': [1,2,3,4]})
df2 = pd.DataFrame({'date': ['2015-01-01', '2015-01-05', '2015-01-11'], 'b': [10,20,30]})
df = df1.merge(df2, on=['date'], how='outer')
df = df.sort_values('date')
print df
"like magnetic thing" may not be a good expression in title. I will explain below.
I want record from df2 to match the first record of df1 which date is greater or equals df2's. For example, I want df2's '2015-01-05' to match df1's '2015-01-10'.
I cannot achieve it by simply merging them in inner, outer, left way. Though, the above result is very close to what I want.
a date b
0 1.0 2015-01-01 10.0
4 NaN 2015-01-05 20.0
1 2.0 2015-01-10 NaN
2 3.0 2015-01-11 30.0
3 4.0 2015-01-12 NaN
How can achieve this from what I have done or in some other ways from scratch?
a date b
0 1.0 2015-01-01 10.0
1 2.0 2015-01-10 20.0
2 3.0 2015-01-11 30.0
3 4.0 2015-01-12 NaN

making sure your dates are dates
df1.date = pd.to_datetime(df1.date)
df2.date = pd.to_datetime(df2.date)
numpy
np.searchsorted
ilocs = df1.date.values.searchsorted(df2.date.values)
df1.loc[df1.index[ilocs], 'b'] = df2.b.values
df1
a date b
0 1 2015-01-01 10.0
1 2 2015-01-10 20.0
2 3 2015-01-11 30.0
3 4 2015-01-12 NaN
pandas
pd.merge_asof gets you really close
pd.merge_asof(df1, df2)
a date b
0 1 2015-01-01 10
1 2 2015-01-10 20
2 3 2015-01-11 30
3 4 2015-01-12 30

How can I replicate excel COUNTIFS in python/pandas?

I would like to get a count for the # of the previous 5 values in df['A'] which are < current value in df['A'] & are also >= df2['A']. I am trying to avoid looping over every row and columns because I'd like to apply this to a larger data set.
Given this...
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
I would like to return this (solved in Excel with COUNTIFS)...
The line below achieves the first part (thanks Alexander), and Divakar and DSM have also weighed in previously (here and here).
df3 = pd.DataFrame(df.rolling(center=False,window=6).apply(lambda rollwin: sum((rollwin[:-1] < rollwin[-1]))))
But I am unable to to add the comparison to df2. Please help.
FOLLOW UP on 10/27/16:
How would I write the lambda above as a standard function?
10/28/16:
See below, taking col 'A' from both df and df2, I am trying to count how many of the previous 5 values from df['A'] fall between the current df2['A'] and df['A']. Said differently, how many from each orange box fall between the yellow low-high range?
UPDATE: different list1 data produces incorrect df3...
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[26,108],[25,102],[26,106],[25,111],[22,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))
df
Out[9]:
A B
2000-01-01 21 101
2000-01-02 22 110
2000-01-03 25 113
2000-01-04 24 112
2000-01-05 21 109
2000-01-06 26 108
2000-01-07 25 102
2000-01-08 26 106
2000-01-09 25 111
2000-01-10 22 110
df3
Out[8]:
A B
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 1.0 0.0
2000-01-07 2.0 0.0
2000-01-08 3.0 1.0
2000-01-09 2.0 3.0
2000-01-10 1.0 3.0
EXCEL EXAMPLES (11/14): see below, trying to count how many numbers in the blue box fall between the range highlighted in orange.

list1 = [[21,50,101],[22,52,110],[25,49,113],[24,49,112],[21,55,109],[28,54,108],[30,57,102],[26,56,106],[25,58,111],[24,60,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('ABC'))
print df
I believe this matches your new screen shot "Given Data".
A B C
2000-01-01 21 50 101
2000-01-02 22 52 110
2000-01-03 25 49 113
2000-01-04 24 49 112
2000-01-05 21 55 109
2000-01-06 28 54 108
2000-01-07 30 57 102
2000-01-08 26 56 106
2000-01-09 25 58 111
2000-01-10 24 60 110
and the same function:
print pd.DataFrame(
df.rolling(center=False,window=6).
apply(lambda rollwin: pd.Series(rollwin[:-1]).
between(rollwin[-1]*0.95,rollwin[-1]).sum()))
gives your desired output "Desired outcome":
A B C
2000-01-01 nan nan nan
2000-01-02 nan nan nan
2000-01-03 nan nan nan
2000-01-04 nan nan nan
2000-01-05 nan nan nan
2000-01-06 0 1 0
2000-01-07 0 1 0
2000-01-08 1 2 1
2000-01-09 1 2 3
2000-01-10 0 2 3

list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
window = 6
results = []
for i in range (len(df)-window+1):
slice_df1 = df.iloc[i:i + window]
slice_df2 = df2.iloc[i:i + window]
compare1 = slice_df1['A'].iloc[-1]
compare2 = slice_df2['A'].iloc[-1]
a= slice_df1.iloc[:-1]['A'].between(compare2,compare1) # series have a between metho
results.append(a.sum())
df_res = pd.DataFrame(data = results , index = df.index[window-1:] , columns = ['countifs'])
df_res = df_res.reindex(df.index,fill_value=0.0)
print df_res
which yields:
countifs
2000-01-01 0.0000
2000-01-02 0.0000
2000-01-03 0.0000
2000-01-04 0.0000
2000-01-05 0.0000
2000-01-06 0.0000
2000-01-07 0.0000
2000-01-08 1.0000
2000-01-09 1.0000
2000-01-10 0.0000
BUT
Seeing there is a logical relationship between your upper and lower bound, value and value - 5%. Then this will perhaps be what you want.
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: sum(np.logical_and(
rollwin[-1]*0.95 <= rollwin[:-1]
,rollwin[:-1] < rollwin[-1])
)))
and if you prefer the pd.Series.between() approach:
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fill missing data with Python - python

Related

How to create a new column with the last value of the previous year

How to display grouped by column during ffill() and not agg using pandas?

Display variable data as month and year

pandas merge DataFrames like magnetic thing

How can I replicate excel COUNTIFS in python/pandas?

Categories

Resources