There are a lot of stations in csv file, I don't know how to use loop to count the number of nan of every station. There is I got so far, count one by one. Can someone help me please, thank you in advance.
station1= train_df[train_df['station'] == 28079004]
station1 = station1[['date', 'O_3']]
count_nan = len(station1) - station1.count()
print(count_nan)
I think need create index by station column with set_index, filter columns for check missing values and last count them by sum:
train_df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'date':pd.date_range('2015-01-01', periods=6),
'O_3':[np.nan,3,np.nan,9,2,np.nan],
'station':[28079004] * 2 + [28079005] * 4})
print (train_df)
B C date O_3 station
0 4 7 2015-01-01 NaN 28079004
1 5 8 2015-01-02 3.0 28079004
2 4 9 2015-01-03 NaN 28079005
3 5 4 2015-01-04 9.0 28079005
4 5 2 2015-01-05 2.0 28079005
5 4 3 2015-01-06 NaN 28079005
df = train_df.set_index('station')[['date', 'O_3']].isnull().sum(level=0).astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Another solution:
df = train_df[['date', 'O_3']].isnull().groupby(train_df['station']).sum().astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Although jez already answered and that answer is probably better here. This is how a groupby would look like:
import pandas as pd
import numpy as np
np.random.seed(444)
n = 10
train_df = pd.DataFrame({
'station': np.random.choice(np.arange(28079004,28079008), size=n),
'date': pd.date_range('2018-01-01', periods=n),
'O_3': np.random.choice([np.nan,1], size=n)
})
print(train_df)
s = train_df.groupby('station')['O_3'].apply(lambda x: x.isna().sum())
print(s)
prints:
station date O_3
0 28079007 2018-01-01 NaN
1 28079004 2018-01-02 1.0
2 28079007 2018-01-03 NaN
3 28079004 2018-01-04 NaN
4 28079007 2018-01-05 NaN
5 28079004 2018-01-06 1.0
6 28079007 2018-01-07 NaN
7 28079004 2018-01-08 NaN
8 28079006 2018-01-09 NaN
9 28079007 2018-01-10 1.0
And the output (s):
station
28079004 2
28079006 1
28079007 4
Related
I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year
This isn't duplicate. I already referred this post_1 and post_2
My question is different and not about agg function. It is about displaying grouped by column as well during ffill operation. Though the code works fine, just sharing the full code for you to get an idea. Problem is in the commented line. look out for that line below.
I have a dataframe like as given below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What this code with the help of Jezrael from forum does is add missing dates based on threshold value. Only issue is,I don't see the grouped by column during output
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
df2 = df2.groupby(df2['subject_id']).ffill() # problem is here #here is the problem
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
As shown in code above, I tried the below approaches
df2 = df2.groupby(df2['subject_id']).ffill() # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index() # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill() # doesn't help
Incorrect output without subject_id
I expect my output to have subject_id column as well
Here are 2 possible solutions - specify all columns in list after groupby and assign back:
cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill() # problem is here #here is the problem
Or create index by subject_id column and grouping by index:
#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()
#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day month count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 4.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 4.0 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 4.0 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 4.0 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 5.0 1.0
33 1 2173-05-05 2173-05-05 13:37:00 1 5 5.0 1.0
95 1 2173-07-06 2173-07-06 13:39:00 6 6 7.0 1.0
96 1 2173-07-07 2173-07-07 13:39:00 6 7 7.0 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 7.0 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 4.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8 9 4.0 NaN
100 2 2173-04-10 2173-04-10 22:00:00 8 10 4.0 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 4.0 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 4.0 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 4.0 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 4.0 1.0
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result
I have a dataframe df, which can be created with this:
import pandas as pd
import datetime
#create the dates to make into columns
datestart=datetime.date(2018,1,1)
dateend=datetime.date(2018,1,5)
newcols=pd.date_range(datestart,dateend).date
#create the test data
d={'name':['a','b','c','d'],'earlydate': [datetime.date(2018,1,1),datetime.date(2018,1,3),datetime.date(2018,1,4),datetime.date(2018,1,5)]}
#create initial test dataframe
df=pd.DataFrame(data=d)
#create the new dataframe with empty newcols
df=pd.concat([df,pd.DataFrame(columns=newcols)])
AND Looks like this:
df
Out[17]:
name earlydate 2018-01-01 ... 2018-01-03 2018-01-04 2018-01-05
0 a 2018-01-01 NaN ... NaN NaN NaN
1 b 2018-01-03 NaN ... NaN NaN NaN
2 c 2018-01-04 NaN ... NaN NaN NaN
3 d 2018-01-05 NaN ... NaN NaN NaN
[4 rows x 7 columns]
What I am looking to do is fill all of the empty newcols with the difference in days between the newcol name and the earlydate (newcolname(which is a date)-earlydate(which is a date). I want to do this dataframe 'wise', and not use a function,lambda,apply, or a for loop. I am fairly certain this should be able to be done dataframe wise, not column or row wise.
The result/expected ending df can be created with this:
dresultdata={'name':['a','b','c','d'],
'earlydate': [datetime.date(2018,1,1),datetime.date(2018,1,3),datetime.date(2018,1,4),datetime.date(2018,1,5)],
datetime.date(2018,1,1):[0,-2,-3,-4], #this is the difference in days between the column name and the earlydate
datetime.date(2018,1,2):[-1,1,2,3],
datetime.date(2018,1,3):[-2,0,1,2],
datetime.date(2018,1,4):[-3,-1,0,1]}
dferesult=pd.DataFrame(data=dresultdata)
And looks like this:
dferesult
Out[19]:
name earlydate 2018-01-01 2018-01-02 2018-01-03 2018-01-04
0 a 2018-01-01 0 -1 -2 -3
1 b 2018-01-03 -2 1 0 -1
2 c 2018-01-04 -3 2 1 0
3 d 2018-01-05 -4 3 2 1
I have made this work by looping as follows:
for d in newcols:
df.loc[:,d]=d-df.earlydate
But it takes forever for large frames (1m rows). Ideas welcome!
IIUC:
i = pd.to_datetime(df.earlydate.values).values
j = pd.to_datetime(df.columns[2:]).values
df.iloc[:, 2:] = (j - i[:, None]).astype('timedelta64[D]').astype(int)
df
earlydate name 2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05
0 2018-01-01 a 0 1 2 3 4
1 2018-01-03 b -2 -1 0 1 2
2 2018-01-04 c -3 -2 -1 0 1
3 2018-01-05 d -4 -3 -2 -1 0
I have 2 df and I would like to merge them based on 2 keys - ID and date:
I following is just a small slice of the entire df
df_pw6
ID date pw10_0 pw50_0 pw90_0
0 153 2018-01-08 27.88590 43.2872 58.2024
0 2 2018-01-05 11.03610 21.4879 31.6997
0 506 2018-01-08 6.98468 25.3899 45.9486
df_ex
date ID measure f188 f187 f186 f185
0 2017-07-03 501 NaN 1 0.5 7 4.0
1 2017-07-03 502 NaN 0 2.5 5 3.0
2 2018-01-08 506 NaN 5 9.0 9 1.2
As you can see, only the third row has a match.
When I type:
#check date
df_ex.iloc[2,0]== df_pw6.iloc[1,1]
True
#check ID
df_ex.iloc[2,1] == df_pw6.iloc[2,0]
True
Now I try to merge them:
df19 = pd.merge(df_pw6,df_ex,on=['date','ID'])
I get an empty df
When I try:
df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID'])
I get:
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN
1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN
2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN
My desired result should be:
> ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
>
> 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2
I have run your codes post your edit, and I succeeded in getting the desired result.
import pandas as pd
# copy paste your first df by hand
pw = pd.read_clipboard()
# copy paste your second df by hand
ex = pd.read_clipboard()
pd.merge(pw,ex,on=['date','ID'])
# output [edited. now it is the correct result OP wanted.]
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2