Pandas, create missing dates dictionary from dates column - python

I have a DataFrame which contains data from last year, but the dates column has some missing dates
date
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
4 2019-11-05
I want to create a dictionary of gaps between dates, so keys would be start dates and values as end dates, something like:
dates_gaps = {2019-10-21:2019-10-29, 2019-10-29:2019-11-01,2019-11-01:2019-11-04 ...}
so I created a column to indicate whether a gap exists with the following:
df['missing_dates'] = df[DATE].diff().dt.days > 1
which outputs the following:
# True indicates if there's a gap or not
0 2019-10-21 False
1 2019-10-29 True
2 2019-11-01 True
3 2019-11-04 True
4 2019-11-05 False
and I'm having trouble going forward from here

You can add condition for compare missing values, convert date columnto strings by Series.dt.strftime and last create dictionary with zip:
diff = df['date'].diff()
s = df.loc[(diff.dt.days > 1) | diff.isna(), 'date'].dt.strftime('%Y-%m-%d')
print (s)
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
Name: date, dtype: object
d = dict(zip(s, s.shift(-1)[:-1]))
print (d)
{'2019-10-21': '2019-10-29', '2019-10-29': '2019-11-01', '2019-11-01': '2019-11-04'}

just convert these dates into datetime and find the difference between two adjacent dates.
a = pd.to_datetime('1900-01-01', format='%Y-%m-%d')
b = pd.to_datetime('1900-02-01', format='%Y-%m-%d')
c = a-b
c.days # -31

Related

Find max number of consecutive days

The code below groups the dataframe by a key.
df = pd.DataFrame(data, columns=['id', 'date', 'cnt'])
df['date']= pd.to_datetime(df['date'])
for c_id, group in df.groupby('id'):
print(c_id)
print(group)
This produces a result like this:
id date cnt
1 2019-01-02 1
1 2019-01-03 2
1 2019-01-04 3
1 2019-01-05 1
1 2019-01-06 2
1 2019-01-07 1
id date cnt
2 2019-01-01 478964
2 2019-01-02 749249
2 2019-01-03 1144842
2 2019-01-04 1540846
2 2019-01-05 1444918
2 2019-01-06 1624770
2 2019-01-07 2227589
id date cnt
3 2019-01-01 41776
3 2019-01-02 82322
3 2019-01-03 93467
3 2019-01-04 56674
3 2019-01-05 47606
3 2019-01-06 41448
3 2019-01-07 145827
id date cnt
4 2019-01-01 41776
4 2019-01-02 82322
4 2019-01-06 93467
4 2019-01-07 56674
From this result, I want to find the maximum consecutive number of days for each id. So id 1 would be 6, id 2 would be 7, id 3 would be 7, and id 4 would be 2.
Use:
m = (df.assign(date=pd.to_datetime(df['date'])) #if necessary convert else drop
.groupby('id')['date']
.diff()
.gt(pd.Timedelta('1D'))
.cumsum())
df.groupby(['id', m]).size().max(level='id')
Output
id
1 6
2 7
3 7
4 2
dtype: int64
To get your result, run:
result = df.groupby('id').apply(lambda grp: grp.groupby(
(grp.date.shift() + pd.Timedelta(1, 'd') != grp.date).cumsum())
.id.count().max())
Details:
df.groupby('id') - First level grouping (by id).
grp.groupby(...) - Second level grouping (by sequences
of consecutive dates.
grp.date.shift() - Date from the previous row.
+ pd.Timedelta(1, 'd') - Shifted by 1 day.
!= grp.date - Not equal to the current date. The result
is a Series with True on the start of each sequence of
consecutive dates.
cumsum() - Convert the above (bool) Series to a Series
of int - consecutive numbers of above sequences, starting
from 1.
id - Take id column from each (second level) group.
count() - Compute the size of the current group.
.max() - Take max from sizes of second level groups
(within the current level 1 group).

Remove non date values from data-frame column python

I have a dataframe (df) which the head looks like:
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
Is there a way to remove the non date values (not the rows)? With this example the resulting frame would look like:
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
All the examples I can see want me to drop the rows and I'd like to keep them.
pd.to_datetime
With errors='coerce'
df.assign(Date=pd.to_datetime(df.Date, errors='coerce'))
Date
0 2015-01-04
1 1996-01-09
2 NaT
3 1992-12-05
4 NaT
You can fill those NaT with empty strings if you'd like (though I don't recommend it)
df.assign(Date=pd.to_datetime(df.Date, errors='coerce').fillna(''))
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00
4
If you want to preserve whatever the things were in your dataframe and simply replace the things that don't look like dates with ''
df.assign(Date=df.Date.mask(pd.to_datetime(df.Date, errors='coerce').isna(), ''))
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
One more simple way around..
>>> df
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce').fillna('')
>>> df
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00

Keep only the rows with dates

I have a very messy dataframe imported from excel with only some rows containing a date in the first column (index 0, no headers). How do I drop all the rows that don't contain a date?
I would use pd.to_datetime with errors='coerce', then drop the null dates by indexing:
For example:
>>> df
x y
0 2011-02-03 1
1 x 2
2 1 3
3 2012-03-03 4
>>> df[pd.to_datetime(df.x, errors='coerce').notnull()]
x y
0 2011-02-03 1
3 2012-03-03 4
Note: This will lead to some problems if you have different date formats in your column
Explanation:
using pd.to_datetime with errors='coerce' will look for a date-like string, and return NaT (which is null) if it is not found:
>>> pd.to_datetime(df.x, errors='coerce')
0 2011-02-03
1 NaT
2 NaT
3 2012-03-03
Name: x, dtype: datetime64[ns]
Therefore, you can get all the non-null values using notnull:
>>> pd.to_datetime(df.x, errors='coerce').notnull()
0 True
1 False
2 False
3 True
Name: x, dtype: bool
And use that as a mask on your original dataframe

Pandas check for corresponding column in list and lowest date

I have a dataframe with multiple status fields per row. I want to check if any of the status fields have values in a list, and if so, I need to take the lowest date field for the corresponding status. My list of acceptable values and a sample dataframe look like this:
checkList = ['Foo','Bar']
df = pd.DataFrame([['A',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],['B','Foo',datetime.datetime(2017,10,1),'Other',datetime.datetime(2017,9,1),np.nan,np.nan],
['C','Bar',datetime.datetime(2016,1,1),np.nan,np.nan,'Foo',datetime.datetime(2016,5,5)]]
,columns = ['record','status1','status1_date','status2','status2_date','another_status','another_status_date'])
print df
record status1 status1_date status2 status2_date another_status \
0 A NaN NaT NaN NaT NaN
1 B Foo 2017-10-01 Other 2017-09-01 NaN
2 C Bar 2016-01-01 NaN NaT Foo
another_status_date
0 NaT
1 NaT
2 2016-05-05
I need to figure out if any of the statuses are in the approved list. If so, I need the first date for an approved status. The output would look like this:
print output_df
record master_status master_status_date
0 A False NaT
1 B True 2017-10-01
2 C True 2016-01-01
Thoughts on how best to approach? I can't just take min date, I'd need min where corresponding status field is in the list.
master_status = df.apply(lambda x: False if all([pd.isnull(rec) for rec in x[1:]]) else True, axis=1)
master_status_date = df.apply(lambda x: min([i for i in x[1:] if isinstance(i, datetime.datetime)]), axis=1)
record = df['record']
n_df = pd.concat([record, master_status, master_status_date], 1)
print(n_df)
record 0 1
0 A False NaT
1 B True 2017-09-01
2 C True 2016-01-01

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Categories