Remove non date values from data-frame column python - python

I have a dataframe (df) which the head looks like:
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
Is there a way to remove the non date values (not the rows)? With this example the resulting frame would look like:
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
All the examples I can see want me to drop the rows and I'd like to keep them.

pd.to_datetime
With errors='coerce'
df.assign(Date=pd.to_datetime(df.Date, errors='coerce'))
Date
0 2015-01-04
1 1996-01-09
2 NaT
3 1992-12-05
4 NaT
You can fill those NaT with empty strings if you'd like (though I don't recommend it)
df.assign(Date=pd.to_datetime(df.Date, errors='coerce').fillna(''))
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00
4
If you want to preserve whatever the things were in your dataframe and simply replace the things that don't look like dates with ''
df.assign(Date=df.Date.mask(pd.to_datetime(df.Date, errors='coerce').isna(), ''))
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4

One more simple way around..
>>> df
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce').fillna('')
>>> df
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00

Related

pandas calculate date difference in a new column referencing previous cell

I have pandas dataframe that contains dates in column Date. I need to add another column Days which contains the date difference from previous cell. So date in ith cell should difference from i-1th. And for the first difference consider it to be 0.
Date Days
08-01-1997 0
09-01-1997 1
10-01-1997 1
13-01-1997 3
14-01-1997 1
15-01-1997 1
01-03-1997 45
03-03-1997 2
04-03-1997 1
05-03-1997 1
13-06-1997 100
I tried this but not useful.
First convert the Date column to pandas DateTime object, then calculate the difference which is timedelta object, from there, take the days from Series.dt and assign 0 to first value
>>> df['Date']=pd.to_datetime(df['Date'], dayfirst=True)
>>> df['Days']=(df['Date']-df['Date'].shift()).dt.days.fillna(0).astype(int)
OUTPUT
df
Date Days
0 1997-01-08 0
1 1997-01-09 1
2 1997-01-10 1
3 1997-01-13 3
4 1997-01-14 1
5 1997-01-15 1
6 1997-03-01 45
7 1997-03-03 2
8 1997-03-04 1
9 1997-03-05 1
10 1997-06-13 100
you can use diff as well
df['date_up'] = pd.to_datetime(df['Date'],dayfirst=True)
df['date_diff'] = df['date_up'].diff()
df['date_diff_num_days'] = df['date_diff'].dt.days.fillna(0).astype(int)
df.head()
Date Days date_up date_diff date_diff_num_days
0 08-01-1997 0 1997-01-08 NaT 0
1 09-01-1997 1 1997-01-09 1 days 1
2 10-01-1997 1 1997-01-10 1 days 1
3 13-01-1997 3 1997-01-13 3 days 3
4 14-01-1997 1 1997-01-14 1 days 1

Just keep the first value of every minute in pandas dataframe

I want to reduce my data. My initial dataframe looks as follows:
index
time [hh:mm:ss]
value1
value2
0
0 days 00:00:00.000000
3
4
1
0 days 00:00:04.000000
5
2
2
0 days 00:02:02.002300
7
9
3
0 days 00:02:03.000000
9
7
4
0 days 03:02:03.000000
4
3
Now I want to reduce my data in order to only keep the cells of every new minute (respectively also new hour and days). the other way around: only the first line of a new minute should be kept. all remaining lines of this minute should be dropped.
So the resulting table looks as follows:
index
time
value1
value2
0
0 days 00:00:00.000000
3
4
2
0 days 00:02:02.002300
7
9
4
0 days 03:02:03.000000
4
3
Any ideas how to approach this?
There is used timedeltas so is possible create TimedeltaIndex and use DataFrame.resample by 1Minute with Resampler.first, only are added all minutes, so removed only NaNs rows:
df.index = pd.to_timedelta(df['time [hh:mm:ss]'])
df = df.resample('1Min').first().dropna(how='all').reset_index(drop=True)
print (df)
time [hh:mm:ss] value1 value2
0 0 days 00:00:00.000000 3.0 4.0
1 0 days 00:02:02.002300 7.0 9.0
2 0 days 03:02:03.000000 4.0 3.0
You could extract the D:HH:MM using apply and multiple splits, and then delete the duplicates, choosing the first value.
dms = df['time [hh:mm:ss]'].apply(lambda x: ':'.join( [x.split(' days ')[0], *x.split('days ')[1].split(':')[:2]]) )
df.iloc[dms.drop_duplicates().index]
d = '''index,time,value1,value2 0,0 days 00:00:00.000000,3,4 1,0 days
00:00:04.000000,5,2 2,0 days 00:02:02.002300,7,9 3,0 days
00:02:03.000000,9,7 4,0 days 03:02:03.000000,4,3'''
df = pd.read_csv(StringIO(d),parse_dates=True)
df
df['time1'] = pd.to_datetime(df['time'].str.slice(7))
df.set_index('time1',inplace=True)
df
df.groupby([df.index.hour,df.index.minute]).head(1).sort_index().reset_index(drop=True)

How to group and aggregate data starting from constant and ending on changing date? [duplicate]

This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4

Count cases by dates and save it in a new dataframe

In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...
Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4
Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4

Pandas, create missing dates dictionary from dates column

I have a DataFrame which contains data from last year, but the dates column has some missing dates
date
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
4 2019-11-05
I want to create a dictionary of gaps between dates, so keys would be start dates and values as end dates, something like:
dates_gaps = {2019-10-21:2019-10-29, 2019-10-29:2019-11-01,2019-11-01:2019-11-04 ...}
so I created a column to indicate whether a gap exists with the following:
df['missing_dates'] = df[DATE].diff().dt.days > 1
which outputs the following:
# True indicates if there's a gap or not
0 2019-10-21 False
1 2019-10-29 True
2 2019-11-01 True
3 2019-11-04 True
4 2019-11-05 False
and I'm having trouble going forward from here
You can add condition for compare missing values, convert date columnto strings by Series.dt.strftime and last create dictionary with zip:
diff = df['date'].diff()
s = df.loc[(diff.dt.days > 1) | diff.isna(), 'date'].dt.strftime('%Y-%m-%d')
print (s)
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
Name: date, dtype: object
d = dict(zip(s, s.shift(-1)[:-1]))
print (d)
{'2019-10-21': '2019-10-29', '2019-10-29': '2019-11-01', '2019-11-01': '2019-11-04'}
just convert these dates into datetime and find the difference between two adjacent dates.
a = pd.to_datetime('1900-01-01', format='%Y-%m-%d')
b = pd.to_datetime('1900-02-01', format='%Y-%m-%d')
c = a-b
c.days # -31

Categories