I have a file with two different dates: one has a timestamp and one does not. I need to read the file, disregard the timestamp, and compare the two dates. If the two dates are the same then I need to spit it to the output file and disregard any other rows.
I'm having trouble knowing if I should be using a datetime function on the input and formatting the date there and then simply seeing if the two are equivalent? Or should I be using a timedelta?
I've tried a couple different ways but haven't had success.
df = pd.read_csv("File.csv", dtype={'DATETIMESTAMP': np.datetime64, 'DATE':np.datetime64})
Gives me : TypeError: the dtype < M8 is not supported for parsing, pass this column using parse_dates instead
I've also tried to just remove the timestamp and then compare, but the strings end up with different date formats and that doesn't work either.
df['RemoveTimestamp'] = df['DATETIMESTAMP'].apply(lambda x: x[:10])
df = df[df['RemoveTimestamp'] == df['DATE']]
Any guidance appreciated.
Here is my sample input CSV file:
"DATE", "DATETIMESTAMP"
"8/6/2014","2014-08-06T10:18:38.000Z"
"1/15/2013","2013-01-15T08:57:38.000Z"
"3/7/2013","2013-03-07T16:57:18.000Z"
"12/4/2012","2012-12-04T10:59:37.000Z"
"5/6/2014","2014-05-06T11:07:46.000Z"
"2/13/2013","2013-02-13T15:51:42.000Z"
import pandas as pd
import numpy as np
# your data, both columns are in string
# ================================================
df = pd.read_csv('sample_data.csv')
df
DATE DATETIMESTAMP
0 8/6/2014 2014-08-06T10:18:38.000Z
1 1/15/2013 2013-01-15T08:57:38.000Z
2 3/7/2013 2013-03-07T16:57:18.000Z
3 12/4/2012 2012-12-04T10:59:37.000Z
4 5/6/2014 2014-05-06T11:07:46.000Z
5 2/13/2013 2013-02-13T15:51:42.000Z
# processing
# =================================================
# convert string to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['DATETIMESTAMP'] = pd.to_datetime(df['DATETIMESTAMP'])
# cast timestamp to date
df['DATETIMESTAMP'] = df['DATETIMESTAMP'].values.astype('<M8[D]')
DATE DATETIMESTAMP
0 2014-08-06 2014-08-06
1 2013-01-15 2013-01-15
2 2013-03-07 2013-03-07
3 2012-12-04 2012-12-04
4 2014-05-06 2014-05-06
5 2013-02-13 2013-02-13
# compare
df['DATE'] == df['DATETIMESTAMP']
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
How about:
import time
filename = dates.csv
with open(filename) as f:
contents = f.readlines()
for i in contents:
date1, date2 = i.split(',')
date1 = date1.strip('"')
date2 = date2.split('T')[0].strip('"')
date1a = time.strftime("%Y-%m-%d",time.strptime(date1, "%m/%d/%Y"))
print i if date1a == date2 else None
Related
I wanted to re-assign/replace my new value, from my current
20000123
19850123
19880112
19951201
19850123
20190821
20000512
19850111
19670133
19850123
As you can see there is data with 19670133 (YYYYMMDD), which means that date is not exist since there is no month with 33 days in it.So I wanted to re assign it to the end of the month. I tried to make it to the end of the month, and it works.
But when i try to replace the old value with the new ones, it became a problem.
What I've tried to do is this :
for x in df_tmp_customer['date']:
try:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x), axis=1)
except Exception:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0), axis=1)
This part is the one that makes it end of the month :
pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0)
Probably not efficient on a large dataset but can be done using pendulum.parse()
import pendulum
def parse_dates(x: str) -> pendulum:
i = 0
while ValueError:
try:
return pendulum.parse(str(int(x) - i)).date()
except ValueError:
i += 1
df["date"] = df["date"].apply(lambda x: parse_dates(x))
print(df)
date
0 2000-01-23
1 1985-01-23
2 1988-01-12
3 1995-12-01
4 1985-01-23
5 2019-08-21
6 2000-05-12
7 1985-01-11
8 1967-01-31
9 1985-01-23
For a vectorial solution, you can use:
# try to convert to YYYYMMDD
date1 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
# get rows for which conversion failed
m = date1.isna()
# try to get end of month
date2 = pd.to_datetime(df.loc[m, 'date'].str[:6], format='%Y%m', errors='coerce').add(pd.offsets.MonthEnd())
# Combine both
df['date2'] = date1.fillna(date2)
NB. Assuming df['date'] is of string dtype. If rather of integer dtype, use df.loc[m, 'date'].floordiv(100) in place of df.loc[m, 'date'].str[:6].
Output:
date date2
0 20000123 2000-01-23
1 19850123 1985-01-23
2 19880112 1988-01-12
3 19951201 1995-12-01
4 19850123 1985-01-23
5 20190821 2019-08-21
6 20000512 2000-05-12
7 19850111 1985-01-11
8 19670133 1967-01-31 # invalid replaced by end of month
9 19850123 1985-01-23
I have a timestamp that looks like this: "1994-10-01:00:00:00" and when I've trying with pd.read_csv or pd.read_table to read this dataset, it imports everything including the date column ([0]) but not even as an object. This is part of my code:
namevar = ['timestamp', 'nsub',
'sub_cms', # var 1 [cms]
'sub_gwflow', # var 2 [cfs]
'sub_interflow', # var 3 [cfs]
'sub_sroff', # var 4 [cfs]
....
'subinc_sroff', # var 13
'subinc_tavgc'] # var 14
df = pd.read_csv(root.anima, delimiter='\t', skiprows=1, header=avar+6, index_col=0,
names=namevar, infer_datetime_format=True,
parse_dates=[0])
print(df)
Results in:
nsub sub_cms ... subinc_sroff subinc_tavgc
timestamp
1994-10-01:00:00:00 1 4.4180 ... 0.0 59.11000
1994-10-01:00:00:00 2 2.6690 ... 0.0 89.29000
1994-10-01:00:00:00 3 4.3170 ... 0.0 77.02000
...
2000-09-30:00:00:00 2 2.3580 ... 0.0 0.19570
2000-09-30:00:00:00 3 2.2250 ... 0.0 0.73340
2000-09-30:00:00:00 4 0.8876 ... 0.0 0.07124
[8768 rows x 15 columns]
print(df.dtypes)
Results in:
nsub int64
sub_cms float64
sub_gwflow float64
sub_interflow float64
sub_sroff float64
subinc_actet float64
...
subinc_sroff float64
subinc_tavgc float64
dtype: object
my ultimate goal is that once the timestamp is in the dataframe I could modify it by getting rid of the time, with:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y%m%d', infer_datetime_format=True)
but when I run this now, it is telling me " KeyError: 'timestamp' "
Any help in getting the timestamp in the dataframe is much appreciated.
As highlighted by #s13wr81, the way bring 'timstamp' into the dataframe as a column was by removing index_col='timestamp' from the statement.
In order to edit timestamp properly, I needed to remove the :Hr:Min:Sec portion of it by using:
df['timestamp'] = df.timestamp.str.split(":", expand=True)
and then to convert timestamp as a Panda datetime I used:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y%m%d', infer_datetime_format=True)
I am not sure about the data, but i think timestamp is not a column, but index. this kind of problem, sometimes happen when we doing grouping.
Try :
"timestamp" in df.columns
if the result false, then :
df = df.reset_index()
next to strip time in timestamp try :
df['timestamp'] = pd.to_datetime(df.timestamp,unit='ns')
df = pd.read_csv(root.anima, delimiter='\t', skiprows=1, header=avar+6, index_col=0,
names=namevar, infer_datetime_format=True,
parse_dates=[0])
I think you are explicitly telling pandas to consider column 0 as the index which happens to be your datetime column.
Kindly try removing the index_col=0 from the pd.read_csv() and i think it will work.
I think the issue is that the timestamp is in a non-standard format. There is a colon between the date and time parts. Here is a way to convert the value in the example:
import datetime
# note ':' between date part and time part
raw_timestamp = '1994-10-01:00:00:00'
format_string = '%Y-%m-%d:%H:%M:%S'
result = datetime.datetime.strptime(raw_timestamp, format_string)
print(result)
1994-10-01 00:00:00
You could use pd.to_datetime() with the format_string in this example, to process an entire column of timestamps.
UPDATE
Here is an example that uses a modified version of the original data (timestamp + one column; every entry is unique):
from io import StringIO
import pandas as pd
data = '''timestamp nsub
1994-10-01:00:00:00 1
1994-10-02:00:00:00 2
1994-10-03:00:00:00 3
2000-09-28:00:00:00 4
2000-09-29:00:00:00 5
2000-09-30:00:00:00 6
'''
df = pd.read_csv(StringIO(data), sep='\s+')
df['timestamp'] = pd.to_datetime(df['timestamp'],
format='%Y-%m-%d:%H:%M:%S',
errors='coerce')
print(df, end='\n\n')
print(df.dtypes)
timestamp nsub
0 1994-10-01 1
1 1994-10-02 2
2 1994-10-03 3
3 2000-09-28 4
4 2000-09-29 5
5 2000-09-30 6
timestamp datetime64[ns]
nsub int64
dtype: object
so let's say this is my code:
df = pd.read_table('file_name', sep=';')
pd.Timestamp("today").strftime(%d.%m.%y)
df = df[(df['column1'] < today)]
df
Here's the table from the csv file:
Column 1
27.02.2018
05.11.2018
22.05.2018
01.11.2018
01.08.2018
01.08.2018
16.10.2018
22.08.2018
21.11.2018
so as you can see, I imported a table from a csv file. I only need to see dates before today (16.10.2018), but when I run the code this is what I get
Column 1
05.11.2018
01.11.2018
01.08.2018
01.08.2018
Which means Python is only looking at the days and ignoring the months, and this is wrong. I need it to understand this is a date not just numbers. What do I do to achieve that?
PS I'm new to Python
You should convert your column to the date type, not strings, since strings are compared lexicographically.
You can thus convert it with:
# convert the strings to date(time) objects
df['column1'] = pd.to_datetime(df['column1'], format='%d.%m.%Y')
Then you can compare it with a date object, like:
>>> from datetime import date
>>> df[df['column1'] < date.today()]
column1
0 2018-02-27
1 2018-05-11
2 2018-05-22
3 2018-01-11
4 2018-01-08
5 2018-01-08
7 2018-08-22
I have a Series of dates in datetime64 format.
I want to convert them to a series of Period with a monthly frequency. (Essentially, I want to group dates into months for analytical purposes).
There must be a way of doing this - I just cannot find it quickly.
Note: these dates are not the index of the data frame - they are just a column of data in the data frame.
Example input data (as a Series)
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-12-01']))
print (data)
My current kludge/work around looks like
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data = pd.DatetimeIndex(data).to_period('M')
data = pd.Series(data.year).astype('str') + '-' + pd.Series((data.month).astype('int')).map('{:0>2d}'.format)
data = data.where(data != '2262-04', other='No Date')
print (data)
Their are some issues currently (even in master) dealing with NaT in PeriodIndex, so your approach won't work like that. But seems that you simply want to resample; so do this. You can of course specify a function for how if you want.
In [57]: data
Out[57]:
0 2014-10-01
1 2014-10-01
2 2014-10-31
3 2014-11-15
4 2014-11-30
5 NaT
6 2014-12-01
dtype: datetime64[ns]
In [58]: df = DataFrame(dict(A = data, B = np.arange(len(data))))
In [59]: df.dropna(how='any',subset=['A']).set_index('A').resample('M',how='count')
Out[59]:
B
A
2014-10-31 3
2014-11-30 2
2014-12-31 1
import pandas as pd
import numpy as np
datetime import datetime
data = pd.to_datetime(
pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data=pd.Series(['{}-{:02d}'.format(x.year,x.month) if isinstance(x, datetime) else "Nat" for x in pd.DatetimeIndex(data).to_pydatetime()])
0 2014-10
1 2014-10
2 2014-10
3 2014-11
4 2014-11
5 Nat
6 2014-01
dtype: object
Best I could come up with, if the only non datetimes objects possible are floats you can change if isinstance(x, datetime) to if not isinstance(x, float)
py2.7
pandas version .13
What is the safest way to read a csv and convert the column to dates.
I noticed that in my case, a white space in the column of dates was converted to today's date. Why?
here's my csv data
fake_file = StringIO.StringIO("""case,opdate,
7,10/18/2006,
7,10/18/2008,
621, ,""")
here's my code
df=pd.DataFrame(pd.read_csv('path.csv',parse_dates=['opdate']))
tragically fills in the white space with today's date!
df=pd.DataFrame(pd.read_csv('path.csv',parse_dates=['opdate'],na_values=' '))
works, but do i really have to know that it is always ' ', instead of say '' or 'null'.
What is the safest way to convert dates and keep the nulls (especially when the null isn't a consistent value)?
One way is to pass a different date parser to read_csv (I threw in a null too):
fake_file = StringIO.StringIO("""case,opdate,
7,null,
7,10/18/2008,
621, ,""")
In [11]: parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y', coerce=True)
In [12]: pd.read_csv(fake_file, parse_dates=['opdate'], date_parser=parser)
Out[12]:
case opdate Unnamed: 2
0 7 NaT NaN
1 7 2008-10-18 NaN
2 621 NaT NaN
[3 rows x 3 columns]
Another option is to convert to dates after the fact using to_datetime:
In [21]: df = pd.read_csv(fake_file)
In [22]: pd.to_datetime(df.opdate, format='%m/%d/%Y')
ValueError: time data 'null' does not match format '%m/%d/%Y'
In [23]: pd.to_datetime(df.opdate, format='%m/%d/%Y', coerce=True)
Out[23]:
0 NaT
1 2008-10-18
2 NaT
Name: opdate, dtype: datetime64[ns]
In [24]: df['opdate'] = pd.to_datetime(df.opdate, format='%m/%d/%Y', coerce=True)
I think the fact that both to_datetime and read_csv convert blank/spaces to todays date is definitely a bug...
You can specify NA values using the na_values argument to read_csv:
fake_file = StringIO.StringIO("""case,opdate,
7,10/18/2006,
7,10/18/2008,
621, ,""")
df = pd.read_csv(fake_file, parse_dates=[1], na_values={'opdate': ' '})
Output:
case opdate Unnamed: 2
0 7 2006-10-18 NaN
1 7 2008-10-18 NaN
2 621 NaT NaN