What is the format for UNIX timestamp of '253402128000000'? - python

I'm trying to convert a whole column of timestamp values in UNIX format but I get some values that doesn't look like a normal timestamp format: 253402128000000
For what I know, a timestamp should look like: 1495245009655
I've tried in miliseconds, nanoseconds and other configurations for Pandas to_datetime but I haven't been able to find a solution that can convert the format.
EDIT
My data looks like below and the ValidEndDateTime seems way off.
"ValidStartDateTime": "/Date(1495245009655)/",
"ValidEndDateTime": "/Date(253402128000000)/",
SOLUTION
I've accepted the answer below because I can see the date is a "never-end" date as all the values in my dataset that can't be converted is set to the same value: 253402128000000
Thank you for the answers!

From a comment of yours:
The data I get looks like this: "ValidStartDateTime": "/Date(1495245009655)/", "ValidEndDateTime": "/Date(253402128000000)/",
The numbers appear to be UNIX timestamps in milliseconds and the big "End" one seems to mean "never end", note the special date:
1495245009655 = Sat May 20 2017 01:50:09
253402128000000 = Thu Dec 30 9999 00:00:00
Converted with https://currentmillis.com/

I think it was divided by 1,000,000 becoming 253402128 and calculated.
Which means approximately 44 years ago.
Format: Microseconds (1/1,000,000 second)
GMT: Wed Jan 11 1978 21:28:48 GMT+0000
I used this website as reference: https://www.unixtimestamp.com/

Use pd.to_datetime:
>>> pd.to_datetime(1495245009655, unit='ms')
Timestamp('2017-05-20 01:50:09.655000')
>>> pd.to_datetime(253402128000000 / 100, unit='ms')
Timestamp('2050-04-19 22:48:00')

Related

Why does pandas interpret Aug-30 as 1930-08, but not 2030-08? [duplicate]

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.
Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.
E.g.:
dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']
pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
'2061-01-09', '2055-02-08'],
dtype='datetime64[ns]', freq=None)
The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What's going on here?
That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:
datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)
datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)
Two digits year ambiguity
So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900
The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.
If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)
I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).
If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference
If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.
from the docs
Year 2000 (Y2K) issues: Python depends on the platform’s C library,
which generally doesn’t have year 2000 issues, since all dates and
times are represented internally as seconds since the epoch. Function
strptime() can parse 2-digit years when given %y format code. When
2-digit years are parsed, they are converted according to the POSIX
and ISO C standards: values 69–99 are mapped to 1969–1999, and values
0–68 are mapped to 2000–2068.
For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:
from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)
You may need to tune the threshold date closer to the present depending on the earliest dates in your data.
You can write a simple function to correct this parsing of wrong year as stated below:
import datetime
def fix_date(x):
if x.year > 1989:
year = x.year - 100
else:
year = x.year
return datetime.date(year,x.month,x.day)
df['date_column'] = data['date_column'].apply(fix_date)
Hope this helps..
Another quick solution to the problem:-
import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])
for i in dates:
tempyear=pd.to_numeric(dates[i].str[-2:])
dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
dates["temp_month"]=dates[i].str[:-2]
dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
tempdrops=["temp_year","temp_month","temp_flyr",i]
dates.drop(tempdrops, axis=1, inplace=True)
And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime
pddt
0 2005-09-26
1 2005-09-26
2 1970-06-15
3 1994-12-05
4 1961-01-09
5 1955-02-08
As mentioned in some other answers this works best if there is no overlap between the dates of the two centuries.
If running into the same problem using a pandas DataFrame, try using the current year or year greater than a particular year, then apply a lambda function similar to below:
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > dt.datetime.now() else x)
or
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > 2022 else x)

Converting a day decimal number array to unix timestamp in Python

I have an array of numbers (e.g 279.341, 279.345, 279.348) which relate to the date and time in 2017 (its supposed to be October 6th 2017). To be able to compare this data to another dataset I need to convert that array into an array of UNIX timestamps.
I have successfully done something similar in matlab (code below) but don't know how to translate this to Python.
MatLab:
adcpTimeStr = datestr(adcp.adcp_day_num,'2017 mmm dd HH:MM:SS');
adcpTimeRaw = datetime(adcpTimeStr,'InputFormat','yyyy MMM dd HH:mm:ss');
adcpTimenumRaw = datenum(adcpTimeRaw)';
What would be a good way of converting the array into UNIX timestamps?
assuming these numbers are fractional days of the year (UTC) and the year is 2017, in Python you would do
from datetime import datetime, timedelta, timezone
year = datetime(2017,1,1, tzinfo=timezone.utc) # the starting point
doy = [279.341, 279.345, 279.348]
# add days to starting point as timedelta and call timestamp() method:
unix_t = [(year+timedelta(d)).timestamp() for d in doy]
# [1507363862.4, 1507364208.0, 1507364467.2]

How can I handle wrong year format

Being new to python and pandas, I faced next problem.
In my dataframe i have column with dates (yyyy-mm-ddThh-mm-sec), where most part of the years are ok (looks like 2008), and a part, where year is written like 0008. Due to this I have problem with formatting column using pd.to_datetime.
My thought was to convert it first into 2-digit year (using pd.to_datetime(df['date']).dt.strftime('%y %b, %d %H:%M:%S.%f +%Z')), but I got an error Out of bounds nanosecond timestamp: 08-10-02 14:41:00.
Are there any other options to convert 0008 to 2008 in dataframe?
Thanks for the help in advance
If the format for the bad data is always the same (as in the bad years are always 4 characters) then you can use str:
df = pd.DataFrame({'date':['2008-01-01', '0008-01-02']})
df['date'] = pd.to_datetime(df['date'].str[2:], yearfirst=True)
date
0 2008-01-01
1 2008-01-02

Date change halfway through csv from YYYY-MM-DD to DD/MM/YY and after switch datetime no longer works

I have a csv of daily temperature data with 3 columns: dates, daily maximum temperatures, and daily minimum temperatures. I attached it here so you can see what I mean.
I am trying to break this data set into smaller datasets of 30 year periods. For the first few years of Old.csv the dates are entered in YYYY-MM-DD but then switch to DD/MM/YY in 1900. After this date format switches my code to split the years no longer works. Here is what I'm using:
df2 = pd.read_csv("Old.csv")
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1897-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1899-12-31'))
]
and it works...BUT when I switch to 1900 and beyond it stops. So this one doesnt work:
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1900-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1905-12-31'))
]
The above code gives me an empty data set, despite working pre 1900. I'm assuming this is some sort of a formatting issue but I thought that using ".to_datetime" would fix that. I also tried this:
df2['Date']=pd.to_datetime(df2['Date'])
to reformat the entire list before I ran the code above but it still didnt work. The other interesting thing is that I have a separate csv with dates consistently entered as MM/DD/YY and that one works with the code above. Could it be an issue with the turn of the century? Does anyone know how to fix this?
You're dealing with time/date data with different formats, for this you could you could use a more flexible parser, for instance dateutil.parser
Example:
>>> from dateutil.parser import parse
>>> df
Date
0 1897-01-01
1 1899-12-31
2 01/01/00
>>> df.Date.apply(parse)
0 1897-01-01 00:00:00
1 1899-12-31 00:00:00
2 2000-01-01
Name: Date, dtype: datetime64[ns]
and use your function on the parsed data.
As remarked in the comment above, it's still not clear whether year "00" refers to year 1900 or 2000, but maybe you can infer that from the context of the csv file.
To change all years in the 'DD/MM/YY' format to 1900 dates you could define your own parse function
>>> def my_parse(d):
... if d[-3]=='/':
... d = d[:-3]+'/19'+d[-2:]
... return parse(d)
>>> df.Date.apply(my_parse)
0 1897-01-01
1 1899-12-31
2 1900-01-01
Python is reading 00 as 2000 instead of 1900. So I tried this to edit 00 to read as 1900:
df2.Date.dt.year.replace(2000, 1990, inplace=True)
But python returned an error that said dates are not directly editable. So I then changed them to a string and edited that way using:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
This works but now I need to find a way to loop through 1896-1968 without having to type that line out every time.

Pandas: slow date conversion

I'm reading a huge CSV with a date field in the format YYYYMMDD and I'm using the following lambda to convert it when reading:
import pandas as pd
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=lambda t:pd.to_datetime(str(t),
format='%Y%m%d', coerce=True))
This function is very slow though.
Any suggestion to improve it?
Note: As #ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse
Great suggestion #EdChum! As #EdChum suggests, using infer_datetime_format=True can be significantly faster. Below is my example.
I have a file of temperature data from a sensor log, which looks like this:
RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...
My code reads the csv and parses the date (parse_dates=['Date']).
With infer_datetime_format=False, it takes 8min 8sec:
Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.
With infer_datetime_format=True, it takes 13sec:
Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Unless you're stuck with a very old version of pandas, pre 0.25, this answer is not for you.
The functionality described here has been merged into pandas in version 0.25
Streamlined date parsing with caching
Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won't need to iterate over all the data twice if you do it right away. You also don't have to store it as strings in memory.
We can define our own date parser that utilizes a cache for the dates it has already seen.
import pandas as pd
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
cache[s] = dt
return dt
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=cached_date_parser)
Has the same advantages as #fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.
Since pandas version 0.25 the function pandas.read_csv accepts a cache_dates=boolean (which defaults to True) keyword argument. So no need to write your own function for caching as done in the accepted answer.
No need to specify a date_parser, pandas is able to parse this without any trouble, plus it will be much faster:
In [21]:
import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null datetime64[ns]
val 3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:
df
Out[22]:
date val
0 2012-06-08 12321
1 2013-06-08 12321
2 2014-03-08 12321
Try the standard library:
import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")
However, I don't really know if this is much faster than pandas.
Since your format is so simple, what about
def parse(t):
string_ = str(t)
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
EDIT you say you need to take care of invalid data.
def parse(t):
string_ = str(t)
try:
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
except:
return default_datetime #you should define that somewhere else
All in all, I'm a bit conflicted about the validity of your problem:
you need to be fast, but still you get your data from a CSV
you need to be fast, but still need to deal with invalid data
That's kind of contradicting; my personal approach here would be assuming that your "huge" CSV just needs to be brought into a better-performing format once, and you either shouldn't care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data--there's so many formats that don't rely on string parsing.
If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.
created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC
df["date"]= df["created_at"].apply(lambda x: str(x)[:10])
df["date"] = pd.to_datetime(df["date"])
I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:
read the file row by row using Python3.7 native csv.reader
convert all 4 numeric columns using float() and
parse the date column with datetime.datetime.fromisoformat()
and Behold:
finally convert the list to a DataFrame (!)**
It baffles me how can this be faster than native pandas pd.read_csv(...)... can someone explain?

Categories