Pandas: slow date conversion - python

I'm reading a huge CSV with a date field in the format YYYYMMDD and I'm using the following lambda to convert it when reading:
import pandas as pd
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=lambda t:pd.to_datetime(str(t),
format='%Y%m%d', coerce=True))
This function is very slow though.
Any suggestion to improve it?

Note: As #ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse

Great suggestion #EdChum! As #EdChum suggests, using infer_datetime_format=True can be significantly faster. Below is my example.
I have a file of temperature data from a sensor log, which looks like this:
RecNum,Date,LocationID,Unused
1,11/7/2013 20:53:01,13.60,"117","1",
2,11/7/2013 21:08:01,13.60,"117","1",
3,11/7/2013 21:23:01,13.60,"117","1",
4,11/7/2013 21:38:01,13.60,"117","1",
...
My code reads the csv and parses the date (parse_dates=['Date']).
With infer_datetime_format=False, it takes 8min 8sec:
Tue Jan 24 12:18:27 2017 - Loading the Temperature data file.
Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB.
Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient.
Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.
With infer_datetime_format=True, it takes 13sec:
Tue Jan 24 13:19:58 2017 - Loading the Temperature data file.
Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB.
Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient.
Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.

Unless you're stuck with a very old version of pandas, pre 0.25, this answer is not for you.
The functionality described here has been merged into pandas in version 0.25
Streamlined date parsing with caching
Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won't need to iterate over all the data twice if you do it right away. You also don't have to store it as strings in memory.
We can define our own date parser that utilizes a cache for the dates it has already seen.
import pandas as pd
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
cache[s] = dt
return dt
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=cached_date_parser)
Has the same advantages as #fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.

Since pandas version 0.25 the function pandas.read_csv accepts a cache_dates=boolean (which defaults to True) keyword argument. So no need to write your own function for caching as done in the accepted answer.

No need to specify a date_parser, pandas is able to parse this without any trouble, plus it will be much faster:
In [21]:
import io
import pandas as pd
t="""date,val
20120608,12321
20130608,12321
20140308,12321"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null datetime64[ns]
val 3 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 72.0 bytes
In [22]:
df
Out[22]:
date val
0 2012-06-08 12321
1 2013-06-08 12321
2 2014-03-08 12321

Try the standard library:
import datetime
parser = lambda t: datetime.datetime.strptime(str(t), "%Y%m%d")
However, I don't really know if this is much faster than pandas.
Since your format is so simple, what about
def parse(t):
string_ = str(t)
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
EDIT you say you need to take care of invalid data.
def parse(t):
string_ = str(t)
try:
return datetime.date(int(string_[:4]), int(string[4:6]), int(string[6:]))
except:
return default_datetime #you should define that somewhere else
All in all, I'm a bit conflicted about the validity of your problem:
you need to be fast, but still you get your data from a CSV
you need to be fast, but still need to deal with invalid data
That's kind of contradicting; my personal approach here would be assuming that your "huge" CSV just needs to be brought into a better-performing format once, and you either shouldn't care about speed of that conversion process (because it only happens once) or you should probably bring whatever produces the CSV to give you better data--there's so many formats that don't rely on string parsing.

If your datetime has UTC timestamp and you just need part of it. Convert it to a string, slice what you need and then apply the below for much faster access.
created_at
2018-01-31 15:15:08 UTC
2018-01-31 15:16:02 UTC
2018-01-31 15:27:10 UTC
2018-02-01 07:05:55 UTC
2018-02-01 08:50:14 UTC
df["date"]= df["created_at"].apply(lambda x: str(x)[:10])
df["date"] = pd.to_datetime(df["date"])

I have a csv with ~150k rows. After trying almost all the suggestions in this post, I found 25% faster to:
read the file row by row using Python3.7 native csv.reader
convert all 4 numeric columns using float() and
parse the date column with datetime.datetime.fromisoformat()
and Behold:
finally convert the list to a DataFrame (!)**
It baffles me how can this be faster than native pandas pd.read_csv(...)... can someone explain?

Related

Convert (Dutch) datetime string to datetime format in Pandas

I have a CSV file that contains information about the period August 22, 2022 up to September 21, 2022. I loaded to CSV into Python using the Pandas library. The timestamps in the CSV file are in a Dutch format (and are strings), i.e., %d-%m-%Y. When I use pd.to_datetime() for the timestamps, not all data points are converted correctly. For example:
Old (in %d-%m-%Y)
New
22-02-22
2022-08-22 (%Y-%m-%d)
31-08-22
2022-08-31 (%Y-%m-%d)
01-09-22
2022-01-09 (%Y-%d-%m)
06-09-22
2022-06-09 (%Y-%d-%m)
12-09-22
2022-12-09 (%Y-%d-%m)
13-09-22
2022-09-13 (%Y-%m-%d)
21-09-22
2022-09-21 (%Y-%m-%d)
So, for some data points the months and days are interchanged. I want to convert the strings into the right datetime format. How to solve this?
Thanks in advance!
If you read the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html you will see that pd.to_datetime can take yearfirst and dayfirst as parameters.
In your case, just pass dayfirst=True to it and that's it.
You have to specify the format during conversation because the dates here can be confusing to pandas. The default format is %Y-%m-%d.
So try this instead:
df.<your_date_col> = pd.to_datetime(df.<your_date_col>, format='%d-%m-%y')

What is the format for UNIX timestamp of '253402128000000'?

I'm trying to convert a whole column of timestamp values in UNIX format but I get some values that doesn't look like a normal timestamp format: 253402128000000
For what I know, a timestamp should look like: 1495245009655
I've tried in miliseconds, nanoseconds and other configurations for Pandas to_datetime but I haven't been able to find a solution that can convert the format.
EDIT
My data looks like below and the ValidEndDateTime seems way off.
"ValidStartDateTime": "/Date(1495245009655)/",
"ValidEndDateTime": "/Date(253402128000000)/",
SOLUTION
I've accepted the answer below because I can see the date is a "never-end" date as all the values in my dataset that can't be converted is set to the same value: 253402128000000
Thank you for the answers!
From a comment of yours:
The data I get looks like this: "ValidStartDateTime": "/Date(1495245009655)/", "ValidEndDateTime": "/Date(253402128000000)/",
The numbers appear to be UNIX timestamps in milliseconds and the big "End" one seems to mean "never end", note the special date:
1495245009655 = Sat May 20 2017 01:50:09
253402128000000 = Thu Dec 30 9999 00:00:00
Converted with https://currentmillis.com/
I think it was divided by 1,000,000 becoming 253402128 and calculated.
Which means approximately 44 years ago.
Format: Microseconds (1/1,000,000 second)
GMT: Wed Jan 11 1978 21:28:48 GMT+0000
I used this website as reference: https://www.unixtimestamp.com/
Use pd.to_datetime:
>>> pd.to_datetime(1495245009655, unit='ms')
Timestamp('2017-05-20 01:50:09.655000')
>>> pd.to_datetime(253402128000000 / 100, unit='ms')
Timestamp('2050-04-19 22:48:00')

Python Pandas read_csv function does not allow to change parsed dates into required format

I am a python beginner and am trying to read a csv file with pandas. The issue is that the date column in the csv has following format: 2020-03-12 00:00:00+00:00. Within the read_csv function already, I want to change the date format into isoformat (%Y-%m-%d). I tried all stackoverflow solutions but none of them work. This is my code:
import time
from datetime import date
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url,
usecols=[2, 5, 8],
index_col=['Landkreis', 'Meldedatum'],
parse_dates=['Meldedatum'],
squeeze=True
).sort_index()
Current result
The column "Meldedatum" should only show the date, not the hours and minutes. Yet, I can't change the format because it is an index column.
Your help is much appreciated!
Read your csv normally into dataframe without specifying any format.
Then do this:
countries['Meldedatum'] = pd.to_datetime(countries['Meldedatum'])
This should give you the format you want.
That's just how pandas displays a datetime object. It always stores fields for hours/minutes/seconds/miliseconds, even if they are all set to zero. You can't change this internal representation.
You can, however, cast datetime objects to string, in order to format their representation the way you want. Keep in mind that you lose all functionality of a datetime object along the way.
It looks like you want to count the number of occurrences per day. If that's the case, you should use a groupby object. We don't need to set the index columns or parse dates in this case. We can also convert the representation of the datetime objects to strings, if that's your preference:
import time
from datetime import date
import pandas as pd
# get the data
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url, usecols=[2, 5, 8], index_col=None, squeeze=True).sort_index()
# modify dates to strings
countries['Meldedatum'] = countries.Meldedatum.astype(str).apply(lambda x: x.split('T')[0])
# group by Landkreis and Meldedatum
grouped_countries = countries.groupby(['Landkreis', 'Meldedatum']).count()
print(grouped_countries)
# output:
AnzahlFall
Landkreis Meldedatum
LK Ahrweiler 2020-03-12 5
2020-03-13 2
2020-03-14 1
2020-03-16 3
2020-03-17 5
... ...
StadtRegion Aachen 2020-04-14 8
2020-04-15 37
2020-04-16 23
2020-04-17 18
2020-04-18 5

Generate a datetime format string from timestamp

I want to generate time/date format strings from the input data I got.
Is there an easy way to do this?
My input data looks like this:
'01.12.2016 23:30:59,123'
So my code should generate the following format string:
'%d.%m.%Y %H:%M:%S,%f'
Background:
I used pandas.to_datetime() to generate datetime object for further processing. This works great but this function gets slow (uses dateutil.parser.parse here) with a lot of data (>~50k). At the moment I'm providing the format string above hardcoded within my code to speed up to_datetime() which also works great. Now I wanted to generate the format string within code to be more flexible regaring the input data.
edit (because the first two answers do not fit to my question):
I want to generate the format string not the datetime string.
edit2:
New approch to formulate the question: I'm reading in a file with a lot of data. Every line of data has got a timestamp with the following format: '01.12.2016 23:30:59,123'. I want to convert these timestamps into datetime objects. For this I'm using pandas.to_datetime() at the moment. This function works perfectly but it get slow since I got some files with over 50k datasets. To speed this process up I'm passing a format string within the function pandas.to_datetime(format='%d.%m.%Y %H:%M:%S,%f'). This speeds up the process but it is less flexible. Therefore I want to evaluate the format string only for the first dataset and use it for the rest of the 50k or more datasets.
How is this possible?
you can try to use infer_datetime_format parameter, but be aware - pd.to_datetime() will use dayfirst=False per default
Demo:
In [422]: s
Out[422]:
0 01.12.2016 23:30:59,123
1 23.12.2016 03:30:59,123
2 31.12.2016 13:30:59,123
dtype: object
In [423]: pd.to_datetime(s, infer_datetime_format=True)
Out[423]:
0 2016-01-12 23:30:59.123
1 2016-12-23 03:30:59.123
2 2016-12-31 13:30:59.123
dtype: datetime64[ns]
In [424]: pd.to_datetime(s, infer_datetime_format=True, dayfirst=True)
Out[424]:
0 2016-12-01 23:30:59.123
1 2016-12-23 03:30:59.123
2 2016-12-31 13:30:59.123
dtype: datetime64[ns]
use "datatime" to return the data and time. I this this will help you.
import datetime
print datetime.datetime.now().strftime('%d.%m.%Y %H:%M:%S,%f')
You can use datetime.strptime() inside datetime package which would return a datetime.datetime object.
In your case you should do something like:
datetime.strptime('01.12.2016 23:30:59,123', '%d.%m.%Y %H:%M:%S,%f').
After you have the datetime.datetime object, you can use datetime.strftime() function to get the datetime in the desired string format.
You should probably have a look here: https://github.com/humangeo/DateSense/
From its documentation:
>>> import DateSense
>>> print DateSense.detect_format( ["15 Dec 2014", "9 Jan 2015"] )
%d %b %Y

Python CSV data analysis based on date time

I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
Here are the requirements: “What is the LATEST movement type for each serial number?”
I need to parse the CSV file and for each UNIQUE serial number, take the movement type that has the LATEST “posting date”.
As an example, for Serial Number 2LMXK1 the latest posting date/time is 1/5/15 at 14:00.
Here is basically what I will need to obtain:
“Serial Number 2LMXK1 has a movement type 301 and was last updated 1/5/15 14:00”.
I have started with some code that parses the CSV file and creates a dictionary.
#Import modules
import csv
import pandas as pd
fields = ['Serial number','Movement type','Posting date']
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields)
dc = df.to_dict()
#print (df['Serial number'])
for value in dc.items():
print (value)
This code works to parse the CSV and create a dictionary.
However, I need help with the date comparison and filtering techniques. How may I create another dictionary that only lists unique serial numbers with the latest posting date? Once I have created a new filtered data dictionary I can use that to import into our asset management database. The idea is that I will use python to analyze and manipulate the data before importing into our system.
Pandas is a useful library for more than just reading csv files. In fact, you don't need the csv library at all here (it's not being used in the code sample you posted)
First you need to make sure the dates are read in as dates, by using the parse_dates parameter of the read_csv function. Then you can use pandas' grouping functionality.
# parse the 3rd column (index 2) as dates
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields, parse_dates=[2])
last_movement = df.sort_values('Posting date').groupby('Serial number').last()
To create the string that you want, you can then iterate through the rows of last_movement:
for index, row in last_movement.iterrows():
print('Serial Number {} has a movement type {} and was last updated {}'
.format(index, row['Movement type'], row['Posting date']))
Which will produce the following:
Serial Number 2LMXK1 has a movement type 301 and was last updated 2015-01-05 14:00:00
Serial Number BR83GP has a movement type 301 and was last updated 2015-01-09 15:30:00
Serial Number JEMLP3 has a movement type 203 and was last updated 2015-01-07 17:30:00
Side note: Pandas should be able to read the column headings for you, so you shouldn't need the usecols parameter
The dict creation or best way to sort the list depends a little on what you want but for the parsing side of things, to convert a string into a date object so you can then do sane comparisons etc you probably want the datetime module in datetime (yes, datetime.datetime)
It's got a strptime() function that will do exactly that:
import datetime
datetime.datetime.strptime(r"1/5/15 13:00", "%d/%m/%y %H:%M")
# I've assumed you have a Day/Month/Year format
The only bit of oddness is the format specifier, which is documented here:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
(note that where it talks about zero-padded, that's for output. It'll parse non-zero padded numbers fine)

Categories