Convert (Dutch) datetime string to datetime format in Pandas - python

I have a CSV file that contains information about the period August 22, 2022 up to September 21, 2022. I loaded to CSV into Python using the Pandas library. The timestamps in the CSV file are in a Dutch format (and are strings), i.e., %d-%m-%Y. When I use pd.to_datetime() for the timestamps, not all data points are converted correctly. For example:
Old (in %d-%m-%Y)
New
22-02-22
2022-08-22 (%Y-%m-%d)
31-08-22
2022-08-31 (%Y-%m-%d)
01-09-22
2022-01-09 (%Y-%d-%m)
06-09-22
2022-06-09 (%Y-%d-%m)
12-09-22
2022-12-09 (%Y-%d-%m)
13-09-22
2022-09-13 (%Y-%m-%d)
21-09-22
2022-09-21 (%Y-%m-%d)
So, for some data points the months and days are interchanged. I want to convert the strings into the right datetime format. How to solve this?
Thanks in advance!

If you read the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html you will see that pd.to_datetime can take yearfirst and dayfirst as parameters.
In your case, just pass dayfirst=True to it and that's it.

You have to specify the format during conversation because the dates here can be confusing to pandas. The default format is %Y-%m-%d.
So try this instead:
df.<your_date_col> = pd.to_datetime(df.<your_date_col>, format='%d-%m-%y')

Related

What is the format for UNIX timestamp of '253402128000000'?

I'm trying to convert a whole column of timestamp values in UNIX format but I get some values that doesn't look like a normal timestamp format: 253402128000000
For what I know, a timestamp should look like: 1495245009655
I've tried in miliseconds, nanoseconds and other configurations for Pandas to_datetime but I haven't been able to find a solution that can convert the format.
EDIT
My data looks like below and the ValidEndDateTime seems way off.
"ValidStartDateTime": "/Date(1495245009655)/",
"ValidEndDateTime": "/Date(253402128000000)/",
SOLUTION
I've accepted the answer below because I can see the date is a "never-end" date as all the values in my dataset that can't be converted is set to the same value: 253402128000000
Thank you for the answers!
From a comment of yours:
The data I get looks like this: "ValidStartDateTime": "/Date(1495245009655)/", "ValidEndDateTime": "/Date(253402128000000)/",
The numbers appear to be UNIX timestamps in milliseconds and the big "End" one seems to mean "never end", note the special date:
1495245009655 = Sat May 20 2017 01:50:09
253402128000000 = Thu Dec 30 9999 00:00:00
Converted with https://currentmillis.com/
I think it was divided by 1,000,000 becoming 253402128 and calculated.
Which means approximately 44 years ago.
Format: Microseconds (1/1,000,000 second)
GMT: Wed Jan 11 1978 21:28:48 GMT+0000
I used this website as reference: https://www.unixtimestamp.com/
Use pd.to_datetime:
>>> pd.to_datetime(1495245009655, unit='ms')
Timestamp('2017-05-20 01:50:09.655000')
>>> pd.to_datetime(253402128000000 / 100, unit='ms')
Timestamp('2050-04-19 22:48:00')

Python Pandas read_csv function does not allow to change parsed dates into required format

I am a python beginner and am trying to read a csv file with pandas. The issue is that the date column in the csv has following format: 2020-03-12 00:00:00+00:00. Within the read_csv function already, I want to change the date format into isoformat (%Y-%m-%d). I tried all stackoverflow solutions but none of them work. This is my code:
import time
from datetime import date
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url,
usecols=[2, 5, 8],
index_col=['Landkreis', 'Meldedatum'],
parse_dates=['Meldedatum'],
squeeze=True
).sort_index()
Current result
The column "Meldedatum" should only show the date, not the hours and minutes. Yet, I can't change the format because it is an index column.
Your help is much appreciated!
Read your csv normally into dataframe without specifying any format.
Then do this:
countries['Meldedatum'] = pd.to_datetime(countries['Meldedatum'])
This should give you the format you want.
That's just how pandas displays a datetime object. It always stores fields for hours/minutes/seconds/miliseconds, even if they are all set to zero. You can't change this internal representation.
You can, however, cast datetime objects to string, in order to format their representation the way you want. Keep in mind that you lose all functionality of a datetime object along the way.
It looks like you want to count the number of occurrences per day. If that's the case, you should use a groupby object. We don't need to set the index columns or parse dates in this case. We can also convert the representation of the datetime objects to strings, if that's your preference:
import time
from datetime import date
import pandas as pd
# get the data
url = 'https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data'
countries = pd.read_csv(url, usecols=[2, 5, 8], index_col=None, squeeze=True).sort_index()
# modify dates to strings
countries['Meldedatum'] = countries.Meldedatum.astype(str).apply(lambda x: x.split('T')[0])
# group by Landkreis and Meldedatum
grouped_countries = countries.groupby(['Landkreis', 'Meldedatum']).count()
print(grouped_countries)
# output:
AnzahlFall
Landkreis Meldedatum
LK Ahrweiler 2020-03-12 5
2020-03-13 2
2020-03-14 1
2020-03-16 3
2020-03-17 5
... ...
StadtRegion Aachen 2020-04-14 8
2020-04-15 37
2020-04-16 23
2020-04-17 18
2020-04-18 5

Working on dates with mm-dd-YY & YY-mm-dd format in pandas

I am trying to do a simple test on pandas capabilities to handle dates & format.
For that i have created a dataframe with values like below. :
df = pd.DataFrame({'date1' : ['10-11-11','12-11-12','10-10-10','12-11-11',
'12-12-12','11-12-11','11-11-11']})
Here I am assuming that the values are dates. And I am converting it into proper format using pandas' to_datetime function.
df['format_date1'] = pd.to_datetime(df['date1'])
print(df)
Out[3]:
date1 format_date1
0 10-11-11 2011-10-11
1 12-11-12 2012-12-11
2 10-10-10 2010-10-10
3 12-11-11 2011-12-11
4 12-12-12 2012-12-12
5 11-12-11 2011-11-12
6 11-11-11 2011-11-11
Here, Pandas is reading the date of the dataframe as "MM/DD/YY" and converting it in native format (i.e. YYYY/MM/DD). I want to check if Pandas can take my input indicating that the date format is actually "YY/MM/DD" and then let it convert into its native format. This will change the value of row no.: 5. To do this, I have run following code. But it is giving me an error.
df3['format_date2'] = pd.to_datetime(df3['date1'], format='%Y/%m/%d')
ValueError: time data '10-10-10' does not match format '%Y/%m/%d' (match)
I have seen the sort of solution here. But I was hoping to get a little easy and crisp answer.
%Y in the format specifier takes the 4-digit year (i.e. 2016). %y takes the 2-digit year (i.e. 16, meaning 2016). Change the %Y to %y and it should work.
Also the dashes in your format specifier are not present. You need to change your format to %y-%m-%d

Generate a datetime format string from timestamp

I want to generate time/date format strings from the input data I got.
Is there an easy way to do this?
My input data looks like this:
'01.12.2016 23:30:59,123'
So my code should generate the following format string:
'%d.%m.%Y %H:%M:%S,%f'
Background:
I used pandas.to_datetime() to generate datetime object for further processing. This works great but this function gets slow (uses dateutil.parser.parse here) with a lot of data (>~50k). At the moment I'm providing the format string above hardcoded within my code to speed up to_datetime() which also works great. Now I wanted to generate the format string within code to be more flexible regaring the input data.
edit (because the first two answers do not fit to my question):
I want to generate the format string not the datetime string.
edit2:
New approch to formulate the question: I'm reading in a file with a lot of data. Every line of data has got a timestamp with the following format: '01.12.2016 23:30:59,123'. I want to convert these timestamps into datetime objects. For this I'm using pandas.to_datetime() at the moment. This function works perfectly but it get slow since I got some files with over 50k datasets. To speed this process up I'm passing a format string within the function pandas.to_datetime(format='%d.%m.%Y %H:%M:%S,%f'). This speeds up the process but it is less flexible. Therefore I want to evaluate the format string only for the first dataset and use it for the rest of the 50k or more datasets.
How is this possible?
you can try to use infer_datetime_format parameter, but be aware - pd.to_datetime() will use dayfirst=False per default
Demo:
In [422]: s
Out[422]:
0 01.12.2016 23:30:59,123
1 23.12.2016 03:30:59,123
2 31.12.2016 13:30:59,123
dtype: object
In [423]: pd.to_datetime(s, infer_datetime_format=True)
Out[423]:
0 2016-01-12 23:30:59.123
1 2016-12-23 03:30:59.123
2 2016-12-31 13:30:59.123
dtype: datetime64[ns]
In [424]: pd.to_datetime(s, infer_datetime_format=True, dayfirst=True)
Out[424]:
0 2016-12-01 23:30:59.123
1 2016-12-23 03:30:59.123
2 2016-12-31 13:30:59.123
dtype: datetime64[ns]
use "datatime" to return the data and time. I this this will help you.
import datetime
print datetime.datetime.now().strftime('%d.%m.%Y %H:%M:%S,%f')
You can use datetime.strptime() inside datetime package which would return a datetime.datetime object.
In your case you should do something like:
datetime.strptime('01.12.2016 23:30:59,123', '%d.%m.%Y %H:%M:%S,%f').
After you have the datetime.datetime object, you can use datetime.strftime() function to get the datetime in the desired string format.
You should probably have a look here: https://github.com/humangeo/DateSense/
From its documentation:
>>> import DateSense
>>> print DateSense.detect_format( ["15 Dec 2014", "9 Jan 2015"] )
%d %b %Y

Python CSV data analysis based on date time

I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.
Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30
Here are the requirements: “What is the LATEST movement type for each serial number?”
I need to parse the CSV file and for each UNIQUE serial number, take the movement type that has the LATEST “posting date”.
As an example, for Serial Number 2LMXK1 the latest posting date/time is 1/5/15 at 14:00.
Here is basically what I will need to obtain:
“Serial Number 2LMXK1 has a movement type 301 and was last updated 1/5/15 14:00”.
I have started with some code that parses the CSV file and creates a dictionary.
#Import modules
import csv
import pandas as pd
fields = ['Serial number','Movement type','Posting date']
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields)
dc = df.to_dict()
#print (df['Serial number'])
for value in dc.items():
print (value)
This code works to parse the CSV and create a dictionary.
However, I need help with the date comparison and filtering techniques. How may I create another dictionary that only lists unique serial numbers with the latest posting date? Once I have created a new filtered data dictionary I can use that to import into our asset management database. The idea is that I will use python to analyze and manipulate the data before importing into our system.
Pandas is a useful library for more than just reading csv files. In fact, you don't need the csv library at all here (it's not being used in the code sample you posted)
First you need to make sure the dates are read in as dates, by using the parse_dates parameter of the read_csv function. Then you can use pandas' grouping functionality.
# parse the 3rd column (index 2) as dates
df = pd.read_csv('import.csv', skipinitialspace=True, usecols=fields, parse_dates=[2])
last_movement = df.sort_values('Posting date').groupby('Serial number').last()
To create the string that you want, you can then iterate through the rows of last_movement:
for index, row in last_movement.iterrows():
print('Serial Number {} has a movement type {} and was last updated {}'
.format(index, row['Movement type'], row['Posting date']))
Which will produce the following:
Serial Number 2LMXK1 has a movement type 301 and was last updated 2015-01-05 14:00:00
Serial Number BR83GP has a movement type 301 and was last updated 2015-01-09 15:30:00
Serial Number JEMLP3 has a movement type 203 and was last updated 2015-01-07 17:30:00
Side note: Pandas should be able to read the column headings for you, so you shouldn't need the usecols parameter
The dict creation or best way to sort the list depends a little on what you want but for the parsing side of things, to convert a string into a date object so you can then do sane comparisons etc you probably want the datetime module in datetime (yes, datetime.datetime)
It's got a strptime() function that will do exactly that:
import datetime
datetime.datetime.strptime(r"1/5/15 13:00", "%d/%m/%y %H:%M")
# I've assumed you have a Day/Month/Year format
The only bit of oddness is the format specifier, which is documented here:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
(note that where it talks about zero-padded, that's for output. It'll parse non-zero padded numbers fine)

Categories