The default format of csv is dd/mm/yyyy. When I convert it to datetime by df['Date']=pd.to_datetime(df['Date']), it change the format to mm//dd/yyyy.
Then, I used df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%d/%m/%Y')
to convert to dd/mm/yyyy, But, they are in the string (object) format. However, I need to change them to datetime format. When I use again this (df['Date']=pd.to_datetime(df['Date'])), it gets back to the previous format. Need your help
You can use the parse_dates and dayfirst arguments of pd.read_csv, see: the docs for read_csv()
df = pd.read_csv('myfile.csv', parse_dates=['Date'], dayfirst=True)
This will read the Date column as datetime values, correctly taking the first part of the date input as the day. Note that in general you will want your dates to be stored as datetime objects.
Then, if you need to output the dates as a string you can call dt.strftime():
df['Date'].dt.strftime('%d/%m/%Y')
When I use again this: df['Date'] = pd.to_datetime(df['Date']), it gets back to the previous format.
No, you cannot simultaneously have the string format of your choice and keep your series of type datetime. As remarked here:
datetime series are stored internally as integers. Any
human-readable date representation is just that, a representation,
not the underlying integer. To access your custom formatting, you can
use methods available in Pandas. You can even store such a text
representation in a pd.Series variable:
formatted_dates = df['datetime'].dt.strftime('%m/%d/%Y')
The dtype of formatted_dates will be object, which indicates
that the elements of your series point to arbitrary Python times. In
this case, those arbitrary types happen to be all strings.
Lastly, I strongly recommend you do not convert a datetime series
to strings until the very last step in your workflow. This is because
as soon as you do so, you will no longer be able to use efficient,
vectorised operations on such a series.
This solution will work for all cases where a column has mixed date formats. Add more conditions to the function if needed. Pandas to_datetime() function was not working for me, but this seems to work well.
import date
def format(val):
a = pd.to_datetime(val, errors='coerce', cache=False).strftime('%m/%d/%Y')
try:
date_time_obj = datetime.datetime.strptime(a, '%d/%m/%Y')
except:
date_time_obj = datetime.datetime.strptime(a, '%m/%d/%Y')
return date_time_obj.date()
Saving the changes to the same column.
df['Date'] = df['Date'].apply(lambda x: format(x))
Saving as CSV.
df.to_csv(f'{file_name}.csv', index=False, date_format='%s')
Related
I have date as string (example: 3/24/2020) that I would like to convert to datetime64[ns] format
df2['date'] = pd.to_datetime(df1["str_date"], format='%m/%d/%Y')
Use pandas to_datetime on vaex dataframe will result an error:
ValueError: time data 'str_date' does not match format '%m/%d/%Y' (match)
I have see maybe duplicate question.
df2['pdate']=df2.date.astype('datetime64[ns]')
However, the answer is type casting. My case required to a format ('%m/%d/%Y') parse string to datetime64[ns], not just type cast.
Solution: make custom function, then .apply
vaex can use apply function for object operations, so you can use datetime and np.datetime64 convert each date string, then apply it.
import numpy as np
from datetime import datetime
def convert_to_datetime(date_string):
return np.datetime64(datetime.strptime(str(date_string), "%Y%m%d%H%M%S"))
df['date'] = df.date.apply(convert_to_datetime)
I have the following datatable, which I would like to filter by dates greater than "2019-01-01". The problem is that the dates are strings.
dt_dates = dt.Frame({"days_date": ['2019-01-01','2019-01-02','2019-01-03']})
This is my best attempt.
dt_dates[f.days_date > datetime.strptime(f.days_date, "2019-01-01")]
this returns the error
TypeError: strptime() argument 1 must be str, not Expr
what is the best way to filter dates in python's datatable?
Reference
python datatable
f-expressions
Your datetime syntax is incorrect, for converting a string to a datetime.
What you're looking for is:
dt_dates[f.days_date > datetime.strptime(f.days_date, "%Y-%m-%d")]
Where the 2nd arguement for strptime is the date format.
However, lets take a step back, because this isn't the right way to do it.
First, we should convert all your dates in your Frame to a datetime. I'll be honest, I've never used a datatable, but the syntax looks extremely similar to panda's Dataframe.
In a dataframe, we can do the following:
df_date = df_date['days_date'].apply(lambda x: datetime.strptime(x, '%Y-%m'%d))
This goes through each row where the column is 'dates_date" and converts each string into a datetime.
From there, we can use a filter to get the relevant rows:
df_date = df_date[df_date['days_date'] > datetime.strptime("2019-01-01", "%Y-%m-%d")]
datatable version 1.0.0 introduced native support for date an time data types. Note the difference between these two ways to initialize data:
dt_dates = dt.Frame({"days_date": ['2019-01-01','2019-01-02','2019-01-03']})
dt_dates.stypes
> (stype.str32,)
and
dt_dates = dt.Frame({"days_date": ['2019-01-01','2019-01-02','2019-01-03']}, stype="date32")
dt_dates.stypes
> (stype.date32,)
The latter frame contains days_date column of type datatable.Type.date32 that represents a calendar date. Then one can filter by date as follows:
split_date = datetime.datetime.strptime("2019-01-01", "%Y-%m-%d")
dt_split_date = dt.time.ymd(split_date.year, split_date.month, split_date.day)
dt_dates[dt.f.days_date > dt_split_date, :]
I need to print a string on the first multi-index in a date format.
Essentially, I need to delete all data on the first date. But finding out the cause of this error is also very important to me. Thank you very much in advance!
As commented dt.date returns datetime.date object, which is different from Pandas' datetime object. Use dt.floor('D') or dt.normalized() instead. For example, this would work:
df['Date'] = df.session_started.dt.normalize()
df['Time'] = df.session_started.dt.hour
df_hour = df.groupby(['Date','Time']).checkbooking.count()
df_hour.loc['2019-01-13']
I'm getting dates from my API in iso format.
When I'm doing:
df = DataFrame(results)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
And I look at the CSV I see for example:
lastDeliveryDate
2018-11-21 16:25:53.990000-05:00
However,
When I do:
df = DataFrame(results)
df.to_json(path_or_buf=file_name, orient="records",compression='gzip', lines=True)
I see (other record):
"lastDeliveryDate":1543258826689
This is a problem.
When I load the data from the CSV to Google BigQuery eveything is fine. The date is parsed correctly.
But when I changed the loading to Json. It doesn't parse the date correctly.
I see the dates in this format:
50866-01-09 23:46:40 UTC
This occurs because the to_json() and to_csv() produce different results for dates in iso_format
How can I fix this? Must I edit the data frame and convert all my dates columns to regular UTC? how can I do that? and why it's needed for to_json() but not for to_csv() ?
as explained at How do I translate an ISO 8601 datetime string into a Python datetime object? Tried to do:
df["lastDeliveryDate"] = dateutil.parser.parse(df["lastDeliveryDate"])
But it gives:
TypeError: Parser must be a string or character stream, not Series
From the Pandas documentation on to_json():
date_format: {None, ‘epoch’, ‘iso’}
Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.
So, with orient="records", you'll have to set date_format="iso" to get a date-time format that can be understood later on:
df.to_json(path_or_buf=file_name, orient="records", date_format="iso",
compression='gzip', lines=True)
Basically dateutil.parser.parse() is expecting a string as a parameter, but you passed the whole column. Try with the lambda function:
df["lastDeliveryDate"] = df["lastDeliveryDate"].apply( lambda row: dateutil.parser.parse(row))
From a source I retrieve some data in JSON format. I want to save this data (measurements in time) as a text file. Repeatedly I want to go the same source and see if new measurements are available, if so I want to add it to the other measurements.
The data I get looks like this:
{"xyz":[{"unixtime":"1458255600","time":"00:00","day":"18\/03","value":"11","paramlabel":"30-500 mHz","popupcorr":"550","iconnr":"7","paramname":"30-500 mHz"},{"unixtime":"1458256200","time":"00:10","day":"18\/03","value":"14","paramlabel":"30-500 mHz","popupcorr":"550","iconnr":"7","paramname":"30-500 mHz"},etc.]}
I load this data into a pandas DataFrame to be able to work with it more easily. When I load this into a dataframe however, all columns are treated as strings. How can I make sure that the unixtime column is treated as a timestamp (such that I can convert to a datetime)?
use to_datetime and pass unit='s' to treat the value as epoch time after converting the dtype to int using astype:
df['unixtime'] = pd.to_datetime(df['unixtime'].astype(int), unit='s')
Example:
In [162]:
pd.to_datetime(1458255600, unit='s')
Out[162]:
Timestamp('2016-03-17 23:00:00')