Pandas won't recognize date while reading csv - python

I'm working on a script which reads in a .csv file with pandas and fills in a specific form.
One column in the .csv file is a birthday-column.
While reading the .csv I parse it with 'parse_dates' to get a datetime object so i can format it for my needs:
df = pd.read_csv('readfile1.csv',sep=';', parse_dates=['birthday'])
While it works perfectly with readfile1.csv, it won't work with readfile2.csv. But these files look exactly the same.
The error i get makes me think that the automatic parsing to datetime through pandas is not working:
print(df.at[i,'birthday'].strftime("%d%m%Y"))
AttributeError: 'str' object has no attribute 'strftime'
In both cases the format of the birthday looks like:
'1965-05-16T12:00:00.000Z' #from readfile1.csv
'1934-04-06T11:00:00.000Z' #from readfile2.csv
I can't figure out what's wrong. I checked the encoding of the files and both are 'UTF-8'. Any ideas?
Thank you!
Greetings

if you do not set keyword parse_dates, and convert the column after
reading the csv, with pd.to_datetime and keyword errors='coerce', what
result do you get? does the column have NaT values? – MrFuppes 32 mins
ago
MrFuppes comment on calling pd.to_datetime led to success. One faulty date in the column was the cause of the error. Also Lumber Jacks's hint was helpful to determine the datatypes!

Related

pyarrow/parquet saving large timestamp incorrectly

I've got some timestamps in a database that are 9999-12-31 and trying to convert to parquet. Somehow these timestamps all end up as 1816-03-29 05:56:08.066 in the parquet file.
Below is some code to reproduce the issue.
file_path = "tt.parquet"
schema = pa.schema([pa.field("tt", pa.timestamp("ms"))])
table = pa.Table.from_arrays([pa.array([datetime(9999, 12, 31),], pa.timestamp('ms'))], ["tt"])
writer = pq.ParquetWriter(file_path, schema)
writer.write_table(table)
writer.close()
I'm not trying to read the data with pandas but I've tried inspecting with pandas but that ends up with pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: error.
I'm loading the parquet files into Snowflake and get back the incorrect timestamp. I've also tried inspecting with parquet-tools but that doesn't seem to work with timestamps.
Does parquet/pyarrow not support large timestamps? How can I store the correct timestamp?
It turns out for me, it was cause I needed to set use_deprecated_int96_timestamps=False on parquet writer
It says by default it's False but I had set the flavor to 'spark' so I think it overrode it.
Thanks for help
Clearly the timestamp '9999-12-31' is being used not as a real timestamp, but as a flag for an invalid value.
If at the end of the pipeline Snowflake is seeing those as '1816-03-29 05:56:08.066', then you could just keep them as that - or re-cast them to whatever value you want them to have in Snowflake. At least it's consistent.
But if you insist that you want Python to handle the 9999 cases correctly, look at this question that solves it with use_deprecated_int96_timestamps=True:
handling large timestamps when converting from pyarrow.Table to pandas

pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?

I have realised that, unless the format of a date column is declared explicitly or semi-explicitly (with dayfirst), pandas can apply different date formats to the same column, when reading a csv file! One row could be dd/mm/yyyy and another row in the same column mm/dd/yyyy!
Insane doesn't even come close to describing it! Is it a known bug?
To demonstrate: the script below creates a very simple table with the dates from January 1st to the 31st, in the dd/mm/yyyy format, saves it to a csv file, then reads back the csv.
I then use pandas.DatetimeIndex to extract the day.
Well, the day is 1 for the first 12 days (when month and day were both < 13), and 13 14 etc afterwards. How on earth is this possible?
The only way I have found to fix this is to declare the date format, either explicitly or just with dayfirst=True. But it's a pain because it means I must declare the date format even when I import csv with the best-formatted dates ever! Is there a simpler way?
This happens to me with pandas 0.23.4 and Python 3.7.1 on Windows 10
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['day'] =np.arange(1,32)
df['day']=df['day'].apply(lambda x: "{:0>2d}".format(x) )
df['month']='01'
df['year']='2018'
df['date']=df['day']+'/'+df['month']+'/'+df['year']
df.to_csv('mydates.csv', index=False)
#same results whether you use parse_dates or not
imp = pd.read_csv('mydates.csv',parse_dates=['date'])
imp['day extracted']=pd.DatetimeIndex(imp['date']).day
print(imp['day extracted'])
By default it assumes the American dateformat, and indeed switches mid-column without throwing an Error, if that fails. And though it breaks the Zen of Python by letting this Error pass silently, "Explicit is better than implicit". So if you know your data has an international format, you can use dayfirst
imp = pd.read_csv('mydates.csv', parse_dates=['date'], dayfirst=True)
With files you produce, be unambiguous by using an ISO 8601 format with a timezone designator.

Pandas DatetimeIndex string format conversion from American to European

Ok I have read some data from a CSV file using:
df=pd.read_csv(path,index_col='Date',parse_dates=True,dayfirst=True)
The data are in European date convention format dd/mm/yyyy, that is why i am using dayfirst=True.
However, what i want to do is change the string format appearance of my dataframe index df from the American(yyyy/mm/dd) to the European format(dd/mm/yyyy) just to visually been consistent with how i am looking the dates.
I could't find any relevant argument in the pd.read_csv method.
In the output I want a dataframe in which simply the index will be a datetime index visually consistent with the European date format.
Could anyone propose a solution? It should be straightforward, since I guess there should be a pandas method to handle that, but i am currently stuck.
Try something like the following once it's loaded from the CSV. I don't believe it's possible to perform the conversion as part of the reading process.
import pandas as pd
df = pd.DataFrame({'date': pd.date_range(start='11/24/2016', periods=4)})
df['date_eu'] = df['date'].dt.strftime('%d/%m/%Y')

Pandas 19 not parsing dates

A script I'd written using an earlier version of Pandas now no longer works. Date parsing is not working. This is my read_html line:
gnu = pd.read_html('gnucash.html', flavor="html5lib", header=0, parse_dates=['Date'])
Pandas identifies the HTML table properly but returns the date as unicode. The HTML has been generated by Gnucash and is in ISO format Y-m-d (no times).
Whatever I do I can't get Pandas to recognise the dates. I tried including a date_parser, but read_html doesn't recognise that.
Apologies #IanS, I've inadvertently deleted your comment. When I set out an example it worked. I think the problem is with my html file. There must be a non-date buried in the date column. Anyway, Pandas did what it ought to with my sample file.
Thanks for taking an interest.

to_datetime not working on a string in the format YYYY-MM-DD HH:MM in pandas

I have a date in the format 2014-01-31 05:47.
When its read into pandas the object gets changed as object.
When i try to change it to pd.to_datetime, there is no error, but the datatype does not change to datatime.
Please suggest some way out.
T=pd.read_csv("TESTING.csv")
T['DATE']=pd.to_datetime(T['DATE'])
T.dtypes
>DATE object
T['DATE']
>2014-01-31 05:47
Basically, Pandas doesn't understand what the string "2014-01-31 05:47" is other than the fact that you gave it a string. If you read this string in from a CSV file then read the Pandas docs on the read_csv method that allows you to parse datetimes.
However, given something like this:
records = ["2014-01-31 05:47", "2014-01-31 14:12"]
df = pandas.DataFrame(records)
df.dtypes
>0 object
>dtype: object
This is because you haven't told Pandas how to parse your string into a datetime (or TimeStamp) type.
Using the pandas.to_datetime method is what you want but you must be careful to pass it only the column that has the values you want to convert. Remember that pandas won't mutate the dataframe you're working on, you need to save it again.
df[0] = pandas.to_datetime(df[0])
df.dtypes
>0 datetime64[ns]
>dtype: object
This is what you want. The cells are now the right format.
There are many ways to achieve the same thing, you could use the apply() method with a lambda, correctly parse from CSV or SQL or work with Series.

Categories