pyarrow/parquet saving large timestamp incorrectly - python

I've got some timestamps in a database that are 9999-12-31 and trying to convert to parquet. Somehow these timestamps all end up as 1816-03-29 05:56:08.066 in the parquet file.
Below is some code to reproduce the issue.
file_path = "tt.parquet"
schema = pa.schema([pa.field("tt", pa.timestamp("ms"))])
table = pa.Table.from_arrays([pa.array([datetime(9999, 12, 31),], pa.timestamp('ms'))], ["tt"])
writer = pq.ParquetWriter(file_path, schema)
writer.write_table(table)
writer.close()
I'm not trying to read the data with pandas but I've tried inspecting with pandas but that ends up with pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: error.
I'm loading the parquet files into Snowflake and get back the incorrect timestamp. I've also tried inspecting with parquet-tools but that doesn't seem to work with timestamps.
Does parquet/pyarrow not support large timestamps? How can I store the correct timestamp?

It turns out for me, it was cause I needed to set use_deprecated_int96_timestamps=False on parquet writer
It says by default it's False but I had set the flavor to 'spark' so I think it overrode it.
Thanks for help

Clearly the timestamp '9999-12-31' is being used not as a real timestamp, but as a flag for an invalid value.
If at the end of the pipeline Snowflake is seeing those as '1816-03-29 05:56:08.066', then you could just keep them as that - or re-cast them to whatever value you want them to have in Snowflake. At least it's consistent.
But if you insist that you want Python to handle the 9999 cases correctly, look at this question that solves it with use_deprecated_int96_timestamps=True:
handling large timestamps when converting from pyarrow.Table to pandas

Related

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

What is difference between timestamp[ns] timestamp[s]?

I am reading the schema of parquet and JSON files to generate a DDL for a redshift table. I am getting some data types like timestamp[ns] and timestamp[s]. I tried to look upon the internet but couldn't understand the difference.
Can you please make me understand with some examples?
timestamp[x] is expressed in units x.
For you, s=seconds, ns=nanoseconds
For example:
Timestamp[s] = 2020-03-14T15:32:52
Timestamp[ms] = 2020-03-14T15:32:52.192
Timestamp[us] = 2020-03-14T15:32:52.192548
Timestamo[ns]= 2020-03-14T15:32:52.192548165

Pandas won't recognize date while reading csv

I'm working on a script which reads in a .csv file with pandas and fills in a specific form.
One column in the .csv file is a birthday-column.
While reading the .csv I parse it with 'parse_dates' to get a datetime object so i can format it for my needs:
df = pd.read_csv('readfile1.csv',sep=';', parse_dates=['birthday'])
While it works perfectly with readfile1.csv, it won't work with readfile2.csv. But these files look exactly the same.
The error i get makes me think that the automatic parsing to datetime through pandas is not working:
print(df.at[i,'birthday'].strftime("%d%m%Y"))
AttributeError: 'str' object has no attribute 'strftime'
In both cases the format of the birthday looks like:
'1965-05-16T12:00:00.000Z' #from readfile1.csv
'1934-04-06T11:00:00.000Z' #from readfile2.csv
I can't figure out what's wrong. I checked the encoding of the files and both are 'UTF-8'. Any ideas?
Thank you!
Greetings
if you do not set keyword parse_dates, and convert the column after
reading the csv, with pd.to_datetime and keyword errors='coerce', what
result do you get? does the column have NaT values? – MrFuppes 32 mins
ago
MrFuppes comment on calling pd.to_datetime led to success. One faulty date in the column was the cause of the error. Also Lumber Jacks's hint was helpful to determine the datatypes!

Right format of Timestamp for filtering pyspark dataframe for Cassandra

I'm storing the timestamp as YYYY-mm-dd HH:MM:SSZ in Cassandra and I am able to filter the data to get a certain range of time in cql shell, but when I try the same on a pyspark dataframe I don't get any values in the filtered dataframe.
Can anyone help me find the right datetime format in pyspark for this?
Thank you.
This format for timestamps works just fine. I think that you have a problem with Spark SQL types, so you may need to perform explicit cast for timestamp string, so Spark can perform correct comparison.
For example, this Scala code works correctly (you may need to adjust it to Python):
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-07-17 14:41:34.373Z' as timestamp) AND ts <= cast('2019-07-19 19:01:56Z' as timestamp)")

Reading Date times from Excel to Python using Pandas

I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.

Categories