Unable to format timestamp in pyspark - python

I have a CSV data like the below:
time_value,annual_salary
5/01/2019 1:02:16,120.56
06/01/2019 2:02:17,12800
7/01/2019 03:02:18,123.00
08/01/2019 4:02:19,123isdhad
Now, I want to convert to the timestamp column. So, I created a view out of these records and tried to convert it but it throws an error:
spark.sql("select to_timestamp(time_value,'M/dd/yyyy H:mm:ss') as time_value from table")
Error:
Text '5/1/2019 1:02:16' could not be parsed

According to the error that I am seeing there, this is concerning the Date Format issue.
Text '5/1/2019 1:02:16' could not be parsed
But your time format is specified as such
'M/dd/yyyy H:mm:ss'
You can see that the day-specific is /1/ but your format is dd which expects two digits.
Please try the following format:
'M/d/yyyy H:mm:ss'

I tried your SQL no problem. It may be a problem with the spark version. I used 2.4.8

Related

Reading Data Frame in Atoti?

While reading Dataframe in Atoti using the following code error is occured which is shown below.
#Code
global_data=session.read_pandas(df,keys=["Row ID"],table_name="Global_Superstore")
#error
ArrowInvalid: Could not convert '2531' with type str: tried to convert to int64
How to solve this? Please help guys..
Was trying to read a Dataframe using atoti functions.
There are values with different types in that particular column. If you aren't going to preprocess the data and you're fine with that column being read as a string, then you should specify the exact datatypes of each of your columns (or that particular column), either when you load the dataframe with pandas, or when you read the data into a table with the function you're currently using:
import atoti as tt
global_superstore = session.read_pandas(
df,
keys=["Row ID"],
table_name="Global_Superstore",
types={
"<invalid_column>": tt.type.STRING
}
)

pyarrow/parquet saving large timestamp incorrectly

I've got some timestamps in a database that are 9999-12-31 and trying to convert to parquet. Somehow these timestamps all end up as 1816-03-29 05:56:08.066 in the parquet file.
Below is some code to reproduce the issue.
file_path = "tt.parquet"
schema = pa.schema([pa.field("tt", pa.timestamp("ms"))])
table = pa.Table.from_arrays([pa.array([datetime(9999, 12, 31),], pa.timestamp('ms'))], ["tt"])
writer = pq.ParquetWriter(file_path, schema)
writer.write_table(table)
writer.close()
I'm not trying to read the data with pandas but I've tried inspecting with pandas but that ends up with pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: error.
I'm loading the parquet files into Snowflake and get back the incorrect timestamp. I've also tried inspecting with parquet-tools but that doesn't seem to work with timestamps.
Does parquet/pyarrow not support large timestamps? How can I store the correct timestamp?
It turns out for me, it was cause I needed to set use_deprecated_int96_timestamps=False on parquet writer
It says by default it's False but I had set the flavor to 'spark' so I think it overrode it.
Thanks for help
Clearly the timestamp '9999-12-31' is being used not as a real timestamp, but as a flag for an invalid value.
If at the end of the pipeline Snowflake is seeing those as '1816-03-29 05:56:08.066', then you could just keep them as that - or re-cast them to whatever value you want them to have in Snowflake. At least it's consistent.
But if you insist that you want Python to handle the 9999 cases correctly, look at this question that solves it with use_deprecated_int96_timestamps=True:
handling large timestamps when converting from pyarrow.Table to pandas

PySpark caused mismatch column when reading from csv

Edit: The previous problem was solved by specifying the argument multiLine by Truein the spark.read.csv function. However, I discovered another problem when using the spark.read.csv function.
Another problem I encountered was with another csv file in the same dataset as described in the question. It is a review dataset from insideairbnb.com.
The csv file is like this:
:
But the output of the read.csv function concatenated several lines together and generated a weird format:
Any thoughts? Thank you for your time.
The following problem was solved by specifying the argument multiLine in spark.read.csv function. The root cause was there were \r\n\n\r strings in one of the columns, which the function treated as a line separator instead of a string
I attempted to load a large csv file to a spark dataframe using PySpark.
listings = spark.read.csv("listings.csv")
# Loading to SparkSession
listings.createOrReplaceTempView("listings")
When I tried to get a glance at the result using Spark SQL with the following code:
listing_query = "SELECT * FROM listings LIMIT 20"
spark.sql(listing_query).show()
I got the following result:
Which is very weird consider reading the csv with pandas outputs the correct format of the table without the mismatched column.
Any idea about what caused this issue and how to fix it?

Writing Pandas dataframe to Snowflake but issue with Date column

I've a Pandas dataframe with a date column (E.g., 29-11-2019). But when I'm writing the dataframe to Snowflake it's throwing an error like this:
sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 100040 (22007): Date '29-11-2019' is not recognized
I've tried changing the datatype to datetime:
df['REPORTDATE'] = df['REPORTDATE'].astype('datetime64[ns]')
and I get this error:
sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 100035 (22007): Timestamp '00:24.3' is not recognized
Will appreciate your help.
Snowflake is now enforcing strict date format and the date is expected as YYYY-MM-DD . Any other format is not going to be recognized and also "odd" dates like 0000-00-00 are not going to be recognized.
You can try to change the DATE_INPUT_FORMAT in session to 'dd-MM-YYYY' and see if that fixes anything. Otherwise you'd have to re-format your sorce data (my guess would be strftime("%Y/%m/%d %H:%M:%S")) if there is the hour/minute/second piece in it, but be aware that in DATE format for Snowflake these are getting truncated anyway.

String to MMM-YY format

I am using xlrd to read a spreadsheet and write to a database. However, there is a cell value which needs to be written to a date column in the database.
The cell is a string and I read it as and trying to convert it to MON-YY as follows.
sales_month_val = curr_sheet.cell(1,5).value
print sales_month_val
current_sales_month = datetime.strptime(sales_month_val,'%MMM%-%YY%')
But I keep getting the conversion failed error message. Is the above conversion to datetime correct to convert to MON-YY format?
Thanks,
bee
You should take a look at this strftime reference.
The format you are looking for is:
%b-%y

Categories