Getting error while using overwrite mode with column mapping in spark - python

I am loading data from CSV files that are present under S3 into Snowflake table using a pyspark script. Earlier, the table had 10 columns & in the updated files I started receiving 12 columns. Now in Snowflake you can’t add new columns in between (we can alter a table and add columns in between in MySQL, but in Snowflake it is possible in the end only).
I have now altered the table and added two new columns in the end.
In the script I am doing upserts (insert - update). For new records that are inserted in snowflake, I am using append mode. For update records, I am using overwrite mode.
In the insertion append mode, I have specified column mapping by name (it is working fine)
df.write.format(“net.snowflake.spark.snowflake”).options(“column_mapping”,”name”).mode(“append”).save()
But when I added column mapping by name in update logic (overwrite mode).
it is giving me a error saying column mapping only works with append mode.
df.write.format(“net.snowflake.spark.snowflake”).options(“column_mapping”,”name”).mode(“overwrite”).save()
I have to specify column mapping because order of columns in snowflake table is different from order of columns in csv file.
Error:
the column mapping only works in append mode.

Related

Create automatic table columns SQL code after a .CSV file in PostgreSQL

I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.

Avoiding exponential values in CSV on loading to Database

I am using a python script in order to structure a CSV file fetched from a website and store it as CSV in Azure container. The CSV File is loaded into Azure database using Azure Data Factory Copy Activity.
The datatype one of the amount column in CSV is having Exponential value inside it.For example it has value 3.7E+08 (the actual value is 369719968 ). But when the data is loaded into the database the values are stored as 3.7E+08 in the staging table for which the column data type is varchar(30). I am converting it to float and then to decimal in order to store it in the target table. On converting to float the value is coming as 370000000 inside database and hence it causing a big mismatch in the amount.
Kindly suggest how we can resolve this issue. We tried to convert the amount column to string in the CSV using Python. But still the value is coming as exponential.

How to handle schema changes in Spark?

Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records.
Also, I have to handle schema changes in the data, as new fields may get added.
How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?
Is the below approach is a good one?
Generate a script to create hive tables on top of HDFS location.
Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
Create a view from the hive tables to get the latest records using primary key and audit column.

Using Python: Extracting tables from postgres DB as CSV files and adding current date column to the CSV file

I have 25 tables in a single Schema as part of my Postgres DB. I would like to extract these tables as CSV files. These tables doesn't have a current date column to it. so we would like to add cuurent date column to the CSV file.
Planning to do this in Python and the code will be executed in a ec2 instance.
One of the option i explored is COPY Command

Trying to merge the same structured sqlite databases- each contain 3 tables, 1 being a joining table for many to many relationships from the other 2

I am new to both python and SQLite.
I have used python to extract data from xlsx workbooks. Each workbook is one series of several sheets and is its own database, but I would also like a merged database of every series together. The structure is the same for all.
The structure of my database is:
*Table A with autoincrement primary key id, logical variable and 1 other variable.
*Table B autoincrement primary key id, logical variable and 4 other variables
*Table C is joined by table A id and table B id, together the primary key, and also has 4 other variables specific to this instance of intersection between table A and table B.
I tried using the answer at
Sqlite merging databases into one, with unique values, preserving foregin key relation
along with various other ATTACH solutions, but each time I got an error message ("cannot ATTACH database within transaction").
Can anyone suggest why I can't get ATTACH to work?
I also tried a ToMerge like the one at How can I merge many SQLite databases?
and it couldn't do ToMerge in the transaction either.
I also initially tried connecting to the existing SQLite db, making dictionaries from the existing tables in python, then adding the information in the dictionaries into a new 'merged' db, but this actually seemed to be far slower than the original process of extracting everything from the xlsx files.
I realize I can easily just run my xlsx to SQL python script again and again for each series directing it all into the one big SQL database and that is my backup plan, but I want to learn how to do it the best, fastest way.
So, what is the best way for me to merge identical structured SQLite databases into one, maintaining my foreign keys.
TIA for any suggestions
:-)L
You cannot execute the ATTACH statement from inside a transaction.
You did not start a transaction, but Python tried to be clever, got the type of your statement wrong, and automatically started a transaction for you.
Set connection.isolation_level = None.

Categories