I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.
I am using a python script in order to structure a CSV file fetched from a website and store it as CSV in Azure container. The CSV File is loaded into Azure database using Azure Data Factory Copy Activity.
The datatype one of the amount column in CSV is having Exponential value inside it.For example it has value 3.7E+08 (the actual value is 369719968 ). But when the data is loaded into the database the values are stored as 3.7E+08 in the staging table for which the column data type is varchar(30). I am converting it to float and then to decimal in order to store it in the target table. On converting to float the value is coming as 370000000 inside database and hence it causing a big mismatch in the amount.
Kindly suggest how we can resolve this issue. We tried to convert the amount column to string in the CSV using Python. But still the value is coming as exponential.
Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records.
Also, I have to handle schema changes in the data, as new fields may get added.
How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?
Is the below approach is a good one?
Generate a script to create hive tables on top of HDFS location.
Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
Create a view from the hive tables to get the latest records using primary key and audit column.
I have 25 tables in a single Schema as part of my Postgres DB. I would like to extract these tables as CSV files. These tables doesn't have a current date column to it. so we would like to add cuurent date column to the CSV file.
Planning to do this in Python and the code will be executed in a ec2 instance.
One of the option i explored is COPY Command
I am new to both python and SQLite.
I have used python to extract data from xlsx workbooks. Each workbook is one series of several sheets and is its own database, but I would also like a merged database of every series together. The structure is the same for all.
The structure of my database is:
*Table A with autoincrement primary key id, logical variable and 1 other variable.
*Table B autoincrement primary key id, logical variable and 4 other variables
*Table C is joined by table A id and table B id, together the primary key, and also has 4 other variables specific to this instance of intersection between table A and table B.
I tried using the answer at
Sqlite merging databases into one, with unique values, preserving foregin key relation
along with various other ATTACH solutions, but each time I got an error message ("cannot ATTACH database within transaction").
Can anyone suggest why I can't get ATTACH to work?
I also tried a ToMerge like the one at How can I merge many SQLite databases?
and it couldn't do ToMerge in the transaction either.
I also initially tried connecting to the existing SQLite db, making dictionaries from the existing tables in python, then adding the information in the dictionaries into a new 'merged' db, but this actually seemed to be far slower than the original process of extracting everything from the xlsx files.
I realize I can easily just run my xlsx to SQL python script again and again for each series directing it all into the one big SQL database and that is my backup plan, but I want to learn how to do it the best, fastest way.
So, what is the best way for me to merge identical structured SQLite databases into one, maintaining my foreign keys.
TIA for any suggestions
:-)L
You cannot execute the ATTACH statement from inside a transaction.
You did not start a transaction, but Python tried to be clever, got the type of your statement wrong, and automatically started a transaction for you.
Set connection.isolation_level = None.