Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records.
Also, I have to handle schema changes in the data, as new fields may get added.
How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?
Is the below approach is a good one?
Generate a script to create hive tables on top of HDFS location.
Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
Create a view from the hive tables to get the latest records using primary key and audit column.
Related
I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.
We're doing streaming inserts on a BigQuery table.
We want to update the schema of a table without changing its name.
For example, we want to drop a column because it has sensitive data but we want to keep all the other data and the table name the same.
Our process is as follows:
copy original table to temp table
delete original table
create new table with original table name and new schema
populate new table with old table's data
cry because the last (up to) 90 minutes of data is stuck in streaming buffer and was not transferred.
How to avoid the last step?
I believe the new streaming API does not use streaming buffer anymore. Instead, it writes data directly to the destination table.
To enable API you have to enroll with BigQuery Streaming V2 Beta Enrollment Form:
You can find out more in the following link
I hope it addresses your case.
I have a task to import multiple Excel files in their respective sql server tables. The Excel files are of different schema and I need a mechanism to create a table dynamically; so that I don't have to write a Create Table query. I use SSIS, and I have seen some SSIS articles on the same. However, it looks I have to define the table anyhow. OpenRowSet doesn't work well in case of large excel files.
You can try using BiML, which dynamically creates packages based on meta data.
The only other possible solution is to write a script task.
Background
I studied and found that bigQuery doesn't accept schemas defined by online tools (which have different formats, even though meaning is same).
So, I found that if I want to load data (where no. of columns keeps varying and increasing dynamically) into a table which has a fixed schema.
Thoughts
What i could do as a workaround is:
First check if the data being loaded has extra fields.
If it has, a schema mismatch will occur, so first you create a temporary table in BQ and load this data into the table using "autodetect" parameter, which gives me a schema (that is in a format,which BQ accepts schema files).
Now i can download this schema file and use it,to update my exsisting table in BQ and load it with appropriate data.
Suggestion
Any thoughts on this, if there is a better approach please share.
We are in the process of releasing a new feature that can update the schema of the destination table within a load/query job. With autodetect and the new feature you can directly load the new data to the existing table, and the schema will be updated as part of the load job. Please stay tuned. The current ETA is 2 weeks.
Background
I am loading files from local machine to BigQuery.Each file has variable number of fields.So,i am using 'autodetect=true' while running load job.
Issue is,when load job is run for first time and if the destination table doesn't exsist,Bigquery creates the table ,by infering the fields present in our file and that becomes New table's schema.
Now,when i run load job with a different file,which contains some extra (Eg:"Middile Name":"xyz")fields ,bigQuery throws error saying "field doesn't exsist in table")
From this post::BigQuery : add new column to existing tables using python BQ API,i learnt that columns can be added dynamically.However what i don't understand is,
Query
How will my program come to know,that the file being uploaded ,contains extra fields and schema mismatch will occur.(Not a problem ,if table doesn't exsist bcoz. new table will be created).
If my program can somehow infer the extra fields present in file being uploaded,i could add those columns to the exsisting table and then run the load job.
I am using python BQ API.
Any thoughts on how to automate this process ,would be helpful.
You should check schema update options. There is an option named as "ALLOW_FIELD_ADDITION" that will help you.
A naive solution would be:
1.get the target table schema using
service.tables().get(projectId=projectId, datasetId=datasetId, tableId=tableId)
2.Generate schema of your data in the file.
3.Compare the schemas (kind of a "diff") and then add those columns to the target table ,which are extra in your data schema
Any better ideas or approaches would be highly appreciated!