I would like to add columns which is a result of two existing columns in BigQuery. I am using Apache Beam to read from BigQuery and then process it and update the results to the same BigQuery table as a new column.
Beam BigQuery connector does not explicitly support BigQuery DML, however you can write a pipeline to insert the result of your processing into a separate table, and after the pipeline runs, run a DML statement to update the column in the original table using that auxiliary table.
Alternatively, if your processing logic can be expressed in SQL, you're probably better off just implementing it as an SQL DML statement without using a pipeline.
Related
I have two tables stored in Bigquery, and want to join the columns from the one table to another table. This needs to be done using Apache Beam (Python) for a dataflow pipeline in Google cloud platform. Just cannot find an approach to do this with Apache Beam. WriteToBigQuery only appends rows, which is not what I need - need to add columns from another table. Both tables uses the same primary keys. Any help will be appraciated.
FEEDBACK: See responses below from Guillaume. This solved my problem and were a better approach as apposed to using Apache beam and dataflow!
You can try following snippet, to read data from Bigquery over Dataflow and join 2 tables and write data to a new Bigquery table:-
data_loading = (
p1
| 'ReadBQ' >> beam.io.Read(beam.io.BigQuerySource(query='''SELECT a.Coll1, b.Coll2 FROM `PROJ.dataset.table-a` as a, `PROJ.dataset.table-b` as b WHERE a.coll-join=b.coll-join; ''', use_standard_sql=True))
)
I want to test my complex hive queries beforehand by executing on empty dataframes using pyspark or pandas. How can I do this. I don't want to create hive connection just mocking tables as df and then execute query on them
I have a BigQuery data warehouse containing all the data from a mongodb database, those data are sync once a day.
I would like to add a column to one of my table, that column is a cleaned + lemmatized version of another column (the type is string). I can't do that with DBT because I need to use the python library Spacy. How could I run such a transformation on my table without having to get all the data locally and sending 10M UPDATE on bigquery ? Is there some GCP tools to run python function against bigquery like dataflow or something like that ?
And in a more general way, how do you tranform data when tools like DBT are not enough ?
Thanks for your help !
You can try Dataflow Batch processing for your requirement since Dataflow is a fully managed service which can run a transformation on your table without downloading the data locally and spaCy library can be used along with the Dataflow pipelines. Although Bigquery and Dataflow is a managed service that can process larger amounts of data, it is always a best practice to split larger jobs into smaller ones for larger NLP jobs as discussed here.
Note - As you want to add a column which is a lemmatized and cleaned version of a column in a table, it would be better to create a new destination table.
I have to read three different bigquery tables and then join them to get some data which will be stored to GCS bucket. I was using Spark BQ connector.
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', bq_dataset + bq_table) \
.load()
bqdf.createOrReplaceTempView('bqdf')
This reads entire table data to dataframe. I know that I can apply filter on tables and also select required columns. Thereafter create three dataframes and then join them to get the output.
Is there any equivalent way to achieve this?
I have an option of using BigQuery client API (https://googleapis.dev/python/bigquery/latest/index.html) and import it from pyspark script. However, if I can achive that through Spark BQ connector, dont want to use the API call from python script.
Please help.
Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records.
Also, I have to handle schema changes in the data, as new fields may get added.
How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?
Is the below approach is a good one?
Generate a script to create hive tables on top of HDFS location.
Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
Create a view from the hive tables to get the latest records using primary key and audit column.