Add column from column using Python and Spacy - python

I have a BigQuery data warehouse containing all the data from a mongodb database, those data are sync once a day.
I would like to add a column to one of my table, that column is a cleaned + lemmatized version of another column (the type is string). I can't do that with DBT because I need to use the python library Spacy. How could I run such a transformation on my table without having to get all the data locally and sending 10M UPDATE on bigquery ? Is there some GCP tools to run python function against bigquery like dataflow or something like that ?
And in a more general way, how do you tranform data when tools like DBT are not enough ?
Thanks for your help !

You can try Dataflow Batch processing for your requirement since Dataflow is a fully managed service which can run a transformation on your table without downloading the data locally and spaCy library can be used along with the Dataflow pipelines. Although Bigquery and Dataflow is a managed service that can process larger amounts of data, it is always a best practice to split larger jobs into smaller ones for larger NLP jobs as discussed here.
Note - As you want to add a column which is a lemmatized and cleaned version of a column in a table, it would be better to create a new destination table.

Related

Consistency checks in dbt Cloud

The goal is to compare the data in the dataset source with the data after cleaning, modelling and ingesting it into the data warehouse and send an alert in case of mismatch.
Is this possible with dbt Cloud or I should use Python
If by "source" you mean a different database, before it is ingested into your data warehouse, I'd recommend using datadiff.
Once the data is in your warehouse, you can use dbt to compare two different tables (let's say in your raw or source schema and your final modeled schema).
There are quite a few tests in dbt-utils for this. You may also be interested in dbt-expectations if you need more powerful or complex tests.

Export Bigquery table to gcs bucket into multiple folders/files corresponding to clusters

Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.

Automatically Importing Data From a Website in Google Cloud

I am trying to find a way to automatically update a big query table using this link: https://www6.sos.state.oh.us/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:1
This link is updated with new data every week and I want to be able to replace the Big Query table with this new data. I have researched that you can export spreadsheets to Big Query, but that is not a streamlined approach.
How would I go about submitting a script that imports the data and having that data be fed to Big Query?
I assume you already have a working script that parses the content of the URL and places the contents in BigQuery. Based on that I would recommend the following workflow:
Upload the script as a Google Cloud Function. If your script isn't written in a compatible language (i.e. Python, Node, Go), you can use Google Cloud Run instead. Set the Cloud Function to be triggered by a Pub/Sub message. In this scenario, the content of your Pub/Sub message doesn't matter.
Set up a Google Cloud Scheduler job to (a) run at 12am every Saturday (or whatever time you wish) and (b) send a dummy message to the Pub/Sub topic that your Cloud Function is subscribed to.
You can try using a HTTP request to the page using a programming language like Python with the Request library, save the data into a Pandas Dataframe or a CSV file, and then using the BigQuery libraries you can push that data into a BigQuery table.

Is there a way to serverlessly update a big aggregated table daily?

I am trying to build a big aggregated table with googles tools but I am a bit lost on the 'how to do it'.
Here is what I would like to create: I have a big table in bigquery. Its updated daily with about 1.2M events for evert user of the application. I would like to have an auto updating aggregate table(udpated once every day) built upon that with all user data broken by userID. But how do I continiously update the data inside of it?
I read a bit about firebase and bigquery but since they are very new to me I cant figure out if this is possible to do serverlessly?
I know how to do it with a jenkins process that queries the big events table for the last day, gets all userIDs, joins with the data from the existing aggregate values for the userIDs, takes whatever is changed and deletes from the aggregate in order to insert the updated data. (In python)
The issue is I want to do this entirely within the structure of google. Is firebase able to do that? Is bigquery able? How? What tools? Could this be solved using the serverless functions available?
I am more familiar with Redshift.
You can use a pretty new BigQuery feature that to schedule queries.
I use it to create rollup tables. If you need more custom stuff you can use cloud scheduler to call any Google product that can be triggered by an HTTP request such as cloud function, cloud run, or app engine.

schema free solution to BigQuery Load job

Background
I studied and found that bigQuery doesn't accept schemas defined by online tools (which have different formats, even though meaning is same).
So, I found that if I want to load data (where no. of columns keeps varying and increasing dynamically) into a table which has a fixed schema.
Thoughts
What i could do as a workaround is:
First check if the data being loaded has extra fields.
If it has, a schema mismatch will occur, so first you create a temporary table in BQ and load this data into the table using "autodetect" parameter, which gives me a schema (that is in a format,which BQ accepts schema files).
Now i can download this schema file and use it,to update my exsisting table in BQ and load it with appropriate data.
Suggestion
Any thoughts on this, if there is a better approach please share.
We are in the process of releasing a new feature that can update the schema of the destination table within a load/query job. With autodetect and the new feature you can directly load the new data to the existing table, and the schema will be updated as part of the load job. Please stay tuned. The current ETA is 2 weeks.

Categories