I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.
Related
I have a BigQuery data warehouse containing all the data from a mongodb database, those data are sync once a day.
I would like to add a column to one of my table, that column is a cleaned + lemmatized version of another column (the type is string). I can't do that with DBT because I need to use the python library Spacy. How could I run such a transformation on my table without having to get all the data locally and sending 10M UPDATE on bigquery ? Is there some GCP tools to run python function against bigquery like dataflow or something like that ?
And in a more general way, how do you tranform data when tools like DBT are not enough ?
Thanks for your help !
You can try Dataflow Batch processing for your requirement since Dataflow is a fully managed service which can run a transformation on your table without downloading the data locally and spaCy library can be used along with the Dataflow pipelines. Although Bigquery and Dataflow is a managed service that can process larger amounts of data, it is always a best practice to split larger jobs into smaller ones for larger NLP jobs as discussed here.
Note - As you want to add a column which is a lemmatized and cleaned version of a column in a table, it would be better to create a new destination table.
I need to pull a table ABC from Salesforce having 4Million records with 250 columns. I'm using python simple-salesforce API to do it, but if is running out of memory as I'm using a 8GB ram machine.
Is there any way to query this much large amount of records in salesforce using Pyspark, if so please suggest.
If there are any other approach also by using either Python or Pyspark, suggest them as well...
Is there a chance you are using query_all from simple_salesforce? If so, you might try query_more or query_all_iter instead so that your script does not attempt to load all into a single Python list.
Simple Salesforce documentation here.
I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.
In my python project, I need to fill a bigquery table with a relational dataframe. I'm having a lot of trouble at creating a new table from scratch and being sure that the first data I upload to it are actually put into the table.
I've read the page https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency and have seen that applying a insertId to the insert query would solve the problem, but since I use pandas's dataframes, the function to_gbq of the pandas-gbq package seems to be perfect for this task. Yet, when using to_gbq function and a new table is created/replaced, sometimes (apparently randomly) the first data chunk is not written into the table.
Does anybody know how to ensure the complete insertion of a DataFrame into a bigquery new created table? Thanks
I believe you are encountering https://github.com/pydata/pandas-gbq/issues/75. Basically, Pandas using the BigQuery streaming API to write data into tables, but the streaming API has a delay after table creation to when it starts working.
Edit: Version 0.3.0 of pandas-gbq fixes this issue by using a load job to upload data frames to BigQuery instead of streaming.
In the meantime, I'd recommend using a "load job" to create the tables. For example, using the client.load_table_from_file method in the google-cloud-bigquery package.
from google.cloud.bigquery import LoadJobConfig
from six import StringIO
destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []
for row in maybe_a_dataframe:
row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
rows.append(row_json)
body = StringIO('{}\n'.format('\n'.join(rows)))
client.load_table_from_file(
body,
destination_table,
job_config=job_config).result()
Edit: This code sample fails for columns containing non-ASCII characters. See https://github.com/pydata/pandas-gbq/pull/108
Background
I studied and found that bigQuery doesn't accept schemas defined by online tools (which have different formats, even though meaning is same).
So, I found that if I want to load data (where no. of columns keeps varying and increasing dynamically) into a table which has a fixed schema.
Thoughts
What i could do as a workaround is:
First check if the data being loaded has extra fields.
If it has, a schema mismatch will occur, so first you create a temporary table in BQ and load this data into the table using "autodetect" parameter, which gives me a schema (that is in a format,which BQ accepts schema files).
Now i can download this schema file and use it,to update my exsisting table in BQ and load it with appropriate data.
Suggestion
Any thoughts on this, if there is a better approach please share.
We are in the process of releasing a new feature that can update the schema of the destination table within a load/query job. With autodetect and the new feature you can directly load the new data to the existing table, and the schema will be updated as part of the load job. Please stay tuned. The current ETA is 2 weeks.