How to publish JSON data into PubSub and then push to BigQuery? - python

I am fetching twitter data in JSON format and I want to push that data(same JSON format) into PubSub topic. I am able to Publish data in Topic, but when I try to push data into BigQuery, it stores the data into "Error record table"(I have already created BigQuery table with proper schema and given table_id and dataset_id as credentials)

Issues could be due to below reasons:
1] Destination table schema and table schema specified in JSON format is different.
2] Datatype of the column and datatype specfied in JSON format is different.
These corrective measures can be taken:
1] In order to fix schema mismatch issue, please make sure to match the JSON and destination table schema. While creating schema try following these steps which are mentioned in doc[1] that explains how to specify the schema of a BigQuery table.
2] Since there is mismatch in datatype try to make sure that you specify correct datatype.
Consider the below JSON schema:
{"message":"hi","id":65,marks:89.8}
In the above example, where message=string,id=integer,marks=float
So any mismatch in datatype,message will be published in error record table.
References:
[1] https://cloud.google.com/bigquery/docs/schemas
[2] https://cloud.google.com/bigquery/docs/error-messages#metadata-errors-for-streaming-inserts
[3] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery

Related

Partial match file headers to Avro schema headers

I am building a data pipeline and making use of Avro as the file format. The pipeline is built in Python. The source data is received as csv and converted to avro format using a defined avro schema as expected.
During the conversion, there is obviously validation between the file schema and the avro schema. This will output an error if a column is missing from the file or if data types in the file do not match the schema.
My issue is the source files are received from an external source every month and often have changes to the headers and the content of the files. This is not an issue for major changes as this would need to be corrected on our side regardless. However, the headers of the source files often contain minor changes that do not warrant a break in the system and interference from our side - say if the schema header is "Client" and the source file header is "Client Name". Or even simpler, "Client" and "Clients".
Is there a way to perform partial header matches or keyword matching when converting to avro format? Any assistance with this would be much appreciated!

Pandas df `to_gbq` with nested data

I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.

InternalError_: Spectrum Scan Error. S3 to Redshift copy command

I am trying to copy some data from S3 bucket to redshift table by using the COPY command. The format of the file is PARQUET. When I run the execute the COPY command query, I get InternalError_: Spectrum Scan Error.
This is the first time I tried copying from a parquet file.
Please help me if there is a solution for this. I am using boto3 in python.
This generally happens for below reasons:
If there is a mismatch in number of columns between table and file.
If the Column type of your file schema is incompatible with your target table column type.
Try going into the error logs. You might find partial log in cloud watch. From the screen shot you have uplaoded, you can also find a query number you have run.
Got to aws redshift query editor and run below query to get the full log:
select message
from svl_s3log
where query = '<<your query number>>'
order by query,segment,slice;
Hope this helps !
This error usually indicates some problem with compatibility of data in your file and redshift tables. you can get more insights about error in table 'SVL_S3LOG'. In my case it was because file had some invalid utf8 characters.
Spectrum scan error are usually caused by two things.
a) column mismatch between source and destination
e.g. if u are copying data from S3 to redshift then, the columns of parquet are not in order with those present in redshift tables.
b) there is match in the datatype for source and destination
e.g. S3 to redshift copy, in parquet one has col1 datatype as Integer and in redshift same col1 has datatype as float.
Verify the schema with their datatype
matching the sequence and the datatype for source and destination will solve the Spectrum Scan Error.

Bigquery data not getting inserted

I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.

Bigquery : Create table if not exist and load data using Python and Apache AirFlow

First, I get all data using MySQL query from production database then store that data as NEW LINE DELIMITED JSON in google cloud storage, what I want to do is:
1. check if the table exists
2. if the table doesn't exist, create the table using autodetect schema
3. store data
All of this will be scheduled in airflow. What really confused me is number 2, how can I do this in Python ? or can Airflow do this automatically?
Airflow can do this automatically. The create_disposition parameter creates the table if needed. And the autodetect parameter does exactly what you need. This is for Airflow 1.10.2.
GCS_to_BQ = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq',
bucket='test_bucket',
source_objects=['folder1/*.csv', 'folder2/*.csv'],
destination_project_dataset_table='dest_table',
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='bq-conn',
google_cloud_storage_conn_id='gcp-conn',
autodetect=True, # This uses autodetect
dag=dag
)
From BigQuery commandline, if your json file is on GCS, then Loading JSON data with schema auto-detection does 2 + 3 for you in one command.
Looking at AirFlow document, GoogleCloudStorageToBigQueryOperator seems doing the same thing. I checked its source, it simply call BigQuery load api. I believe it will do what you want.
When it is unclear what each argument means, you can search BigQuery Jobs api using argument name.
E.g., to achieve 1 in your task list, you only need to specify:
write_disposition (string) – The write disposition if the table already exists.
But in order to know what string you need to pass as write_disposition, you have to search on BigQuery document.

Categories