I am trying to copy some data from S3 bucket to redshift table by using the COPY command. The format of the file is PARQUET. When I run the execute the COPY command query, I get InternalError_: Spectrum Scan Error.
This is the first time I tried copying from a parquet file.
Please help me if there is a solution for this. I am using boto3 in python.
This generally happens for below reasons:
If there is a mismatch in number of columns between table and file.
If the Column type of your file schema is incompatible with your target table column type.
Try going into the error logs. You might find partial log in cloud watch. From the screen shot you have uplaoded, you can also find a query number you have run.
Got to aws redshift query editor and run below query to get the full log:
select message
from svl_s3log
where query = '<<your query number>>'
order by query,segment,slice;
Hope this helps !
This error usually indicates some problem with compatibility of data in your file and redshift tables. you can get more insights about error in table 'SVL_S3LOG'. In my case it was because file had some invalid utf8 characters.
Spectrum scan error are usually caused by two things.
a) column mismatch between source and destination
e.g. if u are copying data from S3 to redshift then, the columns of parquet are not in order with those present in redshift tables.
b) there is match in the datatype for source and destination
e.g. S3 to redshift copy, in parquet one has col1 datatype as Integer and in redshift same col1 has datatype as float.
Verify the schema with their datatype
matching the sequence and the datatype for source and destination will solve the Spectrum Scan Error.
Related
I am fetching twitter data in JSON format and I want to push that data(same JSON format) into PubSub topic. I am able to Publish data in Topic, but when I try to push data into BigQuery, it stores the data into "Error record table"(I have already created BigQuery table with proper schema and given table_id and dataset_id as credentials)
Issues could be due to below reasons:
1] Destination table schema and table schema specified in JSON format is different.
2] Datatype of the column and datatype specfied in JSON format is different.
These corrective measures can be taken:
1] In order to fix schema mismatch issue, please make sure to match the JSON and destination table schema. While creating schema try following these steps which are mentioned in doc[1] that explains how to specify the schema of a BigQuery table.
2] Since there is mismatch in datatype try to make sure that you specify correct datatype.
Consider the below JSON schema:
{"message":"hi","id":65,marks:89.8}
In the above example, where message=string,id=integer,marks=float
So any mismatch in datatype,message will be published in error record table.
References:
[1] https://cloud.google.com/bigquery/docs/schemas
[2] https://cloud.google.com/bigquery/docs/error-messages#metadata-errors-for-streaming-inserts
[3] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery
Basically I want a SQL connection to a csv file in a s3 bucket using Amazon Athena. I also do not know any information other than that the first row will give the names of the headers. Does anyone know any solution to this?
You have at least two ways of doing this. One is to examine a few rows of the file to detect the data types, then create a CREATE TABLE SQL statement as seen at the Athena docs.
If you know you are getting only strings and numbers (for example) and if you know all the columns will have values, it can be relatively easy to build it that way. But if types can be more flexible or columns can be empty, building a robust solution from scratch might be tricky.
So the second option would be to use the AWS Glue Catalog to define a crawler, which does exactly what I told you above, but automatically. It also creates the metadata you need in Athena, so you don't need to write the CREATE TABLE statement.
As a bonus, you can use that automatically catalogued data not only from Athena, but also from Redshift and EMR. And if you keep adding new files to the same bucket (every day, every hour, every week...) you can tell the crawl to pass again and rediscover the data in case the schema has evolved.
I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.
Background
I studied and found that bigQuery doesn't accept schemas defined by online tools (which have different formats, even though meaning is same).
So, I found that if I want to load data (where no. of columns keeps varying and increasing dynamically) into a table which has a fixed schema.
Thoughts
What i could do as a workaround is:
First check if the data being loaded has extra fields.
If it has, a schema mismatch will occur, so first you create a temporary table in BQ and load this data into the table using "autodetect" parameter, which gives me a schema (that is in a format,which BQ accepts schema files).
Now i can download this schema file and use it,to update my exsisting table in BQ and load it with appropriate data.
Suggestion
Any thoughts on this, if there is a better approach please share.
We are in the process of releasing a new feature that can update the schema of the destination table within a load/query job. With autodetect and the new feature you can directly load the new data to the existing table, and the schema will be updated as part of the load job. Please stay tuned. The current ETA is 2 weeks.
Background
I am loading files from local machine to BigQuery.Each file has variable number of fields.So,i am using 'autodetect=true' while running load job.
Issue is,when load job is run for first time and if the destination table doesn't exsist,Bigquery creates the table ,by infering the fields present in our file and that becomes New table's schema.
Now,when i run load job with a different file,which contains some extra (Eg:"Middile Name":"xyz")fields ,bigQuery throws error saying "field doesn't exsist in table")
From this post::BigQuery : add new column to existing tables using python BQ API,i learnt that columns can be added dynamically.However what i don't understand is,
Query
How will my program come to know,that the file being uploaded ,contains extra fields and schema mismatch will occur.(Not a problem ,if table doesn't exsist bcoz. new table will be created).
If my program can somehow infer the extra fields present in file being uploaded,i could add those columns to the exsisting table and then run the load job.
I am using python BQ API.
Any thoughts on how to automate this process ,would be helpful.
You should check schema update options. There is an option named as "ALLOW_FIELD_ADDITION" that will help you.
A naive solution would be:
1.get the target table schema using
service.tables().get(projectId=projectId, datasetId=datasetId, tableId=tableId)
2.Generate schema of your data in the file.
3.Compare the schemas (kind of a "diff") and then add those columns to the target table ,which are extra in your data schema
Any better ideas or approaches would be highly appreciated!