Bigquery data not getting inserted - python

I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also

client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.

It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability

got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.

Related

How to deal with (cx_Oracle.DatabaseError) DPI-1037: column at array position fetched with error 1406?

I am using Python, pandas, sqlalchemy and cx_Oracle to execute a select query in an Oracle DB.
But I am not able to retrieve data from a column that has type LONG. When I remove it from the query, it runs fine, but when it is included I get "(cx_Oracle.DatabaseError) DPI-1037: column at array position (*) fetched with error 1406"
I ran into this post How to solve error: ORA-01406 fetched column value was truncated?, and read the official documentation http://www.dba-oracle.com/t_ora_01406_fetched_column_value_was_truncated.htm
The problem seems to be related to the size of the column buffer but I don't know how I would overcome it using pandas or another python library.
Is there anyway I can deal with this error?
Thank you in advance
For anyone who stumbles upon this, I ended up executing the query via cx_Oracle and loading the result with pandas afterwards

Pandas df `to_gbq` with nested data

I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.

SalesForce SOQL issue with query

I've been trying to pull a table from sales force into a jupyter notebook to make as a dataframe. When I create the query and try executing it, I get this message:
Malformed request https://lumn.my.salesforce.com/services/data/v42.0/query/?q=SELECT+WhoId%2C+ActivityDate%2C+OwnerId%2C+IsDeleted%2C+AccountId%2C+IsClosed%2C+CreatedDate%.... Response content: [{'message': "exceeded 100000 distinct who/what's", 'errorCode': 'OPERATION_TOO_LARGE'}]
I pulled in another table successfully using sf.query_all(), which had more data than 100000 rows so I'm stumped on this one and there is not a lot of documentation on what to do at this point since it doesn't give me a recordsURL to pull in increments. I also tried to see if sf.bulk.Task.query_all() would work, but it does not do it either. As well I tried to pull in this data through excel and was given the same error message. The project I'm working on does not filter this table down in any way so it consequently has to pull all from these columns I am working with. If anyone has any idea how to circumvent this please share!
Not familiar with this error but you might give "PK chunking" a go, https://developer.salesforce.com/docs/atlas.en-us.234.0.api_asynch.meta/api_asynch/async_api_headers_enable_pk_chunking.htm
And there's some good reading on work with huge tables in general, https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/salesforce_large_data_volumes_bp/ldv_deployments_introduction.htm

snowflake select cursor statement fails

'''
cursor.execute(Select * From Table);
'''
Iam using the above code to execute the above select query, but this code gets stucked, because in the table, I am having 93 million records,
Do we have any other method to extract all the data from snowflake table in python script
Depending on what you are trying to do with that data, it'd probably be most efficient to run a COPY INTO location statement to extract the data into a file to a stage, and then run a GET via Python to bring that file locally to wherever you are running python.
However, you might want to provide more detail on how you are using the data in python after the cursor.execute statement. Are you going to iterate over that data set to do something (in which case, you may be better off issuing SQL statements directly to Snowflake, instead), loading it into Pandas to do something (there are better Snowflake functions for pandas in that case), or something else? If you are just creating a file from it, then my suggestion above will work.
The problem is when you are fetching data from Snowflake to Python, the query is getting stuck due to the volume of record and the SF to Python Data conversion.
Are you trying to fetch all the data from the table and how are you using the Data in the downstream which is most important. Restrict the number of columns
Improving Query Performance by Bypassing Data Conversion
To improve query performance, use the SnowflakeNoConverterToPython class in the snowflake.connector.converter_null module to bypass data conversions from the Snowflake internal data type to the native Python data type, e.g.:
con = snowflake.connector.connect(
...
converter_class=SnowflakeNoConverterToPython
)
for rec in con.cursor().execute("SELECT * FROM large_table"):
# rec includes raw Snowflake data

Reattempting of failed bulk inserts in pymongo

I am trying to do a bulk insert of documents into a MongoDB collection in Python, using pymongo. This is what the code looks like:
collection_name.insert_many([ logs[i] for i in range (len(logs)) ])
where logs is a list of dictionaries of variable length.
This works fine when there are no issues with any of the logs. However, if any one of the logs has some kind of issue and pymongo refuses to save it (say, the issue is something like the document fails to match the validation schema set for that collection), the entire bulk insert is rolled back and no documents are inserted in the database.
Is there any way I can retry the bulk insert by ignoring only the defective log?
You can ignore those types of errors by specifying ordered: false as an option: collection.insert_many(logs, ordered=False). All operations are attempted before raising an exception, which you can catch.
See https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many

Categories