BigQuery API for access log - I'm losing data - python

i have doing access log to a MySQL table, but recently it became too much for MySQL. Then, i decided to save in Google BigQuery. I don't know if it is the better option, but it seems to viable. Anyone has comments about that? Okay...
I started to integrate to Google BigQuery, i made an small application with Flask (a Python framework). I created endpoints to receive data and send to BigQuery. Now my general application sends data to a URL which is pointed to my Flask application, that for your turn, sends to BigQuery. Any observation or suggestion here?
Finally my problem, sometimes i'm losing data. I made an script to test my general application to see the results, i ran the script it for many times and noticed that i lost some data, because sometimes the same data are being saved and sometimes not. Someone has some idea what can be happening? And most important.. How can i prevent to lose data in that case? How my application can be prepared to notice that data wasn't seved to Google BigQuery and then treat it, like to try again?
I am using google-cloud-python library (reference: https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#tables).
My code:
client = bigquery.Client(project=project_id)
table_ref = client.dataset(dataset_id).table(table_id)
SCHEMA = [SchemaField(**field) for field in schema]
errors = client.create_rows(table_ref, [row], SCHEMA)
That is all

As I expected, you don't handle errors. Make sure you handle and understand how streaming insert works. If you stream 1000 rows, and 56 fail, you get that back, and you need to retry only 56 rows. Also insertId is important.
Streaming Data into BigQuery

Related

Use Google Cloud Scheduler to get a Pandas Data Frame loaded into Google Big Query?

I'm following the documentation from the link below.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_gbq.html#pandas.DataFrame.to_gbq
Everything is setup perfectly fine in my data frame. Now, I'm trying to export if to GBQ. Here's a one-liner that should pretty much work...but it doesn't...
pandas_gbq.to_gbq(my_df, 'table_in_gbq', 'my_project_id', chunksize=None, reauth=False, if_exists='append', private_key=False, auth_local_webserver=True, table_schema=None, location=None, progress_bar=True, verbose=None)
I'm having a lot of trouble getting the cloud scheduler to run the job successfully. The scheduler runs, and I see a message saying the 'Result' was a 'Success', but actually, and no data is loaded into Big Query. When I run the job on the client side, everything is fine. When I run it from the server, no data gets loaded. I'm guessing, the credentials are throwing it off, but I can't tell for sure what's going on. All I know for sure is that Google says the job runs successfully, but no data is loaded into my table.
My question is, how can I modify this to run using Google Cloud Scheduler, with minimal security, so I can rule out some kind of security issue? Or, otherwise determine exactly what's going on here?

Google BigQuery web UI not return correct data

I saved some analysis data into bigquery by date using python googleapiclient in batch mode (service.tabledata().insertAll()). The response is succeed, but when query in the Web UI, it doesnot return the correct data, something like some day's data is lost.
just like below:
But when query data using bigquery client service in python program(client.run_sync_query(sql)), it returns the right data.
can someone explain how this abnormal phenomenon happen, any useful advice would be adappreciated.

How to sync data between a Google Sheet and a Mysql DB?

So I have a Google sheet that maintains a lot of data. I also have a MySQL DB with a huge junk of data. There is a vital piece of information in the Sheet that is also present in the DB. Both needs to be in sync. The information always enters the Sheet first. I had a python script with mysql queries to update my database separately.
Now the work flow has changed. Data will enter the sheet and whenever that happens the database has to updated automatically.
After some research, I found that using the onEdit function of Google AppScript (I learned from here.), I could pickup when the file has changed.
The Next step is to fetch the data from relevant cell, which I can do using this.
Now I need to connect to the DB and send some queries. This is where I am stuck.
Approach 1:
Have a python web-app running live. Send the data via UrlFetchApp.This I yet have to try.
Approach 2:
Connect to mySQL remotely through appscript. But I am not sure this is possible after 2-3 hours of reading the docs.
So this is my scenario. Any viable solution you can think of or a better approach?
Connect directly to mySQL. You likely missed reading this part https://developers.google.com/apps-script/guides/jdbc
Using JDBC within Apps Script will work if you have the time to build this yourself.
If you don't want to roll your own solution, check out SeekWell. It allows you to connect to databases and write SQL queries directly in Sheets. You can create a run a “Run Sheet” that will run multiple queries at once and schedule those queries to be run without you even opening the Sheet.
Disclaimer: I made this.

Saving Data on GAE: logging vs. datastore

I have a google app engine app that has to deal with a lot of data collecting. The data I gather is around millions of records per day. As I see it, there are two simple approaches to dealing with this in order to be able to analyze the data:
1. use logger API to generate app engine logs, and then try to load these up to a big query (or more simply export to CSV and do the analysis with excel).
2. saving the data in the app engine datastore (ndb), and then download that data later / try to load that up to big query.
Is there any preferable method of doing this?
Thanks!
BigQuery has a new Streaming API, which they claim was designed for high-volume real-time data collection.
Advice from practice: we are currently logging 20M+ multi-event records a day via a method 1. as described above. It works pretty well, except when the batch uploader is not called (normally every 5min), then we need to detect this and re-run the importer.
Also, we are currently in process of migrating to new Streaming API, but is not yet in production so I can't say how reliable it is.

Google AppEngine - How To Perform a Partial Datastore Download

I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.
It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http

Categories