Google Cloud Storage <-> Google App Engine -> Google BigQuery - python

We are developing project which is to process our log data. The idea is
update log data from local logstash to Google Cloud Storage
write python script to insert job to import log data from Google
Cloud Storage into Google BigQuery
write python script to process data in BigQuery itself
Note. for python script, we are thinking whether running on google app engine or google compute engine.
The questions are
Is this practical solution?
Structure of log data changes quite often this will cause an error when insert to BigQuery.How we going to handle it in python script?
Incase, we have to rerun log data in particular period. How we can do that?need to write python script?
Thanks

There is a new API for streaming data directly into BigQuery which may be a better match for your use case.
Instead of using a job to load data into BigQuery, you can choose to
stream your data into BigQuery one record at a time by using the
tabledata().insertAll() method. This approach enables querying data
without the delay of running a load job. There are several important
trade-offs to consider before choosing an approach.
If the structure of your data changes, you could have BigQuery run over its tables and update accordingly. Streaming the raw data will give you most flexibility but at the higher cost of having to post-process the data again.

There is the streaming data solution that someone has already mentioned, but if you're trying to move a large block of logs data rather than set up a continuous stream, you may want to take the route of using asynchronous load jobs instead.
The GCS library acts like most python file libraries when used in Google App Engine, and can store files for import in cloud storage buckets:
import cloudstorage as gcs
filePath = "/CloudStorageBucket/dir/dir/logs.json"
with gcs.open(filePath, "w") as f:
f.write(SomeLogData)
f.close()
You can instruct Big Query to load a list of CSV or newline-delimited JSON files in Cloud Storage, by creating load jobs via the API: (Note: you will need to use oauth 2)
from apiclient.discovery import build
service = build("bigquery", "v2", http = oAuthedHttp)
job = {
"configuration": {
"load": {
"sourceUris": ["gs://CloudStorageBucket/dir/dir/logs.json"],
"schema": {
"files" : [
{"name": "Column1",
"type": "STRING"},
...
]
},
"destinationTable": {
"projectId": "Example-BigQuery-ProjectId",
"datasetId": "LogsDataset",
"tableId": "LogsTable"
},
"sourceFormat" : "NEWLINE_DELIMITED_JSON"
"createDisposition": "CREATE_IF_NEEDED"
}
}
}
response = service.jobs().insert(
projectId = "Example-BigQuery-ProjectId",
body = job
).execute()
You can read more about how to create Big Query load jobs if you want to set other properties like write disposition or skipping rows in a CSV file. You can also see other good examples of how to load data, including command line prompts.
Edit:
To answer your more specific questions:
Is this practical solution?
Yes. We export our Google App Engine logs to Cloud Storage and Import to BigQuery, using deferred tasks. Some have used map reduce jobs, but this can be overkill if you don't need to shuffle or reduce.
Structure of log data changes quite often this will cause an error
when insert to BigQuery.How we going to handle it in python script?
It shouldn't be an issue unless you're parsing the messages before they reach big query. A better design would be to port the messages, timestamps, levels etc. to Big Query and then digest it with queries there.
Incase, we have to rerun log data in particular period. How we can do that? need to write python script?
Streaming the data won't give you backups unless you set them up yourself in BigQuery. Using the method I outlined above will automatically give you back ups in Google Cloud Storage, which is preferred.
Know that BigQuery is an OLAP database, not transactional, so it's typically best to rebuild tables each time you add more log data, rather than try to insert new data. It's counter-intuitive, but BigQuery is designed for this, as it can import 10,000 files / 1TB at a time. Using pagination with job write disposition, you can in theory import hundreds of thousands of records fairly quickly. Streaming the data would be ideal if you don't care about having backup logs.

Related

bigquery storage API: Is it possible to stream / save AVRO files directly to Google Cloud Storage?

I would like to export a 90 TB BigQuery table to Google Cloud Storage. According to the documentation, BigQuery Storage API (beta) should be the way to go due to export size quotas (e.g., ExtractBytesPerDay) associated with other methods.
The table is date-partitioned, with each partition occupying ~300 GB. I have a Python AI Notebook running on GCP, which runs partitions (in parallel) through this script adapted from the docs.
from google.cloud import bigquery_storage_v1
client = bigquery_storage_v1.BigQueryReadClient()
table = "projects/{}/datasets/{}/tables/{}".format(
"bigquery-public-data", "usa_names", "usa_1910_current"
) # I am using my private table instead of this one.
requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO
parent = "projects/{}".format(project_id)
session = client.create_read_session(
parent,
requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.
rows = reader.rows(session)
Is it possible to save data from the stream directly to Google Cloud Storage?
I tried saving tables as AVRO files to my AI instance using fastavro and later uploading them to GCS using Blob.upload_from_filename(), but this process is very slow. I was hoping it would be possible to point the stream at my GCS bucket. I experimented with Blob.upload_from_file, but couldn't figure it out.
I cannot decode the whole stream to memory and use Blob.upload_from_string because I don't have over ~300 GB of RAM.
I spent the last two days parsing GCP documentation, but couldn't find anything, so I would appreciate your help, preferably with a code snippet, if at all possible. (If working with another file format is easier, I am all for it.)
Thank you!
Is it possible to save data from the stream directly to Google Cloud Storage?
By itself, the BigQuery Storage API is not capable of writing directly to GCS; you'll need to pair the API with code to parse the data, write it to local storage, and subsequently upload to GCS. This could be code that you write manually, or code from a framework of some kind.
It looks like the code snippet that you've shared processes each partition in a single-threaded fashion, which caps your throughput at the throughput of a single read stream. The storage API is designed to achieve high throughput through parallelism, so it's meant to be used with a parallel processing framework such as Google Cloud Dataflow or Apache Spark. If you'd like to use Dataflow, there's a Google-provided template you can start from; for Spark, you can use the code snippets that David has already shared.
An easy way to that would be to use Spark with the the spark-bigquery-connector? It uses the BigQuery Storage API in order to read the table directly into a Spark's DataFrame. You can create a Spark cluster on Dataproc, which is located as the same data centers as BigQuery and GCS, making the read and write speeds much faster.
A code example will look like this:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.write.format("avro").save("gs://bucket/path")
You can also filter the data and work on each partition separately:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.option("filter", "the_date='2020-05-12'") \
.load()
# OR, in case you don't need to give the partition at load
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.where("the_date='2020-05-12'").write....
Please note that in order to read large amounts of data you would need a sufficiently large cluster.

ETL to bigquery using airflow without have permission cloud storage/ cloud sql

i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

How to set up GCP infrastructure to perform search quickly over massive set of json data?

I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!

Saving Data on GAE: logging vs. datastore

I have a google app engine app that has to deal with a lot of data collecting. The data I gather is around millions of records per day. As I see it, there are two simple approaches to dealing with this in order to be able to analyze the data:
1. use logger API to generate app engine logs, and then try to load these up to a big query (or more simply export to CSV and do the analysis with excel).
2. saving the data in the app engine datastore (ndb), and then download that data later / try to load that up to big query.
Is there any preferable method of doing this?
Thanks!
BigQuery has a new Streaming API, which they claim was designed for high-volume real-time data collection.
Advice from practice: we are currently logging 20M+ multi-event records a day via a method 1. as described above. It works pretty well, except when the batch uploader is not called (normally every 5min), then we need to detect this and re-run the importer.
Also, we are currently in process of migrating to new Streaming API, but is not yet in production so I can't say how reliable it is.

Google AppEngine - How To Perform a Partial Datastore Download

I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.
It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http

Categories