Loading a Lot of Data into Google Bigquery from Python - python

I've been struggling to load big chunks of data into bigquery for a little while now. In Google's docs, I see the insertAll method, which seems to work fine, but gives me 413 "Entity too large" errors when I try to send anything over about 100k of data in JSON. Per Google's docs, I should be able to send up to 1TB of uncompressed data in JSON. What gives? The example on the previous page has me building the request body manually instead of using insertAll, which is uglier and more error prone. I'm also not sure what format the data should be in in that case.
So, all of that said, what is the clean/proper way of loading lots of data into Bigquery? An example with data would be great. If at all possible, I'd really rather not build the request body myself.

Note that for streaming data to BQ, anything above 10k rows/sec requires talking to a sales rep.
If you'd like to send large chunks directly to BQ, you can send it via POST. If you're using a client library, it should handle making the upload resumable for you. To do this, you'll need to make a call to jobs.insert() instead of tabledata.insertAll(), and provide a description of a load job. To actually push the bytes using the Python client, you can create a MediaFileUpload or MediaInMemoryUpload and pass it as the media_body parameter.
The other option is to stage the data in Google Cloud Storage and load it from there.

The example here uses the resumable upload to upload a CSV file. While the file used is small, it should work for virtually any size upload since it uses a robust media upload protocol. It sounds like you want json, which means you'd need to tweak the code slightly for json (an example for json is in the load_json.py example in the same directory). If you have a stream you want to upload instead of a file, you can use a MediaInMemoryUpload instead of the MediaFileUpload that is used in the example.
BTW ... Craig's answer is correct, I just thought I'd chime in with links to sample code.

Related

Understanding protobuff protocol

I'm just doing some reverse engineering exercise and have ran across application/x-protobuff protocol..
I am currently sniffing network calls from redfin using mitmproxy. I see a endpoint for a result, however the response is unstructured JSON formatted data with content type application/x-protobuff After doing a bit of research, I found out that protobuff uses a schema to map the data internally, and I am assuming the schema also sits in the client somewhere, called .proto file.
SS
To validate my assumption on what that screenshot tells is that
I can see there is a response header called X-ProtoBuf-Schema is that the the location where the schma would be located, the same schema I can use to decrypt the response data? How would I go on about reading that data in a more structured manner?
I am able to make a request using requests to that endpoint, just gives me protobuffers.
PS: This is what the JSON format looks like
https://pastebin.com/LY51X9KZ
"and I am assuming the schema also sits in the client somewhere, called .proto file." - I wouldn't assume that at all; the client, once built, doesn't need the .proto - the generated code is used instead of any explicit schema. If a site is publishing a schema, it is probably a serialized FileDescriptorSet from google/protobuf/descriptor.proto, which contains the intent of the .proto, but as data.

How to fetch data as a .zip using Cx_Oracle?

I would like to fetch the data but receiving a .zip with all the data instead of a list of tuples of data. That is, making the specified query, then database server compresses the result data as .zip and then sends this .zip to client.
By doing this I expect to reduce time spent on sending data by a lot, because there are lots of repeated fields.
I know Advanced Data compression exists in Oracle, however I am not able to achieve this using Cx_Oracle.
Any help/ workaround is appreciated.
Advanced Network Compression can be enabled as described here, using sqlnet.ora and/or tnsnames.ora:
https://cx-oracle.readthedocs.io/en/latest/user_guide/initialization.html#optnetfiles
https://www.oracle.com/technetwork/database/enterprise-edition/advancednetworkcompression-2141325.pdf

Get the data from a REST API and store it in HDFS/HBase

I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation

Create a large .csv (or any other type!) file with google app engine

I've been struggling to create a file with GAE for two days now, I've examined different approaches and each one seems more complex and time consuming than the previous one.
I've tried simply loading a page and writing the file in to response object with relevant headers:
self.response.headers['Content-Disposition'] = "attachment; filename=titles.csv"
q = MyObject.all()
for obj in q:
title = json.loads(obj.data).get('title')
self.response.out.write(title.encode('utf8')+"\n")
This tells me (in a very long error) that Full proto too large to save, cleared variables. Here's the full error.
I've also checked Cloud Storage, but it needs tons of info and tweaking in the Cloud Console just to get enabled, and Blobstorage which can save stuff only in to the DataStore.
Writing a file can't be this complicated! Please tell me that I am missing something.
That error doesn't have anything to do with writing a CSV, but appears to be a timeout when iterating over all MyObject entities. Remember that requests in GAE are subject to strict limits, and you are probably exceeding those. You probably want to use a cursor and the deferred API to build up your CSV in stages. But for that, you definitely will need to write to the blobstore or CS.

upload csv/excel file to appengine (python) for processing

I need to be able to upload an excel or csv file to appengine so that the server can process the rows and create objects. Can anyone provide or point me to an example of how this is done? Thanks for your help.
Uploading to the Blobstore is probably what you are after. Then reading the data and processing it with the csv module.
You might want to look into sending your file to google docs in the case of excel (and other) formats then reading the rows back via the Spreadsheets API
If you mean a one-off (or a few) transfers, you're probably looking for the bulk upload system: http://code.google.com/appengine/docs/python/tools/uploadingdata.html
If you're talking about regular uploads during use, you'll need to handle them as post requests to the application.

Categories