Calculate size multipart/form-data encoded file - python

I'm writing an application that should receive a file and store it.
One way of storing would be to upload it to another server (e.g. filehoster).
Server-side I'm using Python and the Pyramid-framework. I already get rid of the problem getting the file while the client is uploading, and wrapped the app returned by make_wsgi_app in another class. This class handles the upload request and I'm able to only read the file.
My current problem is getting the file size while the client is uploading. The client sends the request multipart/form-data encoded so the content-length header includes the size of the boundarys and content-type declarations.
I think it's a bad idea to just subtract a fix size because anything in the form can change and the whole file-part would be broken.
I read another question about this topic but I don't want to use another lib. I think there has to be a half-way easy way to do this in pure python.
Thanks

Related

Deserializing Prometheus `remote_write` Protobuf output in Python

I'm experimenting (for the first time) with Prometheus. I've setup Prometheus to send messages to a local flask server:
remote_write:
- url: "http://localhost:5000/metric"
I'm able to read the incoming bytes, however, I'm not able to convert the incoming messages to any meaningful data.
I'm very new to Prometheus (and Protobuf!) so I'm not sure what the best approach is. I would rather not use a third party package, but want to learn and understand the Protobuf de/serialization myself.
I tried copying the metrics.proto definitions from the Prometheus GitHub and compiling them with protoc. I tried importing the metrics_pb2.py file and parsing the incoming message:
read_metric = metrics_pb2.Metric()
read_metric.ParseFromString(request.data)
I also tried using the remote.proto definitions (specifically WriteRequest) which also didn't work:
read_metric = remote_pb2.WriteRequest()
read_metric.ParseFromString(request.data)
This results in:
google.protobuf.message.DecodeError: Error parsing message
So I suspect that I'm using the wrong Protobuf definitions?
I would really appreciate any help & advice on this!
To provide some more context for what I'm attempting to accomplish:
I'm trying to stream data from multiple Prometheus instances to a message queue so they can be passed to a machine learning model.
I'm using online training with an active learning model, and I want the data to be (near) real-time. That's why I thought the remote_write functionality is the best approach rather than continuously scraping each instance. If you have any other ideas on how I can build this system, feel free to share - I've just been playing around with it for a couple days, so I'm open to any feedback!
ANSWER EDIT:
I had to first decompress the data using snappy, thanks larsks!:
bytes = request.data
decompressed = snappy.uncompress(bytes)
read_metric = remote_pb2.WriteRequest()
read_metric.ParseFromString(decompressed)
The remote.proto document is the correct protobuf specification. You may find this document useful, which explicitly defines the remote write protocol. That document includes the "official" protobuf specification, and mentions that:
The remote write request MUST be encoded using Google Protobuf 3, and MUST use the schema defined above. Note the Prometheus implementation uses gogoproto optimisations - for receivers written in languages other than Golang the gogoproto types MAY be substituted for line-level equivalents.
The document also notes that the body of remote write requests is compressed:
The remote write request in the body of the HTTP POST MUST be compressed with Google’s Snappy. The block format MUST be used - the framed format MUST NOT be used.
So before you can parse the request body you'll need to find a Python solution for decompressing snappy-compressed data.
(I found a link to that google doc from this article that talks about the development of the remote write protocol.)

Understanding protobuff protocol

I'm just doing some reverse engineering exercise and have ran across application/x-protobuff protocol..
I am currently sniffing network calls from redfin using mitmproxy. I see a endpoint for a result, however the response is unstructured JSON formatted data with content type application/x-protobuff After doing a bit of research, I found out that protobuff uses a schema to map the data internally, and I am assuming the schema also sits in the client somewhere, called .proto file.
SS
To validate my assumption on what that screenshot tells is that
I can see there is a response header called X-ProtoBuf-Schema is that the the location where the schma would be located, the same schema I can use to decrypt the response data? How would I go on about reading that data in a more structured manner?
I am able to make a request using requests to that endpoint, just gives me protobuffers.
PS: This is what the JSON format looks like
https://pastebin.com/LY51X9KZ
"and I am assuming the schema also sits in the client somewhere, called .proto file." - I wouldn't assume that at all; the client, once built, doesn't need the .proto - the generated code is used instead of any explicit schema. If a site is publishing a schema, it is probably a serialized FileDescriptorSet from google/protobuf/descriptor.proto, which contains the intent of the .proto, but as data.

Loading a Lot of Data into Google Bigquery from Python

I've been struggling to load big chunks of data into bigquery for a little while now. In Google's docs, I see the insertAll method, which seems to work fine, but gives me 413 "Entity too large" errors when I try to send anything over about 100k of data in JSON. Per Google's docs, I should be able to send up to 1TB of uncompressed data in JSON. What gives? The example on the previous page has me building the request body manually instead of using insertAll, which is uglier and more error prone. I'm also not sure what format the data should be in in that case.
So, all of that said, what is the clean/proper way of loading lots of data into Bigquery? An example with data would be great. If at all possible, I'd really rather not build the request body myself.
Note that for streaming data to BQ, anything above 10k rows/sec requires talking to a sales rep.
If you'd like to send large chunks directly to BQ, you can send it via POST. If you're using a client library, it should handle making the upload resumable for you. To do this, you'll need to make a call to jobs.insert() instead of tabledata.insertAll(), and provide a description of a load job. To actually push the bytes using the Python client, you can create a MediaFileUpload or MediaInMemoryUpload and pass it as the media_body parameter.
The other option is to stage the data in Google Cloud Storage and load it from there.
The example here uses the resumable upload to upload a CSV file. While the file used is small, it should work for virtually any size upload since it uses a robust media upload protocol. It sounds like you want json, which means you'd need to tweak the code slightly for json (an example for json is in the load_json.py example in the same directory). If you have a stream you want to upload instead of a file, you can use a MediaInMemoryUpload instead of the MediaFileUpload that is used in the example.
BTW ... Craig's answer is correct, I just thought I'd chime in with links to sample code.

Create a large .csv (or any other type!) file with google app engine

I've been struggling to create a file with GAE for two days now, I've examined different approaches and each one seems more complex and time consuming than the previous one.
I've tried simply loading a page and writing the file in to response object with relevant headers:
self.response.headers['Content-Disposition'] = "attachment; filename=titles.csv"
q = MyObject.all()
for obj in q:
title = json.loads(obj.data).get('title')
self.response.out.write(title.encode('utf8')+"\n")
This tells me (in a very long error) that Full proto too large to save, cleared variables. Here's the full error.
I've also checked Cloud Storage, but it needs tons of info and tweaking in the Cloud Console just to get enabled, and Blobstorage which can save stuff only in to the DataStore.
Writing a file can't be this complicated! Please tell me that I am missing something.
That error doesn't have anything to do with writing a CSV, but appears to be a timeout when iterating over all MyObject entities. Remember that requests in GAE are subject to strict limits, and you are probably exceeding those. You probably want to use a cursor and the deferred API to build up your CSV in stages. But for that, you definitely will need to write to the blobstore or CS.

Some Google App Engine BlobStore Problems

I have googled and read the docs on Google App Engine official site about BlobStore but there are some problems that I still dont understand. My Platform is webapp.
Docs I have read:
webapp Blobstore Handlers
Blobstore Python API Overview
The Blobstore Python API
After reading all these docs, I still have some problems:
In Blobstore Python API Overview it says: maximum size of Blobstore data that can be read by the app with one API call is 1MB. What does this mean? Does this 1MB limit apply to sendblob()? Take the following code from webapp Blobstore Handlers as an example:
class ViewPhotoHandler(blobstore_handlers.BlobstoreDownloadHandler):
def get(self, photo_key):
self.send_blob(photo_key)
Does that mean the photo ( which is uploaded and stored in the blobstore )associated with the photo_key must be less than 1MB? From the context, I dont think so. I think the the photo can be as large as 2GB. But I am not sure.
How is the ContentType determined on send_blob()? Is it text/html or image/jpeg? Can I set somewhere it myself? The following explanation from webapp Blobstore Handlers is so confusing. Quite difficult for a non-english speaker. Can someone paraphrase it with code samples? Where is the docs for send_blob()? I cant find it.
The send_blob() method accepts a save_as argument that determines whether the blob data is sent as raw response data or as a MIME attachment with a filename, which prompts web browsers to save the file with the given name instead of displaying it. If the value of the argument is a string, the blob is sent as an attachment, and the string value is used as the filename. If True and blob_key_or_info is a BlobInfo object, the filename from the object is used. By default, the blob data is sent as the body of the response and not as a MIME attachment.
There is a file http://www.example.com/example.avi which is 20MB or even 2GB. I want to fetch example.avi from the internet and store it in the BlobStore. I checked, the urlfetch request size limit is 1MB. I searched and hadnt found a solution.
Thanks a lot!
send_blob() doesn't involve your application reading the file from the API, so the 1MB limit doesn't apply. The frontend service that returns the response to the user will read the entire blob and return all of it in the response (it most likely does the reading in chunks, but this is an implementation detail that you don't have to worry about.
send_blob() sets the content type to either the Blob's internal stored type, or the type you specify with an optional content_type parameter to send_blob(). For the documentation, it seems to need to RTFS; there's a docstring (in the google.appengine.ext.webapp.blobstore_handlers package.)
There's really no great solution for fetching arbitrary files from the web and storing them in Blobstore. Most likely you'd need a service running elsewhere, like your own machine or an EC2 instance, to fetch the files and POST them to a blobstore handler in your application.

Categories