Understanding protobuff protocol - python

I'm just doing some reverse engineering exercise and have ran across application/x-protobuff protocol..
I am currently sniffing network calls from redfin using mitmproxy. I see a endpoint for a result, however the response is unstructured JSON formatted data with content type application/x-protobuff After doing a bit of research, I found out that protobuff uses a schema to map the data internally, and I am assuming the schema also sits in the client somewhere, called .proto file.
SS
To validate my assumption on what that screenshot tells is that
I can see there is a response header called X-ProtoBuf-Schema is that the the location where the schma would be located, the same schema I can use to decrypt the response data? How would I go on about reading that data in a more structured manner?
I am able to make a request using requests to that endpoint, just gives me protobuffers.
PS: This is what the JSON format looks like
https://pastebin.com/LY51X9KZ

"and I am assuming the schema also sits in the client somewhere, called .proto file." - I wouldn't assume that at all; the client, once built, doesn't need the .proto - the generated code is used instead of any explicit schema. If a site is publishing a schema, it is probably a serialized FileDescriptorSet from google/protobuf/descriptor.proto, which contains the intent of the .proto, but as data.

Related

Deserializing Prometheus `remote_write` Protobuf output in Python

I'm experimenting (for the first time) with Prometheus. I've setup Prometheus to send messages to a local flask server:
remote_write:
- url: "http://localhost:5000/metric"
I'm able to read the incoming bytes, however, I'm not able to convert the incoming messages to any meaningful data.
I'm very new to Prometheus (and Protobuf!) so I'm not sure what the best approach is. I would rather not use a third party package, but want to learn and understand the Protobuf de/serialization myself.
I tried copying the metrics.proto definitions from the Prometheus GitHub and compiling them with protoc. I tried importing the metrics_pb2.py file and parsing the incoming message:
read_metric = metrics_pb2.Metric()
read_metric.ParseFromString(request.data)
I also tried using the remote.proto definitions (specifically WriteRequest) which also didn't work:
read_metric = remote_pb2.WriteRequest()
read_metric.ParseFromString(request.data)
This results in:
google.protobuf.message.DecodeError: Error parsing message
So I suspect that I'm using the wrong Protobuf definitions?
I would really appreciate any help & advice on this!
To provide some more context for what I'm attempting to accomplish:
I'm trying to stream data from multiple Prometheus instances to a message queue so they can be passed to a machine learning model.
I'm using online training with an active learning model, and I want the data to be (near) real-time. That's why I thought the remote_write functionality is the best approach rather than continuously scraping each instance. If you have any other ideas on how I can build this system, feel free to share - I've just been playing around with it for a couple days, so I'm open to any feedback!
ANSWER EDIT:
I had to first decompress the data using snappy, thanks larsks!:
bytes = request.data
decompressed = snappy.uncompress(bytes)
read_metric = remote_pb2.WriteRequest()
read_metric.ParseFromString(decompressed)
The remote.proto document is the correct protobuf specification. You may find this document useful, which explicitly defines the remote write protocol. That document includes the "official" protobuf specification, and mentions that:
The remote write request MUST be encoded using Google Protobuf 3, and MUST use the schema defined above. Note the Prometheus implementation uses gogoproto optimisations - for receivers written in languages other than Golang the gogoproto types MAY be substituted for line-level equivalents.
The document also notes that the body of remote write requests is compressed:
The remote write request in the body of the HTTP POST MUST be compressed with Google’s Snappy. The block format MUST be used - the framed format MUST NOT be used.
So before you can parse the request body you'll need to find a Python solution for decompressing snappy-compressed data.
(I found a link to that google doc from this article that talks about the development of the remote write protocol.)

How do i send webhook data and turn it into a CSV file

I'm a little new to coding but I'd like to know how I can get webhook data from say tradingview and just turn the info into a csv. I use python
edit: ok hi i figured how i want to do my app. i am stuck again but with a more descriptive image
id like to figure out how to turn my current request into a csv file i can place into a path of my choice in my computer. Below is my current code but ill editing it
my current code
things to note is that i am using aws chalice as a rest api to get webhook requests from. My goal is to turn that webhook data into a csv file.
example json: { "name": "kaishi",
"lotsize": "0.03",
"Signal": "op_sell"
}
these signals do change so the request is always different
in the image i posted i do get a error from my powershell that i dont understand because as ive said i am quite new to coding.
( raise TypeError('Object of type %s is not JSON serializable'
TypeError: Object of type DataFrame is not JSON serializable
raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index)
thats just 2 error2, they vary depending on how im testing my app.
any info would be appreciated.
Webhooks are requests that are sent from the remote server (in this case the "tradingview" server) to your server, so you'll need to find an existing package that handles webhooks from "tradingview" and deploy it accordingly, or develop with one of the many web service frameworks exposed through Python. A good place to start is exploring some of the open-source options available on GitHub: https://github.com/search?l=Python&q=tradingview&type=Repositories

Loading a Lot of Data into Google Bigquery from Python

I've been struggling to load big chunks of data into bigquery for a little while now. In Google's docs, I see the insertAll method, which seems to work fine, but gives me 413 "Entity too large" errors when I try to send anything over about 100k of data in JSON. Per Google's docs, I should be able to send up to 1TB of uncompressed data in JSON. What gives? The example on the previous page has me building the request body manually instead of using insertAll, which is uglier and more error prone. I'm also not sure what format the data should be in in that case.
So, all of that said, what is the clean/proper way of loading lots of data into Bigquery? An example with data would be great. If at all possible, I'd really rather not build the request body myself.
Note that for streaming data to BQ, anything above 10k rows/sec requires talking to a sales rep.
If you'd like to send large chunks directly to BQ, you can send it via POST. If you're using a client library, it should handle making the upload resumable for you. To do this, you'll need to make a call to jobs.insert() instead of tabledata.insertAll(), and provide a description of a load job. To actually push the bytes using the Python client, you can create a MediaFileUpload or MediaInMemoryUpload and pass it as the media_body parameter.
The other option is to stage the data in Google Cloud Storage and load it from there.
The example here uses the resumable upload to upload a CSV file. While the file used is small, it should work for virtually any size upload since it uses a robust media upload protocol. It sounds like you want json, which means you'd need to tweak the code slightly for json (an example for json is in the load_json.py example in the same directory). If you have a stream you want to upload instead of a file, you can use a MediaInMemoryUpload instead of the MediaFileUpload that is used in the example.
BTW ... Craig's answer is correct, I just thought I'd chime in with links to sample code.

How do I pass around data between pages in a GAE app?

I have a fairly basic GAE app that takes some input, fetches some data from a webpage, parses, then presents it to the user. Right now, the fairly spare input HTML form POSTs the arguments to the output 'file' which is wholly generated by the handler for that URL.
I'd like to do a couple things with the data (e.g. graph it at a landing page perhaps, then write it to an output file), but I don't know how I should pass the parsed data between the different handlers. I could maybe encode it then successively POST it to other handlers, but my gut says that I shouldn't need to HTTP the data back and forth within my app—it seems terribly inefficient (my gut is also hungry...).
In fairly broad swaths (or maybe a link to an example), how should my handlers handle this?
Later thoughts (edit)
My very rough idea now is to have the form submitted to a page that 1) enters the subsequent query into a database (datastore?) keyed to some hash, then uses that to 2) grab and parse all the data. The parsed data would be stored in memory (memcache?) for near-immediate use to graph it and/or process it into a variety of tabular formats for download. The script that does said parsing redirects to a unique URL based on the hash which can be referred to in order to get the data.
The thought would be that you could save the URL, then if you visit it later when the data has been lost, it can re-query the source to get it back/update it.
Reasonable? Should I be looking at other things?

Calculate size multipart/form-data encoded file

I'm writing an application that should receive a file and store it.
One way of storing would be to upload it to another server (e.g. filehoster).
Server-side I'm using Python and the Pyramid-framework. I already get rid of the problem getting the file while the client is uploading, and wrapped the app returned by make_wsgi_app in another class. This class handles the upload request and I'm able to only read the file.
My current problem is getting the file size while the client is uploading. The client sends the request multipart/form-data encoded so the content-length header includes the size of the boundarys and content-type declarations.
I think it's a bad idea to just subtract a fix size because anything in the form can change and the whole file-part would be broken.
I read another question about this topic but I don't want to use another lib. I think there has to be a half-way easy way to do this in pure python.
Thanks

Categories