I have been working with the Coinbase Websocket API recently for data analysis purposes. I am trying to track the order book in at least seconds-frequency.
As far as I am aware of, it is possible to use the REST API for that, but it does not include timestamp. The other options are the websocket level2 updates and the full channels.
The problem is that when I am processing the level2 updates I am constantly falling back in time (I did not focus on processing speed while I was programming since it was not my goal and I do not have the hardware neither the connection speed to do it), so for example after 30min I am able to process only 10 min of data.
The problem comes if, for whatever reason I am disconnected from the exchange, I have to reconnect again and I have a big empty window of data in the middle.
Is there any aggregated feed or way to do it (Receive all updates in one second or something like that) that I am not aware of? or should I just resign and improve my code and buy better equipment?
P.D: I am relatively new, so sorry if this type of question does not fit here!
Just in case anyone interested I just opened multiple websocket at different time windows and reconnect them periodically in order to miss as few price updates as possible.
Related
I have a pipeline where a message is produced to topic A, which is processed by a stream processor and sends enriched data to topic B.
Topic B is consumed by 3 other stream processors which independently perform a small part of the calculation (to reduce the load on a single processor) and forward their enriched data onto a new topic. The final processor reads from all 3 new topics and send this data on to web clients via web sockets.
It all works well but if the system sits idle for 30 minutes or so with no new messages it can sometimes take up to 10 seconds to get to the end of the pipeline. When operating normally this time has been in the order of 10-20ms.
Every stream processor uses tables to refer to previous data and determine how to enrich going forward, so I'm not sure whether accessing this table slows if there's no need to access it over time?
If so, it seems a silly workaround, but it might be possible to use a timer to send a dummy dataset to trigger each worker to stay alive and alert.
Below is a print output of the time difference from the message initiation to the arrival time at the end of the pipeline:
[2022-05-23 08:52:46,445] [10340] [WARNING] 0:00:00.017999
[2022-05-23 08:53:03,469] [10340] [WARNING] 0:00:00.025995
[2022-05-23 09:09:46,774] [10340] [WARNING] 0:00:06.179146
I wonder whether using any of the settings available to either brokers or agents noted on this page will be of use here? If anyone knows, please let me know.
UPDATE
So I ran tests where i use the #app.time option to send a dummy/test message through the entire pipeline every second and never had an instance of slow send times. I also updated the way things work to directly talk to the app using the #app.page() decorator rather than a FastAPI endpoint to try send to the topic and this did mean I never saw a delay greater than 2 seconds. But the same thing did still happen where if it sat idle for a while then received a new message it took almost exactly 2 seconds (plus change) to do it's thing. This really starts to look like an agent throttles it's poll or kafka throttles an agent's connection if the throughput is low.
It appears that the issue stems from a setting on Kafka for both consumers and producers which basically closes the connection if they haven't sent/consumed messages within the designated time frame.
In Faust, you access this and set it up when you define the app like so:
app.conf.producer_connections_max_idle_ms
app.conf.consumer_connections_max_idle_ms
and set it to something appropriate. I understand that this setting is probably left low (9 mins by default) for large dynamic clusters to release resources or memory (or something) but in our use case with a small cluster that will remain static in terms or architecture and design, it's not an issue (I think) to increase this from 9 minutes to 12 or 24 hours.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm reforming a more concise version of my question here. Got flagged for being too broad.
I'm looking for a way, either native python or a framework, which will allow me to do the following:
Publish a webservice which an end customer can call like any other standard webservice (using curl, postman, requests, etc.)
This webservice will be accepting gigabytes (perhaps 10s of GB) of data per call.
While this data is being transmitted, I'd like to break it into chunks and spin off separate threads and/or processes to simultaneously work with it (my processing is complex but each chunk will be independent and self-contained)
Doing this will allow my logic to be running in parallel with the data upload across the internet, and avoid wasting all that time while the data is just being transmitted
It will also prevent the gigabytes/10s GB to be put all into RAM before my logic even begins.
Original Question:
I'm trying to build a web service (in Python) which can accept potentially tens of gigabytes of data and process this data. I don't want this to be completely received and built into an in-memory object before passing to my logic as a) this will use a ton of memory, and b) the processing will be pretty slow and I'd love to have a processing thread working on chunks of the data while the rest of the data is being received asynchronously.
I believe I need some sort of streaming solution for this but I'm having trouble finding any Python solution to handle this case. Most things I've found are about streaming the output (not an issue for me). Also it seems like wsgi has issues by design with a data streaming solution.
Is there a best practice for this sort of issue which I'm missing? And/or, is there a solution that I haven't found?
Edit: Since a couple of people asked, here's an example of the sort of data I'd be looking at. Basically I'm working with lists of sentences, which may be millions of sentences long. But each sentence (or group of sentences, for ease) is a separate processing task. Originally I had planned on receiving this as a json array like:
{"sentences: [
"here's a sentence",
"here's another sentence",
"I'm also a sentence"
]
}
For this modification I'm thinking it would just be newline delimited sentences, since I don't really need the json structure. So in my head, my solution would be; I get a constant stream of characters, and whenever I get a newline character, I'd split off the previous sentence and pass it to a worker thread or threadpool to do my processing. I could also do in groups of many sentences to avoid having a ton of threads going at once. But the main thing is, while the main thread is getting this character stream, it is splitting off tasks periodically so other threads can start the processing.
Second Edit: I've had a few thoughts on how to process the data. I can't give tons of specific details as it's proprietary, but I could either store the sentences as they come in into ElasticSearch or some other database, and have an async process working on that data, or (ideally) I'd just work with the sentences (in batches) in memory. Order is important, and also not dropping any sentences is important. The inputs will be coming from customers over the internet though, so that's why I'm trying to avoid a message queue like process, so there's not the overhead of a new call for each sentence.
Ideally, the customer of the webservice doesn't have to do anything particularly special other than do the normal POST request with a gigantic body, and all this special logic is server-side. My customers won't be expert software engineers so while a webservice call is perfectly within their wheelhouse, handling a more complex message queue process or something along those lines isn't something I want to impose on them.
Unless you share a little more of the type of data, processing or what other constraints your problem has, it's going to be very difficult to provide more tailored advice than maybe pointing you to a couple resources.
... Here is my attempt, hope it helps!
It seems like what you need is the following:
A message passing system vs streaming system in order to deliver/receive the data
Optionally, an asynchronous task queue to fire up different processing tasks on the data
or even a custom data processing pipeline system
Messaging vs Streaming
Examples: RabbitMQ, Kombu (per #abolotnov's comment), Apache Kafka (and python ports), Faust
The main differences between messaging and streaming can vary on the system/definition/who you ask, but in general:
- messaging: a "simple" system that will take care of sending/receiving single messages between two processes
- streaming adds functionality like the ability to "replay", send mini-batches of groups of messages, process rolling windows, etc.
Messaging systems may implement as well broadcasting (send message to all receivers) and publish/subscribe scenarios, that would come handy if you don't want your publisher (creator of data) to keep track of who to send the data to (subscribers), or alternatively your subscribers to keep track who and when to go and get the data from.
Asynchronous task queue
Examples: Celery, RQ, Taskmaster
This will basically help you assign a set of tasks that may be the smaller chunks of the main processing you are intending to do, and then make sure these tasks get performed whenever new data pops up.
Custom Data Processing Systems
I mainly have one in mind: Dask (official tutorial repo)
This is a system very much created for what seems to me you have in your hands. Basically large amounts of information emerging from some source (that may or not be fully under your control), that need to flow through a set of processing steps in order to be consumable by some other process (or stored).
Dask is kind of a combination of the previous, in that you define a computation graph (or task graph) with data sources and computation nodes that connect and some may depend on other nodes. Later, and dependent on the system you deploy it on, you can specify for sync or different types of async in which the tasks will be able to be executed, but keeping this run-time implementation detail separate from the actual tasks to be performed. This means, you could deploy on your computer, but later decide to deploy the same pipeline on a cluster, and you would only need to change the "settings" of this run-time implementation.
Additionally, Dask basically imitates numpy / pandas / pyspark or whatever data processing framework you may be already using, so the syntax will be (almost in every case) virtually the same.
I'm using the Nest API to poll the current temperature and related temperature data from two of my Nests (simultaneously).
I was initially polling this data every minute but started getting an error:
nest.nest.APIError: blocked
I don't get the error every minute, more like intermittently every 5-10 minutes.
Reading through their documentation it seems that while pulling data once per minute is permissible, it's the maximum recommended query frequency.
So I set it to two minutes. I'm still getting the error.
I'm using this Python package, although I'm starting to wonder if there's too much going on under the hood that is making unnecessary requests.
Has anyone had any experience with this type of Nest error, or this Python package before?
Does polling two Nests with the same authenticated call result in multiple requests, as it relates to their data limiting?
Should I just scrap this package and roll my own? (this is generally my preference, but I need to learn to stop always re-writing everything the moment I hit a snag like this in order to fully control and thoroughly understand each aspect of a particular integration, right?)
Our situation is as follows:
We are working on a schoolproject where the intention is that multiple teams walk around in a city with smarthphones and play a city game while walking.
As such, we can have 10 active smarthpones walking around in the city, all posting their location, and requesting data from the google appengine.
Someone is behind a webbrowser,watching all these teams walk around, and sending them messages etc.
We are using the datastore the google appengine provides to store all the data these teams send and request, to store the messages and retrieve them etc.
However we soon found out we where at our max limit of reads and writes, so we searched for a solution to be able to retrieve periodic updates(which cost the most reads and writes) without using any of the limited resources google provides. And obviously, because it's a schoolproject we don't want to pay for more reads and writes.
Storing this information in global variables seemed an easy and quick solution, which it was... but when we started to truly test we noticed some of our data was missing and then reappearing. Which turned out to be because there where so many requests being done to the cloud that a new instance was made, and instances don't keep these global variables persistent.
So our question is:
Can we somehow make sure these global variables are always the same on every running instance of google appengine.
OR
Can we limit the amount of instances ever running, no matter how many requests are done to '1'.
OR
Is there perhaps another way to store this data in a better way, without using the datastore and without using globals.
You should be using memcache. If you use the ndb (new database) library, you can automatically cache the results of queries. Obviously this won't improve your writes much, but it should significantly improve the numbers of reads you can do.
You need to back it with the datastore as data can be ejected from memcache at any time. If you're willing to take the (small) chance of losing updates you could just use memcache. You could do something like store just a message ID in the datastore and have the controller periodically verify that every message ID has a corresponding entry in memcache. If one is missing the controller would need to reenter it.
Interesting question. Some bad news first, I don't think there's a better way of storing data; no, you won't be able to stop new instances from spawning and no, you cannot make seperate instances always have the same data.
What you could do is have the instances perioidically sync themselves with a master record in the datastore, by choosing the frequency of this intelligently and downloading/uploading the information in one lump you could limit the number of read/writes to a level that works for you. This is firmly in the kludge territory though.
Despite finding the quota for just about everything else, I can't find the limits for free read/write so it is possible that they're ludicrously small but the fact that you're hitting them with a mere 10 smartphones raises a red flag to me. Are you certain that the smartphones are being polled (or calling in) at a sensible frequency? It sounds like you might be hammering them unnecessarily.
Consider jabber protocol for communication between peers. Free limits are on quite high level for it.
First, definitely use memcache as Tim Delaney said. That alone will probably solve your problem.
If not, you should consider a push model. The advantage is that your clients won't be asking you for new data all the time, only when something has actually changed. If the update is small enough that you can deliver it in the push message, you won't need to worry about datastore reads on memcache misses, or any other duplicate work, for all those clients: you read the data once when it changes and push it out to everyone.
The first option for push is C2DM (Android) or APN (iOS). These are limited on the amount of data they send and the frequency of updates.
If you want to get fancier you could use XMPP instead. This would let you do more frequent updates with (I believe) bigger payloads but might require more engineering. For a starting point, see Stack Overflow questions about Android and iOS.
Have fun!
I was wondering if it would be a good idea to use callLater in Twisted to keep track of auction endings. It would be a callLater on the order of 100,000's of seconds, though does that matter? Seems like it would be very convenient. But then again it seems like a horrible idea if the server crashes.
Keeping a database of when all the auctions are ending seems like the most secure solution, but checking the whole database each second to see if any auction has ended seems very expensive.
If the server crashes, maybe the server can recreate all the callLater's from database entries of auction end times. Are there other potential concerns for such a model?
One of the Divmod projects, Axiom, might be applicable here. Axiom is an object database. One of its unexpected, useful features is a persistent scheduling system.
You schedule events using APIs provided by the database. When the events come due, a callback you specified is called. The events persist across process restarts, since they're represented as database objects. Large numbers of scheduled events are supported, by only doing work to keep track when the next event is going to happen.
The canonical Divmod site went down some time ago (sadly the company is no longer an operating concern), but the code is all available at http://launchpad.net/divmod.org and the documentation is being slowly rehosted at http://divmod.readthedocs.org/.