Complete sparse data in Graphite - python

I have a list of objects that each contain: Product, Price, Time. It basically holds product price change over time. So, for each price change you'll get a record with the product name, the new price and the exact second of the change.
I'm sending the data to graphite using the timestamp (written in python):
import socket
sock = socket.socket()
sock.connect(('my-host-ip', 2003))
message = 'my.metric.prefix.%s %s %d\n' % (product, price, timestamp)
sock.sendall(message)
sock.close()
Thing is, as prices do not change very often, the data points are very sparsed which mean I get a point per product in a frequency of hours/days. If I look at Graphite at the exact time of the price change, I can see the data point. But if I want to look at price change over time, I would like to draw a constant line from the data point of the price change going forward.
I tried using:
keepLastValue(my.metric.prefix.*)
I would work only if I look at the data points in a time frame of a few minutes, but not hours (surely not days). Is there a way to do something like that in Graphite? Or I have to put some redundant data every minute to describe the missing points?

I believe using keepLastValue doesn't work for you for a coarser time interval due to aggregation rules defined in storage-aggregation.conf. You can try using xFilesFactor = 0 and aggregationMethod = last to get always the last value of the metric at each aggregated point.
However I think your concrete use case is much better resolved by using StatsD gauges. Basically you can set an arbitrary numerical value for a gauge in StatsD and it will send (flush) its value to Graphite every 10 seconds by default. You can set the flush interval to a shorter period, like 1 second, if you really need to record the second of the change. If the gauge is not updated at the next flush, StatsD will send the previous value.
So basically StatsD gauges do what you say about sending redundant data to describe missing points.

I had the same problem as well with sparse data.
I used the whisper database tools outlined in the link below to update my whisper files which were aggregating load data on 10 minute intervals.
https://github.com/graphite-project/whisper
First examine my file using whisper-info:
/opt/graphite/bin/whisper-info.py /opt/graphite/storage/whisper/test/system/myserver/n001/loadave.wsp
Then fix aggregation methods using whisper-resize:
/opt/graphite/bin/whisper-resize.py --xFilesFactor=0 --aggregationMethod=last /opt/graphite/storage/whisper/test/system/myserver/n001/loadave.wsp 600:259200
Please be cautious using whisper-resize as it can result in data loss if you aren't careful!

Related

InfluxDB: How to deal with missing data?

Question Description
We are performing a lot of timeseries queries, these queries sometimes result in issues, they are usually performed through an API (Python) and sometimes result in complete failure due to data missing.
Due to this situation we are not sure where to educate ourselves and get the answer to this specific question on, how to deal with missing data in our timeseries (influxdb) database
Example
To describe a problem in an example..
We have some timeseries data, let's say we measure the temperature of the room, now we have many rooms and sometimes sensors die or stop working for a week or two, then we replace them and so on, in that timeframe the data is missing.
Now we try to perform certain calculations, they fail, let's say we want to calculate the temperature average per each day, now this will fail because some days we have no measurement input on the sensors.
One approach that we thought of is that we just interpolate the data for that day. Use the last and the first available and just place that value for the days that there is no data available.
This has many downsides, major one being due to fake data, you can't trust it and for our processes that are a bit more serious we would prefer to not store fake data (or interpolated).
We were wondering what the possible alternatives were to this question and where can we find the resource to educate ourselves on such topic.
Answer
The idea is that we fill the missing values, the gaps, with data that is null or None. This way we can use influxdb built-in fill.
https://docs.influxdata.com/influxdb/cloud/query-data/flux/fill/
Like in this example, we are able to fill null values and thereby perform any additional queries and actions on the data on analysis.
The link reference above contains all of the methodologies that we can use to resolve and fill in the missing data values.

How do you store the results of some python code for use in another calculation?

I am a day trader who is new to python and learning every day. I have written a basic script or maybe you call it a function? but some basic python code that pulls the best bid/offer data from an API for me on repeat every 5 seconds
I now want a rolling average of the data coming in the from the API every 5 seconds so i can compare the current data against the rolling average
My problem is I have no idea where to start or what I should be looking to learn. Any help would be great! Even just to point me in the right direction.
Does the data need to be stored into a .csv that is updated each 5 seconds? or can all this be done within the code?
Thanks in advance for any help, code is below
import time
from binance.client import Client
api_key = "###"
api_secret = "###"
while True:
client = Client(api_key, api_secret)
ticker_info = (client.get_ticker(symbol="ETHUSDT"))
bid_qty = int(float(ticker_info['bidQty']))
ask_qty = int(float(ticker_info['askQty']))
bbo_delta = ask_qty-bid_qty
print("Ask=")
print(ask_qty)
print("Bid=")
print(bid_qty)
print("Delta=")
print(bbo_delta)
print("-")
time.sleep(5)
Actually, this types of quires may have various possibilities. As per your explanation, I got to know that, You want to fetch the data that get generated every five second.
So, you can scrape the data via beautiful soup if you are using python.
As the data is updated in every five second, after some times, it may be huge too . So, in that case properly stored data maybe in a csv or in a excel or in a database , It would be great help for you too.
Just scrape the data and store the data in a csv format or if you are using API, store that in a proper dataframe.
My suggestion would be use Beautiful Soup (BS4) and read some documentation and just in a few lines of code, store the data in a csv.
Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
By a 'rolling average' do you mean over minutes, hours or days? You can get a one minute rolling mean by putting the values into a list of len <= 12 and dropping the 'old' (only one minute old) values as new ones arrive. List is going to get really big as the roll time window gets big enough to be useful (len=120k for a one week average). Hard to imagine volatility that would make a 5 sec sampling interval valuable, but I know nothing about day trading. If you do want that short an interval and a 100k size data set, reading and writing to a file is going to be too slow.
Try writing code for a one hour rolling average with samples every minute. That will get you started. You can then post the code with specific questions and incrementally work to your goal.

Getting good timestamps for sensor readings

I'm trying to get timestamp data to match accelerometer and gyroscope readings.
I'm using a Raspberry Pi 3 B+ with python to pull accelerometer (Adxl345) and gyroscope (ITG3200) readings. I'm currently reading them through I2C as fast as I can, and I write a timestamp from the system time (time.time()) immediately before reading.
I thought this would be sufficiently accurate, but when I look at the resulting data the time is often not monotonic and/or just looks wrong. In fact, the result often seem to better match the motion I was tracking if I throw out all but the first timestamp and then synthetically create times based on the frequency of the device I'm collecting from!
That said, this approach clearly gives me wrong data, but I'm at a loss as to where to pull correct data. The accelerometer and gyro don't seem to include times in anything they do, and if I pull data as fast as I can I'm still bound to miss some from them at their highest rates meaning the times I use will always be somewhat wrong.
The accelerometer also has a FIFO cache it can store some values in, but again, if I pull from that case how do I know what timestamp goes with each value?? The documention mentions the cache storing values but nothing about the timestamp.
All of which leads me to believe I'm missing something. Is there a secret here I don't know? Or a standard practice I'm unaware of? Any thoughts or suggestions would be most welcome.

Iterating through a list of CryptoCurrencies on Poloniex using a FOR Loop and saving data on each Crypto token in its separate Variable

This question is related to a trading exchange known as Poloniex.com where I m using their public api- https://poloniex.com/support/api/ especially the function of returnChartData using Python Wrapper.
I have a list with me and this list includes all the altcoins(Alternate coins) listed on Poloniex
Something like this-
Altcoins=
['BTC_ETH','BTC_ZEC','BTC_XMR','BTC_LTC','BTC_ETC','BTC_BTS','BTC_GNT','BTC_XRP','BTC_FCT','BTC_SC','BTC_DCR','BTC_DASH',.....] (It should have more than 80-100 Altcoins)
the returnChartData function when called returns the trading and pricing data for the particular currencypair for an interval ranging from 5 minutes to a week. So bascially it is Historical data api.
I want to use there 4 Hour candle data (period=14400) which I wish to call every 4 hour for all the alt coins at once.
This is what I wish to do-
1. Use poloniex public api and call the historical data for all the altcoins (around 100) every 4 hours
2. Want to create variable of the name same of the altcoin so around 80-90 variables
and
3. store data of a particular altcoin to its respective variable
4. Using Pandas DataFrame on all those variable and perform trading and analysis
5. Repeat process every 4 hours. (Offcourse i need not create variables again and again)
So is there any way that I use and run one or two loops every 4 hours to solve this issue or should I run individual 80-100 calculations individually?
Here from where the api is taken- https://github.com/s4w3d0ff/python-poloniex
Here is the sample code for running 1 calculation at a time
from poloniex import Poloniex, Coach
import pandas as pd
myCoach = Coach()
public = Poloniex(coach=myCoach)
"""Below is the code for a single Altcoin. But I wish to perform the below process on the whole gamut"""
eth=public.returnChartData('BTC_ETH',period=14400) """Saving the data to a variable"""
eth=pd.DataFrame(eth)
The above code gives me what I want, but please understand how can I write same above piece for 100 altcoins and run them every 4 hours. What if i want to run it every 5 minutes. It will be cumbersome.
This is what I tried to solve the problem-
from poloniex import Poloniex, Coach
import pandas as pd
myCoach = Coach()
public = Poloniex(coach=myCoach)
coinlist=['BTC_ETH','BTC_ZEC','BTC_XMR','BTC_LTC','BTC_ETC','BTC_BTS','BTC_GNT','BTC_XRP','BTC_FCT','BTC_SC','BTC_DCR','BTC_DASH']
for i in coinlist:
altcoins=public.returnChartData(i,period=14400)
The above thing that I tried gives me data of the last altcoin in list and i.e. BTC_DASH. I think it overriding data till it reaches the end
Can you guys help out please
So is there any way that I use and run one or two loops every 4 hours to solve this issue
Just a quick thought.
Yes, there is a way to run two loops every 4 hours. Timestamp moment you start and if current time is >= timestamp + 4h run the loops and reset timestamp.
or should I run individual 80-100 calculations individually?
Get more hardware and think about multiprocessing / multithreading to parallelise operations.
Try to store only what you need and take care to use start and end parameters when getting ChartData from Poloniex API
From https://poloniex.com/support/api/, you can see that :
- returnChartData
Returns candlestick chart data. Required GET parameters are "currencyPair", "period" (candlestick period in seconds; valid values are 300, 900, 1800, 7200, 14400, and 86400), "start", and "end". "Start" and "end" are given in UNIX timestamp format and used to specify the date range for the data returned. [...]
Call: https://poloniex.com/public?command=returnChartData&currencyPair=BTC_XMR&start=1405699200&end=9999999999&period=14400
The best method is to setup and use an external DB (mongoDB,tinyDB,etc...), to store chartdata then to update them.
Assuming that your DB is constantly sync with the real data market, you may do what you want with your local DB without taking any risk to overload or to reach the requests/min limit on Poloniex side.
Assuming that DB is functional:
you may first store all data from the begin to the end of each supported pair and for each windows, it will take a long time to process ...
The 5M chartdata for a coin that have more than 3 years of trade activity could be very long to reinject in your DB (depend on DB,CPU ...)
at regular interval - when it is required - and using cronjob :
- every 5 minutes for the 5M window
- every 15 minutes for the 15M window
- ...
- every x minutes for the xM window
you need to update chartData for each avaible pair with a start parameter set to the previous recorded candle time according to this pair and window in your DB and the end parameter corresponding to now time (or the Poloniex server now time if you are not in the same time zone)
Considering that each update will give you only 1 or 2 candle data per requests, the global and complete request processing will be short !
you may at the end try to use multithreading & multiprocessing to speed up the global update procedure, but take care to not overload (my advise is no more than 4 concurrents threads) Poloniex infrastructure.

Pandas-ipython, how to create new data frames with drill down capabilities

I think my problem is simple but I've made a long post in the interest of being thorough.
I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)
I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.
My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).
I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).
I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.
My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?
For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.
Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?
I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)
Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)
It sounds like you want to run a database query like this:
SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3
Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.

Categories