Scale-able plotting of data stream from Elasticsearch - python

Assume that I have a set of documents stored in an index of Elasticsearch, such that each document has the following (simplified) form:
{
"timestamp": N,
"val": X
}
where N is a long integer representing a unix-timestamp and X is some float.
My goal is to plot the behavior of val over time; in other words, obtain a graph where the x-axis is the time(stamp) and the y-axis is the val.
Medium-Small number of document
If the number of documents stored in the index is medium-small, then using python I could do the following. Scan the documents, using for example the scan-helper, and create a list of the JSON documents. Next, convert the list into a pandas.DataFrame and sort its line according to the timestamp. Finally, I can, now easily, plot the data as I described above. Here is a minimal example:
docs = scan(
es, # instance of es-client
index = 'myIndex',
doc_type = 'myDocType')
docsList = []
for doc in docs:
docsList.append(doc)
dfDocs = pandas.DataFrame(docsList)
dfDocsSorted = dfDocs.sort(columns='timestamp')
dfDocsSorted.plot(x='timestamp', y='val')
Here is how the output looks like for some sample data set:
I find it a rather clean and fine solution, given the number of documents is limited.
Large number of documents
What is the "right" way to do the same as above in the case where the number of documents is large? Note that the sorting step above is rather mandatory, as far as I can tell, since the scan returns documents in a "random" order. Therefore, if the number of documents is large, this step (and the storing of the data) will become an issue.
Is there a canonical way to achieve this using Elasticsearch? Or am I bound to carry out a pre-processing locally before being able to plot the data?

Related

Creating two lists from one randomly

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

influxdb: Write multiple points vs single point multiple times

I'm using influxdb in my project and I'm facing an issue with query when multiple points are written at once
I'm using influxdb-python to write 1000 unique points to influxdb.
In the influxdb-python there is a function called influxclient.write_points()
I have two options now:
Write each point once every time (1000 times) or
Consolidate 1000 points and write all the points once.
The first option code looks like this(pseudo code only) and it works:
thousand_points = [0...9999
while i < 1000:
...
...
point = [{thousand_points[i]}] # A point must be converted to dictionary object first
influxclient.write_points(point, time_precision="ms")
i += 1
After writing all the points, when I write a query like this:
SELECT * FROM "mydb"
I get all the 1000 points.
To avoid the overhead added by every write in every iteration, I felt like exploring writing multiple points at once. Which is supported by the write_points function.
write_points(points, time_precision=None, database=None,
retention_policy=None, tags=None, batch_size=None)
Write to multiple time series names.
Parameters: points (list of dictionaries, each dictionary represents
a point) – the list of points to be written in the database
So, what I did was:
thousand_points = [0...999]
points = []
while i < 1000:
...
...
points.append({thousand_points[i]}) # A point must be converted to dictionary object first
i += 1
influxclient.write_points(points, time_precision="ms")
With this change, when I query:
SELECT * FROM "mydb"
I only get 1 point as the result. I don't understand why.
Any help will be much appreciated.
You might have a good case for a SeriesHelper
In essence, you set up a SeriesHelper class in advance, and every time you discover a data point to add, you make a call. The SeriesHelper will batch up the writes for you, up to bulk_size points per write
I know this has been asked well over a year ago, however, in order to publish multiple data points in bulk to influxdb each datapoint needs to have a unique timestamp it seems, otherwise it will just be continously overwritten.
I'd import a datetime and add the following to each datapoint within the for loop:
'time': datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
So each datapoint should look something like...
{'fields': data, 'measurement': measurement, 'time': datetime....}
Hope this is helpful for anybody else who runs into this!
Edit: Reading the docs show that another unique identifier is a tag, so you could instead include {'tag' : i} (supposedly each iteration value is unique) if you wish to specify time. (However this I haven't tried)

Write a collection to a binary file

I'm searching a way to write some data (List, Array, etc) into a binary file. The collection to put into the binary file represents a list of points. What I try until now :
[
11:17]
Welcome to Scala 2.12.0-M3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40).
Type in expressions for evaluation. Or try :help.
scala> import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
scala> val oos = new ObjectOutputStream(new FileOutputStream("/tmp/f1.data"))
oos: java.io.ObjectOutputStream = java.io.ObjectOutputStream#13bc8645
scala> oos.writeObject(List(1,2,3,4))
scala> oos.close
scala> val ois = new ObjectInputStream(new FileInputStream("/tmp/f1.data"))
ois: java.io.ObjectInputStream = java.io.ObjectInputStream#392a04e7
scala> val s : List[Int] = ois.readObject().asInstanceOf[List[Int]]
s: List[Int] = List(1, 2, 3, 4)
Ok it's working well. The problem is that maybe tomorrow I will need to read this binary file with an another language as Python. Is it a way to have a more generic binary file that can be read by a multiple languages ?
Solution
To the person searching in the same situation, you can do it like that :
def write2binFile(filename : String, a : Array[Int]) = {
val inChannel = new RandomAccessFile(filename, "rw").getChannel
val bbufer = ByteBuffer.allocateDirect(a.length * 4)
val ibuffer = bbufer.asIntBuffer()
ibuffer.put(a)
inChannel.write(bbufer)
inChannel.close
}
Format for cross-platform sharing of point coordinates allowing selective access by RANGE
Your requirements are:
store data by Scala, read by Python (or other languages)
the data are lists of point coordinates
store the data on AWS S3
allow fetching only part of the data using RANGE request
Data structure to use
The data must be uniform in structure and size per element to allow calculating position of certain part by means of RANGE.
If Scala format for storing lists/arrays fulfils this requirement, and if
the binary format is well defined, you may succeed. If not, you have to find
another format.
Reading binary data by Python
Assuming the format is known, use Python struct module from stdlib to read it.
Alternative approach: split data to smaller pieces
You are willing to access the data piece by piece, probably expecting one large object on S3 and using HTTP request with RANGE.
Alternative solution is to split the data into smaller pieces, which are of reasonable size for fetching (e.g. 64 kB, but you know your use case better), and design rule for storing them piece by piece on AWS S3. You may even use tree structure for this purpose.
There are some advantages with this approach:
use whatever format you like, e.g. XML, JSON, no need to deal with special binary formats
pieces can be compressed, you will save some costs
Note, that AWS S3 will charge you not only for data transfer, but also per request, so each HTTP request using RANGE will be counted as one.
Cross-platform binary formats to consider
Consider following formats:
BSON
(Google) Result Buffers
HDF5
If you visit Wikipedia page for any of those formats, you will find many links to other formats.
Anyway, I am not aware of any of such formats, which would be using uniform size per element as most of them are trying to keep the size as small as possible. For this reason they cannot be used in scenario using RANGE unless some special index file is introduced (what is probably not very feasible).
On the other hand, using these formats with alternative approach (splitting the
data to smaller pieces) shall work.
Note: I did some test in past regarding storage efficiency and speed of encoding/decoding. From practical point of view the best results were achieved using simple JSON structure (possibly compressed). You find these options on every platform, it is very simple to use, speed of encoding/decoding is high (I do not say the hightest).

geoJSON: How to create FeatureCollection with dynamic number of Features?

I have a list of of Features (all Points) in a list in Python. The Features are dynamic stemming from database data which is updated on a 30 minutes interval.
Hence I never have a static number of features.
I need to generate a Feature Collection with all Features in my list.
However (as far as I know) the syntax for creating a FeatureCollection wants you to pass it all the features.
ie:
FeatureClct = FeatureCollection(feature1, feature2, feature3)
How does one generate a FeatureCollection without knowing how many features there will be beforehand? Is there a way to append Features to an existing FeatureCollection?
According to the documentation of python-geojson (which i guess you are using, you didn't mention it) you can also pass a list to FeatureCollection, just put all the results into a list and you're good to go:
feature1 = Point((45, 45));
feature2 = Point((-45, -45));
features = [feature1, feature2];
collection = FeatureCollection(features);
https://github.com/frewsxcv/python-geojson#featurecollection

Pattern for associating pyparsing results with a linked-list of nodes

I have defined a pyparsing rule to parse this text into a syntax-tree...
TEXT COMMANDS:
add Iteration name = "Cisco 10M/half"
append Observation name = "packet loss 1"
assign Observation results_text = 0.0
assign Observation results_bool = True
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append Observation name = "packet loss 2"
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
SYNTAX TREE:
['add', 'Iteration', ['name', 'Cisco 10M/half']]
['append', 'Observation', ['name', 'packet loss 1']]
['assign', 'Observation', ['results_text', '0.0']]
['assign', 'Observation', ['results_bool', 'True']]
['append', 'DataPoint']
['assign', 'DataPoint', ['metric', 'txpackets']]
['assign', 'DataPoint', ['units', 'packets']]
...
I'm trying to associate all the nested key-value pairs in the syntax-tree above into a linked-list of objects... the heirarchy looks something like this (each word is a namedtuple... children in the heirarchy are on the parents' list of children):
Log: [
Iteration: [
Observation:
[DataPoint, DataPoint],
Observation:
[DataPoint, DataPoint]
]
]
The goal of all this is to build a generic test data-acquisition platform to drive the flow of tests against network gear, and record the results. After the data is in this format, the same data structure will be used to build a test report. To answer the question in the comments below, I chose a linked list because it seemed like the easiest way to sequentially dequeue the information when writing the report. However, I would rather not assign Iteration or Observation sequence numbers before finishing the tests... in case we find problems and insert more Observations in the course of conducting the test. My theory is that the position of each element in the list is sufficient, but I'm willing to change that if it's part of the problem.
The problem is that I'm getting lost trying to assign Key-Values to objects in the linked list after it's built. For instance, after I insert an Observation namedtuple into the first Iteration, I have trouble reliably handling the update of assign Observation results_bool = True in the example above.
Is there a generalized design pattern to handle this situation? I have googled this for while, but I can't seem to make the link between parsing the text (which I can do) and managing the data-heirarchy (the main problem). Hyperlinks or small demo code is fine... I just need pointers to get on the right track.
I am not aware of an actual design pattern for what you're looking for, but I have a great passion for the issue at hand. I work heavily with network devices and parsing and organizing the data is a large ongoing challenge for me.
It's clear that the problem is not parsing the data, but what you do with it afterwards. This is where you need to think about the meaning you are attaching to the data you have parsed. The nested-list method might work well for you if the objects containing the lists are also meaningful.
Namedtuples are great for quick-and-dirty class-ish behavior, but they fall flat when you need them to do anything outside of basic attribute access, especially considering that as tuples they are immutable. It seems to me that you'll want to replace certain namedtuple objects with full-blown classes. This way you can highly customize the behavior and methods available.
For example, you know that an Iteration will always contain 1 or more Observation objects which will then contain 1 or more DataPoint objects. If you can accurately describe the relationships, this sets you on the path to handling them.
I wound up using textfsm, which allows me to keep state between different lines while parsing the configuration file.

Categories