I have a list of of Features (all Points) in a list in Python. The Features are dynamic stemming from database data which is updated on a 30 minutes interval.
Hence I never have a static number of features.
I need to generate a Feature Collection with all Features in my list.
However (as far as I know) the syntax for creating a FeatureCollection wants you to pass it all the features.
ie:
FeatureClct = FeatureCollection(feature1, feature2, feature3)
How does one generate a FeatureCollection without knowing how many features there will be beforehand? Is there a way to append Features to an existing FeatureCollection?
According to the documentation of python-geojson (which i guess you are using, you didn't mention it) you can also pass a list to FeatureCollection, just put all the results into a list and you're good to go:
feature1 = Point((45, 45));
feature2 = Point((-45, -45));
features = [feature1, feature2];
collection = FeatureCollection(features);
https://github.com/frewsxcv/python-geojson#featurecollection
Related
I'm trying to perform a simple task which requires iterations and interactions with specific vectors after loading it into gensim's Word2Vec.
Basically, given a txt file of the form:
t1 -0.11307 -0.63909 -0.35103 -0.17906 -0.12349
t2 0.54553 0.18002 -0.21666 -0.090257 -0.13754
t3 0.22159 -0.13781 -0.37934 0.39926 -0.25967
where t1 is the name of the vector and what follows are the vectors themselves. I load it in using the function vecs = KeyedVectors.load_word2vec_format(datapath(f), binary=False).
Now, I want to iterate through the vectors I have and make a calculation, take summing up all of the vectors as an example. If this was read in using with open(f), I know I can just use .split(' ') on it, but since this is now a KeyedVector object, I'm not sure what to do.
I've looked through the word2vec documentation, as well as used dir(KeyedVectors) but I'm still not sure if there is an attribute like KeyedVectors.vectors or something that allows me to perform this task.
Any tips/help/advice would be much appreciated!
There's a list of all words in the KeyedVectors object in its .index_to_key property. So one way to sum all the vectors would be to retrieve each by name in a list comprehension:
np.sum([vecs[key] for key in vecs.index_to_key], axis=0)
But, if all you really wanted to do is sum the vectors – and the keys (word tokens) aren't an important part of your calculation, the set of all the raw word-vectors is available in the .vectors property, as a numpy array with one vector per row. So you could also do:
np.sum(vecs.vectors, axis=0)
Questions like 1 and 2 give answers for retrieving vocabulary frequencies from gensim word2vec models.
For some reason, they actually just give a deprecating counter from n (size of vocab) to 0, alongside the most frequent tokens, ordered.
For example:
for idx, w in enumerate(model.vocab):
print(idx, w, model.vocab[w].count)
Gives:
0 </s> 111051
1 . 111050
2 , 111049
3 the 111048
4 of 111047
...
111050 tokiwa 2
111051 muzorewa 1
Why is it doing this? How can I extract term frequencies from the model, given a word?
Those answers are correct for reading the declared token-counts out of a model which has them.
But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.
In particular, if you created the model using load_word2vec_format(), that simple vectors-only format (whether binary or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.
So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.
(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)
In some cases, the original source of the file may have saved a separate .vocab file with the word-frequencies alongside the word2vec_format vectors. (In Google's original word2vec.c code release, this is the file generated by the optional -save-vocab flag. In Gensim's .save_word2vec_format() method, the optional fvocab parameter can be used to generate this side file.)
If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format(), as the fvocab parameter - and then your vector-set will have true counts.
If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()/.load() which use an extended form of Python-pickling, then the original true count info will never have been lost.
If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:
Gensim: Any chance to get word frequency in Word2Vec format?
I have lines like this in my data:
0,tcp,http,SF,181,5450,0,0,0.5,normal.
I want to use decision tree algorithm for training. I couldn't create LabeledPoints, so I want to try HashingTF for strings but I couldn't handle it. "normal" is my target label. How can I create a LabeledPoint RDD data to use in pyspark? Also, Label for LabeledPoint requires double, should I just create some double values for labels or should it be hashed?
I come up with the solution.
First of all, Spark's Decision tree classifier has already a parameter for this: categoricalFeaturesInfo. In the pyspark api documentation:
categoricalFeaturesInfo - Map from categorical feature index to number of categories. Any feature not in this map is treated as continuous.
However, before doing this, we first should simply replace the strings to numbers for pypsark to understand them.
Then we create for the above example data categoricalFeaturesInfo as in the definition like this:
categoricalFeaturesInfo = {1:len(feature1), 2:len(feature2), 3:len(feature3), 9:len(labels)}
Simply, first ones are the indexes of the categorical features and the second ones are the number of categories in that feature.
Note that converting strings to numbers is enough for the trainer algorithm but if you declare the categorical features like this, it would train faster.
Assume that I have a set of documents stored in an index of Elasticsearch, such that each document has the following (simplified) form:
{
"timestamp": N,
"val": X
}
where N is a long integer representing a unix-timestamp and X is some float.
My goal is to plot the behavior of val over time; in other words, obtain a graph where the x-axis is the time(stamp) and the y-axis is the val.
Medium-Small number of document
If the number of documents stored in the index is medium-small, then using python I could do the following. Scan the documents, using for example the scan-helper, and create a list of the JSON documents. Next, convert the list into a pandas.DataFrame and sort its line according to the timestamp. Finally, I can, now easily, plot the data as I described above. Here is a minimal example:
docs = scan(
es, # instance of es-client
index = 'myIndex',
doc_type = 'myDocType')
docsList = []
for doc in docs:
docsList.append(doc)
dfDocs = pandas.DataFrame(docsList)
dfDocsSorted = dfDocs.sort(columns='timestamp')
dfDocsSorted.plot(x='timestamp', y='val')
Here is how the output looks like for some sample data set:
I find it a rather clean and fine solution, given the number of documents is limited.
Large number of documents
What is the "right" way to do the same as above in the case where the number of documents is large? Note that the sorting step above is rather mandatory, as far as I can tell, since the scan returns documents in a "random" order. Therefore, if the number of documents is large, this step (and the storing of the data) will become an issue.
Is there a canonical way to achieve this using Elasticsearch? Or am I bound to carry out a pre-processing locally before being able to plot the data?
I have defined a pyparsing rule to parse this text into a syntax-tree...
TEXT COMMANDS:
add Iteration name = "Cisco 10M/half"
append Observation name = "packet loss 1"
assign Observation results_text = 0.0
assign Observation results_bool = True
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append Observation name = "packet loss 2"
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
SYNTAX TREE:
['add', 'Iteration', ['name', 'Cisco 10M/half']]
['append', 'Observation', ['name', 'packet loss 1']]
['assign', 'Observation', ['results_text', '0.0']]
['assign', 'Observation', ['results_bool', 'True']]
['append', 'DataPoint']
['assign', 'DataPoint', ['metric', 'txpackets']]
['assign', 'DataPoint', ['units', 'packets']]
...
I'm trying to associate all the nested key-value pairs in the syntax-tree above into a linked-list of objects... the heirarchy looks something like this (each word is a namedtuple... children in the heirarchy are on the parents' list of children):
Log: [
Iteration: [
Observation:
[DataPoint, DataPoint],
Observation:
[DataPoint, DataPoint]
]
]
The goal of all this is to build a generic test data-acquisition platform to drive the flow of tests against network gear, and record the results. After the data is in this format, the same data structure will be used to build a test report. To answer the question in the comments below, I chose a linked list because it seemed like the easiest way to sequentially dequeue the information when writing the report. However, I would rather not assign Iteration or Observation sequence numbers before finishing the tests... in case we find problems and insert more Observations in the course of conducting the test. My theory is that the position of each element in the list is sufficient, but I'm willing to change that if it's part of the problem.
The problem is that I'm getting lost trying to assign Key-Values to objects in the linked list after it's built. For instance, after I insert an Observation namedtuple into the first Iteration, I have trouble reliably handling the update of assign Observation results_bool = True in the example above.
Is there a generalized design pattern to handle this situation? I have googled this for while, but I can't seem to make the link between parsing the text (which I can do) and managing the data-heirarchy (the main problem). Hyperlinks or small demo code is fine... I just need pointers to get on the right track.
I am not aware of an actual design pattern for what you're looking for, but I have a great passion for the issue at hand. I work heavily with network devices and parsing and organizing the data is a large ongoing challenge for me.
It's clear that the problem is not parsing the data, but what you do with it afterwards. This is where you need to think about the meaning you are attaching to the data you have parsed. The nested-list method might work well for you if the objects containing the lists are also meaningful.
Namedtuples are great for quick-and-dirty class-ish behavior, but they fall flat when you need them to do anything outside of basic attribute access, especially considering that as tuples they are immutable. It seems to me that you'll want to replace certain namedtuple objects with full-blown classes. This way you can highly customize the behavior and methods available.
For example, you know that an Iteration will always contain 1 or more Observation objects which will then contain 1 or more DataPoint objects. If you can accurately describe the relationships, this sets you on the path to handling them.
I wound up using textfsm, which allows me to keep state between different lines while parsing the configuration file.