influxdb: Write multiple points vs single point multiple times

influxdb: Write multiple points vs single point multiple times - python

I'm using influxdb in my project and I'm facing an issue with query when multiple points are written at once
I'm using influxdb-python to write 1000 unique points to influxdb.
In the influxdb-python there is a function called influxclient.write_points()
I have two options now:
Write each point once every time (1000 times) or
Consolidate 1000 points and write all the points once.
The first option code looks like this(pseudo code only) and it works:
thousand_points = [0...9999
while i < 1000:
...
...
point = [{thousand_points[i]}] # A point must be converted to dictionary object first
influxclient.write_points(point, time_precision="ms")
i += 1
After writing all the points, when I write a query like this:
SELECT * FROM "mydb"
I get all the 1000 points.
To avoid the overhead added by every write in every iteration, I felt like exploring writing multiple points at once. Which is supported by the write_points function.
write_points(points, time_precision=None, database=None,
retention_policy=None, tags=None, batch_size=None)
Write to multiple time series names.
Parameters: points (list of dictionaries, each dictionary represents
a point) – the list of points to be written in the database
So, what I did was:
thousand_points = [0...999]
points = []
while i < 1000:
...
...
points.append({thousand_points[i]}) # A point must be converted to dictionary object first
i += 1
influxclient.write_points(points, time_precision="ms")
With this change, when I query:
SELECT * FROM "mydb"
I only get 1 point as the result. I don't understand why.
Any help will be much appreciated.

You might have a good case for a SeriesHelper
In essence, you set up a SeriesHelper class in advance, and every time you discover a data point to add, you make a call. The SeriesHelper will batch up the writes for you, up to bulk_size points per write

I know this has been asked well over a year ago, however, in order to publish multiple data points in bulk to influxdb each datapoint needs to have a unique timestamp it seems, otherwise it will just be continously overwritten.
I'd import a datetime and add the following to each datapoint within the for loop:
'time': datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
So each datapoint should look something like...
{'fields': data, 'measurement': measurement, 'time': datetime....}
Hope this is helpful for anybody else who runs into this!
Edit: Reading the docs show that another unique identifier is a tag, so you could instead include {'tag' : i} (supposedly each iteration value is unique) if you wish to specify time. (However this I haven't tried)

Related

Is there another way to convert ee.Number to float except getInfo()?

Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.

I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.

Pyspark - Saving & using previously calculated values

I have a dataset of thousands of files and I read / treat them with PySpark.
First, I've created functions like the following one to treat the whole dataset and this is working great.
def get_volume_spark(data):
days = lambda i: i * 86400 # This is 60sec*60min*24h
partition = Window.partitionBy("name").orderBy(F.col("date").cast("long")).rangeBetween(days(-31), days(0))
data = data.withColumn("monthly_volume", F.count(F.col("op_id")).over(partition))\
.filter(F.col("monthly_volume") >= COUNT_THRESHOLD)
return data
Every day I got new files arriving and I want to treat new files ONLY and append data the the first created file instead of treating the whole dataset again with more data every day because it would be too long and operations has been already made.
The other thing is, here I split by month for example (I calculate the count per month), but no one can assure that I will have a whole month (and certainly not) in the new files. So I want to keep a counter or something to resume where I were.
I wanted to know if there's some way to do that or this is not possible at all.

Python: improve performance in log writing to file

I am writing a piece of code that involves generation of new parameter values over a double FOR loop and store these values to a file. The loop iteration count can go as high as 10,000 * 100,000. I have stored the variable values in a string, which gets appended on every iteration with newer values. Finally, at the end of loop I write the complete string in a txt file.
op=open("output file path","w+")
totresult = ""
for n seconds: #this user input parameter can be upto 100,000
result = ""
for car in (cars running): #number of cars can be 10000
#Code to check if given car is in range to another car
.
.
#if car in range with another car
if distance < 1000:
result = getDetailsofOtherCar()
totresult = totalresult + carName + result
#end of loops
op.write(totresult)
op.close()
My question here is, is there a better pythonic way to perform this kind of logging. As I am guessing the string gets very bulky in the later iterations and may be causing delay in execution. Is the use of string the best possible option to store the values. Or should I consider other python data structures like list, array. I came across Logging python module but would like to get an opinion before switching to it.
I tried looking up for similar issues but found nothing similar to my current doubt.
Open to any suggestions
Thank you
Edit: code added

You can write to the file as you go e.g.
with open("output.txt", "w") as log:
for i in range(10):
for j in range(10):
log.write(str((i,j)))
Update: whether or not directly streaming the records is faster than concatenating them in a memory buffer depends crucially on how big the buffer becomes, which in turn depends on the number of records and the size of each record. On my machine this seems to kick in around 350MB.

write table cell real-time python

I would like to loop trough a database, find the appropriate values and insert them in the appropriate cell in a separate file. It maybe a csv, or any other human-readable format.
In pseudo-code:
for item in huge_db:
for list_of_objects_to_match:
if itemmatch():
if there_arent_three_matches_yet_in_list():
matches++
result=performoperationonitem()
write_in_file(result, row=object_to_match_id, col=matches)
if matches is 3:
remove_this_object_from_object_to_match_list()
can you think of any way other than going every time through all the outputfile line by line?
I don't even know what to search for...
even better, there are better ways to find three matching objects in a db and have the results in real-time? (the operation will take a while, but I'd like to see the results popping out RT)

Assuming itemmatch() is a reasonably simple function, this will do what I think you want better than your pseudocode:
for match_obj in list_of_objects_to_match:
db_objects = query_db_for_matches(match_obj)
if len(db_objects) >= 3:
result=performoperationonitem()
write_in_file(result, row=match_obj.id, col=matches)
else:
write_blank_line(row=match_obj.id) # if you want
Then the trick becomes writing the query_db_for_matches() function. Without detail, I'll assume you're looking for objects that match in one particular field, call it type. In pymongo such a query would look like:
def query_db_for_matches(match_obj):
return pymongo_collection.find({"type":match_obj.type})
To get this to run efficiently, make sure your database has an index on the field(s) you're querying on by first calling:
pymongo_collection.ensure_index({"type":1})
The first time you call ensure_index it could take a long time for a huge collection. But each time after that it will be fast -- fast enough that you could even put it into query_db_for_matches before your find and it would be fine.

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!

No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.

I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.

As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

influxdb: Write multiple points vs single point multiple times - python

You might have a good case for a SeriesHelper In essence, you set up a SeriesHelper class in advance, and every time you discover a data point to add, you make a call. The SeriesHelper will batch up the writes for you, up to bulk_size points per write

Related

Is there another way to convert ee.Number to float except getInfo()?

Pyspark - Saving & using previously calculated values

Python: improve performance in log writing to file

write table cell real-time python

summing entries in a variable based on another variable(make unique) in python lists

Categories

Resources