Bigtable data modeling and query with python - python

This is my first time using BigTable I can't tell if I don't understand bigtable modeling or how to use the python library.
Some background on what I'm storing:
I am storing time series events that let's say have two columns name and message, my rowkey is "#200501163223" so rowkey includes time in this format '%y%m%d%H%M%S'
Let's say later I needed to add another column called "type".
Also, it possible that there can be two events at the same second.
So this is what I end up with if I store 2 events, with the second event having the additional "type" data:
account#200501163223
Outbox:name # 2020/05/01-17:32:16.412000
"name1"
Outbox:name # 2020/05/01-16:41:49.093000
"name2"
Outbox:message # 2020/05/01-17:32:16.412000
"msg1"
Outbox:message # 2020/05/01-16:41:49.093000
"msg2"
Outbox:type # 2020/05/01-16:35:09.839000
"temp"
When I query this rowkey using python bigtable library, I get back a dictionary with my column names as keys and data as a list of Cell objects
"name" and "message" key would have 2 objects, and "type" would only have one object since it was only part of the second event.
My question is, how do I know which event, 1 or 2 that "type" value of temp belongs to? Is this model just wrong and I have to ensure only one event can be stored under a rowkey which would be hard to do.. or is there a trick I'm missing in the library to be able to associate the events data accordingly?

This is a great question tasha, and something I've come across before too, so thanks for asking it.
In Bigtable, there isn't a concept of having the columns be connected from the same write. This can be very helpful to some people by having a lot of flexibility with what you can do with various columns and versions, but in your case it causes this issue.
The best way to handle this is with 2 steps.
Make sure each time you write to a row you use the same timestamp for that write. That would look like this:
timestamp = datetime.datetime.utcnow()
row_key = "account#200501163223"
row = table.direct_row(row_key)
row.set_cell(column_family_id,
"name",
"name1",
timestamp)
row.set_cell(column_family_id,
"type",
"temp",
timestamp)
row.commit()
Then when you are querying your database, you can apply a filter to only get either the latest version or latest N versions, or a scan based on timestamp ranges.
rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(2))
Here are a few snippets with examples on how to use a filter with Bigtable reads. They should be added to the documentation soon.

Related

Query for 2 indices in Elasticsearch?

I'm wondering if it's possible to query for 2 indicies in Elasticsearch, and display the results mixed together in 1 table. For example:
Indicies:
food-american-burger
food-italian-pizza
food-japanese-ramen
food-mexican-burritos
#query here for burger and pizza, and display the results in a csv file
#i.e. if there was a timestamp field, display results starting from the most recent
I know you can do a query for food-*, but it would give 2 indices that I wouldn't want.
I looked up the multisearch module for Elasticsearch DSL, but the documentation shows only an instance of 1 index query:
ms = MultiSearch(index='blogs')
ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))
Part 1:
Is it possible to use this for multiple indices? Ultimately, I would like to query for x number of indicies and display all the data in a single human-readable format (csv, json, etc.), but I'm not sure how to perform a single query for only the indices I want.
I currently have the functionality to perform queries and write out the data, but each data file would only consist of that index I queried for. I would like to display all the data into one file.
Part 2:
The data is stored in a dictionary, and then I am writing it to a csv. It is currently being ordered by timestamp. The code:
sorted_rows = sorted(rows,key=lambda x: x['#timestamp'], reverse=True)
for row in sorted_rows:
writer.writerow(row.values())
When writing to the csv, the timestamp field is not the first column. I'm storing the fields in a dictionary, and updating that dictionary for every Elasticsearch hit, then writing it to the csv. Is there a way to move the timestamp field to the first column?
Thanks!
According to the Elasticsearch Docs, you can query a single index (e.g. food-american-burger), multiple comma-separated indicies (e.g. food-american-burger,food-italian-pizza), or all indicies using the _all keyword.
I haven't personally used the Python client, but this is an API convention and should apply to any of the official Elasticsearch clients.
For part 2, you should probably submit a separate question to keep things to a single topic per question, since the two topics are not directly related.

Organizing column and header data with pandas, python

I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.
My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this.
Reading Multiple CSV Files into Python Pandas Dataframe
The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:
Flight ID: XXXXXX
Date: 01-27-10 Time: 5:25:19
OWNER
Release Point: xx.304N xx.060E 11 m
Serial Number xxxxxx
Surface Data: 985.1 mb 1.0 C 100% 1.0 m/s # 308 deg.
I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?
Looks like somebody is working with radiosondes...
When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):
import pandas as pd
with open("filename.csv",'r') as data:
header = data.read().split('\n')[:5] # change to match number of your header rows
data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])
# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]
# a lot of the header information will get stored as metadata for me.
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
'date':date}
I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.
new_index = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)
You now have a multi-level indexed dataframe.
Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.
Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.
data.metadata = {FLIGHT1:{'date':date},
FLIGHT2:{'date':date}}
Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.
Your question was quite broad, so you got a broad answer. I hope this was helpful.

Modify column output for sqlform.grid() in Web2py

I have started using web2py for a web application and try to use SQLFORM.grid(...) to display a paginated listing of one of my db-table's data like in the following minimal example.
grid=SQLFORM.grid(query,
links=links,
fields=[db.example.date,db.example.foo, db.example.bar])
The db.example.date field contains a Python datetime.datetime object in UTC. At the moment it is displayed just plainly like that. However, I want to have more control about the actual output in a way that I can set the local timezone and modify the output string to have something like "2 hours ago".
As seen in another question[0] I can use the links to insert new columns. Unfortunately I can't seem to sort the rows by a field I have inserted in such way. Also, they are inserted on the right instead of actually replacing my first column. So that does not seem to be a solution.
To sum it up: How do I gain control about the way db.example.date is printed out in the end?
[0] Calculated Fields in web2py sqlgrid
You can achieve your goal when you define the table in your model. The represent parameter in the Field constructor that you used in define_table will be recognized by the SQLFORM.grid. For example, if you wanted to just print the date with the month name you could put the following in your model.
Field('a_date', type='date', represent=lambda x, row: x.strftime("%B %d, %Y")),
your function could also convert to local time.
You need to use prettydate to change the datetime arid format in a humanized string, and call it in the represent parameter of your Field() descriptor. For example :
from gluon.tools import prettydate
db.example.date.represent = lambda v,r: prettydate(r.date)
That way, any display of the db.example.date would be displayed humanized, including through SQLFORM.grid
If you don't want to have the date always represented in this way as per David Nehme's answer. Just before your grid creation, you can set the db.table.field.represent in the controller.
db.example.date.represent = lambda value, row: value.strftime("%B %d, %Y")
followed by.
grid = SQLFORM.grid(query,....
I use this often when I join tables. If there is a row.field in the represent from the model file it breaks because it then must be more specific, row.table.field.

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Counts of events grouped by date in python?

This is no doubt another noobish question, but I'll ask it anyways:
I have a data set of events with exact datetime in UTC. I'd like to create a line chart showing total number of events by day (date) in the specified date range. Right now I can retrieve the total data set for the needed date range, but then I need to go through it and count up for each date.
The app is running on google app engine and is using python.
What is the best way to create a new data set showing date and corresponding counts (including if there were no events on that date) that I can then use to pass this info to a django template?
Data set for this example looks like this:
class Event(db.Model):
event_name = db.StringProperty()
doe = db.DateTimeProperty()
dlu = db.DateTimeProperty()
user = db.UserProperty()
Ideally, I want something with date and count for that date.
Thanks and please let me know if something else is needed to answer this question!
You'll have to do the binning in-memory (i.e. after the datastore fetch).
The .date() method of a datetime instance will facilitate your binning; it chops off the time element. Then you can use a dictionary to hold the bins:
bins = {}
for event in Event.all().fetch(1000):
bins.setdefault(event.doe.date(), []).append( event )
Then do what you wish with (e.g. count) the bins. For a direct count:
counts = collections.defaultdict(int)
for event in Event.all().fetch(1000):
counts[event.doe.date()] += 1
I can't see how that would be possible with single query as GQL has no support for GROUP BY or aggregation generally.
In order to minimize the amount of work you do, you'll probably want to write a task that sums up the per-day totals once, so you can reuse them. I'd suggest using the bulkupdate library to run a once-a-day task that counts events for the previous day, and creates a new model instance, with a key name based on the date, containing the count. Then, you can get all needed data points by doing a query (or better, a batch get) for the set of summary entities you need.

Categories