Query for 2 indices in Elasticsearch?

Query for 2 indices in Elasticsearch? - python

I'm wondering if it's possible to query for 2 indicies in Elasticsearch, and display the results mixed together in 1 table. For example:
Indicies:
food-american-burger
food-italian-pizza
food-japanese-ramen
food-mexican-burritos
#query here for burger and pizza, and display the results in a csv file
#i.e. if there was a timestamp field, display results starting from the most recent
I know you can do a query for food-*, but it would give 2 indices that I wouldn't want.
I looked up the multisearch module for Elasticsearch DSL, but the documentation shows only an instance of 1 index query:
ms = MultiSearch(index='blogs')
ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))
Part 1:
Is it possible to use this for multiple indices? Ultimately, I would like to query for x number of indicies and display all the data in a single human-readable format (csv, json, etc.), but I'm not sure how to perform a single query for only the indices I want.
I currently have the functionality to perform queries and write out the data, but each data file would only consist of that index I queried for. I would like to display all the data into one file.
Part 2:
The data is stored in a dictionary, and then I am writing it to a csv. It is currently being ordered by timestamp. The code:
sorted_rows = sorted(rows,key=lambda x: x['#timestamp'], reverse=True)
for row in sorted_rows:
writer.writerow(row.values())
When writing to the csv, the timestamp field is not the first column. I'm storing the fields in a dictionary, and updating that dictionary for every Elasticsearch hit, then writing it to the csv. Is there a way to move the timestamp field to the first column?
Thanks!

According to the Elasticsearch Docs, you can query a single index (e.g. food-american-burger), multiple comma-separated indicies (e.g. food-american-burger,food-italian-pizza), or all indicies using the _all keyword.
I haven't personally used the Python client, but this is an API convention and should apply to any of the official Elasticsearch clients.
For part 2, you should probably submit a separate question to keep things to a single topic per question, since the two topics are not directly related.

Related

Bigtable data modeling and query with python

This is my first time using BigTable I can't tell if I don't understand bigtable modeling or how to use the python library.
Some background on what I'm storing:
I am storing time series events that let's say have two columns name and message, my rowkey is "#200501163223" so rowkey includes time in this format '%y%m%d%H%M%S'
Let's say later I needed to add another column called "type".
Also, it possible that there can be two events at the same second.
So this is what I end up with if I store 2 events, with the second event having the additional "type" data:
account#200501163223
Outbox:name # 2020/05/01-17:32:16.412000
"name1"
Outbox:name # 2020/05/01-16:41:49.093000
"name2"
Outbox:message # 2020/05/01-17:32:16.412000
"msg1"
Outbox:message # 2020/05/01-16:41:49.093000
"msg2"
Outbox:type # 2020/05/01-16:35:09.839000
"temp"
When I query this rowkey using python bigtable library, I get back a dictionary with my column names as keys and data as a list of Cell objects
"name" and "message" key would have 2 objects, and "type" would only have one object since it was only part of the second event.
My question is, how do I know which event, 1 or 2 that "type" value of temp belongs to? Is this model just wrong and I have to ensure only one event can be stored under a rowkey which would be hard to do.. or is there a trick I'm missing in the library to be able to associate the events data accordingly?

This is a great question tasha, and something I've come across before too, so thanks for asking it.
In Bigtable, there isn't a concept of having the columns be connected from the same write. This can be very helpful to some people by having a lot of flexibility with what you can do with various columns and versions, but in your case it causes this issue.
The best way to handle this is with 2 steps.
Make sure each time you write to a row you use the same timestamp for that write. That would look like this:
timestamp = datetime.datetime.utcnow()
row_key = "account#200501163223"
row = table.direct_row(row_key)
row.set_cell(column_family_id,
"name",
"name1",
timestamp)
row.set_cell(column_family_id,
"type",
"temp",
timestamp)
row.commit()
Then when you are querying your database, you can apply a filter to only get either the latest version or latest N versions, or a scan based on timestamp ranges.
rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(2))
Here are a few snippets with examples on how to use a filter with Bigtable reads. They should be added to the documentation soon.

python equivalent to listObjects in VBA for Excel (tables)

I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell

A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.

Pandas - check every part of a field to extract data to save on a different field

I have a dataset (CSV file), which I load into Pandas, to be parsed.
Among various fields, there is one which is a mashup string of various info that I would like to separate and have in their own field. I did try to use split on that field, but I get carried over also data that I don't care about.
Is there a way to parse a specific field for every line in the pandas dataset, and say something like if the string contain X then put X in a new field under the name of "car brand" for example.
This is an example of how my mashed up field look like
Compound X222-12 Mixed 23.3
Compound 13.2 Single AP 128-A
Element X221-X 55.1 Mixed
Compound 720TC-1 Single RS 69.5
Element F332-2 Double 2.7
Each item is divided by a space, and the final outcome I would like, is to get all this data and put in a proper field, so it will look like this:
itemtype itemcode extra_info frequency_value
Compound X222-12 Mixed 23.3
Compound 128-A Single AP 13.2
Element X221-X Mixed 55.1
Compound 720TC-1 Single RS 69.5
Element F332-2 Double 2.7
The issue I am having, is that not every field is in the same place at times, so the frequency may come before the itemcode or the extra info may have more than one string in it.
I tried this to split, but this didn't work out
df = pd.read_csv('myfile.csv')
df['splitdata1'], df['splitdata2'], df['splitdata3'], df['splitdata4'] = df['content'].str.split(' ' )
print (df)
Am I using Pandas incorrectly? Should I manage the strings before parsing the CSV file?

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?

Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.

If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.

One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).

For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

How to count the number of row keys for a particular column_family in Cassandra (read details)

I am trying to load data from SQL to No-SQL i.e Cassandra. but somehow few rows are not matching. Can somebody tell me how to count the number of row keys for a particular column_family in Cassandra.
I tried get_count and get_multicount, but these methods require keys to passed, In my case i do not know the keys, Instead I need the row count of the row_keys.
list column_family_name gives me the list but limited to only 100 rows. is there any way,
I can override the 100 limit.

As far as I know, there is no way to get a row count for a column family. You have to perform a range query over the whole column family instead.
If cf is your column family, something like this should work:
num_rows = len(list(cf.get_range()))
However, the documentation for get_range indicates that this might cause issues if you have too many rows. You might have to do it in chunks, using start and row_count.

You can count Cassandra rows without reading all rows.
See the implementation in Spark for cassandraCount() which does this quite efficiently.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.