How can I access objects of an array in a query? - python

I am currently working on a project for my school. The goal is to analyze logsfiles that are generated by salt-stack. I have already setup a MongoDB and the MONGO_FUTURE_RETURN Returner in salt-stack.
I want to analyze the logfiles via a python script. Eveything is connected and basic queries such as db.saltReturns.find() work fine.
Since I do not want to rewrite the retuner and my programm needs to access different objects I need something like "Object.Everything.nestedObject".
To clarify what I mean I have attached a snippet of data and will show you what I would want to access.
I have already tried to use $, $each and $[] - none of them solved my problem.
"fun_args" : [ ],
"jid" : "20190423135733454092",
"return" : {
"cmd_|-sssd-ldap-cmd-pam-auth-update-bugfix_|-/usr/local/bin/bugfix-682662-sh_|-wait" : {
"comment" : "No changes detected",
"start_time" : "13:58:22.852410",
"result" : true,
"duration" : 0.016,
"__run_num__" : 26,
"__sls__" : "sssd-ldap.install",
"changes" : {
}
},
"pkg_|-salt-minion-required-packages_|-salt-minion-required-packages_|-installed" : {
"comment" : "All specified packages are already installed",
"name" : "python-concurrent.futures",
"start_time" : "13:58:18.915102",
"result" : true,
"duration" : 24.703,
"__run_num__" : 5,
"__sls__" : "salt-minion.install",
"changes" : {
},
"__id__" : "salt-minion-required-packages"
}
...
}
In my script I would like to access:
"full_ret.return.[all].comment"
All I need is a positional operator that can replace the [all] placeholder.

I have searched many forums and documentations but could not find a wildcard operator for mongoDB queries. Sadly this seems to be a feature that will not be implemented.

Related

Query for a specific item in firebase from python

I am getting this error
Index not defined, add ".indexOn": "Number", for path "/public_access", to the rules
and have no idea what the issue is. All I'd like to do is retrieve (Check if the entry exists in my db) to see if the tag has already been use
My rules look like this
{
"rules": {
"public_access": {
"all_users": {
".indexOn": "Numbers"
},
".read": "auth.uid === 'regular_user'",
".write": true,
},
"connection_testing": {
".read": true,
".write": "auth.uid === 'admin_user'"
},
"login_data": {
".read": true,
".write": true
},
}
}
And the python code I am trying to use is this
ref = db.reference('/public_access/all_users')
snapshot = ref.order_by_child('Number').equal_to('AAAAAA').get()
But it just errors, I've tried copying so many different examples and I've been stuck on it for hours. I know the solution will be something stupidly simple but I just can't apply it to what I want to do.
I don't need the number value at all, it was just an attempt to get it to find 'AAAAAA'.
The end goal is to see if AAAAAA exists by searching for it, and I have not been successful at achieving this
You don't need order_by_child or database rules here. If you want to search for "AAAAAA", just access it by the child method:
ref = db.reference('/public_access/all_users')
data = ref.child('AAAAAA').get()
print(data['Date_of_join']) # '12/34/34'
print(data['Number']) # 2

Elasticsearch prevent indexing of Markdown hyperlinks

I am building a Markdown file content search using Elasticsearch. Currently the whole content inside the MD file is indexed in Elasticsearch. But the problem is it shows results like this [Mylink](https://link-url-here.org), [Mylink2](another_page.md)
in the search results.
I would like to prevent indexing of hyperlinks and reference to other pages. When someone search for "Mylink" it should only return the text without the URL. It would be great if someone could help me with the right solution for this.
You need to render Markdown in your indexing application, then remove HTML tags and save it alongside with the markdown source.
I think you have two main solutions for this problem.
first: clean the data in your source code before indexing it into Elasticsearch.
second: use the Elasticsearch filter to clean the data for you.
the first solution is the easy one but if you need to do this process inside the Elasticsearch you need to create a ingest pipeline.
then you can use the Script processor to clean the data you need by a ruby script that can find your regex and remove it
You could use an ingest pipeline with a script processor to extract the link text:
1. Set up the pipeline
PUT _ingest/pipeline/clean_links
{
"description": "...",
"processors": [
{
"script": {
"source": """
if (ctx["content"] == null) {
// nothing to do here
return
}
def content = ctx["content"];
Pattern pattern = /\[([^\]\[]+)\](\(((?:[^\()]+)+)\))/;
Matcher matcher = pattern.matcher(content);
def purged_content = matcher.replaceAll("$1");
ctx["purged_content"] = purged_content;
"""
}
}
]
}
The regex can be tested here and is inspired by this.
2. Include the pipeline when ingesting the docs
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink2](another_page.md)"
}
The python docs are here.
3. Verify
GET my-index/_search?filter_path=hits.hits._source
should yield
{
"hits" : {
"hits" : [
{
"_source" : {
"purged_content" : "Mylink anotherLink",
"content" : "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
},
{
"_source" : {
"purged_content" : "Mylink2",
"content" : "[Mylink2](another_page.md)"
}
}
]
}
}
You could instead replace the original content if you want to fully discard them from your _source.
In contrast, you could go a step further in the other direction and store the text + link pairs in a nested field of the form:
{
"content": "...",
"links": [
{
"text": "Mylink",
"href": "https://link-url-here.org"
},
...
]
}
so that when you later decide to make them searchable, you'll be able to do so with precision.
Shameless plug: you can find other hands-on ingestion guides in my Elasticsearch Handbook.

Access GlobalParameters in Azure ML Python script

How can one access the global parameters ("GlobalParameters") sent from a web service in a Python script on Azure ML?
I tried:
if 'GlobalParameters' in globals():
myparam = GlobalParameters['myparam']
but with no success.
EDIT: Example
In my case, I'm sending a sound file over the web service (as a list of samples). I would also like to send a sample rate and the number of bits per sample. I've successfully configured the web service (I think) to take these parameters, so the GlobalParameters now look like:
"GlobalParameters": {
"sampleRate": "44100",
"bitsPerSample": "16",
}
However, I cannot access these variables from the Python script, neither as GlobalParameters["sampleRate"] nor as sampleRate. Is it possible? Where are they stored?
based on our understanding of your question, here may has a miss conception that Azure ML parameters are not “Global Parameters”, as a matter of fact they are just parameter substitution tied to a particular module. So in affect there are no global parameters that are accessible throughout the experiment you have mentioned. Such being the case, we think the experiment below accomplishes what you are asking for:
Please add an “Enter Data” module to the experiment and add Data in csv format. Then for the Data click the parameter to create a web service parameter. Add in the CSV data which will be substituted from data passed by the client application. I.e.
Please add an “Execute Python” module and hook up the “Enter Data” output to the “Execute Python” input1. Add the python code to take the dataframe1 and add it to a python list. Once you have it in a list you can use it anywhere in your python code.
Python code snippet
def azureml_main(dataframe1 = None, dataframe2 = None):
import pandas as pd
global_list = []
for g in dataframe1["Col3"]:
global_list.append(g)
df_global = pd.DataFrame(global_list)
print('Input pandas.DataFrame:\r\n\r\n{0}'.format(df_global))
return [df_global]
Once you publish your experiment, you can add in new values in the “Data”: “”, section below with the new values that you was substituted for the “Enter Data” values in the experiment.
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["Col1", "Col2", "Col3"],
"Values": [ [ "0", "value", "0" ], [ "0", "value", "0" ], ]
}, },
"GlobalParameters": {
"Data": "1,sampleRate,44500\\n2,bitsPerSample,20",
}
}
Please feel free to let us know if this makes sense.
The GlobalParameters parameter can not be used in a Python script. It is used to override certain parameters in other modules.
If you, for example, take the 'Split Data' module, you'll find an option to turn a parameter into a web service parameter:
Once you click that, a new section appears titled "Web Service Parameters". There you can change the default parameter name to one of your choosing.
If you deploy your project as a web service, you can override that parameter by putting it in the GlobalParameters parameter:
"GlobalParameters": {
"myFraction": 0.7
}
I hope that clears things up a bit.
Although it is not possible to use GlobalParameters in the Python script (see my previous answer), you can however hack/abuse the second input of the Python script to pass in other parameters. In my example I call them metadata parameters.
To start, I added:
a Web service input module with name: "realdata" (for your real data off course)
a Web service input module with name: "metadata" (we will abuse this one to pass parameters to our Python).
a Web service output module with name: "computedMetadata"
Connect the modules as follows:
As you can see, I also added a real data set (Restaurant ratings) as wel as a dummy metadata csv (the Enter Data Manually) module.
In this manual data you will have to predefine your metadata parameters as if they were a csv with a header and a only a single row to hold the data:
In the example both sampleRate and bitsPerSample are set to 0.
My Python scripts then takes in that fake csv as metadata, does some dummy calculation with it and returns it as column name:
import pandas as pd
def azureml_main(realdata = None, metadata = None):
theSum = metadata["sampleRate"][0] + metadata["bitsPerSample"][0]
outputString = "The sum of the sampleRate and the bitsPerSecond is " + str(theSum)
print(outputString)
return pd.DataFrame([outputString])
I then published this as a web service and called it using Node.js like this:
httpreq.post('https://ussouthcentral.services.azureml.net/workspaces/xxx/services/xxx', {
headers: {
Authorization: 'Bearer xxx'
},
json: {
"Inputs": {
"realdata": {
"ColumnNames": [
"userID",
"placeID",
"rating"
],
"Values": [
[
"100",
"101",
"102"
],
[
"200",
"201",
"202"
]
]
},
"metadata": {
"ColumnNames": [
"sampleRate",
"bitsPerSample"
],
"Values": [
[
44100,
16
]
]
}
},
"GlobalParameters": {}
}
}, (err, res) => {
if(err) return console.log(err);
console.log(JSON.parse(res.body));
});
The output was as expected:
{ Results:
{ computedMetadata:
{ type: 'table',
value:
{ ColumnNames: [ '0' ],
ColumnTypes: [ 'String' ],
Values:
[ [ 'The sum of the sampleRate and the bitsPerSecond is 44116' ] ] } } } }
Good luck!

Elasticsearch relative time range query in Python

I have searched and searched but cannot find an answer for this. I am new to using Elasticsearch with Python and trying to do a simple Python query against my Elasticsearch index which will return a count of the results matching a specific set of criteria in the past hour. I'm getting all the results back using the following (sanitized) code:
hits = es.count(index='myindex-*',q=thing.rstrip() )
Simple enough right? So is there a way to include a relative time range in this query, or do I need to write some Python to figure out the times to insert as a time range?
Thanks in advance for the help!
Yes, everything you need is a time-based key in your index and then query that key with:
{
"query" : {
"range" : {
"<time_based_key>" : {
"gte" : "now-1h"
}
}
}
}
To define your time-based key:
curl -XPUT localhost:9200/<database>/<index>/_mapping?pretty -d '
{
"<index>" : {
"properties": {
"<time_based_key>" : {
"type" : "date",
"index": "not_analyzed"
}
}
}
}'

How to distribute MongoDB test data over VCS?

I'm working on a Python/MongoDB project in both my computer at home and my laptop.
The schema in document stores, naturally, is best represented by the data itself - and that's why I want to distribute my test data over Mercurial, together with the code itself.
Would the best way be to simply dump the BSONs in a file and add it to the mercurial repository?
Dumping BSON and putting it into a VCS would not make much sense, since it's binary file and can't be viewed easily.
You can export a collection to a JSON, by using mongoexport tool. You can even pass it a query filter to limit number of exported documents.
Here's an example (reformatted for readability):
sergio#soviet-russia$ mongoexport -d test -c geo \
sergio#soviet-russia$ -q '{"_id": ObjectId("4efa5f7d8840e680c850cd94") }'
connected to: 127.0.0.1
{ "_id" : { "$oid" : "4efa5f7d8840e680c850cd94" },
"longg" : [ { "start" : 322815488, "end" : 322817535 },
{ "start" : 822815488, "end" : 822817535 } ],
"m" : "Cracow",
"postal" : 55050,
"lat" : [ "XX.89XXX", "XX.74XXX" ] }
exported 1 records

Categories