ElasticSearch and Python : Issue with search function

ElasticSearch and Python : Issue with search function - python

I'm trying to use for the first time ElasticSearch 6.4 with an existing web application wrote in Python/Django. I have some issues and I would like to understand why and how I can solve these issues.
###########
# Existing : #
###########
In my application, it's possible to upload document files (.pdf or .doc for example). Then, I have a search function in my application which let to search over documents indexed by ElasticSearch when they are uploaded.
Document title is always written through the same way :
YEAR - DOC_TYPE - ORGANISATION - document_title.extension
For example :
1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf
The search function is always done among doc_type = ANNUAL_REPORT. because there are several doc_types (ANNUAL_REPORT, OTHERS, ....).
##################
# My environment : #
##################
This is some data according to my ElasticSearch part. I'm learning ES commands too.
$ curl -XGET http://127.0.0.1:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open app 5T0HZTbmQU2-ZNJXlNb-zg 5 1 742 2 396.4kb 396.4kb
So my index is app
For the above example, if I search this document : 1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf, I have :
$ curl -XGET http://127.0.0.1:9200/app/annual-report/1343?pretty
{
"_index" : "app",
"_type" : "annual-report",
"_id" : "1343",
"_version" : 33,
"found" : true,
"_source" : {
"attachment" : {
"date" : "2010-03-04T12:08:00Z",
"content_type" : "application/pdf",
"author" : "manshanden",
"language" : "et",
"title" : "Microsoft Word - Test document Word.doc",
"content" : "some text ...",
"content_length" : 3926
},
"relative_path" : "app_docs/APP-TEST/1970_ANNUAL_REPORT_APP-TEST_1342.pdf",
"title" : "1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf"
}
}
Now, with my search part in my web application, I would like to find this document with this search : 1970.
def search_in_annual(self, q):
try:
response = self.es.search(
index='app', doc_type='annual-report',
q=q, _source_exclude=['data'], size=5000)
except ConnectionError:
return -1, None
total = 0
hits = []
if response:
for hit in response["hits"]["hits"]:
hits.append({
'id': hit['_id'],
'title': hit['_source']['title'],
'file': hit['_source']['relative_path'],
})
total = response["hits"]["total"]
return total, hits
But when q=1970, the result is 0
If I write :
response = self.es.search(
index='app', doc_type='annual-report',
q="q*", _source_exclude=['data'], size=5000)
It returns my document, but many documents too with no 1970 inside the title or the document content.
#################
# My global code : #
#################
This is the global class which manage indexing functions :
class EdqmES(object):
host = 'localhost'
port = 9200
es = None
def __init__(self, *args, **kwargs):
self.host = kwargs.pop('host', self.host)
self.port = kwargs.pop('port', self.port)
# Connect to ElasticSearch server
self.es = Elasticsearch([{
'host': self.host,
'port': self.port
}])
def __str__(self):
return self.host + ':' + self.port
#staticmethod
def file_encode(filename):
with open(filename, "rb") as f:
return b64encode(f.read()).decode('utf-8')
def create_pipeline(self):
body = {
"description": "Extract attachment information",
"processors": [
{"attachment": {
"field": "data",
"target_field": "attachment",
"indexed_chars": -1
}},
{"remove": {"field": "data"}}
]
}
self.es.index(
index='_ingest',
doc_type='pipeline',
id='attachment',
body=body
)
def index_document(self, doc, bulk=False):
filename = doc.get_filename()
try:
data = self.file_encode(filename)
except IOError:
data = ''
print('ERROR with ' + filename)
# TODO: log error
item_body = {
'_id': doc.id,
'data': data,
'relative_path': str(doc.file),
'title': doc.title,
}
if bulk:
return item_body
result1 = self.es.index(
index='app', doc_type='annual-report',
id=doc.id,
pipeline='attachment',
body=item_body,
request_timeout=60
)
print(result1)
return result1
def index_annual_reports(self):
list_docs = Document.objects.filter(category=Document.OPT_ANNUAL)
print(list_docs.count())
self.create_pipeline()
bulk = []
inserted = 0
for doc in list_docs:
inserted += 1
bulk.append(self.index_document(doc, True))
if inserted == 20:
inserted = 0
try:
print(helpers.bulk(self.es, bulk, index='app',
doc_type='annual-report',
pipeline='attachment',
request_timeout=60))
except BulkIndexError as err:
print(err)
bulk = []
if inserted:
print(helpers.bulk(
self.es, bulk, index='app',
doc_type='annual-report',
pipeline='attachment', request_timeout=60))
My document is indexed when he's submitted thanks a Django form with a signal :
#receiver(signals.post_save, sender=Document, dispatch_uid='add_new_doc')
def add_document_handler(sender, instance=None, created=False, **kwargs):
""" When a document is created index new annual report (only) with Elasticsearch and update conformity date if the
document is a new declaration of conformity
:param sender: Class which is concerned
:type sender: the model class
:param instance: Object which was just saved
:type instance: model instance
:param created: True for a creation, False for an update
:type created: boolean
:param kwargs: Additional parameter of the signal
:type kwargs: dict
"""
if not created:
return
# Index only annual reports
elif instance.category == Document.OPT_ANNUAL:
es = EdqmES()
es.index_document(instance)

This is what I've done and it seems to work :
def search_in_annual(self, q):
try:
response = self.es.search(
index='app', doc_type='annual-report', q=q, _source_exclude=['data'], size=5000)
if response['hits']['total'] == 0:
response = self.es.search(
index='app', doc_type='annual-report',
body={
"query":
{"prefix": {"title": q}},
}, _source_exclude=['data'], size=5000)
except ConnectionError:
return -1, None
total = 0
hits = []
if response:
for hit in response["hits"]["hits"]:
hits.append({
'id': hit['_id'],
'title': hit['_source']['title'],
'file': hit['_source']['relative_path'],
})
total = response["hits"]["total"]
return total, hits
It lets to search over title, prefix and content to find my document.

Related

Python Elasticsearch not returning the same number of results on each run

I have to run a query on ElasticSearch with Python to extract some voluminous data. I'm using elasticsearch 7.9.1 with Python 3.
The problem is, my script doesn't return the same number of lines from one run to an other.
Sometimes I get ~300,000 results, sometimes more (1 million), sometimes zero. It seems to be happening especially if 2 runs are close to one another. I'm not changing the query nor the time range in which to search.
I noticed changing the scroll argument in es.search method seems to change the behaviour of the script.
If scroll = '1s' for example, page['hits']['hits'] is zero, which doesn't really make sense to me.
Here is the class I'm using in my script :
class ElasticsearchFinder():
def __init__(self, cfg):
logger.info('ElasticsearchFinder.__init__ : initiate parameters')
try:
######### ES configuration #######
self.port = cfg.get('Elasticsearch', 'port')
self.hostnames = cfg.get('Elasticsearch', 'hostnames')
self.username = cfg.get('Elasticsearch', 'username')
self.password = cfg.get('Elasticsearch', 'password')
self.index = cfg.get('Elasticsearch', 'index')
self.delay = cfg.get('Elasticsearch', 'delay')
self.folder = cfg.get('Elasticsearch', 'root_csv_folder')
self.filename = cfg.get('Elasticsearch', 'csv_filename')
self.ssl_es_context = create_ssl_context()
self.ssl_es_context.check_hostname = False
self.ssl_es_context.verify_mode = ssl.CERT_NONE
try:
self.start_time_conf = cfg.get('Elasticsearch', 'start_time')
except:
self.start_time_conf = None
try:
self.end_time_conf = cfg.get('Elasticsearch', 'end_time')
except:
self.end_time_conf = None
except Exception as e:
logger.error('ElasticsearchFinder.__init__ : initiate parameters failed, please verify fetch_qradar.conf : %s', str(e))
now = datetime.datetime.now()
if self.delay != "0":
start_date = now - datetime.timedelta(hours=int(self.delay))
else:
start_date = now - datetime.timedelta(weeks=52)
self.end_time = now.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
self.start_time = start_date.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
if self.start_time_conf:
self.start_time = datetime.datetime.strptime(self.start_time_conf, "%Y-%m-%dT%H:%M:%S.%fZ").strftime("%Y-%m-%dT%H:%M:%S.%fZ")
if self.end_time_conf:
self.end_time = datetime.datetime.strptime(self.end_time_conf, "%Y-%m-%dT%H:%M:%S.%fZ").strftime("%Y-%m-%dT%H:%M:%S.%fZ")
try:
self.es = Elasticsearch(self.hostnames, port = self.port, scheme="https",
http_auth=(self.username, self.password), ssl_context=self.ssl_es_context)
except Exception as e:
logger.error('ElasticsearchFinder.__init__ : Connect to remote Elasticsearch %s:%s failed : %s', self.hostname, self.port, str(e))
def search_elastic(self, start_time=None, end_time=None):
try:
# field to store all data
data = []
# parameters end_time and start_time as ARGS
if start_time:
self.start_time = datetime.datetime.strptime(start_time, "%Y-%m-%dT%H:%M:%S.%fZ").strftime("%Y-%m-%dT%H:%M:%S.%fZ")
if end_time:
self.end_time = datetime.datetime.strptime(end_time, "%Y-%m-%dT%H:%M:%S.%fZ").strftime("%Y-%m-%dT%H:%M:%S.%fZ")
logger.info('ElasticsearchFinder.search_elastic : QUERY = field1: "value1" and (field2: "value2" or "value3")')
query = {
"query": {
"bool": {
"must": [
{"match" : {"field1" : "value1"}},
{"range": {"timestamp": {"lt": self.end_time, "gte": self.start_time}}},
{
"bool": {
"should": [
{"match" : {"field2" : "value2"}},
{"match" : {"field2" : "value3"}}
]
}
}
]
}
}
}
# Initialize the scroll
page = self.es.search(
index=self.index,
scroll='30m',
size=1000,
body=query)
sid = page['_scroll_id']
scroll_size = page['hits']['total']['value']
logger.info('ElasticsearchFinder.search_elastic : search for qradar from {} to {}. Total hits {}'.format(self.start_time, self.end_time, scroll_size))
# fetch data
for i in page['hits']['hits']:
data.append({'some_field' : i[_source]['some_field']
})
# Start scrolling
while (scroll_size > 0):
# Get the number of results that we returned in the last scroll
page = self.es.scroll(scroll_id=sid, scroll='30m')
scroll_size = len(page['hits']['hits'])
for i in page['hits']['hits']:
data.append({'some_field' : i[_source]['some_field']
})
# Update the scroll ID (to move to next page)
sid = page['_scroll_id']
logger.info('ElasticsearchFinder.search_elastic : Total stored data {}'.format(len(data)))
# write CSV file
self.writeToCSV(self.folder+self.filename, data)
The query has since been checked directly on Kibana and is correct.
(My script is also not returning the same number of results than Kibana)
I've had to set large timeouts or else I would get timeout errors.
Any idea why it isn't returning the same number of results each time?
I thought maybe the scroll_id was kept somewhere so Elastic wasn't returning results it had already returned on the previous run, but the scroll_id changes from one run to the next so that seems unlikely.

Turns out the script was creating too many scroll contexts (more than the default 500). This explains why the issue especially appeared if the script was run several times in a row.
This could be seen by looking at the _shards part of the result (page['_shards'] in my case), which contained many such error message :
Trying to create too many scroll contexts. Must be less than or equal to: [500]
Reducing the scroll argument in es.search() seems to help, as scroll contexts are then flushed more quickly, as well as clearing scroll context after the search with es.clear_scroll().

Error while deploying machine learning python application

I am trying to deploy my XGboost model into kubernetes. I am facing a problem in writing the flask code. Here is the code(imported from github). Whenever I try to deploy into web server, I am facing the error message: invalid parameters. Please help me to solve this issue and thank you in advance.
'''
#
import json
import pickle
import numpy as np
from flask import Flask, request
#
flask_app = Flask(__name__)
#ML model path
model_path = "Y:/Docker_Tests/Deploy-ML-model-master/Deploy-ML-model-master/ML_Model/model2.pkl"
#flask_app.route('/', methods=['GET'])
def index_page():
return_data = {
"error" : "0",
"message" : "Successful"
}
return flask_app.response_class(response=json.dumps(return_data), mimetype='application/json')
#flask_app.route('/predict',methods=['GET'])
def model_deploy():
try:
age = request.form.get('age')
bs_fast = request.form.get('BS_Fast')
bs_pp = request.form.get('BS_pp')
plasma_r = request.form.get('Plasma_R')
plasma_f = request.form.get('Plasma_F')
HbA1c = request.form.get('HbA1c')
fields = [age,bs_fast,bs_pp,plasma_r,plasma_f,HbA1c]
if not None in fields:
#Datapreprocessing Convert the values to float
age = float(age)
bs_fast = float(bs_fast)
bs_pp = float(bs_pp)
plasma_r = float(plasma_r)
plasma_f = float(plasma_f)
hbA1c = float(HbA1c)
result = [age,bs_fast,bs_pp,plasma_r,plasma_f,HbA1c]
#Passing data to model & loading the model from disk
classifier = pickle.load(open(model_path, 'rb'))
prediction = classifier.predict([result])[0]
conf_score = np.max(classifier.predict_proba([result]))*100
return_data = {
"error" : '0',
"message" : 'Successfull',
"prediction": prediction,
"confidence_score" : conf_score.round(2)
}
else:
return_data = {
"error" : '1',
"message": "Invalid Parameters"
}
except Exception as e:
return_data = {
'error' : '2',
"message": str(e)
}
return flask_app.response_class(response=json.dumps(return_data), mimetype='application/json')
if __name__ == "__main__":
flask_app.run(host ='0.0.0.0',port=9091, debug=True)
'''

Create Google Cloud Function using API in Python

I'm working on a project with Python(3.6) & Django(1.10) in which I need to create a function at Google cloud using API request.
How can upload code in the form of a zip archive while creating that function?
Here's what I have tried:
From views.py :
def post(self, request, *args, **kwargs):
if request.method == 'POST':
post_data = request.POST.copy()
post_data.update({'user': request.user.pk})
form = forms.SlsForm(post_data, request.FILES)
print('get post request')
if form.is_valid():
func_obj = form
func_obj.user = request.user
func_obj.project = form.cleaned_data['project']
func_obj.fname = form.cleaned_data['fname']
func_obj.fmemory = form.cleaned_data['fmemory']
func_obj.entryPoint = form.cleaned_data['entryPoint']
func_obj.sourceFile = form.cleaned_data['sourceFile']
func_obj.sc_github = form.cleaned_data['sc_github']
func_obj.sc_inline_index = form.cleaned_data['sc_inline_index']
func_obj.sc_inline_package = form.cleaned_data['sc_inline_package']
func_obj.bucket = form.cleaned_data['bucket']
func_obj.save()
service = discovery.build('cloudfunctions', 'v1', http=views.getauth(), cache_discovery=False)
requ = service.projects().locations().functions().generateUploadUrl(parent='projects/' + func_obj.project + '/locations/us-central1', body={})
resp = requ.execute()
print(resp)
try:
auth = views.getauth()
# Prepare Request Body
req_body = {
"CloudFunction": {
"name": func_obj.fname,
"entryPoint": func_obj.entryPoint,
"timeout": '60s',
"availableMemoryMb": func_obj.fmemory,
"sourceArchiveUrl": func_obj.sc_github,
},
"sourceUploadUrl": func_obj.bucket,
}
service = discovery.build('cloudfunctions', 'v1beta2', http=auth, cachce_dicovery=False)
func_req = service.projects().locations().functions().create(location='projects/' + func_obj.project
+ '/locations/-',
body=req_body)
func_res = func_req.execute()
print(func_res)
return HttpResponse('Submitted',)
except:
return HttpResponse(status=500)
return HttpResponse('Sent!')
Updated Code below:
if form.is_valid():
func_obj = form
func_obj.user = request.user
func_obj.project = form.cleaned_data['project']
func_obj.fname = form.cleaned_data['fname']
func_obj.fmemory = form.cleaned_data['fmemory']
func_obj.entryPoint = form.cleaned_data['entryPoint']
func_obj.sourceFile = form.cleaned_data['sourceFile']
func_obj.sc_github = form.cleaned_data['sc_github']
func_obj.sc_inline_index = form.cleaned_data['sc_inline_index']
func_obj.sc_inline_package = form.cleaned_data['sc_inline_package']
func_obj.bucket = form.cleaned_data['bucket']
func_obj.save()
#######################################################################
# FIRST APPROACH FOR FUNCTION CREATION USING STORAGE BUCKET
#######################################################################
file_name = os.path.join(IGui.settings.BASE_DIR, 'media/archives/', func_obj.sourceFile.name)
print(file_name)
service = discovery.build('cloudfunctions', 'v1')
func_api = service.projects().locations().functions()
url_svc_req = func_api.generateUploadUrl(parent='projects/'
+ func_obj.project
+ '/locations/us-central1',
body={})
url_svc_res = url_svc_req.execute()
print(url_svc_res)
upload_url = url_svc_res['uploadUrl']
print(upload_url)
headers = {
'content-type': 'application/zip',
'x-goog-content-length-range': '0,104857600'
}
print(requests.put(upload_url, headers=headers, data=func_obj.sourceFile.name))
auth = views.getauth()
# Prepare Request Body
name = "projects/{}/locations/us-central1/functions/{}".format(func_obj.project, func_obj.fname,)
print(name)
req_body = {
"name": name,
"entryPoint": func_obj.entryPoint,
"timeout": "3.5s",
"availableMemoryMb": func_obj.fmemory,
"sourceUploadUrl": upload_url,
"httpsTrigger": {},
}
service = discovery.build('cloudfunctions', 'v1')
func_api = service.projects().locations().functions()
response = func_api.create(location='projects/' + func_obj.project + '/locations/us-central1',
body=req_body).execute()
pprint.pprint(response)
Now the function has been created successfully, but it fails because the source code doesn't upload to storage bucket, that's maybe something wrong at:
upload_url = url_svc_res['uploadUrl']
print(upload_url)
headers = {
'content-type': 'application/zip',
'x-goog-content-length-range': '0,104857600'
}
print(requests.put(upload_url, headers=headers, data=func_obj.sourceFile.name))

In the request body you have a dictionary "CloudFunction" inside the request. The content of "CloudFunction" should be directly in request.
request_body = {
"name": parent + '/functions/' + name,
"entryPoint": entry_point,
"sourceUploadUrl": upload_url,
"httpsTrigger": {}
}
I recomend using "Try this API" to discover the structure of projects.locations.functions.create .
"sourceArchiveUrl" and "sourceUploadUrl" can't appear together. This is explained in Resorce Cloud Function:
// Union field source_code can be only one of the following:
"sourceArchiveUrl": string,
"sourceRepository": { object(SourceRepository) },
"sourceUploadUrl": string,
// End of list of possible types for union field source_code.
In the rest of the answer I assume that you want to use "sourceUploadUrl". It requires you to pass it a URL returned to you by .generateUploadUrl(...).execute(). See documentation:
sourceUploadUrl -> string
The Google Cloud Storage signed URL used for source uploading,
generated by [google.cloud.functions.v1.GenerateUploadUrl][]
But before passing it you need to upload a zip file to this URL:
curl -X PUT "${URL}" -H 'content-type:application/zip' -H 'x-goog-content-length-range: 0,104857600' -T test.zip
or in python:
headers = {
'content-type':'application/zip',
'x-goog-content-length-range':'0,104857600'
}
print(requests.put(upload_url, headers=headers, data=data))
This is the trickiest part:
the case matters and it should be lowercase. Because the signature is calculated from a hash (here)
you need 'content-type':'application/zip'. I deduced this one logically, because documentation doesn't mention it. (here)
x-goog-content-length-range: min,max is obligatory for all PUT requests for cloud storage and is assumed implicitly in this case. More on it here
104857600, the max in previous entry, is a magical number which I didn't found mentioned anywhere.
where data is a FileLikeObject.
I also assume that you want to use the httpsTrigger. For a cloud function you can only choose one trigger field. Here it's said that trigger is a Union field. For httpsTrigger however that you can just leave it to be an empty dictionary, as its content do not affect the outcome. As of now.
request_body = {
"name": parent + '/functions/' + name,
"entryPoint": entry_point,
"sourceUploadUrl": upload_url,
"httpsTrigger": {}
}
You can safely use 'v1' instead of 'v1beta2' for .create().
Here is a full working example. It would be to complicated if I presented it to you as part of your code, but you can easily integrate it.
import pprint
import zipfile
import requests
from tempfile import TemporaryFile
from googleapiclient import discovery
project_id = 'your_project_id'
region = 'us-central1'
parent = 'projects/{}/locations/{}'.format(project_id, region)
print(parent)
name = 'ExampleFunctionFibonacci'
entry_point = "fibonacci"
service = discovery.build('cloudfunctions', 'v1')
CloudFunctionsAPI = service.projects().locations().functions()
upload_url = CloudFunctionsAPI.generateUploadUrl(parent=parent, body={}).execute()['uploadUrl']
print(upload_url)
payload = """/**
* Responds to any HTTP request that can provide a "message" field in the body.
*
* #param {Object} req Cloud Function request context.
* #param {Object} res Cloud Function response context.
*/
exports.""" + entry_point + """= function """ + entry_point + """ (req, res) {
if (req.body.message === undefined) {
// This is an error case, as "message" is required
res.status(400).send('No message defined!');
} else {
// Everything is ok
console.log(req.body.message);
res.status(200).end();
}
};"""
with TemporaryFile() as data:
with zipfile.ZipFile(data, 'w', zipfile.ZIP_DEFLATED) as archive:
archive.writestr('function.js', payload)
data.seek(0)
headers = {
'content-type':'application/zip',
'x-goog-content-length-range':'0,104857600'
}
print(requests.put(upload_url, headers=headers, data=data))
# Prepare Request Body
# https://cloud.google.com/functions/docs/reference/rest/v1/projects.locations.functions#resource-cloudfunction
request_body = {
"name": parent + '/functions/' + name,
"entryPoint": entry_point,
"sourceUploadUrl": upload_url,
"httpsTrigger": {},
"runtime": 'nodejs8'
}
print('https://{}-{}.cloudfunctions.net/{}'.format(region,project_id,name))
response = CloudFunctionsAPI.create(location=parent, body=request_body).execute()
pprint.pprint(response)
Open and upload a zip file like following:
file_name = os.path.join(IGui.settings.BASE_DIR, 'media/archives/', func_obj.sourceFile.name)
headers = {
'content-type': 'application/zip',
'x-goog-content-length-range': '0,104857600'
}
with open(file_name, 'rb') as data:
print(requests.put(upload_url, headers=headers, data=data))

'ManyToOneRel' object has no attribute 'parent_model' error with django chartit

I have django project with 2 models:
class DeviceModel(models.Model):
name = models.CharField(max_length=255)
def __unicode__(self):
return self.name
class Device(models.Model):
created_at = models.DateTimeField(auto_now_add=True)
device_model = models.ForeignKey(DeviceModel)
serial_number = models.CharField(max_length=255)
def __unicode__(self):
return self.device_model.name + " - " + self.serial_number
There many devices in the database and I want to plot chart "amount of devices" per "device model".
I'm trying to do this task with django chartit.
Code in view.py:
ds = PivotDataPool(
series=[
{'options': {
'source':Device.objects.all(),
'categories':'device_model'
},
'terms':{
u'Amount':Count('device_model'),
}
}
],
)
pvcht = PivotChart(
datasource=ds,
series_options =
[{'options':{
'type': 'column',
'stacking': True
},
'terms':[
u'Amount']
}],
chart_options =
{'title': {
'text': u'Device amount chart'},
'xAxis': {
'title': {
'text': u'Device name'}},
'yAxis': {
'title': {
'text': u'Amount'}}}
)
return render(request, 'charts.html', {'my_chart': pvcht})
This seems to plot result I need, but instead of device names it plots values of ForeignKey (1,2,3,4...) and I need actual device model names.
I thought that solution is to change 'categories' value to:
'categories':'device_model__name'
But this gives me error:
'ManyToOneRel' object has no attribute 'parent_model'
This type of referencing should work accoring to official example http://chartit.shutupandship.com/demo/pivot/pivot-with-legend/
What am I missing here?
C:\Anaconda\lib\site-packages\django\core\handlers\base.py in get_response
response = middleware_method(request, callback, callback_args, callback_kwargs)
if response:
break
if response is None:
wrapped_callback = self.make_view_atomic(callback)
try:
response = wrapped_callback(request, *callback_args, **callback_kwargs) ###
except Exception as e:
# If the view raised an exception, run it through exception
# middleware, and if the exception middleware returns a
# response, use that. Otherwise, reraise the exception.
for middleware_method in self._exception_middleware:
response = middleware_method(request, e)
D:\django\predator\predator\views.py in charts
series=[
{'options': {
'source':Device.objects.all(),
'categories':'device_model__name'
},
#'legend_by': 'device_model__device_class'},
'terms':{
u'Amount':Count('device_model'), ###
}
}
],
#pareto_term = 'Amount'
)
C:\Anaconda\lib\site-packages\chartit\chartdata.py in __init__
'terms': {
'asia_avg_temp': Avg('temperature')}}]
# Save user input to a separate dict. Can be used for debugging.
self.user_input = locals()
self.user_input['series'] = copy.deepcopy(series)
self.series = clean_pdps(series) ###
self.top_n_term = (top_n_term if top_n_term
in self.series.keys() else None)
self.top_n = (top_n if (self.top_n_term is not None
and isinstance(top_n, int)) else 0)
self.pareto_term = (pareto_term if pareto_term in
self.series.keys() else None)
C:\Anaconda\lib\site-packages\chartit\validation.py in clean_pdps
def clean_pdps(series):
"""Clean the PivotDataPool series input from the user.
"""
if isinstance(series, list):
series = _convert_pdps_to_dict(series)
clean_pdps(series) ###
elif isinstance(series, dict):
if not series:
raise APIInputError("'series' cannot be empty.")
for td in series.values():
# td is not a dict
if not isinstance(td, dict):
C:\Anaconda\lib\site-packages\chartit\validation.py in clean_pdps
try:
_validate_func(td['func'])
except KeyError:
raise APIInputError("Missing 'func': %s" % td)
# categories
try:
td['categories'], fa_cat = _clean_categories(td['categories'],
td['source']) ###
except KeyError:
raise APIInputError("Missing 'categories': %s" % td)
# legend_by
try:
td['legend_by'], fa_lgby = _clean_legend_by(td['legend_by'],
td['source'])
C:\Anaconda\lib\site-packages\chartit\validation.py in _clean_categories
else:
raise APIInputError("'categories' must be one of the following "
"types: basestring, tuple or list. Got %s of "
"type %s instead."
%(categories, type(categories)))
field_aliases = {}
for c in categories:
field_aliases[c] = _validate_field_lookup_term(source.model, c) ###
return categories, field_aliases
def _clean_legend_by(legend_by, source):
if isinstance(legend_by, basestring):
legend_by = [legend_by]
elif isinstance(legend_by, (tuple, list)):
C:\Anaconda\lib\site-packages\chartit\validation.py in _validate_field_lookup_term
# and m2m is True for many-to-many relations.
# When 'direct' is False, 'field_object' is the corresponding
# RelatedObject for this field (since the field doesn't have
# an instance associated with it).
field_details = model._meta.get_field_by_name(terms[0])
# if the field is direct field
if field_details[2]:
m = field_details[0].related.parent_model ###
else:
m = field_details[0].model
return _validate_field_lookup_term(m, '__'.join(terms[1:]))
def _clean_source(source):

In categories you can only use fields that are included in source queryset.
On the other hand in terms you can use foreignKey or manyTomany connected fields.
Find an example below.
Instead of using
'source':Device.objects.all()
'categories':'device_model'
try to use
'source':DeviceModel.objects.all()
'categories':'name'
and next
'Amount':Count('device__device_model')

I think there is a problem with newer version of django (1.8).
This code is deprecated:
m = field_details[0].related.parent_model
instead of it use
m = field_details[0].getattr(field.related, 'related_model', field.related.model)
You can also find fix to this problem on GitHub.
Hope this helps.

How to create request body for Python Elasticsearch mSearch

I'm trying to run a multi search request on the Elasticsearch Python client. I can run the singular search correctly but can't figure out how to format the request for a msearch. According to the documentation, the body of the request needs to be formatted as:
The request definitions (metadata-search request definition pairs), as
either a newline separated string, or a sequence of dicts to serialize
(one per row).
What's the best way to create this request body? I've been searching for examples but can't seem to find any.

If you follow the demo of official doc(even thought it's for BulkAPI) , you will find how to construct your request in python with the Elasticsearch client:
Here is the newline separated string way:
def msearch():
es = get_es_instance()
search_arr = []
# req_head
search_arr.append({'index': 'my_test_index', 'type': 'doc_type_1'})
# req_body
search_arr.append({"query": {"term" : {"text" : "bag"}}, 'from': 0, 'size': 2})
# req_head
search_arr.append({'index': 'my_test_index', 'type': 'doc_type_2'})
# req_body
search_arr.append({"query": {"match_all" : {}}, 'from': 0, 'size': 2})
request = ''
for each in search_arr:
request += '%s \n' %json.dumps(each)
# as you can see, you just need to feed the <body> parameter,
# and don't need to specify the <index> and <doc_type> as usual
resp = es.msearch(body = request)
As you can see, the final-request is constructed by several req_unit.
Each req_unit construct shows below:
request_header(search control about index_name, optional mapping-types, search-types etc.)\n
reqeust_body(which involves query detail about this request)\n
The sequence of dicts to serialize way is almost same with the previous one, except that you don't need to convert it to string:
def msearch():
es = get_es_instance()
request = []
req_head = {'index': 'my_test_index', 'type': 'doc_type_1'}
req_body = {
'query': {'term': {'text' : 'bag'}},
'from' : 0, 'size': 2 }
request.extend([req_head, req_body])
req_head = {'index': 'my_test_index', 'type': 'doc_type_2'}
req_body = {
'query': {'range': {'price': {'gte': 100, 'lt': 300}}},
'from' : 0, 'size': 2 }
request.extend([req_head, req_body])
resp = es.msearch(body = request)
Here is the structure it returns. Read more about msearch.

If you are using elasticsearch-dsl, you can use the class MultiSearch.
Example from the documentation:
from elasticsearch_dsl import MultiSearch, Search
ms = MultiSearch(index='blogs')
ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))
responses = ms.execute()
for response in responses:
print("Results for query %r." % response.search.query)
for hit in response:
print(hit.title)

Here is what I came up with. I am using the same document type and index so I optimized the code to run multiple queries with the same header:
from elasticsearch import Elasticsearch
from elasticsearch import exceptions as es_exceptions
import json
RETRY_ATTEMPTS = 10
RECONNECT_SLEEP_SECS = 0.5
def msearch(es_conn, queries, index, doc_type, retries=0):
"""
Es multi-search query
:param queries: list of dict, es queries
:param index: str, index to query against
:param doc_type: str, defined doc type i.e. event
:param retries: int, current retry attempt
:return: list, found docs
"""
search_header = json.dumps({'index': index, 'type': doc_type})
request = ''
for q in queries:
# request head, body pairs
request += '{}\n{}\n'.format(search_header, json.dumps(q))
try:
resp = es_conn.msearch(body=request, index=index)
found = [r['hits']['hits'] for r in resp['responses']]
except (es_exceptions.ConnectionTimeout, es_exceptions.ConnectionError,
es_exceptions.TransportError): # pragma: no cover
logging.warning("msearch connection failed, retrying...") # Retry on timeout
if retries > RETRY_ATTEMPTS: # pragma: no cover
raise
time.sleep(RECONNECT_SLEEP_SECS)
found = msearch(queries=queries, index=index, retries=retries + 1)
except Exception as e: # pragma: no cover
logging.critical("msearch error {} on query {}".format(e, queries))
raise
return found
es_conn = Elasticsearch()
queries = []
queries.append(
{"min_score": 2.0, "query": {"bool": {"should": [{"match": {"name.tokenized": {"query": "batman"}}}]}}}
)
queries.append(
{"min_score": 1.0, "query": {"bool": {"should": [{"match": {"name.tokenized": {"query": "ironman"}}}]}}}
)
queries.append(
{"track_scores": True, "min_score": 9.0, "query":
{"bool": {"should": [{"match": {"name": {"query": "not-findable"}}}]}}}
)
q_results = msearch(es_conn, queries, index='pipeliner_current', doc_type='event')
This may be what some of you are looking for if you want to do multiple queries on the same index and doc type.

Got it! Here's what I did for anybody else...
query_list = ""
es = ElasticSearch("myurl")
for obj in my_list:
query = constructQuery(name)
query_count += 1
query_list += json.dumps({})
query_list += json.dumps(query)
if query_count <= 19:
query_list += "\n"
if query_count == 20:
es.msearch(index = "m_index", body = query_list)
I was beging screwed up by having to add the index twice. Even when using the Python client you still have to include the index part described in the original docs. Works now though!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

ElasticSearch and Python : Issue with search function - python

Related

Python Elasticsearch not returning the same number of results on each run

Error while deploying machine learning python application

Create Google Cloud Function using API in Python

'ManyToOneRel' object has no attribute 'parent_model' error with django chartit

How to create request body for Python Elasticsearch mSearch

Categories

Resources