I'm trying to run a multi search request on the Elasticsearch Python client. I can run the singular search correctly but can't figure out how to format the request for a msearch. According to the documentation, the body of the request needs to be formatted as:
The request definitions (metadata-search request definition pairs), as
either a newline separated string, or a sequence of dicts to serialize
(one per row).
What's the best way to create this request body? I've been searching for examples but can't seem to find any.
If you follow the demo of official doc(even thought it's for BulkAPI) , you will find how to construct your request in python with the Elasticsearch client:
Here is the newline separated string way:
def msearch():
es = get_es_instance()
search_arr = []
# req_head
search_arr.append({'index': 'my_test_index', 'type': 'doc_type_1'})
# req_body
search_arr.append({"query": {"term" : {"text" : "bag"}}, 'from': 0, 'size': 2})
# req_head
search_arr.append({'index': 'my_test_index', 'type': 'doc_type_2'})
# req_body
search_arr.append({"query": {"match_all" : {}}, 'from': 0, 'size': 2})
request = ''
for each in search_arr:
request += '%s \n' %json.dumps(each)
# as you can see, you just need to feed the <body> parameter,
# and don't need to specify the <index> and <doc_type> as usual
resp = es.msearch(body = request)
As you can see, the final-request is constructed by several req_unit.
Each req_unit construct shows below:
request_header(search control about index_name, optional mapping-types, search-types etc.)\n
reqeust_body(which involves query detail about this request)\n
The sequence of dicts to serialize way is almost same with the previous one, except that you don't need to convert it to string:
def msearch():
es = get_es_instance()
request = []
req_head = {'index': 'my_test_index', 'type': 'doc_type_1'}
req_body = {
'query': {'term': {'text' : 'bag'}},
'from' : 0, 'size': 2 }
request.extend([req_head, req_body])
req_head = {'index': 'my_test_index', 'type': 'doc_type_2'}
req_body = {
'query': {'range': {'price': {'gte': 100, 'lt': 300}}},
'from' : 0, 'size': 2 }
request.extend([req_head, req_body])
resp = es.msearch(body = request)
Here is the structure it returns. Read more about msearch.
If you are using elasticsearch-dsl, you can use the class MultiSearch.
Example from the documentation:
from elasticsearch_dsl import MultiSearch, Search
ms = MultiSearch(index='blogs')
ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))
responses = ms.execute()
for response in responses:
print("Results for query %r." % response.search.query)
for hit in response:
print(hit.title)
Here is what I came up with. I am using the same document type and index so I optimized the code to run multiple queries with the same header:
from elasticsearch import Elasticsearch
from elasticsearch import exceptions as es_exceptions
import json
RETRY_ATTEMPTS = 10
RECONNECT_SLEEP_SECS = 0.5
def msearch(es_conn, queries, index, doc_type, retries=0):
"""
Es multi-search query
:param queries: list of dict, es queries
:param index: str, index to query against
:param doc_type: str, defined doc type i.e. event
:param retries: int, current retry attempt
:return: list, found docs
"""
search_header = json.dumps({'index': index, 'type': doc_type})
request = ''
for q in queries:
# request head, body pairs
request += '{}\n{}\n'.format(search_header, json.dumps(q))
try:
resp = es_conn.msearch(body=request, index=index)
found = [r['hits']['hits'] for r in resp['responses']]
except (es_exceptions.ConnectionTimeout, es_exceptions.ConnectionError,
es_exceptions.TransportError): # pragma: no cover
logging.warning("msearch connection failed, retrying...") # Retry on timeout
if retries > RETRY_ATTEMPTS: # pragma: no cover
raise
time.sleep(RECONNECT_SLEEP_SECS)
found = msearch(queries=queries, index=index, retries=retries + 1)
except Exception as e: # pragma: no cover
logging.critical("msearch error {} on query {}".format(e, queries))
raise
return found
es_conn = Elasticsearch()
queries = []
queries.append(
{"min_score": 2.0, "query": {"bool": {"should": [{"match": {"name.tokenized": {"query": "batman"}}}]}}}
)
queries.append(
{"min_score": 1.0, "query": {"bool": {"should": [{"match": {"name.tokenized": {"query": "ironman"}}}]}}}
)
queries.append(
{"track_scores": True, "min_score": 9.0, "query":
{"bool": {"should": [{"match": {"name": {"query": "not-findable"}}}]}}}
)
q_results = msearch(es_conn, queries, index='pipeliner_current', doc_type='event')
This may be what some of you are looking for if you want to do multiple queries on the same index and doc type.
Got it! Here's what I did for anybody else...
query_list = ""
es = ElasticSearch("myurl")
for obj in my_list:
query = constructQuery(name)
query_count += 1
query_list += json.dumps({})
query_list += json.dumps(query)
if query_count <= 19:
query_list += "\n"
if query_count == 20:
es.msearch(index = "m_index", body = query_list)
I was beging screwed up by having to add the index twice. Even when using the Python client you still have to include the index part described in the original docs. Works now though!
Related
I'm using the api to get all the data So, I need to paginate so I need to loop through the pages to get all the data I want. Is there any way to extract this automatically?
Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?
import json
import requests
from http import HTTPStatus
client_id = ""
client_secret = ""
os.environ["DX_GATEWAY"] = "http://api.com"
os.environ["DX_CLIENT_ID"] = client_id
os.environ["DX_CLIENT_SECRET"] = client_secret
dx_request = requests.Request()
path = "/path/to/api"
params = {
"Type": "abc",
"Id": "def",
"limit": 999,
"Category": "abc"
}
params_str = "&".join([f"{k}={v}" for k, v in params.items()])
url = "?".join([path, params_str])
vulns = dx_request.get( ##also tried dx_request.args.get(
url,
version=1,
)
if vulns.status_code != int(HTTPStatus.OK):
raise RuntimeError("API call did not return expected response: " + str(vulns))
response_data = vulns.json()
print(json.dumps(response_data))
Why not use the the automatic pagination module included in the Requests Library that you already have loaded in the code?
e.g
import requests
url = 'https://api.example.com/items'
params = {'limit': 100, 'offset': 0} # set initial parameters for first page of results
while True:
response = requests.get(url, params=params)
data = response.json()
items = data['items']
# do something with items here...
if len(items) < 100:
break # if fewer than 100 items were returned, we've reached the last page
params['offset'] += 100 # increment offset to retrieve the next page of results
I am creating a SAM web app, with the backend being an API in front of a Python Lambda function with a DynamoDB table that maintains a count of the number of HTTP calls to the API. The API must also return this number. The yaml code itself loads normally. My problem is writing the Lambda function to iterate and return the count. Here is my code:
def lambda_handler(event, context):
dynamodb = boto3.resource("dynamodb")
ddbTableName = os.environ["databaseName"]
table = dynamodb.Table(ddbTableName)
# Update item in table or add if doesn't exist
ddbResponse = table.update_item(
Key={"id": "VisitorCount"},
UpdateExpression="SET count = count + :value",
ExpressionAttributeValues={":value": Decimal(context)},
ReturnValues="UPDATED_NEW",
)
# Format dynamodb response into variable
responseBody = json.dumps({"VisitorCount": ddbResponse["Attributes"]["count"]})
# Create api response object
apiResponse = {"isBase64Encoded": False, "statusCode": 200, "body": responseBody}
# Return api response object
return apiResponse
I can get VisitorCount to be a string, but not a number. I get this error: [ERROR] TypeError: lambda_handler() missing 1 required positional argument: 'cou response = request_handler(event, lambda_context)le_event_request
What is going on?
[UPDATE] I found the original error, which was that the function was not properly received by the SAM app. Changing the name fixed this, and it is now being read. Now I have to troubleshoot the actual Python. New Code:
import json
import boto3
import os
dynamodb = boto3.resource("dynamodb")
ddbTableName = os.environ["databaseName"]
table = dynamodb.Table(ddbTableName)
Key = {"VisitorCount": { "N" : "0" }}
def handler(event, context):
# Update item in table or add if doesn't exist
ddbResponse = table.update_item(
UpdateExpression= "set VisitorCount = VisitorCount + :val",
ExpressionAttributeValues={":val": {"N":"1"}},
ReturnValues="UPDATED_NEW",
)
# Format dynamodb response into variable
responseBody = json.dumps({"VisitorCount": ddbResponse["Attributes"]["count"]})
# Create api response object
apiResponse = {"isBase64Encoded": False, "statusCode": 200,"body": responseBody}
# Return api response object
return apiResponse
I am getting a syntax error on Line 13, which is
UpdateExpression= "set VisitorCount = VisitorCount + :val",
But I can't tell where I am going wrong on this. It should update the DynamoDB table to increase the count by 1. Looking at the AWS guide it appears to be the correct syntax.
Not sure what the exact error is but ddbResponse will be like this:
ddbResponse = table.update_item(
Key={
'key1': aaa,
'key2': bbb
},
UpdateExpression= "set VisitorCount = VisitorCount + :val",
ExpressionAttributeValues={":val": Decimal(1)},
ReturnValues="UPDATED_NEW",
)
Specify item to be updated with Key (one item for one Lambda call)
Set Decimal(1) for ExpressionAttributeValues
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html#GettingStarted.Python.03.04
i want to index shakespeare data in elastic search using its python api. I am getting below error.
PUT http://localhost:9200/shakes/play/3 [status:400 request:0.098s]
{'error': {'root_cause': [{'type': 'mapper_parsing_exception', 'reason': 'failed to parse'}], 'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': {'type': 'not_x_content_exception', 'reason': 'Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes'}}, 'status': 400}
python script
from elasticsearch import Elasticsearch
from elasticsearch import TransportError
import json
data = []
for line in open('shakespeare.json', 'r'):
data.append(json.loads(line))
es = Elasticsearch()
res = 0
cl = []
# filtering data which i need
for d in data:
if res == 0:
res = 1
continue
cl.append(data[res])
res = 0
try:
res = es.index(index = "shakes", doc_type = "play", id = 3, body = cl)
print(res)
except TransportError as e:
print(e.info)
I also tried using json.dumps but still getting same error. But when add just one element of list to elastic search below code works.
You are not sending a bulk request to es, but only a simple create request -please take a look here. This method works with a dict that represent a new doc, and not with a list of docs. If you put an id on the create request, then you need to make this value dynamic, otherwise every doc will be overwritten on the id of the last doc indicized. If in your json , you have a record for each line you should try this -Please read here for bulk documentation:
from elasticsearch import helpers
es = Elasticsearch()
op_list = []
with open("C:\ElasticSearch\shakespeare.json") as json_file:
for record in json_file:
op_list.append({
'_op_type': 'index',
'_index': 'shakes',
'_type': 'play',
'_source': record
})
helpers.bulk(client=es, actions=op_list)
I am trying to fetch product data from an api.
By default this api returns 20 products and in a single request the api can return max 500 products if we use api's parameter Limit=500.
So for fetching all products we need to use one more parameter with Limit- Offset(Number of products to skip).
I have written following function to achieve this but in case of full data my function is not working well and it's giving me error like- Login failed, Signature mismatching.
def get_data(userid, api_key, action, pagination=True):
timeformat = datetime.datetime.now().replace(microsecond=0).isoformat() + '+08:00'
endpoint = 'https://example.com'
page_json = {}
# set required parameters for this api
parameters = {
'UserID': userid,
'Version': '1.0',
'Action': action,
'Format': 'JSON',
'Timestamp': timeformat
}
if pagination:
page = 0
parameters['Limit'] = 500
while True:
parameters['Offset'] = 500 * page
# set the required cryptographic signature
concatenated = urllib.parse.urlencode(sorted(parameters.items()))
parameters['Signature'] = HMAC(api_key, concatenated.encode('utf-8'), sha256).hexdigest()
page += 1
try:
response = requests.get(endpoint, params=parameters)
page_json = response.json()
except requests.exceptions.ConnectionError:
print("Connection refused!")
sleep(5)
else:
try:
concatenated = urllib.parse.urlencode(sorted(parameters.items()))
# set the required cryptographic signature
parameters['Signature'] = HMAC(api_key, concatenated.encode('utf-8'), sha256).hexdigest()
response = requests.get(endpoint, params=parameters)
page_json = response.json()
except requests.exceptions.ConnectionError:
print("Connection refused!")
sleep(5)
return page_json
It looks like I am not fitting my signature parameter line correctly in case of full data.I printed the value of concatenated and it looks like-
page is 1
concatenated:: Action=GetProducts&Format=JSON&Limit=500&Offset=500&Signature=3d9cd320a4bf816aeea828b9392ed2d5a27cd584b3a337338909c0ab161a101e&Timestamp=2018-05-26T12%3A58%3A38%2B08%3A00&UserID=contact%40example.com.sg&Version=1.0
try: {'ErrorResponse': {'Head': {'ErrorCode': '7', 'ErrorMessage': 'E7:Login failed. Signature mismatching', 'ErrorType': 'Sender', 'RequestAction': 'GetProducts', 'RequestId': '0bb606c015273197313552686ec46f'}}}
page is 2
concatenated:: Action=GetProducts&Format=JSON&Limit=500&Offset=1000&Signature=c1bda1a5ab21c4e4182cc82ca7ba87cb9fc6c5f24c36f9bb006f9da906cf7083&Timestamp=2018-05-26T12%3A58%3A38%2B08%3A00&UserID=contact%40example.com.sg&Version=1.0
try: {'ErrorResponse': {'Head': {'ErrorCode': '7', 'ErrorMessage': 'E7:Login failed. Signature mismatching', 'ErrorType': 'Sender', 'RequestAction': 'GetProducts', 'RequestId': '0bb606c015273197321748243ec3a5'}}}
Can you please look into my function and help me to find out what I have written wrong and what it should be like?
Try this please:
if pagination:
page = 0
parameters['Limit'] = 500
while True:
parameters['Offset'] = 500 * page
# set the required cryptographic signature
concatenated = urllib.parse.urlencode(sorted(parameters.items()))
parameters['Signature'] = HMAC(api_key, concatenated.encode('utf-8'), sha256).hexdigest()
page += 1
try:
response = requests.get(endpoint, params=parameters)
page_json = response.json()
except requests.exceptions.ConnectionError:
print("Connection refused!")
sleep(5)
del parameters['Signature']
I am using Python to extract data from a Solr API, like so:
import requests
user = 'my_username'
password= 'my password'
url = 'my_url'
print ("Accessing API..")
req = requests.get(url = url, auth=(user, password))
print ("Accessed!")
out = req.json()
#print(out)
However, it looks like in some of the API URLs: the output is fairly "large" (many of the columns are lists of dictionaries), and so it doesn't return all rows, which are necessary.
From looking around, it looks like I should use pagination to bring in results in specified increments. Something like this:
url = 'url?start=0&rows=1000'
Then,
url = 'url?start=1000&rows=1000'
and so on, until there is no result is returned.
The way I am thinking about it is write a loop, and append result to output with every loop. However, I am not sure how to do that.
Would someone be able to help please?
Thank you in advance!
Did you look at the output? In my experience, solr response usually includes a 'numFound' in it's result. On a (old) solr I have locally, doing a random query. I get this result.
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "*:*",
"indent": "true",
"start": "0",
"rows": "10",
"wt": "json",
"_": "1509460751164"
}
},
"response": {
"numFound": 7023,
"start": 0,
"docs": [.. 10 docs]
}
}
While working out this code example, I realized you don't need the numFound really. Solr will just return any empty list for docs if there are no further results. Making it easier to make the loop.
import requests
user = 'my_username'
password = 'my password'
# Starting values
start = 0
rows = 1000 # static, but easier to manipulate if it's a variable
base_url = 'my_url?rows={0}?start={1}'
url = base_url.format(rows, start)
req = requests.get(url=url, auth=(user, password))
out = req.json()
total_found = out.get('response', {}).get('numFound', 0)
# Up the start with 1000, so we fetch the next 1000
start += rows
results = out.get('response', {}).get('docs', [])
all_results = results
# Results will be an empty list if no more results are found
while results:
# Rebuild url base on current start.
url = base_url.format(rows, start)
req = requests.get(url=url, auth=(user, password))
out = req.json()
results = out.get('response', {}).get('docs', [])
all_results += results
start += rows
# All results will now contains all the 'docs' of each request.
print(all_results)
Mind you.. those docs will be dict like, so more parsing will be needed.