Limited data in elasticsearch search query in python - python

I have a index named "twitter_profile_response_tms" in which I have doc type named as "posts"
In want to query all data in posts, i have total of 2636 data in posts but after running the query I only get 10 number of match data.
following is my query:
res = self.es.search(
index="twitter_profile_response_tms",
doc_type='posts',
body={
"query": {
"match": {
'username': 'wasimakramlive'
}
}
},
)
How should I resolve this?

you can use hits.total.values key from the elasticsearch result payload to get the number of matching records.

You can control it using size parameter. For e.g. to fetch 50 records set "size": 50.
res = self.es.search(
index="twitter_profile_response_tms",
doc_type='posts',
body={
"query": {
"match": {
'username': 'wasimakramlive'
},
"size": 50
}
},
)
You can fetch data page wise as well using from and size parameters. Read here more.

Related

Unable to replicate post_filter query in elasticsearch-dsl

The query I would like to replicate in DSL is as below:
GET /_search
{
"query":{
"bool":{
"must":[
{
"term":{
"destination":"singapore"
}
},
{
"terms":{
"tag_ids":[
"tag_luxury"
]
}
}
]
}
},
"aggs":{
"max_price":{
"max":{
"field":"price_range_from.SGD"
}
},
"min_price":{
"min":{
"field":"price_range_from.SGD"
}
}
},
"post_filter":{
"range":{
"price_range_from.SGD":{
"gte":0.0,
"lte":100.0
}
}
}
}
The above query
Matches terms - destination and tags_ids
Aggregates to result to find the max price from field price_range_from.SGD
Applies another post_filter to subset the result set within price limits
It works perfectly well in the Elastic/Kibana console.
I replicated the above query in elasticsearch-dsl as below:
es_query = []
es_query.append(Q("term", destination="singapore"))
es_query.append(Q("terms", tag_ids=["tag_luxury"]))
final_query = Q("bool", must=es_query)
es_conn = ElasticSearch.instance().get_client()
dsl_client = DSLSearch(using=es_conn, index=index).get_dsl_client()
dsl_client.query = final_query
dsl_client.aggs.metric("min_price", "min", field="price_range_from.SGD")
dsl_client.aggs.metric("max_price", "max", field="price_range_from.SGD")
q = Q("range", **{"price_range_from.SGD":{"gte": 0.0, "lte": 100.0}})
dsl_client.post_filter(q)
print(dsl_client.to_dict())
response = dsl_client.execute()
print(response.to_dict().get("hits", {}))
Although the aggregations are correct, products beyond the price range are also being returned. There is no error returned but it seems like the post_filter query is not applied.
I dived in the dsl_client object to see whether my query is being captured correctly. I see only the query and aggs but don't see the post_filter part in the object. The query when converted to a dictionary using dsl_client.to_dict() is as below -
{
"query":{
"bool":{
"must":[
{
"term":{
"destination":"singapore"
}
},
{
"terms":{
"tag_ids":[
"tag_luxury"
]
}
}
]
}
},
"aggs":{
"min_price":{
"min":{
"field":"price_range_from.SGD"
}
},
"max_price":{
"max":{
"field":"price_range_from.SGD"
}
}
}
}
Please help. Thanks!
You have to re-assign the dsl_client like:
dsl_client = dsl_client.post_filter(q)

Using the Firestore REST API to Update a Document Field

I've been searching for a pretty long time but I can't figure out how to update a field in a document using the Firestore REST API. I've looked on other questions but they haven't helped me since I'm getting a different error:
{'error': {'code': 400, 'message': 'Request contains an invalid argument.', 'status': 'INVALID_ARGUMENT', 'details': [{'#type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'oil', 'description': "Error expanding 'fields' parameter. Cannot find matching fields for path 'oil'."}]}]}}
I'm getting this error even though I know that the "oil" field exists in the document. I'm writing this in Python.
My request body (field is the field in a document and value is the value to set that field to, both strings received from user input):
{
"fields": {
field: {
"integerValue": value
}
}
}
My request (authorizationToken is from a different request, dir is also a string from user input which controls the directory):
requests.patch("https://firestore.googleapis.com/v1beta1/projects/aethia-resource-management/databases/(default)/documents/" + dir + "?updateMask.fieldPaths=" + field, data = body, headers = {"Authorization": "Bearer " + authorizationToken}).json()
Based on the the official docs (1,2, and 3), GitHub and a nice article, for the example you have provided you should use the following:
requests.patch("https://firestore.googleapis.com/v1beta1/projects{projectId}/databases/{databaseId}/documents/{document_path}?updateMask.fieldPaths=field")
Your request body should be:
{
"fields": {
"field": {
"integerValue": Value
}
}
}
Also keep in mind that if you want to update multiple fields and values you should specify each one separately.
Example:
https://firestore.googleapis.com/v1beta1/projects/{projectId}/databases/{databaseId}/documents/{document_path}?updateMask.fieldPaths=[Field1]&updateMask.fieldPaths=[Field2]
and the request body would have been:
{
"fields": {
"field": {
"integerValue": Value
},
"Field2": {
"stringValue": "Value2"
}
}
}
EDIT:
Here is a way I have tested which allows you to update some fields of a document without affecting the rest.
This sample code creates a document under collection users with 4 fields, then tries to update 3 out of 4 fields (which leaves the one not mentioned unaffected)
from google.cloud import firestore
db = firestore.Client()
#Creating a sample new Document “aturing” under collection “users”
doc_ref = db.collection(u'users').document(u'aturing')
doc_ref.set({
u'first': u'Alan',
u'middle': u'Mathison',
u'last': u'Turing',
u'born': 1912
})
#updating 3 out of 4 fields (so the last should remain unaffected)
doc_ref = db.collection(u'users').document(u'aturing')
doc_ref.update({
u'first': u'Alan',
u'middle': u'Mathison',
u'born': 2000
})
#printing the content of all docs under users
users_ref = db.collection(u'users')
docs = users_ref.stream()
for doc in docs:
print(u'{} => {}'.format(doc.id, doc.to_dict()))
EDIT: 10/12/2019
PATCH with REST API
I have reproduced your issue and it seems like you are not converting your request body to a json format properly.
You need to use json.dumps() to convert your request body to a valid json format.
A working example is the following:
import requests
import json
endpoint = "https://firestore.googleapis.com/v1/projects/[PROJECT_ID]/databases/(default)/documents/[COLLECTION]/[DOCUMENT_ID]?currentDocument.exists=true&updateMask.fieldPaths=[FIELD_1]"
body = {
"fields" : {
"[FIELD_1]" : {
"stringValue" : "random new value"
}
}
}
data = json.dumps(body)
headers = {"Authorization": "Bearer [AUTH_TOKEN]"}
print(requests.patch(endpoint, data=data, headers=headers).json())
I found the official documentation to not to be of much use since there was no example mentioned. This is the API end-point for your firestore database
PATCH https://firestore.googleapis.com/v1beta1/projects/{YOUR_PROJECT_ID}/databases/(default)/documents/{COLLECTION_NAME}/{DOCUMENT_NAME}
the following code is the body of your API request
{
"fields": {
"first_name": {
"stringValue":"Kurt"
},
"name": {
"stringValue":"Cobain"
},
"band": {
"stringValue":"Nirvana"
}
}
}
The response you should get upon successful update of the database should look like
{
"name": "projects/{YOUR_PROJECT_ID}/databases/(default)/documents/{COLLECTION_ID/{DOC_ID}",
{
"fields": {
"first_name": {
"stringValue":"Kurt"
},
"name": {
"stringValue":"Cobain"
},
"band": {
"stringValue":"Nirvana"
}
}
"createTime": "{CREATE_TIME}",
"updateTime": "{UPDATE_TIME}"
Note that performing the above action would result in a new document being created, meaning that any fields that existed previously but have NOT been mentioned in the "fields" body will be deleted. In order to preserve fields, you'll have to add
?updateMask.fieldPaths={FIELD_NAME} --> to the end of your API call (for each individual field that you want to preserve).
For example:
PATCH https://firestore.googleapis.com/v1beta1/projects/{YOUR_PROJECT_ID}/databases/(default)/documents/{COLLECTION_NAME}/{DOCUMENT_NAME}?updateMask.fieldPaths=name&updateMask.fieldPaths=band&updateMask.fieldPaths=age. --> and so on

How to find the count of the number of documents in mongodb using pymongo aggregation?

I'm trying to find the max value of a field from a number of documents and want the output to not only reflect the max value of the field but also the total count of documents that the aggregate query will retrieve.
I'm able to retrieve the "wait" field with the max value that I want with the below query, but am stuck with how to get the count of all the documents that are satisfy the below query(Match field).
db = mongo_client[_MONGO_COLLECTION]
cursor = db.aggregate(
[
{"$match": { "owner": { "$exists": False}}},
{
"$project": {
"wait" : {
"$divide": [{"$subtract": [datetime.now(), "$creationDate"]}, 1000],
}
}
},
{
"$sort" : {
"wait": -1
}
}, {"$limit" : 1}
])
for x in cursor:
print(x)
You can use count method as below:
print(cursor.count())
print(list(cursor))
or
you can add $count pipeline as below:
{
"$count":"count" // the name of count filed
}

Elastic search not giving data with big number for page size

Size of data to get: 20,000 approx
Issue: searching Elastic Search indexed data using below command in python
but not getting any results back.
from pyelasticsearch import ElasticSearch
es_repo = ElasticSearch(settings.ES_INDEX_URL)
search_results = es_repo.search(
query, index=advertiser_name, es_from=_from, size=_size)
If I give size less than or equal to 10,000 it works fine but not with 20,000
Please help me find an optimal solution to this.
PS: On digging deeper into ES found this message error:
Result window is too large, from + size must be less than or equal to: [10000] but was [19999]. See the scrolling API for a more efficient way to request large data sets.
for real time use the best solution is to use the search after query . You need only a date field, and another field that uniquely identify a doc - it's enough a _id field or an _uid field.
Try something like this, in my example I would like to extract all the documents that belongs to a single user - in my example the user field has a keyword datatype:
from elasticsearch import Elasticsearch
es = Elasticsearch()
es_index = "your_index_name"
documento = "your_doc_type"
user = "Francesco Totti"
body2 = {
"query": {
"term" : { "user" : user }
}
}
res = es.count(index=es_index, doc_type=documento, body= body2)
size = res['count']
body = { "size": 10,
"query": {
"term" : {
"user" : user
}
},
"sort": [
{"date": "asc"},
{"_uid": "desc"}
]
}
result = es.search(index=es_index, doc_type=documento, body= body)
bookmark = [result['hits']['hits'][-1]['sort'][0], str(result['hits']['hits'][-1]['sort'][1]) ]
body1 = {"size": 10,
"query": {
"term" : {
"user" : user
}
},
"search_after": bookmark,
"sort": [
{"date": "asc"},
{"_uid": "desc"}
]
}
while len(result['hits']['hits']) < size:
res =es.search(index=es_index, doc_type=documento, body= body1)
for el in res['hits']['hits']:
result['hits']['hits'].append( el )
bookmark = [res['hits']['hits'][-1]['sort'][0], str(result['hits']['hits'][-1]['sort'][1]) ]
body1 = {"size": 10,
"query": {
"term" : {
"user" : user
}
},
"search_after": bookmark,
"sort": [
{"date": "asc"},
{"_uid": "desc"}
]
}
Then you will find all the doc appended to the result var
If you would like to use scroll query - doc here:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch()
es_index = "your_index_name"
documento = "your_doc_type"
user = "Francesco Totti"
body = {
"query": {
"term" : { "user" : user }
}
}
res = helpers.scan(
client = es,
scroll = '2m',
query = body,
index = es_index)
for i in res:
print(i)
Probably its ElasticSearch constraints.
index.max_result_window index setting which defaults to 10,000

How to return only aggregation results not hits in elasticsearch query dsl

I am writing a query dsl in python using http://elasticsearch-dsl.readthedocs.io
and I have following code
search.aggs.bucket('per_ts', 'terms', field='ts')\
.bucket('load_time', 'percentiles', field='total_req', percents=[99])
response = search.execute()
This works fine but it also returns hits. But I don't want hits
In curl query mode I can get what I want by doing size:0 in
GET /twitter/tweet/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "text"
}
}
}
}
I couldn't find a way where I can use size = 0 in query dsl.
Referring to the code of elasticsearch-dsl-py/search.py here
s = Search().query(...).extra(from_=0, size=25)
This statement should work.

Categories