ElasticSearch and Python - Correct methodolgy - python

I am building a search engine for the list of articles I have. I was advised by a lot of people to use elastic search for full text search. I wrote the following code. It works. But I have a few issues.
1) If the same article is added twice - that is indexdoc is run twice for the same article, it accepts it and adds the article twice. Is there a way to have a "unique key" in the search index.
2) How can I change the scoring / ranking function? I want to give more importance to title?
3) Is this the correct way to do it anyways?
4) How do I show related results - if there is a spelling mistake?
from elasticsearch import Elasticsearch
from crsq.models import ArticleInfo
es = Elasticsearch()
def indexdoc(articledict):
doc = {
'text': articledict['articlecontent'],
'title' : articledict['articletitle'],
'url': articledict['url']
}
res = es.index(index="article-index", doc_type='article', body=doc)
def searchdoc(keywordstr):
res = es.search(index="article-index", body={"query": {"query_string": {"query": keywordstr}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
print("%(url)s: %(text)s" % hit["_source"])
def indexurl(url):
articledict = ArticleInfo.objects.filter(url=url).values()
if len(articledict):
indexdoc(articledict)
return

1) You have to specify an id for you document. You have to add the parameter id when you are indexing
res = es.index(index="article-index", doc_type='article', body=doc, id="some_unique_id")
2) There is more than one way to do this, but for example you can boost title by changing a bit your query:
{"query": {"query_string": {"query": keywordstr, "fields" : ["text", "title^2"]}}
With this change title will have the double of importance that field text
3) As a proof of concept is not bad.
4) This is a big topic, I think you should check the documentation of suggesters

Related

pymongo query for all items containing a unique identifier

I have a mongo collection with data structure in the follwoing way
content: {'description': { 'text': [{'_date': '2019-05-21','_sectionId': 'a13a','_objectId: 'f637cee'},
{'_date': '2019-05-21','_objectId': '8b2ed183', '_source: 'f637cee'},
{ etc....}
{'_date': '2019-05-21','_sectionId': 'a13a','_objectId: 'XXXcee'}
},
'client' : {.....},
}
I am looking for the way to query the collection to get a list of tuples in the following way:
given a section Id I would like to get the corresponding 'objectId'
In this case the result would be:
('a13a','f637cee'), ('a13a','XXXcee')
I started to do something like this:
import pymongo
myclient = pymongo.MongoClient(mongoconnection)
print('databases names:')
myclient.list_database_names()
# getting the collection:
mydb = myclient["clients"]
query = {'content.description.text._sectionId': 'a13a'}
cur = mydb.find(query)
But I dont know how to extract the information from the cursor.
Some help?
Note the info might be nested in different places, i.e. there are more nodes preceding "content" that can vary.
Thanks a lot
Use the second parameter of the find() to get required fields.
Ex:
query = {'content.description.text._sectionId': 'a13a'}
cur = mydb.find(query, { "_id": 0, "_sectionId": 1, "_objectId": 1 })
print([tuple(i.values()) for i in cur])

IN Query not working for Amazon DynamoDB

I would like to check retrieve items that have an attribute value that is present in the list of value I provide. Below is the query I have for searching. Unfortunately the response return an empty list of items. I don't understand why this is the case and would like to know the correct query.
def search(self, src_words, translations):
entries = []
query_src_words = [word.decode("utf-8") for word in src_words]
params = {
"TableName": self.table,
"FilterExpression": "src_word IN (:src_words) AND src_language = :src_language AND target_language = :target_language",
"ExpressionAttributeValues": {
":src_words": {"SS": query_src_words},
":src_language": {"S": config["source_language"]},
":target_language": {"S": config["target_language"]}
}
}
page_iterator = self.paginator.paginate(**params)
for page in page_iterator:
for entry in page["Items"]:
entries.append(entry)
return entries
Below is the table that I would like to query from. For example if my list of query_src_word have: [soccer ball, dog] then only row with entry_id=2 should be returned
Any insights would be much appreciated.
I think this is because in the query_src_word you have "soccer_ball" (with an underscore), while in the database you have "soccer ball" (without an underscore).
Change "soccer_ball" to "soccer ball" in your query_src_words and it should work find

Jira Python Custom Fields

I am writing a script to create bugs. We have many custom fields and I cannot figure out how to get them to work correctly in the python code. Can someone please help explain? I have read through as many articles as I can find but none of the solutions are working.
One example of my custom field names is customfield_15400 and has a default value of "NO". The error I get with my below code is:
response text = {"errorMessages":[],"errors":{"customfield_15400":"Could not find valid 'id' or 'value' in the Parent Option object."}}
Code:
project_dict = {'Aroid':'SA', 'OS':'SC'}
epic_dict = {'Aroid':'SA-108', 'OS':'SC-3333'}
for index, row in bugs.iterrows():
issue = st_jira.create_issue(project= project_dict[row['OS']], \
summary= "[UO] QA Issue with '%s' on '%s'" % (row['Event Action'], row['Screen Name']), \
issuetype= {'name':'Bug'},\
customfield_15400="No"
)
Try the following :
customfield_15400={ 'value' : 'NO' }
You can also do the following, value_id being the id of the value in your Select Field :
customfield_15400={ 'id' : 'value_id' }
Indeed the value of a SelectField is an object, described by its value and its ID.
Incase anyone else needs the solution. Below works.
project_dict = {'Android':'SA', 'iOS':'SIC'}
epic_dict = {'Android':'SA-18', 'iOS':'SIC-19'}
for index, row in bugs.iterrows():
issue = st_jira.create_issue(
summary= "[UO] QA Issue with '%s' on '%s'" % (row['Event Action'], row['Screen Name']),\
labels = ['UO'],\
assignee={"name":""},\
versions=[{"name":"4.4"}],\
fields={'project' : project_dict[row['OS']], \
'summary': "[UO] QA Issue with '%s' on '%s'" % (row['Event Action'], row['Screen Name']),\
'labels': ['UO'],\
'assignee':{"name":""},\
'versions':[{"name":"4.4"}],\
'issuetype': {'name':'Bug'},\
'customfield_15400': {'value':'Yes'}}
)
issue.update(fields={'customfield_10100': {'value','Two'}})
I have a multiselect list and below error occurs if i try to update
"response text = {"errorMessages":[],"errors":{"Custom_field":"data was not an array"}}"
issue.update(fields={'customfield_10100': {'value','Two'}})
above will throw the error saying data was not an array
"response text = {"errorMessages":[],"errors":{"Custom_field":"data was not an array"}}"
=> you could try like this -:
issue.update(fields={'customfield_10100': [{'value': "Two"}]})

How to set group = true in couchdb

I am trying to use map/reduce to find the duplication of the data in couchDB
the map function is like this:
function(doc) {
if(doc.coordinates) {
emit({
twitter_id: doc.id_str,
text: doc.text,
coordinates: doc.coordinates
},1)};
}
}
and the reduce function is:
function(keys,values,rereduce){return sum(values)}
I want to find the sum of the data in the same key, but it just add everything together and I get the result:
<Row key=None, value=1035>
Is that a problem of group? How can I set it to true?
Assuming you're using the couchdb package from pypi, you'll need to pass a dictionary with all of the options you require to the view.
for example:
import couchdb
# the design doc and view name of the view you want to use
ddoc = "my_design_document"
view_name = "my_view"
#your server
server = couchdb.server("http://localhost:5984")
db = server["aCouchDatabase"]
#naming convention when passing a ddoc and view to the view method
view_string = ddoc +"/" + view_name
#query options
view_options = {"reduce": True,
"group" : True,
"group_level" : 2}
#call the view
results = db.view(view_string, view_options)
for row in results:
#do something
pass

Mongoengine, retriving only some of a MapField

For Example.. In Mongodb..
> db.test.findOne({}, {'mapField.FREE':1})
{
"_id" : ObjectId("4fb7b248c450190a2000006a"),
"mapField" : {
"BOXFLUX" : {
"a" : "f",
}
}
}
The 'mapField' field is made of MapField of Mongoengine.
and 'mapField' field has a log of key and data.. but I just retrieved only 'BOXFLUX'..
this query is not working in MongoEngine....
for example..
BoxfluxDocument.objects( ~~ querying ~~ ).only('mapField.BOXFLUX')
AS you can see..
only('mapField.BOXFLUX') or only only('mapField__BOXFLUX') does not work.
it retrieves all 'mapField' data, including 'BOXFLUX' one..
How can I retrieve only a field of MapField???
I see there is a ticket for this: https://github.com/hmarr/mongoengine/issues/508
Works for me heres an example test case:
def test_only_with_mapfields(self):
class BlogPost(Document):
content = StringField()
author = MapField(field=StringField())
BlogPost.drop_collection()
post = BlogPost(content='Had a good coffee today...',
author={'name': "Ross", "age": "20"}).save()
obj = BlogPost.objects.only('author__name',).get()
self.assertEquals(obj.author['name'], "Ross")
self.assertEquals(obj.author.get("age", None), None)
Try this:
query = BlogPost.objects({your: query})
if name:
query = query.only('author__'+name)
else:
query = query.only('author')
I found my fault! I used only twice.
For example:
BlogPost.objects.only('author').only('author__name')
I spent a whole day finding out what is wrong with Mongoengine.
So my wrong conclusion was:
BlogPost.objects()._collection.find_one(~~ filtering query ~~, {'author.'+ name:1})
But as you know it's a just raw data not a mongoengine query.
After this code, I cannot run any mongoengine methods.
In my case, I should have to query depending on some conditions.
so it will be great that 'only' method overwrites 'only' methods written before.. In my humble opinion.
I hope this feature would be integrated with next version. Right now, I have to code duplicate code:
not this code:
query = BlogPost.objects()
query( query~~).only('author')
if name:
query = query.only('author__'+name)
This code:
query = BlogPost.objects()
query( query~~).only('author')
if name:
query = BlogPost.objects().only('author__'+name)
So I think the second one looks dirtier than first one.
of course, the first code shows you all the data
using only('author') not only('author__name')

Categories