I have an inverted index (as a dictionary) and I want to take a boolean search query as an input to process it and produce a result.
The inverted index is like this:
{
Test : { FileName1: [213, 1889, 27564], FileName2: [133, 9992866, 27272781, 78676818], FileName3: [9211] },
Try : { FileName4 ...
.....
}
Now, given a boolean search query, I have to return the result.
Examples:
Boolean Search Query: test AND try
The result should be all documents that have the words test and try.
Boolean Search Query: test OR try
The result should be all documents that have either test or try.
Boolean Search Query: test AND NOT try
The result should be all documents that have test but not try.
How can I build this search engine to process the given boolean search query?
Thanks in advance!
EDIT: I am retaining the first part of my answer, because if this WASN'T a school assignment, this would be in my opinion still a better way to go about the task. I replace the second part of the answer with update matching OP's question.
What you appear to want to do is to create a query string parser, which would read the query string and translate it into a series of AND/OR/NOT combos to return the correct keys.
There are 2 approaches to this.
According to what you wrote that you need, by far the simplest solution would be to load the data into any SQL database (SQLite, for example, which does not require a full-blown running SQL server), load dictionary keys as a separate field (the rest of your data may all be in a single another field, if you don't care about normal forms &c), and translate incoming queries to SQL, approximately like this:
SQL table has at least this:
CREATE TABLE my_data(
dictkey text,
data text);
python_query="foo OR bar AND NOT gazonk"
sql_keywords=["AND","NOT","OR"]
sql_query=[]
for word in python_query.split(" "):
if word in sql_keywords:
sql_query+=[ word ]
else:
sql_query+=["dictkey='%s'" % word]
real_sql_query=" ".join(sql_query)
This needs some escaping and control checking for SQL injections and special chars, but in general it would just translate your query into SQL, which, when run against the SQL datbase would return the keys (and possibly data) for further processing.
Now for the pure Python version.
What you need to do is to analyze the string you get and apply the logic to your existing Python data.
Analyzing the string to reduce it to specific components (and their interactions) is parsing. If you actually wanted to build your own fully fledged parser, there would be Python modules for that, however, for a school assignment, I expect you are tasked to build your own.
From your description, the query can be expressed in quasi BNF form as:
(<[NOT] word> <AND|OR>)...
Since you say that priority of is not relevant all, you can do it the easy way and parse word by word.
Then you have to match the keywords to the filenames, which, as mentioned in another answer, is easiest to do with sets.
So, it could go approximately like this:
import re
query="foo OR bar AND NOT gazonk"
result_set=set()
operation=None
for word in re.split(" +(AND|OR) +",query):
#word will be in ['foo', 'OR', 'bar', 'AND', 'NOT gazonk']
inverted=False # for "NOT word" operations
if word in ['AND','OR']:
operation=word
continue
if word.find('NOT ') == 0:
if operation is 'OR':
# generally "OR NOT" operation does not make sense, but if it does in your case, you
# should update this if() accordingly
continue
inverted=True
# the word is inverted!
realword=word[4:]
else:
realword=word
if operation is not None:
# now we need to match the key and the filenames it contains:
current_set=set(inverted_index[realword].keys())
if operation is 'AND':
if inverted is True:
result_set -= current_set
else:
result_set &= current_set
elif operation is 'OR':
result_set |= current_set
operation=None
print result_set
Note that this is not a complete solution (for example it does not include dealing with the first term of the query, and it requires the boolean operators to be in uppercase), and is not tested. However, it should serve the primary purpose of showing you how to go about it. Doing more would be writing your course work for you, which would be bad for you. Because you are expected to learn how to do it so you can understand it. Feel free to ask for clarifications.
Another approach could be an in-memory intersection of the posting lists (for your AND cases, you can enhance this for OR, NOT, etc).
Attached a simple merge algorithm to be performed on the posting lists, assuming that the lists are sorted (increasing doc_id order, this can be easily achieved if we index our terms correctly) - this will improve time complexity (O(n1+n2)) as we will perform linear-merge on sorted list and might stop earlier.
Now assume our positional inverted index looks like this: (similar to yours but with posting lists as lists and not dict's- this will be allow compression in future uses) where it maps- String > Terms, while each term consists of (tf, posting list ([P1, P2,...])) and each Posting has (docid, df, position list). Now we can perform a simple AND to all of our postings lists iteratively:
def search(self, sq: BoolQuery) -> list:
# Performs a search from a given query in boolean retrieval model,
# Supports AND queries only and returns sorted document ID's as result:
if sq.is_empty():
return super().search(sq)
terms = [self.index[term] for term in sq.get_terms() if term in self.index]
if not terms:
return []
# Iterate over posting lists and intersect:
result, terms = terms[0].pst_list, terms[1:]
while terms and result:
result = self.intersect(result, terms[0].pst_list)
terms = terms[1:]
return [p.id for p in result]
Now lets look at the intersection:
def intersect(p1: list, p2: list) -> list:
# Performs linear merge of 2x sorted lists of postings,
# Returns the intersection between them (== matched documents):
res, i, j = list(), 0, 0
while i < len(p1) and j < len(p2):
if p1[i].id == p2[j].id:
res.append(p1[i])
i, j = i + 1, j + 1
elif p1[i].id < p2[j].id:
i += 1
else:
j += 1
return res
This simple algorithm can be later expanded when performing phrase search (edit the intersection to calculate slop distance, e.g: |pos1-pos2| < slop)
Taking into account you have that inverted index and that is a dict with test and try as keys you can define the following functions and play with them:
def intersection(list1, list2):
return list(set(list1).intersection(list2))
def union(list1, list2):
return list(set(list1).union(list2))
def notin(list1, list2)
return [filter(lambda x: x not in list1, sublist) for sublist in list2]
intersection(inverted_index['people'].keys(), intersection(inverted_index['test'].keys(), inverted_index['try'].keys()))
I'm looking for a way to reference joined table inside filter. My query is:
session.query(A).outerjoin(B, C, D).filter(B.column_b == 1, C.column_c == 2)
How can I do this kind of filtering without naming model B and C?
session.query(A).outerjoin(B, C, D).filter_by(column_b=1, column_c=2)
Doesn't work because filter_by tries to find 'column_b' and 'column_c' in D. aliased() and alias() also don't work.
I receive dict of field_to_value in a form {"column_a": 0, "column_b": 1, "column_c": 2} And I need to create a query out of it. Of course I could have a mapping like {"column_a": A.column_b, "column_b": B.column_b, "column_c": C.column_c}. But there should be a better way.
Is there any good way to do such filtering?
filter_by applies on the last join so you should be able to break your outerjoins. Try:
session.query(A).outerjoin(B).filter_by(column_b == 1)
.outerjoin(C).filter_by(column_c == 2)
.outerjoin(D)
I'm a newbie. I want to put a condition on numbers. If I check one number, it works fine. But if I want it to return the same value for two differ)
for row in row2:
if int == 1:
return = "Music"
but I want to expand it further to include other numbers. So,
for row in row2:
if int == 1 , 2:
return = "Music"
I've also tried using or in the second option but it didn't work. Any help to check different integers to bring back a certain result would be great.
To test multiple values:
if i in (1, 2):
Note that return is a keyword and you can't assign a value to it.
int is a predefined identifier; you can assign a value to it, but you
really don't want to; it will confuse things terribly.
Do you mean
for row in row2:
if int == 1 and 2:
return = "Music"
or
for row in row2:
if int in (1 , 2):
return = "Music"
Also, int is a built-in function name. So don't use it as a variable name. And return = "Music" is invalid syntax because return is a keyword in Python. Maybe you mean return "Music".
Three mistakes in your code
int is a type, shouldn't use it as a variable name
The syntax for your if statement is wrong
You shouldn't use an = when returning
What you want is probably:
if row in (1,2):
return "Music"
Or
if row == 1 or row == 2:
return "Music"
I have a sql query as follows
select cloumn1,column2,count(column1) as c
from Table1 where user_id='xxxxx' and timestamp > xxxxxx
group by cloumn1,column2
order by c desc limit 1;
And I successed in write the sqlalchemy equvalent
result = session.query(Table1.field1,Table1.field2,func.count(Table1.field1)).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1,Travelog.field2).order_by(desc(func.count(Table1.field1))).first()
But I want to avoid using func.count(Table1.field1) in the order_by clause.
How can I use alias in sqlalchemy? Can any one show any example?
Aliases are for tables; columns in a query are given a label instead. This trips me up from time to time too.
You can go about this two ways. It is sufficient to store the func.count() result is a local variable first and reuse that:
field1_count = func.count(Table1.field1)
result = session.query(Table1.field1, Table1.field2, field1_count).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1, Travelog.field2).order_by(desc(field1_count)).first()
The SQL produced would still be the same as your own code would generate, but at least you don't have to type out the func.count() call twice.
To give this column an explicit label, call the .label() method on it:
field1_count = func.count(Table1.field1).label('c')
and you can then use that same label string in the order_by clause:
result = session.query(Table1.field1, Table1.field2, field1_count).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1, Travelog.field2).order_by(desc('c')).first()
or you could use the field1_count.name attribute:
result = session.query(Table1.field1, Table1.field2, field1_count).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1, Travelog.field2).order_by(desc(field1_count.name)).first()
Can also using the c which is an alias of the column attribute but in this case a label will work fine as stated.
Will also point out that the filter doesn't need to be used multiple times can pass comma separated criterion.
result = (session.query(Table1.field1, Table1.field2,
func.count(Table1.field1).label('total'))
.filter(Table1.c.user_id == self.user_id, Table1.timestamp > self.from_ts)
.group_by(Table1.field1,Table1.field2)
.order_by(desc('total')).first())
I have a function which receives an optional argument.
I am querying a database table within this function.
What I would like is:
If the optional argument is specified, I want to add another additional .filter() to my database query.
My query line is already rather long so I don't want to do If .. else .. in which I repeat the whole query twice.
What is the way to do this?
Below is an example to my query and if my_val is specified, I need to add another filtering line.
def my_def (my_val):
query = Session.query(Table1, Table2).\
filter(Table1.c1.in_(some_val)).\
filter(Table1.c2 == 113).\
filter(Table2.c3 == val1).\
filter(Table1.c4 == val2).\
filter(Table2.c5 == val5).\
all()
You can wait to call the .all() method on the query set, something like this:
def my_def (my_val=my_val):
query = Session.query(Table1, Table2).\
filter(Table1.c1.in_(some_val)).\
filter(Table1.c2 == 113).\
filter(Table2.c3 == val1).\
filter(Table1.c4 == val2).\
filter(Table2.c5 == val5)
if my_val:
query = query.filter(Table1.c6 == my_val)
return query.all()