Python Query Processing and Boolean Search - python

I have an inverted index (as a dictionary) and I want to take a boolean search query as an input to process it and produce a result.
The inverted index is like this:
{
Test : { FileName1: [213, 1889, 27564], FileName2: [133, 9992866, 27272781, 78676818], FileName3: [9211] },
Try : { FileName4 ...
.....
}
Now, given a boolean search query, I have to return the result.
Examples:
Boolean Search Query: test AND try
The result should be all documents that have the words test and try.
Boolean Search Query: test OR try
The result should be all documents that have either test or try.
Boolean Search Query: test AND NOT try
The result should be all documents that have test but not try.
How can I build this search engine to process the given boolean search query?
Thanks in advance!

EDIT: I am retaining the first part of my answer, because if this WASN'T a school assignment, this would be in my opinion still a better way to go about the task. I replace the second part of the answer with update matching OP's question.
What you appear to want to do is to create a query string parser, which would read the query string and translate it into a series of AND/OR/NOT combos to return the correct keys.
There are 2 approaches to this.
According to what you wrote that you need, by far the simplest solution would be to load the data into any SQL database (SQLite, for example, which does not require a full-blown running SQL server), load dictionary keys as a separate field (the rest of your data may all be in a single another field, if you don't care about normal forms &c), and translate incoming queries to SQL, approximately like this:
SQL table has at least this:
CREATE TABLE my_data(
dictkey text,
data text);
python_query="foo OR bar AND NOT gazonk"
sql_keywords=["AND","NOT","OR"]
sql_query=[]
for word in python_query.split(" "):
if word in sql_keywords:
sql_query+=[ word ]
else:
sql_query+=["dictkey='%s'" % word]
real_sql_query=" ".join(sql_query)
This needs some escaping and control checking for SQL injections and special chars, but in general it would just translate your query into SQL, which, when run against the SQL datbase would return the keys (and possibly data) for further processing.
Now for the pure Python version.
What you need to do is to analyze the string you get and apply the logic to your existing Python data.
Analyzing the string to reduce it to specific components (and their interactions) is parsing. If you actually wanted to build your own fully fledged parser, there would be Python modules for that, however, for a school assignment, I expect you are tasked to build your own.
From your description, the query can be expressed in quasi BNF form as:
(<[NOT] word> <AND|OR>)...
Since you say that priority of is not relevant all, you can do it the easy way and parse word by word.
Then you have to match the keywords to the filenames, which, as mentioned in another answer, is easiest to do with sets.
So, it could go approximately like this:
import re
query="foo OR bar AND NOT gazonk"
result_set=set()
operation=None
for word in re.split(" +(AND|OR) +",query):
#word will be in ['foo', 'OR', 'bar', 'AND', 'NOT gazonk']
inverted=False # for "NOT word" operations
if word in ['AND','OR']:
operation=word
continue
if word.find('NOT ') == 0:
if operation is 'OR':
# generally "OR NOT" operation does not make sense, but if it does in your case, you
# should update this if() accordingly
continue
inverted=True
# the word is inverted!
realword=word[4:]
else:
realword=word
if operation is not None:
# now we need to match the key and the filenames it contains:
current_set=set(inverted_index[realword].keys())
if operation is 'AND':
if inverted is True:
result_set -= current_set
else:
result_set &= current_set
elif operation is 'OR':
result_set |= current_set
operation=None
print result_set
Note that this is not a complete solution (for example it does not include dealing with the first term of the query, and it requires the boolean operators to be in uppercase), and is not tested. However, it should serve the primary purpose of showing you how to go about it. Doing more would be writing your course work for you, which would be bad for you. Because you are expected to learn how to do it so you can understand it. Feel free to ask for clarifications.

Another approach could be an in-memory intersection of the posting lists (for your AND cases, you can enhance this for OR, NOT, etc).
Attached a simple merge algorithm to be performed on the posting lists, assuming that the lists are sorted (increasing doc_id order, this can be easily achieved if we index our terms correctly) - this will improve time complexity (O(n1+n2)) as we will perform linear-merge on sorted list and might stop earlier.
Now assume our positional inverted index looks like this: (similar to yours but with posting lists as lists and not dict's- this will be allow compression in future uses) where it maps- String > Terms, while each term consists of (tf, posting list ([P1, P2,...])) and each Posting has (docid, df, position list). Now we can perform a simple AND to all of our postings lists iteratively:
def search(self, sq: BoolQuery) -> list:
# Performs a search from a given query in boolean retrieval model,
# Supports AND queries only and returns sorted document ID's as result:
if sq.is_empty():
return super().search(sq)
terms = [self.index[term] for term in sq.get_terms() if term in self.index]
if not terms:
return []
# Iterate over posting lists and intersect:
result, terms = terms[0].pst_list, terms[1:]
while terms and result:
result = self.intersect(result, terms[0].pst_list)
terms = terms[1:]
return [p.id for p in result]
Now lets look at the intersection:
def intersect(p1: list, p2: list) -> list:
# Performs linear merge of 2x sorted lists of postings,
# Returns the intersection between them (== matched documents):
res, i, j = list(), 0, 0
while i < len(p1) and j < len(p2):
if p1[i].id == p2[j].id:
res.append(p1[i])
i, j = i + 1, j + 1
elif p1[i].id < p2[j].id:
i += 1
else:
j += 1
return res
This simple algorithm can be later expanded when performing phrase search (edit the intersection to calculate slop distance, e.g: |pos1-pos2| < slop)

Taking into account you have that inverted index and that is a dict with test and try as keys you can define the following functions and play with them:
def intersection(list1, list2):
return list(set(list1).intersection(list2))
def union(list1, list2):
return list(set(list1).union(list2))
def notin(list1, list2)
return [filter(lambda x: x not in list1, sublist) for sublist in list2]
intersection(inverted_index['people'].keys(), intersection(inverted_index['test'].keys(), inverted_index['try'].keys()))

Related

Kaggle Python course Exercise: Strings and Dictionaries Q. no. 2

Here's the question,
A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles.
Your function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
Here's my ans(I want to solve this just using loops and ifs):
def word_search(doc_list, keyword):
"""
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents
containing the keyword.
Example:
doc_list = ['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?']
word_search(doc_list, 'casino')
>>> [0]
"""
#non-course provided and my own code starts here.
k=0
print(doc_list,keyword)
for string in doc_list:
print(string)
for char in string:
if char.upper()==keyword[0] or char.lower()==keyword[0]:
print(char,string[string.index(char)-1])
if (string[string.index(char)-1]==" " or string[string.index(char)-1]=="" or string[string.index(char)-1]==".") and (string[string.index(char)+len(keyword)]==" " or string[string.index(char)+len(keyword)]=="" or string[string.index(char)+len(keyword)]=="."):
print(string[string.index(char)-1])
for k in range(len(keyword)):
print(k)
if string[string.index(char)+k].upper()==keyword[k] or string[string.index(char)+k].lower()==keyword[k]:
c=c+k
if len(c)==len(keyword):
x=[doc_list.index(string)]
return x
But after running the check code:
q2.check() #returns,
Incorrect: Got a return value of None given doc_list=['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?'], keyword='casino', but expected a value of type list. (Did you forget a return statement?)
Here's what gets printed out after executing the code:
['The Learn Python Challenge Casino', 'They bought a car, and a horse',
'Casinoville?'] casino
The Learn Python Challenge Casino
C
C
They bought a car, and a horse
c
Casinoville?
C ?
The code is compiling successfully without syntax and other explicit errors. But I can't find any implicit bugs that's generating a wrong ans after struggling for 5+ hrs. please help!
If I remember correctly Kaggle courses also provide you with the solution which is the solution you should understand and use moving forward. Your code has many conditions and it will be tough to determine which of these conditions is not implemented correctly. Might as well check Kaggles' solution because you cant use this moving forward. Also the solution you have has a nested for-loop checking each letter one-by-one. That is extremely inefficient. Nice beginners attempt though :)
Here is the solution using regex
import re
def word_search(documents, keyword):
res=[]
for i,j in enumerate(documents):
if re.findall('\\b'+keyword+'\\b',j,flags=re.IGNORECASE):
res.append(i)
return res
As stated by the answer, your function should return a list. You are instead returning a None value, because at some points in your nested ifs you are going at the end of your function, in which is not specified any return. When you don't specify any return keyword at the end of your function, it will return None as default
By the way, python offers a lot of utils libraries, for example the str.index() method that return the string index if found in the original string
This I think is a better development of your solution:
def word_search(doc_list, keyword):
"""
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents
containing the keyword.
Example:
doc_list = ['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?']
word_search(doc_list, 'casino')
>>> [0]
"""
my_list = []
for doc in doc_list:
curr_doc = doc.lower()
try:
curr_index = curr_doc.index(keyword.lower())
my_list.append(curr_index)
except:
my_list.append(None)
return my_list
print(word_search(['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?'], 'casino'))
output: [27, None, 0]
As you can see, in my code I am returning a List at the end of the function definition, as requested from the problem
A better approach to solve this would be to use the method contains(). An example of its usage can be found here.
So the algorithm would become:
list_to_return = []
counter = 0
for item in doc_list:
if item.contains(word):
list_to_return.append(counter)
counter += 1
return list_to_return
def word_search(doc_list, keyword):
res = []
sum = 0
for i in range(len(doc_list)-1):
if(doc_list[i] == keyword):
sum=sum+1
res.append(doc_list[i])
return sum, res

Is using string concatenation+eval() to build SQLAlchemy queries okay, or is there a better alternative?

I am using React-Admin as my front end which calls my API using Flask-Restful.
When I use filters, it sends a key value pair in a dict. Between 0 and 3 key value pairs may be requested from my API depending on how many filters are used at once.
I have successfully used a range of if statements to see if the key is in the dict and if it is add the corresponding part of the filter to the query.
However, I have heard it is unsafe to use eval() as others can inject code into your app. I believe as I am setting the strings, it isn't possible. Is this safe to use or is there a better way to set a query using flask-sqlalchemy?
def find_all(cls, sort, fltr):
order = sort[1]
if order == 'ASC':
s = sort[0]
else:
s = sort[0] + " desc"
query_name = "cls.query."
if "T" in fltr:
query_name += 'filter(cls.title.like("%"+fltr["T"]+"%")).'
if "c" in fltr:
if fltr['c']:
query_name += "filter_by(complete=False)."
if "a" in fltr:
query_name += 'filter(cls.assigned_id.like("%"+str(fltr["a"])+"%")).'
query_name += 'order_by(s).all()'
return eval(query_name)
where fltr is the dict.
This is working fine, although is it safe to use in a webapp?
There are two distinct questions here.
Is it safe?
Because the queries can only be composed of segments made from constant substrings you wrote and audited yourself, yes. If you ever substituted user-provided data into those queries, that would no longer be true.
Is it good practice?
Absolutely not. There's a performance penalty, a readability penalty, a correctness penalty (insofar as static checking tools can't read the AST for statements hidden behind eval()), and if someone else in the future wanted to substitute in user-provided data, the original code makes it easier for them to go the string-substitution route and introduce vulnerabilities later.
Consider instead:
def find_all(cls, sort, fltr):
order = sort[1]
if order == 'ASC':
s = sort[0]
else:
s = sort[0] + " desc"
q = cls.query
if "T" in fltr:
q = q.filter(cls.title.like("%"+fltr["T"]+"%"))
if "c" in fltr and fltr['c']:
q = q.filter_by(complete=False)
if "a" in fltr:
q = q.filter(cls.assigned_id.like("%"+str(fltr["a"])+"%"))
return q.order_by(s).all()

Python large list manipulation

I have python list like below:
DEMO_LIST = [
[{'unweighted_criket_data': [-46.14554728131345, 2.997789122813151, -23.66171024766996]},
{'weighted_criket_index_input': [-6.275794430258629, 0.4076993207025885, -3.2179925936831144]},
{'manual_weighted_cricket_data': [-11.536386820328362, 0.7494472807032877, -5.91542756191749]},
{'average_weighted_cricket_data': [-8.906090625293496, 0.5785733007029381, -4.566710077800302]}],
[{'unweighted_football_data': [-7.586729834820534, 3.9521665714843675, 5.702038461085529]},
{'weighted_football_data': [-3.512655913521907, 1.8298531225972623, 2.6400438074826]},
{'manual_weighted_football_data': [-1.8966824587051334, 0.9880416428710919, 1.4255096152713822]},
{'average_weighted_football_data': [-2.70466918611352, 1.4089473827341772, 2.0327767113769912]}],
[{'unweighted_rugby_data': [199.99999999999915, 53.91020408163265, -199.9999999999995]},
{'weighted_rugby_data': [3.3999999999999857, 0.9164734693877551, -3.3999999999999915]},
{'manual_rugby_data': [49.99999999999979, 13.477551020408162, -49.99999999999987]},
{'average_weighted_rugby_data': [26.699999999999886, 7.197012244897959, -26.699999999999932]}],
[{'unweighted_swimming_data': [2.1979283454982053, 14.079951031527246, -2.7585499298828777]},
{'weighted_swimming_data': [0.8462024130168091, 5.42078114713799, -1.062041723004908]},
{'manual_weighted_swimming_data': [0.5494820863745513, 3.5199877578818115, -0.6896374824707194]},
{'average_weighted_swimming_data': [0.6978422496956802, 4.470384452509901, -0.8758396027378137]}]]
I want to manipulate list items and do some basic math operation,like getting each data type list (example taking all first element of unweighted data and do sum etc)
Currently I am doing it like this.
The current solution is a very basic one, I want to do it in such way that if the list length is grown, it can automatically calculate the results. Right now there are four list, it can be 5 or 8,the final result should be the summation of all the first element of unweighted values,example:
now I am doing result_u1/4,result_u2/4,result_u3/4
I want it like result_u0/4,result_u1/4.......result_n4/4 # n is the number of list inside demo list
Any idea how I can do that?
(sorry for the beginner question)
You can implement a specific list class for yourself, that adds your summary with new item's values in append function, or decrease them on remove:
class MyList(list):
def __init__(self):
self.summary = 0
list.__init__(self)
def append(self, item):
self.summary += item.sample_value
list.append(self, item)
def remove(self, item):
self.summary -= item.sample_value
list.remove(self, item)
And a simple usage:
my_list = MyList()
print my_list.summary # Outputs 0
my_list.append({'sample_value': 10})
print my_list.summary # Outputs 10
In Python, whenever you start counting how many there are of something inside an iterable (a string, a list, a set, a collection of any of these) in order to loop over it - its a sign that your code can be revised.
Things can can work for 3 of something, can work for 300, 3000 and 3 million of the same thing without changing your code.
In your case, your logic is - "For every X inside DEMO_LIST, do something"
This translated into Python is:
for i in DEMO_LIST:
# do something with i
This snippet will run through any size of DEMO_LIST and each time i is each of whatever is in side DEMO_LIST. In your case it is the list that contains your dictionaries.
Further expanding on that, you can say:
for i in DEMO_LIST:
for k in i:
# now you are in each list that is inside the outer DEMO_LIST
Expanding this to do a practical example; a sum of all unweighted_criket_data:
all_unweighted_cricket_data = []
for i in DEMO_LIST:
for k in i:
if 'unweighted_criket_data' in k:
for data in k['unweighted_cricket_data']:
all_unweighted_cricked_data.append(data)
sum_of_data = sum(all_unweighted_cricket_data)
There are various "shortcuts" to do the same, but you can appreciate those once you understand the "expanded" version of what the shortcut is trying to do.
Remember there is nothing wrong with writing it out the 'long way' especially when you are not sure of the best way to do something. Once you are comfortable with the logic, then you can use shortcuts like list comprehensions.
Start by replacing this:
for i in range(0,len(data_list)-1):
result_u1+=data_list[i][0].values()[0][0]
result_u2+=data_list[i][0].values()[0][1]
result_u3+=data_list[i][0].values()[0][2]
print "UNWEIGHTED",result_u1/4,result_u2/4,result_u3/4
With this:
sz = len(data_list[i][0].values()[0])
result_u = [0] * sz
for i in range(0,len(data_list)-1):
for j in range(0,sz):
result_u[j] += data_list[i][0].values()[0][j]
print "UNWEIGHTED", [x/len(data_list) for x in result_u]
Apply similar changes elsewhere. This assumes that your data really is "rectangular", that is to say every corresponding inner list has the same number of values.
A slightly more "Pythonic"[*] version of:
for j in range(0,sz):
result_u[j] += data_list[i][0].values()[0][j]
is:
for j, dataval in enumerate(data_list[i][0].values()[0]):
result_u[j] += dataval
There are some problems with your code, though:
values()[0] might give you any of the values in the dictionary, since dictionaries are unordered. Maybe it happens to give you the unweighted data, maybe not.
I'm confused why you're looping on the range 0 to len(data_list)-1: if you want to include all the sports you need 0 to len(data_list), because the second parameter to range, the upper limit, is excluded.
You could perhaps consider reformatting your data more like this:
DEMO_LIST = {
'cricket' : {
'unweighted' : [1,2,3],
'weighted' : [4,5,6],
'manual' : [7,8,9],
'average' : [10,11,12],
},
'rugby' : ...
}
Once you have the same keys in each sport's dictionary, you can replace values()[0] with ['unweighted'], so you'll always get the right dictionary entry. And once you have a whole lot of dictionaries all with the same keys, you can replace them with a class or a named tuple, to define/enforce that those are the values that must always be present:
import collections
Sport = collections.namedtuple('Sport', 'unweighted weighted manual average')
DEMO_LIST = {
'cricket' : Sport(
unweighted = [1,2,3],
weighted = [4,5,6],
manual = [7,8,9],
average = [10,11,12],
),
'rugby' : ...
}
Now you can replace ['unweighted'] with .unweighted.
[*] The word "Pythonic" officially means something like, "done in the style of a Python programmer, taking advantage of any useful Python features to produce the best idiomatic Python code". In practice it usually means "I prefer this, and I'm a Python programmer, therefore this is the correct way to write Python". It's an argument by authority if you're Guido van Rossum, or by appeal to nebulous authority if you're not. In almost all circumstances it can be replaced with "good IMO" without changing the sense of the sentence ;-)

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

Finding partial strings in a list of strings - python

I am trying to check if a user is a member of an Active Directory group, and I have this:
ldap.set_option(ldap.OPT_REFERRALS, 0)
try:
con = ldap.initialize(LDAP_URL)
con.simple_bind_s(userid+"#"+ad_settings.AD_DNS_NAME, password)
ADUser = con.search_ext_s(ad_settings.AD_SEARCH_DN, ldap.SCOPE_SUBTREE, \
"sAMAccountName=%s" % userid, ad_settings.AD_SEARCH_FIELDS)[0][1]
except ldap.LDAPError:
return None
ADUser returns a list of strings:
{'givenName': ['xxxxx'],
'mail': ['xxxxx#example.com'],
'memberOf': ['CN=group1,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group2,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group3,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group4,OU=Projects,OU=Office,OU=company,DC=domain,DC=com'],
'sAMAccountName': ['myloginid'],
'sn': ['Xxxxxxxx']}
Of course in the real world the group names are verbose and of varied structure, and users will belong to tens or hundreds of groups.
If I get the list of groups out as ADUser.get('memberOf')[0], what is the best way to check if any members of a separate list exist in the main list?
For example, the check list would be ['group2', 'group16'] and I want to get a true/false answer as to whether any of the smaller list exist in the main list.
If the format example you give is somewhat reliable, something like:
import re
grps = re.compile(r'CN=(\w+)').findall
def anyof(short_group_list, adu):
all_groups_of_user = set(g for gs in adu.get('memberOf',()) for g in grps(gs))
return sorted(all_groups_of_user.intersection(short_group_list))
where you pass your list such as ['group2', 'group16'] as the first argument, your ADUser dict as the second argument; this returns an alphabetically sorted list (possibly empty, meaning "none") of the groups, among those in short_group_list, to which the user belongs.
It's probably not much faster to just a bool, but, if you insist, changing the second statement of the function to:
return any(g for g in short_group_list if g in all_groups_of_user)
might possibly save a certain amount of time in the "true" case (since any short-circuits) though I suspect not in the "false" case (where the whole list must be traversed anyway). If you care about the performance issue, best is to benchmark both possibilities on data that's realistic for your use case!
If performance isn't yet good enough (and a bool yes/no is sufficient, as you say), try reversing the looping logic:
def anyof_v2(short_group_list, adu):
gset = set(short_group_list)
return any(g for gs in adu.get('memberOf',()) for g in grps(gs) if g in gset)
any's short-circuit abilities might prove more useful here (at least in the "true" case, again -- because, again, there's no way to give a "false" result without examining ALL the possibilities anyway!-).
You can use set intersection (& operator) once you parse the group list out. For example:
> memberOf = 'CN=group1,OU=Projects,OU=Office,OU=company,DC=domain,DC=com'
> groups = [token.split('=')[1] for token in memberOf.split(',')]
> groups
['group1', 'Projects', 'Office', 'company', 'domain', 'com']
> checklist1 = ['group1', 'group16']
> set(checklist1) & set(groups)
set(['group1'])
> checklist2 = ['group2', 'group16']
> set(checklist2) & set(groups)
set([])
Note that a conditional evaluation on a set works the same as for lists and tuples. True if there are any elements in the set, False otherwise. So, "if set(checklist2) & set(groups): ..." would not execute since the condition evaluates to False in the above example (the opposite is true for the checklist1 test).
Also see:
http://docs.python.org/library/sets.html

Categories