Finding partial strings in a list of strings - python - python

I am trying to check if a user is a member of an Active Directory group, and I have this:
ldap.set_option(ldap.OPT_REFERRALS, 0)
try:
con = ldap.initialize(LDAP_URL)
con.simple_bind_s(userid+"#"+ad_settings.AD_DNS_NAME, password)
ADUser = con.search_ext_s(ad_settings.AD_SEARCH_DN, ldap.SCOPE_SUBTREE, \
"sAMAccountName=%s" % userid, ad_settings.AD_SEARCH_FIELDS)[0][1]
except ldap.LDAPError:
return None
ADUser returns a list of strings:
{'givenName': ['xxxxx'],
'mail': ['xxxxx#example.com'],
'memberOf': ['CN=group1,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group2,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group3,OU=Projects,OU=Office,OU=company,DC=domain,DC=com',
'CN=group4,OU=Projects,OU=Office,OU=company,DC=domain,DC=com'],
'sAMAccountName': ['myloginid'],
'sn': ['Xxxxxxxx']}
Of course in the real world the group names are verbose and of varied structure, and users will belong to tens or hundreds of groups.
If I get the list of groups out as ADUser.get('memberOf')[0], what is the best way to check if any members of a separate list exist in the main list?
For example, the check list would be ['group2', 'group16'] and I want to get a true/false answer as to whether any of the smaller list exist in the main list.

If the format example you give is somewhat reliable, something like:
import re
grps = re.compile(r'CN=(\w+)').findall
def anyof(short_group_list, adu):
all_groups_of_user = set(g for gs in adu.get('memberOf',()) for g in grps(gs))
return sorted(all_groups_of_user.intersection(short_group_list))
where you pass your list such as ['group2', 'group16'] as the first argument, your ADUser dict as the second argument; this returns an alphabetically sorted list (possibly empty, meaning "none") of the groups, among those in short_group_list, to which the user belongs.
It's probably not much faster to just a bool, but, if you insist, changing the second statement of the function to:
return any(g for g in short_group_list if g in all_groups_of_user)
might possibly save a certain amount of time in the "true" case (since any short-circuits) though I suspect not in the "false" case (where the whole list must be traversed anyway). If you care about the performance issue, best is to benchmark both possibilities on data that's realistic for your use case!
If performance isn't yet good enough (and a bool yes/no is sufficient, as you say), try reversing the looping logic:
def anyof_v2(short_group_list, adu):
gset = set(short_group_list)
return any(g for gs in adu.get('memberOf',()) for g in grps(gs) if g in gset)
any's short-circuit abilities might prove more useful here (at least in the "true" case, again -- because, again, there's no way to give a "false" result without examining ALL the possibilities anyway!-).

You can use set intersection (& operator) once you parse the group list out. For example:
> memberOf = 'CN=group1,OU=Projects,OU=Office,OU=company,DC=domain,DC=com'
> groups = [token.split('=')[1] for token in memberOf.split(',')]
> groups
['group1', 'Projects', 'Office', 'company', 'domain', 'com']
> checklist1 = ['group1', 'group16']
> set(checklist1) & set(groups)
set(['group1'])
> checklist2 = ['group2', 'group16']
> set(checklist2) & set(groups)
set([])
Note that a conditional evaluation on a set works the same as for lists and tuples. True if there are any elements in the set, False otherwise. So, "if set(checklist2) & set(groups): ..." would not execute since the condition evaluates to False in the above example (the opposite is true for the checklist1 test).
Also see:
http://docs.python.org/library/sets.html

Related

Trouble converting "for key in dict" to == for exact matching

Good morning,
I am having trouble pulling the correct value from my dictionary because there are similar keys. I believe I need to use the == instead of in however when I try to change if key in c_item_number_one: to if key == c_item_number_one: it just returns my if not_found: print("Specify Size One") however I know 12" is in the dictionary.
c_item_number_one = ('12", Pipe,, SA-106 GR. B,, SCH 40, WALL smls'.upper())
print(c_item_number_one)
My formula is as follows:
def item_one_size_one():
not_found = True
for key in size_one_dict:
if key in c_item_number_one:
item_number_one_size = size_one_dict[key]
print(item_number_one_size)
not_found = False
break
if not_found:
print("Specify Size One")
item_one_size_one()
The current result is:
12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS
Specify Size One
To split the user input into fields, use re.split
>>> userin
'12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS'
>>> import re
>>> fields = re.split('[ ,]*',userin)
>>> fields
['12"', 'PIPE', 'SA-106', 'GR.', 'B', 'SCH', '40', 'WALL', 'SMLS']
Then compare the key to the first field, or to all fields:
if key == fields[0]:
There are two usages of the word in here - the first is in the context of a for loop, and the second entirely distinct one is in the context of a comparison.
In the construction of a for loop, the in keyword connects the variable that will be used to hold the values extracted from the loop to the object containing values to be looped over.
e.g.
for x in list:
Meanwhile, the entirely distinct usage of the in keyword can be used to tell python to perform a collection test where the left-hand side item is tested to see whether it exists in the rhs-object's collection.
e.g.
if key in c_item_number_one:
So the meaning of the in keyword is somewhat contextual.
If your code is giving unexpected results then you should be able to replace the if-statement to use an == test, while keeping everything else the same.
e.g.
if key == c_item_number_one:
However, since the contents of c_item_number_one is a tuple, you might only want to test equality for the first item in that tuple - the number 12 for example. You should do this by indexing the element in the tuple for which you want to do the comparison:
if key == c_item_number_one[0]:
Here the [0] is telling python to extract only the first element from the tuple to perform the == test.
[edit] Sorry, your c_item_number_one isn't a tuple, it's a long string. What you need is a way of clearly identifying each item to be looked up, using a unique code or value that the user can enter that will uniquely identify each thing. Doing a string-match like this is always going to throw up problems.
There's potential then for a bit of added nuance, the 1st key in your example tuple is a string of '12'. If the key in your == test is a numeric value of 12 (i.e. an integer) then the test 12 == '12' will return false and you won't extract the value you're after. That your existing in test succeeds currently suggests though that this isn't a problem here, but might be something to be aware of later.

Why is my recursion leading to self referencing dict values?

I aim to write a function that splits a budget across options by comparing options based on their benefit/cost ratio and stores them in a list of nested dicts. When multiple options with the same benefit/cost ratio are available, each option shall be pursued separately (as downstream element which in turn may have multiple downstream elements) and reflected as a list of dicts for its upstream dict. There is no limitation as to how many options may occur.
def get_all_allocation_proposals(budget, options, upstream_element=dict()):
# accepts:
# budget to be allocated
# list of options
# returns:
# a list of allocation proposals
# filter options for affordable options and sort by cost benefit ratio
options = [x for x in options if x['cost'] <= budget]
options = sorted(options, key=lambda x: (
x['benefit_cost_ratio'], x['benefit']), reverse=True)
if (len(options) > 0):
# select the best options
best_bc_ratio = options[0]['benefit_cost_ratio']
best_options = [
x for x in options if x['benefit_cost_ratio'] == best_bc_ratio]
upstream_element['downstream_elements'] = []
for current_element in best_options:
downstream_options = remove_conflicting_options(
current_element, options)
downstream_budget = budget - current_element['cost']
current_element['donstream_budget'] = downstream_budget
downstream_elements = get_all_allocation_proposals(downstream_budget,
downstream_options,
current_element)
if downstream_elements is not None:
current_element['downstream_elements'] = downstream_elements
upstream_element['downstream_elements'].append(current_element)
return upstream_element
else:
return None
In the code above, when the elements are appended, self referencing dict values are created. Why is that the case and how can I avoid that? All I want to do is to pass on all downstream elements to the first call stack.
Is there something fundamentally flawed with my recursion pattern?
I think the issue is probably because you are passing mutable objects into your recursive call. Specifically downstream_options and current_element are dicts and when you modify them within a given recursion of the function, you are also modifying them at the level above, which in this case seems to leave you attempting to assign a value in the dict to itself (or some such impossibility, I haven't quite managed to follow the logic through).
A quick solution might be (I'm not sure if this will break your logic) to make a copy of these dicts at each recursion:
from copy import deepcopy
...
downstream_elements = get_all_allocation_proposals(downstream_budget,
deepcopy(downstream_options),
deepcopy(current_element))
Additionally, as identified in the comments you should avoid having a mutable default argument, i.e. upstream_element=dict(). This can produce some very confusing behaviour if you actually use the default (which you don't appear to in your code)

Error when trying to build logical parser

So i have these strings stored in database and i want to convert them to python expression to use them with if statement. I will store these strings into list and will loop over them.
For example:
string = "#apple and #banana or #grapes"
i am able to convert this string by replacing # with "a==" and # with "b==" to this :
if a == apple and b == banana or b == grapes
hash refers to a
# refers to b
But when i use eval it throws up error "apple is not defined" because apple is not in quotes. so what i want is this:
if a == "apple" and b == "banana" or b == "grapes"
Is there any way i can do this ?
The strings stored in DB can have any type of format, can have multiple and/or conditions.
Few examples:
string[0] = "#apple and #banana or #grapes"
string[1] = "#apple or #banana and #grapes"
string[2] = "#apple and #banana and #grapes"
There will be else condition where no condition is fullfilled
Thanks
If I understand correctly you are trying so setup something of a logical parser - you want to evaluate if the expression can possibly be true, or not.
#word or #otherword
is always true since it's possible to satisfy this with #=word for example, but
#word and #otherword
is not since it is impossible to satisfy this. The way you were going is using Python's builtin interpreter, but you seem to "make up" variables a and b, which do not exist. Just to give you a starter for such a parser, here is one bad implementation:
from itertools import product
def test(string):
var_dict = {}
word_dict = {}
cur_var = ord('a')
expression = []
for i,w in enumerate(string.split()):
if not i%2:
if w[0] not in var_dict:
var_dict[w[0]] = chr(cur_var)
word_dict[var_dict[w[0]]] = []
cur_var += 1
word_dict[var_dict[w[0]]].append(w[1:])
expression.append('{}=="{}"'.format(var_dict[w[0]],w[1:]))
else: expression.append(w)
expression = ' '.join(expression)
result = {}
for combination in product(
*([(v,w) for w in word_dict[v]] for v in word_dict)):
exec(';'.join('{}="{}"'.format(v,w) for v,w in combination)+';value='+expression,globals(),result)
if result['value']: return True
return False
Beyond not checking if the string is valid, this is not great, but a place to start grasping what you're after.
What this does is create your expression in the first loop, while saving a hash mapping the first characters of words (w[0]) to variables named from a to z (if you want more you need to do better than cur_var+=1). It also maps each such variable to all the words it was assigned to in the original expression (word_dict).
The second loop runs a pretty bad algorithm - product will give all the possible paring of variable and matching word, and I iterate each combination and assign our fake variables the words in an exec command. There are plenty of reasons to avoid exec, but this is easiest for setting the variables. If I found a combination that satisfies the expression, I return True, otherwise False. You cannot use eval if you want to assign stuff (or for if,for,while etc.).
Not this can drastically be improved on by writing your own logical parser to read the string, though it will probably be longer.
#Evaluted as (#apple and #banana) or #grapes) by Python - only #=apple #=banana satisfies this.
>>> test("#apple and #banana or #grapes")
True
#Evaluted as #apple or (#banana and #grapes) by Python - all combinations satisfy this as # does not matter.
>>> test("#apple or #banana and #grapes")
True
#demands both #=banana and #=grapes - impossible.
>>> test("#apple and #banana and #grapes")
False
I am not sure of what you are asking here, but you can use the replace and split functions :
string = "#apple and #banana"
fruits = string.replace("#", "").split("and")
if a == fruits[0] and b == fruits[1]:
Hope this helps

Python Query Processing and Boolean Search

I have an inverted index (as a dictionary) and I want to take a boolean search query as an input to process it and produce a result.
The inverted index is like this:
{
Test : { FileName1: [213, 1889, 27564], FileName2: [133, 9992866, 27272781, 78676818], FileName3: [9211] },
Try : { FileName4 ...
.....
}
Now, given a boolean search query, I have to return the result.
Examples:
Boolean Search Query: test AND try
The result should be all documents that have the words test and try.
Boolean Search Query: test OR try
The result should be all documents that have either test or try.
Boolean Search Query: test AND NOT try
The result should be all documents that have test but not try.
How can I build this search engine to process the given boolean search query?
Thanks in advance!
EDIT: I am retaining the first part of my answer, because if this WASN'T a school assignment, this would be in my opinion still a better way to go about the task. I replace the second part of the answer with update matching OP's question.
What you appear to want to do is to create a query string parser, which would read the query string and translate it into a series of AND/OR/NOT combos to return the correct keys.
There are 2 approaches to this.
According to what you wrote that you need, by far the simplest solution would be to load the data into any SQL database (SQLite, for example, which does not require a full-blown running SQL server), load dictionary keys as a separate field (the rest of your data may all be in a single another field, if you don't care about normal forms &c), and translate incoming queries to SQL, approximately like this:
SQL table has at least this:
CREATE TABLE my_data(
dictkey text,
data text);
python_query="foo OR bar AND NOT gazonk"
sql_keywords=["AND","NOT","OR"]
sql_query=[]
for word in python_query.split(" "):
if word in sql_keywords:
sql_query+=[ word ]
else:
sql_query+=["dictkey='%s'" % word]
real_sql_query=" ".join(sql_query)
This needs some escaping and control checking for SQL injections and special chars, but in general it would just translate your query into SQL, which, when run against the SQL datbase would return the keys (and possibly data) for further processing.
Now for the pure Python version.
What you need to do is to analyze the string you get and apply the logic to your existing Python data.
Analyzing the string to reduce it to specific components (and their interactions) is parsing. If you actually wanted to build your own fully fledged parser, there would be Python modules for that, however, for a school assignment, I expect you are tasked to build your own.
From your description, the query can be expressed in quasi BNF form as:
(<[NOT] word> <AND|OR>)...
Since you say that priority of is not relevant all, you can do it the easy way and parse word by word.
Then you have to match the keywords to the filenames, which, as mentioned in another answer, is easiest to do with sets.
So, it could go approximately like this:
import re
query="foo OR bar AND NOT gazonk"
result_set=set()
operation=None
for word in re.split(" +(AND|OR) +",query):
#word will be in ['foo', 'OR', 'bar', 'AND', 'NOT gazonk']
inverted=False # for "NOT word" operations
if word in ['AND','OR']:
operation=word
continue
if word.find('NOT ') == 0:
if operation is 'OR':
# generally "OR NOT" operation does not make sense, but if it does in your case, you
# should update this if() accordingly
continue
inverted=True
# the word is inverted!
realword=word[4:]
else:
realword=word
if operation is not None:
# now we need to match the key and the filenames it contains:
current_set=set(inverted_index[realword].keys())
if operation is 'AND':
if inverted is True:
result_set -= current_set
else:
result_set &= current_set
elif operation is 'OR':
result_set |= current_set
operation=None
print result_set
Note that this is not a complete solution (for example it does not include dealing with the first term of the query, and it requires the boolean operators to be in uppercase), and is not tested. However, it should serve the primary purpose of showing you how to go about it. Doing more would be writing your course work for you, which would be bad for you. Because you are expected to learn how to do it so you can understand it. Feel free to ask for clarifications.
Another approach could be an in-memory intersection of the posting lists (for your AND cases, you can enhance this for OR, NOT, etc).
Attached a simple merge algorithm to be performed on the posting lists, assuming that the lists are sorted (increasing doc_id order, this can be easily achieved if we index our terms correctly) - this will improve time complexity (O(n1+n2)) as we will perform linear-merge on sorted list and might stop earlier.
Now assume our positional inverted index looks like this: (similar to yours but with posting lists as lists and not dict's- this will be allow compression in future uses) where it maps- String > Terms, while each term consists of (tf, posting list ([P1, P2,...])) and each Posting has (docid, df, position list). Now we can perform a simple AND to all of our postings lists iteratively:
def search(self, sq: BoolQuery) -> list:
# Performs a search from a given query in boolean retrieval model,
# Supports AND queries only and returns sorted document ID's as result:
if sq.is_empty():
return super().search(sq)
terms = [self.index[term] for term in sq.get_terms() if term in self.index]
if not terms:
return []
# Iterate over posting lists and intersect:
result, terms = terms[0].pst_list, terms[1:]
while terms and result:
result = self.intersect(result, terms[0].pst_list)
terms = terms[1:]
return [p.id for p in result]
Now lets look at the intersection:
def intersect(p1: list, p2: list) -> list:
# Performs linear merge of 2x sorted lists of postings,
# Returns the intersection between them (== matched documents):
res, i, j = list(), 0, 0
while i < len(p1) and j < len(p2):
if p1[i].id == p2[j].id:
res.append(p1[i])
i, j = i + 1, j + 1
elif p1[i].id < p2[j].id:
i += 1
else:
j += 1
return res
This simple algorithm can be later expanded when performing phrase search (edit the intersection to calculate slop distance, e.g: |pos1-pos2| < slop)
Taking into account you have that inverted index and that is a dict with test and try as keys you can define the following functions and play with them:
def intersection(list1, list2):
return list(set(list1).intersection(list2))
def union(list1, list2):
return list(set(list1).union(list2))
def notin(list1, list2)
return [filter(lambda x: x not in list1, sublist) for sublist in list2]
intersection(inverted_index['people'].keys(), intersection(inverted_index['test'].keys(), inverted_index['try'].keys()))

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

Categories