Count matches in Mongodb $or - python

Trying to count the matches across all columns.
I currently use this code to copy across certain fields from a Scrapy item.
def getDbModel(self, item):
deal = { "name":item['name'] }
if 'imageURL' in item:
deal["imageURL"] = item['imageURL']
if 'highlights' in item:
deal['highlights'] = replace_tags(item['highlights'], ' ')
if 'fine_print' in item:
deal['fine_print'] = replace_tags(item['fine_print'], ' ')
if 'description' in item:
deal['description'] = replace_tags(item['description'], ' ')
if 'search_slug' in item:
deal['search_slug'] = item['search_slug']
if 'dealURL' in item:
deal['dealurl'] = item['dealURL']
Wondering how I would turn this into an OR search in mongodb.
I was looking at something like the below:
def checkDB(self,item):
# Check if the record exists in the DB
deal = self.getDbModel(item)
return self.db.units.find_one({"$or":[deal]})
Firstly, Is this the best method to be doing?
Secondly, how would I find the count of the amount of columns matched i.e. trying to limit records that match at least two columns.

There is no easy way of counting the number of colum matches on MongoDBs end, it just kinda matches and then returns.
You would probably be better doing this client side, I am unsure exactly how you intend to use this count figure but there is no easy way, whether through MR or aggregation framework of doing this.
You could, in the aggregation framework, change your schema a little to put these colums within a properties field and then $sum the matches within the subdocuemnt. This is a good approach since you can also sort on it to create a type of relevance search (if that is what your intending).
As to whether this is a good approach depends. When using an $or MongoDB will use an index for each condition, this is a special case within MongoDB indexing, however it does mean you should take this into consideration when making an $or and ensure you have indexes to cover each condition.
You have also got to consider that MongoDB will effectively eval each clause and then merge the results to remove duplicates, which can be heavy for bigger $ors or a large working set.
Of course the format of your $or is wrong, you need an array of arrays of your fields. At the minute you have a single array with another array which has all your attributes. When used like this the attributes will actually have an $and condition between them so it won't work.
You could probably change your code to:
def getDbModel(self, item):
deal = []
deal[] = { "name":item['name'] }
if 'imageURL' in item:
deal[] = {"imageURL": tem['imageURL']}
if 'highlights' in item:
// etc
// Some way down
return self.db.units.find_one({"$or":deal})
NB: I am not a Python programmer
Hope it helps,

Related

Efficient functional list iteration in Python

So suppose I have an array of some elements. Each element have some number of properties.
I need to filter this list from some subsets of values determined by predicates. This subsets of course can have intersections.
I also need to determine amount of values in each such subset.
So using imperative approach I could write code like that and it would have running time of 2*n. One iteration to copy array and another one to filter it count subsets sizes.
from split import import groupby
a = [{'some_number': i, 'some_time': str(i) + '0:00:00'} for i in range(10)]
# imperative style
wrong_number_count = 0
wrong_time_count = 0
for item in a[:]:
if predicate1(item):
delete_original(item, a)
wrong_number_count += 1
if predicate2(item):
delete_original(item, a)
wrong_time_count += 1
update_some_data(item)
do_something_with_filtered(a)
def do_something_with_filtered(a, c1, c2):
print('filtered a {}'.format(a))
print('{} items had wrong number'.format(c1))
print('{} items had wrong time'.format(c2))
def predicate1(x):
return x['some_number'] < 3
def predicate2(x):
return x['some_time'] < '50:00:00'
Somehow I can't think of the way to do that in Python in functional way with same running time.
So in functional style I could have used groupby multiple times probably or write a comprehension for each predicate, but that's obviously would be slower than imperative approach.
I think such thing possible in Haskell using Stream Fusion (am I right?)
But how do that in Python?
Python has a strong support to "stream processing" in the form of its iterators - and what you ask seens just trivial to do. You just have to have a way to group your predicates and attributes to it - it could be a dictionary where the predicate itself is the key.
That said, a simple iterator function that takes in your predicate data structure, along with the data to be processed could do what you want. TThe iterator would have the side effect of changing your data-structure with the predicate-information. If you want "pure functions" you'd just have to duplicate the predicate information before, and maybe passing and retrieving all predicate and counters valus to the iterator (through the send method) for each element - I don´ t think it would be worth that level of purism.
That said you could have your code something along:
from collections import OrderedDict
def predicate1(...):
...
...
def preticateN(...):
...
def do_something_with_filtered(item):
...
def multifilter(data, predicates):
for item in data:
for predicate in predicates:
if predicate(item):
predicates[predicate] += 1
break
else:
yield item
def do_it(data):
predicates = OrderedDict([(predicate1, 0), ..., (predicateN, 0) ])
for item in multifilter(data, predicates):
do_something_with_filtered(item)
for predicate, value in predicates.items():
print("{} filtered out {} items".format(predicate.__name__, value)
a = ...
do_it(a)
(If you have to count an item for all predicates that it fails, then an obvious change from the "break" statement to a state flag variable is enough)
Yes, fusion in Haskell will often turn something written as two passes into a single pass. Though in the case of lists, it's actually foldr/build fusion rather than stream fusion.
That's not generally possible in languages that don't enforce purity, though. When side effects are involved, it's no longer correct to fuse multiple passes into one. What if each pass performed output? Unfused, you get all the output from each pass separately. Fused, you get the output from both passes interleaved.
It's possible to write a fusion-style framework in Python that will work correctly if you promise to only ever use it with pure functions. But I'm doubtful such a thing exists at the moment. (I'd loved to be proven wrong, though.)

Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)
• Search a key word in Column (C#3) and keep 2 words front and back to that key word.
• Divert the print output to file.
• Here I don’t want to touch C#1, C#2 for referential integrity purposes.
Really appreciate for all your help.
My input file:
C #1 C # 2 C# 3 (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it
Desired output file: (only change in Column 3 or last column)
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
Code I am currently using:
s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():
if not line.strip():
continue
fields = line.split(None, 2)
joined = '|'.join(fields)
print(joined)
BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.
First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.
For 1 million cases you'll need pandas as for loop is too slow.
import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
"This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())
which gives
df
C1 C2 C3
0 12088 CITA very nice lists, better to
1 12089 CITA theme for lists keep it
There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.
Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword(), and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.
KEYWORD = "lists"
S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
def findword(words, keyword):
"""Return index of first occurrence of `keyword` in sequence
`words`, otherwise return None.
The current implementation searches for "keyword" as well as
for "keyword," (with trailing comma).
"""
for test in (keyword, "%s," % keyword):
try:
return words.index(test)
except ValueError:
pass
return None
for line in S.splitlines():
tokens = line.split("|")
words = tokens[2].split()
idx = findword(words, KEYWORD)
if idx is None:
# Keyword not found. Print line without change.
print line
continue
l = len(words)
start = idx-2 if idx > 1 else 0
end = idx+3 if idx < l-2 else -1
tokens[2] = " ".join(words[start:end])
print '|'.join(tokens)
Test:
$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
PS: I hope I got the indices right for slicing. You should check, nevertheless.

Accessing Data using df['foo'] missing data for pattern searching python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data
Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

Proper way to access items in FlexGridSizer in wxPython?

I have a FlexGridSizer called self.grid with five columns, each row having two TextCtrl's a pair of RadioButton's and a CheckBox. What is the best way to retrieve the data associated with these objects? Currently I am successfully using
# get flat list of all items in flexgridsizer, self.grid
children = list(self.grid.GetChildren())
# change list into list of rows (5 items each)
table = [children[i:i+5] for i in range(0, len(children), 5)]
# parse list of 'sizeritems' to extract content
for x in range(len(table)):
for y in range(len(table[x])):
widget = table[x][y].GetWindow()
if isinstance(widget, wx.TextCtrl):
text = ""
for num in range(widget.GetNumberOfLines()):
text += widget.GetLineText(num)
table[x][y] = text
if isinstance(widget, wx.RadioButton):
table[x][y] = widget.GetValue()
if isinstance(widget, wx.CheckBox):
table[x][y] = (widget.GetLabel(), widget.GetValue())
This leaves me with table, a list of rows with five elements each, each item being relevant data: text for TextCtrl, bool for RadioButton, and (label, bool) for CheckBox.
This seems to get the job done, but it doesn't feel right.
Is there a better way to recover data from a FlexGridSizer? Alternatively, should I be using a different sizer/control for this layout? (I tried UltimateListCtrl, but it was buggy/didn't actually do what I needed).
you shouldnt really be doing that .. instead you should create references to them when created
self.widgetTable = []
for row in dataSet:
self.widgetTable.append([wx.TextCtrl(self,-1,value) for value in row])
then access them through that
self.widgetTable[0][0].GetValue()
Since you have working code, and seem to be asking about coding style, you may have some luck asking on Code Review.
That being said, what you have here isn't too terrible. I think isinstance() is pretty ugly, so when I did something like this, I went by order of the widgets since I knew every 5th widget was what I wanted. Maybe you could use a similar approach? Or use a try...except structure to avoid isinstance.
So there are two approaches here, the first based on the order of your widgets, and the second just guesses how you retrieve info.
Method 1: So if your widgets have regular order, you can do something like this: (horrendous variable names for demonstration only)
list_of_one_type_of_widget = map(lambda x: x.GetWindow(), self.grid.GetChildren())[::k]
list_of_values_for_first_type = map(lambda x: x.GetValue(), list_of_one_type_of_widget)
list_of_another_type_of_widget = map(lambda x: x.GetWindow(), self.grid.GetChildren())[1::k]
list_of_values_for_first_type = map(lambda x: (x.GetLabel(), x.GetValue()), list_of_another_type_of_widget)
Where k is the number of widget types you have. This is how I tackled the problem when I came up against it, and I think its pretty nifty and very concise. You're left with a list for each type of widget you have, so that makes processing easy if it depends on the widget. You could also pretty easily build this back into a table. Be sure to note how the second one is sliced with [1::k] rather than [::k]. Each subsequent widget type will need to be one greater than the previous.
Method 2: If you don't have a regular order, you can do something like this:
list_of_values = []
for widget in map(lambda x: x.GetWindow(), self.grid.GetChildren()):
try:
list_of_values.append(widget.GetValue())
except:
#Same as above, but with the appropriate Get function. If there multiple alternatives, just keep nesting try...except blocks in decrease order of commonness
You could make the case that the second method is more "pythonic", but that's up for debate. Additionally, without some clever tricks, you're left with one list in the end, which may not be ideal for you.
Some notes on your solution:
self.grid.GetChildren() is iterable, so you don't need to convert it to a list before using it as an iterable
You could change your sequential if statements to a if...elif...(else) construct, but its really not that big of a deal in this case since you don't expect any widget to more than one test
Hope this helps

Most effient way of List/Dict Lookups in Python

I have a list of dictionaries. Which looks something like,
abc = [{"name":"bob",
"age": 33},
{"name":"fred",
"age": 18},
{"name":"mary",
"age": 64}]
Lets say I want to lookup bobs age. I know I can run a for loop through etc etc. However my questions is are there any quicker ways of doing this.
One thought is to use a loop but break out of the loop once the lookup (in this case the age for bob) has been completed.
The reason for this question is my datasets are thousands of lines long so Im looking for any performance gains I can get.
Edit : I can see you can use the following via the use of a generator, however im not too sure whether this would still iterate over all items of the list or just iterate until the the first dict containing the name bob is found ?
next(item for item in abc if item["name"] == "bob")
Thanks,
Depending on how many times you want to perform this operation, it might be worth defining a dictionary mapping names to the corresponding age (or the list of corresponding ages if more than two people can share the same name).
A dictionary comprehension can help you:
abc_dict = {x["name"]:x["age"] for x in abc}
I'd consider making another dictionary and then using that for multiple age lookups:
for person in abc:
age_by_name[person['name']] = person['age']
age_by_name['bob']
# this is a quick lookup!
Edit: This is equivalent to the dict comprehension listed in Josay's answer
Try indexing it first (once), and then using the index (many times).
You can index it eg. by using dict (keys would be what you are searching by, while the values would be what you are searching for), or by putting the data in the database. That should cover the case if you really have a lot more lookups and rarely need to modify the data.
define dictionary of dictionary like this only
peoples = {"bob":{"name":"bob","age": 33},
"fred":{"name":"fred","age": 18},
"mary": {"name":",mary","age": 64}}
person = peoples["bob"]
persons_age = person["age"]
look up "bob" then look up like "age"
this is correct no ?
You might write a helper function. Here's a take.
import itertools
# First returns the first element encountered in an iterable which
# matches the predicate.
#
# If the element is never found, StopIteration is raised.
# Args:
# pred The predicate which determines a matching element.
#
first = lambda pred, seq: next(itertools.dropwhile(lambda x: not pred(x), seq))

Categories