How to check if the document is empty by pymongo? - python

I tried to prevent pymongo.errors.InvalidOperation: No operations to execute after getting aggregate function's result. However, the way I used still shows pymongo.errors.InvalidOperation: No operations to execute for db.collection.insert(). I believe the result is not null at the beginning case.
Here is my code:
import sys
from pymongo import MongoClient
client = MongoClient()
db = client['movies']
for i in range(1,db.movies.count() + 1):
res = db.ratings.aggregate([
{"$match": {"movieID":str(i)}},
{"$group": {"_id": "$movieID", "avg": {"$avg":"$rating"}}}
])
if list(res) != None:
db.question1.insert(res)
So how to check the document is empty in MongoDB?

There are a couple things that may be wrong here: one is that a lot of pymongo's functions return iterators. For example:
query = db.customers.find({ 'name': 'John' })
# will print the number of Johns
print len(list(query))
# will always print 0, because the query results have already been
# consumed
print len(list(query))
Python's list function takes an iterator and creates a list with all the elements returned by the iterator, so list(query) will create a list with everything returned by the query. But, if we don't store the result of list(query) in a variable, they are lost forever and can't be accessed again ;-)
I'm not sure whether this is the case for aggregate or not. If it is, the moment you called list(res) you basically consumed all the results, so you can't access them a second time when you call insert(res) because they are already gone.
The other issue is that list(res) is never None. If there are no results, list(res) will be [] so you actually want to check simply if list(res) or if len(list(res)) > 0. Putting these two suggestions together, we get:
import sys
from pymongo import MongoClient
client = MongoClient()
db = client['movies']
for i in range(1,db.movies.count() + 1):
# Convert db.ratings.aggregate directly to a list
# Now `res` is a simple list, which can be accessed
# and traversed many times
res = list(db.ratings.aggregate([
{"$match": {"movieID":str(i)}},
{"$group": {"_id": "$movieID", "avg": {"$avg":"$rating"}}}
]))
if len(res) > 0:
db.question1.insert(res)

Related

SqlAlchemy 2.x with specific columns makes scalars() return non-orm objects

This question is probably me not understanding architecture of (new) sqlalchemy, typically I use code like this:
query = select(models.Organization).where(
models.Organization.organization_id == organization_id
)
result = await self.session.execute(query)
return result.scalars().all()
Works fine, I get a list of models (if any).
With a query with specific columns only:
query = (
select(
models.Payment.organization_id,
models.Payment.id,
models.Payment.payment_type,
)
.where(
models.Payment.is_cleared.is_(True),
)
.limit(10)
)
result = await self.session.execute(query)
return result.scalars().all()
I am getting first row, first column only. Same it seems to: https://docs.sqlalchemy.org/en/14/core/connections.html?highlight=scalar#sqlalchemy.engine.Result.scalar
My understanding so far was that in new sqlalchemy we should always call scalars() on the query, as described here: https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-orm-usage
But with specific columns, it seems we cannot use scalars() at all. What is even more confusing is that result.scalars() returns sqlalchemy.engine.result.ScalarResult that has fetchmany(), fechall() among other methods that I am unable to iterate in any meaningful way.
My question is, what do I not understand?
My understanding so far was that in new sqlalchemy we should always call scalars() on the query
That is mostly true, but only for queries that return whole ORM objects. Just a regular .execute()
query = select(Payment)
results = sess.execute(query).all()
print(results) # [(Payment(id=1),), (Payment(id=2),)]
print(type(results[0])) # <class 'sqlalchemy.engine.row.Row'>
returns a list of Row objects, each containing a single ORM object. Users found that awkward since they needed to unpack the ORM object from the Row object. So .scalars() is now recommended
results = sess.scalars(query).all()
print(results) # [Payment(id=1), Payment(id=2)]
print(type(results[0])) # <class '__main__.Payment'>
However, for queries that return individual attributes (columns) we don't want to use .scalars() because that will just give us one column from each row, normally the first column
query = select(
Payment.id,
Payment.organization_id,
Payment.payment_type,
)
results = sess.scalars(query).all()
print(results) # [1, 2]
Instead, we want to use a regular .execute() so we can see all the columns
results = sess.execute(query).all()
print(results) # [(1, 123, None), (2, 234, None)]
Notes:
.scalars() is doing the same thing in both cases: return a list containing a single (scalar) value from each row (default is index=0).
sess.scalars() is the preferred construct. It is simply shorthand for sess.execute().scalars().

Using python multiprocessing on a for loop that appends results to dictionary

So I've looked at both the documentation of the multiprocessing module, and also at the other questions asked here, and none seem to be similar to my case, hence I started a new question.
For simplicity, I have a piece of code of the form:
# simple dataframe of some users and their properties.
data = {'userId': [1, 2, 3, 4],
'property': [12, 11, 13, 43]}
df = pd.DataFrame.from_dict(data)
# a function that generates permutations of the above users, in the form of a list of lists
# such as [[1,2,3,4], [2,1,3,4], [2,3,4,1], [2,4,1,3]]
user_perm = generate_permutations(nr_perm=4)
# a function that computes some relation between users
def comp_rel(df, permutation, user_dict):
df1 = df.userId.isin(permutation[0])
df2 = df.userId.isin(permutation[1])
user_dict[permutation[0]] += permutation[1]
return user_dict
# and finally a loop:
user_dict = defaultdict(int)
for permutation in user_perm:
user_dict = comp_rel(df, permutation, user_dict)
I know this code makes very little (if any) sense right now, but I just wrote a small example that is close to the structure of the actual code that I am working on. That user_dict should finally contain userIds and some value.
I have the actual code, and it works fine, gives the correct dict and everything, but... it runs on a single thread. And it's painfully slow, keeping in mind that I have another 15 threads totally free.
My question is, how can I use the multiprocessing module of python to change the last for loop, and be able to run on all threads/cores available? I looked at the documentation, it's not very easy to understand.
EDIT: I am trying to use pool as:
p = multiprocessing.Pool(multiprocessing.cpu_count())
p.map(comp_rel(df, permutation, user_dict), user_perm)
p.close()
p.join()
however this breaks because I am using the line :
user_dict = comp_rel(df, permutation, user_dict)
in the initial code, and I don't know how these dictionaries should be merged after pool is done.
After short discussion in comments I've decided to post solution using ProcessPoolExecutor:
import concurrent.futures
from collections import defaultdict
def comp_rel(df, perm):
...
return perm[0], perm[1]
user_dict = defaultdict(int)
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = {executor.submit(comp_rel, df, perm): perm for perm in user_perm}
for future in concurrent.futures.as_completed(futures):
try:
k, v = future.result()
except Exception as e:
print(f"{futures[future]} throws {e}")
else:
user_dict[k] += v
It works same as #tzaman, but it gives you possibility to handle exceptions. Also there're more interesting features in this module, check docs.
There are two parts to your comp_rel which need to be separated - first is the calculation itself which is computing some value for some userID. The second is the "accumulation" step which is adding that value to the user_dict result.
You can separate the calculation itself so that it returns a tuple of (id, value) and farm it out with multiprocessing, then accumulate the results afterwards in the main thread:
from multiprocessing import Pool
from functools import partial
from collections import defaultdict
# We make this a pure function that just returns a result instead of mutating anything
def comp_rel(df, perm):
...
return perm[0], perm[1]
comp_with_df = partial(comp_rel, df) # df is always the same, so factor it out
with Pool(None) as pool: # Pool(None) uses cpu_count automatically
results = pool.map(comp_with_df, user_perm)
# Now add up the results at the end:
user_dict = defaultdict(int)
for k, v in results:
user_dict[k] += v
Alternatively you could also pass a Manager().dict() object into the processing function directly, but that's a little more complicated and likely won't get you any additional speed.
Based on #Masklinn's suggestion, here's a slightly better way to do it to avoid memory overhead:
user_dict = defaultdict(int)
with Pool(None) as pool:
for k, v in pool.imap_unordered(comp_with_df, user_perm):
user_dict[k] += v
This way we add up the results as they complete, instead of having to store them all in an intermediate list first.

How to call get() on dictionary with indexes?

I have an array of dictionaries but i am running into a scenario where I have to get the value from 1st index of the array of dictionaries, following is the chunk that I am trying to query.
address_data = record.get('Rdata')[0].get('Adata')
This throws the following error:
TypeError: 'NoneType' object is not subscriptable
I tried following:
if record.get('Rdata') and record.get('Rdata')[0].get('Adata'):
address_data = record.get('Rdata')[0].get('Adata')
but I don't know if the above approach is good or not.
So how to handle this in python?
Edit:
"partyrecord": {
"Rdata": [
{
"Adata": [
{
"partyaddressid": 172,
"addressid": 142165
}
]
}
]
}
Your expression assumes that record['Rdata'] will return a list with at least one element, so provide one if that isn't the case.
address_data = record.get('Rdata', [{}])[0].get('Adata')
Now if record['Rdata'] doesn't exist, you'll still have an empty dict on which to invoke get('Adata'). The end result will be address_data being set to None.
(Checking for the key first is preferable if a suitable default is expensive to create, since it will be created whether get needs to return it or not. But [{}] is fairly lightweight, and the compiler can generate it immediately.)
You might want to go for the simple, not exciting route:
role_data = record.get('Rdata')
if role_data:
address_data = role_data[0].get('Adata')
else:
address_data = None

Dynamodb scan() using FilterExpression

First post here on Stack and fairly new to programming with Python and using DynamoDB, but I'm simply trying to run a scan on my table that returns results based on two pre-defined attributes.
---Here is my Python code snippet---
shift = "3rd"
date = "2017-06-21"
if shift != "":
response = table.scan(
FilterExpression=Attr("Date").eq(date) and Attr("Shift").eq(shift)
)
My DynamoDB has 4 fields.
ID
Date
Shift
Safety
Now for the issue, upon running I'm getting two table entries returned when I should only be getting the first entry... the one with "No safety issues" based on my scan criteria.
---Here is my DynamoDB return results---
[
{
"Shift": "3rd",
"Safety": "No safety issues",
"Date": "2017-06-21",
"ID": "2"
},
{
"Shift": "3rd",
"Safety": "Cut Finger",
"Date": "2017-06-22",
"ID": "4"
}
]
Items Returned: 2
I believe that by applying the FilterExpression with the logical 'and' specified that the scan operation is looking for entries that meet BOTH criteria since I used 'and'.
Could this be because the 'shift' attribute "3rd" is found in both entries? How do I ensure it returns entries based on BOTH criteria being meet and not just giving me results from one attribute type?
I have a feeling this is simple but I've looked at the available documentation at: http://boto3.readthedocs.io/en/latest/reference/services/dynamodb.html#DynamoDB.Table.scan and am still having trouble. Any help would be greatly appreciated!
P.S. I tried to keep the post simple and easy to understand (not including all my program code) however, if additional information is needed I can provide it!
This is because you used Python's and keyword in your expression, instead of the & operator.
If a and b are both considered True, a and b returns the latter, b:
>>> 2 and 3
3
If any of them is False, or if both of them are, the first False object is returned:
>>> 0 and 3
0
>>> 0 and ''
0
>>>
The general rule is, and returns the first object that allows it to decide the truthiness of the whole expression.
Python objects are always considered True in boolean context. So, your expression:
Attr("Date").eq(date) and Attr("Shift").eq(shift)
will evaluate as the last True object, that is:
Attr("Shift").eq(shift)
which explains why you only filtered on the shift.
You need to use the & operator. It usually means "bitwise and" between integers in Python, it is redefined for Attr objects to mean what you want: "both conditions".
So you must use the "bitwise and":
FilterExpression=Attr("Date").eq(date) & Attr("Shift").eq(shift)
According to the documentation,
You are also able to chain conditions together using the logical
operators: & (and), | (or), and ~ (not).
Using parts from each of the above answers, here's a compact way I was able to get this working:
from functools import reduce
from boto3.dynamodb.conditions import Key, And
response = table.scan(FilterExpression=reduce(And, ([Key(k).eq(v) for k, v in filters.items()])))
Allows filtering upon multiple conditions in filters as a dict. For example:
{
'Status': 'Approved',
'SubmittedBy': 'JackCasey'
}
Dynamodb scan() using FilterExpression
For multiple filters, you can use this approach:
import boto3
from boto3.dynamodb.conditions import Key, And
filters = dict()
filters['Date'] = "2017-06-21"
filters['Shift'] = "3rd"
response = table.scan(FilterExpression=And(*[(Key(key).eq(value)) for key, value in filters.items()]))
Expanding on Maxime Paille's answer, this covers the case when only one filter is present vs many.
from boto3.dynamodb.conditions import And, Attr
from functools import reduce
from operator import and_
filters = dict()
filters['Date'] = "2017-06-21"
filters['Shift'] = "3rd"
table.scan("my-table", **build_query_params(filters))
def build_query_params(filters):
query_params = {}
if len(filters) > 0:
query_params["FilterExpression"] = add_expressions(filters)
return query_params
def add_expressions(self, filters: dict):
if filters:
conditions = []
for key, value in filters.items():
if isinstance(value, str):
conditions.append(Attr(key).eq(value))
if isinstance(value, list):
conditions.append(Attr(key).is_in([v for v in value]))
return reduce(and_, conditions)

Return Random Result from JSON using PyMongo

I'm attempting to retrieve a random result from a collection of JSON data using PyMongo. I'm using Flask and MongoDB. Here is how it is set up:
def getData():
dataCollection = db["data"]
for item in dataCollection.find({},{"Category":1,"Name":1,"Location":1,"_id":0}):
return (jsonify(item)
return (jsonify(item) returns 1 result and it is always the first one. How can I randomize this?
I tried importing the random module (import random) and switched the last line to random.choice(jsonify(item) but that results in an error.
Here is what the data looks like that was imported into MongoDB:
[
{
"Category":"Tennis",
"Name":"ABC Courts",
"Location":"123 Fake St"
},
{
"Category":"Soccer",
"Name":"XYZ Arena",
"Location":"319 Ace Blvd"
},
{
"Category":"Basketball",
"Name":"Dome Courts",
"Location":"8934 My Way"
},
]
You're always getting one result because return jsonify(item) ends the request. jsonify returns a response it does not only just turn result from Mongo into a json object. if you want to turn your Mongo result into a sequence use list then random.choice
item = random.choice(list(dataCollection.find({},{"Category":1,"Name":1,"Location":1,"_id":0}))
return jsonify(item)

Categories