MongoDB How do I reference an array? - python

I'm trying to create a MongoDB database that contains two collections: Students and Courses.
The first collection "students" contains:
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.Database
student = [{"_id":"0",
"firstname":"Bert",
"lastname":"Holden"},
{"_id":"1",
"firstname":"Sam",
"lastname":"Olsen"},
{"_id":"2",
"firstname":"James",
"lastname":"Swan"}]
students = db.students
students.insert_many(student)
pprint.pprint(students.find_one())
The second collection "courses" contains:
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.Database
course = [{"_id":"10",
"coursename":"Databases",
"grades":"[{student_id:0, grade:83.442}, {student_id:1, grade:45.323}, {student_id:2, grade:87.435}]"}]
courses = db.courses
courses.insert_many(course)
pprint.pprint(courses.find_one())
I then want to use aggregation to find a student and the corresponding courses with grade(s).
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://127.0.0.1:27017")
db = client["Database"]
pipeline = [
{
"$lookup": {
"from": "courses",
"localField": "_id",
"foreignField": "student_id",
"as": "student_course"
}
},
{
"$match": {
"_id": "0"
}
}
]
pprint.pprint(list(db.students.aggregate(pipeline)))
I'm not sure if the student_id/grade is implemented correctly in the "courses" collection, so that might be one reason why my arregation returns [].
The aggregate works if I create seperate courses for each student, but that seems like a waste of memory, so I would like to have one course with all the student_ids and grades in an array.
Expected output:
[{'_id': '0',
'firstname': 'Bert',
'lastname': 'Holden',
'student_course': [{'_id': '10',
'coursename': 'Databases',
'grade': '83.442',
'student_id': '0'}]}]

A couple of points worth mentioning..
Your example code in file "courses.py" is inserting grades as a
string that represents an array, not an actual array. This was
pointed out by Matt in the comments, and you requested an
explanation. Here is my attempt to explain - if you insert a string
that looks like an array you cannot perform $unwind, or $lookup on
sub-elements because they aren't sub-elements, they are part of a
string.
You have array data in courses that hold students grades, which are
the datapoints that are desired, but you start the aggregation on
the student collection. Instead, perhaps change your perspective a
bit and come at it from the courses collections instead of the
student perspective. If you do, you will may re-qualify the
requirement as - "show me all courses and student grades where
student id is 0".
Your array data seems to have a datatype mismatch. The student id
is an integer in your string variable "array", but the student
collection has the student id as a string. Need to be consistent
for the $lookup to work properly (if not wanting to perform a bunch
of casting).
But, nonetheless, here is a possible solution to your problem. I have revised the python code, including a redefinition of the aggregation...
The name of my test database is pythontest as seen in this code example.
This database must exist prior to running the code else an error.
File students.py
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.pythontest
student = [{"_id":"0",
"firstname":"Bert",
"lastname":"Holden"},
{"_id":"1",
"firstname":"Sam",
"lastname":"Olsen"},
{"_id":"2",
"firstname":"James",
"lastname":"Swan"}]
students = db.students
students.insert_many(student)
pprint.pprint(students.find_one())
Then the courses file. Notice the field grades is no longer a string, but is a valid array object? Notice the student id is a string, and not an integer? (In reality, a stronger datatype such as UUID or int would likely be preferable).
File courses.py
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.pythontest
course = [{"_id":"10",
"coursename":"Databases",
"grades": [{ "student_id": "0", "grade": 83.442}, {"student_id": "1", "grade": 45.323}, {"student_id": "2", "grade": 87.435}]}]
courses = db.courses
courses.insert_many(course)
pprint.pprint(courses.find_one())
... and finally, the aggregation file with the changed aggregation pipeline...
File aggregation.py
from pymongo import MongoClient
import pprint
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.pythontest
pipeline = [
{ "$match": { "grades.student_id": "0" } },
{ "$unwind": "$grades" },
{ "$project": { "coursename": 1, "student_id": "$grades.student_id", "grade": "$grades.grade" } },
{
"$lookup":
{
"from": "students",
"localField": "student_id",
"foreignField": "_id",
"as": "student"
}
},
{
"$unwind": "$student"
},
{ "$project": { "student._id": 0 } },
{ "$match": { "student_id": "0" } }
]
pprint.pprint(list(db.courses.aggregate(pipeline)))
Output of running program
> python3 aggregation.py
[{'_id': '10',
'coursename': 'Databases',
'grade': 83.442,
'student': {'firstname': 'Bert', 'lastname': 'Holden'},
'student_id': '0'}]
The format of the data at the end of the program may not be as desired, but can be tweaked by manipulating the aggregation.
** EDIT **
So if you want to approach this aggregation from the student rather than approaching it from the course you can still perform that aggregation, but because the array is in courses the aggregation will be a bit more complicated. The $lookup must utilize a pipeline itself to prepare the foreign data structures:
Aggregation from Student perspective
db.students.aggregate([
{ $match: { _id: "0" } },
{ $addFields: { "colStudents._id": "$_id" } },
{
$lookup:
{
from: "courses",
let: { varStudentId: "$colStudents._id"},
pipeline:
[
{ $unwind: "$grades" },
{ $match: { $expr: { $eq: ["$grades.student_id", "$$varStudentId" ] } } },
{ $project: { course_id: "$_id", coursename: 1, grade: "$grades.grade", _id: 0} }
],
as: "student_course"
}
},
{ $project: { _id: 0, student_id: "$_id", firstname: 1, lastname: 1, student_course: 1 } }
])
Output
> python3 aggregation.py
[{'firstname': 'Bert',
'lastname': 'Holden',
'student_course': [{'course_id': '10',
'coursename': 'Databases',
'grade': 83.442}],
'student_id': '0'}]

I was finally able to take a look at this..
TLDR; see Mongo Playground
This solution requires you to store grades as an actual object vs a string.
Consider the following database structure:
db={
// Collection
"students": [
{
"_id": "0",
"firstname": "Bert",
"lastname": "Holden"
},
{
"_id": "1",
"firstname": "Sam",
"lastname": "Olsen"
},
{
"_id": "2",
"firstname": "James",
"lastname": "Swan"
}
],
// Collection
"courses": [
{
"_id": "10",
"coursename": "Databases",
"grades": [
{
student_id: "0",
grade: 83.442
},
{
student_id: "1",
grade: 45.325
},
{
student_id: "2",
grade: 87.435
}
]
}
],
}
You can achieve what you want using the following query:
db.students.aggregate([
{
$match: {
_id: "0"
}
},
{
$lookup: {
from: "courses",
pipeline: [
{
$unwind: "$grades"
},
{
$match: {
"grades.student_id": "0"
}
},
{
$group: {
"_id": "$_id",
"coursename": {
$first: "$coursename"
},
"grade": {
$first: "$grades.grade"
},
"student_id": {
$first: "$grades.student_id"
}
}
}
],
as: "student_course"
}
}
])

Related

Add ascending serial number field to all existing mongodb documents in a collection

I have a mongodb collection which looks something like this;
[
{
"Code": "018906",
"X": "0.12",
},
{
"Code": "018907",
"X": "0.18",
},
{
"Code": "018910",
"X": "0.24",
},
{
"Code": "018916",
"X": "0.75",
},
]
I want to add an ascending serial number field to all existing mongodb documents inside the collection. After adding, the new collection will look like this;
[
{
"Serial": 1,
"Code": "018906",
"X": "0.12",
},
{
"Serial": 2,
"Code": "018907",
"X": "0.18",
},
{
"Serial": 3,
"Code": "018910",
"X": "0.24",
},
{
"Serial": 4,
"Code": "018916",
"X": "0.75",
},
]
I am open to using any python mongodb library such as pymongo or mongoengine.
I am using python 3.7, mongodb v4.2.
You can do it with a single aggregation query by grouping up all documents in a single array, then unwinding it with element index included:
db.collection.aggregate([
{
$group: {
_id: null,
doc: {
$push: "$$ROOT"
}
}
},
{
$unwind: {
path: "$doc",
includeArrayIndex: "doc.Serial"
}
},
{
$replaceRoot: {
newRoot: "$doc"
}
},
{
$out: "new_collection_name"
}
])
All job is done serverside, no need to load whole collection to the application's memory. If the collection is large enough, you may need to call aggregation with "allowDiskUse".
Prepend it with sorting stage to ensure expected order if required.
First you need to find all the _id in the collection, and use bulk write operation.
from pymongo import UpdateOne
records = db.collection.find({}, {'_id':1})
i = 1
request = []
for record in records:
request.append(UpdateOne({'_id': record['_id']}, {'$set': {'serial': i}}))
i=i+1
db.collection.bulk_write(request)

In MongoDB, how to both create a document with fields if they don't exist and increment value if it does?

I'm trying to improve the performance of my app and my knowledge of MongoDB. I have been able to execute a fire and forget query that both creates fields if they don't exist and otherwise increment a value as follows:
date = "2018-6"
sid = "012345"
cid = "06789"
key = "MESSAGES.{}.{}.{}.{}".format(date, sid, cid, hour)
db.stats.update({}, { "$inc": { key : 1 }})
This produces a single document with the following structure:
document:
{
"MESSAGES": {
"2018-6": {
"012345": {
"06789": 1
},
"011111": {
"06667": 5
}
},{
"2018-5": {
"012345": {
"06789": 20
},
"011111": {
"06667": 15
}
}
}
}
As you can probably imagine it has become a bit of a nightmare to query this structure with increasing data. I'd like to achieve the same fire and forget query but with the implementation of a better indexable schema. Something like:
documents:
[{
"SID": "012345",
"MESSAGES: {
"MONTHS": {
"KEY": "2018-6",
"CHANNELS": {
"KEY": "06789",
"COUNT": 1
}
},{
"KEY": "2018-5",
"CHANNELS": {
"KEY": "06667",
"COUNT": 20
}
}
}
},
{
"SID": "011111",
"MESSAGES: {
"MONTHS": {
"KEY": "2018-6",
"CHANNELS": {
"KEY": "06667",
"COUNT": 5
}
},{
"KEY": "2018-5",
"CHANNELS": {
"KEY": "06667",
"COUNT": 15
}
}
}
}]
I'm working with a quite a large amount of data and these queries can happen many times a second so it's important that I just execute a thing once if at all possible. Any advice you can give is very welcome, feel free to criticise anything you see here too as my goal is to learn.
Thanks in advance!
UPDATED WITH ATTEMPT:
db.test.updateOne({"SERVER_ID": "23894723487sdf"}, {
"$addToSet" : {
"MESSAGES" : {
"DATE": "2018-6",
"CHANNELS": [{
"ID": "239048349",
"COUNT": NumberInt(1)
}]
}
},
"$inc" : {
"MESSAGES.CHANNELS.$.COUNT" : 1
}},
{upsert: true})

ElasticSearch not respecting custom mapping while indexing data

I am using ElasticSearch 6.2.4. Currently learning it and writing code in Python. Following is my code. No matter I give age as Integer or text, it still accepts it.
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
# index settings
settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"members": {
"dynamic": "strict",
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "integer"
},
}
}
}
}
if not es.indices.exists('family'):
es.indices.create(index='family', ignore=400, body=settings)
print('Created Index')
data = {'name': 'Maaz', 'age': "4"}
result = es.index(index='family', id=2, doc_type='members', body=data)
print(result)
You can give 42 and "42" as numeric type just because it still numbers and it has no impact on searching and storing this field, but you can't give, for example, "42a" in any numeric field.

How Iterate or remove MongoDb array list item using pymongo?

I want to iterate Mongodb database Arraylist items(TRANSACTION list) and remove Arraylist specific(TRANSACTION List) item using pymongo ?
I create Mongo collection as above using python pymongo. I want to iterate array list item using pymongo and remove final item only in Arraylist?
Data insert query using Python pymongo
# added new method create block chain_structure
def addCoinWiseTransaction(self, senz, coin, format_date):
self.collection = self.db.block_chain
coinValexists = self.collection.find({"_id": str(coin)}).count()
print('coin exists : ', coinValexists)
if (coinValexists > 0):
print('coin hash exists')
newTransaction = {"$push": {"TRANSACTION": {"SENDER": senz.attributes["#SENDER"],
"RECIVER": senz.attributes["#RECIVER"],
"T_NO_COIN": int(1),
"DATE": datetime.datetime.utcnow()
}}}
self.collection.update({"_id": str(coin)}, newTransaction)
else:
flag = senz.attributes["#f"];
print flag
if (flag == "ccb"):
print('new coin mined othir minner')
root = {"_id": str(coin)
, "S_ID": int(senz.attributes["#S_ID"]), "S_PARA": senz.attributes["#S_PARA"],
"FORMAT_DATE": format_date,
"NO_COIN": int(1),
"TRANSACTION": [{"MINER": senz.attributes["#M_S_ID"],
"RECIVER": senz.attributes["#RECIVER"],
"T_NO_COIN": int(1),
"DATE": datetime.datetime.utcnow()
}
]
}
self.collection.insert(root)
else:
print('new coin mined')
root = {"_id": str(coin)
, "S_ID": int(senz.attributes["#S_ID"]), "S_PARA": senz.attributes["#S_PARA"],
"FORMAT_DATE": format_date,
"NO_COIN": int(1),
"TRANSACTION": [{"MINER": "M_1",
"RECIVER": senz.sender,
"T_NO_COIN": int(1),
"DATE": datetime.datetime.utcnow()
}
]
}
self.collection.insert(root)
return 'DONE'
To remove the last entry, the general idea (as you have mentioned) is to iterate the array and grab the index of the last element as denoted by its DATE field, then update the collection by removing it using $pull. So the crucial piece of data you need for this to work is the DATE value and the document's _id.
One approach you could take is to first use the aggregation framework to get this data. With this, you can run a pipeline where the first step if filtering the documents in the collection by using the $match operator which uses standard MongoDB queries.
The next stage after filtering the documents is to flatten the TRANSACTION array i.e. denormalise the documents in the list so that you can filter the final item i.e. get the last document by the DATE field. This is made possible with the $unwind operator, which for each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.
After deconstructing the array, in order to get the last document, use the $group operator where you can regroup the flattened documents and in the process use the group accumulator operators to obtain
the last TRANSACTION date by using the $max operator applied to its embedded DATE field.
So in essence, run the following pipeline and use the results to update the collection. For example, you can run the following pipeline:
mongo shell
db.block_chain.aggregate([
{ "$match": { "_id": coin_id } },
{ "$unwind": "$TRANSACTION" },
{
"$group": {
"_id": "$_id",
"last_transaction_date": { "$max": "$TRANSACTION.DATE" }
}
}
])
You can then get the document with the update data from this aggregate operation using the toArray() method or the aggregate cursor and update your collection:
var docs = db.block_chain.aggregate([
{ "$match": { "_id": coin_id } },
{ "$unwind": "$TRANSACTION" },
{
"$group": {
"_id": "$_id",
"LAST_TRANSACTION_DATE": { "$max": "$TRANSACTION.DATE" }
}
}
]).toArray()
db.block_chain.updateOne(
{ "_id": docs[0]._id },
{
"$pull": {
"TRANSACTION": {
"DATE": docs[0]["LAST_TRANSACTION_DATE"]
}
}
}
)
python
def remove_last_transaction(self, coin):
self.collection = self.db.block_chain
pipe = [
{ "$match": { "_id": str(coin) } },
{ "$unwind": "$TRANSACTION" },
{
"$group": {
"_id": "$_id",
"last_transaction_date": { "$max": "$TRANSACTION.DATE" }
}
}
]
# run aggregate pipeline
cursor = self.collection.aggregate(pipeline=pipe)
docs = list(cursor)
# run update
self.collection.update_one(
{ "_id": docs[0]["_id"] },
{
"$pull": {
"TRANSACTION": {
"DATE": docs[0]["LAST_TRANSACTION_DATE"]
}
}
}
)
Alternatively, you can run a single aggregate operation that will also update your collection using the $out pipeline which writes the results of the pipeline to the same collection:
If the collection specified by the $out operation already
exists, then upon completion of the aggregation, the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not
change any indexes that existed on the previous collection. If the
aggregation fails, the $out operation makes no changes to
the pre-existing collection.
For example, you could run this pipeline:
mongo shell
db.block_chain.aggregate([
{ "$match": { "_id": coin_id } },
{ "$unwind": "$TRANSACTION" },
{ "$sort": { "TRANSACTION.DATE": 1 } }
{
"$group": {
"_id": "$_id",
"LAST_TRANSACTION": { "$last": "$TRANSACTION" },
"FORMAT_DATE": { "$first": "$FORMAT_DATE" },
"NO_COIN": { "$first": "$NO_COIN" },
"S_ID": { "$first": "$S_ID" },
"S_PARA": { "$first": "$S_PARA" },
"TRANSACTION": { "$push": "$TRANSACTION" }
}
},
{
"$project": {
"FORMAT_DATE": 1,
"NO_COIN": 1,
"S_ID": 1,
"S_PARA": 1,
"TRANSACTION": {
"$setDifference": ["$TRANSACTION", ["$LAST_TRANSACTION"]]
}
}
},
{ "$out": "block_chain" }
])
python
def remove_last_transaction(self, coin):
self.db.block_chain.aggregate([
{ "$match": { "_id": str(coin) } },
{ "$unwind": "$TRANSACTION" },
{ "$sort": { "TRANSACTION.DATE": 1 } },
{
"$group": {
"_id": "$_id",
"LAST_TRANSACTION": { "$last": "$TRANSACTION" },
"FORMAT_DATE": { "$first": "$FORMAT_DATE" },
"NO_COIN": { "$first": "$NO_COIN" },
"S_ID": { "$first": "$S_ID" },
"S_PARA": { "$first": "$S_PARA" },
"TRANSACTION": { "$push": "$TRANSACTION" }
}
},
{
"$project": {
"FORMAT_DATE": 1,
"NO_COIN": 1,
"S_ID": 1,
"S_PARA": 1,
"TRANSACTION": {
"$setDifference": ["$TRANSACTION", ["$LAST_TRANSACTION"]]
}
}
},
{ "$out": "block_chain" }
])
Whilst this approach can be more efficient than the first, it requires knowledge of the existing fields first so in some cases the solution cannot be practical.

Mongodb distinct count for multi fields With example

I am using pymongo MongoClient to do multiple fields distinct count.
I found a similar example here: Link
But it doesn't works for me.
For example, by given:
data = [{"name": random.choice(all_names),
"value": random.randint(1, 1000)} for i in range(1000)]
collection.insert(data)
I want to count how many name, value combination. So I followed the Link above, and write this just for a test (I know this solution is not what I want, I just follow the pattern of the Link, and trying to get understand how it works, at least this code can returns me somthing):
collection.aggregate([
{
"$group": {
"_id": {
"name": "$name",
"value": "$value",
}
}
},
{
"$group": {
"_id": {
"name": "$_id.name",
},
"count": {"$sum": 1},
},
}
])
But the console gives me this:
on namespace test.$cmd failed: exception: A pipeline stage
specification object must contain exactly one field.
So, what is the right code to do this? Thank you for all your helps.
Finally I found a solution: Group by Null
res = col.aggregate([
{
"$group": {
"_id": {
"name": "$name",
"value": "$value",
},
}
},
{
"$group": {"_id": None, "count": {"$sum": 1}}
},
])

Categories