Efficiently query missing integers in a range on a field? - python

I have a database for a backup service I'm writing to backup Yahoo! Groups. It incrementally retrieves messages, which have a contiguous numeric id. stored in a 'message_id' field. So, if the last message on the service is message number 10000, then once the backup is complete, the database should contain 10000 documents, with the sorted 'message_id's of each document being equivalent to range(1, 10000+1).
I'd like to write a query yielding the missing message ids. So if I have 9995 documents in the database, and messages 10, 15, 49, 99, and 1043 are missing, it should return [10, 15, 49, 99, 1043].
I've done the following, getting just the ids from the database and running a set intersection in my app code:
def missing_message_ids(self):
"""Return the set of the ids of all missing messages.."""
latest = self.get_latest_message()
ids = set(range(1, latest['_id']+1))
present_ids = set(doc['_id'] for doc in self.db.messages.find({}, {'_id': 1}))
return ids - present_ids
This is fine for my purposes, but it seems like it might get too slow for a vast number of messages. This is more for curiosity's sake than a real performance requirement: Is there any more efficient way to do this, perhaps entirely on the database engine?

in SQL word one could use CTE for that, in mongo we can use aggregation with $lookup as a kind of CTE (common table expressions)
having this data structure
{
"_id" : ObjectId("575deea531dcfb59af388e17"),
"mesId" : 4.0
}, {
"_id" : ObjectId("575deea531dcfb59af388e18"),
"mesId" : 6.0
}
with missing "mesId" : 5.0 we can use this aggregation query, which will project all next expected ids, and join on them. The limitation here is if we have missing more than one message in sequence, but this could be extended by projecting next Id and making $lookup again.
var project = {
$project : {
_id : 0,
mesId : 1,
nextId : {
$sum : ["$mesId", 1]
}
}
}
var lookup = {
$lookup : {
from : "claudiu",
localField : "nextId",
foreignField : "mesId",
as : "missing"
}
}
var match = {
$match : {
missing : []
}
}
db.claudiu.aggregate([project, lookup, match])
and output:
{
"mesId" : 4.0,
"nextId" : 5.0,
"missing" : []
}

Related

Combine and Subtract items from different Collecitons

Excuse my ignorance, I am new in MongoDB. I am having tree collections, where the one is a superset of the other two whose elements are not overlapped. Each item is distinguish by a unique string id. What I want is to get the items of the superset that are not included in the other two collections. Could you please provide me some hint on how do do this efficiently?
Thanks.
EDIT:
Superset structure:
{ "_id" : 1, "str_id" : "ABC1fd3fsewer", "date": "a day" }
Subset 1 structure: { "_id" : 1, "str_id" : "ABre1fd3fsewer", "description" : "product" }
Subset 2 structure: { "_id" : 1, "str_id" : "ABC1fd3fsewfe"}
Each collection has a different structure but all have a common filed, the str_id.
EDIT Improved by #Neel suggestion
I have following format:
parent = [{'str_id':'a', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'},{'str_id':'b',...},{'str_id':'c',...},{'str_id':'d',...}...]
child1 = [{'str_id':'a', 'tag2': child1_random'},{'str_id':'b', 'tag2': 'child1_random'}]
child2 = [{'str_id':'c', 'tag1':'child2_random'}]
and I want
outcome = [{'str_id':'c', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'},{'str_id':'d', 'tag1':'parent_random', 'tag2': 'parent_random', 'tag3':'parent_random'}]
It sounds like you'll need an aggregate operation.
This document might help you:
Lookup in an array
You can do multiple lookups with one aggregate operation so you can check both the subset collections.
I am going to assume you are working with a REST API and that the client is sending a request for a subset of documents from the superset collection. You can send the array of documents you want to check from superset from the client then:
1 - match all the documents in superset to the array of documents you're sending
2 - unwind your superset document array
3 - lookup the subset collections on "str_id" field and set to a field, like "subset_one_results".
4 - do a match operation on both subset results that returns an empty array on, say, "subset_one_results"... this will match all superset documents that are not contained in subset1 for example.
$match({ $and : { "subset_one_results" : { $eq : [] } }, { "subset_two_results" : { $eq : [] } } })
5 - group them in a new array if you want to return them as an array to the client.
To increase the performance of your operations, you have to determine how often this request will be made. If it risks being often, be sure to create an index on the field that will be solicited if it's not an ObjectId field. I can't tell from your code if you are using a custom string field or an ObjectId, which is why I'm bringing up this point.
I don't know what you're using for making your queries (pure MongoDB query language, driver, etc.) so I am not sure how to answer with code hence delineating the steps up above.

How to use PyMongo find() to search nested array attribute?

Using PyMongo, how would one find/search for the documents where the nested array json object matches a given string.
Given the following 2 Product JSON documents in a MongoDB collection..
[{
"_id" : ObjectId("5be1a1b2aa21bb3ceac339b0"),
"id" : "1",
"prod_attr" : [
{
"name" : "Branded X 1 Sneaker"
},
{
"hierarchy" : {
"dept" : "10",
"class" : "101",
"subclass" : "1011"
}
}
]
},
{
"_id" : ObjectId("7be1a1b2aa21bb3ceac339xx"),
"id" : "2",
"prod_attr" : [
{
"name" : "Branded Y 2 Sneaker"
},
{
"hierarchy" : {
"dept" : "10",
"class" : "101",
"subclass" : "2022"
}
}
]
}
]
I would like to
1. return all documents where prod_att.hierarchy.subclass = "2022"
2. return all documents where prod_attr.name contains "Sneaker"
I appreciate the JSON could be structured differently, unfortunately that is not within my control to change.
1. Return all documents where prod_attr.hierarchy.subclass = "2022"
Based on the Query an Array of Embedded Documents documentation of MongoDB you can use dot notation concatenating the name of the array field (prod_attr), with a dot (.) and the name of the field in the nested document (hierarchy.subclass):
collection.find({"prod_attr.hierarchy.subclass": "2022"})
2. Return all documents where prod_attr.name contains "Sneaker"
As before, you can use the dot notation to query a field of a nested element inside an array.
To perform the "contains" query you have to use the $regex operator:
collection.find({"prod_attr.name": {"$regex": "Sneaker"}})
Another option is to use the MongoDB Aggregation framework:
collection.aggregate([
{"$unwind": "$prod_attr"},
{"$match": {"prod_attr.hierarchy.subclass": "2022"}}
])
the $unwind operator creates a new object for each object inside the prod_attr array, so you will have only nested documents and no array (check the documentation for details).
The next step is the $match operator that actually perform a query on the nested object.
This is a simple example but playing with the Aggregators Operators you have a lot of flexibility.

SELECT column FROM table ORDER BY date ASC mongodb equivalent [duplicate]

In my MongoDB, I have a student collection with 10 records having fields name and roll. One record of this collection is:
{
"_id" : ObjectId("53d9feff55d6b4dd1171dd9e"),
"name" : "Swati",
"roll" : "80",
}
I want to retrieve the field roll only for all 10 records in the collection as we would do in traditional database by using:
SELECT roll FROM student
I went through many blogs but all are resulting in a query which must have WHERE clause in it, for example:
db.students.find({ "roll": { $gt: 70 })
The query is equivalent to:
SELECT * FROM student WHERE roll > 70
My requirement is to find a single key only without any condition. So, what is the query operation for that.
From the MongoDB docs:
A projection can explicitly include several fields. In the following operation, find() method returns all documents that match the query. In the result set, only the item and qty fields and, by default, the _id field return in the matching documents.
db.inventory.find( { type: 'food' }, { item: 1, qty: 1 } )
In this example from the folks at Mongo, the returned documents will contain only the fields of item, qty, and _id.
Thus, you should be able to issue a statement such as:
db.students.find({}, {roll:1, _id:0})
The above statement will select all documents in the students collection, and the returned document will return only the roll field (and exclude the _id).
If we don't mention _id:0 the fields returned will be roll and _id. The '_id' field is always displayed by default. So we need to explicitly mention _id:0 along with roll.
get all data from table
db.student.find({})
SELECT * FROM student
get all data from table without _id
db.student.find({}, {_id:0})
SELECT name, roll FROM student
get all data from one field with _id
db.student.find({}, {roll:1})
SELECT id, roll FROM student
get all data from one field without _id
db.student.find({}, {roll:1, _id:0})
SELECT roll FROM student
find specified data using where clause
db.student.find({roll: 80})
SELECT * FROM students WHERE roll = '80'
find a data using where clause and greater than condition
db.student.find({ "roll": { $gt: 70 }}) // $gt is greater than
SELECT * FROM student WHERE roll > '70'
find a data using where clause and greater than or equal to condition
db.student.find({ "roll": { $gte: 70 }}) // $gte is greater than or equal
SELECT * FROM student WHERE roll >= '70'
find a data using where clause and less than or equal to condition
db.student.find({ "roll": { $lte: 70 }}) // $lte is less than or equal
SELECT * FROM student WHERE roll <= '70'
find a data using where clause and less than to condition
db.student.find({ "roll": { $lt: 70 }}) // $lt is less than
SELECT * FROM student WHERE roll < '70'
I think mattingly890 has the correct answer , here is another example along with the pattern/commmand
db.collection.find( {}, {your_key:1, _id:0})
> db.mycollection.find().pretty();
{
"_id": ObjectId("54ffca63cea5644e7cda8e1a"),
"host": "google",
"ip": "1.1.192.1"
}
db.mycollection.find({},{ "_id": 0, "host": 1 }).pretty();
Here you go , 3 ways of doing , Shortest to boring :
db.student.find({}, 'roll _id'); // <--- Just multiple fields name space separated
// OR
db.student.find({}).select('roll _id'); // <--- Just multiple fields name space separated
// OR
db.student.find({}, {'roll' : 1 , '_id' : 1 ); // <---- Old lengthy boring way
To remove specific field use - operator :
db.student.find({}).select('roll -_id') // <--- Will remove id from result
While gowtham's answer is complete, it is worth noting that those commands may differ from on API to another (for those not using mongo's shell).
Please refer to documentation link for detailed info.
Nodejs, for instance, have a method called `projection that you would append to your find function in order to project.
Following the same example set, commands like the following can be used with Node:
db.student.find({}).project({roll:1})
SELECT _id, roll FROM student
Or
db.student.find({}).project({roll:1, _id: 0})
SELECT roll FROM student
and so on.
Again for nodejs users, do not forget (what you should already be familiar with if you used this API before) to use toArray in order to append your .then command.
Try the following query:
db.student.find({}, {roll: 1, _id: 0});
And if you are using console you can add pretty() for making it easy to read.
db.student.find({}, {roll: 1, _id: 0}).pretty();
Hope this helps!!
Just for educational purposes you could also do it with any of the following ways:
1.
var query = {"roll": {$gt: 70};
var cursor = db.student.find(query);
cursor.project({"roll":1, "_id":0});
2.
var query = {"roll": {$gt: 70};
var projection = {"roll":1, "_id":0};
var cursor = db.student.find(query,projection);
`
db.<collection>.find({}, {field1: <value>, field2: <value> ...})
In your example, you can do something like:
db.students.find({}, {"roll":true, "_id":false})
Projection
The projection parameter determines which fields are returned in the
matching documents. The projection parameter takes a document of the
following form:
{ field1: <value>, field2: <value> ... }
The <value> can be any of the following:
1 or true to include the field in the return documents.
0 or false to exclude the field.
NOTE
For the _id field, you do not have to explicitly specify _id: 1 to
return the _id field. The find() method always returns the _id field
unless you specify _id: 0 to suppress the field.
READ MORE
For better understanding I have written similar MySQL query.
Selecting specific fields
MongoDB : db.collection_name.find({},{name:true,email:true,phone:true});
MySQL : SELECT name,email,phone FROM table_name;
Selecting specific fields with where clause
MongoDB : db.collection_name.find({email:'you#email.com'},{name:true,email:true,phone:true});
MySQL : SELECT name,email,phone FROM table_name WHERE email = 'you#email.com';
This works for me,
db.student.find({},{"roll":1})
no condition in where clause i.e., inside first curly braces.
inside next curly braces: list of projection field names to be needed in the result and 1 indicates particular field is the part of the query result
getting name of the student
student-details = db.students.find({{ "roll": {$gt: 70} },{"name": 1, "_id": False})
getting name & roll of the student
student-details = db.students.find({{ "roll": {$gt: 70}},{"name": 1,"roll":1,"_id": False})
I just want to add to the answers that if you want to display a field that is nested in another object, you can use the following syntax
db.collection.find( {}, {{'object.key': true}})
Here key is present inside the object named object
{ "_id" : ObjectId("5d2ef0702385"), "object" : { "key" : "value" } }
var collection = db.collection('appuser');
collection.aggregate(
{ $project : { firstName : 1, lastName : 1 } },function(err, res){
res.toArray(function(err, realRes){
console.log("response roo==>",realRes);
});
});
it's working
Use the Query like this in the shell:
1. Use database_name
e.g: use database_name
2. Which returns only assets particular field information when matched , _id:0 specifies not to display ID in the result
db.collection_name.find( { "Search_Field": "value" },
{ "Field_to_display": 1,_id:0 } )
If u want to retrieve the field "roll" only for all 10 records in the collections.
Then try this.
In MongoDb :
db.students.find( { } , { " roll " : { " $roll " })
In Sql :
select roll from students
The query for MongoDB here fees is collection and description is a field.
db.getCollection('fees').find({},{description:1,_id:0})
Apart from what people have already mentioned I am just introducing indexes to the mix.
So imagine a large collection, with let's say over 1 million documents and you have to run a query like this.
The WiredTiger Internal cache will have to keep all that data in the cache if you have to run this query on it, if not that data will be fed into the WT Internal Cache either from FS Cache or Disk before the retrieval from DB is done (in batches if being called for from a driver connected to database & given that 1 million documents are not returned in 1 go, cursor comes into play)
Covered query can be an alternative. Copying the text from docs directly.
When an index covers a query, MongoDB can both match the query conditions and return the results using only the index keys; i.e. MongoDB does not need to examine documents from the collection to return the results.
When an index covers a query, the explain result has an IXSCAN stage that is not a descendant of a FETCH stage, and in the executionStats, the totalDocsExamined is 0.
Query : db.getCollection('qaa').find({roll_no : {$gte : 0}},{_id : 0, roll_no : 1})
Index : db.getCollection('qaa').createIndex({roll_no : 1})
If the index here is in WT Internal Cache then it would be a straight forward process to get the values. An index has impact on the write performance of the system thus this would make more sense if the reads are a plenty compared to the writes.
If you are using the MongoDB driver in NodeJs then the above-mentioned answers might not work for you. You will have to do something like this to get only selected properties as a response.
import { MongoClient } from "mongodb";
// Replace the uri string with your MongoDB deployment's connection string.
const uri = "<connection string uri>";
const client = new MongoClient(uri);
async function run() {
try {
await client.connect();
const database = client.db("sample_mflix");
const movies = database.collection("movies");
// Query for a movie that has the title 'The Room'
const query = { title: "The Room" };
const options = {
// sort matched documents in descending order by rating
sort: { "imdb.rating": -1 },
// Include only the `title` and `imdb` fields in the returned document
projection: { _id: 0, title: 1, imdb: 1 },
};
const movie = await movies.findOne(query, options);
/** since this method returns the matched document, not a cursor,
* print it directly
*/
console.log(movie);
} finally {
await client.close();
}
}
run().catch(console.dir);
This code is copied from the actual MongoDB doc you can check here.
https://docs.mongodb.com/drivers/node/current/usage-examples/findOne/
db.student.find({}, {"roll":1, "_id":0})
This is equivalent to -
Select roll from student
db.student.find({}, {"roll":1, "name":1, "_id":0})
This is equivalent to -
Select roll, name from student
In mongodb 3.4 we can use below logic, i am not sure about previous versions
select roll from student ==> db.student.find(!{}, {roll:1})
the above logic helps to define some columns (if they are less)
Using Studio 3T for MongoDB, if I use .find({}, { _id: 0, roll: true }) it still return an array of objects with an empty _id property.
Using JavaScript map helped me to only retrieve the desired roll property as an array of string:
var rolls = db.student
.find({ roll: { $gt: 70 } }) // query where role > 70
.map(x => x.roll); // return an array of role
Not sure this answers the question but I believe it's worth mentioning here.
There is one more way for selecting single field (and not multiple) using db.collection_name.distinct();
e.g.,db.student.distinct('roll',{});
Or, 2nd way: Using db.collection_name.find().forEach(); (multiple fields can be selected here by concatenation)
e.g., db.collection_name.find().forEach(function(c1){print(c1.roll);});
_id = "123321"; _user = await likes.find({liker_id: _id},{liked_id:"$liked_id"}); ;
let suppose you have liker_id and liked_id field in the document so by putting "$liked_id" it will return _id and liked_id only.
For Single Update :
db.collection_name.update({ field_name_1: ("value")}, { $set: { field_name_2 : "new_value" }});
For MultiUpdate :
db.collection_name.updateMany({ field_name_1: ("value")}, { $set: {field_name_2 : "new_value" }});
Make sure indexes are proper.

Searching JSON, Returning String within the JSON Structure

I have a set of JSON data that looks simular to this:
{"executions": [
{
"id": 17,
"orderId": 16,
"executionStatus": "1",
"cycleId": 5,
"projectId": 15006,
"issueId": 133038,
"issueKey": "QTCMP-8",
"label": "",
"component": "",
"projectKey": "QTCMP",
"executionDefectCount": 0,
"stepDefectCount": 0,
"totalDefectCount": 0
},
{
"id": 14,
"orderId": 14,
"executionStatus": "1",
"cycleId": 5,
"projectId": 15006,
"issueId": 133042,
"issueKey": "QTCMP-10",
"label": "",
"component": "",
"projectKey": "QTCMP",
"executionDefectCount": 0,
"stepDefectCount": 0,
"totalDefectCount": 0
}
],
"currentlySelectedExecutionId": "",
"recordsCount": 4
}
I have taken this and parsed it with Python as below:
import json
import pprint
with open('file.json') as dataFile:
data = json.load(dataFile)
With this I am able to find sets of data like executions by doing data["executions"] etc.. What I need to be able to do is search for the string "QTCMP-8" within the structure, and then when I find that specific string, find the "id" that is associated with that string. So in the case of QTCMP-8 it would be id 17; for QTCMP-10 it would be 14.
Is this possible? Do I need to convert the data first? Any help is greatly appreciated!
You can't do this in computational order of O(1), at least with what it is like now. The following is a solution with O(n) complexity for each search.
id = None
for dic in executions:
if dic['issueKey'] == query:
id = dic['id']
break
Doing this in O(1), need a pre-processing of O(n), in which you categorize executions by their issueKey, and save inside it whatever information you want.
# Preprocessing of O(n)
mapping = dict()
for dic in executions:
mapping[dic['issueKey']] = {
'id': dic['id'],
'whatever': 'whateverel'
}
# Now you can query in O(1)
return dic[query]['id']
You might also want to consider working with MongoDB or likes of it, if you're doing heavy json querying.
A simple iteration with a condition will do the job:
for execution in data['executions']:
if "QTCMP" in execution['issueKey']:
print(execution["id"])
# -> 17
# -> 14
You can get the list of all ids as:
>>> [item['id'] for item in my_json['executions'] if item['issueKey'].startswith('QTCMP')]
[17, 14]
where my_json is the variable storing your JSON structure
Note: I am using item['issueKey'].startswith('QTCMP') instead of 'QTCMP' in item['issueKey'] as you need id of items starting with QTCMP. For example, if the value is XXXQTCMP, it's id should not present with the result (but it will result as True on using in statement)

MongoDB query in pymongo with sort feature

I am new in MongoDB and I am trying to create a query.
I have a list, for example: mylist = [a,b,c,d,e]
My dataset has one key with a similar list: mydatalist = [b,d,g,e]
I want to create a query that will return all the data that contains at least one from the mylist.
What I have done.
query = {'mydatalist': {'$in': mylist}}
selector = {'_id':1,'name':1}
mydata = collection.find(query,selector)
That's work perfect. The only thing I want to do and I cannot is to sort the results in base of the number of mylist data they have in the mydatalist. Is there any way to do this in the query or I have to do it manually after in the cursor?
Update with an example:
mylist = [a,b,c,d,e,f,g]
#data from collection
data1[mydatalist] = [a,b,k,l] #2 items from mylist
data2[mydatalist] = [b,c,d,e,m] #4items from mylist
data3[mydatalist] = [a,u,i] #1 item from mylist
So, I want the results to be sorted as data2 -> data1 -> data3
So you want the results sorted by the number of matches to your array selection. Not a simple thing for a find but this can be done with the aggregation framework:
db.collection.aggregate([
// Match your selection to minimise the
{$match: {list: {$in: ['a','b','c','d','e','f','g']}}},
// Projection trick, keep the original document
{$project: {_id: {_id: "$_id", list: "$list" }, list: 1}},
// Unwind the array
{$unwind: "$list"},
// Match only the elements you want
{$match: {list: {$in: ['a','b','c','d','e','f','g']}}},
// Sum up the count of matches
{$group: {_id: "$_id", count: {$sum: 1}}},
// Order by count descending
{$sort: {count: -1 }},
// Clean up the response, however you want
{$project: { _id: 0, _id: "$_id._id", list: "$_id.list", count: 1 }}
])
And there you have your documents in the order you want:
{
"result" : [
{
"_id" : ObjectId("5305bc2dff79d25620079105"),
"count" : 4,
"list" : ["b","c","d","e","m"]
},
{
"_id" : ObjectId("5305bbfbff79d25620079104"),
"count" : 2,
"list" : ["a","b","k","l"]
},
{
"_id" : ObjectId("5305bc41ff79d25620079106"),
"count" : 1,
"list" : ["a","u","i"]
}
],
"ok" : 1
}
Also, it is probably worth mentioning that aggregate in all recent driver versions will return a cursor just as is the case with find. Currently this is emulated by the driver, but as of version 2.6 it will really be for real. This makes aggregate a very valid "swap-in" replacement for find in your implemented calls.

Categories