What is the best way to look for a document’s position in a collection?
Using the following code. But its works so bad with my big collection of documents.
get_top_func(user_score):
return len(db.collection.find({'score': {'$gt': user_score}})) + 1
Considering your query and the fact that you mention that the query is slow, I am guessing that you don't have an index on the score field. Creating and index should make this query faster:
db.collection.createIndex( { score: -1 } )
After that, you should be able to run the query you have with better performance.
You can have a field like myDocId as the key and value to be like a variable that functions as a counter.
So while inserting each document in a collection, you'll be having its number also stored with the document data as well.
Each document in a collection has an identifier key that is stuffed my mongoDB itself which is _id but that wont tell you exactly which number of document is it(like nth document) as its made up using 4 parameters
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
Also refer https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
and also MongoDB Query, sort then take nth document for group
So what you can do is using aggregate you can use your filters and then project as needed to get the nth element using the $arrayElemAt.
https://docs.mongodb.com/manual/reference/operator/aggregation/arrayElemAt/
Related
So i am inserting data that looks like this.. into my mongo db colletion, it is some polling data
Link to sample insertions
What i plan on doing is combining the "Poll_Name", "Date", "Sample_Size" and "MoE" values into a unique string and then using some function to convert it into a unique id value.
What i wish to get out of this function is the ability to both create an id for each poll and create a duplicate id if a duplicate string is given.
So for example lets say i wish to add this poll to my database..
{'Poll_Name': 'NBC News/Marist', 'Date': '2020-03-10', 'Sample_Size': '2523 RV', 'MoE': '2.7', 'Biden (D)': '47', 'Trump(R)': '46', 'Spread': 'Biden +1'}
and i create an id from this poll using its "Poll_Name", "Date", "Sample_Size" and "MoE" values
so the string would come out something like this...
poll_String = "NBC News/Marist2020-03-102523RV2.7"
And then i put it through the function that creates an id and lets say it spits out the value "12345" (for simplicity sake)
Then lets say later on in the insertions, i am adding an exact duplicate of this poll, so the "poll_String" comes out the exact same for this poll duplicate.
I need the id creation function to return the exact same value ie 12345, so that i know then that the poll being added here is a duplicate. And obviously in the process keep the id created completely unique to other polls that differ, so as not to create incorrect duplicate ids.
Is this possible? or am i asking for something too advanced...
You can use a hashing function to create a hash of the data.
But you will need to consider that hashing data will not guarantee that another piece of data will not have the same hash, it would just be very unlikely.
So consider the following code
import hashlib
some_string = "Some test string I want to generate an ID from"
new_id = hashlib.md5(some_string.encode()).hexdigest()
print(new_id)
This snippet will print 051ba4078ab8419b76388ee9173dac1a.
Please note that md5 hashes should not be used to store passwords.
Also, if you want the id to be shorter than this you can simply take the first x characters of the hash. But remember, the shorter the id, the higher the chance of you getting two pieces of data with the same auto-generated id.
The odds of two different pieces of data to have the same auto-generated id with this method is roughly 1/16^x. Consider how much data you have and how unlikely you'd want id collisions to happen. over 99% in a lifetime of an application is reasonable in my opinion.
So if you have say 100M items, taking the first 10 hexadecimal characters from the md5 hash will give you a likelihood of about 0.01% to have a collision (assuming of course no who items are the same).
Also, it isn't random, so for the same string you will always get the same hash value.
I'm trying to write a cloud function that returns users near a specific location. get_nearby() returns a list of tuples containing upper and lower bounds for a geohash query, and then this loop should query firebase for users within those geohashes.
user_ref = db.collection(u'users')
db_response = []
for query_range in get_nearby(lat, long, radius):
query = user_ref.where(u'geohash', u'>=', query_range[0]).where(u'geohash', u'<=', query_range[1]).get()
for el in query:
db_response.append(el.to_dict())
For some reason when I run this code, it returns only one document from my database, even though there are three other documents with the same geohash as that one. I know the documents are there, and they do get returned when I request the entire collection. What am I missing here?
edit:
The database currently has 4 records in it, 3 of which should be returned in this query:
{
{name: "Trevor", geohash: "dnvtz"}, #this is the one that gets returned
{name: "Test", geohash: "dnvtz"},
{name: "Test", geohash: "dnvtz"}
}
query_range is a tuple with two values. A lower and upper bound geohash. In this case, it's ("dnvt0", "dnvtz").
I decided to clear all documents from my database and then generate a new set of sample data to work with (everything there was only for testing anyway, nothing important). After pushing the new data to Firestore, everything is working. My only assumption is that even though the strings matched up, I'd used the wrong encoding on some of them.
Let's take this simple collection col with 2 documents:
{
"_id" : ObjectId("5ca4bf475e7a8e4881ef9dd2"),
"timestamp" : 1551736800,
"score" : 10
}
{
"_id" : ObjectId("5ca4bf475e7a8e4881ef9dd3"),
"timestamp" : 1551737400,
"score" : 12
}
To access the last timestamp (the one of the second document), I first did this request
a = db['col'].find({}).sort("_id", -1)
and then a[0]['timestamp']
But as there will be a lot of documents in this collection, i think that it would be more efficient to request only the last one with the limit function, like
a = db['col'].find({}).sort("_id", -1).limit(1)
and then
for doc in a:
lastTimestamp = doc['timestamp']
as there will be only one, i can declare the variable inside the loop.
So three questions :
Do i have to worry about memory / speed issues if i continue to use the first request and get the first element in the dic ?
Is there a smarter way to access the first element of the cursor instead of using a loop, when using the limit request ?
Is there another way to get that timestamp that i don't know ?
Thanks !
Python 3.6 / Pymongo 3.7
If you are using any field with an unique index in the selection criteria, you should use find_one method which will return the only document that matches your query.
That being said, the find method returns a Cursor object and does not load the data into memory.
You might get a better performance if you where using a filter option. Your query as it is now will do a collection scan.
if you are not using a filter, and want to retrieve the last document, then the clean way is with the Python built-in next function. You could also use the next method.
cur = db["col"].find().sort({"_id": -1}).limit(1):
with cur:
doc = next(cur, None) # None when we have empty collection.
find().sort() is so fast and don't worry about the speed and it's the best access the first element of the cursor.
This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.
Let's say I have a mongodb collection of the following layout:
{'number':1, '_id':...}
{'number':2, '_id':...}
{'number':4, '_id':...}
and so on. As demonstrated, not all the numbers currently present have to be consecutive.
I want to write code which (a) determines what is the highest value for number found in collection and then (b) inserts a new document whose value for number is 1 higher than the current largest.
So if this is the only code that operates on the collection, no particular value for number should be duplicated. The issue is that, done naively, this creates a race condition where two threads of this code running in parallel might find the same highest value and then insert the same next highest number twice.
So how would I do this atomically? I'm working in Python, so I would prefer a solution in that language, but I will accept an answer that explains the concept in a way that can be adapted to any language.
MongoEngine does what you're looking for in its SequenceField.
Create a new collection called indexes. This collection will look like this:
[
{ '_id': 'mydata.number', 'next': 5 }
]
Whenever you'd like to get and set the next index, you simply use the following statement:
counter = collection.find_and_modify(
query = { '_id': 'mydata.number' },
update = { '$inc': { 'next': 1 } },
new = True,
upsert = True)
What this does is it finds and updates the sequence atomically in MongoDB and retrieves the next number. If the sequence doesn't exist, it is generated.
Thus, whenever you want to insert a new value into your collection, call the code above. If you want to maintain multple indexes across different collections and their fields, simply modify mydata.number to be another string referencing your "index."
There is no clean transactional way to do this in MongoDB. This is why there is the ObjectID datatype. http://api.mongodb.org/python/current/api/bson/objectid.html
Or you can utilize a unique key inside python using something like UUID: https://docs.python.org/2/library/uuid.html