How to sort paginated logs by #timestamp with Elasticsearch?

How to sort paginated logs by #timestamp with Elasticsearch? - python

My goal is to sort millions of logs by timestamp that I receive out of Elasticsearch.
Example logs:
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:00:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:01:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:02:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:04:09.000Z"}
Unfortunately, I am not able to get all the logs sorted out of Elastic. It seems like I have to do it by myself.
Approaches I have tried to get the data sorted out of elastic:
es = Search(index="somelogs-*").using(client).params(preserve_order=True)
for hit in es.scan():
print(hit['#timestamp'])
Another approach:
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.scan()
)
So I am looking for a way to sort these logs by myself or directly through Elasticsearch. Currently, I am saving all the data in a local 'logs.json' and it seems to me I have to iter over and sort it by myself.

You should definitely let Elasticsearch do the sorting, then return the data to you already sorted.
The problem is that you are using .scan(). It uses Elasticsearch's scan/scroll API, which unfortunately only applies the sorting params on each page/slice, not the entire search result. This is noted in the elasticsearch-dsl docs on Pagination:
Pagination
...
If you want to access all the documents matched by your query you can
use the scan method which uses the scan/scroll elasticsearch API:
for hit in s.scan():
print(hit.title)
Note that in this case the results won’t be sorted.
(emphasis mine)
Using pagination is definitely an option especially when you have a "millions of logs" as you said. There is a search_after pagination API:
Search after
You can use the search_after parameter to retrieve the next page of
hits using a set of sort values from the previous page.
...
To get the first page of results, submit a search request with a sort
argument.
...
The search response includes an array of sort values for
each hit.
...
To get the next page of results, rerun the previous search using the last hit’s sort values as the search_after argument. ... The search’s query and sort arguments must remain unchanged. If provided, the from argument must be 0 (default) or -1.
...
You can repeat this process to get additional pages of results.
(omitted the raw JSON requests since I'll show a sample in Python below)
Here's a sample how to do it with elasticsearch-dsl for Python. Note that I'm limiting the fields and the number of results to make it easier to test. The important parts here are the sort and the extra(search_after=).
search = Search(using=client, index='some-index')
# The main query
search = search.extra(size=100)
search = search.query('range', **{'#timestamp': {'gte': '2020-12-29T09:00', 'lt': '2020-12-29T09:59'}})
search = search.source(fields=('#timestamp', ))
search = search.sort({
'#timestamp': {
'order': 'desc'
},
})
# Store all the results (it would be better to be wrap all this in a generator to be performant)
hits = []
# Get the 1st page
results = search.execute()
hits.extend(results.hits)
total = results.hits.total
print(f'Expecting {total}')
# Get the next pages
# Real use-case condition should be "until total" or "until no more results.hits"
while len(hits) < 1000:
print(f'Now have {len(hits)}')
last_hit_sort_id = hits[-1].meta.sort[0]
search = search.extra(search_after=[last_hit_sort_id])
results = search.execute()
hits.extend(results.hits)
with open('results.txt', 'w') as out:
for hit in hits:
out.write(f'{hit["#timestamp"]}\n')
That would lead to an already sorted data:
# 1st 10 lines
2020-12-29T09:58:57.749Z
2020-12-29T09:58:55.736Z
2020-12-29T09:58:53.627Z
2020-12-29T09:58:52.738Z
2020-12-29T09:58:47.221Z
2020-12-29T09:58:45.676Z
2020-12-29T09:58:44.523Z
2020-12-29T09:58:43.541Z
2020-12-29T09:58:40.116Z
2020-12-29T09:58:38.206Z
...
# 250-260
2020-12-29T09:50:31.117Z
2020-12-29T09:50:27.754Z
2020-12-29T09:50:25.738Z
2020-12-29T09:50:23.601Z
2020-12-29T09:50:17.736Z
2020-12-29T09:50:15.753Z
2020-12-29T09:50:14.491Z
2020-12-29T09:50:13.555Z
2020-12-29T09:50:07.721Z
2020-12-29T09:50:05.744Z
2020-12-29T09:50:03.630Z
...
# 675-685
2020-12-29T09:43:30.609Z
2020-12-29T09:43:30.608Z
2020-12-29T09:43:30.602Z
2020-12-29T09:43:30.570Z
2020-12-29T09:43:30.568Z
2020-12-29T09:43:30.529Z
2020-12-29T09:43:30.475Z
2020-12-29T09:43:30.474Z
2020-12-29T09:43:30.468Z
2020-12-29T09:43:30.418Z
2020-12-29T09:43:30.417Z
...
# 840-850
2020-12-29T09:43:27.953Z
2020-12-29T09:43:27.929Z
2020-12-29T09:43:27.927Z
2020-12-29T09:43:27.920Z
2020-12-29T09:43:27.897Z
2020-12-29T09:43:27.895Z
2020-12-29T09:43:27.886Z
2020-12-29T09:43:27.861Z
2020-12-29T09:43:27.860Z
2020-12-29T09:43:27.853Z
2020-12-29T09:43:27.828Z
...
# Last 3
2020-12-29T09:43:25.878Z
2020-12-29T09:43:25.876Z
2020-12-29T09:43:25.869Z
There are some considerations on using search_after as discussed in the API docs:
Use a Point In Time or PIT parameter
If a refresh occurs between these requests, the order of your results may change, causing inconsistent results across pages. To prevent this, you can create a point in time (PIT) to preserve the current index state over your searches.
You need to first make a POST request to get a PIT ID
Then add an extra 'pit': {'id':xxxx, 'keep_alive':5m} parameter to every request
Make sure to use the PIT ID from the last response
Use a tiebreaker
We recommend you include a tiebreaker field in your sort. This tiebreaker field should contain a unique value for each document. If you don’t include a tiebreaker field, your paged results could miss or duplicate hits.
This would depend on your Document schema
# Add some ID as a tiebreaker to the `sort` call
search = search.sort(
{'#timestamp': {
'order': 'desc'
}},
{'some.id': {
'order': 'desc'
}}
)
# Include both the sort ID and the some.ID in `search_after`
last_hit_sort_id, last_hit_route_id = hits[-1].meta.sort
search = search.extra(search_after=[last_hit_sort_id, last_hit_route_id])

Thank you Gino Mempin. It works!
But I also figured out, that a simple change does the same job.
by adding .params(preserve_order=True) elasticsearch will sort all the data.
es = Search(index="somelog-*").using(client)
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.params(preserve_order=True)
.scan()
)

Related

How to get negative keyword list from a campaing using Google Adwords API

When you set up a campaign in google adwords you can add negative keywords to it so that the searchquery may not match your campaign if it has the negative keyword.
I want to extract the list of the negative keywords per each campaign. In the documentation I was able to find this example:
def retrieve_negative_keywords(report_utils)
report_definition = {
:selector => {
:fields => ['CampaignId', 'Id', 'KeywordMatchType', 'KeywordText']
},
:report_name => 'Negative campaign keywords',
:report_type => 'CAMPAIGN_NEGATIVE_KEYWORDS_PERFORMANCE_REPORT',
:download_format => 'CSV',
:date_range_type => 'TODAY',
:include_zero_impressions => true
}
campaigns = {}
report = report_utils.download_report(report_definition)
# Slice off the first row (report name).
report.slice!(0..report.index("\n"))
CSV.parse(report, { :headers => true }) do |row|
campaign_id = row['Campaign ID']
# Ignore totals row.
if row[0] != 'Total'
campaigns[campaign_id] ||= Campaign.new(campaign_id)
negative = Negative.from_csv_row(row)
campaigns[campaign_id].negatives << negative
end
end
return campaigns
end
Which is written in Ruby and there are no Python examples for this task. There is also a report for the negative keywords but it holds no metrics and I can't use it to retrieve the list of negative keywords per each campaign.
I am using this structure to query the database:
report_query = (adwords.ReportQueryBuilder()
.Select('CampaignId', 'Id', 'KeywordMatchType', 'KeywordText')
.From('CAMPAIGN_NEGATIVE_KEYWORDS_PERFORMANCE_REPORT')
.During('LAST_7_DAYS')
.Build())
But querying it gives an error:
googleads.errors.AdWordsReportBadRequestError: Type: QueryError.DURING_CLAUSE_REQUIRES_DATE_COLUMN
When I add Date it throws the same error.
Has anyone been able to extract the negative keyword list per campaign using Python with the Google Adwords API reports?

You can't use a DURING clause when querying CAMPAIGN_NEGATIVE_KEYWORDS_PERFORMANCE_REPORT as that report is a structure report, meaning it doesn't contain statistics. If you remove the During() call and just do
report_query = (googleads.adwords.ReportQueryBuilder()
.Select('CampaignId', 'Id', 'KeywordMatchType', 'Criteria')
.From('CAMPAIGN_NEGATIVE_KEYWORDS_PERFORMANCE_REPORT')
.Build())
you'll get the list of all negative keywords per campaign.
If you think about it, this makes sense—negative keywords prevent your ads from being shown, so metrics like impressions or clicks would be meaningless.

How to change date format and have concatenated string matches in mongodb query filter?

I'm matching two collections residing in 2 different databases over a criteria and creates a new collection for records that matches this criterion.
Below is working with simple criteria, but I need a different criterion.
Definitions
function insertBatch(collection, documents) {
var bulkInsert = collection.initializeUnorderedBulkOp();
var insertedIds = [];
var id;
documents.forEach(function(doc) {
id = doc._id;
// Insert without raising an error for duplicates
bulkInsert.find({_id: id}).upsert().replaceOne(doc);
insertedIds.push(id);
});
bulkInsert.execute();
return insertedIds;
}
function moveDocuments(sourceCollection, targetCollection, filter, batchSize) {
print("Moving " + sourceCollection.find(filter).count() + " documents from " + sourceCollection + " to " + targetCollection);
var count;
while ((count = sourceCollection.find(filter).count()) > 0) {
print(count + " documents remaining");
sourceDocs = sourceCollection.find(filter).limit(batchSize);
idsOfCopiedDocs = insertBatch(targetCollection, sourceDocs);
targetDocs = targetCollection.find({_id: {$in: idsOfCopiedDocs}});
}
print("Done!")
}
Call
var db2 = new Mongo("<URI_1>").getDB("analy")
var db = new Mongo("<URI_2>").getDB("clone")
var readDocs= db2.coll1
var writeDocs= db.temp_coll
var Urls = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url" ,{})
var filter= {"Url": {$in: Urls }}
moveDocuments(readDocs, writeDocs, filter, 10932)
In a nutshell, my criterion is distinct "Url" string. Instead, I want Url + Date string to be my criterion. There are 2 problems:
In one collection, the date is in format ISODate("2016-03-14T13:42:00.000+0000") and in other collection the date format is "2018-10-22T14:34:40Z". So, How to make them uniform so that they match each other?
Assuming, we get a solution to 1., and we create a new array having concatenated strings UrlsAndDate instead of Urls. How would we create a similar concatenated field on the fly and match it in the other collection?
For example: (non-functional code!)
var UrlsAndDate = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url"+"formated_Date" ,{})
var filter= {"Url"+"formated_Date": {$in: Urls }}
readDocs.find(filter)
...and do the same stuff as above!
Any suggestions?
Got a brute force solution, but isn't feasible!
Problem:
I want to merge 2 collections mycoll & coll1. Both have a field name Url and Date. mycoll has 35000 docs and coll1 has 4.7M docs(16+gb)-can't load into m/m.
Algo, written using pymongo client :
iterate over mycoll
create a src string "url+common_date_format"
Try to find a match in coll1, since, coll1 is big I can't load it in m/m and treat as dictionary!. So, I'm iterating over each doc in this collection again and again.
iterate over coll1
create a destination string "url+common_date_format"
if src_string == dest_string
insert this doc in a new collection called temp_coll
This is a terrible algorithm since O(35000*4.7M), would take ages to complete!. If I could load 4.7M in m/m then the run time will reduce to O(35000), that's doable!
Any suggestions for another algorithm!

First thing I would do is create compound index with {url: 1, date: 1} on collections if they don't already exist. Say collection A has 35k docs and collection B has 4.7M docs. We can't load whole 4.7M docs data in-memory. You are iterating over cursor object of B in inner loop. I assume once that cursor object is exhausted you are querying the collection again.
Some observations to make here why are we iterating over 4.7M docs each time. Instead of fetching all 4.7M docs and then matching, we could just fetch docs that match url and date for each doc in A. Converting a_doc date to b_doc format and then querying would be better than making both to common format which forces us to do 4.7M docs iteration. Read the below pseudo code.
a_docs = a_collection.find()
c_docs = []
for doc in a_docs:
url = doc.url
date = doc.date
date = convert_to_b_collection_date_format(date)
query = {'url': url, 'date': date}
b_doc = b_collection.find(query)
c_docs.append(b_doc)
c_docs = covert_c_docs_to_required_format(c_docs)
c_collection.insert_many(c_docs)
Above we are looping over 35k docs and filter for each doc. Given that we have indexes created already lookup takes logarithmic time, which seems reasonable.

elasticsearch-dsl aggregations returns only 10 results. How to change this

I am using elasticsearch-dsl python library to connect to elasticsearch and do aggregations.
I am following code
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
This works fine but returns only 10 results in response.aggregations.per_ts.buckets
I want all the results
I have tried one solution with size=0 as mentioned in this question
search.aggs.bucket('per_ts', 'terms', field='ts', size=0)\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
But this results in error
TransportError(400, u'parsing_exception', u'[terms] failed to parse field [size]')

I had the same issue. I finally found this solution:
s = Search(using=client, index="jokes").query("match", jks_content=keywords).extra(size=0)
a = A('terms', field='jks_title.keyword', size=999999)
s.aggs.bucket('by_title', a)
response = s.execute()
After 2.x, size=0 for all bucket results won't work anymore, please refer to this thread. Here in my example I just set the size equal 999999. You can pick a large number according to your case.
It is recommended to explicitly set reasonable value for size a number
between 1 to 2147483647.
Hope this helps.

This is a bit older but I ran into the same issue. What I wanted was basically an iterator that i could use to go through all aggregations that i got back (i also have a lot of unique results).
The best thing i found is to create a python generator like this
def scan_aggregation_results():
i=0
partitions=20
while i < partitions:
s = Search(using=elastic, index='my_index').extra(size=0)
agg = A('terms', field='my_field.keyword', size=999999,
include={"partition": i, "num_partitions": partitions})
s.aggs.bucket('my_agg', agg)
result = s.execute()
for item in result.aggregations.my_agg.buckets:
yield my_field.key
i = i + 1
# in other parts of the code just do
for item in scan_aggregation_results():
print(item) # or do whatever you want with it
The magic here is that elastic will automatically partition the number of results by 20, ie the number of partitions i define. I just have to set the size to something large enough to hold a single partition, in this case the result can be up to 20 million items large (or 20*999999). If you have much less items, like me, to return (like 20000) then you will just have 1000 results per query in your bucket, regardless that you defined a much larger size.
Using the generator construct as outlined above you can then even get rid of that and create your own scanner so to speak, iterating over all results individually, just what i wanted.

You should read the documentation.
So in your case, this should be like this :
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})[0:50]
response = search.execute()

Failed WriteBatch Operation with py2neo

I am trying to find a workaround to the following problem. I have seen it quasi-described in this SO question, yet not really answered.
The following code fails, starting with a fresh graph:
from py2neo import neo4j
def add_test_nodes():
# Add a test node manually
alice = g.get_or_create_indexed_node("Users", "user_id", 12345, {"user_id":12345})
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch requests
a = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
batch.set_properties(a, new_node_data) #<-- I'm the problem
# execute batch requests and clear
batch.run()
batch.clear()
if __name__ == '__main__':
# Initialize Graph DB service and create a Users node index
g = neo4j.GraphDatabaseService()
users_idx = g.get_or_create_index(neo4j.Node, "Users")
# run the test functions
add_test_nodes()
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
print alice
do_batch(g)
# get alice back and assert additional properties were added
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
assert "name" in alice
In short, I wish, in one batch transaction, to update existing indexed node properties. The failure is occurring at the batch.set_properties line, and it is because the BatchRequest object returned by the previous line is not being interpreted as a valid node. Though not entirely indentical, it feels like I am attempting something like the answer posted here
Some specifics
>>> import py2neo
>>> py2neo.__version__
'1.6.0'
>>> g = py2neo.neo4j.GraphDatabaseService()
>>> g.neo4j_version
(2, 0, 0, u'M06')
Update
If I split the problem into separate batches, then it can run without error:
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch request 1
batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
# execute batch request and clear
alice = batch.submit()
batch.clear()
# batch request 2
batch.set_properties(a, new_node_data)
# execute batch request and clear
batch.run()
batch.clear()
This works for many nodes as well. Though I do not love the idea of splitting the batch up, this might be the only way at the moment. Anyone have some comments on this?

After reading up on all the new features of Neo4j 2.0.0-M06, it seems that the older workflow of node and relationship indexes are being superseded. There is presently a bit of a divergence on the part of neo in the way indexing is done. Namely, labels and schema indexes.
Labels
Labels can be arbitrarily attached to nodes and can serve as a reference for an index.
Indexes
Indexes can be created in Cypher by referencing Labels (here, User) and node property key, (screen_name):
CREATE INDEX ON :User(screen_name)
Cypher MERGE
Furthermore, the indexed get_or_create methods are now possible via the new cypher MERGE function, which incorporate Labels and their indexes quite succinctly:
MERGE (me:User{screen_name:"SunPowered"}) RETURN me
Batch
Queries of the sort can be batched in py2neo by appending a CypherQuery instance to the batch object:
from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
cypher_merge_user = neo4j.CypherQuery(graph_db,
"MERGE (user:User {screen_name:{name}}) RETURN user")
def get_or_create_user(screen_name):
"""Return the user if exists, create one if not"""
return cypher_merge_user.execute_one(name=screen_name)
def get_or_create_users(screen_names):
"""Apply the get or create user cypher query to many usernames in a
batch transaction"""
batch = neo4j.WriteBatch(graph_db)
for screen_name in screen_names:
batch.append_cypher(cypher_merge_user, params=dict(name=screen_name))
return batch.submit()
root = get_or_create_user("Root")
users = get_or_create_users(["alice", "bob", "charlie"])
Limitation
There is a limitation, however, in that the results from a cypher query in a batch transaction cannot be referenced later in the same transaction. The original question was in reference to updating a collection of indexed user properties in one batch transaction. This is still not possible, as far as I can muster. For example, the following snippet throws an error:
batch = neo4j.WriteBatch(graph_db)
b1 = batch.append_cypher(cypher_merge_user, params=dict(name="Alice"))
batch.set_properties(b1, dict(last_name="Smith")})
resp = batch.submit()
So, it seems that although there is a bit less overhead in implementing the get_or_create over a labelled node using py2neo because the legacy indexes are no longer necessary, the original question still needs 2 separate batch transactions to complete.

Your problem seems not to be in batch.set_properties() but rather in the output of batch.get_or_create_in_index(). If you add the node with batch.create(), it works:
db = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(db)
# create a node instead of getting it from index
test_node = batch.create({'key': 'value'})
# set new properties on the node
batch.set_properties(test_node, {'key': 'foo'})
batch.submit()
If you have a look at the properties of the BatchRequest object returned by batch.create() and batch.get_or_create_in_index() there is a difference in the URI because the methods use different parts of the neo4j REST API:
test_node = batch.create({'key': 'value'})
print test_node.uri # node
print test_node.body # {'key': 'value'}
print test_node.method # POST
index_node = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
print index_node.uri # index/node/Users?uniqueness=get_or_create
print index_node.body # {u'value': 12345, u'key': 'user_id', u'properties': {}}
print index_node.method # POST
batch.submit()
So I guess batch.set_properties() somehow can't handle the URI of the indexed node? I.e. it doesn't really get the correct URI for the node?
Doesn't solve the problem, but could be a pointer for somebody else ;) ?

MongoDb speed decrease

I use mongodb to store compressed html files .
Basically, a complete document of mongod is like:
{'_id': 1, 'p1': data, 'p2': data2, 'p3': data3}
where data, data1, data3 are :bson.binary.Binary(zlib_compressed_html)
I have 12 Million ids and dataX are each one average 90KB,
so each document has at least size 180KB + sizeof(_id) + some_overhead.
The total data size would be at least 2TB.
I would like to notice that '_id' is index.
I insert to mongo with the following way:
def _save(self, mongo_col, my_id, page, html):
doc = mongo_col.find_one({'_id': my_id})
key = 'p%d' % page
success = False
if doc is None:
doc = {'_id': my_id, key: html}
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
else:
try:
mongo_col.update({'_id': my_id}, {'$set': {key: html}})
success = True
except:
log.exception('Exception updating mongodb')
return success
As you can see first I lookup the collection to see if a document with
my_id exists.
If it does not exist then I create it and save it to mongo else I update it.
The problem with the above is that although it was super fast, at some point it became really slow.
I will give you some numbers:
When it was fast I was doing 1.500.000 per 4 hours and after 300.000 per 4 hours.
I suspect that this affects the speed:
Note
When performing update operations that increase the document size beyond the allocated space for that document, the update operation relocates the document on disk and may reorder the document fields depending on the type of update.
As of these driver versions, all write operations will issue a getLastError command to confirm the result of the write operation:
{ getLastError: 1 }
Refer to the documentation on write concern in the Write Operations document for more information.
the above is from : http://docs.mongodb.org/manual/applications/update/
I am saying that because we could have the following :
{'_id: 1, 'p1': some_data}, ...., {'_id': 10000000, 'p2': some_data2}, ...{'_id': N, 'p1': sd3}
and imagine that I am calling the above _save method as:
_save(my_collection, 1, 2, bin_compressed_html)
Then it should update the doc with _id 1 . But if the thing that mongo site is the case,
because I am adding a key to the document it does not fit and should rearrange the document.
It is possible to move the document in the end of the collection, which could be very far on the disk. Could this slow things down?
Or speed slow down has to do with the size of the collection?
In any way to you think it should be more efficient to modify my structure to be like:
{'_id': ObjectId, 'mid': 1, 'p': 1, 'd': html}
where mid=my_id, p=page, d=compressed html
and modify _save method to do only inserts?
def _save(self, mongo_col, my_id, page, html):
doc = {'mid': my_id, 'p': page, 'd': html}
success = False
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
return success
this way I avoid the update (so the rearrange on disk) and one lookup (find_one)
but the documents would be 3x mores and I would have 2 indexes ( _id and mid ) .
What do you suggest?

Document relocation could be an issue if you continue to add pages of html as new attributes. Would it really be an issue to move pages to a new collection where you could simply add them one record each? Also I don't really think MongoDB is a good fit for your use case. E.g. Redis would be much more efficient.
Another thing you should take care of is to have enough ram for your _id index. Use db.mongocol.stats() to check the index size.

When inserting new Documents into MongoDB, a Document can grow without moving it up to a certain point. Because the DB is analyzing the incoming Data and adds a padding to the Document.
So do deal with less Document movements you can do two things:
manually tweaking the padding factor
preallocate space (attributes) for each document.
See Article about Padding or MongoDB Docs for more Information about the padding factor.
Btw. insetad of using save for creating new documents, you should use .insert() which will throw a duplicate key error if the _id is already there (.save() will overwrite your document)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sort paginated logs by #timestamp with Elasticsearch? - python

Related

How to get negative keyword list from a campaing using Google Adwords API

How to change date format and have concatenated string matches in mongodb query filter?

elasticsearch-dsl aggregations returns only 10 results. How to change this

Failed WriteBatch Operation with py2neo

MongoDb speed decrease

Categories

Resources