I am reading a large amount of data from an API provider. Once get the response, I need to scan through and repackage the data and put into App Engine datastore. A particular big account will contain ~50k entries.
Every time I get some entries from the API, I will store 500 entries as a batch in a temp table and send the processing task to a queue. In case too many tasks get jammed inside one queue, I use 6 queues in total:
count = 0
worker_number = 6
for folder, property in entries:
data[count] = {
# repackaging data here
}
count = (count + 1) % 500
if count == 0:
cache = ClientCache(parent=user_key, data=json.dumps(data))
cache.put()
params = {
'access_token': access_token,
'client_key': client.key.urlsafe(),
'user_key': user_key.urlsafe(),
'cache_key': cache.key.urlsafe(),
}
taskqueue.add(
url=task_url,
params=params,
target='dbworker',
queue_name='worker%d' % worker_number)
worker_number = (worker_number + 1) % 6
And the task_url will lead to the following:
logging.info('--------------------- Process File ---------------------')
user_key = ndb.Key(urlsafe=self.request.get('user_key'))
client_key = ndb.Key(urlsafe=self.request.get('client_key'))
cache_key = ndb.Key(urlsafe=self.request.get('cache_key'))
cache = cache_key.get()
data = json.loads(cache.data)
for property in data.values():
logging.info(property)
try:
key_name = '%s%s' % (property['key1'], property['key2'])
metadata = Metadata.get_or_insert(
key_name,
parent=user_key,
client_key=client_key,
# ... other info
)
metadata.put()
except StandardError, e:
logging.error(e.message)
All the tasks are running in the backend.
With such structure, it's working fine. well... most of time. But sometimes I get this error:
2013-09-19 15:10:07.788
suspended generator transaction(context.py:938) raised TransactionFailedError(The transaction could not be committed. Please try again.)
W 2013-09-19 15:10:07.788
suspended generator internal_tasklet(model.py:3321) raised TransactionFailedError(The transaction could not be committed. Please try again.)
E 2013-09-19 15:10:07.789
The transaction could not be committed. Please try again.
It seems to be the problem of writing into datastore too frequently? I want to find out how I can balance the pace and let the worker run smoothly...
Also is there any other way I can improve the performance further? My queue configuration is something like this:
- name: worker0
rate: 120/s
bucket_size: 100
retry_parameters:
task_retry_limit: 3
You are writing single entities at a time.
How about modifing your code to write in batches using ndb.put_multi that will reduce the round trip time for each transaction.
And why are you using get_or_insert as you are overwriting the record each time. You might as well just write. Both of these will reduce the workload a lot
Related
I am trying to enter large number of data (13 Million rows) into Firebase Firestore, but it is taking forever to finish.
Currently,I am Inserting the data row by row using python, I tried to use multi-treading, but it is still very slow and it is not efficient (I have to stay connected into the Internet)
so, is there another way to insert a file into Firebase (a more efficient way to (batch) insert the data)?
This is the data format
[
{
'010045006031': {
'FName':'Ahmed',
'LName':'Aline'
}
},
{
'010045006031': {
'FName':'Ali',
'LName':'Adel'
}
},
{
'010045006031': {
'FName':'Osama',
'LName':'Luay'
}
}
]
This is the code that I am using
import firebase_admin
from firebase_admin import credentials, firestore
def Insert2DB(I):
doc_ref = db.collection('DBCoolect').document(I['M'])
doc_ref.set({"FirstName": I['FName'], "LastName": I['LName']}
cred = credentials.Certificate("ServiceAccountKey.json")
firebase_admin.initialize_app(cred)
db = firestore.client()
List = []
#List are read from File
List.append({'M': random(),'FName':'Ahmed','LName':'Aline'})
List.append({'M': random(),'FName':'Ali','LName':'Adel'})
List.append({'M': random(),'FName':'Osama','LName':'Luay'})
for item in List:
Insert2DB(item)
Thanks a lot ...
Firestore does not offer any way to "bulk update" documents. They have to be added individually. There is a facility to batch write, but that's limited to 500 documents per batch, and that's not likely to speed up your process by a large amount.
If you want to optimize the rate at which documents can be added, I suggest reading the documentation on best practices for read and write operations and designing for scale. All things considered, however, there is really no "fast" way to get 13 million documents into Firestore. You're going to be writing code to add each one individually. Firestore is not optimized for fast writes. It's optimized for fast reads.
Yes there is a way to bulk data add into firebase using python
def process_streaming(self, response):
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
sentiment = self.sentiment_model(json_response["data"]["text"])
self.datalist.append(self.post_process_data(json_response, sentiment))
It extract data from response and save it into list.
if we want to wait to loop for minutes and upload first objects of list into daatabase then add belove code after if loop.
if (self.start_time + 300 < time.time()):
print(f"{len(self.datalist)} data send to database")
self.batch_upload_data(self.datalist)
self.datalist = []
self.start_time = time.time() + 300
Final function look like this.
def process_streaming(self, response):
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
sentiment = self.sentiment_model(json_response["data"]["text"])
self.datalist.append(self.post_process_data(json_response, sentiment))
if (self.start_time + 300 < time.time()):
print(f"{len(self.datalist)} data send to database")
self.batch_upload_data(self.datalist)
self.datalist = []
self.start_time = time.time() + 300
I'm trying to implement this example:
https://github.com/Azure/azure-documentdb-python/blob/master/samples/DatabaseManagement/Program.py
To fetch data from azure documentdb and do some visualization. However, I would like to use a query on the line where it says #error here instead.
def read_database(client, id):
print('3. Read a database by id')
try:
db = next((data for data in client.ReadDatabases() if data['id'] == database_id))
coll = next((coll for coll in client.ReadCollections(db['_self']) if coll['id'] == database_collection))
return list(itertools.islice(client.ReadDocuments(coll['_self']), 0, 100, 1))
except errors.DocumentDBError as e:
if e.status_code == 404:
print('A Database with id \'{0}\' does not exist'.format(id))
else:
raise errors.HTTPFailure(e.status_code)
The fetching is really slow when I want to get >10k items, how can I improve this?
Thanks!
You can't query documents directly through database entity.
The parameters of the ReadDocuments() method used in your code should be collection link and query options.
def ReadDocuments(self, collection_link, feed_options=None):
"""Reads all documents in a collection.
:Parameters:
- `collection_link`: str, the link to the document collection.
- `feed_options`: dict
:Returns:
query_iterable.QueryIterable
"""
if feed_options is None:
feed_options = {}
return self.QueryDocuments(collection_link, None, feed_options)
So, you could modify your code as below:
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY']})
db = "db"
coll = "coll"
try:
database_link = 'dbs/' + db
database = client.ReadDatabase(database_link)
collection_link = 'dbs/' + db + "/colls/" + coll
collection = client.ReadCollection(collection_link)
# options = {}
# options['enableCrossPartitionQuery'] = True
# options['partitionKey'] = 'jay'
docs = client.ReadDocuments(collection_link)
print(list(docs))
except errors.DocumentDBError as e:
if e.status_code == 404:
print('A Database with id \'{0}\' does not exist'.format(id))
else:
raise errors.HTTPFailure(e.status_code)
If you want to query partition of your collection, please add the snippet of code which are commented in the above code.
options = {}
options['enableCrossPartitionQuery'] = True
options['partitionKey'] = 'jay'
It seems that your issue is focused on Azure Cosmos DB query performance.
You could refer to the following points to improve query performance.
Partitioning
You could set partition keys in your database and query with a filter clause on a single partition key so that it needs lower latency and consumes lower RUs.
Throughput
You could set the throughput a bit larger so that Azure Cosmos DB performance in unit time will be greatly improved. Of course, this will lead to higher costs.
Indexing Policy
The use of indexing paths can offer improved performance and lower latency.
For more details, I recommend that you refer to the official performance documentation.
Hope it helps you.
Reading the Developer's Guide I found how to delete single contact:
def delete_contact(gd_client, contact_url):
# Retrieving the contact is required in order to get the Etag.
contact = gd_client.GetContact(contact_url)
try:
gd_client.Delete(contact)
except gdata.client.RequestError, e:
if e.status == 412:
# Etags mismatch: handle the exception.
pass
Is there a way to delete all contacts?
Could not find a way to do so.
Iterate each contact takes few minutes for a large batch
If you are performing a lot of operations, use the batch requests. You can have the server perform multiple operations with a single HTTP request. Batch requests are limited to 100 operations at a time. You can find more information about batch operations in the Google Data APIs Batch Processing documentation.
To delete all contacts use the contactsrequest.Batch operation. For this operation, create a LIST<type>, you set the BatchData for each contact item, and then pass the list to the contactsrequest.Batch operation.
private void DeleteAllContacts()
{
RequestSettings rs = new RequestSettings(this.ApplicationName, this.userName, this.passWord);
rs.AutoPaging = true // this will result in automatic paging for listing and deleting all contacts
ContactsRequest cr = new ContactsRequest(rs);
Feed<Contact> f = cr.GetContacts();
List<Contact> list = new List<Contact>();
int i=0;
foreach (Contact c in f.Entries)
{
c.BatchData = new GDataBatchEntryData();
c..BatchData.Id = i.ToString();
c.BatchData.Type = GDataBatchOperationType.delete;
i++;
list.Add(c);
}
cr.Batch(list, new Uri(f.AtomFeed.Batch), GDataBatchOperationType.insert);
f = cr.GetContacts();
Assert.IsTrue(f.TotalResults == 0, "Feed should be empty now");
}
I saw some threads about long polling in python, but my problem is not so bit to use some additional kits like tornado etc.
I have js client. It sends requests to my /longpolling page and wait for response. Once it get response or timeout it sends new one. This is working well.
My /longpolling handler is a function:
currentTime = datetime.datetime.now()
lastUpdate = datetime.datetime.strptime(req.GET["ts"], "%Y-%m-%dT%H:%M:%S.%f")
response = {
"added": [],
"updated": [],
"deleted": []
}
while (datetime.datetime.now() - currentTime).seconds < 600:
time.sleep(2)
now = datetime.datetime.now()
#query = Log.objects.filter(time__range = (lastUpdate, now))
query = Log.objects.raw("SELECT * FROM ...log WHERE time BETWEEN %s and %s", [lastUpdate, now])
exist = False
for log in query:
exist = True
type = {
NEW: "added",
UPDATED: "updated",
DELETED: "deleted"
}[log.type]
response[type].append(json.loads(log.data))
if exist:
response["ts"] = now.isoformat()
return JsonResponse(response)
response["ts"] = datetime.datetime.now().isoformat()
return JsonResponse(response)
Every 2 seconds during 10 min I want to check for new Log instances in DB to notify js client.
I tryed to insert Log record manualy through phpMyAdmin but next Log.objects.filter(time__range = (lastUpdate, now)) returns empty QuerySet. I copy raw query from .query attr it looks like:
SELECT ... FROM ... WHERE time BETWEEN 2013-01-05 03:30:36 and 2013-01-05 03:45:18
So I quoted 2013-01-05 03:30:36 and 2013-01-05 03:45:18 and executed this SQL through phpMyAdmin and it returned my added record.
I tryed:
query = Log.objects.filter(time__range = (lastUpdate, now))
and
query = Log.objects.raw("SELECT * FROM ...log WHERE time BETWEEN %s and %s", [lastUpdate, now])
and
for log in query.iterate():
But it always returns an empty QuerySet but never my added record.
I thought there is some caching, but where?
Or the problem is that I insert new record until while True: loop was performing? Or maybe there is some thread protection? Why phpMyAdmin see record but django does not?
Please help me, I am stuck.
I haven't run into this problem, so I'm not sure. Based on #DanielRoseman's answer in the thread you linked in comments, you might be able to do this:
with transaction.commit_on_success():
query = Log.objects.raw("SELECT * FROM ...log WHERE time BETWEEN %s and %s", [lastUpdate, now])
It seems more likely, though, that you will have to wrap the lines that insert your log entries in the commit_on_success decorator. I'm not sure where in your code the log entries are inserted.
When running the below with 200 Documents and 1 DocUser the script takes approx 5000ms according to AppStats. The culprint is that there is a request to the datastore for each lockup of the lastEditedBy (datastore_v3.Get) taking 6-51ms each.
What I'm trying do is to make something that makes possible to show many entities with several properties where some of them are derived from other entities. There will never be a large number of entities (<5000) and since this is more of an admin interface there will never be many simultaneous users.
I have tried to optimize by caching the DocUser entities but I am not able to get the DocUser key from the query above without making a new request to the datastore.
1) Does this make sense - is the latency I am experiencing normal?
2) Is there a way to make this work without the additional requests to the datastore?
models.py
class Document(db.Expando):
title = db.StringProperty()
lastEditedBy = db.ReferenceProperty(DocUser, collection_name = 'documentLastEditedBy')
...
class DocUser(db.Model):
user = db.UserProperty()
name = db.StringProperty()
hasWriteAccess= db.BooleanProperty(default = False)
isAdmin = db.BooleanProperty(default = False)
accessGroups = db.ListProperty(db.Key)
...
main.py
$out = '<table>'
documents = Document.all()
for i,d in enumerate(documents):
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, d.lastEditedBy.name)
$out = '</table>'
This is a typical anti-pattern. You can workaround this by:
Prefetch all of the references. Please see Nick's blog entry for details.
Use ndb. This module doesn't have ReferenceProperty. It has various goodies like 2 automatic caching layers, asynchronous mechanism called tasklets, etc. For more details, see the ndb documentation.
One way to do it is to prefetch all the docusers to make a lookup dictionary, with the keys being docuser.key() and values being docuser.name.
docusers = Docuser.all().fetch(1000)
docuser_dict = dict( [(i.key(), i.name) for i in docusers] )
Then in your code, you can get the names from the docuser_dict by using get_value_for_datastore to get the docuser.key() without pulling the object from the datastore.
documents = Document.all().fetch(1000)
for i,d in enumerate(documents):
docuser_key = Document.lastEditedBy.get_value_for_datastore(d)
last_editedby_name = docuser_dict.get(docuser_key)
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, last_editedby_name)
If you want to cut instance-time, you can break a single synchronous query into multiple asynchronous queries, which can prefetch results while you do other work. Instead of using Document.all().fetch(), use Document.all().run(). You may have to block on the first query you iterate on, but by the time it is done, all other queries will have finished loading results. If you want to get 200 entities, try using 5 queries at once.
q1 = Document.all().run(prefetch_size=20, batch_size=20, limit=20, offset=0)
q2 = Document.all().run(prefetch_size=45, batch_size=45, limit=45, offset=20)
q3 = Document.all().run(prefetch_size=45, batch_size=45, limit=45, offset=65)
q4 = Document.all().run(prefetch_size=45, batch_size=45, limit=45, offset=110)
q5 = Document.all().run(prefetch_size=45, batch_size=45, limit=45, offset=155)
for i,d in enumerate(q1):
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, d.lastEditedBy.name)
for i,d in enumerate(q2):
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, d.lastEditedBy.name)
for i,d in enumerate(q3):
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, d.lastEditedBy.name)
for i,d in enumerate(q4):
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, d.lastEditedBy.name)
for i,d in enumerate(q5):
out += '<tr><td>%s</td><td>%s</td></tr>' % (d.title, d.lastEditedBy.name)
I apologize for my crummy python; but the idea is simple. set your prefetch_size = batch_size = limit, and start all your queries at once. q1 has a smaller size because we will block on it first, and blocking is what wastes time. By the time q1 is done, q2 will be done or almost done, and q3-5 you will pay zero latency.
See https://developers.google.com/appengine/docs/python/datastore/async#Async_Queries for details.