Why are my Datastore Write Ops so high? - python

I busted through my daily free quota on a new project this weekend. For reference, that's .05 million writes, or 50,000 if my math is right.
Below is the only code in my project that is making any Datastore write operations.
old = Streams.query().fetch(keys_only=True)
ndb.delete_multi(old)
try:
r = urlfetch.fetch(url=streams_url,
method=urlfetch.GET)
streams = json.loads(r.content)
for stream in streams['streams']:
stream = Streams(channel_id=stream['_id'],
display_name=stream['channel']['display_name'],
name=stream['channel']['name'],
game=stream['channel']['game'],
status=stream['channel']['status'],
delay_timer=stream['channel']['delay'],
channel_url=stream['channel']['url'],
viewers=stream['viewers'],
logo=stream['channel']['logo'],
background=stream['channel']['background'],
video_banner=stream['channel']['video_banner'],
preview_medium=stream['preview']['medium'],
preview_large=stream['preview']['large'],
videos_url=stream['channel']['_links']['videos'],
chat_url=stream['channel']['_links']['chat'])
stream.put()
self.response.out.write("Done")
except urlfetch.Error, e:
self.response.out.write(e)
This is what I know:
There will never be more than 25 "stream" in "streams." It's
guaranteed to call .put() exactly 25 times.
I delete everything from the table at the start of this call because everything needs to be refreshed every time it runs.
Right now, this code is on a cron running every 60 seconds. It will never run more often than once a minute.
I have verified all of this by enabling Appstats and I can see the datastore_v3.Put count go up by 25 every minute, as intended.
I have to be doing something wrong here, because 25 a minute is 1,500 writes an hour, not the ~50,000 that I'm seeing now.
Thanks

You are mixing two different things here: write API calls (what your code calls) and low-level datastore write operations. See the billing docs for relations: Pricing of Costs for Datastore Calls (second section).
This is the relevant part:
New Entity Put (per entity, regardless of entity size) = 2 writes + 2 writes per indexed property value + 1 write per composite index value
In your case Streams has 15 indexed properties resulting in: 2 + 15 * 2 = 32 write OPs per write API call.
Total per hour: 60 (requests/hour) * 25 (puts/request) * 32 (operations/put) = 48,000 datastore write operations per hour

It seems as though I've finally figured out what was going on, so I wanted to update here.
I found this older answer: https://stackoverflow.com/a/17079348/1452497.
I've missed somewhere along the line where the properties being indexed were somehow multiplying the writes by factors of at least 10, I did not expect that. I didn't need everything indexed and after turning off the index in my model, I've noticed the write ops drop DRAMATICALLY. Down to about where I expect them.
Thanks guys!

It is 1500*24=36,000 writes/day, which is very near to the daily quota.

Related

Efficiently processing ~50 million record file in python

We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:
import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
Tested the program on a small set of data, here are the processing times:
lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec
We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!
Questions:
Is there a better way to process these records to improve on processing times?
Is python the right choice? Would awk or some other tool help?
I am guessing 64 byte ID look-up on 167K array in the following statement is the culprit:
if fieldAry[CUSTOMER_ID] not in idArray:
Is there a better alternative?
Thank you!
Update: This is processed on an EC2 instance with EBS attached volume.
You should must use a set instead of a list; before the for loop do:
idArray = set(idArray)
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.
in operator with set has O(1) time complexity, whereas O(n) time complexity with list. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.
See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove. I then do a for loop like you do, doing the in operation for each record:
import random
import timeit
all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)
def test_set():
to_remove_set = set(to_remove)
for i in all_ids:
if i in to_remove_set:
pass
def test_list():
for i in all_ids:
if i in to_remove:
pass
print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))
And the results:
testing list 227.91903045598883
testing set 0.14897623099386692
It took 149 milliseconds for the set version; the list version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set time by 50: with your data set it would take about 7.5 seconds.
The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.
So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:
2 days 2.2 days
|reading and writing the files||------- doing id in list ------|
to something like
2 days 7.5 seconds (doing id in set)
|reading and writing the files||
The easiest way to speed things a bit would be to parallelize the line-processing with some distributed solution. On of the easiest would be to use multiprocessing.Pool. You should do something like this (syntax not checked):
from multiprocessing import Pool
p = Pool(processes=4)
p.map(process_row, csvReadHandler)
Despite that, python is not the best language to do this kind of batch processing (mainly because writing to disk is quite slow). It is better to leave all the disk-writing management (buffering, queueing, etc...) to the linux kernel so using a bash solution would be quite better. Most efficient way would be to split the input files into chunks and simply doing an inverse grep to filter the ids.
for file in $list_of_splitted_files; then
cat $file | grep -v (id1|id2|id3|...) > $file.out
done;
if you need to merge afterwards simply:
for file in $(ls *.out); then
cat $file >> final_results.csv
done
Considerations:
Don't know if doing a single grep with all the ids is more/less
efficient than looping through all the ids and doing a single-id
grep.
When writing parallel solutions try to read/write to different
files to minimize the I/O bottleneck (all threads trying to write to
the same file) in all languages.
Put timers on your code for each processing part. This way you'll see which part is wasting more time. I really recommend this because I had a similar program to write and I thought the processing part was the bottleneck (similar to your comparison with the ids vector) but in reality it was the I/O which was dragging down all the execution.
i think its better to use a database to solve this problem
first create a database like MySql or anything else
then write your data from file into 2 table
and finally use a simple sql query to select rows
somthing like:
select * from table1 where id in(select ids from table2)

Specifying limit and offset in Django QuerySet wont work

I'm using Django 1.6.5 and have MySQL's general-query-log on, so I can see the sql hitting MySQL.
And I noticed that Specifying a bigger limit in Django's QuerySet would not work:
>>> from blog.models import Author
>>> len(Author.objects.filter(pk__gt=0)[0:999])
>>> len(Author.objects.all()[0:999])
And MySQL's general log showed that both query had LIMIT 21.
But a limit smaller than 21 would work, e.g. len(Author.objects.all()[0:10]) would make a sql with LIMIT 10.
Why is that? Is there something I need to configure?
It happens when you make queries from the shell - the LIMIT clause is added to stop your terminal filling up with thousands of records when debugging:
You were printing (or, at least, trying to print) the repr() of the
queryset. To avoid people accidentally trying to retrieve and print a
million results, we (well, I) changed that to only retrieve and print
the first 20 results and print "remainder truncated" if there were more.
This is achieved by limiting the query to 21 results (if there are 21
results there are more than 20, so we print the "truncated" message).
That only happens in the repr() -- i.e. it's only for diagnostic
printing. No normal user code has this limit included automatically, so
you happily create a queryset that iterates over a million results.
(Source)
Django implements OFFSET using Python’s array-slicing syntax. If you want to offset the first 10 elements and then show the next 5 elements then use it
MyModel.objects.all()[OFFSET:OFFSET+LIMIT]
For example if you wanted to check 5 authors after an offset of 10 then your code would look something like this:
Author.objects.all()[10:15]
You can read more about it here in the official Django doc
I have also written a blog around this concept, you can here more here
The LIMIT and OFFSET doesn't work in the same way in Django, the way we expect it to work.
For example.
If we have to read next 10 rows starting from 10th row and if we specify :
Author.objects.all()[10:10]
It will return the empty record list. In order to fetch the next 10 rows, we have to add the offset to the limit.
Author.objects.all()[10:10+10]
And it will return the record list of next 10 rows starting from the 10th row.
for offset and limit i used and worked for me :)
MyModel.objects.all()[offset:limit]
for exapmle:-
Post.objects.filter(Post_type=typeId)[1:1]
I does work, but django uses an iterator. It does not load all objects at once.

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

How can I keep a list of recently seen users without running out of RAM/crashing my DB?

Here's some code that should demonstrate what I'm trying to do:
current_time = datetime.datetime.now()
recently_seen = []
user_id = 10
while True:
if user_id not in recently_seen:
recently_seen[user_id] = current_time
print 'seen {0}'.format(user_id)
else:
if current_time - recently_seen[user_id] > '5 seconds':
recently_seen[user_id] = current_time
print 'seen {0}'.format(user_id)
time.sleep(0.1)
Basically, my program is listening on a socket for users. This is wrapped in a loop that spits out user_ids as it sees them. This means, I'm seeing user_ids every few milliseconds.
What I'm trying to do is log the users it sees and at what times. Saying it saw a user at 0.1 seconds and then again at 0.7 seconds is silly. So I want to implement a 5 second buffer.
It should find a user and, if the user hasn't been seen within 5 seconds, log them to a database.
The two solutions I've come up with is:
1) Keep the user_id in a dictionary (similar to the sample code above) and check against this. The problem is, if it's running for a few days and continues finding new users, this will eventually use up my RAM
2) Log them to a database and check against that. The problem with this is, it finds users every few milliseconds. I don't want to read the database every few milliseconds...
I need some way of creating a list of limited size. That limit would be 5 seconds. Any ideas on how to implement this?
How about removing the user from your dictionary once you log them to the database?
Why aren't you using a DBM?
It will work like a dictionary but will be stored on the disk.

What's the best performing xml parsing for GAE (Python Version)?

I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else.
My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?
Thanks for your anwsers.
Edit 01:
#Peter Recore
I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.
My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.
I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.
I know that this don't awnser my question directly, but id does what i just needed.
I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.
I used from google.appengine.api import urlfetch to get the json feeds in parallel:
class GetEntityJSON(webapp.RequestHandler):
def post(self):
url = 'http://url.that.generates.the.feeds/'
if self.request.get('idList'):
idList = self.request.get('idList').split(',')
try:
asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
except urlfetch.DownloadError:
# Dealed with time out errors (#5) as these were very frequent
for result in asyncRequests:
if result.status_code == 200:
entityJSON = simplejson.loads(result.content)
# Filled a database entity with some json info. It goes like this:
# entity= Entity(
# name = entityJSON['name'],
# dateOfBirth = entityJSON['date_of_birth']
# ).put()
self.redirect('/')
def _asyncFetch(self, urlList):
rpcs = []
for url in urlList:
rpc = urlfetch.create_rpc(deadline = 10)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]
I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.
But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Categories