I have to choose Cassandra or MongoDB(or another nosql database, I accept suggestions) for a project with a lot of inserts(1M/day).
So I create a small test to measure the write performance. Here's the code to insert in Cassandra:
import time
import os
import random
import string
import pycassa
def get_random_string(string_length):
return ''.join(random.choice(string.letters) for i in xrange(string_length))
def connect():
"""Connect to a test database"""
connection = pycassa.connect('test_keyspace', ['localhost:9160'])
db = pycassa.ColumnFamily(connection,'foo')
return db
def random_insert(db):
"""Insert a record into the database. The record has the following format
ID timestamp
4 random strings
3 random integers"""
record = {}
record['id'] = str(time.time())
record['str1'] = get_random_string(64)
record['str2'] = get_random_string(64)
record['str3'] = get_random_string(64)
record['str4'] = get_random_string(64)
record['num1'] = str(random.randint(0, 100))
record['num2'] = str(random.randint(0, 1000))
record['num3'] = str(random.randint(0, 10000))
db.insert(str(time.time()), record)
if __name__ == "__main__":
db = connect()
start_time = time.time()
for i in range(1000000):
random_insert(db)
end_time = time.time()
print "Insert time: %lf " %(end_time - start_time)
And the code to insert in Mongo it's the same changing the connection function:
def connect():
"""Connect to a test database"""
connection = pymongo.Connection('localhost', 27017)
db = connection.test_insert
return db.foo2
The results are ~1046 seconds to insert in Cassandra, and ~437 to finish in Mongo.
It's supposed that Cassandra it's much faster than Mongo inserting data. So , What i'm doing wrong?
There is no equivalent to Mongo's unsafe mode in Cassandra. (We used to have one, but we took it out, because it's just a Bad Idea.)
The other main problem is that you're doing single-threaded inserts. Cassandra is designed for high concurrency; you need to use a multithreaded test. See the graph at the bottom of http://spyced.blogspot.com/2010/01/cassandra-05.html (actual numbers are over a year out of date but the principle is still true).
The Cassandra source distribution has such a test included in contrib/stress.
If I am not mistaken, Cassandra allows you to specify whether or not you are doing a MongoDB-equivalent "safe mode" insert. (I dont recall the name of that feature in Cassandra)
In other words, Cassandra may be configured to write to disk and then return as opposed to the default MongoDB configuration which immediately returns after performing an insert without knowing if the insert was successful or not. It just means that your application never waits for a pass\fail from the server.
You can change that behavior by using safe mode in MongoDB but this is known to have a large impact on performance. Enable safe mode and you may see different results.
You will harness true power of Cassandra once you have multiple nodes running. Any node will be able to take a write request. Multithreading a client is only flooding more requests to same instance which is not going to help after a point.
Check cassandra log for the events that happen during your tests. Cassandra will initiate a disk write once the Memtable is full (this is configurable, make it large enough and you will be dealing on in RAM + disk writes of commit log). If disk write for Memtable happen during your test then it will slow it down. I do not know when MongoDB writes to disk.
Might I suggest taking a look at Membase here? It's used in exactly the same way as memcached and is fully distributed so you can continuously scale your write input rate simply by adding more servers and/or more RAM.
For this case, you'll definitely want to go with a client-side Moxi to give you the best performance. Take a look at our wiki: wiki.membase.org for examples and let me know if you need any further instruction...I'm happy to walk you through it and I'm certain that Membase can handle this load easily.
Create batch mutator for doing
multiple insert, update, and remove
operations using as few roundtrips as
possible.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch
Batch mutator helped me reduce insert time in at least half
Related
Which library is best to use among "boto3" and "Psycopg2" for redshift operations in python lambda functions:
Lookup for a table in redshift cluster
Create a table in redshift cluster
Insert data in redshift cluster
I would appretiate if i am answered with following:
python code for either of the library that addresses all of the above 3 needs.
Thanks in Advance!!
Connecting directly to Redshift from Lambda with psycopg2 is the simpler, more straight-forward way to go but comes with a significant limitation. Lambda functions have run-time limits and even if your SQL commands don't exceed the max run-time, you will be paying for the Lambda function to wait for Redshift to complete the SQL. For fast-running SQL commands things run quickly and this isn't a problem but inserting data can take some time depending on the amount of data.
If all your Redshift actions are less than a few seconds (and won't grow longer with time) then psycopg2 connecting directly to Redshift is likely the way to go. If the data insert takes a minute or 2 BUT this process doesn't run very often (daily) then psycopg2 may still be the way to go as Lambda isn't very expensive when run in frequently. It is a process simplicity vs. cost calculation.
Using Redshift Data API is more complicated. This process lets you fire the SQL to Redshift and terminate the Lambda. A later running Lambda checks to see if the SQL has completed and the results of the SQL are checked. The SQL not completing means that Lambda needs to be invoke at a later time to see if things are complete. This polling process often is done by a Step Function and a set of different Lambda functions. Not super difficult but a level of complexity above a single Lambda. Since this is a polling process there is a wait time between checks for results which if too long leads to latency and if too short over-polling and additional costs.
If you need to have Data API for time-out reasons then you may want to use both psycopg2 for short running queries to the database - like 'does this table exist?'. Use Data API for long-running steps like 'insert this 1TB set of data into Redshift'.
Sample basic python code for all three operations using boto3.
import json
import boto3
clientdata = boto3.client('redshift-data')
# looks up table and returns true if found
def lookup_table(table_name):
response = clientdata.list_tables(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
TablePattern=table_name
)
print(response)
if ( len(response['Tables']) == 0 ):
return False
else:
return True
# creates table with one integer column
def create_table(table_name):
sqlstmt = 'CREATE TABLE '+table_name+' (col1 integer);'
print(sqlstmt)
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='CreateTable'
)
print(response)
# inserts one row with integer value for col1
def insert_data(table_name, dval):
print(dval)
sqlstmt = 'INSERT INTO '+table_name+'(col1) VALUES ('+str(dval)+');'
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='InsertData'
)
print(response)
result = lookup_table('date')
if ( result ):
print("Table exists.")
else:
print("Table does not exist!")
create_table("testtab")
insert_data("testtab", 11)
I am not using Lambda, instead executing it just from my shell. Hope this helps. Assuming credentials and default region are already set up for the client.
I use arango-orm (which uses python-arango in the background) in my Python/ArangoDB back-end. I have set up a small testing util that uses a remote database to insert test data, execute the unit tests and remove the test data again.
I insert my test data with a Python for loop. Each iteration, a small piece of information changes based on a generic object and then I insert that modified generic object into ArangoDB until I have 10 test objects. However, after that code is run, my test assertions tell me I don't have 10 objects stored inside my db, but only 8 (or sometimes 3, 7 or 9). It looks like pythong-arango runs these queries asynchronously or that ArangoDB already replies with an OK before the data is actually inserted. Anyone has an idea of what is going on? When I put in a sleep of 1 second after all data is inserted, my tests run green. This obviously is no solution.
This is a little piece of example code I use:
def load_test_data(self) -> None:
# This method is called from the setUp() method.
logging.info("Loading test data...")
for i in range(1, 11):
# insertion with data object (ORM)
user = test_utils.get_default_test_user()
user.id = i
user.username += str(i)
user.name += str(i)
db.add(user)
# insertion with dictionary
project = test_utils.get_default_test_project()
project['id'] = i
project['name'] += str(i)
project['description'] = f"Description for project with the id {i}"
db.insert_document("projects", project)
# TODO: solve this dirty hack
sleep(1)
def test_search_by_user_username(self) -> None:
actual = dao.search("TestUser3")
self.assertEqual(1, len(actual))
self.assertEqual(3, actual[0].id)
Then my db is created like this in a separate module:
client = ArangoClient(hosts=f"http://{arango_host}:{arango_port}")
test_db = client.db(arango_db, arango_user, arango_password)
db = Database(test_db)
EDIT:
I had not put the sync property to true upon collection creation, but after changing the collection and setting it to true, the behaviour stays exactly the same.
After getting in touch with the people of ArangoDB, I learned that views are not updatet as quickly as collections. Thye have given me an internal SEARCH option which also waits for synching views. Since it's an internal option, only used for unit testing, they high discourage the use of it. For me, I only use it for unit testing.
We have around 1500 sqlite dbs, each has 0 to 20,000,000 records in table (violation) total no of violation records is around 90,000,000.
Each file we generate by running a crawler on the 1500 servers. With this violation table we have some other tables too which we use for further analysis.
To analyze the results we push all these sqlite violation records into postsgres violation table, along with other insertion and other calculation.
Following is the code I use to transfer records,
class PolicyViolationService(object):
def __init__(self, pg_dao, crawler_dao_s):
self.pg_dao = pg_dao
self.crawler_dao_s = crawler_dao_s
self.user_violation_count = defaultdict(int)
self.analyzer_time_id = self.pg_dao.get_latest_analyzer_tracker()
def process(self):
"""
transfer policy violation record from crawler db to analyzer db
"""
for crawler_dao in self.crawler_dao_s:
violations = self.get_violations(crawler_dao.get_violations())
self.pg_dao.insert_rows(violations)
def get_violations(self, violation_records):
for violation in violation_records:
violation = dict(violation.items())
violation.pop('id')
self.user_violation_count[violation.get('user_id')] += 1
violation['analyzer_time_id'] = self.analyzer_time_id
yield PolicyViolation(**violation)
in sqlite dao
==============
def get_violations(self):
result_set = self.db.execute('select * from policyviolations;')
return result_set
in pg dao
=========
def insert_rows(self, rows):
self.session.add_all(rows)
self.session.commit()
This code works but taking very log time. What is the right way to approach this problem. Please suggest, we have been discussing about parallel processing, skip sqlalchemy and some other options. Please suggest us right way.
Thanks in advance!
The fastest way to get these to PostgreSQL is to use the COPY command, outside any SQLAlchemy.
Within SQLAlchemy one must note that the ORM is very slow. It is doubly slow if you have lots of stuff in ORM that you then flush. You could make it faster by doing flushes after 1000 items or so; it would also make sure that the session does not grow too big. However, why just not use SQLAlchemy Core to generate inserts:
ins = violations.insert().values(col1='value', col2='value')
conn.execute(ins)
I have a table in a database, mapped with SQLAlchemy ORM module (I have a "scoped_session" Variable)
I want multiple instances of my program (not just threads, also from several servers) to be able to work on the same table and NOT work on the same data.
so i have coded a manual "row-lock" mechanism to make sure each row is handled in this method i use "Full Lock" on the table while i "row-lock" it:
def instance:
s = scoped_session(sessionmaker(bind=engine)
engine.execute("LOCK TABLES my_data WRITE")
rows = s.query(Row_model).filter(Row_model.condition == 1).filter(Row_model.is_locked == 0).limit(10)
for row in rows:
row.is_locked = 1
row.lock_time = datetime.now()
s.commit()
engine.execute("UNLOCK TABLES")
for row in row:
manipulate_data(row)
row.is_locked = 0
s.commit()
for i in range(10):
t = threading.Thread(target=instance)
t.start()
The problem is that while running some instances, several threads are collapsing and produce this error (each):
sqlalchemy.exc.DatabaseError: (raised as a result of Query-invoked
autoflush; consider using a session.no_autoflush block if this flush
is occurring prematurely) (DatabaseError) 1205 (HY000): Lock wait
timeout exceeded; try restarting transaction 'UPDATE my_daya SET
row_var = 1}
Where is the catch? what makes my DB table to not UNLOCK successfully?
Thanks.
Locks are evil. Avoid them. Things go very bad when errors occur. Especially when you mix sessions with raw SQL statements, like you do.
The beauty of the scoped session is that it wraps a database transaction. This transaction makes the modifications to the database atomic, and also takes care of cleaning up when things go wrong.
Use scoped sessions as follows:
with scoped_session(sessionmaker(bind=engine) as s:
<ORM actions using s>
It may be some work to rewrite your code so that it becomes properly transactional, but it will be worth it! Sqlalchemy has tricks to help you with that.
I have an expensive function that takes and returns a small amount of data (a few integers and floats). I have already memoized this function, but I would like to make the memo persistent. There are already a couple of threads relating to this, but I'm unsure about potential issues with some of the suggested approaches, and I have some fairly specific requirements:
I will definitely use the function from multiple threads and processes simultaneously (both using multiprocessing and from separate python scripts)
I will not need read or write access to the memo from outside this python function
I am not that concerned about the memo being corrupted on rare occasions (like pulling the plug or accidentally writing to the file without locking it) as it isn't that expensive to rebuild (typically 10-20 minutes) but I would prefer if it would not be corrupted because of exceptions, or manually terminating a python process (I don't know how realistic that is)
I would strongly prefer solutions that don't require large external libraries as I have a severely limited amount of hard disk space on one machine I will be running the code on
I have a weak preference for cross-platform code, but I will likely only use this on Linux
This thread discusses the shelve module, which is apparently not process-safe. Two of the answers suggest using fcntl.flock to lock the shelve file. Some of the responses in this thread, however, seem to suggest that this is fraught with problems - but I'm not exactly sure what they are. It sounds as though this is limited to Unix (though apparently Windows has an equivalent called msvcrt.locking), and the lock is only 'advisory' - i.e., it won't stop me from accidentally writing to the file without checking it is locked. Are there any other potential problems? Would writing to a copy of the file, and replacing the master copy as a final step, reduce the risk of corruption?
It doesn't look as though the dbm module will do any better than shelve. I've had a quick look at sqlite3, but it seems a bit overkill for this purpose. This thread and this one mention several 3rd party libraries, including ZODB, but there are a lot of choices, and they all seem overly large and complicated for this task.
Does anyone have any advice?
UPDATE: kindall mentioned IncPy below, which does look very interesting. Unfortunately, I wouldn't want to move back to Python 2.6 (I'm actually using 3.2), and it looks like it is a bit awkward to use with C libraries (I make heavy use of numpy and scipy, among others).
kindall's other idea is instructive, but I think adapting this to multiple processes would be a little difficult - I suppose it would be easiest to replace the queue with file locking or a database.
Looking at ZODB again, it does look perfect for the task, but I really do want to avoid using any additional libraries. I'm still not entirely sure what all the issues with simply using flock are - I imagine one big problem is if a process is terminated while writing to the file, or before releasing the lock?
So, I've taken synthesizerpatel's advice and gone with sqlite3. If anyone's interested, I decided to make a drop-in replacement for dict that stores its entries as pickles in a database (I don't bother to keep any in memory as database access and pickling is fast enough compared to everything else I'm doing). I'm sure there are more efficient ways of doing this (and I've no idea whether I might still have concurrency issues), but here is the code:
from collections import MutableMapping
import sqlite3
import pickle
class PersistentDict(MutableMapping):
def __init__(self, dbpath, iterable=None, **kwargs):
self.dbpath = dbpath
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'create table if not exists memo '
'(key blob primary key not null, value blob not null)'
)
if iterable is not None:
self.update(iterable)
self.update(kwargs)
def encode(self, obj):
return pickle.dumps(obj)
def decode(self, blob):
return pickle.loads(blob)
def get_connection(self):
return sqlite3.connect(self.dbpath)
def __getitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select value from memo where key=?',
(key,)
)
value = cursor.fetchone()
if value is None:
raise KeyError(key)
return self.decode(value[0])
def __setitem__(self, key, value):
key = self.encode(key)
value = self.encode(value)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'insert or replace into memo values (?, ?)',
(key, value)
)
def __delitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo where key=?',
(key,)
)
if cursor.fetchone()[0] == 0:
raise KeyError(key)
cursor.execute(
'delete from memo where key=?',
(key,)
)
def __iter__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select key from memo'
)
records = cursor.fetchall()
for r in records:
yield self.decode(r[0])
def __len__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo'
)
return cursor.fetchone()[0]
sqlite3 out of the box provides ACID. File locking is prone to race-conditions and concurrency problems that you won't have using sqlite3.
Basically, yeah, sqlite3 is more than what you need, but it's not a huge burden. It can run on mobile phones, so it's not like you're committing to running some beastly software. It's going to save you time reinventing wheels and debugging locking issues.
I assume you want to continue to memoize the results of the function in RAM, probably in a dictionary, but use the persistence to reduce the "warmup" time of the application. In this case you're not going to be randomly accessing items directly in the backing store so a database might indeed be overkill (though as synthesizerpatel notes, maybe not as much as you think).
Still, if you want to roll your own, a viable strategy might be to simply load the dictionary from a file at the beginning of your run before starting any threads. When a result isn't in the dictionary, then you need to write it to the file after adding it to the dictionary. You can do this by adding it to a queue and using a single worker thread that flushes items from the queue to disk (just appending them to a single file would be fine). You might occasionally add the same result more than once, but this is not fatal since it'll be the same result each time, so reading it back in twice or more will do no real harm. Python's threading model will keep you out of most kinds of concurrency trouble (e.g., appending to a list is atomic).
Here is some (untested, generic, incomplete) code showing what I'm talking about:
import cPickle as pickle
import time, os.path
cache = {}
queue = []
# run at script start to warm up cache
def preload_cache(filename):
if os.path.isfile(filename):
with open(filename, "rb") as f:
while True:
try:
key, value = pickle.load(f), pickle.load(f)
except EOFError:
break
cache[key] = value
# your memoized function
def time_consuming_function(a, b, c, d):
key = (a, b, c, d)
if key in cache:
return cache[key]
else:
# generate the result here
# ...
# add to cache, checking to see if it's already there again to avoid writing
# it twice (in case another thread also added it) (this is not fatal, though)
if key not in cache:
cache[key] = result
queue.append((key, result))
return result
# run on worker thread to write new items out
def write_cache(filename):
with open(filename, "ab") as f:
while True:
while queue:
key, value = queue.pop() # item order not important
# but must write key and value in single call to ensure
# both get written (otherwise, interrupting script might
# leave only one written, corrupting the file)
f.write(pickle.dumps(key, pickle.HIGHEST_PROTOCOL) +
pickle.dumps(value, pickle.HIGHEST_PROTOCOL))
f.flush()
time.sleep(1)
If I had time, I'd turn this into a decorator... and put the persistence into a dict subclass... the use of global variables is also sub-optimal. :-) If you use this approach with multiprocessing you'd probably want to use a multiprocessing.Queue rather than a list; you can then use queue.get() as a blocking wait for a new result in the worker process that writes to the file. I've not used multiprocessing, though, so take this bit of advice with a grain of salt.