Python ArangoDB insertion of objects not completed after method is run - python

I use arango-orm (which uses python-arango in the background) in my Python/ArangoDB back-end. I have set up a small testing util that uses a remote database to insert test data, execute the unit tests and remove the test data again.
I insert my test data with a Python for loop. Each iteration, a small piece of information changes based on a generic object and then I insert that modified generic object into ArangoDB until I have 10 test objects. However, after that code is run, my test assertions tell me I don't have 10 objects stored inside my db, but only 8 (or sometimes 3, 7 or 9). It looks like pythong-arango runs these queries asynchronously or that ArangoDB already replies with an OK before the data is actually inserted. Anyone has an idea of what is going on? When I put in a sleep of 1 second after all data is inserted, my tests run green. This obviously is no solution.
This is a little piece of example code I use:
def load_test_data(self) -> None:
# This method is called from the setUp() method.
logging.info("Loading test data...")
for i in range(1, 11):
# insertion with data object (ORM)
user = test_utils.get_default_test_user()
user.id = i
user.username += str(i)
user.name += str(i)
db.add(user)
# insertion with dictionary
project = test_utils.get_default_test_project()
project['id'] = i
project['name'] += str(i)
project['description'] = f"Description for project with the id {i}"
db.insert_document("projects", project)
# TODO: solve this dirty hack
sleep(1)
def test_search_by_user_username(self) -> None:
actual = dao.search("TestUser3")
self.assertEqual(1, len(actual))
self.assertEqual(3, actual[0].id)
Then my db is created like this in a separate module:
client = ArangoClient(hosts=f"http://{arango_host}:{arango_port}")
test_db = client.db(arango_db, arango_user, arango_password)
db = Database(test_db)
EDIT:
I had not put the sync property to true upon collection creation, but after changing the collection and setting it to true, the behaviour stays exactly the same.

After getting in touch with the people of ArangoDB, I learned that views are not updatet as quickly as collections. Thye have given me an internal SEARCH option which also waits for synching views. Since it's an internal option, only used for unit testing, they high discourage the use of it. For me, I only use it for unit testing.

Related

How can I do parallel tests in Django with Elasticsearch-dsl?

Has anyone gotten parallel tests to work in Django with Elasticsearch? If so, can you share what configuration changes were required to make it happen?
I've tried just about everything I can think of to make it work including the solution outlined here. Taking inspiration from how Django itself does the parallel DB's, I currently have created a custom new ParallelTestSuite that overrides the init_worker to iterate through each index/doctype and change the index names roughly as follows:
_worker_id = 0
def _elastic_search_init_worker(counter):
global _worker_id
with counter.get_lock():
counter.value += 1
_worker_id = counter.value
for alias in connections:
connection = connections[alias]
settings_dict = connection.creation.get_test_db_clone_settings(_worker_id)
# connection.settings_dict must be updated in place for changes to be
# reflected in django.db.connections. If the following line assigned
# connection.settings_dict = settings_dict, new threads would connect
# to the default database instead of the appropriate clone.
connection.settings_dict.update(settings_dict)
connection.close()
### Everything above this is from the Django version of this function ###
# Update index names in doctypes
for doc in registry.get_documents():
doc._doc_type.index += f"_{_worker_id}"
# Update index names for indexes and create new indexes
for index in registry.get_indices():
index._name += f"_{_worker_id}"
index.delete(ignore=[404])
index.create()
print(f"Started thread # {_worker_id}")
This seems to generally work, however, there's some weirdness that happens seemingly randomly (i.e. running the test suite again doesn't reliably reproduce the issue and/or the error messages change). The following are the various errors I've gotten and it seems to randomly fail on one of them each test run:
Raise a 404 when trying to create the index in the function above (I've confirmed that it's the 404 coming back from the PUT request, however in the Elasticsearch server logs it says that it's created the index without issue)
a 500 when trying to create the index, although this one hasn't happened in a while so I think this was fixed by something else
query responses will sometimes not have an items dictionary value inside the _process_bulk_chunk function from the elasticsearch library
I'm thinking that there's something weird going on at the connection layer (like somehow the connections between Django test runner processes are getting the responses mixed up?) but I'm at a loss as to how that would be even possible since Django uses multiprocessing to parallelize the tests and thus they are each running in their own process. Is it somehow possible that the spun-off processes are still trying to use the connection pool of the original process or something? I'm really at a loss of other things to try from here and would greatly appreciate some hints or even just confirmation that this is in fact possible to do.
I'm thinking that there's something weird going on at the connection layer (like somehow the connections between Django test runner processes are getting the responses mixed up?) but I'm at a loss as to how that would be even possible since Django uses multiprocessing to parallelize the tests and thus they are each running in their own process. Is it somehow possible that the spun-off processes are still trying to use the connection pool of the original process or something?
This is exactly what is happening. From the Elasticsearch DSL docs:
Since we use persistent connections throughout the client it means that the client doesn’t tolerate fork very well. If your application calls for multiple processes make sure you create a fresh client after call to fork. Note that Python’s multiprocessing module uses fork to create new processes on POSIX systems.
What I observed happening is that the responses get very weirdly interleaved with a seemingly random client that may have started the request. So a request to index a document might end up with a response to create an index which have very different attributes on them.
The fix is to ensure that each test worker has its own Elasticsearch client. This can be done by creating worker-specific connection aliases and then overwriting the current connection aliases (with the private attribute _using) with the worker-specific one. Below is a modified version of the code you posted with the change
_worker_id = 0
def _elastic_search_init_worker(counter):
global _worker_id
with counter.get_lock():
counter.value += 1
_worker_id = counter.value
for alias in connections:
connection = connections[alias]
settings_dict = connection.creation.get_test_db_clone_settings(_worker_id)
# connection.settings_dict must be updated in place for changes to be
# reflected in django.db.connections. If the following line assigned
# connection.settings_dict = settings_dict, new threads would connect
# to the default database instead of the appropriate clone.
connection.settings_dict.update(settings_dict)
connection.close()
### Everything above this is from the Django version of this function ###
from elasticsearch_dsl.connections import connections
# each worker needs its own connection to elasticsearch, the ElasticsearchClient uses
# global connection objects that do not play nice otherwise
worker_connection_postfix = f"_worker_{_worker_id}"
for alias in connections:
connections.configure(**{alias + worker_connection_postfix: settings.ELASTICSEARCH_DSL["default"]})
# Update index names in doctypes
for doc in registry.get_documents():
doc._doc_type.index += f"_{_worker_id}"
# Use the worker-specific connection
doc._doc_type._using = doc.doc_type._using + worker_connection_postfix
# Update index names for indexes and create new indexes
for index in registry.get_indices():
index._name += f"_{_worker_id}"
index._using = doc.doc_type._using + worker_connection_postfix
index.delete(ignore=[404])
index.create()
print(f"Started thread # {_worker_id}")

Flask SQLAlchemy Sessions commit() not working "sometimes"

I've been working in a web app for a while and this is the first time I realize this problem, I think it could be related with how SQLAlchemy sessions are handled, so some clarification in simple term would be helpful.
My configuration for work with flask sqlAlchemy is:
from flask_sqlalchemy import SQLAlchemy
db = SQLAlchemy(app)
My problem: db.session.commit() sometimes doesn't save changes. I wrote some flask endpoints which are reached via the front end requests in the user browser.
In this particular case, I'm editing a hotel "Booking" object altering the "Rooms" columns which is a Text field.
the function does the following:
1-Query the Booking object from the dates in the request
2- Edit the Rooms column of this Booking object
3- Commit the changes "db.session.commit()"
4- If a user has X functionality active, I make some checks calling a second function:
·4.1- This functions make some checks and query and edit another object in the database different from the "Booking" object I edited previously.
·4.2- At the end of this secondary function I call db.session.commit() "Note this changes always got saved correctly in the database"
·4.3- Return the results to the previous function
5- Return results to the front end ("just before this return, I print the Booking.Rooms to make sure it looks as it should, and it does... I even tried to make a second commit after the print but before the return... But after this, sometimes Booking.Rooms are updated as expected but some other times it doesn't... I noted if repeat the action many times it finally works, but given the intermediate function "described in point 4" saves all his changes correctly, this causes an inconsistency in the data and drives me mad because if I repeat the action and procedure in the function of point 4 worked correctly, I can't repeat the mod Rooms action...
So, I'm now really confused if this is something I don't understand from flask sessions, for what I understand, whenever I make a new request to flask, it's an isolated session, right?
I mean, if 2 concurrent users are storing some changes in the database, a db.session.commit() from one of the users won't commit the changes from the other one, right?
Same way, if I call db.session.commit() in one request, that changes are stored in the database, and if after that "in the same request", I keep modding things, it's like another session, right? And the committed changes are there already safely stored? And I can still use previous objects for further modifications
Anyway, all of this shouldn't be a problem because after the commit() I print the Booking.Rooms and looks as expected... And some times it works getting stored correctly and some times it doesn't...
Also note: When I return this result to the client, the client makes instantly a second request to the server to request updated Booking data, and then the data is returned without this expected changes committed... I suppose flask handled all the commit() before it gets the second request "other way it wouldn't have returned the result previously..."
Can this be a limitation of the flask development server which can't handle correctly many requests and that when deployed with gunicorn it doesn't happen?
Any hint or clarification about Sessions would be nice, because this is pretty strange behaviour, especially that sometimes works and others don't...
And as requested here is the code, I know is not possible to reproduce, have a lot of setup behind and would need a lot of data to works as intended under same circumstances as in my case, but this should provide an overview of how the functions looks like and where are the commits I mention above. Any ideas of where can be the problem is very helpful.
#Main function hit by the frontend
#users.route('/url_endpoint1', methods=['POST'], strict_slashes=False)
#login_required
def add_room_to_booking_Api():
try:
bookingData = request.get_json()
roomURL=bookingData["roomSafeURL"]
targetBooking = bookingData["bookingURL"]
startDate = bookingData["checkInDate"]
endDate = bookingData["checkOutDate"]
roomPrices=bookingData["roomPrices"]
booking = Bookings.query.filter_by(SafeURL=targetBooking).first()
alojamiento = Alojamientos.query.filter_by(id=reserva.CodigoAlojamiento).first() #owner of the booking
room=Rooms.query.filter_by(SafeURL=roomURL).first()
roomsInBooking=ast.literal_eval(reserva.Habitaciones) #I know, should be json.loads() and json.dumps() for better performance probably...
#if room is available for given days add it to the booking
if CheckIfRoomIsAvailableForBooking(alojamiento.id, room, startDate, endDate, booking) == "OK":
roomsInBooking.append({"id": room.id, "Prices": roomPrices, "Guests":[]}) #add the new room the Rooms column of the booking
booking.Habitaciones = str(roomsInBooking)#save the new rooms data
print(booking.Habitaciones) # check changes applied
room.ReservaAsociada = booking.id # associate booking and room
for ocupante in room.Ocupantes: #associate people in the room with the booking
ocupante.Reserva = reserva.id
#db.session.refresh(reserva) # test I made to check if something changes but didn't worked
if some_X_function() == True: #if user have some functionality enabled
#db.session.begin() #another test which didn't worked
RType = WuBook_Rooms.query.filter_by(ParentType=room.Tipo).first()
RType=[RType] #convert to list because I resuse the function in cases with multiple types
resultAdd = function4(RType, booking.Entrada.replace(hour=0, minute=0, second=0), booking.Salida.replace(hour=0, minute=0, second=0))
if resultAdd["resultado"] == True: # "resultado":error, "casos":casos
return (jsonify({"resultado": "Error", "mensaje": resultAdd["casos"]}))
print(booking.Habitaciones) #here I still get expected result
db.session.commit()
#I get this return of averything correct in my frontend but not really stored in the database
return jsonify({"resultado": "Ok", "mensaje": "Room " + str(room.Identificador) + " added to the booking"})
else:
return (jsonify({"resultado": "Error", "mensaje": "Room " + str(room.Identificador) + " not available to book in target dates"}))
except Exception as e:
#some error handling which is not getting hit now
db.session.rollback()
print(e, ": en linea", lineno())
excepcion = str((''.join(traceback.TracebackException.from_exception(e).format()).replace("\n","</br>"), "</br>Excepcion emitida ne la línea: ", lineno()))
sendExceptionEmail(excepcion, current_user)
return (jsonify({"resultado":"Error","mensaje":"Error"}))
#function from point 4
def function4(RType, startDate, endDate):
delta = endDate - startDate
print(startDate, endDate)
print(delta)
for ind_type in RType:
calendarUpdated=json.loads(ind_type.updated_availability_by_date)
calendarUpdatedBackup=calendarUpdated
casos={}
actualizar=False
error=False
for i in range(delta.days):
day = (startDate + timedelta(days=i))
print(day, i)
diaString=day.strftime("%d-%m-%Y")
if day>=datetime.now() or diaString==datetime.now().strftime("%d-%m-%Y"): #only care about present and future dates
disponibilidadLocal=calendarUpdated[diaString]["local"]
yaReservadas=calendarUpdated[diaString]["local_booked"]
disponiblesChannel=calendarUpdated[diaString]["avail"]
#adjust availability data
if somecondition==True:
actualizar=True
casos.update({diaString:"Happened X"})
else:
actualizar=False
casos.update({diaString:"Happened Y"})
error="Error"
if actualizar==True: #this part of the code is hit normally and changes stored correctly
ind_type.updated_availability_by_date=json.dumps(calendarUpdated)
wubookproperty=WuBook_Properties.query.filter_by(id=ind_type.PropertyCode).first()
wubookproperty.SyncPending=True
ind_type.AvailUpdatePending=True
elif actualizar==False: #some error occured, revert changes
ind_type.updated_availability_by_date = json.dumps(calendarUpdatedBackup)
db.session.commit()#this commit persists
return({"resultado":error, "casos":casos}) #return to main function with all this chnages stored
Realized nothing was wrong at session level, it was my fault in another function client side which sends a request to update same data which is just being updated but with the old data... so in fact, I was getting the data saved correctly in the database but overwrote few milliseconds later. It was just a return statement missing in a javascript file to avoid this outcome...

Python script getting Killed

Environment
Flask 0.10.1
SqlAlchemy 1.0.10
Python 3.4.3
Using unittest
I have created two separate tests whose goals are looking into the databases through 700k records and doing some string finds. When the tests are executed one at a time, it works fine, but when the whole script is executed with:
python name_of_script.py
it exits with "KILLED" at random places.
The main code on both tests go something like this:
def test_redundant_categories_exist(self):
self.assertTrue(self.get_redundant_categories() > 0, 'There are 0 redundant categories to remove. Cannot test removing them if there are none to remove.')
def get_redundant_categories(self):
total = 0
with get_db_session_scope(db.session) as db_session:
records = db_session.query(Category)
for row in records:
if len(row.c) > 1:
c = row.c
#TODO: threads, each thread handles a bulk of rows
redundant_categories = [cat_x.id
for cat_x in c
for cat_y in c
if cat_x != cat_y and re.search(r'(^|/)' + cat_x.path + r'($|/)', cat_y.path)
]
total += len(redundant_categories)
records = None
db_session.close()
return total
The other test calls a function located in the manager.py file that does something similar, but with an added bulk delete in the database.
def test_remove_redundant_mappings(self):
import os
os.system( "python ../../manager.py remove_redundant_mappings" )
self.assertEqual(self.get_redundant_categories(), 0, "There are redundant categories left after running manager.py remove_redundant_mappings()")
Is it possible for the data to be kept in memory between tests? I don't quite understand how executing the tests individually works, but when run back to back, the process ends with Killed.
Any ideas?
Edit (things I've tried to no avail):
import the function from manager.py and call it without os.system(..)
import gc and run a gc.collect() after get_redundant_categories() and after calling remove_redundant_mappings()
While searching high and low, I serendipitously came upon the following comment in this StackOverflow question/answer
What is happening, I think, is that people are instantiating sessions and not closing them. The objects are then being garbage collected without closing the sessions. Why sqlalchemy sessions don't close themselves when the session object goes out of scope has always and will always be beyond me. #melchoir55
So I added the following the to the method that was being tested:
db_session.close()
Now the unittest executes without getting killed.

Python Mock Process for Unit Testing

Background:
I am currently writing a process monitoring tool (Windows and Linux) in Python and implementing unit test coverage. The process monitor hooks into the Windows API function EnumProcesses on Windows and monitors the /proc directory on Linux to find current processes. The process names and process IDs are then written to a log which is accessible to the unit tests.
Question:
When I unit test the monitoring behavior I need a process to start and terminate. I would love if there would be a (cross-platform?) way to start and terminate a fake system process that I could uniquely name (and track its creation in a unit test).
Initial ideas:
I could use subprocess.Popen() to open any system process but this runs into some issues. The unit tests could falsely pass if the process I'm using to test is run by the system as well. Also, the unit tests are run from the command line and any Linux process I can think of suspends the terminal (nano, etc.).
I could start a process and track it by its process ID but I'm not exactly sure how to do this without suspending the terminal.
These are just thoughts and observations from initial testing and I would love it if someone could prove me wrong on either of these points.
I am using Python 2.6.6.
Edit:
Get all Linux process IDs:
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Get all Windows process IDs:
import ctypes, ctypes.wintypes
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
count = 50
while True:
# Build arguments to EnumProcesses
processIds = (ctypes.wintypes.DWORD*count)()
size = ctypes.sizeof(processIds)
bytes_returned = ctypes.wintypes.DWORD()
# Call enum processes to find all processes
if self.EnumProcesses(ctypes.byref(processIds), size, ctypes.byref(bytes_returned)):
if bytes_returned.value &lt size:
return processIds
else:
# We weren't able to get all the processes so double our size and try again
count *= 2
else:
print "EnumProcesses failed"
sys.exit()
Windows code is from here
edit: this answer is getting long :), but some of my original answer still applies, so I leave it in :)
Your code is not so different from my original answer. Some of my ideas still apply.
When you are writing Unit Test, you want to only test your logic. When you use code that interacts with the operating system, you usually want to mock that part out. The reason being that you don't have much control over the output of those libraries, as you found out. So it's easier to mock those calls.
In this case, there are two libraries that are interacting with the sytem: os.listdir and EnumProcesses. Since you didn't write them, we can easily fake them to return what we need. Which in this case is a list.
But wait, in your comment you mentioned:
"The issue I'm having with it however is that it really doesn't test
that my code is seeing new processes on the system but rather that the
code is correctly monitoring new items in a list."
The thing is, we don't need to test the code that actually monitors the processes on the system, because it's a third party code. What we need to test is that your code logic handles the returned processes. Because that's the code you wrote. The reason why we are testing over a list, is because that's what your logic is doing. os.listir and EniumProcesses return a list of pids (numeric strings and integers, respectively) and your code acts on that list.
I'm assuming your code is inside a Class (you are using self in your code). I'm also assuming that they are isolated inside their own methods (you are using return). So this will be sort of what I suggested originally, except with actual code :) Idk if they are in the same class or different classes, but it doesn't really matter.
Linux method
Now, testing your Linux process function is not that difficult. You can patch os.listdir to return a list of pids.
def getLinuxProcess(self):
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Now for the test.
import unittest
from fudge import patched_context
import os
import LinuxProcessClass # class that contains getLinuxProcess method
def test_LinuxProcess(self):
"""Test the logic of our getLinuxProcess.
We patch os.listdir and return our own list, because os.listdir
returns a list. We do this so that we can control the output
(we test *our* logic, not a built-in library's functionality).
"""
# Test we can parse our pdis
fakeProcessIds = ['1', '2', '3']
with patched_context(os, 'listdir', lamba x: fakeProcessIds):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = [1, 2, 3]
self.assertEqual(result, expected)
# Test we can handle IOERROR
with patched_context(os, 'listdir', lamba x: raise IOError):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = []
self.assertEqual(result, expected)
# Test we only get pids
fakeProcessIds = ['1', '2', '3', 'do', 'not', 'parse']
.....
Windows method
Testing your Window's method is a little trickier. What I would do is the following:
def prepareWindowsObjects(self):
"""Create and set up objects needed to get the windows process"
...
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
self.EnumProcessses = EnumProcess
...
def getWindowsProcess(self):
count = 50
while True:
.... # Build arguments to EnumProcesses and call enun process
if self.EnumProcesses(ctypes.byref(processIds),...
..
else:
return []
I separated the code into two methods to make it easier to read (I believe you are already doing this). Here is the tricky part, EnumProcesses is using pointers and they are not easy to play with. Another thing is, that I don't know how to work with pointers in Python, so I couldn't tell you of an easy way to mock that out =P
What I can tell you is to simply not test it. Your logic there is very minimal. Besides increasing the size of count, everything else in that function is creating the space EnumProcesses pointers will use. Maybe you can add a limit to the count size but other than that, this method is short and sweet. It returns the windows processes and nothing more. Just what I was asking for in my original comment :)
So leave that method alone. Don't test it. Make sure though, that anything that uses getWindowsProcess and getLinuxProcess get's mocked out as per my original suggestion.
Hopefully this makes more sense :) If it doesn't let me know and maybe we can have a chat session or do a video call or something.
original answer
I'm not exactly sure how to do what you are asking, but whenever I need to test code that depends on some outside force (external libraries, popen or in this case processes) I mock out those parts.
Now, I don't know how your code is structured, but maybe you can do something like this:
def getWindowsProcesses(self, ...):
'''Call Windows API function EnumProcesses and
return the list of processes
'''
# ... call EnumProcesses ...
return listOfProcesses
def getLinuxProcesses(self, ...):
'''Look in /proc dir and return list of processes'''
# ... look in /proc ...
return listOfProcessses
These two methods only do one thing, get the list of processes. For Windows, it might just be a call to that API and for Linux just reading the /proc dir. That's all, nothing more. The logic for handling the processes will go somewhere else. This makes these methods extremely easy to mock out since their implementations are just API calls that return a list.
Your code can then easy call them:
def getProcesses(...):
'''Get the processes running.'''
isLinux = # ... logic for determining OS ...
if isLinux:
processes = getLinuxProcesses(...)
else:
processes = getWindowsProcesses(...)
# ... do something with processes, write to log file, etc ...
In your test, you can then use a mocking library such as Fudge. You mock out these two methods to return what you expect them to return.
This way you'll be testing your logic since you can control what the result will be.
from fudge import patched_context
...
def test_getProcesses(self, ...):
monitor = MonitorTool(..)
# Patch the method that gets the processes. Whenever it gets called, return
# our predetermined list.
originalProcesses = [....pids...]
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic is right ...
# Let's "add" some new processes and test that our logic realizes new
# processes were added.
newProcesses = [...]
updatedProcesses = originalProcessses + (newProcesses)
with patched_context(monitor, "getLinuxProcesses", lamba x: updatedProcesses):
monitor.getProcesses()
# ... assert logic caught new processes ...
# Let's "kill" our new processes and test that our logic can handle it
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic caught processes were 'killed' ...
Keep in mind that if you test your code this way, you won't get 100% code coverage (since your mocked methods won't be run), but this is fine. You're testing your code and not third party's, which is what matters.
Hopefully this might be able to help you. I know it doesn't answer your question, but maybe you can use this to figure out the best way to test your code.
Your original idea of using subprocess is a good one. Just create your own executable and name it something that identifies it as a testing thing. Maybe make it do something like sleep for a while.
Alternately, you could actually use the multiprocessing module. I've not used python in windows much, but you should be able to get process identifying data out of the Process object you create:
p = multiprocessing.Process(target=time.sleep, args=(30,))
p.start()
pid = p.getpid()

Cassandra low performance?

I have to choose Cassandra or MongoDB(or another nosql database, I accept suggestions) for a project with a lot of inserts(1M/day).
So I create a small test to measure the write performance. Here's the code to insert in Cassandra:
import time
import os
import random
import string
import pycassa
def get_random_string(string_length):
return ''.join(random.choice(string.letters) for i in xrange(string_length))
def connect():
"""Connect to a test database"""
connection = pycassa.connect('test_keyspace', ['localhost:9160'])
db = pycassa.ColumnFamily(connection,'foo')
return db
def random_insert(db):
"""Insert a record into the database. The record has the following format
ID timestamp
4 random strings
3 random integers"""
record = {}
record['id'] = str(time.time())
record['str1'] = get_random_string(64)
record['str2'] = get_random_string(64)
record['str3'] = get_random_string(64)
record['str4'] = get_random_string(64)
record['num1'] = str(random.randint(0, 100))
record['num2'] = str(random.randint(0, 1000))
record['num3'] = str(random.randint(0, 10000))
db.insert(str(time.time()), record)
if __name__ == "__main__":
db = connect()
start_time = time.time()
for i in range(1000000):
random_insert(db)
end_time = time.time()
print "Insert time: %lf " %(end_time - start_time)
And the code to insert in Mongo it's the same changing the connection function:
def connect():
"""Connect to a test database"""
connection = pymongo.Connection('localhost', 27017)
db = connection.test_insert
return db.foo2
The results are ~1046 seconds to insert in Cassandra, and ~437 to finish in Mongo.
It's supposed that Cassandra it's much faster than Mongo inserting data. So , What i'm doing wrong?
There is no equivalent to Mongo's unsafe mode in Cassandra. (We used to have one, but we took it out, because it's just a Bad Idea.)
The other main problem is that you're doing single-threaded inserts. Cassandra is designed for high concurrency; you need to use a multithreaded test. See the graph at the bottom of http://spyced.blogspot.com/2010/01/cassandra-05.html (actual numbers are over a year out of date but the principle is still true).
The Cassandra source distribution has such a test included in contrib/stress.
If I am not mistaken, Cassandra allows you to specify whether or not you are doing a MongoDB-equivalent "safe mode" insert. (I dont recall the name of that feature in Cassandra)
In other words, Cassandra may be configured to write to disk and then return as opposed to the default MongoDB configuration which immediately returns after performing an insert without knowing if the insert was successful or not. It just means that your application never waits for a pass\fail from the server.
You can change that behavior by using safe mode in MongoDB but this is known to have a large impact on performance. Enable safe mode and you may see different results.
You will harness true power of Cassandra once you have multiple nodes running. Any node will be able to take a write request. Multithreading a client is only flooding more requests to same instance which is not going to help after a point.
Check cassandra log for the events that happen during your tests. Cassandra will initiate a disk write once the Memtable is full (this is configurable, make it large enough and you will be dealing on in RAM + disk writes of commit log). If disk write for Memtable happen during your test then it will slow it down. I do not know when MongoDB writes to disk.
Might I suggest taking a look at Membase here? It's used in exactly the same way as memcached and is fully distributed so you can continuously scale your write input rate simply by adding more servers and/or more RAM.
For this case, you'll definitely want to go with a client-side Moxi to give you the best performance. Take a look at our wiki: wiki.membase.org for examples and let me know if you need any further instruction...I'm happy to walk you through it and I'm certain that Membase can handle this load easily.
Create batch mutator for doing
multiple insert, update, and remove
operations using as few roundtrips as
possible.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch
Batch mutator helped me reduce insert time in at least half

Categories