Logging to MongoDB from python

Logging to MongoDB from python - python

i want to log some information into mongodb using python . i found 2 libraries mongodblog and log4mongo for python. any idea which one is better ? or any other library which is better than these ?

When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from Python program.
Fluentd: Data Import from Python Applications
Here's some tutorials about MongoDB + Fluentd.
Fluentd + MongoDB: The Easiest Way to Log Your Data Effectively on 10gen blog
Fluentd: Store Apache Logs into MongoDB

No need to use a logging library. Use pymongo and do the following:
create a different database from your application database (it can be on the same machine) to avoid problems with high write throughput hogging the lock that the rest of your application may need.
if there is a ton of logging to be done, consider using a capped collection
if you need to be analyzing the log as it occurs, write another script that uses a tailable cursor: http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/
The upshot is that all of your logging needs can be taken care of with a few lines of code. Again, no need to complicate your code base by introducing extra dependencies when a bit of code will suffice.

As mentioned by other users here, it is quite simple to log directly using pymongo:
from pymongo import MongoClient
from pymongo import ASCENDING
import datetime
client = MongoClient()
db = client.my_logs
log_collection = db.log
log_collection.ensure_index([("timestamp", ASCENDING)])
def log(msg):
"""Log `msg` to MongoDB log"""
entry = {}
entry['timestamp'] = datetime.datetime.utcnow()
entry['msg'] = msg
log_collection.insert(entry)
log('Log messages like this')
You might want to experiment by replacing the _id with the timestamp, just remember that _id has to be unique.

You can use Mongolog or Log4Mongo. Both of them have log appender for python logging package. You can easily instantiate your log handler (mongo log handler) and add it (the handler) to your logger. Rest of the thing will be handled out of the box. Both of them also support capped collection (could be useful in case of huge records of logs(specially junk longs) )
Log4Mongo : https://pypi.org/project/log4mongo/
Github page : https://github.com/log4mongo/log4mongo-python
MongoLog : https://pypi.org/project/mongolog/#description
Github page : https://github.com/puentesarrin/mongodb-log

Here is the author of one of the libraries. I can't say much about the other library.
I agree that for really a lot of logs you should not necessarily use mongodb directly (the accepted answer does a fair job of explaining what should be used.
However, for medium size (medium in the sense of traffic and amount of logs) applications where a complex setup might be undesirable you can use BufferedMongoHandler, this logging class is designed to solve exactly that locking problem.
It does that by collecting the messages and writing them periodically instead of immidieatly. Take a look at the code, it is pretty straight forward.
IMO, if you already use mongodb, and you feel comfortable with it, it's an OK solution.

Related

Two flask apps using one database

Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.

SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.

Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.

control access to a resource with two application instances

I have a use case where i have to count the usage of my application that which feature is used most. I am storing all my statistics in a file. Each time application closes it will write the statistics to that file.
Application is standalone and is in the server, and multiple people can use the application at the same time by running the application(So its like multiple copy of application will be running at different places).
So the problem is if multiple people try to update the stat at the same time there are chances of getting the false statistics(concurrency control problem). So how can I handle such situations in python?
I store the followiing data in my stat file::
stats = {user1 : {feature1 : count, feature2 : count, etc..},
user2 : {feature1 : count, feature2 : count, etc..}
}

You can use portalocker, it's a great module that gives you a portable way to use filesystem locks.
If you are confined to one platform, you could use platform-specific file-locking primitives, usually exposed via python standard library.
You could also use a proper database but it seems like a huge overkill for the task at hand.
EDIT: seeing that you are about to use threading locks, don't! Threading locks (mutexes or semaphores) only prevent several threads in the same process from accessing shared variable, they do not work across separate processes, especially independant program instances! What you need is a file-locking mechanism, not thread-locks.

Your problem seems like its a flavor of the readers-writers problem which is extended to transactions from the database. I would create a SQL-transactional class which is primarily responsible for the CRUD operations with your database. I would then have a flag which could be locked and released in a very similar way to mutex locks, this way you are able to ensure that no two concurrent processes are in their critical section (performing an UPDATE or INSERT into the DB) at the same time.
I also believe there is a readers-writers lock which specifically deals with controlling access to a shared resource, which might be of interest to you.
Please let me know if you have any questions!

In your code where the shared resource can be used by multiple applications, use Locks
import threading
mylock = threading.Lock()
with mylock:
....
#access the shared resource

multiple users doing form submission with python CGI

I have a simple cgi script in python collecting a value from form fields submitted through post. After collecting this, iam dumping these values to a single text file.
Now, when multiple users submit at the same time, how do we go about it?
In C\C++ we use semaphore\mutex\rwlocks etc? Do we have anything similar in python. Also, opening and closing the file multiple times doesnt seem to be a good idea for every user request.
We have our code base for our product in C\C++. I was asked to write a simple cgi script for some reporting purpose and was googling with python and cgi.
Please let me know.
Thanks!
Santhosh

There are lots of python based servers you could use. Here's one:
Twisted
Twisted is an event-driven networking engine written in Python and licensed under the open source
from twisted.internet import protocol, reactor
class Echo(protocol.Protocol):
def dataReceived(self, data):
self.transport.write(data)
class EchoFactory(protocol.Factory):
def buildProtocol(self, addr):
return Echo()
reactor.listenTCP(1234, EchoFactory())
reactor.run()
You probably want to drop the values into a database instead of a text file.
However threading is available and so you can use lock() to ensure only one user writes to the file at a time.
http://effbot.org/zone/thread-synchronization.htm
Locks are typically used to synchronize access to a shared resource. For each shared resource, create a Lock object. When you need to access the resource, call acquire to hold the lock (this will wait for the lock to be released, if necessary), and call release to release it

If you're concerned about multiple users, and considering complex solutions like mutexes or semaphores, you should ask yourself why you're planning on using an unsuitable solution like CGI and text files in the first place. Any complexity you're saving by doing this will be more than outweighed by whatever you put in place to allow multiple users.
The right way to do this is to write a simple WSGI app - maybe using something like Flask - which writes to a database, rather than a text file.

The simple (and slow way) is to acquire a lock on the file (in C you'd use flock), write on it and close it. If you think this can be a bottleneck then use a database or something like that.

How do I collect up logs for an App Engine request using Python logging?

Using Google App Engine, Python 2.7, threadsafe:true, webapp2.
I would like to include all logging.XXX() messages in my API responses, so I need an efficient way to collect up all the log messages that occur during the scope of a request. I also want to operate in threadsafe:true, so I need to be careful to get only the right log messages.
Currently, my strategy is to add a logging.Handler at the start of my webapp2 dispatch method, and then remove it at the end. To collect logs only for my thread, I instantiate the logging.Handler with the name of the current thread; the handler will simply throw out log records that are from a different thread. I am using thread name and not thread ID because I was getting some unexpected results on dev_appserver when using the ID.
Questions:
Is it efficient to constantly be adding/removing logging.Handler objects in this fashion? I.e., every request will add, then remove, a Handler. Is this "cheap"?
Is this the best way to get only the logging messages for my request? My big assumption is that each request gets its own thread, and that thread name will actually select the right items.
Am I fundamentally misunderstanding Python logging? Perhaps I should only have a single additional Handler added once at the "module-level" statically, and my dispatch should do something lighter.
Any advice is appreciated. I don't have a good understanding of what Python (and specifically App Engine Python) does under the hood with respect to logging. Obviously, this is eminently possible because the App Engine Log Viewer does exactly the same thing: it displays all the log messages for that request. In fact, if I could piggyback on that somehow, that would be even better. It absolutely needs to be super-cheap though - i.e., an RPC call is not going to cut it.
I can add some code if that will help.

I found lots of goodness here:
from google.appengine.api import logservice
entries = logservice.logs_buffer().parse_logs()

Python sqlite3 and concurrency

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?

Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)

You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.

The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.

Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.

You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.

Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.

You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.

Use threading.Lock()

I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!

Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)

I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.

I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.

You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).

Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety

The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.