Suppose that I have a huge SQLite file (say, 500[MB]). Can 10 different python instances access this file at the same time and update different records of it?. Note, the emphasis here is on different records.
For example, suppose that the SQLite file has say 1M rows:
instance 1 will deal with (and update) rows 0 - 100000
instance 2 will will deal with (and update) rows 100001 - 200000
.........................
instance 10 will deal with (and update) rows 900001 - 1000000
Meaning, each python instance will only be updating a unique subset of the file. Will this work, or will I have serious integrity issues?
Updated, thanks to André Caron.
You can do that, but only read operations supports concurrency in SQLite, since entire database is locked on any write operation. SQLite engine will return SQLITE_BUSY status in this situation (if it exceeds default timeout for access). Also consider that this heavily depends on how good file locking is implemented for given OS and file system. In general I wouldn't recommend to use proposed solution, especially considering that DB file is quite large, but you can try.
It will be better to use server process based database (MySQL, PostgreSQL, etc.) to implement desired app behaviour.
Somewhat. Only one instance can write at any single time, which means concurrent writes will block (refer to the SQLite FAQ). You won't get integrity issues, but I'm not sure you'll really benefit from concurrency, although that depends on how much you write versus how much you read.
Related
I'm working on a Python/Django webapp...and I'm really just digging into Python for the first time (typically I'm a .NET/core stack dev)...currently running PostgreSQL on the backend. I have about 9-10 simple (2 dimensional) lookup tables that will be hit very, very often in real time, that I would like to cache them in memory.
Ideally, I'd like to do this with Postgres itself, but it may be that another data engine and/or some other library will be suited to help with this (I'm not super familiar with Python libraries).
Goals would be:
Lookups are handled in memory (data footprint will never be "large").
Ideally, results could be cached after the first pull (by complete parameter signature) to optimize time, although this is somewhat optional as I'm assuming in-memory lookup would be pretty quick anyway....and
Also optional, but ideally, even though the lookup tables are stored separately in the db for importing/human-readability/editing purposes, I would think generating a x-dimensional array for the lookup when loaded into memory would be optimal. Though there are about 9-10 lookup tables total, there only maybe 10-15 values per table (some smaller) and probably a total of only maybe 15 parameters total for the complete lookup against all tables. Basically it's 9-10 tables of a modifier for an equation....so given certain values we lookup x/y values in each table, get the value, and add them together.
So I guess I'm looking for a library and/or suitable backend that handles the in-memory loading and caching (again, total size of this footprint in RAM will never be a factor)... and possibly can automatically resolve the x lookup tables into a single in-memory x-dimensional table for efficiency (rather than making 9-10 look-ups seperately)....and caching these results for repeated use when all parameters match a previous query (unless the lookup performs so quickly this is irrelevant).
Lookup tables are not huge...I would say, if I were to write code to break down each x/y value/range and create one giant x-dimensional lookup table by hand, it would probably endup with maybe 15 fields and 150 rows-ish...so we aren't talking very much data....but it will be hit very, very often and I don't want to perform these lookups everytime against the actual DB.
Recommendations for an engine/library suited best for this (with a preference for still being able to use postgresql for the persistent storage) are greatly appreciated.
You don't have to do anything special to achieve that: if you use the tables frequently, PostgreSQL will make them stay in cache automatically.
If you need to haye the tables in cache right from the start, use pg_prewarm. It allows you to explicitly load certain tables into cache and can automatically restore the state of the cache as it was before the last shutdown.
Once the tables are cached, they will only cause I/O when you write to them.
The efficient in-memory data structures you envision sound like a premature micro-optimization to me. I'd bet that these small lookup tables won't cause a performance problem (if you have indexes on all foreign keys).
I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.
Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.
Here are the two three four five six seven eight nine approaches (all failing) that I have tried:
Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).
Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.
Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.
Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.
Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.
Use PyPy. Crash still happened as in #1.
Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.
Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.
Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.
How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!
I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.
Edit: What finally worked:
I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).
Thank you all for your help and suggestions.
You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.
Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.
Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.
Instead of using a dictionary, use a data structure that compresses data, but still has fast lookups.
e.g:
keyvi: https://github.com/cliqz-oss/keyvi
keyvi is a FSA-based key-value data structure optimized for space & lookup speed. multiple processes reading from keyvi will re-use the memory, because a keyvi structure is memory mapped and it uses shared memory. Since your worker processes don't need to modify the data structure, I think this would be your best bet.
marisa trie: https://github.com/pytries/marisa-trie static trie structure for Python, based on the marisa-trie C++ library. Like keyvi, marisa-trie also uses memory-mapping. Multiple processes using the same trie will use the same memory.
EDIT:
To use keyvi for this task, you can first install it with pip install pykeyvi. Then use it like this:
from pykeyvi import StringDictionaryCompiler, Dictionary
# Create the dictionary
compiler = StringDictionaryCompiler()
compiler.Add('foo', 'bar')
compiler.Add('key', 'value')
compiler.Compile()
compiler.WriteToFile('test.keyvi')
# Use the dictionary
dct = Dictionary('test.keyvi')
dct['foo'].GetValue()
> 'bar'
dct['key'].GetValue()
> 'value'
marisa trie is just a trie, so it wouldn't work as a mapping out of the box, but you can for example us a delimiter char to separate keys from values.
If you can successfully load that data into a single process in point 1, you can most likely work around the problem of fork doing copies by using gc.freeze introduced in https://bugs.python.org/issue31558
You have to use python 3.7+ and call that function before you fork. (or before you do the map over process pool)
Since this requires a virtual copy of the whole memory for the CoW to work, you need to make sure your overcommit settings allow you to do that.
As most people here already mentioned:
Don't use that big a dictionary, Dump it on a Database instead!!!
After dumping your data into a database, using indexes will help reduce data retrieval times.
A good indexing explanation for PostgreSQL databases here.
You can optimize your database even further (I give a PostgreSQL example because that is what I mostly use, but those concepts apply to almost every database)
Assuming you did the above (or if you want to use the dictionary either way...), you can implement a parallel and asynchronous processing routine using Python's asyncio (needs Python version >= 3.4).
The base idea is to create a mapping method to assign (map) an asynchronous task to each item of an iterable and register each task to asyncio's event_loop.
Finally, we will collect all those promises with asyncio.gather and we will wait to receive all the results.
A skeleton code example of this idea:
import asyncio
async def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_loop = asyncio.get_event_loop()
my_future = asyncio.gather(
*(my_coroutine(val) for val in my_iterable)
)
return my_loop.run_until_complete(my_future)
my_async_map(my_processing, my_ginormous_iterable)
You can use gevent instead of asyncio, but keep in mind that asyncio is part of the standard library.
Gevent implementation:
import gevent
from gevent.pool import Group
def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_group = Group()
return my_group.map(my_coroutine, my_iterable)
my_async_map(my_processing, my_ginormous_iterable)
The already mentioned keyvi (http://keyvi.org) sounds like the best option to me, because "python shared memory dictionary" describes exactly what it is. I am the author of keyvi, call me biased, but give me the chance to explain:
Shared memory make it scalable, especially for python where the GIL-problematic forces you to use multiprocessing rather than threading. That's why a heap-based in-process solution wouldn't scale. Also shared memory can be bigger than main memory, parts can be swapped in and out.
External process network based solutions require an extra network hop, which you can avoid by using keyvi, this makes a big performance difference even on the local machine. The question is also whether the external process is single-threaded and therefore introduces a bottleneck again.
I wonder about your dictionary size: 86GB: there is a good chance that keyvi compresses that nicely, but hard to say without knowing the data.
As for processing: Note that keyvi works nicely in pySpark/Hadoop.
Your usecase BTW is exactly what keyvi is used for in production, even on a higher scale.
The redis solution sounds good, at least better than some database solution. For saturating the cores you should use several instances and divide the key space using consistent hashing. But still, using keyvi, I am sure, would scale way better. You should try it, if you have to repeat the task and/or need to process more data.
Last but not least, you find nice material on the website, explaining the above in more detail.
Maybe you should try do it in database, and maybe try to use Dask to solve your problem,let Dask to care about how to multiprocessing in the low level. You can focus on the main question you want to solve using that large data.
And this the link you may want to look Dask
Well I do believe that the Redis or a database would be the easiest and quickest fix.
But from what I understood, why not reduce the problem from your second solution? That is, first try to load a portion of the billion keys into memory (say 50 Million). Then using Multi-processing, create a pool to work on the 2 TB file. If the lookup of the line exists in the table, push the data to a list of processed lines. If it doesn't exist, push it to a list. Once you complete reading the data set, pickle your list and flush the keys you have stored from memory. Then load the next million and repeat the process instead reading from your list. Once it is finished completely, read all your pickle objects.
This should handle the speed issue that you were facing. Of course, I have very little knowledge of your data set and do not know if this is even feasible. Of course, you might be left with lines that did not get a proper dictionary key read, but at this point your data size would be significantly reduced.
Don't know if that is of any help.
Another solution could be to use some existing database driver which can allocate / retire pages as necessary and deal with the index lookup quickly.
dbm has a nice dictionary interface available and with automatic caching of pages may be fast enough for your needs. If nothing is modified, you should be able to effectively cache the whole file at VFS level.
Just remember to disable locking, open in not synch-ed mode, and open for 'r' only so nothing impacts caching/concurrent access.
Since you're only looking to create a read-only dictionary it is possible that you can get better speed than some off the shelf databases by rolling your own simple version. Perhaps you could try something like:
import os.path
import functools
db_dir = '/path/to/my/dbdir'
def write(key, value):
path = os.path.join(db_dir, key)
with open(path, 'w') as f:
f.write(value)
#functools.lru_cache(maxsize=None)
def read(key):
path = os.path.join(db_dir, key)
with open(path) as f:
return f.read()
This will create a folder full of text files. The name of each file is the dictionary key and the contents are the value. Timing this myself I get about 300us per write (using a local SSD). Using those numbers theoretically the time taken to write your 1.75 billion keys would be about a week but this is easily parallelisable so you might be able to get it done a lot faster.
For reading I get about 150us per read with warm cache and 5ms cold cache (I mean the OS file cache here). If your access pattern is repetitive you could memoize your read function in process with lru_cache as above.
You may find that storing this many files in one directory is not possible with your filesystem or that it is inefficient for the OS. In that case you can do like the .git/objects folder: Store the key abcd in a file called ab/cd (i.e. in a file cd in folder ab).
The above would take something like 15TB on disk based on a 4KB block size. You could make it more efficient on disk and for OS caching by trying to group together keys by the first n letters so that each file is closer to the 4KB block size. The way this would work is that you have a file called abc which stores key value pairs for all keys that begin with abc. You could create this more efficiently if you first output each of your smaller dictionaries into a sorted key/value file and then mergesort as you write them into the database so that you write each file one at a time (rather than repeatedly opening and appending).
While the majority suggestion of "use a database" here is wise and proven, it sounds like you may want to avoid using a database for some reason (and you are finding the load into the db to be prohibitive), so essentially it seems you are IO-bound, and/or processor-bound. You mention that you are loading the 86GB index from 1024 smaller indexes. If your key is reasonably regular, and evenly-distributed, is it possible for you to go back to your 1024 smaller indexes and partition your dictionary? In other words, if, for example, your keys are all 20 characters long, and comprised of the letters a-z, create 26 smaller dictionaries, one for all keys beginning with 'a', one for keys beginning 'b' and so on. You could extend this concept to a large number of smaller dictionaries dedicated to the first 2 characters or more. So, for example, you could load one dictionary for the keys beginning 'aa', one for keys beginning 'ab' and so on, so you would have 676 individual dictionaries. The same logic would apply for a partition over the first 3 characters, using 17,576 smaller dictionaries. Essentially I guess what I'm saying here is "don't load your 86GB dictionary in the first place". Instead use a strategy that naturally distributes your data and/or load.
I have data that is best represented by a tree. Serializing the structure makes the most sense, because I don't want to sort it every time, and it would allow me to make persistent modifications to the data.
On the other hand, this tree is going to be accessed from different processes on different machines, so I'm worried about the details of reading and writing. Basic searches didn't yield very much on the topic.
If two users simultaneously attempt to revive the tree and read from it, can they both be served at once, or does one arbitrarily happen first?
If two users have the tree open (assuming they can) and one makes an edit, does the other see the change implemented? (I assume they don't because they each received what amounts to a copy of the original data.)
If two users alter the object and close it at the same time, again, does one come first, or is an attempt made to make both changes simultaneously?
I was thinking of making a queue of changes to be applied to the tree, and then having the tree execute them in the order of submission. I thought I would ask what my problems are before trying to solve any of them.
Without trying it out I'm fairly sure the answer is:
They can both be served at once, however, if one user is reading while the other is writing the reading user may get strange results.
Probably not. Once the tree has been read from the file into memory the other user will not see edits of the first user. If the tree hasn't been read from the file then the change will still be detected.
Both changes will be made simultaneously and the file will likely be corrupted.
Also, you mentioned shelve. From the shelve documentation:
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
Personally, at this point, you may want to look into using a simple key-value store like Redis with some kind of optimistic locking.
You might try klepto, which provides a dictionary interface to a sql database (using sqlalchemy under the covers). If you choose to persist your data to a mysql, postgresql, or other available database (aside from sqlite), then you can have two or more people access the data simultaneously or have two threads/processes access the database tables -- and have the database manage the concurrent read-writes. Using klepto with a database backend will perform under concurrent access as well as if you were accessing the database directly. If you don't want to use a database backend, klepto can write to disk as well -- however there is some potential for conflict when writing to disk -- even though klepto uses a "copy-on-write, then replace" strategy that minimizes concurrency conflicts when working with files on disk. When working with a file (or directory) backend, your issues 1-2-3 are still handled due to the strategy klepto employs for saving writes to disk. Additionally, klepto can use a in-memory caching layer that enables fast access, where loads/dumps from the on-disk (or database) backend are done either on-demand or when the in-memory cache reaches a user-determined size.
To be specific: (1) both are served at the same time. (2) if one user makes an edit, the other user sees the change -- however that change may be 'delayed' if the second user is using an in-memory caching layer. (3) multiple simultaneous writes are not a problem, due to klepto letting NFS or the sql database handle the "copy-on-write, then replace" changes.
The dictionary interface for klepto.archvives is also available in a decorator form that provided LRU caching (and LFU and others), so if you have a function that is generating/accessing the data, hooking up the archive is really easy -- you get memorization with an on-disk or database backend.
With klepto, you can pick from several different serialization methods to encrypt your data. You can have klepto cast data to a string, or use a hashing algorithm (like md5), or use a pickler (like json, pickle, or dill).
You can get klepto here: https://github.com/uqfoundation/klepto
I'm going to have two independent programs (using SqlAlchemy / ORM / Declarative)
that will inevitably try to access the same database-file/table(SQLite) at the same time.
They could both want to read or write to that table.
Will there be a conflict when this happens?
If the answer is yes, how could this be handled?
Sqlite is resistant to any issues as you describe. http://www.sqlite.org/howtocorrupt.html gives you details on what could cause problems, and they're generally isolated from anything the code might accidentally do.
If you're concerned due to the nature of your application data access, use BEGIN TRANSACTION and COMMIT/ROLLBACK as appropriate. If your transactions are single query access (that is, you're not reading a value in one query and then changing it in another relative to what you already read), this should not be necessary.
I am in the middle of a project involving trying to grab numerous pieces of information out of 70GB worth of xml documents and loading it into a relational database (in this case postgres) I am currently using python scripts and psycopg2 to do this inserts and whatnot. I have found that as the number of rows in the some of the tables increase. (The largest of which is at around 5 million rows) The speed of the script (inserts) has slowed to a crawl. What was once taking a couple of minutes now takes about an hour.
What can I do to speed this up? Was I wrong in using python and psycopg2 for this task? Is there anything I can do to the database that may speed up this process. I get the feeling I am going about this in entirely the wrong way.
Considering the process was fairly efficient before and only now when the dataset grew up it slowed down my guess is it's the indexes. You may try dropping indexes on the table before the import and recreating them after it's done. That should speed things up.
What are the settings for wal_buffers and checkpoint_segments? For large transactions, you have to tweak some settings. Check the manual.
Consider the book PostgreSQL 9.0 High Performance as well, there is much more to tweak than just the database configuration to get high performance.
I'd try to use COPY instead of inserts. This is what backup tools use for fast loading.
Check if all foreign keys from this table do have corresponding index on target table. Or better - drop them temporarily before copying and recreate after.
Increase checkpoint_segments from default 3 (which means3*16MB=48MB) to a much higher number - try for example 32 (512MB). make sure you have enough space for this much additional data.
If you can afford to recreate or restore your database cluster from scratch in case of system crash or power failure then you can start Postgres with "-F" option, which will enable OS write cache.
Take a look at http://pgbulkload.projects.postgresql.org/
There is a list of hints on this topic in the Populating a Database section of the documentation. You might speed up general performance using the hints in Tuning Your PostgreSQL Server as well.
The overhead of checking foreign keys might be growing as the table size increases, which is made worse because you're loading a single record at a time. If you're loading 70GB worth of data, it will be far faster to drop foreign keys during the load, then rebuild them when it's imported. This is particularly true if you're using single INSERT statements. Switching to COPY instead is not a guaranteed improvement either, due to how the pending trigger queue is managed--the issues there are discussed in that first documentation link.
From the psql prompt, you can find the name of the constraint enforcing your foreign key and then drop it using that name like this:
\d tablename
ALTER TABLE tablename DROP CONSTRAINT constraint_name;
When you're done with loading, you can put it back using something like:
ALTER TABLE tablename ADD CONSTRAINT constraint_name FOREIGN KEY (other_table) REFERENCES other_table (join_column);
One useful trick to find out the exact syntax to use for the restore is to do pg_dump --schema-only on your database. The dump from that will show you how to recreate the structure you have right now.
I'd look at the rollback logs. They've got to be getting pretty big if you're doing this in one transaction.
If that's the case, perhaps you can try committing a smaller transaction batch size. Chunk it into smaller blocks of records (1K, 10K, 100K, etc.) and see if that helps.
First 5 mil rows is nothing, difference in inserts should not change is it 100k or 1 mil;
1-2 indexes wont slow it down that much(if fill factor is set 70-90, considering each major import is 1/10 of table ).
python with PSYCOPG2 is quite fast.
a small tip, you cud use database extension XML2 to read/work with data
small example from
https://dba.stackexchange.com/questions/8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo is right, try to commit in chunks of 10000 inserts (committing only at the end or after each insert is quite expensive)
autovacuum might be bloating if you do a lot of deletes and updates, you can turn it off temporary at the start for certain tables. set work_mem and maintenance_work_mem according to your servers available resources ...
for inserts, increase wal_buffers, (9.0 and higher its set auto by default -1) if u use version 8 postgresql, you should increase it manually
cud also turn fsync off and test wal_sync_method(be cautious changing this may make your database crash unsafe if sudden power-failures or hardware crash occurs)
try to drop foreign keys, disable triggers or set conditions for trigger not to run/skip execution;
use prepared statements for inserts, cast variables
you cud try to insert data into an unlogged table to temporary hold data
are inserts having where conditions or values from a sub-query, functions or such alike?