Efficient serialisation of large list into a database?

Efficient serialisation of large list into a database? - python

I have about 29 million individual unicode elements I want to serialise into a database. My current solution uses SQLite, and incrementally sends a list of length 100,000 at a time:
cursor.executemany("insert into array values (?)", [list of length 100,000])
I thought this would be efficient enough, but after going to sleep and waking up (say 6 hours), and seeing it still hadn't finished, I thought otherwise.
Would a MongoDB or memcached solution be more efficient?
(feel free to suggest any more efficient method)

If you decide to choose memcached, please take case of the below mentioned note.
Memcached has a default setting of maximum data size that can be set as 1Mb.
Any data size greater then that would require you to start the memcached explicitly stating the maximum size.
I really like using redis for storing data structures. As your requirement is not so very clear. I would be suggesting you to use redis, as you can store it in multiple fashion and use it interestingly.
Please read the documentation of the different data structures that can be stored on the redis.
http://redis.io/documentation

Try spawning gevent threads e.g 10 and monitor how many records are processed in a given time, and try increasing the number. You should also look into multiprocessing

Related

Fastest way to share a very large dict between multiple processes without copying

TL;DR: How to share a large (200MB) read only dict between multiple processes in a performant way, that is accessed VERY heavily without each process having a full copy in memory.
EDIT: It looks like if I just pass the dictionary as the argument for the multiprocessing.Pool/Process, it won't actually create a copy unless a worker modifies the dictionary. I just assumed it would copy. This behavior seems to be Unix only where fork is available and even then not always. But if so, it should solve my problem until this is converted to an ETL job.
What I'm trying to do:
I have a task to improve a script that replicates data from one store to another. Normalizing and transforming the data on the way. This task works on the scale of around 100 million documents coming from the source document store that get rolled up and pushed to another destination document store.
Each document has an ID and there is another document store is that essentially a key value store of those ID's mapped to some additional information needed for this task. This store is a lot smaller and doing queries against it while document from the main store come through, is not really an option without heavy caching and that heavy cache ends up being a copy of the whole thing very quickly. I just create the whole dictionary dictionary from that entire store at beginning before starting anything and use that. That dictionary is around ~200MB in size. Note that this dictionary is only ever read from.
For this I have setup multiprocessing and have around 30 concurrent processes. I've divided the work for each process such that each hit a different indices and can do the whole thing in around 4 hours.
I have noticed that I am extremely CPU bound when doing the following 2 things:
Using a thread pool/threads (what i'm currently doing) so each thread can access the dict without issue. The GIL is killing me and I have one process maxing out at 100% all the time with other CPU's sitting idle. Switching to PyPy helped a lot, but i'm still not happy with this approach.
Creating a Multiprocessing.Manager().dict() for the large dict and having the child processes access through that. The server process that this approach creates is constantly at 100% cpu. I don't know why, as I only ever read from this dictionary so I doubt it's a locking thing. I don't know how the Manager works internally but i'm guessing that the child processes are connecting via Pipes/Sockets for each fetch and the overhead of this is massive. It also suggests that using Reddis/Memcache will have the same problem if true. Maybe it can be configured better?
I am Memory bound when doing these things:
Using a SharedMemory view. You can't seem to do this for dicts like I need to. I can serialize the dict to get into the shared view, but for it to be usable on the Child process you need serialize the data to an actual usable dict which creates the copy in the process.
I strongly suspect that unless I've missed something I'm just going to have to "download more ram" or rewrite from Python into something without a GIL (or use ETL like it should be done in...).
In the case of ram, what is the most efficient way to store a dict like this to make it sting less? It's currently a standard dict mapped to a tuple of the extra information consisting of 3 long/float.
doc_to_docinfo = {
"ID1": (5.2, 3.0, 455),
}
Are there any more efficient hashmap implementations for this use case than what i'm doing?

You seem to have a similar problem that I have. It is possible to use my source here to create a partitioning of those dictionary-keys per thread. My suggestion: Split the document IDs into partitions of length 3 or 4, keep the partition table in sync for all processes/threads and then just move the parts of your documents to each process/thread and as an entrypoint the process does a dictionary lookup and finds out which process can handle the part of that dictionary. If you are clever with balancing the partitions, you could also have an equal amount of documents per thread managed.

Most efficient way to get HTTP Request Response into SQLite3 Database using Python?

I'm using a web API to call and receive data to build out an SQL database for historical energy prices. For context, energy prices are set at what are called "nodes", and each node has 20 years of historical data.
I can receive the data in JSON or XML format. I need to do one operation with the received data before I put it into the SQL database. Namely, I need to convert each hour given in Eastern Daylight Time back to its Eastern Standard Time equivalent.
Being brand new to Python (learned in last two weeks), I initially went down a path more intuitive to me:
HTTP Request (XML format) -> Parse to XML object in Python -> Convert Datetime -> Place in SQL database
The total size of the data I'm attempting to get is roughly 150GB. Because of this, I wanted to get the data in an asynchronous matter and format/put into SQL as it came in from hundreds of API calls (there's a 50000 row limit to what I can get at a time). I was using a ThreadPool to do this. Once the data was received, I attempted to use a ProcessPool to convert this data into the format I needed to place into my SQL database, but was unsuccessful.
Looking at the process from a high level, I think this process can be a lot more efficient. I think I can do the following:
HTTP Request (JSON Format) -> Parse to JSON object in Python -> Perform operation to convert datetime (map value using dictionary?) -> Place into SQL database
I just discovered the OPENJSON library in Python. Is this all I need to do this?
Another issue I need to look into are the limitations of SQLite3. Each node will have its own table in my database, so ideally I'd like to have as many instances of my program as possible getting, parsing, and putting data into my SQLite3 database.
Any help would be much appreciated!

There is no definite answer to you question given so many unknowns but I can outline the way how to get to the solution.
Factors That Influence Performance
The processing is done in stages as you described (I'll abstract away the actual format for now for the reasons I'll describe a bit later):
Fetch data from the remote service
Parse data
Convert data
Store into local DB
For every stage there are some limiting factors that does not allow you to increase processing speed.
For fetching data some of them are:
network bandwidth.
parallelism that remote server supports: remote server may throttle connections and/or total speed for single user or it may be required by terms of usage to limit this on client side.
data format used when downloading. Different formats add their own amount of unneeded/boilerplate formatting and/or data that is sent over network. It depends on the service and its API but it may be that returned XML is smaller than JSON so even XML is usually more verbose for your particular case XML is better.
RAM amount (and swap speed) may be a limit on your system in (very unlikely) case that factors #1 and #2 do not limit you. In this case the downloaded date may not fit into RAM and will be swapped to disk and this will slowdown the download process.
For parsing the data:
RAM amount (for the same reasons as above)
Data format used.
The parser used. Different parsers implementations for JSON for example have different speed.
CPU power: speed and number of processing units.
For data conversion:
RAM amount
CPU power
For data storing:
disk speed
parallelism level the DB efficiently supports
These are not all factors that limit processing speed but just some most obvious. There are also some other unknown limitations.
Also there may be some overhead when passing data between stages. It depends on the design. In some designs (for example single process that reads the data from remote server, processes it in memory and stores to database) the overhead may be zero, but in some designs (multiple processes read data and stores it to files, another set of processes open these files and processes them and so on) the overhead may be quite big.
The final speed of processing is defined by speed of the slowest stage or speed of data passing between stages.
Not all of these factors can be predicted when you design a solution or choose between several designs. Given that there are unknown factors this is even more complicated.
Approach
To be systematic I would use the following approach:
create simple solution (like single process reads data processes and stores to database)
find the processing speed of every phase using that solution
when you have processing speed of every phase look to the slowest phase (note that make sense to look only to the slowest as it defines the overall speed)
then find
why it is slow?
what limits the speed and if that can be improved?
what is the theoretical limit of that stage? (for example if you have 1Gb network and one processing box you can't read data with the speed greater than 120MB/s, in practice it will be even smaller).
Improve. The improvement is usually
optimize processing (like choose better format or library for parsing, remove operations that can be avoided etc) of single processor. If you hit (or is close to) the theoretical limit of the processing speed, you can't use this option.
add more parallelism
In general when you try to optimize something you need to have numbers and compare them when you are doing experiments.
Parallelism
Python
You should be careful when choose between threads and processes. As for example threads are not good for CPU intensive tasks. See more information on this Multiprocessing vs Threading Python
SQLite
SQLite may have some limitations when multiple processes work with single databases. You need to check if it is the limiting factor of your speed. Maybe you need to use another database that better fits for parallelism and then as an additional final step dump the data from it to SQLite in single shot (that would only require to read data sequentially and store it in SQLite and that may be much more efficient if compared to parallel write to single SQLite DB).

Overhead on an SQL insert significant?

I have a python script which hits dozens of API endpoints every 10s to write climate data to a database. Lets say on average I insert 1,500 rows every 10 seconds, from 10 different threads.
I am thinking of making a batch system whereby the insert queries aren't written to the db as they come in, but added to a waiting list and this list is inserted in batch when it reaches a certain size, and the list of course emptied.
Is this justified due to the overhead with frequently writing small numbers of rows to the db?
If so, would a list be wise? I am worried about if my program terminates unexpectedly, perhaps a form of serialized data would be better?

150 inserts per second can be a load on a database and can affect performance. There are pros and cons to changing the approach that you have. Here are some things to consider:
Databases implement ACID, so inserts are secure. This is harder to achieve with buffering schemes.
How important is up-to-date information for queries?
What is the query load?
insert is pretty simple. Alternative mechanisms may require re-inventing the wheel.
Do you have other requirements on the inserts, such as ensuring they are in particular order?
No doubt, there are other considerations.
Here are some possible alternative approaches:
If recent data is not a concern, snapshot the database for querying purposes -- say once per day or once per hour.
Batch inserts in the application threads. A single insert can insert multiple rows.
Invest in larger hardware. An insert load that slows down a single processor may have little effect on a a larger machine.
Invest in better hardware. More memory and faster disk (particularly solid state) and have a big impact.
No doubt, there are other approaches as well.

Any way to 'throttle' Python/Mongo function

I have an function that operates on a single collection, recursively doing two aggregations then updating documents. All my indexes are correct.
At the moment I cannot refactor this code, and when the function runs it slams Mongo for about 10 minutes to process the data. Seems to grow based on collection size, averaging about 60 seconds for every additional 3k documents. This collection can grow to hundreds of thousands of documents. (The documents are small - about 10 keys with very small values each.)
There is no need for the result of this function to be real-time - it is scheduled, so it's perfectly fine for me to throttle it back.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation? Or should I address the throttling in Python using sleep or some other method?

recursively doing two aggregations then updating documents
It looks like you need to consider to re-model your schema. The flexibility of MongoDB documents schema is something to optimise your process. See MongoDB: Data Modeling for more infos, examples and patterns.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation?
MongoDB does not have a feature to limit a CPU usage per operation. This feature may not make sense in distributed fashion. For example, a limit of 1 CPU for an operation that spans multiple shards may not be as simple/desired anymore.
Alternatively, depending on your use case, if the function does not have to be real-time, you could utilise secondary read-preference.
Essentially, directing your scheduled reporting to a secondary member, allowing your primary to process other data.
Although make sure you read the pros and cons of this secondary read beforehand.
See also MongoDB Replication

Is there a way to cache the fetch output?

I'm working on a closed system running in the cloud.
What I need is a search function that uses user-typed-in regexp to filter the rows in a dataset.
phrase = re.compile(request.get("query"))
data = Entry.all().fetch(50000) #this takes around 10s when there are 6000 records
result = x for x in data if phrase.search(x.title)
Now, the database itself won't change too much, and there will be no more than 200-300 searches a day.
Is there a way to somehow cache all the Entries (I expect that there will be no more than 50.000 of them, each no bigger than 500 bytes), so retrieving them won't take up >10 seconds? Or perhaps to parallelize it? I don't mind 10cpu seconds, but I do mind 10 second that the user has to wait.
To address any answers like "index it and use .filter()" - the query is a regexp, and I don't know about any indexing mechanism that would allow to use a regexp.

You can also use cachepy or performance engine (shameless plug) to store the data on app engine's local instances, so you can have faster access to all entities without getting limited by memcache boundaries or datastore latency.
Hint: A local instance gets killed if it surpasses about 185 MB of memory, so you can store actually quite a lot of data in it if you know what you're doing.

Since there is a bounded number of entries, you can memcache all entries and then do the filtering in memory like you've outlined. However, note that each memcache entry cannot exceed 1mb. But you can fetch up to 32mb of memcache entries in parallel.
So split the entries into sub sets, memcache the subsets and then read them in parallel by precomputing the memcache key.
More here:
http://code.google.com/appengine/docs/python/memcache/functions.html

Since your data is on the order of 20MB, you may be able to load it entirely into local instance memory, which will be as fast as you can get. Alternately, you could store it as a data file alongside your app, reading which will be faster than accessing the datastore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.