keep data in memory persistently - python

I want to write a Python script which loads 2 GB of data from the hard disk into memory and then whenever requested by other program, it must get an input and do some calculations on this data based on the input. the important thing for me is to keep this 2 GB data in memory persistently to speed up the calculations and more importantly avoid huge I/O load.
how should I keep the data in memory forever? or more generally, how should I solve such problem in Python?

Depending on what kind of data you have, you can keep the data in a Python list, set, hashmap or any other data structure. If this is just meant to be a cache, you can use a server like Redis or memcached too.
There is nothing special about loading data to memory "forever" or doing it every time you need it. You can just load into Python variables and keep them around.

Make sure you have 2GB of free and available RAM and then use the mmap module (https://docs.python.org/3/library/mmap.html) to map the entire array into active memory.

Related

How to store Python/file data in memory until written to disk?

I have a Python script that writes to disk every time it runs, which happens very frequently. It collects data that changes just as often so I cannot change the rate at which the script runs. I would like to reduce the number of disk writes by only writing after every x minutes. I need the different script results to persist in memory over multiple script runs. The result data consists of lines of text, which are all strings. This script is running on Debian Linux.
Is there a way to store multiple strings/files directly in memory until I write them all to the disk at a later time?
Of course, there are many options for that. One way to do it is to create a temporary string buffer (i.e. container), where you write these strings until disk write is called. If strings are more specific/timestamped or similar, you can use a dictionary to remember their IDs also.
You can use a buffer or queue/pipe to put data in them and then whenever suited write into the file pointer and flush() it to the hard disk whenever needed.

Passing variables between two python processes

I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.

Python 3 - Faster Print & I/O

I'm currently involved in a Python project that involves handling massive amounts of data. In this, I have to print massive amounts of data to files. They are always one-liners, but sometimes consisting of millions of digits.
The actual mathematical operations in Python only take seconds, minutes at most. Printing them to a file takes up to several hours; which I don't always have.
Is there any way of speeding up the I/O?
From what I figure, the number is stored in the RAM (Or at least I assume so, it's the only thing which would take up 11GB of RAM), but Python does not print it to a text file immediately. Is there a way to dump that information -- if it is the number -- to a file? I've tried Task Manager's Dump, which gave me a 22GB dump file (Yes, you read that right), and it doesn't look like there's what I was looking for in there, albeit it wasn't very clear.
If it makes a difference, I have Python 3.5.1 (Anaconda and Spyder), Windows 8.1 x64 and 16GB RAM.
By the way, I do run Garbage Collect (gc module) inside the script, and I delete variables that are not needed, so those 11GB aren't just junk.
If you are indeed I/O bound by the time it takes to write the file, multi-threading with a pool of threads may help. Of course, there is a limit to that, but at least, it would allow you to issue non-blocking file writes.
Multithreading could speed it up (have printers on other threads that you write to in memory that have a queue).
Maybe a system design standpoint, but maybe evaluate whether or not you need to write everything to the file. Perhaps consider creating various levels of logging so that a release mode could run faster (if that makes sense in your context).
Use HDF5 file format
The problem is, you have to write a lot of data.
HDF5 is format being very efficient in size and allowing access to it by various tools.
Be prepared for few challenges:
there are multiple python packages for HDF5, you will have to find the one which fits your needs
installation is not always very simple (but there might be Windows installation binary)
expect a bit of study to understand the data structures to be stored.
it will occasionally need some CPU cycles - typically you write a lot of data quickly and at one moment it has to be flushed to the disk. At this moment it starts compressing the data what can take few seconds. See GIL for IO bounded thread in C extension (HDF5)
Anyway, I think, it is very likely, you will manage and apart of faster writes to the files you will also gain smaller files, which are simpler to handle.

reducing I/O on application and database

Is there a way to reduce the I/O's associated with either mysql or a python script? I am thinking of using EC2 and the costs seem okay except I can't really predict my I/O usage and I am worried it might blindside me with costs.
I basically develop a python script to parse data and upload it into mysql. Once its in mysql, I do some fairly heavy analytic on it(creating new columns, tables..basically alot of math and financial based analysis on a large dataset). So is there any design best practices to avoid heavy I/O's? I think memcached stores a everything in memory and accesses it from there, is there a way to get mysql or other scripts to do the same?
I am running the scripts fine right now on another host with 2 gigs of ram, but the ec2 instance I was looking at had about 8 gigs so I was wondering if I could use the extra memory to save me some money.
By IO I assume you mean disk IO... and assuming you can fit everything into memory comfortably. You could:
Disable swap on your box†
Use mysql MEMORY tables while you are processing, (or perhaps consider using an Sqlite3 in memory store if you are only using the database for the convenience of SQL queries)
Also: unless you are using EBS I didn't think Amazon charged for IO on your instance. EBS is much slower than your instance storage so only use it when you need the persistance, ie. not while you are crunching data.
†probably bad idea
You didn't really specify whether it was writes or reads. My guess is that you can do it all in a mysql instance in a ramdisc (tmpfs under Linux).
Operations such as ALTER TABLE and copying big data around end up creating a lot of IO requests because they move a lot of data. This is not the same as if you've just got a lot of random (or more predictable queries).
If it's a batch operation, maybe you can do it entirely in a tmpfs instance.
It is possible to run more than one mysql instance on the machine, it's pretty easy to start up an instance on a tmpfs - just use mysql_install_db with datadir in a tmpfs, then run mysqld with appropriate params. Stick that in some shell scripts and you'll get it to start up. As it's in a ramfs, it won't need to use much memory for its buffers - just set them fairly small.

How should I optimize this filesystem I/O bound program?

I have a python program that does something like this:
Read a row from a csv file.
Do some transformations on it.
Break it up into the actual rows as they would be written to the database.
Write those rows to individual csv files.
Go back to step 1 unless the file has been totally read.
Run SQL*Loader and load those files into the database.
Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.
There are a few ideas that I have to solve this:
Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.
Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?
Poor man's map-reduce:
Use split to break the file up into as many pieces as you have CPUs.
Use batch to run your muncher in parallel.
Use cat to concatenate the results.
Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.
If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.
Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.
If you are I/O bound, the best way I have found to optimize is to read or write the entire file into/out of memory at once, then operate out of RAM from there on.
With extensive testing I found that my runtime eded up bound not by the amount of data I read from/wrote to disk, but by the number of I/O operations I used to do it. That is what you need to optimize.
I don't know Python, but if there is a way to tell it to write the whole file out of RAM in one go, rather than issuing a separate I/O for each byte, that's what you need to do.
Of course the drawback to this is that files can be considerably larger than available RAM. There are lots of ways to deal with that, but that is another question for another time.
Can you use a ramdisk for step 4? Low millions sounds doable if the rows are less than a couple of kB or so.
Use buffered writes for step 4.
Write a simple function that simply appends the output onto a string, checks the string length, and only writes when you have enough which should be some multiple of 4k bytes. I would say start with 32k buffers and time it.
You would have one buffer per file, so that most "writes" won't actually hit the disk.
Isn't it possible to collect a few thousand rows in ram, then go directly to the database server and execute them?
This would remove the save to and load from the disk that step 4 entails.
If the database server is transactional, this is also a safe way to do it - just have the database begin before your first row and commit after the last.
The first thing is to be certain of what you should optimize. You seem to not know precisely where your time is going. Before spending more time wondering, use a performance profiler to see exactly where the time is going.
http://docs.python.org/library/profile.html
When you know exactly where the time is going, you'll be in a better position to know where to spend your time optimizing.

Categories