Advantages of mmap vs fileinput - python

I read that mmap is advantageous than fileinput, because it will read a page into kernel pagecache and shares the page in user address space. Whereas, fileinput actually brings a page into kernel and copies a line to user address space. So, there is this extra space overhead with fileinput.
So, I am planning to move to mmap, but I want to know from advanced python hackers whether it improves performance?
If so, is there a similar implementation of fileinput that uses mmap?
Please point me to any opensource code, if you are aware of.
thank you

mmap takes a file and sticks it in RAM so that you can index it like an array of bytes or as a big data structure.
Its a lot faster if you are accessing your file in a "random-access" manner -- that is doing a lot of fseek(), fread(), fwrite() combinations.
But if you are just reading the file in and processing each line once (say), then it is unlikely to be significantly faster. In fact, for any reasonable file size (remember with mmap it all must fit in RAM -- or paging occurs which begins to reduce the efficiency of mmap) it probably is indistinguishable.

Related

Indexing very large hex file with python

I'm trying to write a program that parses data from a (very) large file that contains even rows of 8 sets of 16 bit hex values. For instance, one row would look like this:
edfc b600 edfc 2102 81fb 0000 d1fe 0eff
The data files are expected to be anywhere between 1-4 TB, so I wasn't sure what the best approach would be. If I load this file using Python's open() function, could this turn out badly? I'm worried about how much of an impact this will have on my memory if I'm loading such a large file just to index through. Alternatively, if there's a method I can use to load just the section of data I want from the file, that would be ideal, but as far as I know, I don't think that's even possible. Is this correct?
Anyway, Some sort of idea as to how to approach this very general problem would be much appreciated!
Found an answer from Github. In numpy, there's a function called memmap that works for what I'm doing.
samples = np.memmap("hexdump_samples", mode="r", dtype=np.int16)[100:159]
This didn't seem to cause any issues with the smaller data set I was using, but I can't imagine this causing any issues with memory with the larger files. As far as I understand, this wouldn't cause any issues.
It depends on your computer hardware, how much RAM you have. Python is an interpreted language with a bunch of safeguards, but I wouldn't risk trying to open that file with Python. I would recommend using C or C++, they are good with large amounts of data and memory management. You can then parse the data in bite sized chunks, maybe 16MB per chunk. Python is a extremely slow and memory inefficient compared to C.

Python hashlib and sparse files

I wanted to know how does python hashlib library treat sparse files. If the file has a lot of zero blocks then instead of wasting CPU and memory on reading zero blocks does it do any optimization like scanning the inode block map and reading only allocated blocks to compute the hash?
If it does not do it already, what would be the best way to do it myself.
PS: Not sure it would be appropriate to post this question in StackOverflow Meta.
Thanks.
The hashlib module doesn't even work with files. You have to read the data in and pass blocks to the hashing object, so I have no idea why you think it would handle sparse files at all.
The I/O layer doesn't do anything special for sparse files, but that's the OS's job; if it knows the file is sparse, the "read" operation doesn't need to do I/O, it just fills in your buffer with zeroes with no I/O.

Is it meaningful to use multiprocessing to read multiple files with Python?

I intent to use multiprocessing to read a set of small files with multiprocesing capabilities of Python. However this is awkward in some sense to me because if the disk is rotational then the bottle neck is the rotation time and even-though I use multiple processes, total read time should be similar with single process read. Am I wrong ? What are your comments?
I addition, do you think using multiprocessing might cause intertwined reading of the files so the contents of these files are skewed in some way?
Your reasoning is sound, but the only way to find out for sure is by benchmarking (that said, it is unlikely that reading many small files in parallel will increase performance over reading them sequentially).
I am not entirely sure what you mean by "intertwined reading", but -- unless there are bugs in your code or the files are being changed while you're reading them -- you will get exactly the same contents irrespective of how you read it.
You are indeed right, the bottleneck will be disk-IO.
However, the only way to really know, is to measure both approaches.
If you have influence on the files, you could go for one larger file as opposed to many smaller files.

Python: quickly loading 7 GB of text files into unicode strings

I have a large directory of text files--approximately 7 GB. I need to load them quickly into Python unicode strings in iPython. I have 15 GB of memory total. (I'm using EC2, so I can buy more memory if absolutely necessary.)
Simply reading the files will be too slow for my purposes. I have tried copying the files to a ramdisk and then loading them from there into iPython. That speeds things up but iPython crashes (not enough memory left over?) Here is the ramdisk setup:
mount -t tmpfs none /var/ramdisk -o size=7g
Anyone have any ideas? Basically, I'm looking for persistent in-memory Python objects. The iPython requirement precludes using IncPy: http://www.stanford.edu/~pgbovine/incpy.html .
Thanks!
There is much that is confusing here, which makes it more difficult to answer this question:
The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?
So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).
If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:
with codecs.open(inpath, encoding='utf-8') as f:
data = f.read(8192)
while data:
process(data)
data = f.read(8192)
I hope this at least gets you closer.
I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/
If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.

Which is faster?

is opening a large file once reading it completely once to list faster (or) opening smaller files whose total sum of size is equal to large file and loading smaller file into list manupalating one by one faster?
which is faster?? is the difference is time large enough to impact my program??
total time difference of lesser then of 30 sec is negligible for me
It depends if your data fit in your available memory. If you need to resort to paging, or virtual memory, then opening a single giant file might become slower than opening more smaller files. This will be even more true if the computation you need to make creates intermediate variables that won't fit in the physical RAM either.
So, as long as the file is not that big, one opening will be faster, but if this is not true, then many opening may be faster.
At last, note that if you can do many opening, you might be able to do them in parallel and process various parts in different processes, which might make things faster again.
Obviously one open and close is going to be faster than n opens and closes if you are reading the same amount of data. Plus, when reading a single file the I/O classes you use can take advantage of things like buffering, etc, which makes it even faster.
If you are reading the file sequentially from start until end, one open/close is faster than multiple open/close operations.
However keep in mind that if you need to do a lot of seeking in your 1 big file, then maybe storing separate files won't be slower in that case.
Also keep in mind that no matter which approach you are using, you shouldn't read the entire file in at once. Do it in chunks.
Working with a single file is almost certainly going to be faster: you have to read the same amount of data in both cases, but when working with multiple files, you have that much more housekeeping operations slowing you down.
Additionally, you can read data from a single file at the maximum speed the disk can handle, using the disk buffer to the maximum etc., whereas with multiple files, the disk head does a lot more dancing jumping from file to file.
30sec time difference? Define large. Everything that fits into an average's computer RAM would probably not take much more time than 30sec in total.
Why do you think you need to read the file(s) into a list?
If you can open several small files and process each independently, then surely that means:
(a) that you don't need to read into a list, you can process any file (including 1 large file) a line at a time (avoiding running-out-of-real-memory problems)
or
(b) what you need to do is more complicated than you have told us.

Categories