I have a program that wants to be called from the command line many times, but involves reading a large pickle file, so each call can be potentially expensive. Is there any way that I can make cPickle just mmap the file into memory rather than read it in its entirety?
You probably don't even need to do this explicitly as your OS's disk cache will probably do a damn good job already.
Any poor performance might actually be related to the cost of deserialization and not the cost of reading it off the disk. You can test this by creating a temporary ram disk and putting the file there.
And the way to remove the cost of deserialization is to move the loading of the file to a separate python process and call it like a service. Building a quick-and-dirty REST service in python is super-easy and super-useful in these cases.
Take a look at the socket docs for how to do this with a raw socket. The echo server is a good example to start from.
Related
So I have a couple of questions:
Does read, write in python using file.read(), file.write() happen using locks ?
Or can read, write happen at the same time ?
I have pickled object to a file on the disk. Now multiple threads can read the pickled model. But I also want to update the pickled file. Now one way is to lock the file on every read, write. But this would be inefficient since locking the files only when they are being read by different process is unneccary.
How should I solve this ?
I am trying to analyse a large shelve with multiprocessing.Pool. Being with the read-only mode it should be thread safe, but it seams that a large object is being read first, then slowwwwly dispatched through the pool. Can this be done more efficiently?
Here is a minimal example of what I'm doing. Assume that test_file.shelf already exist and is large (14GB+). I can see this snipped hugging 20GB of RAM, but only a small part of the shelve can be read at the same time (many more items than processors).
from multiprocessing import Pool
import shelve
def func(key_val):
print(key_val)
with shelve.open('test_file.shelf', flag='r') as shelf,\
Pool(4) as pool:
list(pool.imap_unordered(func, iter(shelf.items()))
Shelves are inherently not fast to open because they work as a dictionary like object and when you open them it takes a bit of time to load, especially for large shelves, due to the way it works in the backend. For it to function as a dictionary like object, each time you get a new item, that item is loaded into a separate dictionary in memory. lib reference
Also from the docs:
(imap) For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
You are using the standard chunk size of 1 which is causing it take a long time to get through your very large shelve file. The docs suggest chunking instead of allowing it to send 1 at a time to speed it up.
The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing. Unix file locking can be used to solve this, but this differs across Unix versions and requires knowledge about the database implementation used.
Last, just as a note, I'm not sure that your assumption about it being safe for multiprocessing is correct out of the box, depending on the implementation.
Edited:
As pointed out by juanpa.arrivillaga, in this answer it describes what is happening in the backend -- your entire iterable may be being consumed up front which causes a large memory usage.
I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible.
I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has benchmarked different ways of accessing csv and the time taken for each step.
He has used a catdevnull process with the code as shown:
def catDevNull():
os.system('cat %s > /dev/null' % fn)
The time taken in this case is the least. I believe it is independent of the python version as the time taken to read the file remains the same. Then he utilizes the warmc ache method as shown:
def wc():
os.system('wc -l %s > /dev/null' % fn)
The above two methods are the fastest. Using pandas.read_csv for the task, the time is less than other methods, but still slower than the above two methods.
Putting x = os.system('cat %s > /dev/null % fn), and checking the data type is a string.
How does os.system read the file that the time is so much less? Also, is there a way to access the files after they are read by os.system for further processing?
I was also curious as to how come reading the file is so much faster in pandas compared to other methods available as shown in the above link?
os.system completely relinquishes the control you have in Python. There is no way to access anything which happened in the subprocess after it has finished.
A better way to have some (but not sufficient) control over a subprocess is to use the Python subprocess module. This allows you to interact with the running process using signals and I/O, but still, there is no way to affect the internals of a process unless it has a specific API for allowing you to do that. (Linux exposes some process internals in the /proc filesystem if you want to explore that.)
I don't think you understand what the benchmark means. The cat >/dev/null is a baseline which simply measures how quickly the system is able to read the file off the disk; your process cannot possibly be faster than the I/O channel permits, so this is the time that the system takes for doing nothing at all. You would basically subtract this time from the subsequent results before you compare their relative performance.
Conventionally, the absolutely fastest way to read a large file is to index it, then use the index in memory to seek to the position inside the file you want to access. Building the index causes some overhead, but if you access the file more than once, the benefits soon cancel out the overhead. Importing your file to a database is a convenient and friendly way to do this; the database encapsulates the I/O completely and lets you query the data as if you could ignore that it is somehow serialized into bytes on a disk behind the scenes.
Based on my testing. I came across the fact that it is a lot faster to query in a pandas dataframe than querying in the database[tested for sqlite3]
Thus, the fastest way is to get the csv as a pandas dataframe, and then query in the dataframe as required. Also, if I need to save the file, I can pickle the dataframe, and reuse it as required. The time to pickle and unpickle file and querying is a lot lesser than storing the data in sql and then querying for the results.
Are there any advantages/disadvantages to reading an entire file in one go rather than reading the bytes as required? So is there any advantage to:
file_handle = open("somefile", rb)
file_contents = file_handle.read()
# do all the things using file_contents
compared to:
file_handle = open("somefile", rb)
part1 = file_handle.read(10)
# do some stuff
part2 = file_handle.read(8)
# do some more stuff etc
Background: I am writing a p-code (bytecode) interpreter in Python and have initially just written a naive implementation that reads bytes from the file as required and performs the necessary actions etc. A friend I was showing the program has suggested that I should instead read the entire file into memory (Python list?) and then process it from memory to avoid lots of slow disk reads. The test files are currently less than 1KB and will probably be at most a few 100KB so I would have expected the Operating System and disk controller system to cache the file obviating any performance issues caused by repeatedly reading small chunks of the file.
Cache aside, you still have system calls. Each read() results in a mode switch to trigger the kernel. You can see this with strace or another tool to look at system calls.
This might be premature for a 100 KB file though. As always, test your code to know for sure.
If you want to do any kind of random access then putting it in a list is going to be much faster than seeking from disk. Even if the OS does cache disk access, you are hitting another layer of cache. In any case, you can't be sure how the OS will behave.
Here are 3 cases I can think of that would motivate doing it in-memory:
You might have a jump instruction which you can execute by adding a number to your program counter. Doing that to the index of an array vs seeking a file is a good use case.
You may want to optimise your VM's behaviour, and that may involve reading the file more than once. Scanning a list twice vs reading a file twice will be much quicker.
Depending on opcodes and the grammar of your language you may want to look ahead in a 'cycle' to speed up execution. If that ends up doing two seeks then this could end up degrading performance.
If your file will always be small enough fit in RAM then it's probably worth reading it all into memory. Profile it with a real program and see if it's noticeably faster.
A single call to read() will be faster than multiple calls to read(). The tradeoff is that with a single call you must be able to fit all data in memory at once, whereas with multiple reads you only have to retain a fraction of the total amount of data. For files that are just a few kilobytes or megabytes, the difference won't be noticeable. For files that are several gigs in size, memory becomes more important.
Also, to do a single read means all of the data must be present, whereas multiple reads can be used to process data as it is streaming in from an external source.
If you are looking for performance, I would recommend going through generators. Since you have small file size, memory would not be any big concern, but its still a good practice. Still reading file from disc multiple times is a definite bottleneck for a scalable solution.
Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars).
If I use built-in open()/file() the file will be loaded in lazy manner.
I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)
def makefile(self, mode='r', bufsize=-1):
"""makefile([mode[, bufsize]]) -> file object
Return a regular file object corresponding to the socket. The mode
and bufsize arguments are as for the built-in open() function."""
return _fileobject(self._sock, mode, bufsize)
Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.
How big is the difference in performance between urllib.urlopen().readline() and file().readline()?
It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.
For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".
It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.