Reading Large File from non local disk in Python

Reading Large File from non local disk in Python - python

Sorry if the topic was already approached, I didn't find it.
I am trying to read with Python a bench of large csv files (>300 MB) that are not located in a local drive.
I am not an expert in programming but I know that if you copy it into a local drive first it should take less time that reading it (or am I wrong?).
The thing is that I tested both methods and the computation times are similar.
Am I missing something? Can someone explain / give me a good method to read those file as fast as possible?
For copying to local drive I am using: shutil.copy2
For reading the file: for each line in MyFile
Thanks a lot for your help,
Christophe

Copying a file is sequentially reading it and saving in another place.
The performance of application might vary depending on the data access patterns, computation to I/O time, network latency and network bandwidth.
If you execute your script once, and read through it sequentially it's the same as copying the file, except you perform computations on it instead of saving. Even if you process small chunks of data it probably gets buffered. In fact in case of a single execution, if you copy 1st you actually read the same data twice, once for copy, and once the copy for computation.
If you execute your script multiple times, then you need to check what is your data troughput. For example, if you have a gigabit ethernet then 1GBit/s is 125MByte/s, if you process data slower, then you are not limited by the bandwidth.
Network latency comes into play when you send multiple requests for small chunks of data. Upon a read request you send a network request and get data back in a finite time, this is the latency. If you make a request for one big chunk of data you "pay" the latency limit once, if you ask 1000 times for 1/1000 of the big chunk you will need to "pay" it 1000 times. However, this is probably abstracted from you by the network file system and in case of a sequential read it will get buffered. It would manifest itself in randon jumping over file and reading small chunks of it.
You can check what you are limited by checking how much bytes you process per second and compare it to limits of the hardware. If it's close to HDD speed (which in your case i bet is not) you are bound by I/O - HDD. If it's lower, close to network bandwidth, you are limited by I/O - network. If it's even lower, it's either you are bound by processing speed of data or I/O network latency. However, if you were bound by I/O you should see difference between the two approaches, so if you are seeing the same results it's computation.

Related

Does multiprocessing speed up file transfers compared to multithreading

I am writing a script to simultaneously accept many files transfers from many computers on a subnet using sockets (around 40 jpg files total). I want to use multithreading or multiprocessing to make the the transfer occur as fast as possible.
I'm wondering if this type of image transfer is limited by the CPU - and therefore I should use multiprocessing - or if multithreading will be just as good here.
I would also be curious as to what types of activities are limited by the CPU and require multiprocessing, and which are better suited for multithreading.

If the following assumptions are true:
Your script is simply receiving data from the network and writing that data to disk (more or less) verbatim, i.e. it isn't doing any expensive processing on the data
Your script is running on a modern CPU with typical modern networking hardware (e.g. gigabit Ethernet or slower)
Your script's download routines are not grossly inefficient (e.g. you are receiving reasonably-sized chunks of data and not just 1 byte at a time or something silly like that)
... then it's unlikely that your download rate will be CPU-limited. More likely the bottleneck will be either network bandwidth or disk I/O bandwidth.
In any case, since AFAICT your use-case is embarrassingly parallel (i.e. the various downloads never have to communicate or interact with each other, they just each do their own thing independently), it's unlikely that using multithreading vs multiprocessing will make much difference in terms of performance. Of course, the only way to be certain is to try it both ways and measure the throughput each way.

Short answer:
Generally, it really depends on your workload. If you're serious on the performance, please provide details. for example, whether you store images to disk, whether image sizes are > 1GB or not, and etc.
Note: Generally again, if it not mission-critical, both ways are acceptable since we can easily switch between multithread and multiprocess implementations using threading.Thread and multiprocessing.Process.
some more comments
It seems that not CPU but IO will be the bottleneck.
For multiprocess / multithread, due to GIL and/or your implementation, we may have performance difference. You may implement both ways and make try. BTW, IMHO it won't differ much. I think that async IO vs blocking IO will have greater impact.

If your file transfer isn't extremely slow - slower than writing data to disk, multithreading/multiprocessing isn't going to help. By file transfer I mean downloading images and writing them to the local computer with a single HDD.
Using multithreading or multiprocessing when transferring data from several computers with separate disks definitely can improve overall download performance. Simply data read from several physical disks can be read in paralel. The problem arises when you try to save these images to your local drive.
You have just a single local HDD (if disk array not used), single HDD like most HW devices can do just a single IO operation at time. So trying to write several images to disk in the same time won't improve the overal performance - it can even hamper it.
Just imagine that 40 already downloaded images are trying to be written to a single mechanical HDD with single HDD head to different locations (different physical files) especially if disk is fragmented. Then this can even slow down the whole process because HDD is wasting time moving it magnetic head from one position to different (drives can partially mitigate this by reordering IO operation to limit head movement).
On the other hand if you do some preprocessing with these images that is CPU intensive and just then you are going to save them to disk, multithreading can be really helpful.
And to the question what's preferred. On modern OSs there is not a significant difference between using multithreading and multiprocessing (spanning multiple processes). OSs like Linux or Windows schedule threads not processes - based on process and thread priorities. So there is not a big difference between 40 single threaded processes and a single process containing 40 threads. Using multiple processes normally consumes more memory because OS for every process has to allocate some extra memory (not big), but from point of speed difference between multithreading and multiprocessing is not significant. There are other important question to consider which method to use (will these downloads share some data - like common GUI interface - multithreading is easier to use), (are these files to download so big that 40 transfers can exhaust all virtual address space of a single process - use multiprocessing).
Generally:
Multithreading - easier to use in single application because all threads share virtual address space of a single process and can easily communicate with each other. On the other hand single process has a limited size of virtual address space (less than 4GB on 32bit computer).
Multiprocessing - harder to use in a single application (a need of inter-process communication), but more scalable and more robust (if file transfer process crashes only a single file transfer fails) + more virtual address space to use.

Using multiprocessing to process many files in parallel

I am trying to understand whether my way of using multiprocessing.Pool is efficient. The method that I would like to do in parallel is a script that reads a certain file, do calculation, and then saves the results to a different file. My code looks something like this:
from multiprocessing import Pool, TimeoutError
import deepdish.io as dd
def savefile(a,b,t,c,g,e,d):
print a
dd.save(str(a),{'b':b,'t':t,'c':c,'g':g,'e':e,'d':d})
def run_many_calcs():
num_processors = 6
print "Num processors - ",num_processors
pool = Pool(processes=num_processors) # start 4 worker processes
for a in ['a','b','c','d','e','f','g','t','y','e','r','w']:
pool.apply(savefile,args=(a,4,5,6,7,8,1))
How can I see that immediately after one process is finished in one of the processors it continues to the next file?

When considering performance of any program, you have to work out if the performance is bound by I/O (memory, disk, network, whatever) or Compute (core count, core speed, etc).
If I/O is the bottleneck, there's no point having multiple processes, a faster CPU, etc.
If the computation is taking up all the time, then it is worth investing in multiple processes, etc. "Computation time" is often diagnosed as being the problem, but on closer investigation turns out to be limited by the computer's memory bus speed, not the clock rate of the cores. In such circumstances adding multiple processes can make things worse...
Check
You can check yours by doing some performance profiling of your code (there's bound to be a whole load of profiling tools out there for Python).
My Guess
Most of the time these days it's I/O that's the bottleneck. If you don't want to profile your code, betting on a faster SSD is likely the best initial approach.
Unsolvable Computer Science Problem
The architectural features of modern CPUs (L1, L2, L3 cache, QPI, hyperthreads) are all symptoms of the underlying problem in computer design; cores are too quick for the I/O we can wrap around them.
For example, the time taken to transfer 1 byte from SDRAM to the core is exceedingly slow in comparison to the core speed. One just has to hope that the L3, L2 and L1 cache subsystems have correctly predicted the need for that 1 byte and have already fetched it ahead of time. If not, there's a big delay; that's where hyperthreading can help the overall performance of the computer's other processes (they can nip in and get some work done), but does absolutely nothing for the stalled program.
Data fetched from files or networks is very slow indeed.
File System Caching
In your case it sounds like you have 1 single input file; that will at least get cached in RAM by the OS (provided it's not too big).
You may be tempted to read it into memory yourself; I wouldn't bother. If it's large you would be allocating a large amount of memory to hold it, and if that's too big for the RAM in the machine the OS will swap some of that RAM out to the virtual memory page file anyway, and you're worse off than before. If it's small enough there's a good chance the OS will cache the whole thing for you anyway, saving you the bother.
Written files are also cached, up to a point. Ultimately there's nothing you can do if "total process time" is taken to mean that all the data is written to disk; you'd be having to wait for the disk to complete writing no matter what you did in memory and what the OS cached.
The OS's filesystem cache might give an initial impression that file writing has completed (the OS will get on with consolidating the data on the actual drive shortly), but successive runs of the same program will get blocked once that write cache is full.
If you do profile your code, be sure to run it for a long time (or repeatedly), to make sure that the measurements made by the profiler reveal the true underlying performance of the computer. If the results show that most of the time is in file.Read() or file.Write()...

Is there any advantage to reading the entire file

Are there any advantages/disadvantages to reading an entire file in one go rather than reading the bytes as required? So is there any advantage to:
file_handle = open("somefile", rb)
file_contents = file_handle.read()
# do all the things using file_contents
compared to:
file_handle = open("somefile", rb)
part1 = file_handle.read(10)
# do some stuff
part2 = file_handle.read(8)
# do some more stuff etc
Background: I am writing a p-code (bytecode) interpreter in Python and have initially just written a naive implementation that reads bytes from the file as required and performs the necessary actions etc. A friend I was showing the program has suggested that I should instead read the entire file into memory (Python list?) and then process it from memory to avoid lots of slow disk reads. The test files are currently less than 1KB and will probably be at most a few 100KB so I would have expected the Operating System and disk controller system to cache the file obviating any performance issues caused by repeatedly reading small chunks of the file.

Cache aside, you still have system calls. Each read() results in a mode switch to trigger the kernel. You can see this with strace or another tool to look at system calls.
This might be premature for a 100 KB file though. As always, test your code to know for sure.

If you want to do any kind of random access then putting it in a list is going to be much faster than seeking from disk. Even if the OS does cache disk access, you are hitting another layer of cache. In any case, you can't be sure how the OS will behave.
Here are 3 cases I can think of that would motivate doing it in-memory:
You might have a jump instruction which you can execute by adding a number to your program counter. Doing that to the index of an array vs seeking a file is a good use case.
You may want to optimise your VM's behaviour, and that may involve reading the file more than once. Scanning a list twice vs reading a file twice will be much quicker.
Depending on opcodes and the grammar of your language you may want to look ahead in a 'cycle' to speed up execution. If that ends up doing two seeks then this could end up degrading performance.
If your file will always be small enough fit in RAM then it's probably worth reading it all into memory. Profile it with a real program and see if it's noticeably faster.

A single call to read() will be faster than multiple calls to read(). The tradeoff is that with a single call you must be able to fit all data in memory at once, whereas with multiple reads you only have to retain a fraction of the total amount of data. For files that are just a few kilobytes or megabytes, the difference won't be noticeable. For files that are several gigs in size, memory becomes more important.
Also, to do a single read means all of the data must be present, whereas multiple reads can be used to process data as it is streaming in from an external source.

If you are looking for performance, I would recommend going through generators. Since you have small file size, memory would not be any big concern, but its still a good practice. Still reading file from disc multiple times is a definite bottleneck for a scalable solution.

How does urllib.urlopen() work?

Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars).
If I use built-in open()/file() the file will be loaded in lazy manner.
I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?

open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)
def makefile(self, mode='r', bufsize=-1):
"""makefile([mode[, bufsize]]) -> file object
Return a regular file object corresponding to the socket. The mode
and bufsize arguments are as for the built-in open() function."""
return _fileobject(self._sock, mode, bufsize)

Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.
How big is the difference in performance between urllib.urlopen().readline() and file().readline()?
It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.
For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".
It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.

How should I optimize this filesystem I/O bound program?

I have a python program that does something like this:
Read a row from a csv file.
Do some transformations on it.
Break it up into the actual rows as they would be written to the database.
Write those rows to individual csv files.
Go back to step 1 unless the file has been totally read.
Run SQL*Loader and load those files into the database.
Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.
There are a few ideas that I have to solve this:
Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.
Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?

Poor man's map-reduce:
Use split to break the file up into as many pieces as you have CPUs.
Use batch to run your muncher in parallel.
Use cat to concatenate the results.

Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.
If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.
Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

If you are I/O bound, the best way I have found to optimize is to read or write the entire file into/out of memory at once, then operate out of RAM from there on.
With extensive testing I found that my runtime eded up bound not by the amount of data I read from/wrote to disk, but by the number of I/O operations I used to do it. That is what you need to optimize.
I don't know Python, but if there is a way to tell it to write the whole file out of RAM in one go, rather than issuing a separate I/O for each byte, that's what you need to do.
Of course the drawback to this is that files can be considerably larger than available RAM. There are lots of ways to deal with that, but that is another question for another time.

Can you use a ramdisk for step 4? Low millions sounds doable if the rows are less than a couple of kB or so.

Use buffered writes for step 4.
Write a simple function that simply appends the output onto a string, checks the string length, and only writes when you have enough which should be some multiple of 4k bytes. I would say start with 32k buffers and time it.
You would have one buffer per file, so that most "writes" won't actually hit the disk.

Isn't it possible to collect a few thousand rows in ram, then go directly to the database server and execute them?
This would remove the save to and load from the disk that step 4 entails.
If the database server is transactional, this is also a safe way to do it - just have the database begin before your first row and commit after the last.

The first thing is to be certain of what you should optimize. You seem to not know precisely where your time is going. Before spending more time wondering, use a performance profiler to see exactly where the time is going.
http://docs.python.org/library/profile.html
When you know exactly where the time is going, you'll be in a better position to know where to spend your time optimizing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.