is opening a large file once reading it completely once to list faster (or) opening smaller files whose total sum of size is equal to large file and loading smaller file into list manupalating one by one faster?
which is faster?? is the difference is time large enough to impact my program??
total time difference of lesser then of 30 sec is negligible for me
It depends if your data fit in your available memory. If you need to resort to paging, or virtual memory, then opening a single giant file might become slower than opening more smaller files. This will be even more true if the computation you need to make creates intermediate variables that won't fit in the physical RAM either.
So, as long as the file is not that big, one opening will be faster, but if this is not true, then many opening may be faster.
At last, note that if you can do many opening, you might be able to do them in parallel and process various parts in different processes, which might make things faster again.
Obviously one open and close is going to be faster than n opens and closes if you are reading the same amount of data. Plus, when reading a single file the I/O classes you use can take advantage of things like buffering, etc, which makes it even faster.
If you are reading the file sequentially from start until end, one open/close is faster than multiple open/close operations.
However keep in mind that if you need to do a lot of seeking in your 1 big file, then maybe storing separate files won't be slower in that case.
Also keep in mind that no matter which approach you are using, you shouldn't read the entire file in at once. Do it in chunks.
Working with a single file is almost certainly going to be faster: you have to read the same amount of data in both cases, but when working with multiple files, you have that much more housekeeping operations slowing you down.
Additionally, you can read data from a single file at the maximum speed the disk can handle, using the disk buffer to the maximum etc., whereas with multiple files, the disk head does a lot more dancing jumping from file to file.
30sec time difference? Define large. Everything that fits into an average's computer RAM would probably not take much more time than 30sec in total.
Why do you think you need to read the file(s) into a list?
If you can open several small files and process each independently, then surely that means:
(a) that you don't need to read into a list, you can process any file (including 1 large file) a line at a time (avoiding running-out-of-real-memory problems)
or
(b) what you need to do is more complicated than you have told us.
Related
I'm trying to write a program that parses data from a (very) large file that contains even rows of 8 sets of 16 bit hex values. For instance, one row would look like this:
edfc b600 edfc 2102 81fb 0000 d1fe 0eff
The data files are expected to be anywhere between 1-4 TB, so I wasn't sure what the best approach would be. If I load this file using Python's open() function, could this turn out badly? I'm worried about how much of an impact this will have on my memory if I'm loading such a large file just to index through. Alternatively, if there's a method I can use to load just the section of data I want from the file, that would be ideal, but as far as I know, I don't think that's even possible. Is this correct?
Anyway, Some sort of idea as to how to approach this very general problem would be much appreciated!
Found an answer from Github. In numpy, there's a function called memmap that works for what I'm doing.
samples = np.memmap("hexdump_samples", mode="r", dtype=np.int16)[100:159]
This didn't seem to cause any issues with the smaller data set I was using, but I can't imagine this causing any issues with memory with the larger files. As far as I understand, this wouldn't cause any issues.
It depends on your computer hardware, how much RAM you have. Python is an interpreted language with a bunch of safeguards, but I wouldn't risk trying to open that file with Python. I would recommend using C or C++, they are good with large amounts of data and memory management. You can then parse the data in bite sized chunks, maybe 16MB per chunk. Python is a extremely slow and memory inefficient compared to C.
I want to do some heavy batch processing and write results to memory using Python.
Since I will be writing ~30 million records, I/O becomes significant. My strategy is to create the file handle object once in a class constructor, then call f.write in a loop.
Should I implement batching logic myself, or rely on the one implicitly employed by the write method?
I can observe that some buffering is happening implicitly by periodically running wc -l on the output. It doesn't go up linearly, instead it goes up every ~5 minutes or so by ~20k lines. Therefore I can only assume that some batching happens internally. Should I therefore assume that my I/O is already optimized?
Alternatively I could:
Append my strings to a temporary list until a certain batch size is reached
Join the list with "\n".join(l) and write with a single call
Clean the list and continue
My logic would be a bit more convoluted than the previous overview as my business logic that yields the strings to be written is also happening in batch mode and utilised GPU, which is why I am asking if the above is worth it before attempting it. Also, if you do recommend that approach, I would appreciate a ballpark figure to try for batch size on step 1. My RAM can handle 100k records, would that be optimal?
I'm working on an application where it is convenient to store many (around 10^6-10^9) small files (<1K-10K) on disk. I need to iterate over them multiple times for a computation. Would compressing them into a single file would help to significantly reduce the IO bottleneck? I haven't measured different solutions yet, but would like to get pointers on what to expect. This application can easily become IO bound.
At the moment, I am doing a file comparison on 2 CSV files, checking for duplicate lines in each specific file, checking for data mismatches between the files, and checking for missing data rows in each file.
Currently, I am doing this in memory, built for speed because this will be processing thousands of files constantly. This comes at a price though, it can only process files it can completely store in memory.
I am looking to make a fall back if for some reason (although this should never happen) to be able to do the comparison if the files can't fit in memory.
What would be a good approach to do this?
Use pandas. Can't beat it for data analysis in python.
https://pandas.pydata.org/pandas-docs/stable/10min.html
Comes complete with a
read_csv(filepath, skiprows=100000, nrows=9999999)
method that loads the specified rows.
It's built on numpy, the majority of which's methods are implement in C, making them incredibly fast.
I intent to use multiprocessing to read a set of small files with multiprocesing capabilities of Python. However this is awkward in some sense to me because if the disk is rotational then the bottle neck is the rotation time and even-though I use multiple processes, total read time should be similar with single process read. Am I wrong ? What are your comments?
I addition, do you think using multiprocessing might cause intertwined reading of the files so the contents of these files are skewed in some way?
Your reasoning is sound, but the only way to find out for sure is by benchmarking (that said, it is unlikely that reading many small files in parallel will increase performance over reading them sequentially).
I am not entirely sure what you mean by "intertwined reading", but -- unless there are bugs in your code or the files are being changed while you're reading them -- you will get exactly the same contents irrespective of how you read it.
You are indeed right, the bottleneck will be disk-IO.
However, the only way to really know, is to measure both approaches.
If you have influence on the files, you could go for one larger file as opposed to many smaller files.