I need to add some lines to my code that shows the maximum memory that has been used in the runtime of the code. My code opens a CSV file and processes it and write a new CSV file. I need to know the maximum memory that has been allocated throughout the time of running.
I need this to compare different alternatives in terms of memory usage. I have tried memory_usage() and df.memory_usage(deep=True).max() but I don't know what is the numbers they generate.
I need to know that for example: this code to process this CSV file has allocated (for example) 12 MB of RAM in the most-memory-consuming-moment of runtime.
Related
I'm preprocessing a large volume of raw files for future analysis where they'll be read back sequentially. The raw files are on a network file server and the the processed files are being written to a local external USB drive (12TB HDD, ReFS w/ 4k clusters). Processed files are ~100 KB each and I anticipate accessing them all a few times per year sorted by filename. There are 60+ million files.....
My general question pertains to how important is it that the preprocessed files are written in the same order they'll be read? Details follow...
Python's multiprocessing module is being used for the preprocessing to max out I/O and CPU. I have a loop within os.walk that lists N=1000 files from the fileserver, then passes that list into Pool.map(), which maps the work to a function that processes the files one by one and writes them to the USB drive. This seems to be working well from a performance standpoint. However I've noticed that the files are not being processed and written in perfect sequential order, which I assume is because the pools are uncoordinated. It's fairly close but not exact. Here's a fictional sample (pretend these are filenames) listed in the order they're written to the USB drive:
1
4
2
3
6
5
9
11
7
10
8
12
17
21
In the future I'll be reading these back in small batches of "adjacent filenames" (N = 16 or 32) like this:
(1,2, 3,...14,15,16). All files will be read sequentially but they're processed in batches (feeding them into a neural network). I assume this reading will be single-treaded so it's actually requesting one file at a time from the OS, although I may end up relying on a data-loader in TensorFlow or PyTorch, which could do multiple reads in parallel....I'm still learning about that part and am not sure how it will work exactly. But gist of it is that the files will e read sequentially based on their filename.
My questions are:
How important is it that the files be written in the exact same order they'll be read? Am I causing excessive disk thrashing even though the files are written in approximate filename order such that most files within a given "read-batch" are "close together" on the disk platters? I'm worried about decreased performance and increased risk of drive failure.
If I do want to address this, how would I do so? My untested noob ideas are below:
Have the pool return the processed files as in-memory objects, and then use single-treaded code to write them to disk. This would consume more memory forcing me to decrease the number and/or depth of the pools, leading to greater overhead. (I assume the single-threaded writing would not become a bottleneck)
Perhaps this single-threaded write activity could be parallelized next to the os.walk? They're hitting different I/O channels and neither is CPU intensive.
My second thought was to somehow coordinate the pools to get them to write files in sequential order, but I'm not sure how to do that or if it's even possible or desirable.
Third idea is to have the pools write files to the system's SSD instead of the USB drive, then at some regular interval use single-treaded code to move them to the external USB drive in the desired order. Essentially treating the SSD like a large buffer.
Anyways, if anyone has any thoughts on how to approach this, I'd love to hear them!
I have a function that does the following :
Take a file as input and does basic cleaning.
Extract the required items from the file and then write them in a pandas dataframe.
The dataframe is finally converted into csv and written into a folder.
This is the sample code:
def extract_function(filename):
with open(filename,'r') as f:
input_data=f.readlines()
try:
// some basic searching pattern matching extracting
// dataframe creation with 10 columns and then extracted values are filled in
empty dataframe
// finally df.to_csv()
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
filenames=os.listdir("/home/Desktop/input")
pool=multiprocessing.Pool(pool_size)
pool.map(extract_function,filenames)
pool.close()
pool.join()
The total number of files in the input folder is 4000. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:
Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds
My system specification are :
Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04
While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000 to 1,20,000. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.
What is the reason for this ? How can I achieve the same multiprocessing for large files?
Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.
Update 1
I have tried the same code in a much more powerful machine.
Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.
. The file writing was much faster than in normal HDD. The program took:
For 4000 files - 3.7sec
For 1,20,000 files - 2min
Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.
Reading from and writing to disk is slow, compared to running code and data from RAM. It is extremely slow compared to running code and data from the internal cache in the CPU.
In an attempt to make this faster, several caches are used.
A harddisk generally has a built-in cache. In 2012 I did some write testing on this. With the harddisk's write cache disabled writing speed dropped from 72 MiB/s to 12 MiB/s.
Most operating systems today use otherwise unoccupied RAM as a disk cache.
The CPU has several levels of built-in caches as well.
(Usually there is a way to disable caches 1 and 2. If you try that you'll see read and write speed drop like a rock.)
So my guess is that once you pass a certain number of files, you exhaust one or more of the caches, and disk I/O becomes the bottleneck.
To verify, you would have to add code to extract_function to measure 3 things:
How long it takes to read the data from disk.
How long it takes to do the calculations.
How long it takes to write the CSV.
Have extract_function return a tuple of those three numbers, and analyse them. Instead of map, I would advise to use imap_unordered, so you can start evaluating the numbers as soon as they become available.
If disk I/O turns out to be the problem, consider using an SSD.
As #RolandSmith & #selbie suggested, I avoided the IO continuous write into CSV files by replacing it with data frames and appending to it. This I think cleared the inconsistencies. I checked the "feather" and "paraquet" high-performance IO modules as suggested by #CoMartel but I think it's for compressing large files into a smaller data frame structure. The appending options were not there for it.
Observations:
The program runs slow for the first run. The successive runs will be faster. This behavior was consistent.
I have checked for some trailing python process running after the program completion but couldn't find any. So some kind of caching is there within the CPU/RAM which make the program execution faster for the successive runs.
The program for 4000 input files took 72 sec for first-time execution and then an average of 14-15 sec for all successive runs after that.
Restarting the system clears those cache and causes the program to run slower for the first run.
Average fresh run time is 72 sec. But killing the program as soon as it starts and then running it took 40 sec for the first dry run after termination. The average of 14 sec after all successive runs.
During the fresh run, all core utilization will be around 10-13%. But after all the successive runs, the core utilization will be 100%.
Checked with the 1,20,000 files, it follows the same pattern. So, for now, the inconsistency is solved. So if such a code needs to be used as a server a dry run should be made for the CPU/RAM to get cached before it can start to accept API queries for faster results.
I have a software writing out large files of a known size. Multiple instances of the software will run in parallel. I want to avoid running out of storage. Since each instance knows how much data it will produce, this should be possible. However if I just open and write to the files in the usual way, they will grow larger incrementally, whereas I want to allocate the full size at the beginning and then fill the files from the beginning.
Will one of the following work?:
Create a file of all zeros with the given size. Then open it again and write to it.
Create a file of all zeros with the given size. Then seek to the beginning and write from there.
I'm trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?
For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.
I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
1) Check for code errors
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)
2) Make your code more efficient
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
3) Check The Total Memory of the object
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here
to find the size of an object in bites you can always use sys.getsizeof():
import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
4) Check the memory while running
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
use the code below to see the documentation straight in Jupyter Notebook:
%mprun?
%memit?
Sample use:
%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
5) This one may be first.... but Check for simple things like bit version
As in your case, a simple switching of the version of python you were running solved the issue.
Usually the above steps solve my issues.
is opening a large file once reading it completely once to list faster (or) opening smaller files whose total sum of size is equal to large file and loading smaller file into list manupalating one by one faster?
which is faster?? is the difference is time large enough to impact my program??
total time difference of lesser then of 30 sec is negligible for me
It depends if your data fit in your available memory. If you need to resort to paging, or virtual memory, then opening a single giant file might become slower than opening more smaller files. This will be even more true if the computation you need to make creates intermediate variables that won't fit in the physical RAM either.
So, as long as the file is not that big, one opening will be faster, but if this is not true, then many opening may be faster.
At last, note that if you can do many opening, you might be able to do them in parallel and process various parts in different processes, which might make things faster again.
Obviously one open and close is going to be faster than n opens and closes if you are reading the same amount of data. Plus, when reading a single file the I/O classes you use can take advantage of things like buffering, etc, which makes it even faster.
If you are reading the file sequentially from start until end, one open/close is faster than multiple open/close operations.
However keep in mind that if you need to do a lot of seeking in your 1 big file, then maybe storing separate files won't be slower in that case.
Also keep in mind that no matter which approach you are using, you shouldn't read the entire file in at once. Do it in chunks.
Working with a single file is almost certainly going to be faster: you have to read the same amount of data in both cases, but when working with multiple files, you have that much more housekeeping operations slowing you down.
Additionally, you can read data from a single file at the maximum speed the disk can handle, using the disk buffer to the maximum etc., whereas with multiple files, the disk head does a lot more dancing jumping from file to file.
30sec time difference? Define large. Everything that fits into an average's computer RAM would probably not take much more time than 30sec in total.
Why do you think you need to read the file(s) into a list?
If you can open several small files and process each independently, then surely that means:
(a) that you don't need to read into a list, you can process any file (including 1 large file) a line at a time (avoiding running-out-of-real-memory problems)
or
(b) what you need to do is more complicated than you have told us.