I have several ~100 GB NetCDF files.
Within each NetCDF file, there is a variable a, from which I have to extract several data series
The dimension is (1440,721,6,8760).
I need to extract ~20k slices of dimension (1,1,1,8760) from each NetCDF file.
Since it is extremely slow to extract one slice (several minutes), I read about how to optimize the process.
Most likely, the chunks are not set optimally.
Therefore, my goal is to change the chunk size to (1,1,1,8760) for a more efficient I/O.
However, I struggle to understand how I can best re-chunk this NetCDF variable.
First of all, by running ncdump -k file.nc, I found that the type is 64-bit offset.
Based on my research, I think this is NetCDF3 which does not support defining chunk sizes.
Therefore, I copied it to NetCDF4 format using nccopy -k 3 source.nc dest.nc.
ncdump -k file.nc now returns netCDF-4.
However, now I'm stuck. I do not know how to proceed.
If anybody has a proper solution in python, matlab, or using nccopy, please share it.
What I'm trying now is the following:
nccopy -k 3 -w -c latitude/1,longitude/1,level/1,time/8760 source.nc dest.nc
Is this the correct approach in theory?
Unfortunately, after 24 hours, it still did not finish on a potent server with more then enough RAM (250GB) and many CPUs (80).
Your command appears to be correct. Re-chunking takes time.
ncks -4 --cnk_dmn latitude,1 --cnk_dmn longitude,1 --cnk_dmn level,1 --cnk_dmn time,8760 in.nc out.nc
to see if that is any faster.
Related
At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing.
I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB).
So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing groupbys my suggestion is to previously sort them with the unix command "sort".
sort processing needs are negligible and it will not push your RAM limits beyond unmanageable limits. You can define the number of parallel processes to run/sort as well as the ram consumed as buffer. In as far you have disk space, this rocks.
The trick here is to export LC_ALL=C in your environment prior to issue the command. Either wise, pandas/dask sort and unix sort will produce different results.
Here is the code I have used
export LC_ALL=C
zcat BigFat.csv.gz |
fgrep -v ( have headers?? take them away)|
sort -key=1,1 -t "," ( fancy multi field sorting/index ? -key=3,3 -key=4,4)|
split -l 10000000 ( partitions ??)
The result is ready for a
ddf=dd.read_csv(.....)
ddf.set_index(ddf.mykey,sorted=True)
Hope this helps
JC
As discussed above, I am just posting this as a solution to my problem. Hope works for others.
I am not claiming this is the best, most efficient or more pythonic! :-)
I am trying to use python to do some manipulations on huge text files, and by huge I mean over 100GB. Specifically, I'd like to take samples from the lines of the files. For example, let's say I have a file with ~300 million lines, I want to take just a million, write them to a new file and analyze them later to get some statistics. The problem is, I can't start from the first line, since the first fraction of the file does not represent the rest of it good enough. Therefore, I have to get about 20% into the file, and then start extracting lines. If I do it the naive way, it takes very long (20-30 minutes on my machine) to get to the 20% line. For example, let's assume again that my file has 300 million lines, and I want to start sampling from line 60,000,000th (20%) line. I could do something like:
start_in_line = 60000000
sample_size = 1000000
with open(huge_file,'r') as f, open(out_file,'w') as fo:
for x in range(start_in_line):
f.readline()
for y in range(sample_size):
print(f.readline(),file=fo)
But as I said, this is very slow. I tried using some less naive ways, for example the itertools functions, but the improvement in running time was rather slight.
Therefore, I went for another approach - random seeks into the file. What I do is get the size of the file in bytes, calculate 20% of it and than make a seek to this byte. For example:
import os
huge_file_size = os.stat(huge_file).st_size
offset_percent = 20
sample_size = 1000000
start_point_byte = int(huge_file_size*offset_percent/100)
with open(huge_file) as f, open(out_file,'w') as fo:
f.seek(start_point_byte)
f.readline() # get to the start of next line
for y in range(sample_size):
print(f.readline(),file=fo)
This approach works very nice, BUT!
I always work with pairs of files. Let's call them R1 and R2. R1 and R2 will always have the same number of lines, and I run my sampling script on each one of them. It is crucial for my downstream analyses that the samples taken from R1 and R2 coordinate, regarding the lines sampled. For example, if I ended up starting sampling from line 60,111,123 of R1, I must start sampling from the very same line in R2. Even if I miss by one line, my analyses are doomed. If R1 and R2 are of exactly the same size (which is sometimes the case), then I have no problem, because my f.seek() will get me to the same place in both files. However, if the line lengths are different between the files, i.e. the total sizes of R1 and R2 are different, then I am in a problem.
So, do you have any idea for a workaround, without having to resort to the naive iteration solution? Maybe there is a way to tell which line I am at, after performing the seek? (couldn't find one...) I am really out of ideas at this point, so any help/hint would be appreciated.
Thanks!
If the lines in each file can have different lengths, there is really no other way than to scan them first (unless there is some form of unique identifier on each line which is the same in both files).
Even if both files have the same length, there could still be lines with different lengths inside.
Now, if you're doing those statistics more than once on different parts of the same files, you could do the following:
do a one time scan of both files and store the filepositions of each line in a third file (preferably in binary form (2 x 64bit value) or at least the same width so you can directly jump to the position-pair of the line you want, which you could calculate then).
then just use those filepositions to access the lines in both files (you could even calculate the size of the block you need from the different filepositions in your third file).
When scanning both files at the same time, make sure you use some buffering to avoid a lot of harddisk seeks.
edit:
I don't know Python (I'm a C++ programmer), but I did a quick search and it seems Python also supports memory mapped files (mmap).
Using mmap you could speed things up dramaticly (no need to do a readline each time just to know the positions of the lines): just map a view on part(s) of your file and scan through that mapped memory for the newline (\n or 0x0a in hexadecimal). This should take no longer than the time it takes to read the file.
Unix files are just streams of characters, so there is no way to seek to a given line, or find the line number corresponding to a given character, or anything else of that form.
You can use standard utilities to find the character position of a line. For example,
head -n 60000000 /path/to/file | wc -c
will print the number of characters in the first 60,000,000 lines of /path/to/file.
While that may well be faster than using python, it is not going to be fast; it is limited by the speed of reading from disk. If you need to read 20GB, it's going to take minutes. But it would be worth trying at least once to calibrate your python programs.
If your files don't change, you could create indexes mapping line numbers to character position. Once you build the index, it would be extremely fast to seek to the desired line number. If it takes half an hour to read 20% of the file, it would take about five hours to construct two indexes, but if you only needed to do it once, you could leave it running overnight.
OK, so thanks for the interesting answers, but this is what I actually ended up doing:
First, I estimate the number of lines in the file, without actually counting them. Since my files are ASCII, i know that each character takes 1 byte, so I get the number of characters in, say, the first 100 lines, then get the size of the file and use these numbers to get a (quite rough) estimation of the number of lines. I should say here that although my lines might be of different length, they are within a limited range, so this estimation is reasonable.
Once I have that, I use as system call of the Linux sed command to extract a range of lines. So let's say that my file really has 300 million lines, and I estimated it to have 250 million lines (I get much better estimations, but it doesn't really matter in my case). I use an offset of 20%, so I'd like to start sampling from line 50,000,000 and take 1,000,000 lines. I do:
os.system("sed -n '50000000,51000000p;51000000q' in_file > out_file")
Note the 51000000q - without this, you'll end up running on the whole file.
This solution is not as fast as using random seeks, but it's good enough for me. It also includes some inaccuracy, but it doesn't bother me in this specific case.
I'd be glad to hear your opinion on this solution.
I am having a few big files sets of HDF5 files and I am looking for an efficient way of converting the data in these files into XML, TXT or some other easily readable format. I tried working with the Python package (www.h5py.org), but I was not able to figure out any methods with which I can get this stuff done fast enough. I am not restricted to Python and can also code in Java, Scala or Matlab. Can someone give me some suggestions on how to proceed with this?
Thanks,
TM
Mathias711's method is the best direct way. If you want to do it within python, then use pandas.HDFStore:
from pandas import HDFStore
store = HDFStore('inputFile.hd5')
store['table1Name'].to_csv('outputFileForTable1.csv')
You can use h5dump -o dset.asci -y -w 400 dset.h5
-o dset.asci specifies the output file
-y -w 400 specifies the dimension size multiplied by the number of positions and spaces needed to print each value. You should take a very large number here.
dset.h5 is of course the hdf5 file you want to convert
I think this is the easiest way to convert it to an ascii file, which you can import to excel or whatever you want. I did it a couple of times, and it worked for me. I got his information from this website.
from measurements I get text files that basically contain a table of float numbers, with the dimensions 1000x1000. Those take up about 15MB of space which, considering that I get about 1000 result files in a series, is unacceptable to save. So I am trying to compress those by as much as possible without loss of data. My idea is to group the numbers into ~1000 steps over the range I expect and save those. That would provide sufficient resolution. However I still have 1.000.000 points to consider and thus my resulting file is still about 4MB. I probably won t be able to compress that any further?
The bigger problem is the calculation time this takes. Right now I d guesstimate 10-12 secs per file, so about 3 hrs for the 1000 files. WAAAAY to much. This is the algorithm I thougth up, do you have any suggestions? There's probably far more efficient algorithms to do that, but I am not much of a programmer...
import numpy
data=numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
out=numpy.empty((1000,1000),numpy.int16)
i=0
min=-0.5
max=0.5
step=(max-min)/1000
while i<=999:
j=0
while j<=999:
k=(data[i,j]//step)
out[i,j]=k
if data[i,j]>max:
out[i,j]=500
if data[i,j]<min:
out[i,j]=-500
j=j+1
i=i+1
numpy.savetxt('converted.txt', out, fmt="%i")
Thanks in advance for any hints you can provide!
Jakob
I see you store the numpy arrays as text files. There is a faster and more space-efficient way: just dump it.
If your floats can be stored as 32-bit floats, then use this:
data = numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
data.astype(numpy.float32).dump(open('converted.numpy', 'wb'))
then you can read it with
data = numpy.load(open('converted.numpy', 'rb'))
The files will be 1000x1000x4 Bytes, about 4MB.
The latest version of numpy supports 16-bit floats. Maybe your floats will fit in its limiter range.
numpy.savez_compressed will let you save lots of arrays into a single compressed, binary file.
However, you aren't going to be able to compress it that much -- if you have 15GB of data, you're not magically going to fit it in 200MB by compression algorithms. You have to throw out some of your data, and only you can decide how much you need to keep.
Use the zipfile, bz2 or gzip module to save to a zip, bz2 or gz file from python. Any compression scheme you write yourself in a reasonable amount of time will almost certainly be slower and have worse compression ratio than these generic but optimized and compiled solutions. Also consider taking eumiro's advice.
I have a large directory of text files--approximately 7 GB. I need to load them quickly into Python unicode strings in iPython. I have 15 GB of memory total. (I'm using EC2, so I can buy more memory if absolutely necessary.)
Simply reading the files will be too slow for my purposes. I have tried copying the files to a ramdisk and then loading them from there into iPython. That speeds things up but iPython crashes (not enough memory left over?) Here is the ramdisk setup:
mount -t tmpfs none /var/ramdisk -o size=7g
Anyone have any ideas? Basically, I'm looking for persistent in-memory Python objects. The iPython requirement precludes using IncPy: http://www.stanford.edu/~pgbovine/incpy.html .
Thanks!
There is much that is confusing here, which makes it more difficult to answer this question:
The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?
So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).
If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:
with codecs.open(inpath, encoding='utf-8') as f:
data = f.read(8192)
while data:
process(data)
data = f.read(8192)
I hope this at least gets you closer.
I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/
If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.