Is speed of file opening/reading language dependent?

Is speed of file opening/reading language dependent? - python

I have really big collection of files, and my task is to open a couple of random files from this collection treat their content as a sets of integers and make an intersection of it.
This process is quite slow due to long times of reading files from disk into memory so I'm wondering whether this process of reading from file can be speed up by rewriting my program in some "quick" language. Currently I'm using python which could be inefficient for this kind of job. (I could implement tests myself if I knew some other languages beside python and javascript...)
Also will putting all the date into database help? Files wont fit the RAM anyway so it will be reading from disk again only with database related overhead.
The content of files is the list of long integers. 90% of the files are quite small, less than a 10-20MB, but 10% left are around 100-200mb. As input a have filenames and I need read each of the files and output integers present in every file given.
I've tried to put this data in mongodb but that was as slow as plain files based approach because I tried to use mongo index capabilities and mongo does not store indexes in RAM.
Now I just cut the 10% of the biggest files and store rest in the redis, sometimes accessing those big files. This is, obviously temporary solution because my data grows and amount of RAM available does not.

One thing you could try is calculating intersections of the files on a chunk-by-chunk basis (i.e., read x-bytes into memory from each, calculate their intersections, and continue, finally calculating the intersection of all intersections).
Or, you might consider using some "heavy-duty" libraries to help you. Consider looking into PyTables (with HDF storage)/using numpy for calculating intersections. The benefit there is that the HDF layer should help deal with not keeping the entire array structure in memory all at once---though I haven't tried any of these tools before, it seems like they offer what you need.

If no file contains duplicate numbers, I'd try this:
sort file1 file2 | uniq -d
If they may contain duplicates, then you need to eliminate duplicates first:
sort -u file1 > /tmp/file1
sort -u file2 > /tmp/file2
cat /tmp/file1 /tmp/file2 | sort | uniq -d
Or if you prefer a version that doesn't (explicitly) use temporary files.
(sort -u file1; sort -u file2) | sort | uniq -d
You don't say what format the files are in (the above assumes text, with one integer per line). If they're in some binary format, you would also need a command to translate them before applying the above commands. By using pipes you can compose this step like this:
(decode file1 | sort -u ; decode file2 | sort -u) | sort | uniq -d
Here decode is the name of a program you would have to write that parses your file format.
Apart from being incredibly short and simple, the good thing about this shell solution is that it works with files of any size, even if they don't fit into RAM.
It's not clear from your question whether you have 2 or an arbitrary number of files to intersect (the start of your question says "a couple", the end "a list of filenames"). To deal with, for example, 5 files instead of 2, use uniq -c | awk '{ if ($1=="5") print $2; }' instead of uniq -d

Related

change chunk block shape in netCDF file

I have several ~100 GB NetCDF files.
Within each NetCDF file, there is a variable a, from which I have to extract several data series
The dimension is (1440,721,6,8760).
I need to extract ~20k slices of dimension (1,1,1,8760) from each NetCDF file.
Since it is extremely slow to extract one slice (several minutes), I read about how to optimize the process.
Most likely, the chunks are not set optimally.
Therefore, my goal is to change the chunk size to (1,1,1,8760) for a more efficient I/O.
However, I struggle to understand how I can best re-chunk this NetCDF variable.
First of all, by running ncdump -k file.nc, I found that the type is 64-bit offset.
Based on my research, I think this is NetCDF3 which does not support defining chunk sizes.
Therefore, I copied it to NetCDF4 format using nccopy -k 3 source.nc dest.nc.
ncdump -k file.nc now returns netCDF-4.
However, now I'm stuck. I do not know how to proceed.
If anybody has a proper solution in python, matlab, or using nccopy, please share it.
What I'm trying now is the following:
nccopy -k 3 -w -c latitude/1,longitude/1,level/1,time/8760 source.nc dest.nc
Is this the correct approach in theory?
Unfortunately, after 24 hours, it still did not finish on a potent server with more then enough RAM (250GB) and many CPUs (80).

Your command appears to be correct. Re-chunking takes time.
ncks -4 --cnk_dmn latitude,1 --cnk_dmn longitude,1 --cnk_dmn level,1 --cnk_dmn time,8760 in.nc out.nc
to see if that is any faster.

dask set_index from large unordered csv file

At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing.
I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB).
So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing groupbys my suggestion is to previously sort them with the unix command "sort".
sort processing needs are negligible and it will not push your RAM limits beyond unmanageable limits. You can define the number of parallel processes to run/sort as well as the ram consumed as buffer. In as far you have disk space, this rocks.
The trick here is to export LC_ALL=C in your environment prior to issue the command. Either wise, pandas/dask sort and unix sort will produce different results.
Here is the code I have used
export LC_ALL=C
zcat BigFat.csv.gz |
fgrep -v ( have headers?? take them away)|
sort -key=1,1 -t "," ( fancy multi field sorting/index ? -key=3,3 -key=4,4)|
split -l 10000000 ( partitions ??)
The result is ready for a
ddf=dd.read_csv(.....)
ddf.set_index(ddf.mykey,sorted=True)
Hope this helps
JC

As discussed above, I am just posting this as a solution to my problem. Hope works for others.
I am not claiming this is the best, most efficient or more pythonic! :-)

How to jump to the same line in two huge text files?

I am trying to use python to do some manipulations on huge text files, and by huge I mean over 100GB. Specifically, I'd like to take samples from the lines of the files. For example, let's say I have a file with ~300 million lines, I want to take just a million, write them to a new file and analyze them later to get some statistics. The problem is, I can't start from the first line, since the first fraction of the file does not represent the rest of it good enough. Therefore, I have to get about 20% into the file, and then start extracting lines. If I do it the naive way, it takes very long (20-30 minutes on my machine) to get to the 20% line. For example, let's assume again that my file has 300 million lines, and I want to start sampling from line 60,000,000th (20%) line. I could do something like:
start_in_line = 60000000
sample_size = 1000000
with open(huge_file,'r') as f, open(out_file,'w') as fo:
for x in range(start_in_line):
f.readline()
for y in range(sample_size):
print(f.readline(),file=fo)
But as I said, this is very slow. I tried using some less naive ways, for example the itertools functions, but the improvement in running time was rather slight.
Therefore, I went for another approach - random seeks into the file. What I do is get the size of the file in bytes, calculate 20% of it and than make a seek to this byte. For example:
import os
huge_file_size = os.stat(huge_file).st_size
offset_percent = 20
sample_size = 1000000
start_point_byte = int(huge_file_size*offset_percent/100)
with open(huge_file) as f, open(out_file,'w') as fo:
f.seek(start_point_byte)
f.readline() # get to the start of next line
for y in range(sample_size):
print(f.readline(),file=fo)
This approach works very nice, BUT!
I always work with pairs of files. Let's call them R1 and R2. R1 and R2 will always have the same number of lines, and I run my sampling script on each one of them. It is crucial for my downstream analyses that the samples taken from R1 and R2 coordinate, regarding the lines sampled. For example, if I ended up starting sampling from line 60,111,123 of R1, I must start sampling from the very same line in R2. Even if I miss by one line, my analyses are doomed. If R1 and R2 are of exactly the same size (which is sometimes the case), then I have no problem, because my f.seek() will get me to the same place in both files. However, if the line lengths are different between the files, i.e. the total sizes of R1 and R2 are different, then I am in a problem.
So, do you have any idea for a workaround, without having to resort to the naive iteration solution? Maybe there is a way to tell which line I am at, after performing the seek? (couldn't find one...) I am really out of ideas at this point, so any help/hint would be appreciated.
Thanks!

If the lines in each file can have different lengths, there is really no other way than to scan them first (unless there is some form of unique identifier on each line which is the same in both files).
Even if both files have the same length, there could still be lines with different lengths inside.
Now, if you're doing those statistics more than once on different parts of the same files, you could do the following:
do a one time scan of both files and store the filepositions of each line in a third file (preferably in binary form (2 x 64bit value) or at least the same width so you can directly jump to the position-pair of the line you want, which you could calculate then).
then just use those filepositions to access the lines in both files (you could even calculate the size of the block you need from the different filepositions in your third file).
When scanning both files at the same time, make sure you use some buffering to avoid a lot of harddisk seeks.
edit:
I don't know Python (I'm a C++ programmer), but I did a quick search and it seems Python also supports memory mapped files (mmap).
Using mmap you could speed things up dramaticly (no need to do a readline each time just to know the positions of the lines): just map a view on part(s) of your file and scan through that mapped memory for the newline (\n or 0x0a in hexadecimal). This should take no longer than the time it takes to read the file.

Unix files are just streams of characters, so there is no way to seek to a given line, or find the line number corresponding to a given character, or anything else of that form.
You can use standard utilities to find the character position of a line. For example,
head -n 60000000 /path/to/file | wc -c
will print the number of characters in the first 60,000,000 lines of /path/to/file.
While that may well be faster than using python, it is not going to be fast; it is limited by the speed of reading from disk. If you need to read 20GB, it's going to take minutes. But it would be worth trying at least once to calibrate your python programs.
If your files don't change, you could create indexes mapping line numbers to character position. Once you build the index, it would be extremely fast to seek to the desired line number. If it takes half an hour to read 20% of the file, it would take about five hours to construct two indexes, but if you only needed to do it once, you could leave it running overnight.

OK, so thanks for the interesting answers, but this is what I actually ended up doing:
First, I estimate the number of lines in the file, without actually counting them. Since my files are ASCII, i know that each character takes 1 byte, so I get the number of characters in, say, the first 100 lines, then get the size of the file and use these numbers to get a (quite rough) estimation of the number of lines. I should say here that although my lines might be of different length, they are within a limited range, so this estimation is reasonable.
Once I have that, I use as system call of the Linux sed command to extract a range of lines. So let's say that my file really has 300 million lines, and I estimated it to have 250 million lines (I get much better estimations, but it doesn't really matter in my case). I use an offset of 20%, so I'd like to start sampling from line 50,000,000 and take 1,000,000 lines. I do:
os.system("sed -n '50000000,51000000p;51000000q' in_file > out_file")
Note the 51000000q - without this, you'll end up running on the whole file.
This solution is not as fast as using random seeks, but it's good enough for me. It also includes some inaccuracy, but it doesn't bother me in this specific case.
I'd be glad to hear your opinion on this solution.

Convert HDF5 file to other formats

I am having a few big files sets of HDF5 files and I am looking for an efficient way of converting the data in these files into XML, TXT or some other easily readable format. I tried working with the Python package (www.h5py.org), but I was not able to figure out any methods with which I can get this stuff done fast enough. I am not restricted to Python and can also code in Java, Scala or Matlab. Can someone give me some suggestions on how to proceed with this?
Thanks,
TM

Mathias711's method is the best direct way. If you want to do it within python, then use pandas.HDFStore:
from pandas import HDFStore
store = HDFStore('inputFile.hd5')
store['table1Name'].to_csv('outputFileForTable1.csv')

You can use h5dump -o dset.asci -y -w 400 dset.h5
-o dset.asci specifies the output file
-y -w 400 specifies the dimension size multiplied by the number of positions and spaces needed to print each value. You should take a very large number here.
dset.h5 is of course the hdf5 file you want to convert
I think this is the easiest way to convert it to an ascii file, which you can import to excel or whatever you want. I did it a couple of times, and it worked for me. I got his information from this website.

How to generate the shortest possible (alpha)numeric unique ID out of a file path?

I would like to generate numeric or alphanumeric (whichever is easier) unique ID as a function of a file path in Python. I am working on a file parsing application and there is a file entity in the DB with descendants and, in order to have a more compact foreign/primary key than the fully qualified path to a file, I would like to convert it into the shortest possible unique digest as possible.
What are my options to do this? Can I use SHA?
How about if i just took an MD5 checksum out of the fully qualified path string and got something like 1736622845? On a command line, it can be done with
echo -n '/my/path/filename' | cksum | cut -d' ' -f1
Is that guaranteed to never repeat for two different inputs? If yes, how would I translate the above bash piped command into pure Python so that I don't have to invoke a system call but get the same value?

The shortest possible unique ID of a string is the string.
You can try to use an alphabet that only contains the characters allowed in the path, so that you use less bits (a lot of work, not a lot of benefit, unless your paths really only contain a few characters)
What I think you want is a fairly good short hash function. As soon as you generate a hash function there's a risk of collision. For most hash functions a good rule of thumb is that you have far less entries than the hash value space. There's a theorem to prove that as soon as you have more than sqrt(key_space) entries you will (with the best hashes) get collisions half the time.
So if you take say 1000 paths, you should aim at having a hash pace of at least 1.000.000 entries to work with. You can chop up other hash functions (say take only the first 2 bytes of the md5). That should work, but note the increase in collisions (where 2 entries will generate the same value).
Also if you are so keen to save space, store the hash value in binary (large int). It's far shorter than the the usual encodings (base64, or hex) and all the DB functions should work fine.
So say you take md5 and store it as a large int, it will take only 16 bytes to store. But you can also only use 8 or 4 (I wouldn't dare go lower than that).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is speed of file opening/reading language dependent? - python

Related

change chunk block shape in netCDF file

dask set_index from large unordered csv file

How to jump to the same line in two huge text files?

Convert HDF5 file to other formats

How to generate the shortest possible (alpha)numeric unique ID out of a file path?

Categories

Resources