Low level file processing in ruby/python

Low level file processing in ruby/python - python

So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.
First some background:
I have text data files that are tabular and can easily climb into the 10s of GBs. The computer processing them is already heavily loaded from the hours long data collection(at up to 30-50MB/s) as it is doing device processing and control.Therefore, disk space and access are at a premium. We haven't moved from spinning disks to SSDs due to space constraints.
However, we are looking to do something with the just collected data that doesn't need every data point. We were hoping to decimate the data and collect every 1000th point. However, loading these files (Gigabytes each) puts a huge load on the disk which is unacceptable as it could interrupt the live collection system.
I was wondering if it was possible to use a low level method to access every nth byte (or some other method) in the file (like a database does) because the file is very well defined (Two 64 bit doubles in each row). I understand too low level access might not work because the hard drive might be fragmented, but what would the best approach/method be? I'd prefer a solution in python or ruby because that's what the processing will be done in, but in theory R, C, or Fortran could also work.
Finally, upgrading the computer or hardware isn't an option, setting up the system took hundreds of man-hours so only software changes can be performed. However, it would be a longer term project but if a text file isn't the best way to handle these files, I'm open to other solutions too.
EDIT: We generate (depending on usage) anywhere from 50000 lines(records)/sec to 5 million lines/sec databases aren't feasible at this rate regardless.

This should be doable using seek and read methods on a file object. Doing this will prevent the entire file from being loaded into memory, as you would only be working with file streams.
Also, since the files are well defined and predictable, you won't have any trouble seeking ahead N bytes to the next record in the file.
Below is an example. Demo the code below at http://dbgr.cc/o
with open("pretend_im_large.bin", "rb") as f:
start_pos = 0
read_bytes = []
# seek to the end of the file
f.seek(0,2)
file_size = f.tell()
# seek back to the beginning of the stream
f.seek(0,0)
while f.tell() < file_size:
read_bytes.append(f.read(1))
f.seek(9,1)
print read_bytes
The code above assumes pretend_im_large.bin is a file with the contents:
A00000000
B00000000
C00000000
D00000000
E00000000
F00000000
The output of the code above is:
['A', 'B', 'C', 'D', 'E', 'F']

I don't think that Python is going to give you a strong guarantee that it won't actually read the entire file when you use f.seek. I think that this is too platform- and implementation-specific to rely on Python. You should use Windows-specific tools that give you a guarantee of random acess rather than sequential.
Here's a snippet of Visual Basic that you can modify to suit your needs. You can define your own record type that's two 64-bit integers long. Alternatively, you can use a C# FileStream object and use its seek method to get what you want.
If this is performance-critical software, I think you need to make sure you're getting access to the OS primitives that do what you want. I can't find any references that indicate that Python's seek is going to do what you want. If you go that route, you need to test it to make sure it does what it seems like it should.

Is the file human-readable text or in the native format of the computer (sometimes called binary)? If the files are text, you could reduce the processing load and file size by switching to native format. Converting from the internal representation of floating point numbers to human-reading numbers is CPU intensive.
If the files are in native format then it should be easy to skip in the file since each record will be 16 bytes. In Fortran, open the file with an open statement that includes form="unformated", access="direct", recl=16. Then you can read an arbitrary record X without reading intervening records via rec=X in the read statement. If the file is text, you can also read it with direct IO, but it might not be that each two numbers always uses the same number of characters (bytes). You can examine your files and answer that question. If the records are always the same length, then you can use the same technique, just with form="formatted". If the records vary in length, then you could read a large chunk and locate your numbers within the chunk.

Related

How to write a line or block in the middle of bgzf

I'd like to reference the following post as well, and mention that I'm familiar with BioPython.
How to obtain random access of a gzip compressed file
I'm familiar with the Bio.bgzf's potential for indexes and random reads. I'm building a library that uses the module to build an index against the blocks that contain data that is relevant to my interests. The technology is very interesting but I'm struggling to understand the pace of development or limitations of what Bio.bgzf or even the bgzf standard are capable of.
Can Bio.bgzf overwrite a specific line in the file, just as it can read from the virtual offset to the end of the line? If it could, would the new data necessarily need to be exactly the same number of bits?
After using make_virtual_offset() to acquire a position in the .bgzf file for a line that I'd like to overwrite, I'm looking for a method like filehandle.writeline() to replace the line in the block with some new text. If that's not possible, then is it possible to get the coordinates for the entire block and then rewrite that. And if not, it could be said that bgzf index files are sufficient for reading only. Is this correct?

Downsides of keeping file handles open?

I am parsing some XML and write data to different files depending on the XML element that is currently being processed. Processing an element is really fast, and writing the data is, too. Therefore, files would need to open and close very often. For example, given a huge file:
for _, node in lxml.etree.iterparse(file):
with open(f"{node.tag}.txt", 'a') as fout:
fout.write(node.attrib['someattr']+'\n'])
This would work, but relatively speaking it would take a lot of time opening and closing the files. (Note: this is a toy program. In reality the actual contents that I write to the files as well as the filenames are different. See the last paragraph for data details.)
An alternative could be:
fhs = {}
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
for _, fh in fhs.items(): fh.close()
This will keep the files open until the parsing of XML is completed. There is a bit of lookup overhead, but that should be minimal compared to iteratively opening and closing the file.
My question is, what is the downside of this approach, performance wise? I know that this will make the open files inaccessible by other processes, and that you may run into a limit of open files. However, I am more interested in performance issues. Does keeping all file handles open create some sort of memory issues or processing issues? Perhaps too much file buffering is going on in such scenarios? I am not certain, hence this question.
The input XML files can be up to around 70GB. The number of files generated is limited to around 35, which is far from the limits I read about in the aforementioned post.

The obvious downsides you have already mentioned, is that there will be a lot of memory required to keep all the file handles open, depending of course on how many files. This is a calculation you have to do on your own. And don't forget the write locks.
Otherwise there isn't very much wrong with it per say, but it would be good with some precaution:
fhs = {}
try:
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
finally:
for fh in fhs.values(): fh.close()
Note:
When looping over a dict in python, the items you get are really only the keys. I'd recommend doing for key, item in d.items(): or for item in d.values():

You don't didn't say just how many files the process would end up holding open. If it's not so many that it creates a problem, then this could be a good approach. I doubt you can really know without trying it out with your data and in your execution environment.
In my experience, open() is relatively slow, so avoiding unnecessary calls is definitely worth thinking about-- you also avoid setting up all the associated buffers, populating them, flushing them every time you close the file, and garbage-collecting. Since you ask, file pointers do come with large buffers. On OS X, the default buffer size is 8192 bytes (8KB) and there is additional overhead for the object, as with all Python object. So if you have hundreds or thousands of files and little RAM, it can add up. You can specify less buffering or no buffering at all, but that could defeat any efficiency gained from avoiding repeated opens.
Edit: For just 35 distinct files (or any two-digit number), you have nothing to worry about: The space that 35 output buffers will need (at 8 KB per buffer for the actual buffering) will not even be the biggest part of your memory footprint. So just go ahead and do it they way you proposed. You'll see a dramatic speed improvement over opening and closing the file for each xml node.
PS. The default buffer size is given by io.DEFAULT_BUFFER_SIZE.

As a good rule,try to close a file as soon as possible.
Note that also your operating system has limits - you can open only certain number of files. So you might soon hit this limit and you will start getting "Failed to open file" exceptions.
Memory and file handles leaking are obvious problem ( if you fail to close the files for some reason ).

If you are generating thousands of files the way you might considder writing
them to a directory structure to get them separately stored in different
directories to have easier access afterwards. For example: a/a/aanode.txt , a/c/acnode.txt, etc.
In case the XML contains consecutive nodes you can write while that
condition is True. You only close the moment another node for another file appears.
What you gain from it largely depends on the structure of your XML file.

Why protobuf is bad for large data structures?

I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?

It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.

It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.

Better way to store a set of files with arrays?

I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)

Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.

Reading A Big File With Python

I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.
File contents are in the format:
student_name:student_ID:date_of_join:anotherfield1:anotherfield2
In case of a match, I have to print out the whole line. Here's what I've tried.
findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
for line in f:
if findvalue in line:
print line
This works, but it takes a lot of time. How can I reduce the run time?

When textfiles become too slow, you need to start looking at databases. One of the main purposes of databases is to intelligently handle IO from persistent data storage.
Depending on the needs of your application, SQLite may be a good fit. I suspect this is what you want, given that you don't seem to have a gargantuan data set. From there, it's just a matter of making database API calls and allowing SQLite to handle the lookups -- it does so much better than you!
If (for some strange reason) you really don't want to use a database, then consider further breaking up your data into a tree, if at all possible. For example, you could have a file for each letter of the alphabet in which you put student data. This should cut down on looping time since you're reducing the number of students per file. This is a quick hack, but I think you'll lose less hair if you go with a database.

IO is notoriously slow compared to computation, and given that you are dealing with large files it's probably best deal with the files line by line. I don't see an obvious easy way to speed this up in Python.
Depending on how frequent your "hits" (i.e., findvalue in line) will be you may decide to write to a file so not to be possibly slowed down by console output, but if there will be relatively few items found, it wouldn't make much of a difference.
I think for Python there's nothing obvious and major you can do. You could always explore other tools (such as grep or databases ...) as alternative approaches.
PS: No need for the else:pass ..

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.