Read flat file as transpose, python - python

I'm interested in reading fixed width text files in Python in as efficient a manner as I can. Specifically, most of the time I'm interested in one or more columns in the flat file but not entire records.
It strikes me as inefficient to read the file a line at a time and extract the desired columns after reading the entire line into memory. I think I'd rather have the option of reading only the desired columns, top to bottom, left to right (instead of reading left to right, top to bottom).
Is such a thing desirable, and if so, is it possible?

Files are laid out as a (one-dimensional) sequence of bits. 'Lines' are just a convenience we added to make things easy to read for humans. So, in general, what you're asking is not possible on plain files. To pull this off, you would need some way of finding where a record starts. The two most common ways are:
Search for newline symbols (in other words, read the entire file).
Use a specially spaced layout, so that each record is laid out using a fixed with. That way, you can use low level file operations, like seek, to go directly to where you need to go. This avoids reading the entire file, but is painful to do manually.
I wouldn't worry too much about file reading performance unless it becomes a problem. Yes, you could memory map the file, but your OS probably already caches for you. Yes, you could use a database format (e.g., the sqlite3 file format through sqlalchemy), but it probably isn't worth the hassle.
Side note on "fixed width:" What precisely do you mean by this? If you really mean 'every column always starts at the same offset relative to the start of the record' then you can definitely use Python's seek to skip past data that you are not interested in.

How big are the lines? Unless each record is huge, it's probably likely to make little difference only reading in the fields you're interested in rather than the whole line.
For big files with fixed formatting, you might get something out of mmapping the file. I've only done this with C rather than Python, but it seems like mmapping the file then accessing the appropriate fields directly is likely to be reasonably efficient.

Flat files are not good with what you're trying to do. My suggestion is to convert the files to SQL database (using sqlite3) and then reading just the columns you want. SQLite3 is blazing fast.

If it's truly fixed width, then you should be able to just call read(N) to skip past the fixed number of bytes from the end of your column on one line to the start of it on the next.

Related

Pyarrow Write/Append Columns Arrow File

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by using a generator expression, so memory consumption is low. However, I would like to store the Nx1 arrays to file, so I can do other computations.(Compute quantiles, partial sums etc. pandas style) Preferably I would like to use pa.memory_map on a single file (in order to have dataframes not loaded into memory), but I can not see how I can produce such a file without generating the entire result first. (Monte Carlo results on 200-500*10m floats).
If I understand correctly RecordBatchStreamWriter needs a part of the entire table, and I can not produce only a part of it. The parts the calculator produces is the columns, not parts of all columns. Is there any way of writing "columns" one by one? Either by appending, or create an empty arrow file which can be filled? (schema known).
As I see it, my alternative is to write several files and use "dataset" /tabular data to "join" them together. My "other computations" would then have to filter or pull parts into memory as I can`t see in the docs that "dataset()" work with memory_map.The result set is to big to fit in memory. (At least on the server it is running on)
I`m on day 2 of digging the docs and trying to understand how it all works, so apologies if the "lingo" is not all correct.
On further inspection, it looks like all files used in datasets() must have same schema, so I can not split "columns" in separate files either, can I..
EDIT
After wrestling with this library, I now produce single column files which I later combine in a single file. However, in following the suggested solution visible memory consumption (task manager) skyrockets in the step of combining the files. I would expect peaks for every "rowgroup" or combined recordbatch, but instead steadily increase to use all memory. A snip of this step:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(
os.path.join(self.filepath, self.outfile_name + ".arrow"),
schema=combined_schema,
) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names
)
writer.write_batch(combined_batch)
From this link I would expect that running memory consumption to be that of combined_batch and some.
You could do the write in two passes.
First, write each column to its own file. Make sure to set a row group size small enough that a table consisting of one row group from each file comfortably fits into memory.
Second, create a streaming reader for each file you created and one writer. Read a single row group from each one. Create a table by combining all of the partial columns and write the table to your writer. Repeat until you've exhausted all your readers.
I'm not sure that memory mapping is going to help you much here.

Space efficient file type to store double precision floats

I am currently running Simulations written in C later analyzing the results using Python scripts.
ATM the C Programm is writing the results (lots of double values) in a text file which is slowly but surely eating a lot of disc space.
Is there a file format which is more space efficient to store lots of numeric values?
At best but not necessarily it should fulfill the following requirements
Values can be appended continuously such that not all values have to be in memory at once.
The file is more or less easily readable using Python.
I feel like this should be a really common question, but looking for an answer I only found descriptions of various data types within C.
Binary file, but please, be careful with the format of data that you are saving. If possible, reduce the width of each variable that you are using. For example, do you need to save decimal or float, or you can have just 16 or 32 bit integer?
Further, yes, you may apply some of the compression scheme to compress the data before saving, and decompress it after reading, but that requires much more work, and it is probably an overkill for what you are doing.

Python: Fast and efficient way of writing large text file

I have a speed/efficiency related question about python:
I need to write a large number of very large R dataframe-ish files, about 0.5-2 GB sizes. This is basically a large tab-separated table, where each line can contain floats, integers and strings.
Normally, I would just put all my data in numpy dataframe and use np.savetxt to save it, but since there are different data types it can't really be put into one array.
Therefore I have resorted to simply assembling the lines as strings manually, but this is a tad slow. So far I'm doing:
1) Assemble each line as a string
2) Concatenate all lines as single huge string
3) Write string to file
I have several problems with this:
1) The large number of string-concatenations ends up taking a lot of time
2) I run of of RAM to keep strings in memory
3) ...which in turn leads to more separate file.write commands, which are very slow as well.
So my question is: What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient string-concatenation and writing to disk.
... or maybe this strategy is simply just bad and I should do something completely different?
Thanks in advance!
Seems like Pandas might be a good tool for this problem. It's pretty easy to get started with pandas, and it deals well with most ways you might need to get data into python. Pandas deals well with mixed data (floats, ints, strings), and usually can detect the types on its own.
Once you have an (R-like) data frame in pandas, it's pretty straightforward to output the frame to csv.
DataFrame.to_csv(path_or_buf, sep='\t')
There's a bunch of other configuration things you can do to make your tab separated file just right.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
Unless you are running into a performance issue, you can probably write to the file line by line. Python internally uses buffering and will likely give you a nice compromise between performance and memory efficiency.
Python buffering is different from OS buffering and you can specify how you want things buffered by setting the buffering argument to open.
I think what you might want to do is create a memory mapped file. Take a look at the following documentation to see how you can do this with numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html

Trying to delete values in an hdf5 file with PyTables, but file size is not shrinking [duplicate]

I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?
Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:
(Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.
Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?
HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.
Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.
An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)
Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.
Copying the data or using h5repack as you have described are the two usual ways of 'shrinking' the data in an HDF5 file, unfortunately.
The problem, as you may have guessed, is that an HDF5 file has a complicated internal structure (the file format is here, for anyone who is curious), so deleting and shrinking things just leaves holes in an identical-sized file. Recent versions of the HDF5 library can track the freed space and re-use it, but your use case doesn't seem to be able to take advantage of that.
As the other answer has mentioned, you might be able to use external links or the virtual dataset feature to construct HDF5 files that were more amenable to the sort of manipulation you would be doing, but I suspect that you'll still be copying a lot of data and this would definitely add additional complexity and file management overhead.
H5Gunlink() has been deprecated, by the way. H5Ldelete() is the preferred replacement.

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)

Categories