Fast lookup in large datasets using python

Fast lookup in large datasets using python - python

I am processing the human genome and have ~10 million SNPs (identified by a "SNP_ID") in a single patient. I have two reference TSV's which contain rows, each row contains a SNP_ID and a floating point number (as well as lots of other metadata), it is all in ASCII format. These reference TSV's are 300-500GB in size.
I need to filter the 10 million SNPs based on criterion contained within the TSVs. In other words find the row with the SNP_ID, lookup the floating point number and decide if the value is above a threshold.
My thought is to store the SNPs in a python set, then do a scan over each TSV, doing a lookup to see if the row in the TSV matches any object in the set. Do you think this is a reasonable approach, or will the lookup time in the set with 10 million items be very slow? I have hundreds of patients this needs to be done over so it shouldn't take more than an hour or two to process.

Your data size is large enough that you should not be working with data structures in memory. Instead, consider using a relational database system. You can start with sqlite, which comes bundled with Python.
This SO answer has details about how to load a TSV into sqlite.
After your set of SNPs and your reference TSVs are in sqlite, you can filter the SNPs with a simple SQL query such as:
SELECT
t1.SNP_ID
FROM
snps t1
LEFT JOIN
ref_tsv t2
ON
t1.SNP_ID = t2.SNP_ID
WHERE
t2.value >= {your_threshold}
;

ok, here's what I would do in your case.
500GB of metadata is a lot, let's look how can we reduce this amount.
your idea to make a set() with SNP_ID is good. Read all your SNP data, make a set of SNP_ID, it will definitely fit into the memory
then read TSV data, for every row check if SNP_ID is in your set, if it is -- save the SNP_ID and the floating point number, discard the rest. You will have 10M records at the most, because one SNP has only that much.
do your magic
start over with the next SNP
It would be nice to put all the data on a fast SSD just in case.
And, something else to try, maybe if you discard the metadata, you will be able to reduce the TSV size to just a few gigabytes, saving the SNP_ID and the float? Then you may easily fit it into the memory and make things much faster.

Related

Pyarrow Write/Append Columns Arrow File

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by using a generator expression, so memory consumption is low. However, I would like to store the Nx1 arrays to file, so I can do other computations.(Compute quantiles, partial sums etc. pandas style) Preferably I would like to use pa.memory_map on a single file (in order to have dataframes not loaded into memory), but I can not see how I can produce such a file without generating the entire result first. (Monte Carlo results on 200-500*10m floats).
If I understand correctly RecordBatchStreamWriter needs a part of the entire table, and I can not produce only a part of it. The parts the calculator produces is the columns, not parts of all columns. Is there any way of writing "columns" one by one? Either by appending, or create an empty arrow file which can be filled? (schema known).
As I see it, my alternative is to write several files and use "dataset" /tabular data to "join" them together. My "other computations" would then have to filter or pull parts into memory as I can`t see in the docs that "dataset()" work with memory_map.The result set is to big to fit in memory. (At least on the server it is running on)
I`m on day 2 of digging the docs and trying to understand how it all works, so apologies if the "lingo" is not all correct.
On further inspection, it looks like all files used in datasets() must have same schema, so I can not split "columns" in separate files either, can I..
EDIT
After wrestling with this library, I now produce single column files which I later combine in a single file. However, in following the suggested solution visible memory consumption (task manager) skyrockets in the step of combining the files. I would expect peaks for every "rowgroup" or combined recordbatch, but instead steadily increase to use all memory. A snip of this step:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(
os.path.join(self.filepath, self.outfile_name + ".arrow"),
schema=combined_schema,
) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names
)
writer.write_batch(combined_batch)
From this link I would expect that running memory consumption to be that of combined_batch and some.

You could do the write in two passes.
First, write each column to its own file. Make sure to set a row group size small enough that a table consisting of one row group from each file comfortably fits into memory.
Second, create a streaming reader for each file you created and one writer. Read a single row group from each one. Create a table by combining all of the partial columns and write the table to your writer. Repeat until you've exhausted all your readers.
I'm not sure that memory mapping is going to help you much here.

How to lively save pandas dataframe to file?

I have a Python program that is controlling some machines and stores some data. The data is produced at a rate of about 20 rows per second (and about 10 columns or so). The whole run of this program can be as long as one week, as a result there is a large dataframe.
What are safe and correct ways to store this data? With safe I mean that if something fails in the day 6, I will still have all the data from days 1→6. With correct I mean not re-writing the whole dataframe to a file in each loop.
My current solution is a CSV file, I just print each row manually. This solution is both safe and correct, but the problem is that CSV does not preserve data types and also occupies more memory. So I would like to know if there is a binary solution. I like the feather format as it is really fast, but it does not allow to append rows.

I can think of two easy options:
store chunks of data (e.g. every 30 seconds or whatever suits your use case) into separate files; you can then postprocess them back into a single dataframe.
store each row into an SQL database as it comes in. Sqlite will likely be a good start, but I'd maybe really go for PostgreSQL. That's what databases are meant for, after all.

get_range in random and ordered partitioner

How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks

Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().

EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.

Saving large Python arrays to disk for re-use later --- hdf5? Some other method?

I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).

HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.

Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.

I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.

BST or Hash Table?

I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.

Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.

You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}

library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.