How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().
EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.
Related
I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.
(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.
Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))
I've got a sqlite database of music tracks, and I want to remove duplicates. I'd like to compare tracks based on title and duration. (I'll probably try to throw artists in later, but that's a separate table (multiple artists per track), but for now, I've got a text field for the title and an integer field for the duration (in seconds).) Duplicate tracks in this database tend to have similar titles (or at least with similar prefixes) and durations within 5-10 seconds of each other.
I'm trying to learn recordlinkage to detect the duplicates, and my first attempt was to make a full index, use Smith-Waterman to compare titles and a simple linear numeric comparison for the duration. No big surprise; the database was WAAY too big to do a full index. I could do a block index on the duration to limit down to pairs to durations that are identical, but the durations are often off by a few seconds. I could do sorted neighborhood, but if* I'm understanding this correctly (*a big "if"), that means that if I set a window to (for example) 10, each track will only be paired with the 10 closest tracks in terms of duration, which will pretty much always be identical durations and completely miss the durations that are close but not identical. It seems to me like having an "approximate blocking index" or something like that would be a natural step, but I can't seem to find any simple way to do that.
Can anyone help me out here?
Okay, answering my own question here because I believe I've figured out the misunderstanding in my original question.
I was misunderstanding how sorted neighborhood indexing works. I was thinking that if you set the window to (for example) 3, it would sort all the records by the key and then pair each record with exactly 3 neighbor records (the record itself, the one above it, and the one below it). So if there were more than 5 records with the same key value, this would actually result in fewer pairs than a block index. But I'm now pretty sure that it's actually grouping the values by the key first, so that a window of 3 will pair with all records with the exact same key value, all the records with the next highest key value and all the records with the next lowest key value.
Now this doesn't get me exactly what I asked for, but it gets me close enough. If I set a window size of 11 (or 21), then I'll be guaranteed to get all values within 5 seconds (or 10 seconds). If the data is sparse with respect to duration, there will be a bit more. (And this only works because it's integer data. If it were floating point numbers of arbitrary precision, then that would be a different matter.)
What are the top factors to consider when tuning inserts for a LevelDB store?
I'm inserting 500M+ records in the form:
key="rs1234576543" very predictable structure. rs<1+ digits>
value="1,20000,A,C" string can be much longer but usually ~ 40 chars
keys are unique
key insert order is random
into a LevelDB store using the python plyvel, and see dramatic drop in speed as the number of records grows. I guess this is expected but are there tuning measures I could look at to make it scale better?
Example code:
import plyvel
BATCHSIZE = 1000000
db = plyvel.DB('/tmp/lvldbSNP151/', create_if_missing=True)
wb = db.write_batch()
# items not in any key order
for key, value in DBSNPfile:
wb.put(key,value)
if i%BATCHSIZE==0:
wb.write()
wb.write()
I've tried various batch sizes, which helps bit, but am hoping there's something else I've missed. For example, can knowing the max length of a key (or value) be leveraged?
(Plyvel author here.)
LevelDB keeps all database items in sorted order. Since you are writing in a random order, this basically means that all parts of the database get rewritten all the time since LevelDB has to merge SSTs (this happens in the background). Once your database gets larger, and you keep adding more items to it, this results in a reduced write throughput.
I suspect that performance will not degrade as badly if you have better locality of your writes.
Other ideas that may be worth trying out are:
increase the write_buffer_size
increase the max_file_size
experiment with a larger block_size
use .write_batch(sync=False)
The above can all be used from Python using extra keyword arguments to plyvel.DB and to the .write_batch() method. See the api docs for details.
I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.
If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.
"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.
I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.
Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.
You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}
library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".