What does pandas index.searchsorted() do? - python

I'm working on two dataframes df1 and df2.
I used the code :
df1.index.searchsorted(df2.index)
But I'm not sure about how does it work.
Could someone please explain me how ?

The method applies a binary search to the index. This is a well-known algorithm that uses the fact that values are already in sorted order to find an insertion index in as few steps as possible.
Binary search works by picking the middle element of the values, then comparing that to the searched-for value; if the value is lower than that middle element, you then narrow your search to the first half, or you look at the second half if it is larger.
This way you reduce the number of steps needed to find your element to at most the log of length of the index. For 1000 elements, that's fewer than 7 steps, for a million elements, fewer than 14, etc.
The insertion index is the place to add your value to keep the index in sorted order; the left location also happens to be the index of a matching value, so you can also use this both to find places to insert missing or duplicate values, and to test if a given value is present in the index.
The pandas implementation is basically the numpy.sortedsearch() function, which uses generated C code to optimise this search for different object types, squeezing out every last drop of speed.
Pandas uses the method in various index implementations to ensure fast operations. You usually wouldn't use this method to test if a value is present in the index, for example, because Pandas indexes already implement an efficient __contains__ method for you, usually based on searchsorted() where that makes sense. See DateTimeEngine.__contains__() for such an example.

Related

Slicing operation with Dask in a optimal way with Python

I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

Is there something like a blocking index for approximately equal numeric values when indexing with the Python module recordlinkage?

I've got a sqlite database of music tracks, and I want to remove duplicates. I'd like to compare tracks based on title and duration. (I'll probably try to throw artists in later, but that's a separate table (multiple artists per track), but for now, I've got a text field for the title and an integer field for the duration (in seconds).) Duplicate tracks in this database tend to have similar titles (or at least with similar prefixes) and durations within 5-10 seconds of each other.
I'm trying to learn recordlinkage to detect the duplicates, and my first attempt was to make a full index, use Smith-Waterman to compare titles and a simple linear numeric comparison for the duration. No big surprise; the database was WAAY too big to do a full index. I could do a block index on the duration to limit down to pairs to durations that are identical, but the durations are often off by a few seconds. I could do sorted neighborhood, but if* I'm understanding this correctly (*a big "if"), that means that if I set a window to (for example) 10, each track will only be paired with the 10 closest tracks in terms of duration, which will pretty much always be identical durations and completely miss the durations that are close but not identical. It seems to me like having an "approximate blocking index" or something like that would be a natural step, but I can't seem to find any simple way to do that.
Can anyone help me out here?
Okay, answering my own question here because I believe I've figured out the misunderstanding in my original question.
I was misunderstanding how sorted neighborhood indexing works. I was thinking that if you set the window to (for example) 3, it would sort all the records by the key and then pair each record with exactly 3 neighbor records (the record itself, the one above it, and the one below it). So if there were more than 5 records with the same key value, this would actually result in fewer pairs than a block index. But I'm now pretty sure that it's actually grouping the values by the key first, so that a window of 3 will pair with all records with the exact same key value, all the records with the next highest key value and all the records with the next lowest key value.
Now this doesn't get me exactly what I asked for, but it gets me close enough. If I set a window size of 11 (or 21), then I'll be guaranteed to get all values within 5 seconds (or 10 seconds). If the data is sparse with respect to duration, there will be a bit more. (And this only works because it's integer data. If it were floating point numbers of arbitrary precision, then that would be a different matter.)

Python set memebership test

In an algorithmics course our teacher covered "virtual initialization" where you allocate memory for an operation, but don't initialize all the values since the problem space might be too large compared to the values that actually need to be calculated (for example a dictionary or set), which wastes a lot of time for setting a default value. The basic principle is to have two arrays that point to each other (index each other) and keep track of how many variables have been assigned. let's say we have an array a and we want to find if a[i] contains a valid value, we can use array b as an index to a like so:
I had a look at the python time complexity table at https://wiki.python.org/moin/TimeComplexity and it mentions that a set membership test in the worst case can be of O(n). I'm not sure where to find the exact implementation of the set function but after a bit of googling, most people mention that it uses a hash table. My main questions are:
How can a hash table be used to check if a value is valid or not? We can hash every value and read the result, but that doesn't mean the output isn't garbage (we can be reading whatever was written to that address when malloc was called).
virtual initialization completely avoids the problem with collisions in a hash table, so why is it not a better solution than using a hash table?
When you implement a hash table using an array, you need a flag in each entry to indicate whether it's currently in use. This is needed to deal with hash function collisions -- if two values have the same hash code, you can't put them both in the same element of the array.
This means that when you allocate the array, you have to go through it, initializing all these flags. And you have to do this again whenever you grow the hash table.
"Virtual initialization" avoids this. The algorithm you pasted is used to tell whether an a[k] is in use. It uses a second array b that contains indexes into a. When inserting an element, a[k].p contains an index j in b, and b[j] contains k. If these two match, and also j is lower than the limit of all indexes, you know that the array element is in use.
To clear an entry, you simply set a[k].p = 0, since all valid entries have p between 1 and n.
This is an example of a time-space tradeoff in algorithm design. To avoid the time spent initializing the array, we allocate a second array. If you have lots of available memory, this can be an acceptable tradeoff.

get_range in random and ordered partitioner

How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().
EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.

Best way to implement 2-D array of series elements in Python

I have a dynamic set consisting of a data series on the order of hundreds of objects, where each series must be identified (by integer) and consists of elements, also identified by an integer. Each element is a custom class.
I used a defaultdict to created a nested (2-D) dictionary. This enables me to quickly access a series and individual elements by key/ID. I needed to be able to add and delete elements and entire series, so the dict served me well. Also note that the IDs do not have to be sequential, due to add/delete. The IDs are important since they are unique and referenced elsewhere through my application.
For example, consider the following data set with keys/IDs,
[1][1,2,3,4,5]
[2][1,4,10]
[4][1]
However, now I realize I want to be able to insert elements in a series, but the dictionary doesn't quite support it. For example, I'd like to be able to insert a new element between 3 and 4 for series 1, causing the IDs above it (from 4,5) to increment (to 5,6):
[1][1,2,3,4,5] becomes
[1][1,2,3,4(new),5,6]
The order matters since the elements are part of a sequential series. I realize that this would be easier with a nested list since it supports insert(), but then I would be forced to iterate over the entire 2-D array to get element indices right?
What would be the most optimal way to implement this data structure in Python?
I think what you want is a dict with array values:
dict = {1:[...],3:[...], ....}
You can then operate on the arrays as you please. If the array values are sequential ints
just use:
dict[key].append(vals)
dict[key].sort()
Don't worry about the speed unless you find out it's a problem. Premature optimization
is the root of all evil.
In fact, don't even sort your dict vals until you have to, if you want to be really efficient.

Categories