I'm looking for the time complexity of these methods as a function of the number of rows in a dataframe, n.
Another way of asking this question is: Are indexes for dataframes in pandas btrees (with log(n) time look ups) or hash tables (with constant time lookups)?
Asking this question because I'd like a way to do constant time look ups for rows in a dataframe based on a custom index.
Alright so it would appear that:
1) You can build your own index on a dataframe with .set_index in O(n) time where n is the number of rows in the dataframe
2) The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index. So accessing a row for the first time using that index takes O(n) time
3) All subsequent row access takes constant time.
So it looks like the indexes are hash tables and not btrees.
From the Pandas Internals documentation, the default DataFrame index
Populates a dict of label to location in Cython to do O(1) lookups.
dict uses hash tables, supporting Peter Berg's answer to this question.
Related
I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.
suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role
As I understand it, the advantage to using the set_index function with a particular column is to allow for direct access to a row based on a value. As long as you know the value, this eliminates the need to search using something like loc thus cutting down the running time of the operation. Pandas also allows you to set multiple columns as the index using this function. My question is, after how many columns do these indexes stop being valuable? If I were to specify every column in my dataframe as the index would I still see increased speed in indexing rows over searching with loc?
The real downside of setting everything as index is buried deep in the advanced indexing docs of Pandas: indexing can change the dtype of the column being set to index. I would expect you to encounter this problem before realizing the prospective performance benefit.
As for that performance benefit, you pay for indexing up front when you construct the Series object, regardless of whether you explicitly set them. AFAIK Pandas indexes everything by default. And as Jake VanderPlas puts it in his excellent book:
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.
-- Jake VanderPlas, The Python Data Science Handbook
So, the reason to set something as index is to make it easier for you to work with your data or to support your data access pattern, not necessarily for performance optimization like a database index.
I'm working on two dataframes df1 and df2.
I used the code :
df1.index.searchsorted(df2.index)
But I'm not sure about how does it work.
Could someone please explain me how ?
The method applies a binary search to the index. This is a well-known algorithm that uses the fact that values are already in sorted order to find an insertion index in as few steps as possible.
Binary search works by picking the middle element of the values, then comparing that to the searched-for value; if the value is lower than that middle element, you then narrow your search to the first half, or you look at the second half if it is larger.
This way you reduce the number of steps needed to find your element to at most the log of length of the index. For 1000 elements, that's fewer than 7 steps, for a million elements, fewer than 14, etc.
The insertion index is the place to add your value to keep the index in sorted order; the left location also happens to be the index of a matching value, so you can also use this both to find places to insert missing or duplicate values, and to test if a given value is present in the index.
The pandas implementation is basically the numpy.sortedsearch() function, which uses generated C code to optimise this search for different object types, squeezing out every last drop of speed.
Pandas uses the method in various index implementations to ensure fast operations. You usually wouldn't use this method to test if a value is present in the index, for example, because Pandas indexes already implement an efficient __contains__ method for you, usually based on searchsorted() where that makes sense. See DateTimeEngine.__contains__() for such an example.
How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().
EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.