subselection of columns in dask (from pandas) by computed boolean indexer - python

I'm new do dask (imported as dd) and try to convert some pandas (imported as pd) code.
The goal of the following lines, is to slice the data to those columns, which's values fullfill the calculated requirement in dask.
There is a given table in csv. The former code reads
inputdata=pd.read_csv("inputfile.csv");
pseudoa=inputdata.quantile([.035,.965])
pseudob=pseudoa.diff().loc[.965]
inputdata=inputdata.loc[:,inputdata.columns[pseudob.values>0]]
inputdata.describe()
and is working fine.
My simple idea for conversion was so substitute the first line to
inputdata=dd.read_csv("inputfile.csv");
but that resulted in the strange error message IndexError: too many indices for array.
Even by switching to ready computed data in inputdata and pseudob the error remains.
Maybe the question is specifically assigned to the idea of calculated boolean slicing for dask-columns.
I just found a (maybe suboptimal) way (not solution) to do that. Changing line 4 to the following
inputdata=inputdata.loc[:,inputdata.columns[(pseudob.values>0).compute()[0]]]
seems to work.

Yes, Dask.dataframe's .loc accessor only works if it gets concrete indexing values. Otherwise it doesn't know which partitions to ask for the data. Computing your lazy dask result to a concrete Pandas result is one sensible solution to this problem, especially if your indices fit in memory.

Related

Slicing operation with Dask in a optimal way with Python

I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

Ways of Creating List from Dask dataframe column

I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column]) and set(df[column]) but it takes lot of time and ends up giving error regarding creating cluster or sometimes it restarts kernel when memory limit is reached.
Can i use dask.bag or Multiprocessing to create a list?
when you try to convert a column to a list or set with the regular list/set Python will load that into memory, that's why you get a memory limit issue.
I believe that by using dask.bag you might solve that issue since dask.bag will lazy load your data, although I'm not sure if the df[column] won't have to be read first. Also, be aware that turning that column into a bag will take a while depending on how big the data is.
Using a dask.bag allows you to run map, filter and aggregate so it seems it could be a good solution for your problem.
You can try to run this to see if it filters the list/bag as you expect.
import dask.bag as db
bag = db.from_sequence(df[column], npartitions=5)
bag.filter(lamdba list_element: list_element == "filtered row")
Since this is just an example, you will need to change the npartitions and the lambda expression to fit your needs.
Let me know if this helps

After how many columns does the Pandas 'set_index' function stop being useful?

As I understand it, the advantage to using the set_index function with a particular column is to allow for direct access to a row based on a value. As long as you know the value, this eliminates the need to search using something like loc thus cutting down the running time of the operation. Pandas also allows you to set multiple columns as the index using this function. My question is, after how many columns do these indexes stop being valuable? If I were to specify every column in my dataframe as the index would I still see increased speed in indexing rows over searching with loc?
The real downside of setting everything as index is buried deep in the advanced indexing docs of Pandas: indexing can change the dtype of the column being set to index. I would expect you to encounter this problem before realizing the prospective performance benefit.
As for that performance benefit, you pay for indexing up front when you construct the Series object, regardless of whether you explicitly set them. AFAIK Pandas indexes everything by default. And as Jake VanderPlas puts it in his excellent book:
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.
-- Jake VanderPlas, The Python Data Science Handbook
So, the reason to set something as index is to make it easier for you to work with your data or to support your data access pattern, not necessarily for performance optimization like a database index.

Using .loc in pandas slows down calculation

I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
.loc is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A'] (which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01), and assigning df_temp['q']. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc (see this post for a nice overview of SettingWithCopy warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01)). It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).

Is it a good practice to preallocate an empty dataframe with types?

I'm trying to load around 3GB of data into a Pandas dataframe, and I figured that I would save some memory by first declaring an empty dataframe, while enforcing that its float coulms would be 32bit instead of the default 64bit. However, the Pandas dataframe constructor does not allow specifying the types fo multiple columns on an empty dataframe.
I found a bunch of workarounds in the replies to this question, but they made me realize that Pandas is not designed in this way.
This made me wonder whether it was a good strategy at all to declare the empty dataframe first, instead of reading the file and then downcasting the float columns (which seems inefficient memory-wise and processing-wise).
What would be the best strategy to design my program?

Categories