I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column]) and set(df[column]) but it takes lot of time and ends up giving error regarding creating cluster or sometimes it restarts kernel when memory limit is reached.
Can i use dask.bag or Multiprocessing to create a list?
when you try to convert a column to a list or set with the regular list/set Python will load that into memory, that's why you get a memory limit issue.
I believe that by using dask.bag you might solve that issue since dask.bag will lazy load your data, although I'm not sure if the df[column] won't have to be read first. Also, be aware that turning that column into a bag will take a while depending on how big the data is.
Using a dask.bag allows you to run map, filter and aggregate so it seems it could be a good solution for your problem.
You can try to run this to see if it filters the list/bag as you expect.
import dask.bag as db
bag = db.from_sequence(df[column], npartitions=5)
bag.filter(lamdba list_element: list_element == "filtered row")
Since this is just an example, you will need to change the npartitions and the lambda expression to fit your needs.
Let me know if this helps
Related
I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).
If I have a dataset with unknown divisions and would like to sort it according to a column and output to Parquet, it seems to me that Dask does at least some of the work twice:
import dask
import dask.dataframe as dd
def my_identity(x):
"""Does nothing, but shows up on the Dask dashboard"""
return x
df = dask.datasets.timeseries()
df = df.map_partitions(my_identity)
df = df.set_index(['name']) # <- `my_identity` is calculated here, as well as other tasks
df.to_parquet('temp.parq') # <- previous tasks seem to be recalculated here
If my_identity was computationally demanding, then recomputing it would be really costly.
Am I correct in my understanding that Dask does some work twice here? Is there any way to prevent that?
The explanation below may not be accurate, but hopefully helps a bit.
Let's try to get into dask's shoes on this. We are asking dask to create an index based on some variable... Dask only works with sorted indexes, so Dask will want to know how to re-arrange data to make it sorted and also what will be the appropriate divisions for the partitions. The first calculation you see is doing that, and dask will store only the parts of calculation necessary for the divisions/data-reshuffling.
Then when we ask Dask to save the data, it computes the variables, shuffles the data (in line with the previous computations) and stores it in corresponding partitions.
How to avoid this? Possible options:
persist before setting the index. Once you persist, dask will compute the variable and keep it on workers, so setting index will refer to the results of that computation. There will still be reshuffling of the data needed). Note that the documentation suggests persisting after setting the index, but that case assumes that the column exists (does not require separate computation).
Sort within partitions, this can be done lazily, but of course it's only an option if you do not need a global sort.
Use plain pandas, this may necessitate some manual chunking of the data (what I tend to use for sorting).
I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.
The situation:
I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions.
Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing to process each partition separately.
I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions.
Or am I on a completely wrong path here and should other methods of Dask?
You might find using dask delayed useful and then use that to create a custom dask dataframe? https://docs.dask.org/en/latest/dataframe-create.html#dask-delayed
I am trying to use groupby and apply a custom function on a huge dataset, which is giving me memory errors and the workers are getting killed because of the shuffling. How can I avoid shuffle and do this efficiently.
I am reading around fifty 700MB (each) parquet files and the data in those files is isolated, i.e. no group exists in more than one file. If I try running my code on one file, it works fine but fails when I try to run on the complete dataset.
Dask documentation talks about problems with groupby when you apply a custom function on groups, but they do not offer a solution for such data:
http://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases
How can I process my dataset in a reasonable timeframe (it takes around 6 minutes for groupby-apply on a single file) and hopefully avoid shuffle. I do not need my results to be sorted, or groupby trying to sort my complete dataset from different files.
I have tried using persist but the data does not fit into RAM (32GB). Even though dask does not support multi column index, but I tried adding a index on one column to support groupby to no avail. Below is what the structure of code looks like:
from dask.dataframe import read_parquet
df = read_parquet('s3://s3_directory_path')
results = df.groupby(['A', 'B']).apply(custom_function).compute()
# custom function sorts the data within a group (the groups are small, less than 50 entries) on a field and computes some values based on heuristics (it computes 4 values, but I am showing 1 in example below and other 3 calculations are similar)
def custom_function(group):
results = {}
sorted_group = group.sort_values(['C']).reset_index(drop=True)
sorted_group['delta'] = sorted_group['D'].diff()
sorted_group.delta = sorted_group.delta.shift(-1)
results['res1'] = (sorted_group[sorted_group.delta < -100]['D'].sum() - sorted_group.iloc[0]['D'])
# similarly 3 more results are generated
results_df = pd.DataFrame(results, index=[0])
return results_df
One possibility is that I process one file at a time and do it multiple times, but in that case dask seems useless (no parallel processing) and it will take hours to achieve the desired results. Is there any way to do this efficiently using dask, or any other library? How do people deal with such data?
If you want to avoid shuffling and can promise that groups are well isolated, then you could just call a pandas groupby apply across every partition with map_partitions
df.map_partitions(lambda part: part.groupby(...).apply(...))
I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.