How to use dask to quickly access subsets of the data?

How to use dask to quickly access subsets of the data? - python

One of the main reasons I love pandas is that it's easy to home in on subsets, e.g. df[df.sample.isin(['a', 'c', 'p'])] or df[df.age < 35]. Is dask dataframe good at (optimized for) this as well? The tutorials I've seen have focused on whole-column manipulations.
My specific application is (thousands of named GCMS samples) x (~20000 time points per sample) x (500 m/z channels) x (intensity), and I'm looking for the fastest tool to pull arbitrary subsets, e.g.
df[df.sample.isin([...]) & df.rt.lt(800) & df.rt.gt(600) & df.mz.isin(...)]
If dask is a good choice, then I would appreciate advice on how best to structure it.
What I've tried
What I've tried so far is to convert each sample to pandas dataframe that looks like
smp rt 14 15 16 17 18
0 160602_JK_OFCmix:1 271.0 64088.0 9976.0 26848.0 23928.0 89600.0
1 160602_JK_OFCmix:1 271.1 65472.0 10880.0 28328.0 24808.0 91840.0
2 160602_JK_OFCmix:1 271.2 64528.0 10232.0 27672.0 25464.0 90624.0
3 160602_JK_OFCmix:1 271.3 63424.0 10272.0 27600.0 25064.0 90176.0
4 160602_JK_OFCmix:1 271.4 64816.0 10640.0 27592.0 24896.0 90624.0
('smp' is sample name, 'rt' is retention time, 14,15,...500 are m/z channels), save to hdf with zlib, level=1, then make the dask dataframe with
ddf = dd.read_hdf(*.hdf5, key='/*', chunksize=100000, lock=False)
but df = ddf[ddf.smp.isin([...a couple of samples...]).compute() is 100x slower than ddf['57'].mean().compute().
(Note: this is with dask.set_options(get=dask.multiprocessing.get))

Your dask.dataframe is backed by an HDF file, so every time you do any operation you're reading in the data from disk. This is great if your data doesn't fit in memory but wasteful if your data does fit in memory.
If your data fits in memory
Instead, if your data fits in memory then try backing your dask.dataframe off of a Pandas dataframe:
# ddf = dd.from_hdf(...)
ddf = dd.from_pandas(df, npartitions=20)
I expect you'll see better performance from the threaded or distributed schedulers: http://dask.pydata.org/en/latest/scheduler-choice.html
If your data doesn't fit in memory
Try to reduce the number of bytes you have to read by specifying a set of columns to read in your read_hdf call
df = dd.read_hdf(..., columns=['57'])
Or, better yet, use a data store that lets you efficiently load individual columns. You could try something like Feather or Parquet, though both are in early stages:
https://github.com/wesm/feather
http://fastparquet.readthedocs.io/en/latest/
I suspect that if you're careful to avoid reading in all of the columns at once you could probably get by with just Pandas instead of using Dask.dataframe.

Related

Dask dataframe running out of memory while converting it to pandas after selecting subset

So I have a parquet file with 160M records and 240 columns. So I'm using dask to load it in python using EMR CLuster m5.12xlarge.
import dask.dataframe as dd
df = dd.read_parquet(file)
Now I want value count and normalized value count of one of the column:
count = df.a.value_counts()
percent = df.a.value_counts(normalize = True)
a_count = dd.concat([count,percent], axis=1, keys=['counts', '%'])
Out:
Dask DataFrame Structure:
counts %
npartitions=1
int64 float64
... ...
Dask Name: concat, 489 tasks
Note here that I have total 1 npartitions and 489 task.
Now I'm trying to convert it into pandas df. Which is only taking few secs to execute and using around 1.5 GB of memory.
a_count = a_count.compute()
Now from one of the column I want all records with null values and then do the same as I did earlier, value count.
empty_b = df[df['b'].isna()]
count = empty_b.a.value_counts()
percent = empty_b.a.value_counts(normalize = True)
empty_b = dd.concat([count,percent], axis=1, keys=['counts', '%'])
empty_b
Out:
Dask DataFrame Structure:
counts %
npartitions=1
int64 float64
... ...
Dask Name: concat, 828 tasks
Which has a total 1 npartition and 828 task.
Now I'm trying to convert this into pandas dataframe by computing and it's taking lot of time and running out of memory after utilizing 170 GB.
empty_b = empty_b.compute()
Can someone explain what's going wrong here, I'm doing the same thing and that also on subset of bigger one, but still my notebook is running out of memory and not able to execute.

I have a parquet file with 160M records and 240 columns
It's probably best to split up this data into multiple Parquet files
I'm using dask to load it in python using EMR CLuster m5.12xlarge
This instance type has 48 CPUs. Is this a multi node instance or a single node?
Note here that I have total 1 npartitions and 489 task.
You're probably running all these computations on a single core. Try repartitioning the DataFrame into 48 partitions at least, so you can leverage parallelism on your powerful machine.
Now I'm trying to convert this into pandas dataframe
You generally don't want to convert Dask DataFrames to Pandas DataFrames, unless you've significantly reduced the number of rows of data. You lose all the benefits that the parallelism of Dask can provide once converting to Pandas.
In this example, it seems like you're reading a single Parquet file into a Dask DataFrame with one partition and then converting it back to Pandas. You might want to consider breaking up the Dask DataFrame into multiple partitions (and running computations via Dask) or just reading the file directly into a Pandas DataFrame.

How to apply multiple functions to several chunks of a dask dataframe?

I have a dataframe of 500,000 lines and 3 columns. I would like to compute the result of three functions for every chunk of 5,000 lines in the dataframe (that is, 100 chunks). Two of the three functions are used-defined, while the third is the mean of the values in column 3.
At the moment, I am first extracting a chunk, and then computing the results of the functions for that chunk. For the mean of column 3 I am using df.iloc[:,2].compute().mean() but the other functions are performed outside of dask.
Is there a way to leverage dask's multithreading ability, taking the entire dataframe and a chunk size as input, and have it computing the same functions but automatically? This feels like the more appropriate way of using Dask.
Also, this feels like a basic dask question to me, so please if this is a duplicate, just point me to the right place (I'm new to dask and I might have not looked for the right thing so far).

I would repartition your dataframe, and then use the map_partitions function to apply each of your functions in parallel
df = df.repartition(npartitions=100)
a = df.map_partitions(func1)
b = df.map_partitions(func2)
c = df.map_partitions(func3)
a, b, c = dask.compute(a, b, c)

You can create an artificial column for grouping indices into those 100 chunks.
ranges = np.arange(0, df.shape[0], 5000)
df['idx_group'] = ranges.searchsorted(df.index, side='right')
Then use this idx_group to perform your operations using pandas groupby.
NOTE: You can play with searchsorted to exactly fit your chunk requirements.

Pandas_Merge two large Datasets

I am using Pandas for my analysis (I am currently using Jupyter Network). I have a two large datasets (one is 14 GB and the second one is 4 GB). I need to merge this two datasets based on a column. I employ the following code:
df = pd.merge(aa, bb, on='column', how='outer')
Normally, this code works. However, since my datasets are large, it take to much time. I run my code 4 hours ago and it still continuing. The RAM of my machine is 8 GB.
Do you have any suggestions for that?

You can try using dask.dataframe to parallelize your task:
import dask.dataframe as dd
# define lazy readers
aa = dd.read_csv('file1.csv')
bb = dd.read_csv('file2.csv')
# define merging logic
dd_merged = aa.merge(bb, on='column', how='outer')
# apply merge and convert to dataframe
df = dd_merged.compute()

Performance decrease for huge amount of columns. Pyspark

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \

VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.

Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by #chrisaycock. You may want to experiment with the chunksize parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

I do like #root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:
cols = ['Geography','Count']
df = pd.DataFrame()
chunksize = 2 # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
usecols=cols,
chunksize=chunksize)
):
# merge previously aggregated DF with a new portion of data and aggregate it again
df = (pd.concat([df,
chunk.groupby('Geography')['Count'].sum().to_frame()])
.groupby(level=0)['Count']
.sum()
.to_frame()
)
df.reset_index().to_csv('c:/temp/result.csv', index=False)
test data:
Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111
output.csv:
Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111
PS using this approach will you can process huge files.
PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first
I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use dask to quickly access subsets of the data? - python

Related

Dask dataframe running out of memory while converting it to pandas after selecting subset

How to apply multiple functions to several chunks of a dask dataframe?

Pandas_Merge two large Datasets

Performance decrease for huge amount of columns. Pyspark

How to stream in and manipulate a large data file in python

Categories

Resources