I have 1M row dataset. And inside of it, I have 50k unique person_id. The goal is to apply the DBSCAN algorithm individually to these 50k unique person_id. So basically I need to create clusters for each person_id. At the moment I'm doing similar to this:
def DBSCAN(x):
cluster = DBSCAN(eps=0.5, min_samples=3).fit(x)
cluster_labels = cluster.labels_
return cluster_labels
grouped = df_sample.groupby(["person_id"])
grouped.apply(lambda x: DBSCAN(x))
But that's means is running the function fifty thousand times. Because clustering must be applied individually to each person_id.
I want to able to run the code only one time for each unique person_id instead of running 50k times. Is there any way to create a lookup without running the function over and over again for each person_id?
It doesn't have to be the only python, also I'm able to SQL, pyspark, etc. I need a solution to escape from this loop.
Related
I have a df consisting of many millions of rows. I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.
# df index is set to the search column -- this helps a lot, sorting actually hurts performance (surprisingly?)
df = df.set_index('search_col')
# the search function; pull some cols of interest
df[df.index.isin(ids_to_search)][['val1', 'val2']].to_numpy()
Recursion happens because I need to find all the children IDs associated with one ultimate parent ID. The process is as follows:
Load single parent ID
Search for its children IDs
Use step 2 children IDs as new parent IDs
Search for its children IDs
Repeat 3+ until no more children IDs are found
The above is not bad, but with thousands of things to check, n times with recursion, its a slow process at the end of the day.
ids_to_search consists of length 32 random strings in a list, sometimes involving dozens or hundreds of strings to check.
What other tricks might I try to employ?
Edit: Other Attempts
Other attempts that I have done, which did not perform better are:
Using modin, leveraging the Dask engine
Swifter + modin, leveraging the Dask engine
Swapping pandas isin (and the dataframe to fully numpy, too) with numpy's np.in1d, ultimately to use JIT/Numba but I could not get it to work
I am trying to use groupby and apply a custom function on a huge dataset, which is giving me memory errors and the workers are getting killed because of the shuffling. How can I avoid shuffle and do this efficiently.
I am reading around fifty 700MB (each) parquet files and the data in those files is isolated, i.e. no group exists in more than one file. If I try running my code on one file, it works fine but fails when I try to run on the complete dataset.
Dask documentation talks about problems with groupby when you apply a custom function on groups, but they do not offer a solution for such data:
http://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases
How can I process my dataset in a reasonable timeframe (it takes around 6 minutes for groupby-apply on a single file) and hopefully avoid shuffle. I do not need my results to be sorted, or groupby trying to sort my complete dataset from different files.
I have tried using persist but the data does not fit into RAM (32GB). Even though dask does not support multi column index, but I tried adding a index on one column to support groupby to no avail. Below is what the structure of code looks like:
from dask.dataframe import read_parquet
df = read_parquet('s3://s3_directory_path')
results = df.groupby(['A', 'B']).apply(custom_function).compute()
# custom function sorts the data within a group (the groups are small, less than 50 entries) on a field and computes some values based on heuristics (it computes 4 values, but I am showing 1 in example below and other 3 calculations are similar)
def custom_function(group):
results = {}
sorted_group = group.sort_values(['C']).reset_index(drop=True)
sorted_group['delta'] = sorted_group['D'].diff()
sorted_group.delta = sorted_group.delta.shift(-1)
results['res1'] = (sorted_group[sorted_group.delta < -100]['D'].sum() - sorted_group.iloc[0]['D'])
# similarly 3 more results are generated
results_df = pd.DataFrame(results, index=[0])
return results_df
One possibility is that I process one file at a time and do it multiple times, but in that case dask seems useless (no parallel processing) and it will take hours to achieve the desired results. Is there any way to do this efficiently using dask, or any other library? How do people deal with such data?
If you want to avoid shuffling and can promise that groups are well isolated, then you could just call a pandas groupby apply across every partition with map_partitions
df.map_partitions(lambda part: part.groupby(...).apply(...))
I am looking how to repartition (in PySpark) a dataset so that all rows that have the same ID in a specified column move to the same partition. In fact, I have to run in each partition a program which computes a single value for all rows having the same ID.
I have a dataframe (df) built from a HIVE QL query (with lets say contains 10000 distinct IDs).
I tried :
df = df.repartition("My_Column_Name")
By default I get 200 partitions, but always I obtain 199 IDs for which I got duplicated computed values when I run the program.
I looked on the web, and some people recommended to define a custom partitioner to use with repartition method, but I wasn't able to find how to do that in Python.
Is there a way to do this repartition correctly?
I want only to have ALL rows with the same IDs moved to the same partition. No problem if a partition contains several groups of rows with distinct IDs. 1000 was just an example, the number of different IDs can be very high. So, partitionning a DF to number of different IDs partitions should not lead to good performances. I need that because I run a function (which cannot be implemented using basic Spark transformation functions) using RDD mapPartition method. This function produces one result per distinct ID, this is why I need to have all rows with the same ID in the same partition.
My problem is as follows:
I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.
Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.
There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).
Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:
uniqueA = details.select('A').distinct().collect()
uniqueA = map(lambda x: x.A, uniqueA)
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).
Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A")
attributes_repartitioned = attributes.repartition("A")
The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.
details_attributes = details_repartitioned.join(broadcast(
attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.
Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
this is too expensive. An alternate approach would be
uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")
Also, you probably need to set the shuffle partition correctly if you haven't done that yet.
One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.
To resolve that you need to find the skewed values of column A and process them separately.
I have a PySpark program that takes descriptions of variables and uses the Spark Word2Vec model to transform the descriptions into vectors, then uses KMeans to cluster those vectors in order to group descriptions that are hopefully describing the same thing.
However, the dataset has many duplicates in it. I deduplicate the original dataset, but maintain a count with each unique row of how many duplicates of that description existed originally.
After clustering the descriptions vectors, I zip the result rdd back into the dataset with descriptions. I would like to order the clusters based on how many total entries of data mapped to that cluster. So, the final RDD looks like this:
[([companyid=u'xxxxxxxx', variableid=u'prop11', description=u'payment method',duplicateCount=u'8', word2vecOutput=DenseVector([.830574, 1.96709, -0.86785,......])], clusterID=793]
The cluster is separated because it was zipped back into the W2V rdd.
I want to find a way to aggregate all the duplicateCount values and make an ordered list that has a clusterID with it's total number of original rows (before deduplication) ordered by total original rows.
It seems like this should be really easy to do with a simple aggregate function, but for whatever reason, I'm having a hard time wrapping my head around it.
Thanks for any help
EDIT:
To clarify, in each row of my RDD, there is a number labeled duplicateCount. There is another element labeled cluster. I was trying to write a function that would sum the duplicateCount where the cluster is equal, thereby giving me a totalCount for each cluster.
For example, 4 elements might be grouped into cluster 10. However, the first element might have a duplicateCount of 5, the second a duplicateCount of 37, etc (still all in cluster 10). I want to sum the duplicates in every cluster so I can get an actual size of the cluster.
I thought the W2V and KMeans would provide helpful context about why I wanted it, but apparently it just made the question confusing