Summing values in Spark RDD where assigned clusters are equal

Summing values in Spark RDD where assigned clusters are equal - python

I have a PySpark program that takes descriptions of variables and uses the Spark Word2Vec model to transform the descriptions into vectors, then uses KMeans to cluster those vectors in order to group descriptions that are hopefully describing the same thing.
However, the dataset has many duplicates in it. I deduplicate the original dataset, but maintain a count with each unique row of how many duplicates of that description existed originally.
After clustering the descriptions vectors, I zip the result rdd back into the dataset with descriptions. I would like to order the clusters based on how many total entries of data mapped to that cluster. So, the final RDD looks like this:
[([companyid=u'xxxxxxxx', variableid=u'prop11', description=u'payment method',duplicateCount=u'8', word2vecOutput=DenseVector([.830574, 1.96709, -0.86785,......])], clusterID=793]
The cluster is separated because it was zipped back into the W2V rdd.
I want to find a way to aggregate all the duplicateCount values and make an ordered list that has a clusterID with it's total number of original rows (before deduplication) ordered by total original rows.
It seems like this should be really easy to do with a simple aggregate function, but for whatever reason, I'm having a hard time wrapping my head around it.
Thanks for any help
EDIT:
To clarify, in each row of my RDD, there is a number labeled duplicateCount. There is another element labeled cluster. I was trying to write a function that would sum the duplicateCount where the cluster is equal, thereby giving me a totalCount for each cluster.
For example, 4 elements might be grouped into cluster 10. However, the first element might have a duplicateCount of 5, the second a duplicateCount of 37, etc (still all in cluster 10). I want to sum the duplicates in every cluster so I can get an actual size of the cluster.
I thought the W2V and KMeans would provide helpful context about why I wanted it, but apparently it just made the question confusing

Related

grouping data without group by method in python

I have 1M row dataset. And inside of it, I have 50k unique person_id. The goal is to apply the DBSCAN algorithm individually to these 50k unique person_id. So basically I need to create clusters for each person_id. At the moment I'm doing similar to this:
def DBSCAN(x):
cluster = DBSCAN(eps=0.5, min_samples=3).fit(x)
cluster_labels = cluster.labels_
return cluster_labels
grouped = df_sample.groupby(["person_id"])
grouped.apply(lambda x: DBSCAN(x))
But that's means is running the function fifty thousand times. Because clustering must be applied individually to each person_id.
I want to able to run the code only one time for each unique person_id instead of running 50k times. Is there any way to create a lookup without running the function over and over again for each person_id?
It doesn't have to be the only python, also I'm able to SQL, pyspark, etc. I need a solution to escape from this loop.

How to parse a complex csv into multiple dataframes with Python and pandas

I have relatively complex csv files which contain multiple matrices representing several types of data and I would like to be able to parse these into multiple dataframes.
The complication is that these files are quite variable in size and content, as seen in this example containing two types of data, in this case a Median and Count metric for each sample.
There are some commonalities that all of these files share. Each metric will be stored in a matrix structured essentially like the two in the above example. In particular, the DataType field and subsequent entry will always be there, and the feature space (columns) and sample space (rows) will be consistent within a file (the row space may vary between files).
Essentially, the end result should be a dataframe of the data for just one metric, with the feature ids as the column names (Analyte 1, Analyte 2, etc in this example) and the sample ids as the row names (Location column in this case).
So far I've attempted this using the pandas read_csv function without much success.
In theory I could do something like this, but only if I know (1) the size and (2) the location of the particular matrix for the metric that I am after. In this case the headers for my particular metric of interest would be in row 46 and I happen to know that the number of samples is 384.
import pandas as pd
df = pd.read_csv('/path/to/file.csv', sep = ",", header=46, nrows=385, index_col='Location')
I am at a complete loss how to do this in a dynamic fashion with files and matrices that change dimensions. Any input on overall strategy here would be greatly appreciated!

PySpark repartition according to a specific column

I am looking how to repartition (in PySpark) a dataset so that all rows that have the same ID in a specified column move to the same partition. In fact, I have to run in each partition a program which computes a single value for all rows having the same ID.
I have a dataframe (df) built from a HIVE QL query (with lets say contains 10000 distinct IDs).
I tried :
df = df.repartition("My_Column_Name")
By default I get 200 partitions, but always I obtain 199 IDs for which I got duplicated computed values when I run the program.
I looked on the web, and some people recommended to define a custom partitioner to use with repartition method, but I wasn't able to find how to do that in Python.
Is there a way to do this repartition correctly?

I want only to have ALL rows with the same IDs moved to the same partition. No problem if a partition contains several groups of rows with distinct IDs. 1000 was just an example, the number of different IDs can be very high. So, partitionning a DF to number of different IDs partitions should not lead to good performances. I need that because I run a function (which cannot be implemented using basic Spark transformation functions) using RDD mapPartition method. This function produces one result per distinct ID, this is why I need to have all rows with the same ID in the same partition.

Aggregate over two columns in pandas

I'm new to python and pandas and maybe i'm missing something. I have many columns in a dataframe with two important columns in particular: Weight and Volume. I want to create many clusters with the rows from the dataframe, with conditions:
No cluster accumulate (summing) above 30000 Kg in weight.
No cluster accumulate (summing) above 30 M^3 in volume.
The cluster are as large as possible but bellow limits expressed in first two points.
The resulting cluster for each row is annotated in a "cluster" column in the same dataframe.
The algorithm is implemented in a procedural style, with nested loops. I'm reading about rolling and expanding functions in pandas, but i don't find a pandoric way (without loops?) to do that. Is there a way?
Here is some code to help explaining me.

get_range in random and ordered partitioner

How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks

Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().

EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.