PySpark repartition according to a specific column - python

I am looking how to repartition (in PySpark) a dataset so that all rows that have the same ID in a specified column move to the same partition. In fact, I have to run in each partition a program which computes a single value for all rows having the same ID.
I have a dataframe (df) built from a HIVE QL query (with lets say contains 10000 distinct IDs).
I tried :
df = df.repartition("My_Column_Name")
By default I get 200 partitions, but always I obtain 199 IDs for which I got duplicated computed values when I run the program.
I looked on the web, and some people recommended to define a custom partitioner to use with repartition method, but I wasn't able to find how to do that in Python.
Is there a way to do this repartition correctly?

I want only to have ALL rows with the same IDs moved to the same partition. No problem if a partition contains several groups of rows with distinct IDs. 1000 was just an example, the number of different IDs can be very high. So, partitionning a DF to number of different IDs partitions should not lead to good performances. I need that because I run a function (which cannot be implemented using basic Spark transformation functions) using RDD mapPartition method. This function produces one result per distinct ID, this is why I need to have all rows with the same ID in the same partition.

Related

pyspark groupby sum all in one time vs partial where then groupby sum for huge table

suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role

Parallel reading and processing from Postgres using (Py)Spark

I have a question regarding reading large amounts of data from a Postgres database, and process it using spark in parallel. Let's assume I have a table in Postgres I would like to read into Spark using JDBC. Let's assume it has the following columns:
id (bigint)
date (datetime)
many other columns (different types)
Currently the Postgres table is not partitioned. I would like to transform a lot of data in parallel, and eventually store the transformed data somewhere else.
Question: How can we optimize parallel reading of the data from Postgres?
The documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) suggests to use a partitionColum to process the queries in parallel. In addition, one is required to set lowerBound, and upperBound. From what I understand is that in my case, I can either use the column id and date for partitionColumn. However, the problem here is how to set the lowerBound and upperBound values when partitioning on one of the columns. I noticed that data skew arises in my case if not set properly. For processing in Spark, I do not care about natural partitions. I just need to transform all data as quick as possible, so optimizing for unskewed partitions would be prefered I think.
I have come up with a solution for this, but I am unsure if it actually makes sense to do this. Essentially it is hashing the id's into partitions. My solution would be to do use mod() on the id column with a specified number of partitions. So then the dbtable field in the would be something like:
"(SELECT *, mod(id, <<num-parallel-queries>>) as part FROM <<schema>>.<<table>>) as t"
And then I use partitionColum="part", lowerBound=0, and upperBound=<<num-parallel-queries>> as options for the Spark read JDBC job.
Please, let me know if this makes sense!
It is a good idea to "partition" by the primary key column.
To get partitions of equal size, use the table statistics:
SELECT histogram_bounds::text::bigint[]
FROM pg_stats
WHERE tablename = 'mytable'
AND attname = 'id';
If you have default_statistics_target at its default value of 100, this will be an array of 101 values that delimit the percentiles from 0 to 100. You can use this to partition your table evenly.
For example: if the array looks like {42,10001,23066,35723,49756,...,999960} and you need 50 partitions, the first would be all rows with id < 23066, the second all rows with 23066 ≤ id < 49756, and so on.

Spark Dataframe Group By Operation and Picking N values from each group

I have a spark data-frame of the following structure:
Operation|RequestURL|RequestBody|IsGetRequest|IsPostRequest
and a variable : val n = 100
I want to perform Group-by on the Operation column in the data frame. Then, I want to Fetch RequestURL and RequestBody columns for n requests (no ordering) in each of these groups (create a new data-frame/rdd/map of this). If a group has less than n requests, I want to duplicate some of the rows in that group to ensure that the number of requests I fetch from each group is the same.
Need help in figuring out how this can be done in an optimized way. I am open to using any language (python/scala) and also convert the data frame to pandas or a hash-map of key and values, if this is not possible to be done using spark data-frame.
I have seen some solution on stack-overflow using grouping and order-by and then using windows partition function to get topN values.
How my question is different - For my case, there is no ordering. Also, I want to ensure fetching equal number of requests from each group.
Solved it using Windows Partition function. Followed it up with converting the resulting dataset to map of [String,List(Strings)] using groupBy() and toMap functions, traversing the map and replicating the rows using list operations.

Joining a large and a massive spark dataframe

My problem is as follows:
I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.
Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.
There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).
Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:
uniqueA = details.select('A').distinct().collect()
uniqueA = map(lambda x: x.A, uniqueA)
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).
Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A")
attributes_repartitioned = attributes.repartition("A")
The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.
details_attributes = details_repartitioned.join(broadcast(
attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.
Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
this is too expensive. An alternate approach would be
uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")
Also, you probably need to set the shuffle partition correctly if you haven't done that yet.
One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.
To resolve that you need to find the skewed values of column A and process them separately.

Summing values in Spark RDD where assigned clusters are equal

I have a PySpark program that takes descriptions of variables and uses the Spark Word2Vec model to transform the descriptions into vectors, then uses KMeans to cluster those vectors in order to group descriptions that are hopefully describing the same thing.
However, the dataset has many duplicates in it. I deduplicate the original dataset, but maintain a count with each unique row of how many duplicates of that description existed originally.
After clustering the descriptions vectors, I zip the result rdd back into the dataset with descriptions. I would like to order the clusters based on how many total entries of data mapped to that cluster. So, the final RDD looks like this:
[([companyid=u'xxxxxxxx', variableid=u'prop11', description=u'payment method',duplicateCount=u'8', word2vecOutput=DenseVector([.830574, 1.96709, -0.86785,......])], clusterID=793]
The cluster is separated because it was zipped back into the W2V rdd.
I want to find a way to aggregate all the duplicateCount values and make an ordered list that has a clusterID with it's total number of original rows (before deduplication) ordered by total original rows.
It seems like this should be really easy to do with a simple aggregate function, but for whatever reason, I'm having a hard time wrapping my head around it.
Thanks for any help
EDIT:
To clarify, in each row of my RDD, there is a number labeled duplicateCount. There is another element labeled cluster. I was trying to write a function that would sum the duplicateCount where the cluster is equal, thereby giving me a totalCount for each cluster.
For example, 4 elements might be grouped into cluster 10. However, the first element might have a duplicateCount of 5, the second a duplicateCount of 37, etc (still all in cluster 10). I want to sum the duplicates in every cluster so I can get an actual size of the cluster.
I thought the W2V and KMeans would provide helpful context about why I wanted it, but apparently it just made the question confusing

Categories