Spark Dataframe Group By Operation and Picking N values from each group - python

I have a spark data-frame of the following structure:
Operation|RequestURL|RequestBody|IsGetRequest|IsPostRequest
and a variable : val n = 100
I want to perform Group-by on the Operation column in the data frame. Then, I want to Fetch RequestURL and RequestBody columns for n requests (no ordering) in each of these groups (create a new data-frame/rdd/map of this). If a group has less than n requests, I want to duplicate some of the rows in that group to ensure that the number of requests I fetch from each group is the same.
Need help in figuring out how this can be done in an optimized way. I am open to using any language (python/scala) and also convert the data frame to pandas or a hash-map of key and values, if this is not possible to be done using spark data-frame.
I have seen some solution on stack-overflow using grouping and order-by and then using windows partition function to get topN values.
How my question is different - For my case, there is no ordering. Also, I want to ensure fetching equal number of requests from each group.

Solved it using Windows Partition function. Followed it up with converting the resulting dataset to map of [String,List(Strings)] using groupBy() and toMap functions, traversing the map and replicating the rows using list operations.

Related

pyspark groupby sum all in one time vs partial where then groupby sum for huge table

suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role

Ways of Creating List from Dask dataframe column

I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column]) and set(df[column]) but it takes lot of time and ends up giving error regarding creating cluster or sometimes it restarts kernel when memory limit is reached.
Can i use dask.bag or Multiprocessing to create a list?
when you try to convert a column to a list or set with the regular list/set Python will load that into memory, that's why you get a memory limit issue.
I believe that by using dask.bag you might solve that issue since dask.bag will lazy load your data, although I'm not sure if the df[column] won't have to be read first. Also, be aware that turning that column into a bag will take a while depending on how big the data is.
Using a dask.bag allows you to run map, filter and aggregate so it seems it could be a good solution for your problem.
You can try to run this to see if it filters the list/bag as you expect.
import dask.bag as db
bag = db.from_sequence(df[column], npartitions=5)
bag.filter(lamdba list_element: list_element == "filtered row")
Since this is just an example, you will need to change the npartitions and the lambda expression to fit your needs.
Let me know if this helps

Is it possible to manually create Dask data frames? (i.e., not by a fixed partition count)

I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.
The situation:
I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions.
Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing to process each partition separately.
I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions.
Or am I on a completely wrong path here and should other methods of Dask?
You might find using dask delayed useful and then use that to create a custom dask dataframe? https://docs.dask.org/en/latest/dataframe-create.html#dask-delayed

Dask map_partitions results in duplicates when reducing and gives wrong results compared to pure pandas

When I use dask to groupby using map_partitions, I obtain duplicated data and wrong results compared to simple pandas groupby. But when I use n_partitons=1, I get the correct results.
Why does this happen? and how can I use multiple partitions and still get the correct results?
my code is
measurements = measurements.repartition(n_partitions=38)
measurements.map_partitions(lambda df : df.groupby(["id",df.time.dt.to_period("M"),
"country","job"]).source.nunique()).compute().reset_index()
In pandas, I do
measurements.groupby(["id",measurements.time.dt.to_period("M"),
"country","job"]).source.nunique().reset_index()
PS: I'm using a local cluster on a single machine.
When you call map_partitions, you say you want to perform that action on each partition. Given that each unique grouping value can occur in multiple partitions, you will get an entry for each group, for each partition in which it is found.
What if there were a way to do groupby across partitions and have the results smartly merged for you automatically? Fortunately, this is exactly what dask does, and you did not need to use map_partitions at all.
measurements.groupby(...).field.nunique().compute()

PySpark repartition according to a specific column

I am looking how to repartition (in PySpark) a dataset so that all rows that have the same ID in a specified column move to the same partition. In fact, I have to run in each partition a program which computes a single value for all rows having the same ID.
I have a dataframe (df) built from a HIVE QL query (with lets say contains 10000 distinct IDs).
I tried :
df = df.repartition("My_Column_Name")
By default I get 200 partitions, but always I obtain 199 IDs for which I got duplicated computed values when I run the program.
I looked on the web, and some people recommended to define a custom partitioner to use with repartition method, but I wasn't able to find how to do that in Python.
Is there a way to do this repartition correctly?
I want only to have ALL rows with the same IDs moved to the same partition. No problem if a partition contains several groups of rows with distinct IDs. 1000 was just an example, the number of different IDs can be very high. So, partitionning a DF to number of different IDs partitions should not lead to good performances. I need that because I run a function (which cannot be implemented using basic Spark transformation functions) using RDD mapPartition method. This function produces one result per distinct ID, this is why I need to have all rows with the same ID in the same partition.

Categories