PySpark select top N Rows from each group

PySpark select top N Rows from each group - python

I want to choose a N rows randomly for each category of a column in a data frame. Let's say the column is the 'color' and N is 5. Then I'd want to choose 5 items for each of the colors.
The usual way of doing this is something like this
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
# Define a random key that can be used to sort by
df = df.select("*", rand().alias(key))
# Sort the rows within each color by the key
# Simultaneously enumerate the sorted rows
.withColumn(num, row_number().over(Window.partitionBy(color).orderBy(key)))
# Choose only N items for each category
.where(f"{num} <= {N}")
# Drop key column
.drop(key)
But orderBy blows up with an out of memory error on large dataframes. I'm considering using sort to work around this. Context: 'orderBy' runs on a single executor and guarantees total order while sort uses several partitions. I'm ok with the approximate nature of sort as I'm using this to select random subsets anyway.
I can't just replace orderBy as sort can't be used with row_number in a window as above.
Any pointers appreciated.
References:
Code snippet from https://sparkbyexamples.com/pyspark/pyspark-retrieve-top-n-from-each-group-of-dataframe/
Comparison between orderBy and sort from https://towardsdatascience.com/sort-vs-orderby-in-spark-8a912475390

You want to use what they call a 'salt' to redistribute the data, and make it smaller. (Here I'm going to split your colour column into floor(key*8) before randomly sorting it, but that's just a guess that it will work for you and really could be increased if you wish) Then you can re-window as you do today without the salt.
# Define a random key that can be used to sort by and salt by
df = df.select("*", rand().alias(key))
# Sort the rows within each color by the key
# Simultaneously enumerate the sorted rows
.withColumn(num, row_number().over(Window.partitionBy(color,floor(key*8)).orderBy(key)))#divides the data into smaller by a factor of 8 chunks using the salt
# Choose only N items for each category
.where(f"{num} <= {N}")
.drop( num )
.withColumn(num, row_number().over(Window.partitionBy(color).orderBy(key)))
.where(f"{num} <= {N}")
# Drop key column
.drop(key)
I do think you should look into df.sample as it's made to do this type of thing but if you like your logic as is this will work for you.

Related

Pyspark random split changes distribution of data

I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.
Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?

So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:
It is possible that the underlying dataframe doesn't guarantee the
ordering of rows in its constituent partitions each time a split is
materialized which could result in overlapping splits. To prevent
this, we explicitly sort each input partition to make the ordering
deterministic. Note that MapTypes cannot be sorted and are explicitly
pruned out from the sort order.
The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:
Solution 1:
df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')
Solution 2:
df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)
Here is a blog post with more details.

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg

You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.

The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!

Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Cassandra/Pycassa: Getting random rows

Is there a possibility to retrieve random rows from Cassandra (using it with Python/Pycassa)?
Update: With random rows I mean randomly selected rows!

You might be able to do this by making a get_range request with a random start key (just a random string), and a row_count of 1.
From memory, I think the finish key would need to be the same as start, so that the query 'wraps around' the keyspace; this would normally return all rows, but the row_count will limit that.
Haven't tried it but this should ensure you get a single result without having to know exact row keys.

Not sure what you mean by random rows. If you mean random access rows, then sure you can do it very easily:
import pycassa.pool
import pycassa.columnfamily
pool = pycassa.pool.ConnectionPool('keyspace', ['localhost:9160']
cf = pycassa.columnfamily.ColumnFamily(pool, 'cfname')
row = cf.get('row_key')
That will give you any row. If you mean that you want a randomly selected row, I don't think you'd be able to do that very easily without knowing what the keys are. You could generate an index row and then select a random column from that and use that to grab a row from another column family. Basically, you'd need to create a new row where each column value, was a row key from the column family from which you are trying to select a row. Then you could grab a column randomly from that row and you have the key to a random row.
I don't think pycassa offers any support to grab a random, non-indexed row.

This works for my case:
ini = random.randint(0, 999999999)
rows = col_fam.get_range(str(ini), row_count=1, column_count=0,filter_empty=False)
You'll have to adapt to your row key type (string in my case)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark select top N Rows from each group - python

Related

Pyspark random split changes distribution of data

How to to arrange a loop in order to loop over columns and then do something

Performing multiple calculations on a Python Pandas group from CSV data

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

Cassandra/Pycassa: Getting random rows

Categories

Resources