Pyspark, Two dataframes groupBy at the same time and apply pandasUDF - python

I know that for #pandas_udf(schema, PandasUDFType.GROUPED_MAP), it should be used under a pyspark_dataframe.groupBy(something).apply();
What if I need two dataframes in this pandas_udf. For example: I have a dataframe A:
| id | cluster| value |
|:---- |:------:| -----:|
| 1 | A | 3 |
| 2 | A | 5 |
| 3 | B | 7 |
| 4 | B | 5 |
And then I have dataframe B:
| id | cluster |
|:---- |:------:|
| 5 | A |
| 6 | B |
And my desire output dataframe is:
| id | cluster| pred |
|:---- |:------:| -----:|
| 5 | A | 5 |
| 6 | B | 6 |
Where 'pred' is calculated by A.groupBy('cluster') and then the mean of the value in that cluster. I want to achieve this through pandasUDF, so I can only do B.groupBy('cluster').apply(my_pandasUDF) So I'm wondering can I have my_pandasUDF to have two inputs like this?
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def my_pandasUDF(df, A_df = A):
means = A_df['value'].mean(axis=0)
df['pred'] = means
return df
B.groupBy('cluster').apply(my_pandasUDF).show()
Can this code give my desire result? If not, how to do that? Thank you so much

Related

pandas group by category and assign a bin with pd.cut

I have a dataframe like the following:
+-------+-------+
| Group | Price |
+-------+-------+
| A | 2 |
| B | 3 |
| A | 1 |
| C | 4 |
| B | 2 |
+-------+-------+
I would like to create a column, that would give me the in which range (if I divided each group into 4 intervals) my price value is within each group.
+-------+-------+--------------------------+
| Group | Price | Range |
+-------+-------+--------------------------+
| A | 2 | [1-2] |
| B | 3 | [2-3] |
| A | 1 | [0-1] |
| C | 4 | [0-4] |
| B | 2 | [0-2] |
+-------+-------+--------------------------+
Anyone has any idea by using pandas pd.cut and groupby operations?
Thanks
You can pass pd.cut to groupby():
df['Range'] = df.groupby('Group')['Price'].transform(pd.cut, bins=4)

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

How to apply multiple custom functions on multiple columns in grouped DataFrame in pandas?

I have a pandas DataFrame which is grouped by p_id.
The goal is to get a DataFrame with data shown under 'Output I'm looking for'.
I've tried a few things, but I am struggling applying two custom aggregated functions:
apply(list) for x_id
'||'.join for x_name.
How can I solve this problem?
Input
| p_id | x_id | x_name |
|------|------|--------|
| 1 | 4 | Text |
| 2 | 4 | Text |
| 2 | 5 | Text2 |
| 2 | 6 | Text3 |
| 3 | 4 | Text |
| 3 | 7 | Text4 |
Output I'm looking for
| p_id | x_ids | x_names |
|------|---------|--------------------|
| 1 | [4] | Text |
| 2 | [4,5,6] | Text||Text2||Text3 |
| 3 | [4,7] | Text||Text4 |
You can certainly do:
df.groupby('pid').agg({'x_id':list, 'x_name':'||'.join})
Or a little more advanced with named agg:
df.groupby('pid').agg(x_ids=('x_id',list),
x_names=('x_name', '||'.join))

Grouping many columns in one column in Pandas

I have a DataFrame that is similar to this one:
| | id | Group1 | Group2 | Group3 |
|---|----|--------|--------|--------|
| 0 | 22 | A | B | C |
| 1 | 23 | B | C | D |
| 2 | 24 | C | B | A |
| 3 | 25 | D | A | C |
And I want to get something like this:
| | Group | id_count |
|---|-------|----------|
| 0 | A | 3 |
| 1 | B | 3 |
| 2 | C | 3 |
| 3 | D | 2 |
Basically for each group I want to know how many people(id) have chosen it.
I know there is pd.groupby(), but it only gives an appropriate result for one column (if I give it a list, it does not combine group 1,2,3 in one column).
Use DataFrame.melt with GroupBy.size:
df1 = (df.melt('id', value_name='Group')
.groupby('Group')
.size()
.reset_index(name='id_count'))
print (df1)
Group id_count
0 A 3
1 B 3
2 C 4
3 D 2

Select certain row values and make them columns in pandas

I have a dataset that looks like the below:
+-------------------------+-------------+------+--------+-------------+--------+--+
| | impressions | name | shares | video_views | diff | |
+-------------------------+-------------+------+--------+-------------+--------+--+
| _ts | | | | | | |
| 2016-09-12 23:15:04.120 | 1 | Vidz | 7 | 10318 | 15mins | |
| 2016-09-12 23:16:45.869 | 2 | Vidz | 7 | 10318 | 16mins | |
| 2016-09-12 23:30:03.129 | 3 | Vidz | 18 | 29291 | 30mins | |
| 2016-09-12 23:32:08.317 | 4 | Vidz | 18 | 29291 | 32mins | |
+-------------------------+-------------+------+--------+-------------+--------+--+
I am trying to build a dataframe to feed to a regression model, and I'd like to parse out specific rows as features. To do this I would like the dataframe to resemble this
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| | name | 15min_shares | 15min_impressions | 15min_video_views | 30min_shares | 30min_impressions | 30min_video_views |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| _ts | | | | | | | |
| 2016-09-12 23:15:04.120 | Vidz | 7 | 1 | 10318 | 18 | 3 | 29291 |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
What would be the best way to do this? I think this would be easier if I were only trying to select 1 row (15mins), just parse out the unneeded rows and pivot.
However, I need 15min and 30min features and am unsure on how to proceed of the need for these columns
You could take subsets of your DF to include rows for 15mins and 30mins and concatenate them by backfilling NaN values of first row(15mins) with that of it's next row(30mins) and dropping off the next row(30mins) as shown:
prefix_15="15mins"
prefix_30="30mins"
fifteen_mins = (df['diff']==prefix_15)
thirty_mins = (df['diff']==prefix_30)
df = df[fifteen_mins|thirty_mins].drop(['diff'], axis=1)
df_ = pd.concat([df[fifteen_mins].add_prefix(prefix_15+'_'), \
df[thirty_mins].add_prefix(prefix_30+'_')], axis=1) \
.fillna(method='bfill').dropna(how='any')
del(df_['30mins_name'])
df_.rename(columns={'15mins_name':'name'}, inplace=True)
df_
stacking to pivot and collapsing your columns
df1 = df.set_index('diff', append=True).stack().unstack(0).T
df1.columns = df1.columns.map('_'.join)
To see just the first row
df1.iloc[[0]].dropna(1)

Categories