Spark DataFrame aggregate and groupby multiple columns while retaining order

Spark DataFrame aggregate and groupby multiple columns while retaining order - python

I have the following data
id | value1 | value2
-----------------------
1 A red
1 B red
1 C blue
2 A blue
2 B blue
2 C green
The result I need is:
id | values
---------------------------------
1 [[A,red],[B,red][C,blue]]
2 [[A,blue],[B,blue][C,green]]
My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field
df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])
However since order is not guaranteed in collect_list() (see here), how can I make sure value1 and value2 are both matched to the correct values?
This could potentially lead to two lists with different order and subsequent merging would match wrong values?

As commented by #Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list:
import pyspark.sql.functions as F
(df.withColumn('values', F.struct(df.value1, df.value2))
.groupBy('id')
.agg(F.collect_list('values').alias('values'))).show()
+---+--------------------+
| id| values|
+---+--------------------+
| 1|[[A,red], [B,red]...|
| 2|[[A,blue], [B,blu...|
+---+--------------------+

Related

How can I subset dataframe and put them on a list?

I'm looking for a more automated approach to subset this dataframe by rank and put them in a list. Because if there happens to be 150 ranks I can't do individual subsets.
ID | GROUP | RANK
1 | A | 1
2 | B | 2
3 | C | 3
2 | A | 1
2 | E | 2
2 | G | 3
How can I subset the dataframe by Rank and then put every subset in a list? (Not using group by)
I know how to individually subset them but I'm not sure how I can do this if there's more ranks.
Output:
ranks = [df1,df2,df3....and so on]

Just use groupby directly in a list comprehension
>>> [df for rank, df in df.groupby('RANK')]
This will generate a list of dataframes, each a sub-dataframe related to the corresponding rank.
You can also do a dict comprehension:
>>> dic = {rank: df for rank, df in df.groupby('RANK')}
such that you can access your df via dic[1] for rank == 1.
In more detail, pd.DataFrame.groupby is a method that returns a DataFrameGroupBy object. A DataFrameGroupBy object is an iterable, which means you can iterate over it with a for loop. This iterable generates tuples with two vales, where the first is whatever you used to group (in this case, an integer rank), and the second, the sub dataframe.

Find out if values in dataframe are between values in other dataframe

I'm new to pandas and i'm trying to understand if there is a method to find out, if two values from one row in df1 are between two values from one row in df2.
Basically my df1 looks like this:
start | value | end
1 | TEST | 5
2 | TEST | 3
...
and my df2 looks like this:
start | value | end
2 | TEST2 | 10
3 | TEST2 | 4
...
Right now i've got it working with two loops:
for row in df1.iterrows():
for row2 in df2.iterrows():
if row2[1]["start"] >= row[1]["start"] and row2[1]["end"] <= row[1]["end"]:
print(row2)
but this doesn't feel like it's the pandas way to me.
What I'm expecting is that row number 2 from df2 is getting printed because 3 > 1 and 4 < 5, i.e.:
3 | TEST2 | 4
Is there a method to do this in the pandas kind of working?

You could use a cross merge to get all combinations of df1 and df2 rows, and filter using classical comparisons. Finally, get the indices and slice:
idx = (df1.merge(df2.reset_index(), suffixes=('1', '2'), how='cross')
.query('(start2 > start1) & (end2 < end1)')
['index'].unique()
)
df2.loc[idx]
NB. I am using unique here to ensure that a row is selected only once, even if there are several matches
output:
start value end
1 3 TEST2 4

Pandas Merge two tables with the second tables' one column transposed

Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row

You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |

Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer

Combining MultiIndex columns with similar root names in Pandas/Python

I have a MultiIndex dataframe with the top level columns named:
Col1_1 | Col1_2 | Col 2_1 | Col 2_2 | ... |
I'm looking to combine Col1_1 with Col1_2 as Col1. I could also do this before creating the MultiIndex, but the original data is more drawn out as:
Col1_1.aspect1 | Col1_1.aspect 2 | Col1_2.aspect1 | Col1_2.aspect2 | ... |
where 'aspect1' and 'aspect2' become subcolumns in the MultiIndex.
Please let me know if I can clarify anything, and many thanks in advance.
The expected result combines the two as just Sample1; any number of ways is fine, including stacking/concatenating the data, outputting a summary stat e.g. mean(), etc.

You can use groupby and apply an aggregation function against it like mean.
You must group against axis 1 (columns) and with level 1 (lower multiindex columns). It will apply the grouping across all samples. Then simply do a mean if it's what you want to achieve:
df.groupby(level=1, axis=1).mean()

Pivot or transpose in SQL or Pandas

I have a table of the form:
item_code | attribute | time_offset | mean | median | description | ...
The attribute column has one of 40 possible values and the time_offset column can be an integer from 0 to 20.
I want to transform this table to a wide one of the form:
item_code | <attribute1>_<time_offset1>_mean | <attribute1>_<time_offset1>_median | <attribute1>_<time_offset1>_description | <attribute1>_<time_offset1>_... | <attribute2>...
I can do this either in SQL or in Pandas but I'm having difficulty with the fact that some of the columns are not numeric, so it is hard to come up with an aggregation function for them.
I can guarantee that each combination of item_code, attribute and time_offset will have only one row, so I do not need an aggregation function. Is there something like a transpose operation that will allow me to do what I am looking for?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark DataFrame aggregate and groupby multiple columns while retaining order - python

Related

How can I subset dataframe and put them on a list?

Find out if values in dataframe are between values in other dataframe

Pandas Merge two tables with the second tables' one column transposed

Combining MultiIndex columns with similar root names in Pandas/Python

Pivot or transpose in SQL or Pandas

Categories

Resources