Join multiple data frame in PySpark - python

I have the following few data frames which have two columns each and have exactly the same number of rows. How do I join them so that I get a single data frame which has the two columns and all rows from both the data frames?
For example:
DataFrame-1
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
+--------------+-------------+
DataFrame-2
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_1_URI | 1 |
| sample_1_URI | 1 |
+--------------+-------------+
DataFrame-3
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_2_URI | 2 |
| sample_2_URI | 2 |
+--------------+-------------+
DataFrame-4
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
...
I want the result of the join to be:
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
| sample_1_URI | 1 |
| sample_1_URI | 1 |
| sample_2_URI | 2 |
| sample_2_URI | 2 |
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
Now, if I want to do one-hot encoding for label column, should it something like this:
oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>

You are looking for union.
In this case, what I would do is put the dataframes in a list and use reduce:
from functools import reduce
dataframes = [df_1, df_2, df_3, df_4]
result = reduce(lambda first, second: first.union(second), dataframes)

Related

pandas group by category and assign a bin with pd.cut

I have a dataframe like the following:
+-------+-------+
| Group | Price |
+-------+-------+
| A | 2 |
| B | 3 |
| A | 1 |
| C | 4 |
| B | 2 |
+-------+-------+
I would like to create a column, that would give me the in which range (if I divided each group into 4 intervals) my price value is within each group.
+-------+-------+--------------------------+
| Group | Price | Range |
+-------+-------+--------------------------+
| A | 2 | [1-2] |
| B | 3 | [2-3] |
| A | 1 | [0-1] |
| C | 4 | [0-4] |
| B | 2 | [0-2] |
+-------+-------+--------------------------+
Anyone has any idea by using pandas pd.cut and groupby operations?
Thanks
You can pass pd.cut to groupby():
df['Range'] = df.groupby('Group')['Price'].transform(pd.cut, bins=4)

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

How to apply multiple custom functions on multiple columns in grouped DataFrame in pandas?

I have a pandas DataFrame which is grouped by p_id.
The goal is to get a DataFrame with data shown under 'Output I'm looking for'.
I've tried a few things, but I am struggling applying two custom aggregated functions:
apply(list) for x_id
'||'.join for x_name.
How can I solve this problem?
Input
| p_id | x_id | x_name |
|------|------|--------|
| 1 | 4 | Text |
| 2 | 4 | Text |
| 2 | 5 | Text2 |
| 2 | 6 | Text3 |
| 3 | 4 | Text |
| 3 | 7 | Text4 |
Output I'm looking for
| p_id | x_ids | x_names |
|------|---------|--------------------|
| 1 | [4] | Text |
| 2 | [4,5,6] | Text||Text2||Text3 |
| 3 | [4,7] | Text||Text4 |
You can certainly do:
df.groupby('pid').agg({'x_id':list, 'x_name':'||'.join})
Or a little more advanced with named agg:
df.groupby('pid').agg(x_ids=('x_id',list),
x_names=('x_name', '||'.join))

Pivot multiple columns from row to column

I have a PySpark dataframe which looks like this:
| id | name | policy | payment_name | count |
|------|--------|------------|--------------|-------|
| 2 | two | 0 | Hybrid | 58 |
| 2 | two | 1 | Hybrid | 2 |
| 5 | five | 1 | Excl | 13 |
| 5 | five | 0 | Excl | 70 |
| 5 | five | 0 | Agen | 811 |
| 5 | five | 1 | Agen | 279 |
| 5 | five | 1 | Hybrid | 600 |
| 5 | five | 0 | Hybrid | 2819 |
I would like to make the combination of policy and payment_name become a column with the respective count (reducing down to one row per id).
Output would look like this:
| id | name | no_policy_hybrid | no_policy_excl | no_policy_agen | policy_hybrid | policy_excl | policy_agen |
|----|------|------------------|----------------|----------------|---------------|-------------|-------------|
| 2 | two | 58 | 0 | 0 | 2 | 0 | 0 |
| 5 | five | 2819 | 70 | 811 | 600 | 13 | 279 |
In cases where there is no combination we can default it to 0 i.e. id 2 has no combination including payment_name Excl so it is set 0 on the example output.
To pivot the table, you would first need a grouping column to combine the policy and the payment_name.
df = df.withColumn("groupingCol", udf("{}_{}".format)("policy", "payment_name"))
When you have that, you can group by the id and name` columns and pivot the grouping column.
df.groupBy("id", "name").pivot("groupingCol").agg(F.max("count"))
That should return the correct table columns.
+---+----+------+------+--------+------+------+--------+
| id|name|0_Agen|0_Excl|0_Hybrid|1_Agen|1_Excl|1_Hybrid|
+---+----+------+------+--------+------+------+--------+
| 5|five| 811| 70| 2819| 279| 13| 600|
| 2| two| null| null| 58| null| null| 2|
+---+----+------+------+--------+------+------+--------+
To get the same column names as in your example, you can start with changing the content of the policy column to policy and no_policy like this:
df = df.withColumn("policy", when(col("policy") == 1, "policy").otherwise("no_policy"))
This is how you would replace the missing values with 0:
df.na.fill(0)

Select certain row values and make them columns in pandas

I have a dataset that looks like the below:
+-------------------------+-------------+------+--------+-------------+--------+--+
| | impressions | name | shares | video_views | diff | |
+-------------------------+-------------+------+--------+-------------+--------+--+
| _ts | | | | | | |
| 2016-09-12 23:15:04.120 | 1 | Vidz | 7 | 10318 | 15mins | |
| 2016-09-12 23:16:45.869 | 2 | Vidz | 7 | 10318 | 16mins | |
| 2016-09-12 23:30:03.129 | 3 | Vidz | 18 | 29291 | 30mins | |
| 2016-09-12 23:32:08.317 | 4 | Vidz | 18 | 29291 | 32mins | |
+-------------------------+-------------+------+--------+-------------+--------+--+
I am trying to build a dataframe to feed to a regression model, and I'd like to parse out specific rows as features. To do this I would like the dataframe to resemble this
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| | name | 15min_shares | 15min_impressions | 15min_video_views | 30min_shares | 30min_impressions | 30min_video_views |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| _ts | | | | | | | |
| 2016-09-12 23:15:04.120 | Vidz | 7 | 1 | 10318 | 18 | 3 | 29291 |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
What would be the best way to do this? I think this would be easier if I were only trying to select 1 row (15mins), just parse out the unneeded rows and pivot.
However, I need 15min and 30min features and am unsure on how to proceed of the need for these columns
You could take subsets of your DF to include rows for 15mins and 30mins and concatenate them by backfilling NaN values of first row(15mins) with that of it's next row(30mins) and dropping off the next row(30mins) as shown:
prefix_15="15mins"
prefix_30="30mins"
fifteen_mins = (df['diff']==prefix_15)
thirty_mins = (df['diff']==prefix_30)
df = df[fifteen_mins|thirty_mins].drop(['diff'], axis=1)
df_ = pd.concat([df[fifteen_mins].add_prefix(prefix_15+'_'), \
df[thirty_mins].add_prefix(prefix_30+'_')], axis=1) \
.fillna(method='bfill').dropna(how='any')
del(df_['30mins_name'])
df_.rename(columns={'15mins_name':'name'}, inplace=True)
df_
stacking to pivot and collapsing your columns
df1 = df.set_index('diff', append=True).stack().unstack(0).T
df1.columns = df1.columns.map('_'.join)
To see just the first row
df1.iloc[[0]].dropna(1)

Categories