Select certain row values and make them columns in pandas - python

I have a dataset that looks like the below:
+-------------------------+-------------+------+--------+-------------+--------+--+
| | impressions | name | shares | video_views | diff | |
+-------------------------+-------------+------+--------+-------------+--------+--+
| _ts | | | | | | |
| 2016-09-12 23:15:04.120 | 1 | Vidz | 7 | 10318 | 15mins | |
| 2016-09-12 23:16:45.869 | 2 | Vidz | 7 | 10318 | 16mins | |
| 2016-09-12 23:30:03.129 | 3 | Vidz | 18 | 29291 | 30mins | |
| 2016-09-12 23:32:08.317 | 4 | Vidz | 18 | 29291 | 32mins | |
+-------------------------+-------------+------+--------+-------------+--------+--+
I am trying to build a dataframe to feed to a regression model, and I'd like to parse out specific rows as features. To do this I would like the dataframe to resemble this
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| | name | 15min_shares | 15min_impressions | 15min_video_views | 30min_shares | 30min_impressions | 30min_video_views |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| _ts | | | | | | | |
| 2016-09-12 23:15:04.120 | Vidz | 7 | 1 | 10318 | 18 | 3 | 29291 |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
What would be the best way to do this? I think this would be easier if I were only trying to select 1 row (15mins), just parse out the unneeded rows and pivot.
However, I need 15min and 30min features and am unsure on how to proceed of the need for these columns

You could take subsets of your DF to include rows for 15mins and 30mins and concatenate them by backfilling NaN values of first row(15mins) with that of it's next row(30mins) and dropping off the next row(30mins) as shown:
prefix_15="15mins"
prefix_30="30mins"
fifteen_mins = (df['diff']==prefix_15)
thirty_mins = (df['diff']==prefix_30)
df = df[fifteen_mins|thirty_mins].drop(['diff'], axis=1)
df_ = pd.concat([df[fifteen_mins].add_prefix(prefix_15+'_'), \
df[thirty_mins].add_prefix(prefix_30+'_')], axis=1) \
.fillna(method='bfill').dropna(how='any')
del(df_['30mins_name'])
df_.rename(columns={'15mins_name':'name'}, inplace=True)
df_

stacking to pivot and collapsing your columns
df1 = df.set_index('diff', append=True).stack().unstack(0).T
df1.columns = df1.columns.map('_'.join)
To see just the first row
df1.iloc[[0]].dropna(1)

Related

Pyspark, Two dataframes groupBy at the same time and apply pandasUDF

I know that for #pandas_udf(schema, PandasUDFType.GROUPED_MAP), it should be used under a pyspark_dataframe.groupBy(something).apply();
What if I need two dataframes in this pandas_udf. For example: I have a dataframe A:
| id | cluster| value |
|:---- |:------:| -----:|
| 1 | A | 3 |
| 2 | A | 5 |
| 3 | B | 7 |
| 4 | B | 5 |
And then I have dataframe B:
| id | cluster |
|:---- |:------:|
| 5 | A |
| 6 | B |
And my desire output dataframe is:
| id | cluster| pred |
|:---- |:------:| -----:|
| 5 | A | 5 |
| 6 | B | 6 |
Where 'pred' is calculated by A.groupBy('cluster') and then the mean of the value in that cluster. I want to achieve this through pandasUDF, so I can only do B.groupBy('cluster').apply(my_pandasUDF) So I'm wondering can I have my_pandasUDF to have two inputs like this?
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def my_pandasUDF(df, A_df = A):
means = A_df['value'].mean(axis=0)
df['pred'] = means
return df
B.groupBy('cluster').apply(my_pandasUDF).show()
Can this code give my desire result? If not, how to do that? Thank you so much

pandas group by category and assign a bin with pd.cut

I have a dataframe like the following:
+-------+-------+
| Group | Price |
+-------+-------+
| A | 2 |
| B | 3 |
| A | 1 |
| C | 4 |
| B | 2 |
+-------+-------+
I would like to create a column, that would give me the in which range (if I divided each group into 4 intervals) my price value is within each group.
+-------+-------+--------------------------+
| Group | Price | Range |
+-------+-------+--------------------------+
| A | 2 | [1-2] |
| B | 3 | [2-3] |
| A | 1 | [0-1] |
| C | 4 | [0-4] |
| B | 2 | [0-2] |
+-------+-------+--------------------------+
Anyone has any idea by using pandas pd.cut and groupby operations?
Thanks
You can pass pd.cut to groupby():
df['Range'] = df.groupby('Group')['Price'].transform(pd.cut, bins=4)

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

Pivot multiple columns from row to column

I have a PySpark dataframe which looks like this:
| id | name | policy | payment_name | count |
|------|--------|------------|--------------|-------|
| 2 | two | 0 | Hybrid | 58 |
| 2 | two | 1 | Hybrid | 2 |
| 5 | five | 1 | Excl | 13 |
| 5 | five | 0 | Excl | 70 |
| 5 | five | 0 | Agen | 811 |
| 5 | five | 1 | Agen | 279 |
| 5 | five | 1 | Hybrid | 600 |
| 5 | five | 0 | Hybrid | 2819 |
I would like to make the combination of policy and payment_name become a column with the respective count (reducing down to one row per id).
Output would look like this:
| id | name | no_policy_hybrid | no_policy_excl | no_policy_agen | policy_hybrid | policy_excl | policy_agen |
|----|------|------------------|----------------|----------------|---------------|-------------|-------------|
| 2 | two | 58 | 0 | 0 | 2 | 0 | 0 |
| 5 | five | 2819 | 70 | 811 | 600 | 13 | 279 |
In cases where there is no combination we can default it to 0 i.e. id 2 has no combination including payment_name Excl so it is set 0 on the example output.
To pivot the table, you would first need a grouping column to combine the policy and the payment_name.
df = df.withColumn("groupingCol", udf("{}_{}".format)("policy", "payment_name"))
When you have that, you can group by the id and name` columns and pivot the grouping column.
df.groupBy("id", "name").pivot("groupingCol").agg(F.max("count"))
That should return the correct table columns.
+---+----+------+------+--------+------+------+--------+
| id|name|0_Agen|0_Excl|0_Hybrid|1_Agen|1_Excl|1_Hybrid|
+---+----+------+------+--------+------+------+--------+
| 5|five| 811| 70| 2819| 279| 13| 600|
| 2| two| null| null| 58| null| null| 2|
+---+----+------+------+--------+------+------+--------+
To get the same column names as in your example, you can start with changing the content of the policy column to policy and no_policy like this:
df = df.withColumn("policy", when(col("policy") == 1, "policy").otherwise("no_policy"))
This is how you would replace the missing values with 0:
df.na.fill(0)

Join multiple data frame in PySpark

I have the following few data frames which have two columns each and have exactly the same number of rows. How do I join them so that I get a single data frame which has the two columns and all rows from both the data frames?
For example:
DataFrame-1
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
+--------------+-------------+
DataFrame-2
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_1_URI | 1 |
| sample_1_URI | 1 |
+--------------+-------------+
DataFrame-3
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_2_URI | 2 |
| sample_2_URI | 2 |
+--------------+-------------+
DataFrame-4
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
...
I want the result of the join to be:
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
| sample_1_URI | 1 |
| sample_1_URI | 1 |
| sample_2_URI | 2 |
| sample_2_URI | 2 |
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
Now, if I want to do one-hot encoding for label column, should it something like this:
oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>
You are looking for union.
In this case, what I would do is put the dataframes in a list and use reduce:
from functools import reduce
dataframes = [df_1, df_2, df_3, df_4]
result = reduce(lambda first, second: first.union(second), dataframes)

Categories