I have a PySpark dataframe which looks like this:
| id | name | policy | payment_name | count |
|------|--------|------------|--------------|-------|
| 2 | two | 0 | Hybrid | 58 |
| 2 | two | 1 | Hybrid | 2 |
| 5 | five | 1 | Excl | 13 |
| 5 | five | 0 | Excl | 70 |
| 5 | five | 0 | Agen | 811 |
| 5 | five | 1 | Agen | 279 |
| 5 | five | 1 | Hybrid | 600 |
| 5 | five | 0 | Hybrid | 2819 |
I would like to make the combination of policy and payment_name become a column with the respective count (reducing down to one row per id).
Output would look like this:
| id | name | no_policy_hybrid | no_policy_excl | no_policy_agen | policy_hybrid | policy_excl | policy_agen |
|----|------|------------------|----------------|----------------|---------------|-------------|-------------|
| 2 | two | 58 | 0 | 0 | 2 | 0 | 0 |
| 5 | five | 2819 | 70 | 811 | 600 | 13 | 279 |
In cases where there is no combination we can default it to 0 i.e. id 2 has no combination including payment_name Excl so it is set 0 on the example output.
To pivot the table, you would first need a grouping column to combine the policy and the payment_name.
df = df.withColumn("groupingCol", udf("{}_{}".format)("policy", "payment_name"))
When you have that, you can group by the id and name` columns and pivot the grouping column.
df.groupBy("id", "name").pivot("groupingCol").agg(F.max("count"))
That should return the correct table columns.
+---+----+------+------+--------+------+------+--------+
| id|name|0_Agen|0_Excl|0_Hybrid|1_Agen|1_Excl|1_Hybrid|
+---+----+------+------+--------+------+------+--------+
| 5|five| 811| 70| 2819| 279| 13| 600|
| 2| two| null| null| 58| null| null| 2|
+---+----+------+------+--------+------+------+--------+
To get the same column names as in your example, you can start with changing the content of the policy column to policy and no_policy like this:
df = df.withColumn("policy", when(col("policy") == 1, "policy").otherwise("no_policy"))
This is how you would replace the missing values with 0:
df.na.fill(0)
Related
I have 2 data frames
df1
| email | ack |
| -------- | -------------- |
| first#abc.com | 1 |
| second#abc.com | 1 |
| third#abc.com | 1 |
| fourth#abc.com | 1 |
| fifth#abc.com | 1 |
| sixth#abc.com | 1 |
| seventh#abc.com | 1 |
| eight#abc.com | 1 |
df2
| email | ack |name| date|
| -------- | -------------- |-------------- |-------------- |
|first#abc.com | 0 |abc | 01/01/2022 |
| second#abc.com | 0 |xyz | 01/02/2022 |
| third#abc.com | 0 |mno | 01/03/2022 |
| fourth#abc.com | 0 |pqr | 01/04/2022 |
| fifth#abc.com | 0 |adam| 01/05/2022 |
| sixth#abc.com | 0 |eve |01/06/2022|
| seventh#abc.com | 0 |mary|01/07/2022|
| eight#abc.com | 0 |john|01/08/2022|
| nine#abc.com | 0 |kate|01/09/2022|
| ten#abc.com | 0 |matt|01/10/2022|
How do i merge the above two dataframes so as to replace the values in 'ack' column of df2 wherever applicable i.e., on email address.
result
df2
| email | ack |name| date|
| -------- | -------------- |-------------- |-------------- |
|first#abc.com | 1 |abc|01/01/2022|
| second#abc.com | 1 |xyz|01/02/2022|
| third#abc.com | 1 |mno|01/03/2022|
| fourth#abc.com | 1 |pqr|01/04/2022|
| fifth#abc.com | 1 |adam|01/05/2022|
| sixth#abc.com | 1 |eve|01/06/2022|
| seventh#abc.com | 1 |mary|01/07/2022|
| eight#abc.com | 1 |john|01/08/2022|
| nine#abc.com | 0 |kate|01/09/2022|
| ten#abc.com | 0 |matt|01/10/2022|
I tried left join and outer join, it appended rows to existing rows.
Assuming df1['ack'] is always 1, the following code should work:
df2.loc[df2['email'].isin(df1['email']), 'ack'] = 1
In English:
If df2['email'] is found in df1['email'], set df2['ack'] = 1
I have a dataframe like the following:
+-------+-------+
| Group | Price |
+-------+-------+
| A | 2 |
| B | 3 |
| A | 1 |
| C | 4 |
| B | 2 |
+-------+-------+
I would like to create a column, that would give me the in which range (if I divided each group into 4 intervals) my price value is within each group.
+-------+-------+--------------------------+
| Group | Price | Range |
+-------+-------+--------------------------+
| A | 2 | [1-2] |
| B | 3 | [2-3] |
| A | 1 | [0-1] |
| C | 4 | [0-4] |
| B | 2 | [0-2] |
+-------+-------+--------------------------+
Anyone has any idea by using pandas pd.cut and groupby operations?
Thanks
You can pass pd.cut to groupby():
df['Range'] = df.groupby('Group')['Price'].transform(pd.cut, bins=4)
I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
Now I have a table something like the below table:
esn_missing_in_DF_umts
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| cell_name | n_cell_name | source_vendor | target_vendor | source_rnc | target_rnc |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 1 | 8 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 2 | 5 | x | x | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 3 | 6 | x | x | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 4 | 9 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 5 | 10 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 6 | 11 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 7 | 12 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
Now I have two columns are empty in sqlServer or dataframe the source_rnc and the target_rnc:
Here's the other two tables I want to update the two columns from
esn_umts_intra_sho
|---------------------|------------------|------------------|
| ucell | urelation | ucell_rnc |
|---------------------|------------------|------------------|
| 13 | 5 | abc567 |
|---------------------|------------------|------------------|
| 8 | 6 | abc568 |
|---------------------|------------------|------------------|
| 14 | 8 | abc569 |
|---------------------|------------------|------------------|
| 7 | 9 | abc570 |
|---------------------|------------------|------------------|
| 16 | 10 | abc571 |
|---------------------|------------------|------------------|
| 5 | 11 | abc572 |
|---------------------|------------------|------------------|
| 17 | 12 | abc573 |
|---------------------|------------------|------------------|
| 10 | 9 | abc574 |
|---------------------|------------------|------------------|
| 9 | 17 | abc575 |
|---------------------|------------------|------------------|
| 12 | 11 | abc576 |
|---------------------|------------------|------------------|
| 11 | 12 | abc577 |
|---------------------|------------------|------------------|
df_umts_carrier
|---------------------|------------------|
| cell_name_umts | rnc |
|---------------------|------------------|
| 1 | xyz123 |
|---------------------|------------------|
| 2 | xyz124 |
|---------------------|------------------|
| 3 | xyz125 |
|---------------------|------------------|
| 4 | xyz126 |
|---------------------|------------------|
| 5 | xyz127 |
|---------------------|------------------|
| 6 | xyz128 |
|---------------------|------------------|
| 7 | xyz129 |
|---------------------|------------------|
So Not I want to update the source_rnc and target_rnc through those two tables esn_umts_intra_sho and df_umts_carrier
So I imagine that the query could be like this
UPDATE [toolDB].[dbo].[esn_missing_in_DF_umts]
SET [toolDB].[dbo].[esn_missing_in_DF_umts].[target_rnc] = CASE WHEN [toolDB].[dbo].[esn_missing_in_DF_umts].[target_vendor] = 'HUA' THEN [toolDB].[dbo].[df_umts_carrier].[rnc]
FROM [toolDB].[dbo].[esn_missing_in_DF_umts]
INNER JOIN [toolDB].[dbo].[df_umts_carrier]
ON [n_cell_name] = [cell_name_umts]
ELSE
UPDATE [toolDB].[dbo].[esn_missing_in_DF_umts]
SET [toolDB].[dbo].[esn_missing_in_DF_umts].[target_rnc] = [toolDB].[dbo].[esn_umts_intra_sho].[ucell_rnc]
From [toolDB].[dbo].[esn_missing_in_DF_umts] INNER JOIN [toolDB].[dbo].[esn_umts_intra_sho]
ON [n_cell_name] = [ucell]
I want the final output to be somthing like this:
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| cell_name | n_cell_name | source_vendor | target_vendor | source_rnc | target_rnc |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 1 | 8 | x | y | xyz123 | abc568 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 2 | 5 | x | x | xyz124 | xyz127 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 3 | 6 | x | x | xyz125 | xyz128 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 4 | 9 | x | y | xyz126 | abc575 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 5 | 10 | x | y | xyz127 | abc574 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 6 | 11 | x | y | xyz128 | abc576 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 7 | 12 | x | y | xyz129 | abc577 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
I tried even with pandas but doesn't work...
I wish someone help me.
The best thing is to make the query as if you were writing a SELECT statement with the Case clause in it. Once it works as expected, you can amend it for your update.
So in this example, if the main tables Column = bla, then get the data from the first joined table, else the other table.
Quick amendment Make sure its all rows you are happy to update, else remember to put in a where statement. That's why its best to work out your logic in a SELECT and move on from there.
I think you want something like this:
UPDATE [toolDB].[dbo].[esn_missing_in_DF_umts]
SET [toolDB].[dbo].[esn_missing_in_DF_umts].[target_rnc] = (CASE WHEN UMT.target_vendor = 'HUA' THEN carrier.rnc ELSE SHO.ucell_rnc END )
FROM [toolDB].[dbo].[esn_missing_in_DF_umts] UMT
LEFT JOIN [toolDB].[dbo].[df_umts_carrier] carrier ON UMT.n_cell_name = carrier.cell_name_umts
LEFT JOIN [toolDB].[dbo].[esn_umts_intra_sho] SHO ON UMT.n_cell_name = SHO.ucell
I have an excel file that I read with pandas and convert to a dataframe. Here is a sample of the dataframe:
| | salads_count | salads_count | salads_count | carrot_counts | carrot_counts | carrot_counts |
|---------------|--------------|--------------|--------------|---------------|---------------|---------------|
| | 01.2016 | 02.2016 | 03.2016 | 01.2016 | 02.2016 | 03.2016 |
| farm_location | | | | | | |
| sweden | 42 | 41 | 43 | 52 | 51 | 53 |
It's a very weird formatting, but that's what is in the excel file. At first the 2 first rows are not even in a multiindex form.
I managed to get it into a multiindex with the code below, but some columns are duplicated (salads_count appears several times for example):
arrays = [df.columns.tolist(), df.iloc[0].tolist()]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df.columns = index
I would like to convert the columns to a multiindex, something like that:
| | salads_count | | | carrot_counts | | |
|---------------|--------------|---------|---------|---------------|---------|---------|
| | 01.2016 | 02.2016 | 03.2016 | 01.2016 | 02.2016 | 03.2016 |
| farm_location | | | | | | |
| sweden | 42 | 41 | 43 | 52 | 51 | 53 |
Or even better, like that:
| | 01.2016 | | 02.2016 | | | |
|---------------|--------------|--------------|--------------|-------------|---|---|
| | carrot_count | salads_count | carrot_count | salad_count | | |
| farm_location | | | | | | |
| sweden | 52 | 42 | 51 | 41 | | |
How can I do this?
The best is convert columns to MultiIndex in read_excel by parameter header=[0,1]:
df = pd.read_excel(file, header=[0,1])
Then use swaplevel with sort_index:
df = df.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)