Combine Duplicate Rows in a Column in PySpark Dataframe - python

I have duplicate rows in a PySpark data frame and I want to combine and sum all of them into one row per column based on duplicate entries in one column.
Current Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 350 900
30 Deal 1 Client A 360 850
50 Deal 2 Client B 30 50
30 Deal 1 Client A 125 200
30 Deal 1 Client A 90 100
10 Deal 3 Client C 32 121
Attempted PySpark Code
F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),
Expected Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 925 2050
50 Deal 2 Client B 30 50
10 Deal 3 Client C 32 121

I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :
df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})

You have a SQL tag, so that's how it will work in there
select
deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3
otherwise you can use the same group by function in python pandas / dataframe and apply to your datadrame:
you have to pass in the columns that you would need to aggregate by as a list
then you need to specify the aggregation type and the column you want to add up
df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})

Related

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!
You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

How to drop rows in pandas dataframe, when there is similar values?

I have a python pandas dataframe of stock data, and I'm trying to filter some of those tickers.
There are companies that have 2 or more tickers (different types of shares when a share is preferred and the other not).
I want to drop the lines of those additional share values, and let just the share with the higher volume. In the dataframe I also have the company name, so maybe there is a way of using it to make some condition and then drop it when comparing the volume of the same company? How can I do this?
Use groupby and idxmax:
Suppose this dataframe:
>>> df
ticker volume
0 CEBR3 123
1 CEBR5 456
2 CEBR6 789 # <- keep for group CEBR
3 GOAU3 23 # <- keep for group GOAU
4 GOAU4 12
5 CMIN3 135 # <- keep for group CMIN3
>>> df.loc[df.groupby(df['ticker'].str.extract(r'^(.*)\d', expand=False),
sort=False)['volume'].idxmax().tolist()]
ticker volume
2 CEBR6 789
3 GOAU3 23
5 CMIN3 135

dataframe count frequency of a string in a column

I have a csv that contains 1000 rows in a python code and returning a new dataframe with 3 columns:
noOfPeople and Description, Location
My final df will be like this one:
id companyName noOfPeople Description Location
1 comp1 75 tech USA
2 comp2 22 fashion USA
3 comp3 70 tech USA
I want to write a code that will stop once I have 200 rows where noOfPeople is greater or equal to 70 and it will return all the rest rows empty. So the code will count columns where noOfPeople >=70. Once I have 200 rows that has this condition, the code will stop.
Can someone help?
df[df['noOfPeople'] >= 70].iloc[:200]
Use head or iloc for select first 200 values and then get max:
print (df1['noOfPeople'].iloc[:199].max())
And add your filter what ever you need.

Plot multiple rows of dataframe in pandas for specific columns

df
SKU Comp Brand Jan_Sales Feb_Sales Mar_sales Apr_sales Dec_sales..
A AC BA 122 100 50 200 300
B BC BB 100 50 80 90 250
C CC BC 40 30 100 10 11
and so on
Now I want a graph which will plot Jan sales, feb sales and so on till dec in one line for SKU A, Similarly one line on the same graph for SKU B and same way for SKU C.
I read few answers which say that I need to transpose my data. Something like below
df.T. plot()
However my first column is SKU, and I want to plot based on that. Rest of the columns are numeric. So I want that on each line SKU Name should be mentioned. And plotting should be row wise
EDIT(added after receiving some answers as I am facing this issue in few other datasets):
lets say I dont want columns Company, brand etc, then what to do
Use DataFrame.set_index for convert SKU to index and then tranpose:
df.set_index('SKU').T.plot()
Use set_index then transpose:
df.set_index("SKU").T.plot()
Output:

Is there a way to join 2 dataframes using another reference table in python

I have 2 data frames created from CSV files and there is another data frame which is a reference for these table. For e.g.
1 Employee demographic (Emp_id, dept_id)
2 Employee detail (Emp_id, RM_ID)
I have 3rd dataframe(dept_manager) which has only 2 columns (dept_id, RM_ID). Now I need to join table 1 and 2 referencing the 3rd dataframe.
Trying out in pandas(python) any help here would be much appreciated..Thanks in advance.
Table1
Empdemogr
Empid dept_id
1 10
2 20
1 30
Table2
Empdetail
Empid RM_id
1 E120
2 E140
3 E130
Table3
dept_manager
dept_id RM_id
10 E110
10 E120
10 E121
10 E122
10 E123
20 E140
20 E141
20 E142
30 E130
30 E131
30 E132
Output:
Emp_id dept_id RM_id
1 10 E120
2 20 E140
1 30 E130
So trying to bring this sql in python:
select a.Emp_id, a.dept_id, b.RM_id
Empdemogr a, Empdetail b, dept_manager d
where
a.emp_id=b.emp_id
and a.dept_id=d.dept_id
and b.RM_id=d.RM_id
Trying to figure out if you had a typo or you have wrong understanding. Your above SQL would not output the the result you are looking for based on the provided data. I do not think you will see dept_id '30' in there.
But Going by your SQL query, here is how you can write the same in python dataframe:
Preparing DataFrames (I will leave it up to you how you load the dataframes):
import pandas as pd
EmpployeeDemo=pd.read_csv(r"YourEmployeeDemoFile.txt")
EmpDetail=pd.read_csv(r"YourEmpDetailFile.txt")
Dept_Manager=pd.read_csv(r"YourDept_Manager.txt")
Code to Join the DataFrames:
joined_dataframe = pd.merge(pd.merge(EmpployeeDemo, EmpDetail, on="Empid"),Dept_Manager, on=["dept_id", "RM_id"])
print(joined_dataframe)

Categories