Pandas drop duplicates is not working as expected - python

I am building multiple dataframes from a SQL query that contains a lot of left joins which is producing a bunch of duplicate values. I am familiar with pd.drop_duplicates() as I use it regularly in my other scripts, however, I can't get this particular one to work.
I am trying to drop_duplicates on a subset of 2 columns. Here is my code:
df = pd.read_sql("query")
index = []
for i in range(len(df)):
index.append(i)
df['index'] = index
df.set_index([df['index']])
df2 = df.groupby(['SSN', 'client_name', 'Evaluation_Date']).substance_use_name.agg(' | '.join).reset_index()
df2.shape which equals (182,4)
df3 = pd.concat([df, df2], axis=1, join='outer').drop_duplicates(keep=False)
df3.drop_duplicates(subset=['client_name', 'Evaluation_Date'], keep='first', inplace=True)
df3 returns 791 rows of data... (which is the exact amount of rows that my original query returns). After the drop_duplicates method I expected to have only 190 rows of data, however, it only drops the duplicates to 301 rows. When I do df3.to_excel(r'file_path.xlsx') and remove duplicates manually by the same subset in Excel, it works just fine and gives me the 190 rows that I expect. I'm not sure why?
I noticed in other similar questions regarding this topic that pandas cannot drop duplicates if a date field is a dtype 'object' and that it must be changed to a datetime, however, my date field is already a datetime.
Data frame looks like this:
ID | substnace1 | substance2 | substance3 | substance4
01 | drug | null | null | null
01 | null | drug | null | null
01 | null | null | drug | null
01 | null | null | null | drug
02 | drug | null | null | null
so on and so forth. I want to merge the rows into one row so it looks like this:
ID | substnace1 | substance2 | substance3 | substance4
01 | drug | drug | drug | drug
02 | drug | drug | drug | drug
so on and so forth.. Does that make better sense?
Would anyone be able to help me with this?
Thanks!

Related

Create new columns and calculate values based on condition with date in Python

I need to create a new column as Billing and Non-Billing based on the Date column.
Condition for Column 1 : If the Start Date is NULL OR BLANK (OR) if its Start Date is in 'Future Date' (OR) if its Starts Date is in 'Past Date' (OR) if its End Date is in Past Date then I should create a new column as Non-Billing.
Condition for columns 2: If the Start Date is in 'Current Date' then need to create a new column as 'Billable' and need to calculate it. Calculation should be in row axis.
Calculation for Billing in row: Billing = df[Billing] * sum/168 * 100
Calculation for Non-Billing in row: Non-Billing = df[Non-Billing] * sum/ 168 * 100
Data:
Employee Name | Java | Python | .NET | React | Start Date | End Date |
|Anu | 10 | 10 | 5 | 5 | 04-21-2021 | |
|Kalai | | 10 | | 5 | 04-21-2021 | 10-31-2021 |
|Smirthi | | 10 | 20 | | 03-21-2021 | |
|Madhu | 20 | 10 | 10 | | 01-12-2021 | |
|Latha | 40 | | 5 | | | |
Input
Output
Code:
# Adding new columns
total=df.sum(axis=1)
df.insert(len(df.columns),column='Total',value=total)
# Adding Utilization column utilization = (total/168)
df.insert(len(df.columns), column='Utilization', value=utilization)
# Filter dataframe using groupby
df1 = df.groupby(['Employee Name']).sum(min_count=1)
df1['Available'] = 168
I don't understand the conditions very well as there seem to be some inconsistencies but I believe this will help you getting started:
import pandas as pd
import numpy as np
import datetime
df['Total'] = df.sum(axis=1)
df['Available']=168
df['Amount']=df['Total']/df['Available']*100
df['Billing']=np.NaN
df['NonBilling']=np.NaN
df.loc[df['Start Date']==datetime.date.today(),'Billing']= df['Amount']
df.loc[df['Start Date']!=datetime.date.today(),'NonBilling']= df['Amount']
NOTES:
make sure about the date type to compare against today's date, if your date are being loaded as objects you may want to do something like this after loading:
df['Start Date']= pd.to_datetime(df['Start Date']).dt.date
work out the conditions for Billing/NonBilling to make sure the columns are being populated as intended

How to change the size and distribution of a PySpark Dataframe according to the values of its rows & columns?

I have a large PySpark DataFrame that I would like to manipulate as in the example below. I think it is easier to visualise it than to describe it. Hence, for illustrative purposes, let us take a simple DataFrame df:
df.show()
+----------+-----------+-----------+
| series | timestamp | value |
+----------+-----------+-----------+
| ID1 | t1 | value1_1 |
| ID1 | t2 | value2_1 |
| ID1 | t3 | value3_1 |
| ID2 | t1 | value1_2 |
| ID2 | t2 | value2_2 |
| ID2 | t3 | value3_2 |
| ID3 | t1 | value1_3 |
| ID3 | t2 | value2_3 |
| ID3 | t3 | value3_3 |
+----------+-----------+-----------+
In the above DataFrame, each of the three unique values contained in column series (i.e. ID1, ID2 and ID3) have corresponding values (under column values) occurring simulaneously at the same time (i.e. same entries in column timestamp).
From this DataFrame, I would like to have a transformation which ends up with the following DataFrame, named, say, results. As it can be seen, the size of the DataFrame has changed and even the columns have been renamed according to entries of the original DataFrame.
result.show()
+-----------+-----------+-----------+-----------+
| timestamp | ID1 | ID2 | ID3 |
+-----------+-----------+-----------+-----------+
| t1 | value1_1 | value1_2 | value1_3 |
| t2 | value2_1 | value2_2 | value2_3 |
| t3 | value3_1 | value3_2 | value3_3 |
+-----------+-----------+-----------+-----------+
The order of the columns in result is arbitrary and should not affect the final answer. This illustrative example only contains three unique values in series (i.e. ID1, ID2 and ID3). Ideally, I would like to write a piece of code which automatically detects unique values in series and therefore generates a new corresponding column. Does anyone know where can I start from? I have tried to group by timestamp and then to collect a set of distinct series and value by using the aggregate function collect_set but with no luck:(
Many thanks in advance!
Marioanzas
Just a simple pivot:
import pyspark.sql.functions as F
result = df.groupBy('timestamp').pivot('series').agg(F.first('value'))
Make sure that each row in df is distinct; otherwise duplicate entries may be silently deduplicated.
Extendind on mck's answer, I have found out a way of improving the pivot performance. pivot is a very expensive operation, hence, for Spark 2.0 on-wards, it is recommended to provide column data (if known) as an argument to the function as shown in the code below. This will improve the performance of the code for DataFrames much larger than the illustrative one posed in this question. Given that the values of series are known beforehand, we can use:
import pyspark.sql.functions as F
series_list = ('ID1', 'ID2', 'ID3')
result = df.groupBy('timestamp').pivot('series', series_list).agg(F.first('value'))
result.show()
+---------+--------+--------+--------+
|timestamp| ID1| ID2| ID3|
+---------+--------+--------+--------+
| t1|value1_1|value1_2|value1_3|
| t2|value2_1|value2_2|value2_3|
| t3|value3_1|value3_2|value3_3|
+---------+--------+--------+--------+

How can I check check for matching values in a second dataframe, then return a value from a column in the second dataframe?

I have two dataframes. One contains a list of the most recent meeting for each customer. The second is a list of statuses that each customer has been recorded with, and their start date and end date.
I want to look up a customer and meeting date, and find out what status they were at when the meeting occurred.
What I think this will involve is creating a new column in my meeting dataframe that checks the rows of the statuses dataframe for a matching customer ID, then checks if the date from the first dataframe is between two dates in the second. If it is, the calculated column will take its value from the second dataframe's status column.
My dataframes are:
meeting
| CustomerID | MeetingDate |
|------------|-------------|
| 70704 | 2019-07-23 |
| 70916 | 2019-09-04 |
| 72712 | 2019-04-16 |
statuses
| CustomerID | Status | StartDate | EndDate |
|------------|--------|------------|------------|
| 70704 | First | 2019-04-01 | 2019-06-30 |
| 70704 | Second | 2019-07-01 | 2019-08-25 |
| 70916 | First | 2019-09-01 | 2019-10-13 |
| 72712 | First | 2019-03-15 | 2019-05-02 |
So, I think I want to take meeting.CustomerID and find a match in statuses.CustomerID. I then want to check if meeting.MeetingDate is between statuses.StartDate and statuses.EndDate. If it is, I want to return statuses.Status from the matching row, if not, ignore that row and move to the next to see if that matches the criteria and return the Status as described.
The final result should look like:
| CustomerID | MeetingDate | Status |
|------------|-------------|--------|
| 70704 | 2019-07-23 | Second |
| 70916 | 2019-09-04 | First |
| 72712 | 2019-04-16 | First |
I'm certain there must be a neater and more streamlined way to do this than what I've suggested, but I'm still learning the ins and outs of python and pandas and would appreciate if someone could point me in the right direction.
This should work. If the columns are not sorted by CustomerID or Status, this can be easily done. This is assuming your dates are already a datetime type. Here, df2 refers to the dataframe whose columns are CustomerID, Status, StartDate, and EndDate.
import numpy as np
df2 = df2[::-1]
row_arr = np.unique(df2.CustomerID, return_index = True)[1]
df2 = df2.iloc[row_arr, :].drop(['StartDate', 'EndDate'], axis = 1)
final = pd.merge(df1, df2, how = 'inner', on = 'CustomerID')
I managed to wrangle something that works for me:
df = statuses.merge(meetings, on='CustomerID')
df = df[(df['MeetingDate'] >= df['StartDate']) & (df['MeetingDate'] <= df['EndDate'])].reset_index(drop=True)
Gives:
| CustomerID | Status | StartDate | EndDate | MeetingDate |
|------------|--------|------------|------------|-------------|
| 70704 | Second | 2019-01-21 | 2019-07-28 | 2019-07-23 |
| 70916 | First | 2019-09-04 | 2019-10-21 | 2019-09-04 |
| 72712 | First | 2019-03-19 | 2019-04-17 | 2019-04-16 |
And I can just drop the now unneeded columns.

Creating new column from API lookup using groupby

I have a dataframe of weather date that looks like this:
+----+------------+----------+-----------+
| ID | Station_ID | Latitude | Longitude |
+----+------------+----------+-----------+
| 0 | 6010400 | 52.93 | -82.43 |
| 1 | 6010400 | 52.93 | -82.43 |
| 2 | 6010400 | 52.93 | -82.43 |
| 3 | 616I001 | 45.07 | -77.88 |
| 4 | 616I001 | 45.07 | -77.88 |
| 5 | 616I001 | 45.07 | -77.88 |
+----+------------+----------+-----------+
I want to create a new column called postal_code using an API lookup based on the latitude and longitude values. I cannot perform a lookup for each row in the dataframe as that would be inefficient, since there are over 500,000 rows and only 186 unique Station_IDs. It's also unfeasible due to rate limiting on the API I need to use.
I believe I need to perform a groupby transform but can't quite figure out how to get it to work correctly.
Any help with this would be greatly appreciated.
I believe, you can use groupby only for aggregations, which is not what you want.
First combine both 'Latitude' and 'Longitude'. It gives a new column with tuples.
df['coordinates'] = list(zip(df['Latitude'],df['Longitude']))
Then you can use this 'coordinates' column to create all unique values of (Latitude,Longitude) using set datatype, so it doesn't contain duplicates.
set(list(df['coordinates']))
Then fetch the postal_codes of these coordinates using API calls as you said and store them as a dict.
Then you can use this dict to populate postal codes for each row.
postal_code_dict = {'key':'value'} #sample dictionary
df['postal_code'] = df['coordinates'].apply(lambda x: postal_code_dict[x])
Hope this helps.

why is Pyspark joined column turning into null values?

I'm trying to join two dataframes but the values of the second keep turning into nulls:
joint = sdf.join(k, "date", how='left').select(sdf.date, sdf.Res, sdf.Ind, k.gen.cast(IntegerType())).orderBy('date')
output: | 1/1/2001 | 4103 | 9223 | null |

Categories