Compare data series columns based on a column that contains duplicates

Compare data series columns based on a column that contains duplicates - python

I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.
Requirements:
I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.
Here's my DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 84 170 265
1 OFFICE 1 2222278 26 103 287
2 OFFICE 1 2222278 97 167 288
3 OFFICE 2 2222289 7 167 288
4 OFFICE 2 2222289 3 130 295
5 OFFICE 2 2222289 9 195 257
6 OFFICE 3 1111111 1 2 286
7 OFFICE 3 1111111 5 2 287
8 OFFICE 3 1111112 9 7 230
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
Here's what I want to show in the final DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
I was thinking of solving this by using df.itterows()
Here's what I've tried:
for n, v in df.iterrows():
if int(v['Sales']) > 10:
calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
if int(calls['Calls']) > 10:
target = df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
else:
print("No match found")
else:
sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]
However, I keep getting KeyError: False error messages. Any thoughts on what I'm doing wrong?

This is not optimized, but it should meet your needs. The code snippet sends each NAME group to eval_group() where it checks the highest index for each column until the Sales, Calls, Target criteria is met.
If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group so it is called against all groups at once, instead of sequentially.
A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax() call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1 being row 2, not 0.
df = pd.read_csv('./data.txt')
def eval_group(df, keys) :
for key in keys :
row_id = df[key].idxmax()
if df.loc[row_id][key] >= 10 or key == keys[-1] :
return row_id
row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
condition = df['NAME'] == name
row_ids.append( eval_group( df[condition], keys) )
df = df[ df.index.isin(row_ids) ]
df
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298

This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:
temp = (df
.drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
.groupby('NAME', sort = False)
.idxmax()
)
# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)
# groupby for each NAME
bools = bools.groupby(df.NAME, sort = False).all()
# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)
# get matching rows
df.loc[indices]
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298

Related

Replace blank value in dataframe based on another column condition

I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515

Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

compare 2 dataframes simultaneously - 2 itertuples?

Im comparing 2 dataframes and Id like see if the the name matches on the address then to pull the unique ID. otherwise, continue on and search for the best match. (Im using fuzzy matcher for that part)
I was exploring itertools and wondered if using the itertools.zip_longest option would work simultaneously to compare 2 items togther. rather than using 2 for loops (example for x in df1.itertuples: do something... for y in df2.itertuples: do something) would something like this work?
result = itertools.zip_longest(df1.itertuples(), df2.itertuples())
Here's my 2 dataframes -
Here's my DF1:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 123 road 2222277 84 170 265
1 OFFICE 2 15 lane 2222289 7 167 288
2 OFFICE 3 3 highway 1111111 1 2 286
3 OFFICE 4 2 street 1111171 95 193 299
4 OFFICE 5 1 place 1111191 9 193 298
DF2:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER UNIQUE ID
0 OFFICE 1 123 road 2222277 014168
1 OFFICE 2 15 lane 2222289 131989
2 OFFICE 3 3 highway 1111111 149863
3 OFFICE 4 2 street 1111171 198664
4 OFFICE 5 1 place 1111191 198499
5 OFFICE 6 zzzz rd 165198 198791
6 OFFICE 7 5z st 19844 298791
7 OFFICE 8 34 hwy 981818 398791
8 OFFICE 9 81290 rd 899811 498791
9 OFFICE 10 59 rd 699161 598791
10 OFFICE 11 5141 bldvd 33211 698791
Then perform a for loop and do some comparison if statements. I can access both items side by side but how would I then loop over the items to do the check?
Right now im getting: "
TypeError: 'NoneType' object is not subscriptable"
for yy in result:
if yy[0][1]== yy[1][1]:
print(yy) ......

If your headers are the same in both df´s, just apply merge:
dfmerge=pd.merge(df1,df2)
the output should be:

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.
Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.
Sample Data
import pandas as pd
df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225],
"TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
"Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
print(df)
Sample Dataframe
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
7 225 77 Bad Debt
8 225 25 Charity
9 225 77 Bad Debt
10 225 77 Bad Debt
Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
df.drop(df[mask].index[1:],inplace=True)
print(df)
Desired Result (Each patient should have a maximum of one Bad Debt transaction)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Alternate Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
mask = mask & (mask.cumsum() > 1)
df.loc[mask, 'Bucket'] = 'DELETE'
df = df[df['Bucket'] != 'DELETE]
Attempted using Dask
I thought maybe Dask would be able to help me out, but I got the following error codes:
Using Dask on first solution - "NotImplementedError: Series getitem in only supported for other series objects with matching partition structure"
Using Dask on second solution - "TypeError: '_LocIndexer' object does not support item assignment"

You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:
df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity

Create a boolean mask without loop:
mask = df[df['Bucket'].eq('Bad Debt')].duplicated('PatientAccountNumber')
df.drop(mask[mask].index, inplace=True)
>>> df
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728

This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750

concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

Iterating over pandas rows to get minimum

Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help

It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare data series columns based on a column that contains duplicates - python

Related

Replace blank value in dataframe based on another column condition

compare 2 dataframes simultaneously - 2 itertuples?

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

Iterating over pandas rows to get minimum

Categories

Resources