Python: duplicate one row into two rows and make changes - python

I have this pandas dataframe with two columns and the index is shown too.
Name Membership Specs
0 Adam NORMAL 170
1 James NORMAL 170
2 Michael ADMINCOORDINATOR 170
3 Lina NORMAL 170
4 Alexey ADMINCOORDINATOR 170
5 David NORMAL 170
I would like to duplicate the ADMINCOORDINATOR rows and then change the values to following format:-
Name Membership Specs
0 Adam NORMAL 170
1 James NORMAL 170
2 Michael ADMIN 160
3 Michael COORDINATOR 180
4 Lina NORMAL 170
5 Alexey ADMIN 160
6 Alexey COORDINATOR 180
7 David NORMAL 170
so the idea is to split ADMINCOORDINATOR into two rows and change the values ADMIN = 160, Coordinator = 180. Moreover, I would like to keep the sorting for data.
Thank you

You can try:
dmap = {'ADMIN': 160, 'COORDINATOR': 180}
update_specs = lambda x: x['Membership'].map(dmap).fillna(x['Specs']).astype(int)
out = (df.assign(Membership=df['Membership'].str.findall(r'ADMIN|COORDINATOR|.+'))
.explode('Membership', ignore_index=True)
.assign(Specs=update_specs))
print(out)
# Output
Name Membership Specs
0 Adam NORMAL 170
1 James NORMAL 170
2 Michael ADMIN 160
3 Michael COORDINATOR 180
4 Lina NORMAL 170
5 Alexey ADMIN 160
6 Alexey COORDINATOR 180
7 David NORMAL 170

This is what you need:
df = df.assign(Membership=df['Membership'].str.findall(r'ADMIN|COORDINATOR|.+'))
df = df.explode('Membership')
def Membership(df):
if df['Membership'] == "ADMIN":
return "160"
if df['Membership'] == "COORDINATOR":
return "180"
else:
return df['Specs']
df['Specs_new'] = df.apply(Membership, axis=1)

With artificial splitting on ADMINCOORDINATOR keyword and exploding on Membership:
df.Membership = df.Membership.str.replace(r'(ADMIN)', r'\1,').str.split(',')
df = df.explode('Membership')
df.Specs = np.where(df.Membership.eq('ADMIN'), 160,
np.where(df.Membership.eq('COORDINATOR'), 180, df.Specs))
Name Membership Specs
0 Adam NORMAL 170
1 James NORMAL 170
2 Michael ADMIN 160
2 Michael COORDINATOR 180
3 Lina NORMAL 170
4 Alexey ADMIN 160
4 Alexey COORDINATOR 180
5 David NORMAL 170

Related

Trying to access variables while scraping website; trying to get var in script

Trying to web scrape info from this website: http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed=.
For context, I'm trying to find the Tyre brand (Bridgestone, Michelin), pattern (e.g Turanza T001, Ecopia EP500), Tyre Size (205/55. 16 V (91), 225/50. 16 W (100) XL), Seasonality (if available) (Summer, Winter) and price.
My measurements for the tyre are Width – 205, Aspect Ratio – 55, Rim Size - 16.
I found all the info I need here at the var allTyres section. The problem is, I am struggling with how to extract the "manufacturer" (brand), "description" (description has the pattern and size), "winter" (it would have 0 for no and 1 for yes), "summer" (same as winter) and "price".
Afterwards, I want to export the data in CSV format.
Thanks
To create a pandas dataframe from the allTyres data you can do (from the DataFrame you can select columns you want, save it to CSV etc..):
import re
import json
import requests
import pandas as pd
url = "http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed="
data = json.loads(
re.search(r"allTyres = (.*);", requests.get(url).text).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data)
print(df.head())
Prints:
id ManufacturerID width profile rim speed load description part_no pattern manufacturer extra_load run_flat winter summer OEList price tyre_class rolling_resistance wet_grip Graphic noise_db noise_rating info pattern_name recommended rating
0 1881920 647 205 55 16 V 91 205/55VR16 BUDGET VR 2055516VBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
1 3901788 647 205 55 16 H 91 205/55R16 BUDGET 91H 2055516HBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
2 1881957 647 205 55 16 W 91 205/55ZR16 BUDGET ZR 2055516ZBUD Economy N N 0 1 53.54 C1 G F BUD 73 3 0 1
3 6022423 129 205 55 16 H 91 205/55R16 91H UROYAL RAINSPORT 5 2055516HUN09BGS RainSport 5 Uniroyal N N 0 1 70.46 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4
4 6022424 129 205 55 16 V 91 205/55R16 91V UROYAL RAINSPORT 5 2055516VUN09BGR RainSport 5 Uniroyal N N 0 1 70.81 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4

Compare data series columns based on a column that contains duplicates

I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.
Requirements:
I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.
Here's my DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 84 170 265
1 OFFICE 1 2222278 26 103 287
2 OFFICE 1 2222278 97 167 288
3 OFFICE 2 2222289 7 167 288
4 OFFICE 2 2222289 3 130 295
5 OFFICE 2 2222289 9 195 257
6 OFFICE 3 1111111 1 2 286
7 OFFICE 3 1111111 5 2 287
8 OFFICE 3 1111112 9 7 230
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
Here's what I want to show in the final DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
I was thinking of solving this by using df.itterows()
Here's what I've tried:
for n, v in df.iterrows():
if int(v['Sales']) > 10:
calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
if int(calls['Calls']) > 10:
target = df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
else:
print("No match found")
else:
sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]
However, I keep getting KeyError: False error messages. Any thoughts on what I'm doing wrong?
This is not optimized, but it should meet your needs. The code snippet sends each NAME group to eval_group() where it checks the highest index for each column until the Sales, Calls, Target criteria is met.
If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group so it is called against all groups at once, instead of sequentially.
A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax() call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1 being row 2, not 0.
df = pd.read_csv('./data.txt')
def eval_group(df, keys) :
for key in keys :
row_id = df[key].idxmax()
if df.loc[row_id][key] >= 10 or key == keys[-1] :
return row_id
row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
condition = df['NAME'] == name
row_ids.append( eval_group( df[condition], keys) )
df = df[ df.index.isin(row_ids) ]
df
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:
temp = (df
.drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
.groupby('NAME', sort = False)
.idxmax()
)
# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)
# groupby for each NAME
bools = bools.groupby(df.NAME, sort = False).all()
# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)
# get matching rows
df.loc[indices]
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298

compare 2 dataframes simultaneously - 2 itertuples?

Im comparing 2 dataframes and Id like see if the the name matches on the address then to pull the unique ID. otherwise, continue on and search for the best match. (Im using fuzzy matcher for that part)
I was exploring itertools and wondered if using the itertools.zip_longest option would work simultaneously to compare 2 items togther. rather than using 2 for loops (example for x in df1.itertuples: do something... for y in df2.itertuples: do something) would something like this work?
result = itertools.zip_longest(df1.itertuples(), df2.itertuples())
Here's my 2 dataframes -
Here's my DF1:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 123 road 2222277 84 170 265
1 OFFICE 2 15 lane 2222289 7 167 288
2 OFFICE 3 3 highway 1111111 1 2 286
3 OFFICE 4 2 street 1111171 95 193 299
4 OFFICE 5 1 place 1111191 9 193 298
DF2:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER UNIQUE ID
0 OFFICE 1 123 road 2222277 014168
1 OFFICE 2 15 lane 2222289 131989
2 OFFICE 3 3 highway 1111111 149863
3 OFFICE 4 2 street 1111171 198664
4 OFFICE 5 1 place 1111191 198499
5 OFFICE 6 zzzz rd 165198 198791
6 OFFICE 7 5z st 19844 298791
7 OFFICE 8 34 hwy 981818 398791
8 OFFICE 9 81290 rd 899811 498791
9 OFFICE 10 59 rd 699161 598791
10 OFFICE 11 5141 bldvd 33211 698791
Then perform a for loop and do some comparison if statements. I can access both items side by side but how would I then loop over the items to do the check?
Right now im getting: "
TypeError: 'NoneType' object is not subscriptable"
for yy in result:
if yy[0][1]== yy[1][1]:
print(yy) ......
If your headers are the same in both df´s, just apply merge:
dfmerge=pd.merge(df1,df2)
the output should be:

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.
Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.
Sample Data
import pandas as pd
df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225],
"TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
"Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
print(df)
Sample Dataframe
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
7 225 77 Bad Debt
8 225 25 Charity
9 225 77 Bad Debt
10 225 77 Bad Debt
Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
df.drop(df[mask].index[1:],inplace=True)
print(df)
Desired Result (Each patient should have a maximum of one Bad Debt transaction)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Alternate Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
mask = mask & (mask.cumsum() > 1)
df.loc[mask, 'Bucket'] = 'DELETE'
df = df[df['Bucket'] != 'DELETE]
Attempted using Dask
I thought maybe Dask would be able to help me out, but I got the following error codes:
Using Dask on first solution - "NotImplementedError: Series getitem in only supported for other series objects with matching partition structure"
Using Dask on second solution - "TypeError: '_LocIndexer' object does not support item assignment"
You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:
df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Create a boolean mask without loop:
mask = df[df['Bucket'].eq('Bad Debt')].duplicated('PatientAccountNumber')
df.drop(mask[mask].index, inplace=True)
>>> df
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Categories