I want to divide a pandas dataframe by another based on the column values.
For example let's say I have:
>>> df = pd.DataFrame({'NAME': [ 'CA', 'CA', 'CA', 'AZ', 'AZ', 'AZ', 'TX', 'TX', 'TX'], 'NUM':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'VALUE': [10, 20, 30, 40, 50, 60, 70, 80, 90]})
>>> df
NAME NUM VALUE
0 CA 1 10
1 CA 2 20
2 CA 3 30
3 AZ 1 40
4 AZ 2 50
5 AZ 3 60
6 TX 1 70
7 TX 2 80
8 TX 3 90
>>> states = pd.DataFrame({'NAME': ['CA', "AZ", "TX"], 'DIVISOR': [10, 5, 1]})
>>> states
NAME DIVISOR
0 CA 10
1 AZ 5
2 TX 1
For each STATE and NUM I want to divide the VALUE column in df by the divisor COLUMN of the respective state.
Which would give a result of
>>> result = pd.DataFrame({'NAME': [ 'CA', 'CA', 'CA', 'AZ', 'AZ', 'AZ', 'TX', 'TX', 'TX'], 'NUM':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'VALUE': [1, 2, 3, 8, 10, 12, 70, 80, 90]})
>>> result
NAME NUM VALUE
0 CA 1 1
1 CA 2 2
2 CA 3 3
3 AZ 1 8
4 AZ 2 10
5 AZ 3 12
6 TX 1 70
7 TX 2 80
8 TX 3 90
Let us do map
df['NEW VALUE'] = df['VALUE'].div(df['NAME'].map(states.set_index('NAME')['DIVISOR']))
df
Out[129]:
NAME NUM VALUE NEW VALUE
0 CA 1 10 1.0
1 CA 2 20 2.0
2 CA 3 30 3.0
3 AZ 1 40 8.0
4 AZ 2 50 10.0
5 AZ 3 60 12.0
6 TX 1 70 70.0
7 TX 2 80 80.0
8 TX 3 90 90.0
You can use merge as well
result = df.merge(states,on=['NAME'])
result['NEW VALUE'] = result.VALUE/result.DIVISOR
print(result)
NAME NUM VALUE NEW VALUE DIVISOR
0 CA 1 10 1.0 10
1 CA 2 20 2.0 10
2 CA 3 30 3.0 10
3 AZ 1 40 8.0 5
4 AZ 2 50 10.0 5
5 AZ 3 60 12.0 5
6 TX 1 70 70.0 1
7 TX 2 80 80.0 1
8 TX 3 90 90.0 1
I feel like there must be a more eloquent way to accomplish what you are looking for, but this is the rout that I usually take.
myresult = df.copy()
for i in range(len(df['NAME'])):
for j in range(len(states['NAME'])):
if df['NAME'][i] == states['NAME'][j]:
myresult['VALUE'][i] = df['VALUE'][i]/states['DIVISOR'][j]
myresult.head()
Out[10]>>
NAME NUM VALUE
0 CA 1 1
1 CA 2 2
2 CA 3 3
3 AZ 1 8
4 AZ 2 10
This is a very brute force method. You start by looping through each value in the data frame df, then you loop through each element in the data frame states. Then for each comparison, you look to see if the NAME columns match. If they do, you do the VALUE / DIVISOR.
You will get a warring for using the .copy() method
Related
This is an extension to this question.
I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination. Except I need it to ignore that paired grouping if all the rows have Structure = "s".
Here is an example DataFrame that looks like
Index
Cents
Structure
SD_YF
TDateID
GroupID
10
182.5
s
2.1
0
0
11
182.5
s
2.1
0
0
12
153.5
s
1.05
0
1
13
153.5
s
1.05
0
1
14
43
p
11
1
2
15
43
p
11
1
2
4
152
s
21
1
2
5
152
s
21
1
2
21
53
s
13
2
3
22
53
s
13
2
3
24
252
s
25
2
3
25
252
s
25
2
3
In pandas form:
df = pd.DataFrame({'Index': [10, 11, 12, 13, 14, 15, 4, 5, 21, 22, 24, 25],
'Cents': [182.5,
182.5,
153.5,
153.5,
43.0,
43.0,
152.0,
152.0,
53.0,
53.0,
252.0,
252.0],
'Structure': ['s', 's', 's', 's', 'p', 'p', 's', 's', 's', 's', 's', 's'],
'SD_YF': [2.1,
2.1,
1.05,
1.05,
11.0,
11.0,
21.0,
21.0,
13.0,
13.0,
25.0,
25.0],
'TDateID': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
'GroupID': [0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]})
My ideal output would be:
Index
Cents
Structure
SD_YF
TDateID
GroupID
UniID
10
182.5
s
2.1
0
0
1
11
182.5
s
2.1
0
0
2
12
153.5
s
1.05
0
1
3
13
153.5
s
1.05
0
1
4
14
43
p
11
1
2
5
15
43
p
11
1
2
6
4
152
s
21
1
2
5
5
152
s
21
1
2
6
21
53
s
13
2
3
7
22
53
s
13
2
3
8
24
252
s
25
2
3
9
25
252
s
25
2
3
10
I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!
Using the following code worked great, except it would need to be adapted for the "Structure != "s" for all rows in the grouping" part.
df['UniID'] = (df['GroupID']
+df.groupby('GroupID').ngroup().add(1)
+df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)
Do the IDs need to be consecutive?
If the occurrence of "duplicate" rows is small, looping over just those groups might not be too bad.
First set an ID to all the pairs using the code you have (and add an indicator column that a row belongs in a group). Then select out all the rows in groups (using the indicator column) and iterate over the groups. If the group has all S, then reassign the IDs to be unique for each row.
The tricky thing is to imagine how this should generalize. Here is my understanding: create a sequential count ignoring the p, then back fill those.
m = df['Structure'].eq('s')
df['UniID'] = m.cumsum()+(~m).cumsum().mask(m,0)
Output:
Index Cents Structure SD_YF TDateID GroupID UniID
0 10 182.5 s 2.10 0 0 1
1 11 182.5 s 2.10 0 0 2
2 12 153.5 s 1.05 0 1 3
3 13 153.5 s 1.05 0 1 4
4 14 43.0 p 11.00 1 2 5
5 15 43.0 p 11.00 1 2 6
6 4 152.0 s 21.00 1 2 5
7 5 152.0 s 21.00 1 2 6
8 21 53.0 s 13.00 2 3 7
9 22 53.0 s 13.00 2 3 8
10 24 252.0 s 25.00 2 3 9
11 25 252.0 s 25.00 2 3 10
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0
HI I am trying to run lookup equivalent function on python but having tried merge and join I haven't hit the nail yet.
so my first df is this
list = ['Computer', 'AA', 'Monitor', 'BB', 'Printer', 'BB', 'Desk', 'AA', 'Printer', 'DD', 'Desk', 'BB']
list2 = [1500, 232, 300, 2323, 150, 2323, 250, 2323, 23, 34, 45, 56]
df = pd.DataFrame(list,columns=['product'])
df['number'] = list2
This is how the df would look
product number
0 Computer 1500
1 AA 232
2 Monitor 300
3 BB 2323
4 Printer 150
5 BB 2323
6 Desk 250
7 AA 2323
8 Printer 23
9 DD 34
10 Desk 45
11 BB 56
This is the 2nd dataframe
list_n = ['AA','BB','CC','DD']
list_n2 = ['Y','N','N','Y']
df2 = pd.DataFrame(list_n,columns=['product'])
df2['to_add'] = list_n2
This is how df2 looks like
product to_add
0 AA Y
1 BB N
2 CC N
3 DD Y
Now, how do I add a column ('to_add') to the first dataframe (df) so it should look a bit like this. In excel its a simple vlookup. I tried 'merge' and 'join' functions but its altering the sequence of my df and I don't want the sequencing to change. any ideas?
product price to_add
0 Computer 1500
1 AA 232 Y
2 Monitor 300
3 BB 2323 N
4 Printer 150
5 BB 2323 N
6 Desk 250
7 AA 2323 Y
8 Printer 23
9 DD 34 Y
10 Desk 45
11 BB 56 N
pd.merge indeed will do the job, probably you were not using it correctly:
pd.merge(df, df2, on="product", how="left")
will return:
product number to_add
0 Computer 1500 NaN
1 AA 232 Y
2 Monitor 300 NaN
3 BB 2323 N
4 Printer 150 NaN
5 BB 2323 N
6 Desk 250 NaN
7 AA 2323 Y
8 Printer 23 NaN
9 DD 34 Y
10 Desk 45 NaN
11 BB 56 N
I have a dataframe df_sample with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df. Both df_sample and df share the exact same structure:
zip_code city state street_number street_name unit_number country
12345 FAKEVILLE FLORIDA 123 FAKE ST NaN US
What I want to do is match a single row in df_sample against every row in df, starting with state and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9 into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city, zip_code, etc. Something like:
df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]
except that doesn't work.
My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.
Create your dataframes
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [1, 2, 3, 4, 5],
'state': ['Florida', 'Nevada', 'Texas', 'Florida', 'Texas']})
df_sample = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [6, 7, 8, 9, 10],
'state': ['florida', 'Flor', 'NY', 'Florida', 'Tx']})
merged_df = df_sample.merge(df, on='key')
merged_df['fuzzy_ratio'] = merged_df.apply(lambda row: fuzz.ratio(row['state_x'], row['state_y']), axis=1)
merged_df
you get the fuzzy ratio for each pair
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
1 1 6 florida 2 Nevada 31
2 1 6 florida 3 Texas 17
3 1 6 florida 4 Florida 86
4 1 6 florida 5 Texas 17
5 1 7 Flor 1 Florida 73
6 1 7 Flor 2 Nevada 0
7 1 7 Flor 3 Texas 0
8 1 7 Flor 4 Florida 73
9 1 7 Flor 5 Texas 0
10 1 8 NY 1 Florida 0
11 1 8 NY 2 Nevada 25
12 1 8 NY 3 Texas 0
13 1 8 NY 4 Florida 0
14 1 8 NY 5 Texas 0
15 1 9 Florida 1 Florida 100
16 1 9 Florida 2 Nevada 31
17 1 9 Florida 3 Texas 17
18 1 9 Florida 4 Florida 100
19 1 9 Florida 5 Texas 17
20 1 10 Tx 1 Florida 0
21 1 10 Tx 2 Nevada 0
22 1 10 Tx 3 Texas 57
23 1 10 Tx 4 Florida 0
24 1 10 Tx 5 Texas 57
then filter out what you don't want
mask = (merged_df['fuzzy_ratio']>80)
merged_df[mask]
result:
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
3 1 6 florida 4 Florida 86
15 1 9 Florida 1 Florida 100
18 1 9 Florida 4 Florida 100
I'm not familiar with fuzzy, so this is more of a comment than an answer. That said, you can do something like this:
# cross join
df_merge = pd.merge(*[d.assign(dummy=1) for d in (df, df_sample)],
on='dummy', how='left'
)
filters = pd.DataFrame()
# compute the fuzzy ratio for each pair of columns
for col in df.columns:
filters[col] = (df_merge[[col+'_x', col+'_y']]
.apply(lambda x: fuzzy.ratio(x[col+'_x'], x[col+'_y']), axis=1)
)
# filter only those with ratio > 0.9
df_match = df_merge[filter.gt(0.9).all(1)]
You wrote that your df has very big number of rows,
so full cross-join and then elimination may cause your code
to run out of memory.
Take a look at another solution, requiring less memory:
minRatio = 90
result = []
for idx1, t1 in df_sample.state.iteritems():
for idx2, t2 in df.state.iteritems():
ratio = fuzz.WRatio(t1, t2)
if ratio > minRatio:
result.append([ idx1, t1, idx2, t2, ratio ])
df2 = pd.DataFrame(result, columns=['idx1', 'state1', 'idx2', 'state2', 'ratio'])
It contains 2 nested loops running over both DataFrames.
The result is a DataFrame with rows containig:
index and state from df_sample,
index and state from df,
the ratio.
This gives you information which rows in both DataFrames are "related"
with each other.
The advantage is that you don't generate full cross join and (for now)
you operate only on state columns, instead of full rows.
You didn't describe what exactly the final result should be, but I tink
that based on the above code you will be able to proceed further.
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,'id':
[1,2,3,4,5,6]*2 ,'sales': [np.random.randint(100000, 999999) for _ in
range(12)]})
This is ouput of df:
id sales state
0 1 847754 CA
1 2 362532 WA
2 3 615849 CO
3 4 376480 AZ
4 5 381286 CA
5 6 411001 WA
6 1 946795 CO
7 2 857435 AZ
8 3 928087 CA
9 4 675593 WA
10 5 371339 CO
11 6 440285 AZ
I am not able to do the cumulative percentage for each group in descending order. I want the output like this:
id sales state cumsum run_pct
0 2 857435 AZ 857435 0.5121460996296738
1 6 440285 AZ 1297720 0.7751284195436626
2 4 376480 AZ 1674200 1.0
3 3 928087 CA 928087 0.43024216932985404
4 1 847754 CA 1775841 0.8232436013271356
5 5 381286 CA 2157127 1.0
6 1 946795 CO 946795 0.48955704367618535
7 3 615849 CO 1562644 0.807992624547372
8 5 371339 CO 1933983 1.0
9 4 675593 WA 675593 0.46620721731581655
10 6 411001 WA 1086594 0.7498271371847582
11 2 362532 WA 1449126 1.0
One possible solution is to first sort the data, calculate the cumsum and then the percentages last.
Sorting with ascending states and descending sales:
df = df.sort_values(['state', 'sales'], ascending=[True, False])
Calculating the cumsum:
df['cumsum'] = df.groupby('state')['sales'].cumsum()
and the percentages:
df['run_pct'] = df.groupby('state')['sales'].apply(lambda x: (x/x.sum()).cumsum())
This will give:
id sales state cumsum run_pct
0 4 846079 AZ 846079 0.608566
1 2 312708 AZ 1158787 0.833491
2 6 231495 AZ 1390282 1.000000
3 3 790291 CA 790291 0.506795
4 1 554631 CA 1344922 0.862467
5 5 214467 CA 1559389 1.000000
6 1 983878 CO 983878 0.388139
7 5 779497 CO 1763375 0.695650
8 3 771486 CO 2534861 1.000000
9 6 794407 WA 794407 0.420899
10 2 587843 WA 1382250 0.732355
11 4 505155 WA 1887405 1.000000