pandas: groupby and variable weights - python

I have a dataset with weights for each observation and I want to prepare weighted summaries using groupby but am rusty as to how to best do this. I think it implies a custom aggregation function. My issue is how to properly deal with not item-wise data, but group-wise data. Perhaps it means that it is best to do this in steps rather than in one go.
In pseudo-code, I am looking for
#first, calculate weighted value
for each row:
weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
sum(weighted jobs)/sum(weight)
I am not sure how to work the "for each city"-part into a custom aggregate function and get access to group-level summaries.
Mock data:
import pandas as pd
import numpy as np
np.random.seed(43)
## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})

Simply multiply the two columns:
In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']
Now you can groupby the city (and take the sum):
In [12]: df_city_sums = df_city.groupby('city').sum()
In [13]: df_city_sums
Out[13]:
jobs weight weighted_jobs
city
oakland 362 690 7958
san mateo 367 1017 9026
sf 253 638 6209
[3 rows x 3 columns]
Now you can divide the two sums, to get the desired result:
In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs']
Out[14]:
city
oakland 21.983425
san mateo 24.594005
sf 24.541502
dtype: float64

Related

Pandas - Fill in Missing Column Values Regression

I have a data frame 'df' that has missing column values. I want to fill in the missing/NaN values in the Avg Monthly Long Distance Charges column through prediction (regression) using the other column values. Then, replace the NaN values with the new values found.
Data frame: 'df'
Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,Internet Service,Internet Type,Avg Monthly GB Download,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason
0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,9,None,Yes,42.39,No,Yes,Cable,16,No,Yes,No,Yes,Yes,No,No,Yes,One Year,Yes,Credit Card,65.6,593.3,0,0,381.51,974.81,Stayed,,
0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,9,None,Yes,10.69,Yes,Yes,Cable,10,No,No,No,No,No,Yes,Yes,No,Month-to-Month,No,Credit Card,-4,542.4,38.33,10,96.21,610.28,Stayed,,
0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,4,Offer E,Yes,33.65,No,Yes,Fiber Optic,30,No,No,Yes,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,73.9,280.85,0,0,134.6,415.45,Churned,Competitor,Competitor had better devices
0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,13,Offer D,Yes,27.82,No,Yes,Fiber Optic,4,No,Yes,Yes,No,Yes,Yes,No,Yes,Month-to-Month,Yes,Bank Withdrawal,98,1237.85,0,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction
0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,3,None,Yes,7.38,No,Yes,Fiber Optic,11,No,No,No,Yes,Yes,No,No,Yes,Month-to-Month,Yes,Credit Card,83.9,267.4,0,0,22.14,289.54,Churned,Dissatisfaction,Network reliability
0013-MHZWF,Female,23,No,3,Midpines,95345,37.581496,-119.972762,0,9,Offer E,Yes,16.77,No,Yes,Cable,73,No,No,No,Yes,Yes,Yes,Yes,Yes,Month-to-Month,Yes,Credit Card,69.4,571.45,0,0,150.93,722.38,Stayed,,
0013-SMEOE,Female,67,Yes,0,Lompoc,93437,34.757477,-120.550507,1,71,Offer A,Yes,9.96,No,Yes,Fiber Optic,14,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Bank Withdrawal,109.7,7904.25,0,0,707.16,8611.41,Stayed,,
0014-BMAQU,Male,52,Yes,0,Napa,94558,38.489789,-122.27011,8,63,Offer B,Yes,12.96,Yes,Yes,Fiber Optic,7,Yes,No,No,Yes,No,No,No,No,Two Year,Yes,Credit Card,84.65,5377.8,0,20,816.48,6214.28,Stayed,,
0015-UOCOJ,Female,68,No,0,Simi Valley,93063,34.296813,-118.685703,0,7,Offer E,Yes,10.53,No,Yes,DSL,21,Yes,No,No,No,No,No,No,Yes,Two Year,Yes,Bank Withdrawal,48.2,340.35,0,0,73.71,414.06,Stayed,,
0016-QLJIS,Female,43,Yes,1,Sheridan,95681,38.984756,-121.345074,3,65,None,Yes,28.46,Yes,Yes,Cable,14,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Credit Card,90.45,5957.9,0,0,1849.9,7807.8,Stayed,,
0017-DINOC,Male,47,No,0,Rancho Santa Fe,92091,32.99356,-117.207121,0,54,None,No,,,Yes,Cable,10,Yes,No,No,Yes,Yes,No,No,Yes,Two Year,No,Credit Card,45.2,2460.55,0,0,0,2460.55,Stayed,,
0017-IUDMW,Female,25,Yes,2,Sunnyvale,94086,37.378541,-122.020456,2,72,None,Yes,16.01,Yes,Yes,Fiber Optic,59,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Credit Card,116.8,8456.75,0,0,1152.72,9609.47,Stayed,,
0018-NYROU,Female,58,Yes,0,Antelope,95843,38.715498,-121.363411,0,5,None,Yes,18.65,No,Yes,Fiber Optic,10,No,No,No,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,68.95,351.5,0,0,93.25,444.75,Stayed,,
0019-EFAEP,Female,32,No,0,La Mesa,91942,32.782501,-117.01611,0,72,Offer A,Yes,2.25,Yes,Yes,Fiber Optic,16,Yes,Yes,Yes,No,Yes,No,No,Yes,Two Year,Yes,Bank Withdrawal,101.3,7261.25,0,0,162,7423.25,Stayed,,
0019-GFNTW,Female,39,No,0,Los Olivos,93441,34.70434,-120.02609,0,56,None,No,,,Yes,DSL,19,Yes,Yes,Yes,Yes,No,No,No,Yes,Two Year,No,Bank Withdrawal,45.05,2560.1,0,0,0,2560.1,Stayed,,
0020-INWCK,Female,58,Yes,2,Woodlake,93286,36.464635,-119.094348,9,71,Offer A,Yes,27.26,Yes,Yes,Fiber Optic,12,No,Yes,Yes,No,No,Yes,Yes,Yes,Two Year,Yes,Credit Card,95.75,6849.4,0,0,1935.46,8784.86,Stayed,,
0020-JDNXP,Female,52,Yes,1,Point Reyes Station,94956,38.060264,-122.830646,0,34,None,No,,,Yes,DSL,20,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,One Year,No,Credit Card,61.25,1993.2,0,0,0,1993.2,Stayed,,
0021-IKXGC,Female,72,No,0,San Marcos,92078,33.119028,-117.166036,0,1,Offer E,Yes,7.77,Yes,Yes,Fiber Optic,22,No,No,No,No,No,No,No,Yes,One Year,Yes,Bank Withdrawal,72.1,72.1,0,0,7.77,79.87,Joined,,
0022-TCJCI,Male,79,No,0,Daly City,94015,37.680844,-122.48131,0,45,None,Yes,10.67,No,Yes,DSL,17,Yes,No,Yes,No,No,Yes,No,Yes,One Year,No,Credit Card,62.7,2791.5,0,0,480.15,3271.65,Churned,Dissatisfaction,Limited range of services
My code:
# Let X = predictor variable and y = target variable
X = pd.DataFrame(df[['Monthly Charge', 'Total Charges', 'Total Long Distance Charges']])
y = pd.DataFrame(df[['Avg Monthly Long Distance Charges']])
# Add a constant variable to the predictor variables
X = sm.add_constant(X)
model01 = sm.OLS(y, X).fit()
df['Avg Monthly Long Distance Charges'].fillna(sm.OLS(y, X).fit())
My code output:
0 42.39
1 10.69
2 33.65
3 27.82
4 7.38
...
7038 46.68
7039 16.2
7040 18.62
7041 2.12
7042 <statsmodels.regression.linear_model.Regressio...
Name: Avg Monthly Long Distance Charges, Length: 7043, dtype: object
My code outputs this, but does not print this into the original data frame. How do I do this? Thanks.

Pandas Fuzzy Matching

I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.
Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552

Create distribution in Pandas

I want to generate a random/simulated data set with a specific distribution.
As an example the distribution has the following properties.
A population of 1000
The Gender mix is: male 49%, female 50%, other 1%
The age has the following distribution: 0-30 (30%), 31-60 (40%), 61-100 (30%)
The resulting data frame would have 1000 rows, and two columns called gender and age (with the above value distributions)
Is there a way to do this in Pandas or another library?
You may try:
N = 1000
gender = np.random.choice(["male","female", "other"], size=N, p = [.49,.5,.01])
age = np.r_[np.random.choice(range(30),size= int(.3*N)),
np.random.choice(range(31,60),size= int(.4*N)),
np.random.choice(range(61,100),size= N - int(.3*N) - int(.4*N) )]
np.random.shuffle(age)
df = pd.DataFrame({"gender":gender,"age":age})

Drop Dataframe Rows Based on a Similarity Measure Pandas

I want to eliminate repeated rows in my dataframe.
I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:
Title | Area | Price
Apartment at Boston 100 150000
Apt at Boston 105 149000
I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?
Here is a function using difflib. I got the similar function from here. You may also want to check out some of the answers on that page to determine the best similarity metric for your use case.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Title':['Apartment at Boston','Apt at Boston'],
'Area':[100,105],
'Price':[150000,149000]})
def string_ratio(df,col,ratio):
from difflib import SequenceMatcher
import numpy as np
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
ratios = []
for i, x in enumerate(df[col]):
a = np.array([similar(x, row) for row in df[col]])
a = np.where(a < ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
def numeric_ratio(df,col,ratio):
ratios = []
for i, x in enumerate(df[col]):
a = np.array([min(x,row)/max(x,row) for row in df[col]])
a = np.where(a<ratio)[0]
ratios.append(len(a[a != i])==0)
return pd.Series(ratios)
mask = ~((string_ratio(df1,'Title',.95))&(numeric_ratio(df1,'Area',.95))&(numeric_ratio(df1,'Price',.95)))
df1[mask]
It should be able to weed out most of the similar data, though you might want to tweak the string_ratio function if it doesn't suite you case.
See if this meets your needs
Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())
The df created is as below
Title Area Price
0 Apartment at Boston 100 150000
1 Apt at Boston 105 149000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300
4 Apt at Seattle 101 150000
5 Apt at Chicago 101 150000
Now, we run the code as below
from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
val=fuzz.partial_ratio(a,b)
return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
itered=i+1
while itered < len(tl):
val=fuzzy_compare(tl[i],tl[itered])
if val > 80:
if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
df.drop(itered,inplace=True)
df.reset_index()
pass
else:
pass
else:
pass
else:
pass
itered=itered+1
while i < len(tl)-1:
try:
do_the_thing(i)
i=i+1
except:
i=i+1
pass
else:
pass
the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.
Title Area Price
0 Apartment at Boston 100 150000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300

Rolling Product in PANDAS over 30-day time window

I am trying to get data ready for a financial event analysis and want to calculate the buy-and-hold abnormal return (BHAR). For a test data set I have three events (noted by event_id), and for each event I have 272 rows, going from t-252 days to t+20 days (noted by the variable time). For each day I also have the stock's return data (ret) as well as the expected return (Exp_Ret), which was calculated using a market model. Here's a sample of the data:
index event_id time ret vwretd Exp_Ret
0 0 -252 0.02905 0.02498 nan
1 0 -251 0.01146 -0.00191 nan
2 0 -250 0.01553 0.00562 nan
...
250 0 -2 -0.00378 0.00028 -0.00027
251 0 -1 0.01329 0.00426 0.00479
252 0 0 -0.01723 -0.00875 -0.01173
271 0 19 0.01335 0.01150 0.01398
272 0 20 0.00722 -0.00579 -0.00797
273 1 -252 0.01687 0.00928 nan
274 1 -251 -0.00615 -0.01103 nan
And here's the issue. I would like to calculate the following BHAR formula for each day:
So, using the above formula as an example, if I would like to calculate the 10-day buy-and-hold abnormal return,I would have to calculate (1+ret_t=0)x(1+ret_t=1)...x(1+ret_t=10), then do the same with the expected return, (1+Exp_Ret_t=0)x(1+Exp_Ret_t=1)...x(1+Exp_Ret_t=10), then substract the latter from the former.
I have made some progress using rolling_apply but it doesn't solve all my problems:
df['part1'] = pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod())
This seems to correctly implement the left-side of the BHAR equation in that it will add in the correct product -- though it will enter the value two rows down (which can be solved by shifting). One problem, though, is that there are three different 'groups' in the dataframe (3 events), and if the window were to go forward more than 30 days it might start using products from the next event. I have tried to implement a groupby with rolling_apply but keep getting error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
df.groupby('event_id').apply(pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod()))
I am sure I am missing something basic here so any help would be appreciated. I might just need to approach it from a different angle. Here's one thought: In the end, what I am most interested in is getting the 30-day and 60-day buy-and-hold abnormal returns starting at time=0. So, maybe it is easier to select each event at time=0 and then calculate the 30-day product going forward? I'm not sure how I could best approach that.
Thanks in advance for any insights.
# Create sample data.
np.random.seed(0)
VOL = .3
df = pd.DataFrame({'event_id': [0] * 273 + [1] * 273 + [2] * 273,
'time': range(-252, 21) * 3,
'ret': np.random.randn(273 * 3) * VOL / 252 ** .5,
'Exp_Ret': np.random.randn(273 * 3) * VOL / 252 ** .5})
# Pivot on time and event_id.
df = df.set_index(['time', 'event_id']).unstack('event_id')
# Calculated return difference from t=0.
df_diff = df.ix[df.index >= 0, 'ret'] - df.loc[df.index >= 0, 'Exp_Ret']
# Calculate cumulative abnormal returns.
cum_returns = (1 + df_diff).cumprod() - 1
# Get 10 day abnormal returns.
>>> cum_returns.loc[10]
event_id
0 -0.014167
1 -0.172599
2 -0.032647
Name: 10, dtype: float64
Edited so that final values of BHAR are included in the main DataFrame.
BHAR = pd.Series()
def bhar(arr):
return np.cumprod(arr+1)[-1]
grouped = df.groupby('event_id')
for name, group in grouped:
BHAR = BHAR.append(pd.rolling_apply(group['ret'],10,bhar) -
pd.rolling_apply(group['Exp_Ret'],10,bhar))
df['BHAR'] = BHAR
You can then slice the DataFrame using df[df['time']>=0] such that you get only the required part.
You can obviously collapse the loop in one line using .apply() on the group, but I like it this way. Shorter lines to read = better readability.
This is what I did:
((df+1.0) \
.apply(lambda x: np.log(x),axis=1)\
.rolling(365).sum() \
.apply(lambda x: np.exp(x),axis=1)-1.0)
result is a rolling product.

Categories