Create distribution in Pandas - python
I want to generate a random/simulated data set with a specific distribution.
As an example the distribution has the following properties.
A population of 1000
The Gender mix is: male 49%, female 50%, other 1%
The age has the following distribution: 0-30 (30%), 31-60 (40%), 61-100 (30%)
The resulting data frame would have 1000 rows, and two columns called gender and age (with the above value distributions)
Is there a way to do this in Pandas or another library?
You may try:
N = 1000
gender = np.random.choice(["male","female", "other"], size=N, p = [.49,.5,.01])
age = np.r_[np.random.choice(range(30),size= int(.3*N)),
np.random.choice(range(31,60),size= int(.4*N)),
np.random.choice(range(61,100),size= N - int(.3*N) - int(.4*N) )]
np.random.shuffle(age)
df = pd.DataFrame({"gender":gender,"age":age})
Related
Pandas - Fill in Missing Column Values Regression
I have a data frame 'df' that has missing column values. I want to fill in the missing/NaN values in the Avg Monthly Long Distance Charges column through prediction (regression) using the other column values. Then, replace the NaN values with the new values found. Data frame: 'df' Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,Internet Service,Internet Type,Avg Monthly GB Download,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason 0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,9,None,Yes,42.39,No,Yes,Cable,16,No,Yes,No,Yes,Yes,No,No,Yes,One Year,Yes,Credit Card,65.6,593.3,0,0,381.51,974.81,Stayed,, 0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,9,None,Yes,10.69,Yes,Yes,Cable,10,No,No,No,No,No,Yes,Yes,No,Month-to-Month,No,Credit Card,-4,542.4,38.33,10,96.21,610.28,Stayed,, 0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,4,Offer E,Yes,33.65,No,Yes,Fiber Optic,30,No,No,Yes,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,73.9,280.85,0,0,134.6,415.45,Churned,Competitor,Competitor had better devices 0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,13,Offer D,Yes,27.82,No,Yes,Fiber Optic,4,No,Yes,Yes,No,Yes,Yes,No,Yes,Month-to-Month,Yes,Bank Withdrawal,98,1237.85,0,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction 0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,3,None,Yes,7.38,No,Yes,Fiber Optic,11,No,No,No,Yes,Yes,No,No,Yes,Month-to-Month,Yes,Credit Card,83.9,267.4,0,0,22.14,289.54,Churned,Dissatisfaction,Network reliability 0013-MHZWF,Female,23,No,3,Midpines,95345,37.581496,-119.972762,0,9,Offer E,Yes,16.77,No,Yes,Cable,73,No,No,No,Yes,Yes,Yes,Yes,Yes,Month-to-Month,Yes,Credit Card,69.4,571.45,0,0,150.93,722.38,Stayed,, 0013-SMEOE,Female,67,Yes,0,Lompoc,93437,34.757477,-120.550507,1,71,Offer A,Yes,9.96,No,Yes,Fiber Optic,14,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Bank Withdrawal,109.7,7904.25,0,0,707.16,8611.41,Stayed,, 0014-BMAQU,Male,52,Yes,0,Napa,94558,38.489789,-122.27011,8,63,Offer B,Yes,12.96,Yes,Yes,Fiber Optic,7,Yes,No,No,Yes,No,No,No,No,Two Year,Yes,Credit Card,84.65,5377.8,0,20,816.48,6214.28,Stayed,, 0015-UOCOJ,Female,68,No,0,Simi Valley,93063,34.296813,-118.685703,0,7,Offer E,Yes,10.53,No,Yes,DSL,21,Yes,No,No,No,No,No,No,Yes,Two Year,Yes,Bank Withdrawal,48.2,340.35,0,0,73.71,414.06,Stayed,, 0016-QLJIS,Female,43,Yes,1,Sheridan,95681,38.984756,-121.345074,3,65,None,Yes,28.46,Yes,Yes,Cable,14,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Credit Card,90.45,5957.9,0,0,1849.9,7807.8,Stayed,, 0017-DINOC,Male,47,No,0,Rancho Santa Fe,92091,32.99356,-117.207121,0,54,None,No,,,Yes,Cable,10,Yes,No,No,Yes,Yes,No,No,Yes,Two Year,No,Credit Card,45.2,2460.55,0,0,0,2460.55,Stayed,, 0017-IUDMW,Female,25,Yes,2,Sunnyvale,94086,37.378541,-122.020456,2,72,None,Yes,16.01,Yes,Yes,Fiber Optic,59,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Two Year,Yes,Credit Card,116.8,8456.75,0,0,1152.72,9609.47,Stayed,, 0018-NYROU,Female,58,Yes,0,Antelope,95843,38.715498,-121.363411,0,5,None,Yes,18.65,No,Yes,Fiber Optic,10,No,No,No,No,No,No,No,Yes,Month-to-Month,Yes,Bank Withdrawal,68.95,351.5,0,0,93.25,444.75,Stayed,, 0019-EFAEP,Female,32,No,0,La Mesa,91942,32.782501,-117.01611,0,72,Offer A,Yes,2.25,Yes,Yes,Fiber Optic,16,Yes,Yes,Yes,No,Yes,No,No,Yes,Two Year,Yes,Bank Withdrawal,101.3,7261.25,0,0,162,7423.25,Stayed,, 0019-GFNTW,Female,39,No,0,Los Olivos,93441,34.70434,-120.02609,0,56,None,No,,,Yes,DSL,19,Yes,Yes,Yes,Yes,No,No,No,Yes,Two Year,No,Bank Withdrawal,45.05,2560.1,0,0,0,2560.1,Stayed,, 0020-INWCK,Female,58,Yes,2,Woodlake,93286,36.464635,-119.094348,9,71,Offer A,Yes,27.26,Yes,Yes,Fiber Optic,12,No,Yes,Yes,No,No,Yes,Yes,Yes,Two Year,Yes,Credit Card,95.75,6849.4,0,0,1935.46,8784.86,Stayed,, 0020-JDNXP,Female,52,Yes,1,Point Reyes Station,94956,38.060264,-122.830646,0,34,None,No,,,Yes,DSL,20,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,One Year,No,Credit Card,61.25,1993.2,0,0,0,1993.2,Stayed,, 0021-IKXGC,Female,72,No,0,San Marcos,92078,33.119028,-117.166036,0,1,Offer E,Yes,7.77,Yes,Yes,Fiber Optic,22,No,No,No,No,No,No,No,Yes,One Year,Yes,Bank Withdrawal,72.1,72.1,0,0,7.77,79.87,Joined,, 0022-TCJCI,Male,79,No,0,Daly City,94015,37.680844,-122.48131,0,45,None,Yes,10.67,No,Yes,DSL,17,Yes,No,Yes,No,No,Yes,No,Yes,One Year,No,Credit Card,62.7,2791.5,0,0,480.15,3271.65,Churned,Dissatisfaction,Limited range of services My code: # Let X = predictor variable and y = target variable X = pd.DataFrame(df[['Monthly Charge', 'Total Charges', 'Total Long Distance Charges']]) y = pd.DataFrame(df[['Avg Monthly Long Distance Charges']]) # Add a constant variable to the predictor variables X = sm.add_constant(X) model01 = sm.OLS(y, X).fit() df['Avg Monthly Long Distance Charges'].fillna(sm.OLS(y, X).fit()) My code output: 0 42.39 1 10.69 2 33.65 3 27.82 4 7.38 ... 7038 46.68 7039 16.2 7040 18.62 7041 2.12 7042 <statsmodels.regression.linear_model.Regressio... Name: Avg Monthly Long Distance Charges, Length: 7043, dtype: object My code outputs this, but does not print this into the original data frame. How do I do this? Thanks.
How can I locate a cell with specific keys and variables in a table
Image of Table I created this table using excel and loaded onto a jupyter notebook. For the following question. Blood pressure (BP) in childhood tends to increase with age, but differently for boys and girls. Suppose that for both boys and girls, mean systolic blood pressure is 95 mm Hg at 3 years of age and increases 1.5 mm Hg per year up to the age of 13. Furthermore, starting at age 13, the mean increases by 2 mm Hg per year for boys and 1 mm Hg per year for girls up to the age of 18. Finally, assume that blood pressure is normally distributed and that the standard deviation is 12 mm Hg for all age-sex groups. I want to be able to pull out the mean from the table I created given the sex and age. This is so that I can answer the question 5.3 What is the probability that an 11-year-old boy will have an SBP greater than 130 mm Hg? So far I have this. data.loc('M')
Numpy's fancy indexing makes this pretty easy. import pandas as pd data = [] for i in range(3,14): data.append( ('M',i,90.5+i*1.5) ) data.append( ('F',i,90.5+i*1.5) ) for i in range(14,18): data.append( ('M',i,112+(i-14)*2) ) data.append( ('f',i,112+(i-14)*2) ) data = pd.DataFrame( data, columns=['Sex','Age','Mean']) print(data) # Probability for a male 11-year-old? print( data[(data['Sex']=='M') & (data['Age']==11)]['Mean'] ) # Probability for all males older than 12? print( data[(data['Sex']=='M') & (data['Age']>12)]['Mean'].mean() )
Is there a way to analyze the impact/correlation of categorical variables to the label in python?
The dataset I have is a lot larger than this (about 3000 rows * 50 columns), I'm just going to put a sample here. It's a dataframe including information for every line. Basically, I intended to analyze the attribute of each label, like Level 3 might have a higher annual income; or what contribute to higher level. What statistical functions might be a good fit to analyze it? I'm trying with sklearn.preprocessing.OrdinalEncoder() to labelize every category variable and trying something like stats.chi2.ppf() or correlation matrix. Not sure if they work out in my case. example = pd.DataFrame( { "Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'], "Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'], "Location": ['VA','DC','DC','CA','DC','MA'], "Gender": ['male','male','female','male','male','female'], "Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'], "Level": [0,1,2,0,0,3], } ) Degree Age Location Gender Annual Income Level 0 Graduate Age 26-35 VA male $5,001 - $10,000 0 1 Undergraduate Age 18-25 DC male <$5,000 1 2 Undergraduate Age 18-25 DC female $15,001 - $25,000 2 3 Graduate Age 18-25 CA male >$50,000 0 4 Undergraduate Age 26-35 DC male <$5,000 0 5 Doctorate Older than 35 MA female $15,001 - $25,000 3 Open to any ideas and comments.
import pandas as pd example = pd.DataFrame( { "Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'], "Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'], "Location": ['VA','DC','DC','CA','DC','MA'], "Gender": ['male','male','female','male','male','female'], "Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'], "Level": [0,1,2,0,0,3], } ) unique_items = [] for key in example: unique_items.append(example[key].unique()) for item in unique_items: print(item) # figure out how to sort each unique item, # for example, degree= by more education # income = ascending # level = ascending , etc # now use the index as the value and you can start to do math and pictures # Analyze for me means: # Now what you would do is pick any two and scatterplot it to see if there is a relationship # then pick all pairs for any one and make a collage of thumbnail scatterplots # then measure correlation or other math properties that put them in groupings you like # think of this like categorizing galaxies, # straight lines sloping up is one type that would be high on the list # but randomness is another type and some might look like butterflies # then sort by correlation and groupings to show all the strongest top 100 list # good luck ;)
In the correlation I suggest use seaborn. Heatmaps are used to show relationships between two variables, one plotted on each axis. By observing how cell colors change across each axis, you can observe if there are any patterns in value for one or both variables. https://chartio.com/learn/charts/heatmap-complete-guide/ import seaborn as sns sns.heatmap(example[['Level']]) However for heatmap - there is need for having integer - so Annual Income and Age can be transformed into integer. There is some approximation (getting first number from income & age - not range - also there can be use. .mean()): example['Income'] = example['Annual Income'].str.extract('(\d+)') example['Age'] = example['Age'].str.extract('(\d+)') example['Income'] = pd.to_numeric(example['Income']) example['Age'] = pd.to_numeric(example['Age']) import seaborn as sns sns.heatmap(example[['Level', 'Income', 'Age']]) When the annual income and age is integer - there is also option for .corr() function: example.corr()
Pandas Fuzzy Matching
I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished. My code is: import pandas as pd from fuzzywuzzy import fuzz, process ### Main dataframe df = pd.read_csv("adressess.csv", encoding="cp1252") #### Reference dataframe ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252") ### Variable for accuracy scoring accuracy = 0 for index, value in df["address"].iteritems(): ### This gathers the index from the correct address column in the reference df ref_index = ref_df["correct_address"][ ref_df["correct_address"] == process.extractOne(value, ref_df["correct_address"])[0] ].index.toList()[0] ### if each row can score a max total of 1, the ratio must be divided by 100 accuracy += ( fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100 ) Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed. I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.
Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big. You can pull the csv text from the links in the comments and run it and see what could work on your larger sample. For five addresses in the reference frame and 100 contacts in the other its execution timings are: CPU times: user 107 ms, sys: 21 ms, total: 128 ms Wall time: 137 ms The below code should be quicker than .iteritems() etc. Code: # %%time import pandas as pd from fuzzywuzzy import fuzz, process import difflib # create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455 df = pd.read_csv('100-contacts.csv') # create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8 ref_df = pd.read_csv('ref_addresses.csv') # function used for fuzzywuzzy matching def match_addresses(add, list_add, min_score=0): max_score = -1 max_add = '' for x in list_add: score = fuzz.ratio(add, x) if (score > min_score) & (score > max_score): max_add = x max_score = score return (max_add, max_score) # given current row of ref_df (via Apply) and series (df['address']) # return the fuzzywuzzy score def scoringMatches(x, s): o = process.extractOne(x, s, score_cutoff = 60) if o != None: return o[1] # creating two lists from address column of both dataframes contacts_addresses = list(df.address.unique()) ref_addresses = list(ref_df.correct_address.unique()) # via fuzzywuzzy matching and using scoringMatches() above # return a dictionary of addresses where there is a match # the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame) # example: # {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'} names = [] for x in ref_addresses: match = match_addresses(x, contacts_addresses, 75) if match[1] >= 75: name = (str(x), str(match[0])) names.append(name) name_dict = dict(names) # create new frame from fuzzywuzzy address matches dictionary match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address']) # add fuzzywuzzy scoring to original ref_df ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1) # merge the fuzzywuzzy address matches frame with the reference frame compare_df = pd.concat([match_df, ref_df], axis=1) compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy() # add difflib scoring for a bit of interest. # a random thought passed through my head maybe this is interesting? compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\ (None, x['ref_address'], x['matched_address']).ratio(),axis=1) # clean up column ordering ('correct_address' and 'ref_address' are basically # copies of each other, but shown for completeness) compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\ 'fuzzywuzzy_score', 'difflib_score']] # see what we've got print(compare_df) # remember: correct_address and ref_address are copies # so just pick one to compare to matched_address correct_address ref_address matched_address \ 0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673 1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St 3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy 4 1 Central Avenue 1 Central Avenue 1 Central Ave fuzzywuzzy_score difflib_score 0 90 0.904762 1 100 1.000000 2 100 1.000000 3 100 0.944444 4 90 0.896552
pandas: groupby and variable weights
I have a dataset with weights for each observation and I want to prepare weighted summaries using groupby but am rusty as to how to best do this. I think it implies a custom aggregation function. My issue is how to properly deal with not item-wise data, but group-wise data. Perhaps it means that it is best to do this in steps rather than in one go. In pseudo-code, I am looking for #first, calculate weighted value for each row: weighted jobs = weight * jobs #then, for each city, sum these weights and divide by the count (sum of weights) for each city: sum(weighted jobs)/sum(weight) I am not sure how to work the "for each city"-part into a custom aggregate function and get access to group-level summaries. Mock data: import pandas as pd import numpy as np np.random.seed(43) ## prep mock data N = 100 industry = ['utilities','sales','real estate','finance'] city = ['sf','san mateo','oakland'] weight = np.random.randint(low=5,high=40,size=N) jobs = np.random.randint(low=1,high=20,size=N) ind = np.random.choice(industry, N) cty = np.random.choice(city, N) df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})
Simply multiply the two columns: In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs'] Now you can groupby the city (and take the sum): In [12]: df_city_sums = df_city.groupby('city').sum() In [13]: df_city_sums Out[13]: jobs weight weighted_jobs city oakland 362 690 7958 san mateo 367 1017 9026 sf 253 638 6209 [3 rows x 3 columns] Now you can divide the two sums, to get the desired result: In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs'] Out[14]: city oakland 21.983425 san mateo 24.594005 sf 24.541502 dtype: float64