Cleaning up & filling in categorical variables for Data Science analysis

Cleaning up & filling in categorical variables for Data Science analysis - python

I'm taking on my very first machine learning problem, and I'm struggling with cleaning my categorical features in my dataset. My goal is to build a rock climbing recommendation system.
PROBLEM 1:
I have three columns related columns that have erroneous information:
What it looks like now:
What I want it to look like:
If you groupby the location name, there are different location_id numbers and countries associated with that one name. However, there is a clear winner/clear majority to each of these discrepancies. I have a data set of 2 million entries and the mode of the location_id & location_country GIVEN the location_name is overwhelming pointing to one answer (example: "300" & "USA" for clear_creek).
Using pandas/python, how do I group my dataset by the location_name, compute the mode of location_id & location_country based on that location name, and then replace the entire id & country columns with these mode calculations based on location_name to clean up my data?
I've played around with groupby, replace, duplicated, but I think ultimately I will need to create a function that will do this, and I honestly have no idea where to start. (I apologize in advance for my coding naivety)I know there's got to be a solution, I just need to be pointed in the right direction.
PROBLEM 2:
Also, any one have suggestions on filling in NaN values in my location_name category (42,012/2 million) and location_country(46,890/2 million) columns? Is it best to keep as an unknown value? I feel like filling in these features based on frequency would be a horrible bias to my data set.
data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,0,100,300,625,300,300,300],
'location_country': ['GRC', 'GRC', 'ESP', 'GRC', 'USA', 'IRE',
'USA', 'USA', 'USA']}
df = pd.DataFrame.from_dict(data)
***looking for it to return:
improved_data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,100,100,300,300,300,300,300],
'location_country': ['GRC', 'GRC', 'GRC', 'GRC', 'USA', 'USA',
'USA', 'USA', 'USA']}
new_df = pd.DataFrame.from_dict(improved_data)

We can use .agg in combination with pd.Series.mode and cast that back to your dataframe with map:
m1 = df.groupby('location_name')['location_id'].agg(pd.Series.mode)
m2 = df.groupby('location_name')['location_country'].agg(pd.Series.mode)
df['location_id'] = df['location_name'].map(m1)
df['location_country'] = df['location_name'].map(m2)
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 2 kalaymous 100 GRC
2 3 kalaymous 100 GRC
3 4 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 6 clear_creek 300 USA
6 7 clear_creek 300 USA
7 8 clear_creek 300 USA
8 9 clear_creek 300 USA

You can use transform by calculating mode using df.iat[]:
df=(df[['location_name']].join(df.groupby('location_name').transform(lambda x: x.mode()
.iat[0])).reindex(df.columns,axis=1))
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 1 kalaymous 100 GRC
2 1 kalaymous 100 GRC
3 1 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 5 clear_creek 300 USA
6 5 clear_creek 300 USA
7 5 clear_creek 300 USA
8 5 clear_creek 300 USA

As Erfan mentions it would be helpful to have a view on your expected output for the first question.
For the second pandas has a fillna method. You can use this method to fill the NaN values. For example to fill values with 'UNKNOWN_LOCATION' you could do the following:
df.fillna('UNKNOWN_LOCATION')
See potential solution for first question:
df.groupby('location_name')[['location_id', 'location_country']].apply(lambda x: x.mode())

Related

How to convert "event" data into country-year data by summating information in columns? Using python/pandas

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
10000102
2000
Nigeria
3
10
0
10000103
2000
Mali
1
3
15
10000103
2000
Nigeria
15
0
0
10000103
2001
Benin
1
0
0
10000103
2001
Nigeria
1
3
15
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
Nigeria
2000
200
300
300
Nigeria
2001
250
450
15
So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".
For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.

This simplified example shows how you could use groupby -
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
'year': [2000, 2000, 2001, 2000],
'events1': [ 3, 4, 5, 2],
'events2': [1, 6, 3, 4]
})
df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()
print(df2)
which gives the total of each type of event by country and by year
events1 events2
country year
Mali 2000 2 4
Nigeria 2000 7 7
2001 5 3

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500

There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()

One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

merging on pandas: reduce the set of merging variables when match is not possible

Using python, I want to merge on multiple variables; A, B, C, but when realization a-b-c in one dataset is missing, use the finer combination that the observation has (like b-c).
Example:
Suppose I have a dataset (df1) containing person's characteristics (gender, married, city). And another dataset (df2) that I have the median income of a person according to their gender, city, married (created with a groupby).
Then I want to input that median income into the first dataset (df1) matching in as many characterisics as possible. That is if individual has characteristics gender-city-married that has median income, use that value. If the individual has characteristics that there is only city-married median income, to use that value.
Something like that
df1 = pd.DataFrame({'Male':['0', '0', '1','1'],'Married':['0', '1', '0','1'], 'City': ['NY', 'NY', 'NY', 'NY']})
Male Married City
0 0 NY
0 1 NY
1 0 NY
1 1 NY
df2 = pd.DataFrame({'Male':['0', '0', '1'],'Married':['0', '1', '1'], 'City': ['NY', 'NY','NY'], 'income':['300','400', '500']})
Male Married City income
0 0 NY 300
0 1 NY 400
1 1 NY 500
'''
and the desired outcome:
'''
desired_df1:
Male Married City income
0 0 NY 300
0 1 NY 400
1 0 NY 300
1 1 NY 400
I was thinking to do a 1st merge by=['male','married','city'], and then fill missing values from a 2nd merge by=['married','city']. But I think there should be a more systematic and simpler way. Any suggestions?
Thanks and sorry if formulation is not correct or it is duplicate (I look deeply and didn't find anything).

You can do a groupby and fillna too after merging:
out = df1.merge(df2,on=['Male','Married','City'],how='left')
out['income'] = (out['income'].fillna(out.groupby(['Married','City'])['income']
.fillna(method='ffill')))
print(out)
Male Married City income
0 0 0 NY 300
1 0 1 NY 400
2 1 0 NY 300
3 1 1 NY 500 # <- Note that this should be 500 not 400

Split a dataframe and sum [pandas]

I have the following dataframe (dummy data):
score GDP
country
Bangladesh 6 12
Bolivia 4 10
Nigeria 3 9
Pakistan 2 3
Ghana 1 3
India 1 3
Algeria 1 3
And I want to split it into two groups based on GDP and sum the score of each group. On the condition of GDP being less than 9:
sum_score
country
rich 13
poor 5

You can use np.where to make your rich and poor categories, then groupby that category and get the sum:
df['country_cat'] = np.where(df.GDP < 9, 'poor', 'rich')
df.groupby('country_cat')['score'].sum()
country_cat
poor 5
rich 13
You can also do the same in one step, by not creating the extra column for the category (but IMO the code becomes less readable):
df.groupby(np.where(df.GDP < 9, 'poor', 'rich'))['score'].sum()

You can aggregate by boolean mask and last only rename index:
a = df.groupby(df.GDP < 9)['score'].sum().rename({True:'rich', False:'poor'})
print (a)
GDP
poor 13
rich 5
Name: score, dtype: int64
Last for one column DataFrame add Series.to_frame:
df = a.to_frame('sum_score')
print (df)
sum_score
GDP
poor 13
rich 5

Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):
df1:
PRODUCT_ID PRODUCT_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce"
1 185965653252 "Chicken Salad with Dressing"
2 165958565556 "Pork and Honey Rissoles"
3 655262522233 "Cheese, Ham and Tomato Sandwich"
4 857485966653 "Coleslaw with Yoghurt Dressing"
5 524156285551 "Lemon and Raspberry Cheesecake"
I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:
df2 (also saved as dict_2)
PROD_ID PROD_DESCRIPTION
0 548576 "Fish Burger"
1 156956 "Chckn Salad w/Ranch Dressing"
2 257848 "Rissoles - Lamb & Rosemary"
3 298770 "Lemn C-cake"
4 651452 "Potato Salad with Bacon"
5 100256 "Cheese Cake - Lemon Raspberry Coulis"
What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:
PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"
1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"
2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"
3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN
4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN
5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"
6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"
I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.
I've used the following questions and answers (amongst others):
Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array
and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.
I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.
I'm using Python 2.7, pandas and have fuzzywuzzy installed.

using fuzz.ratio as my distance metric, calculate my distance matrix like this
df3 = pd.DataFrame(index=df.index, columns=df2.index)
for i in df3.index:
for j in df3.columns:
vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
vj = df2.get_value(j, 'PROD_DESCRIPTION')
df3.set_value(
i, j, fuzz.ratio(vi, vj))
print(df3)
0 1 2 3 4 5
0 63 15 24 23 34 27
1 26 84 19 21 52 32
2 18 31 33 12 35 34
3 10 31 35 10 41 42
4 29 52 32 10 42 12
5 15 28 21 49 8 55
Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.
threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)
Make assignments
df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df

You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:
d = {
'df1_id': [],
'df1_prod_desc': [],
'df2_id': [],
'df2_prod_desc': [],
'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
for _, df2_row in df2.iterrows():
d['df1_id'] = df1_row['PRODUCT_ID']
...
df3 = pd.DataFrame.from_dict(d)

I don't have enough reputation to be able to comment on answer from #piRSquared. Hence this answer.
The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)
A million thanks to #piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.