How can I merge these two datasets on 'Name' and 'Year'?

How can I merge these two datasets on 'Name' and 'Year'? - python

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')

You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.

Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Related

Rounding up pandas column to nearest n unit value

I have a 200000 row dataframe that looks like this
df =
index
name
d2b(m)
0
Jon
199.9
1
Amy
29
2
Fyn
19
3
Luc
30
4
And
76
5
Pia
90
I am writing a function to classify the "distance to bus stop (d2b)" column into a new column for every 10 meters, expecting:
index
name
d2b (m)
class (<= x meters)
0
Jon
199.9
200m
1
Amy
29
30m
2
Fyn
19
20m
3
Luc
33
40m
4
And
76
80m
5
Pia
90
90m
Code that works (updated):
numpy.ceil(data["d2b (m)"]/10)*10

This is one way of achieving this:
import math
df['class (<= x meters)'] = math.ceil(df[d2b(m)]/10)*10

pandas/python: replacing categorical values in dataframe through iteration

I created a database and I am trying to substitute the categorical variables with some numerical values
that I calculated via 'pivot'. In my code, I am trying to iterate through the whole dataframe and if the dataframe categorical columns cells have the same values as one of the elements in 'sublist_names', they should be replaced by the element in 'sublist_values' located in the same position as the value in sublist names.
For example, while iterating the dataframe and each of the categorical columns, the first value of column called 'Name' is the string 'tom'. 'tom' is exactly the 7th element in 'sublist_names', which means it should be replaced by the 7th element in 'sublist_values' which is equal to 150.
I was able to obtain all the needed values but when it comes to solving this last task by iterating the whole dataframe instead of working column by column, I am not sure how to do it.
I hope I explained clearly, but for any questions feel free to ask.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = [['tom', 10,6,'brown',200],
['nick', 15,5.10,'red',150],
['juli', 14,5.5,'black',170]
,['peter', 10,6,'blue',290],
['axel', 15,5.10,'yellow',190],
['william', 14,5.5,'yellow',170]
,['tom', 10,6,'orange',100],
['tom', 15,5.10,'brown',150],
['angela', 14,5.5,'black',160]
,['peter', 10,6,'purple',220],
['nick', 15,5.10,'orange',150],
['aroon', 14,5.5,'red',170] ]
df = pd.DataFrame(data, columns=['Name', 'Age','height','color','weight'])
categorical_variables= (df.select_dtypes('object') ) # categorical variables
categ_var_list=(list(categorical_variables))
print(categ_var_list)
condition_pivot_list_names=[]
pivot_values_list=[]
for i in categ_var_list:
condition_pivot = df.pivot_table(index=i, values='weight', aggfunc=np.mean)
pivot_names = (condition_pivot.index.values.tolist())
condition_pivot_list_names.append(pivot_names)
pivot_values_draft = ((condition_pivot.values.tolist()))
pivot_values = [i[0] for i in pivot_values_draft]
pivot_values_list.append(pivot_values)
print(condition_pivot_list_names, 'condition pivot list names')
print(pivot_values_list,'pivot values list')
sublist_names=[(sublists) for sublists in condition_pivot_list_names]
print(sublist_names)
sublist_values=[(sublists1) for sublists1 in pivot_values_list]
print(sublist_values)
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
print(df['Name'])
This is what print( df[name]) shows:
0 tom
1 nick
2 juli
3 peter
4 axel
5 william
6 tom
7 tom
8 angela
9 peter
10 nick
11 aroon
And this is what should show:
0 150
1 150
2 170
3 255
4 190
5 170
6 150
7 150
8 160
9 255
10 150
11 170

You have two categorical values Name and Color. So you cam do something like this.
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
And than you can create a function myfunc() which will receive x from above code. What above code is doing is, it will iterate over the column one by one and pass value of each row one by one to the function. Inside the function you can define the logic to convert the categorical values something like this
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
Do the same thing for the column Color.

Try this:
df.Name = np.where(df.groupby('Name', as_index=False)['Name'].cumcount().eq(0), df.Name, df.weight)
Output:
Name Age height color weight
0 tom 10 6.0 brown 200
1 nick 15 5.1 red 150
2 juli 14 5.5 black 170
3 peter 10 6.0 blue 290
4 axel 15 5.1 yellow 190
5 william 14 5.5 yellow 170
6 100 10 6.0 orange 100
7 150 15 5.1 brown 150
8 angela 14 5.5 black 160
9 220 10 6.0 purple 220
10 150 15 5.1 orange 150
11 aroon 14 5.5 red 170

Okay I see your problem. Just write the code below before the function declaration.
sub_names=[]
sub_values=[]
for i in sublist_names:
sub_names.extend(i)
for i in sublist_values:
sub_values.extend(i)
Also dont forget to update variable names in myfunc().

How do I calculate an average of a range from a series within in a dataframe?

Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?

Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo

You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.

For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)

I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Getting a ratio in Pandas groupby object

I have a dataframe that looks like this:
I want to create another column called "engaged_percent" for each state which is basically the number of unique engaged_count divided by the user_count of each particular state.
I tried doing the following:
def f(x):
engaged_percent = x['engaged_count'].nunique()/x['user_count']
return pd.Series({'engaged_percent': engaged_percent})
by = df3.groupby(['user_state']).apply(f)
by
But it gave me the following result:
What I want is something like this:
user_state engaged_percent
---------------------------------
California 2/21 = 0.09
Florida 2/7 = 0.28
I think my approach is correct , however I am not sure why my result shows up like the one seen in the second picture.
Any help would be much appreciated! Thanks in advance!

How about:
user_count=df3.groupby('user_state')['user_count'].mean()
#(or however you think a value for each state should be calculated)
engaged_unique=df3.groupby('user_state')['engaged_count'].nunique()
engaged_pct=engaged_unique/user_count
(you could also do this in one line in a bunch of different ways)
Your original solution was almost fine except that you were dividing a value by the entire user count series. So you were getting a Series instead of a value. You could try this slight variation:
def f(x):
engaged_percent = x['engaged_count'].nunique()/x['user_count'].mean()
return engaged_percent
by = df3.groupby(['user_state']).apply(f)
by

I would just use groupby and apply directly
df3['engaged_percent'] = df3.groupby('user_state')
.apply(lambda s: s.engaged_count.nunique()/s.user_count).values
Demo
>>> df3
engaged_count user_count user_state
0 3 21 California
1 3 21 California
2 3 21 California
...
19 4 7 Florida
20 4 7 Florida
21 4 7 Florida
>>> df3['engaged_percent'] = df3.groupby('user_state').apply(lambda s: s.engaged_count.nunique()/s.user_count).values
>>> df3
engaged_count user_count user_state engaged_percent
0 3 21 California 0.095238
1 3 21 California 0.095238
2 3 21 California 0.095238
...
19 4 7 Florida 0.285714
20 4 7 Florida 0.285714
21 4 7 Florida 0.285714

titanic.groupby('Sex')['Fare'].mean()
you can try this example just put your example in that

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I merge these two datasets on 'Name' and 'Year'? - python

You can do import difflib df_b['name'] = df_b['name'].apply(lambda x: \ difflib.get_close_matches(x, df_a['name'])[0]) to replace names in df_b with closest match from df_a, then do your merge. See also this post.

Related

Rounding up pandas column to nearest n unit value

pandas/python: replacing categorical values in dataframe through iteration

How do I calculate an average of a range from a series within in a dataframe?

Pandas - Count the number of rows that would be true for a function - for each input row

Getting a ratio in Pandas groupby object

Categories

Resources