Combining Rows in a DataFrame - python

I have a DF that has the results of a NER classifier such as the following:
df =
s token pred tokenID
17 hakawati B-Loc 3
17 theatre L-Loc 3
17 jerusalem U-Loc 7
56 university B-Org 5
56 of I-Org 5
56 texas I-Org 5
56 here L-Org 6
...
5402 dwight B-Peop 1
5402 d. I-Peop 1
5402 eisenhower L-Peop 1
There are many other columns in this DataFrame that are not relevant. Now I want to group the tokens depending on their sentenceID (=s) and their predicted tags to combine them into a single entity:
df2 =
s token pred
17 hakawati theatre Location
17 jerusalem Location
56 university of texas here Organisation
...
5402 dwight d. eisenhower People
Normally I would do so by simply using a line like
data_map = df.groupby(["s"],as_index=False, sort=False).agg(" ".join) and using a rename function. However since the data contains different kind of Strings (B,I,L - Loc/Org ..) I don't know how to exactly do it.
Any ideas are appreciated.
Any ideas?

One solution via a helper column.
df['pred_cat'] = df['pred'].str.split('-').str[-1]
res = df.groupby(['s', 'pred_cat'])['token']\
.apply(' '.join).reset_index()
print(res)
s pred_cat token
0 17 Loc hakawati theatre jerusalem
1 56 Org university of texas here
2 5402 Peop dwight d. eisenhower
Note this doesn't match exactly your desired output; there seems to be some data-specific treatment involved.

You could group by both s and tokenID and aggregate like so:
def aggregate(df):
token = " ".join(df.token)
pred = df.iloc[0].pred.split("-", 1)[1]
return pd.Series({"token": token, "pred": pred})
df.groupby(["s", "tokenID"]).apply(aggregate)
# Output
token pred
s tokenID
17 3 hakawati theatre Loc
7 jerusalem Loc
56 5 university of texas Org
6 here Org
5402 1 dwight d. eisenhower Peop

Related

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Writing own custom aggregation function for groupby

I have a Data Set that is available here
It gives us a DataFrame like
df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
df.head()
user_id age gender occupation zip_code
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
I want to find out what is the ratio of Males:Females in each occupation
I have used the given function below but this is not the most optimal approach.
df.groupby(['occupation', 'gender']).agg({'gender':'count'}).div(df.groupby('occupation').agg('count'), level='occupation')['gender']*100
That gives us the result something like
occupation gender
administrator F 45.569620
M 54.430380
artist F 46.428571
M 53.571429
The above answer is in a very different format as I want something like: (demo)
occupation M:F
programmer 2:3
farmer 7:2
Can somebody please tell me how to make own aggregation functions?
Actually, pandas has built-in value_counts(normalized=True) for computing the value count. Then you can play with the number a bit:
new_df = (df.groupby('occupation')['gender']
.value_counts(normalize=True) # this gives normalized counts: 0.45
.unstack('gender', fill_value=0)
.round(2) # get two significant digits
.mul(100) # get the percentage
.astype(int) # get rid of .0000
.astype(str) # turn to string
)
new_df['F:M'] = new_df['F'] + ':' + new_df['M']
new_df.head()
Output:
gender F M F:M
occupation
administrator 46 54 46:54
artist 46 54 46:54
doctor 0 100 0:100
educator 27 73 27:73
engineer 3 97 3:97
It is pretty easy actually. Every group after groupby is a dataframe (a part of initial dataframe) so you can apply your own functions to process this partial dataframe. You may add print statements inside compute_gender_ratio and see what df is.
import pandas as pd
data = pd.read_csv(
'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|')
def compute_gender_ratio(df):
gender_count = df['gender'].value_counts()
return f"{gender_count.get('M', 0)}:{gender_count.get('F', 0)}"
result = data.groupby('occupation').apply(compute_gender_ratio)
result_df = result.to_frame(name='M:F')
result_df is:
M:F
occupation
administrator 43:36
artist 15:13
doctor 7:0
educator 69:26
engineer 65:2
entertainment 16:2
executive 29:3
healthcare 5:11
homemaker 1:6
lawyer 10:2
librarian 22:29
marketing 16:10
none 5:4
other 69:36
programmer 60:6
retired 13:1
salesman 9:3
scientist 28:3
student 136:60
technician 26:1
writer 26:19
Does this work for you
df_g = df.groupby(['occupation', 'gender']).count().user_id/df.groupby(['occupation']).count().user_id
df_g = df_g.reset_index()
df_g['ratio'] = df_g['user_id'].apply(lambda x: str(Fraction(x).limit_denominator()).replace('/',':'))
Output
occupation gender user_id ratio
0 administrator F 0.455696 36:79
1 administrator M 0.544304 43:79
2 artist F 0.464286 13:28
3 artist M 0.535714 15:28
4 doctor M 1.000000 1
5 educator F 0.273684 26:95
6 educator M 0.726316 69:95
7 engineer F 0.029851 2:67
8 engineer M 0.970149 65:67
9 entertainment F 0.111111 1:9
10 entertainment M 0.888889 8:9
11 executive F 0.093750 3:32
12 executive M 0.906250 29:32
13 healthcare F 0.687500 11:16
14 healthcare M 0.312500 5:16
15 homemaker F 0.857143 6:7
16 homemaker M 0.142857 1:7
17 lawyer F 0.166667 1:6
18 lawyer M 0.833333 5:6
19 librarian F 0.568627 29:51
20 librarian M 0.431373 22:51
21 marketing F 0.384615 5:13
22 marketing M 0.615385 8:13
23 none F 0.444444 4:9
24 none M 0.555556 5:9
25 other F 0.342857 12:35
26 other M 0.657143 23:35
27 programmer F 0.090909 1:11
28 programmer M 0.909091 10:11
29 retired F 0.071429 1:14
30 retired M 0.928571 13:14
31 salesman F 0.250000 1:4
32 salesman M 0.750000 3:4
33 scientist F 0.096774 3:31
34 scientist M 0.903226 28:31
35 student F 0.306122 15:49
36 student M 0.693878 34:49
37 technician F 0.037037 1:27
38 technician M 0.962963 26:27
39 writer F 0.422222 19:45
40 writer M 0.577778 26:45

Blank quotes usage in dataframe

I am trying to combine OR | with df.loc to extract data. The code I have written extracts everything in the csv file. Here is the original csv file: https://drive.google.com/open?id=16eo29mF0pn_qNw-BGpZyVM9PBxv2aN1G
import pandas as pd
df = pd.read_csv("yelp_business.csv")
df = df.loc[(df['categories'].str.contains('chinese', case = False)) | (df['name'].str.contains('subway', case = False)) | (df['categories'].str.contains('', case = False)) | (df['address'].str.contains('', case = False))]
print df
It looks like the blank quotes '' are not working in str.contains or the OR | doesn't work in df.loc. Instead of just returning rows with chinese restaurants (which are 4171 in number) and the row with the restaurant name subway, it returns all the 174,568 rows.
EDITED
The output I want should be all the rows of category chinese and all the rows of name subway while taking into consideration that the address might not have any assigned value or is null.
import pandas as pd
df = pd.read_csv("yelp_business.csv")
cusine = 'chinese'
name = 'subway'
address #address has no assigned value or is NULL
df = df.loc[(df['categories'].str.contains(cusine, case = False)) |
(df['name'].str.contains(name, case = False)) |
(df['address'].str.contains(address, case = False))]
print df
This code gives me an error NameError: name 'address' is not defined.
I think here is possible chain conditions by | for categories column, for find empty string use ^""$ - it match start and end of string with quotes:
df = pd.read_csv("yelp_business.csv")
df1 = df.loc[(df['categories'].str.contains('chinese|^""$', case = False)) |
(df['name'].str.contains('subway', case = False)) |
(df['address'].str.contains('^""$', case = False))]
print (len(df1))
11320
print (df1.head())
business_id name neighborhood \
9 TGWhGNusxyMaA4kQVBNeew "Detailing Gone Mobile" NaN
53 4srfPk1s8nlm1YusyDUbjg ***"Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
63 r6Jw8oRCeumxu7Y1WRxT7A "D&D Cleaning" NaN
88 YhV93k9uiMdr3FlV4FHjwA "Caviness Studio" NaN
address city state postal_code latitude \
9 ***"" Henderson NV 89014 36.055825
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119 36.064652
57 "8515 McCowan Road" Markham ON L3P 5E5 43.867918
63 ***"" Urbana IL 61802 40.110588
88 ***"" Phoenix AZ 85001 33.449967
longitude stars review_count is_open \
9 -115.046350 5.0 7 1
53 -115.118954 2.5 6 1
57 -79.283687 3.0 80 1
63 -88.207270 5.0 4 0
88 -112.070223 5.0 4 1
categories
9 Automotive;Auto Detailing
53 Fast Food;Restaurants;Sandwiches
57 ***Restaurants;Chinese
63 Home Cleaning;Home Services;Window Washing
88 Marketing;Men's Clothing;Restaurants;Graphic D...
EDIT: If need filter out empty and NaNs values:
df2 = df.loc[(df['categories'].str.contains('chinese', case = False)) |
(df['name'].str.contains('subway', case = False)) &
~((df['address'] == '""') | (df['categories'] == '""'))]
print (df2.head())
business_id name neighborhood \
53 4srfPk1s8nlm1YusyDUbjg "Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
96 dTWfATVrBfKj7Vdn0qWVWg "Flavor Cuisine" Scarborough
126 WUiDaFQRZ8wKYGLvmjFjAw "China Buffet" University City
145 vzx1WdVivFsaN4QYrez2rw "Subway" NaN
address city state postal_code \
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119
57 "8515 McCowan Road" Markham ON L3P 5E5
96 "8 Glen Watford Drive" Toronto ON M1S 2C1
126 "8630 University Executive Park Dr" Charlotte NC 28262
145 "5111 Boulder Hwy" Las Vegas NV 89122
latitude longitude stars review_count is_open \
53 36.064652 -115.118954 2.5 6 1
57 43.867918 -79.283687 3.0 80 1
96 43.787061 -79.276166 3.0 6 1
126 35.306173 -80.752672 3.5 76 1
145 36.112895 -115.062353 3.0 3 1
categories
53 Fast Food;Restaurants;Sandwiches
57 Restaurants;Chinese
96 Restaurants;Chinese;Food Court
126 Buffets;Restaurants;Sushi Bars;Chinese
145 Sandwiches;Restaurants;Fast Food
Find detail information about contains at
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html

Getting a ratio in Pandas groupby object

I have a dataframe that looks like this:
I want to create another column called "engaged_percent" for each state which is basically the number of unique engaged_count divided by the user_count of each particular state.
I tried doing the following:
def f(x):
engaged_percent = x['engaged_count'].nunique()/x['user_count']
return pd.Series({'engaged_percent': engaged_percent})
by = df3.groupby(['user_state']).apply(f)
by
But it gave me the following result:
What I want is something like this:
user_state engaged_percent
---------------------------------
California 2/21 = 0.09
Florida 2/7 = 0.28
I think my approach is correct , however I am not sure why my result shows up like the one seen in the second picture.
Any help would be much appreciated! Thanks in advance!
How about:
user_count=df3.groupby('user_state')['user_count'].mean()
#(or however you think a value for each state should be calculated)
engaged_unique=df3.groupby('user_state')['engaged_count'].nunique()
engaged_pct=engaged_unique/user_count
(you could also do this in one line in a bunch of different ways)
Your original solution was almost fine except that you were dividing a value by the entire user count series. So you were getting a Series instead of a value. You could try this slight variation:
def f(x):
engaged_percent = x['engaged_count'].nunique()/x['user_count'].mean()
return engaged_percent
by = df3.groupby(['user_state']).apply(f)
by
I would just use groupby and apply directly
df3['engaged_percent'] = df3.groupby('user_state')
.apply(lambda s: s.engaged_count.nunique()/s.user_count).values
Demo
>>> df3
engaged_count user_count user_state
0 3 21 California
1 3 21 California
2 3 21 California
...
19 4 7 Florida
20 4 7 Florida
21 4 7 Florida
>>> df3['engaged_percent'] = df3.groupby('user_state').apply(lambda s: s.engaged_count.nunique()/s.user_count).values
>>> df3
engaged_count user_count user_state engaged_percent
0 3 21 California 0.095238
1 3 21 California 0.095238
2 3 21 California 0.095238
...
19 4 7 Florida 0.285714
20 4 7 Florida 0.285714
21 4 7 Florida 0.285714
titanic.groupby('Sex')['Fare'].mean()
you can try this example just put your example in that

How to find the common values in a particular column in a particular table and display the intersected output

Table
Roll Class Country Rights CountryAcc
1 x IND 23 US
1 x1 IND 32 Ind
2 s US 12 US
3 q IRL 33 CA
4 a PAK 12 PAK
4 e PAK 12 IND
5 f US 21 CA
5 g US 31 PAK
6 h US 21 BAN
I want to display only those Rolls whose CountryAcc is not in US or CA. For example: if Roll 1 has one CountryAcc in US then I don't want its other row with CountryAcc Ind and same goes with Roll 5 as it is having one row with CountryAcc as CA. So my final output would be:
Roll Class Country Rights CountryAcc
4 a PAK 12 PAK
4 e PAK 12 IND
6 h US 21 BAN
I tried getting that output following way:
Home_Country = ['US', 'CA']
#First I saved two countries in a variable
Account_Other_Count = df.loc[~df.CountryAcc.isin(Home_Country)]
Account_Other_Count_Var = df.loc[~df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
# Then I made two variables one with CountryAcc in US or CA and other variable with remaining and I got their Roll
Account_Home_Count = df.loc[df.CountryAcc.isin(Home_Country)]
Account_Home_Count_Var = df.loc[df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
#Here I got the common Rolls
Common_ROLL = list(set(Account_Home_Count_Var).intersection(list(Account_Other_Count_Var)))
Final_Output = Account_Other_Count.loc[~Account_Other_Count.Roll.isin(Common_ROLL)]
Is there any better and more pandas or pythonic way to do it.
One solution could be
In [37]: df.ix[~df['Roll'].isin(df.ix[df['CountryAcc'].isin(['US', 'CA']), 'Roll'])]
Out[37]:
Roll Class Country Rights CountryAcc
4 4 a PAK 12 PAK
5 4 e PAK 12 IND
8 6 h US 21 BAN
This is one way to do it:
sortdata = df[~df['CountryAcc'].isin(['US', 'CA'])].sort(axis=0)

Categories