How to do keyword mapping in pandas - python

I have keyword
India
Japan
United States
Germany
China
Here's sample dataframe
id Address
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2 Arcisstraße 21, 80333 München, Germany
3 Liberty Street, Manhattan, New York, United States
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India
My Goal Is make
id Address India Japan United States Germany China
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0
2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0
3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1
5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0
The basic idea is create keyword detector, I am thinking to use str.contain and word2vec but I can't get the logic

Make use of pd.get_dummies():
countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)
Also, the most straightforward way is to have the countries in a list and use a for loop, say
countries = ['India','Japan','United States','Germany','China']
for c in countries:
df[c] = df.Address.str.contains(c) * 1
but it can be slow if you have a lot of data and countries.

In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())
In [59]: df
Out[59]:
id Address China Germany India Japan United States
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J... 0 0 0 1 0
1 2 Arcisstra?e 21, 80333 Munchen, Germany 0 1 0 0 0
2 3 Liberty Street, Manhattan, New York, United St... 0 0 0 0 1
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 1 0 0 0 0
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo... 0 0 1 0 0
NOTE: this method will not work if country is not at the last position in Address column or if country name contains ,

from numpy.core.defchararray import find
kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]
df.join(
pd.DataFrame(
find(a, kw) >= 0,
df.index, kw,
dtype=int
)
)
id Address India Japan United States Germany China
0 1 Chome-2-8 Shibakoen, Minat... 0 1 0 0 0
1 2 Arcisstraße 21, 80333 Münc... 0 0 0 1 0
2 3 Liberty Street, Manhattan,... 0 0 1 0 0
3 4 30 Shuangqing Rd, Haidian ... 0 0 0 0 1
4 5 Vaishnavi Summit,80feet Ro... 1 0 0 0 0

Related

How can I group multiple columns in a Data Frame?

I don't know if this is possible but I have a data frame like this one:
df
State County Homicides Man Woman Not_Register
Gto Celaya 2 2 0 0
NaN NaN 8 4 2 2
NaN NaN 3 2 1 0
NaN Yiriria 2 1 1 0
Nan Acambaro 1 1 0 0
Sin Culiacan 3 1 1 1
NaN Nan 5 4 0 1
Chih Juarez 1 1 0 0
I want to group by State, County, Man Women, Homicides and Not Register. Like this:
State County Homicides Man Woman Not_Register
Gto Celaya 13 8 3 2
Gto Yiriria 2 1 1 0
Gto Acambaro 1 1 0 0
Sin Culiacan 8 5 1 2
Chih Juarez 1 1 0 0
So far, I been able to group by State and County and fill the rows with NaN with the right name of the county and State. My result and code:
import numpy as np
import math
df = df.fillna(method ='pad') #To repeat the name of the State and County with the right order
#To group
df = df.groupby(["State","County"]).agg('sum')
df =df.reset_index()
df
State County Homicides
Gto Celaya 13
Gto Yiriria 2
Gto Acambaro 1
Sin Culiacan 8
Chih Juarez 1
But When I tried to add the Men and woman
df1 = df.groupby(["State","County", "Man", "Women", "Not_Register"]).agg('sum')
df1 =df.reset_index()
df1
My result is repeating the Counties not giving me a unique County for State,
How can I resolve this issue?
Thanks for your help
Change to
df[['Homicides','Man','Woman','Not_Register']]=df[['Homicides','Man','Woman','Not_Register']].apply(pd.to_numeric,errors = 'coerce')
df = df.groupby(['State',"County"]).sum().reset_index()

How to use beautifulsoup and pandas to scrape data from a dataframe with a date filter?

I'm new to python and I'm looking to scrape data from a website. The issue is it has a date filter and I'm struggling to find how to extract multiple dates. Is there any good resource on this or does anyone have suggestions on how it could be done? I can't seem to find what I need online.
My code which extracts what's shown for today:
res = requests.get("https://www.inmo.ie/Trolley_Ward_Watch")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
The data is loaded via Javascript. But you can simulate the AJAX with requests library, for example (change the DateTrolley parameter to required date):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.inmo.ie/Trolley_Ward_Watch'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
form_url = 'https://www.inmo.ie' + soup.form['action']
data = {'DateTrolley': '01/05/2020'} # <-- change it to eg. 05/05/2020 to get other date
soup = BeautifulSoup(requests.post(form_url, data=data).content, 'html.parser')
df = pd.read_html(str(soup.table))
print(df)
Prints:
[ Date Hospital Region Trolley Total Ward Total Total
0 01/05/2020 Beaumont Hospital Eastern 0 0 0
1 01/05/2020 Connolly Hospital, Blanchardstown Eastern 0 0 0
2 01/05/2020 Connolly Hospital, Blanchardstown Eastern 0 0 0
3 01/05/2020 Mater Misericordiae University Hospital Eastern 0 0 0
4 01/05/2020 Naas General Hospital Eastern 0 0 0
5 01/05/2020 St James' Hospital Eastern 2 0 2
6 01/05/2020 St Vincent's University Hospital Eastern 0 0 0
7 01/05/2020 Tallaght University Hospital Eastern 1 0 1
8 01/05/2020 Bantry General Hospital Country 0 0 0
9 01/05/2020 Cavan General Hospital Country 2 0 2
10 01/05/2020 Cork University Hospital Country 2 0 2
11 01/05/2020 Letterkenny University Hospital Country 0 0 0
12 01/05/2020 Mayo University Hospital Country 0 0 0
13 01/05/2020 Mercy University Hospital, Cork Country 0 0 0
14 01/05/2020 Mid Western Regional Hospital, Ennis Country 0 0 0
15 01/05/2020 Midland Regional Hospital, Mullingar Country 1 0 1
16 01/05/2020 Midland Regional Hospital, Portlaoise Country 0 0 0
17 01/05/2020 Midland Regional Hospital, Tullamore Country 0 0 0
18 01/05/2020 Nenagh General Hospital Country 0 1 1
19 01/05/2020 Our Lady of Lourdes Hospital, Drogheda Country 0 0 0
20 01/05/2020 Our Lady's Hospital, Navan Country 0 0 0
21 01/05/2020 Portiuncula University Hospital Country 0 0 0
22 01/05/2020 Sligo University Hospital Country 0 0 0
23 01/05/2020 South Tipperary General Hospital Country 0 0 0
24 01/05/2020 St Lukes Hospital, Kilkenny Country 0 0 0
25 01/05/2020 University College Hospital Galway Country 0 0 0
26 01/05/2020 University Hospital Kerry Country 0 0 0
27 01/05/2020 University Hospital Waterford Country 0 0 0
28 01/05/2020 University Hospital, Limerick Country 8 0 8
29 01/05/2020 Wexford General Hospital Country 1 0 1]

Calculate nonzeros percentage for specific columns and each row in Pandas

If I have a the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill','lisa','jose'], 'gender':['M','F','M','M','M','F','M'],'state':['california','dc','california','dc','california','texas','texas'],'num_children':[2,0,0,3,2,1,4],'num_pets':[5,1,0,5,2,2,3]})
name gender state num_children num_pets
0 john M california 2 5
1 mary F dc 0 1
2 peter M california 0 0
3 jeff M dc 3 5
4 bill M california 2 2
5 lisa F texas 1 2
6 jose M texas 4 3
I want to create a new row and column pct. to get the percentage of zero values in columns num_children and num_pets
Expected output:
name gender state num_children num_pets pct.
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have calculated percentage of zero in each row for targets columns:
df['pct'] = df[['num_children', 'num_pets']].astype(bool).sum(axis=1)/2
df['pct.'] = 1-df['pct']
del df['pct']
df['pct.'] = pd.Series(["{0:.0f}%".format(val * 100) for val in df['pct.']], index = df.index)
name gender state num_children num_pets pct.
0 john M california 2 5 0%
1 mary F dc 0 1 50%
2 peter M california 0 0 100%
3 jeff M dc 3 5 0%
4 bill M california 2 2 0%
5 lisa F texas 1 2 0%
6 jose M texas 4 3 0%
But i don't know how to insert results below to row of pct. as expected output, please help me to get expected result in more pythonic way. Thanks.
df[['num_children', 'num_pets']].astype(bool).sum(axis=0)/len(df.num_children)
Out[153]:
num_children 0.714286
num_pets 0.857143
dtype: float64
UPDATE: same thing but for calculation of sums, great thanks to #jezrael:
df['sums'] = df[['num_children', 'num_pets']].sum(axis=1)
df1 = (df[['num_children', 'num_pets']].sum()
.to_frame()
.T
.assign(name='sums'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets sums
0 sums 12 18
1 john M california 2 5 7
2 mary F dc 0 1 1
3 peter M california 0 0 0
4 jeff M dc 3 5 8
5 bill M california 2 2 4
6 lisa F texas 1 2 3
7 jose M texas 4 3 7
You can use mean with boolean mask by compare 0 values by DataFrame.eq, because sum/len=mean by definition, multiple by 100 and add percentage with apply:
s = df[['num_children', 'num_pets']].eq(0).mean(axis=1)
df['pct'] = s.mul(100).apply("{0:.0f}%".format)
For first row create new DataFrame with same columns like original and concat together:
df1 = (df[['num_children', 'num_pets']].eq(0)
.mean()
.mul(100)
.apply("{0:.1f}%".format)
.to_frame()
.T
.assign(name='pct.'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets pct
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%

Split ("extract") Columns using Panda [duplicate]

This question already has answers here:
Python Pandas: create a new column for each different value of a source column (with boolean output as column values)
(4 answers)
Closed 4 years ago.
I currently have a column called Country that can have a value of USA, Canada, Japan. For example:
Country
-------
Japan
Japan
USA
....
Canada
I want to split ("extract") the values into three individual columns (Country_USA, Country_Canada, and Country_Japan), and basically, a column will have a value of 1 if it matches the original value from the Country column. For example:
Country --> Country_Japan Country_USA Country_Canada
------- ------------- ----------- ---------------
Japan 1 0 0
USA 0 1 0
Japan 1 0 0
....
Is there a simple (non-tedious) way to do this using Panda / Python 3.x? Thanks!
Use join with get_dummies and with add_prefix:
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Demo:
df=pd.DataFrame({'Country':['Japan','USA','Japan','Canada']})
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Better version, thanks to Scott:
print(df.join(pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Another good version from Scott:
print(df.assign(**pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0

How to group rows so as to use value_counts on the created groups with pandas?

I have some customer data such as this in a data frame:
S No Country Sex
1 Spain M
2 Norway F
3 Mexico M
...
I want to have an output such as this:
Spain
M = 1207
F = 230
Norway
M = 33
F = 102
...
I have a basic notion that I want to group my rows based on their countries with something like df.groupby(df.Country), and on the selected rows, I need to run something like df.Sex.value_counts()
Thanks!
I think need crosstab:
df = pd.crosstab(df.Sex, df.Country)
Or if want use your solution add unstack for columns with first level of MultiIndex:
df = df.groupby(df.Country).Sex.value_counts().unstack(level=0, fill_value=0)
print (df)
Country Mexico Norway Spain
Sex
F 0 1 0
M 1 0 1
EDIT:
If want add more columns then is possible set which level parameter is converted to columns:
df1 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=0, fill_value=0).reset_index()
print (df1)
No Country Sex 1 2 3
0 Mexico M 0 0 1
1 Norway F 0 1 0
2 Spain M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=1, fill_value=0).reset_index()
print (df2)
Country No Sex Mexico Norway Spain
0 1 M 0 0 1
1 2 F 0 1 0
2 3 M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=2, fill_value=0).reset_index()
print (df2)
Sex No Country F M
0 1 Spain 0 1
1 2 Norway 1 0
2 3 Mexico 0 1
You can also use pandas.pivot_table:
res = df.pivot_table(index='Country', columns='Sex', aggfunc='count', fill_value=0)
print(res)
SNo
Sex F M
Country
Mexico 0 1
Norway 1 0
Spain 0 1

Categories