How can I convert an unstructured string to a dataframe?

How can I convert an unstructured string to a dataframe? - python

I have a long string text that I would like to convert to a dataframe to analyze. Please see below for a sample of the data below. I would like the columns to be "Facility", "Street", "City", "Phone", and "Store Hours".
string = AlaskaUSCG Base Ketchikan 1300 Stedman Street Ketchikan, AK (907) 228-0250 Mon-Fri 7:30am-5pm | Sat 10am-4pm | Closed Sunday USCG Base Kodiak Albatros Avenue, Building 26 (2nd Floor) Kodiak, AK (907) 487-5773 USCG Base Kodiak Albatros Avenue, Building 26 (1st Floor) Kodiak, AK (907) 487-5773 Mon-Fri: 7am-9pm | Sat: 9am-9pm |
I have used StringIO to convert it to a dataframe but it converts it into a dataframe with 0 rows and 1000 columns. Instead I would like the columns I mentioned above and rows for each store.
I expect it to look like this with the data populated as rows:
Facility Street City Phone
Alaska USCG Base Ketchikan 1300 Stedman Street Ketchikan, AK (907) 228 0250

You may use simple web-scraping techniques, such as bs4 and requests.
import bs4
r = requests.get(URL)
b = bs4.BeautifulSoup(r.text)
addresses = []
for val in b.find_all(name='p'):
s = list(val.stripped_strings)
if s and not s[0].startswith('HOURS'): addresses.append(' '.join(s[:-1]))

Related

Parse *.txt file looping with comprehensive dictionary

I have a *.txt file coming from a SQL query organised in rows.
I'm reading it with pandas library through:
df = pd.read_csv(./my_file_path/my_file.txt, sep = '\n', head = 0)
df.rename(columns = {list(df.columns)[0]: 'cols'}, inplace = True)
the output are rows with the information separated by spaces in an standard structure (dots are meant to be spaces):
name................address........country..........age
0 Bob.Hope............Broadway.......United.States....101
1 Richard.Donner......Park.Avenue....United.States.....76
2 Oscar.Meyer.........Friedrichshain.Germany...........47
I tried to create a dictionary to get the info with comprehensive lists:
col_dict = {'name': [df.cols[i][0:20].strip() for i in range(0,len(df.cols))],
'address': [df.cols[I][21:36].strip() for i in range(0,len(df.cols))],
'country': [df.cols[i][36:52].strip() for i in range(0,len(df.cols))],
'age': [df.cols[i][53:].strip() for i in range(0,len(df.cols))],
}
This script runs well in order to create a dictionary as a basis for a dataframe to work with. But I were asking myself if there is any other way to make the script more pythonic, looping directly through a dictionary with the column names and avoiding the repetition of the same code for every column -the actual dataset is much longer-.
The question is how can I store de string indexation to use it later with the column names to parse everything at once.

You can read it directly with pandas:
df = pd.read_csv(./my_file_path/my_file.txt, delim_whitespace=True)
If you know that the space between the columns is going to be at least 2 spaces, you can do it this way:
df = pd.read_csv(./my_file_path/my_file.txt, sep='\s{2,}')
In your case, the file is fixed width so you need to use a different method:
df = pd.read_fwf(StringIO(my_text), widths=[20,15,16, 10],skiprows=1)

The pandas.read_fwf method is what you are looking for.
df = pd.read_fwf( 'data.txt' )
data.txt
name address country age
Bob Hope Broadway United States 101
Richard Donner Park Avenue United States 76
Oscar Meyer Friedrichshain Germany 47
df
id
name
address
country
age
0
Bob Hope
Broadway
United States
101
1
Richard Donner
Park Avenue
United States
76
2
Oscar Meyer
Friedrichshain
Germany
47

Pandas read_html producing empty df with tuple column names

I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs
However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:
Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []
Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables? My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting. Thanks!
funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
df['URL'] = url
df['YEAR'] = year
funds_df = funds_df.append(df)

For this site, there's no need for beautifulsoup or requests
pandas.read_html creates a list of DataFrames for each <table> at the URL.
import pandas as pd
url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'
# read the url
dfl = pd.read_html(url)
# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
print(i)
display(d) # display worker in Jupyter, otherwise use print
print('\n')
dfl[0]
Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program TOTAL
0 State of Colorado $7,140,000 $1,896,854 $503,424 $9,540,278
dfl[1]
WF-CMA 2 RSS TAG-F CMA Mandatory 3 TOTAL
0 $3,309,953 $1,896,854 $503,424 $7,140,000 $9,540,278
dfl[2]
Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program Total
0 State of Colorado $430,000 $0 $100,000 $150,000 $0 $680,000
dfl[3]
Volag Affiliate Name Projected ORR MG Funding Director
0 CWS Ecumenical Refugee & Immigration Services $127,600 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
1 ECDC ECDC African Community Center $308,000 Jennifer Guddiche 5250 Leetsdale Drive Denver, CO 80246 303-399-4500
2 EMM Ecumenical Refugee Services $191,400 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
3 LIRS Lutheran Family Services Rocky Mountains $121,000 Floyd Preston 132 E Las Animas Colorado Springs, CO 80903 719-314-0223
4 LIRS Lutheran Family Services Rocky Mountains $365,200 James Horan 1600 Downing Street, Suite 600 Denver, CO 80218 303-980-5400

Best Python looping system for merging pandas DataFrame rows for export

I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?

The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department

concat pdf tables into one excel table using python

I'm using tabula in order to concat all tables in the following pdf file
To be a one table within excel format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well

I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.
tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.
from tabula import read_pdf
import pandas as pd
# display all columns in dataframe
pd.set_option('display.width', None)
columns = [210, 350, 420, 450] # boundaries in points
#columns = ['210,350,420,450'] # boundaries in points
pages = '90-92'
#pages = [90,91,92]
#pages = list(range(90,93))
#pages = 'all' # read all pages
tables = read_pdf("goal.pdf",
pages=pages,
pandas_options={'header': None},
columns=columns,
guess=False)
df = pd.concat(tables).reset_index(drop=True)
#df.rename(columns=df.iloc[0], inplace=True) # convert first row to headers
#df.drop(df.index[0], inplace=True) # remove first row with headers
# display
#for x in range(0, len(df), 20):
# print(df.iloc[x:x+20])
# print('----------')
print(df.iloc[45:50])
#df.to_csv('output-pdf.csv')
#print(df[ df['State'].str.contains(' ') ])
#print(df[ df.iloc[:,3].str.contains(' ') ])
Result:
0 1 2 3 4
45 JARRARD, GARY 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
46 JARRARD, GARY 2219 COLORADO BLVD DENTON TX (940) 380-1661
47 MASON HARRISON, RATLIFF ENTERPRISES 1815 W. UNIVERSITY DRIVE DENTON TX (940) 387-5431
48 MASON HARRISON, RATLIFF ENTERPRISES 109 N. LOOP #288 DENTON TX (940) 484-2904
49 MASON HARRISON, RATLIFF ENTERPRISES 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
EDIT:
It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.
I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in City is too long.
col3 = df.iloc[:,3]
print(df[ col3.str.contains(' ') ])
Result:
0 1 2 3 4
1941 UMSTATTD RESTAURANTS, LLC 120 WEST US HIGHWAY 54 EL DORADO SPRING MS O (417) 876-5755
2079 SIMONS, GARY 1412 BURLINGTON NORTH KANSAS CIT MY O (816) 421-5941
2763 GRISHAM, ROBERT (RB) 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
2764 STAUFFER, JACOB 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830

Iterate geolocation over pandas dataframe

I have a dataframe that has two columns, Hospital name and Address, and I want to iterate through each address to find the latitude and longitude. My code seems to be taking the first row in the dataframe and I can't seem to select the address to find the coordinates.
import pandas
from geopy.geocoders import Nominatim
geolocator = Nominatim()
for index, item in df.iterrows():
location = geolocator.geocode(item)
df["Latitude"].append(location.latitude)
df["Longitude"].append(location.longitude)
Here is the code I used to scrape the website. Copy and run this and you'll have the data set.
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("https://www.privatehealth.co.uk/hospitals-and-
clinics/orthopaedic-surgery/?offset=300")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"col-9"})
names = []
for item in all:
d={}
d["Hospital Name"] = item.find(["h3"],{"class":"mb6"}).text.replace("\n","")
d["Address"] = item.find(["p"],{"class":"mb6"}).text.replace("\n","")
names.append(d)
df=pandas.DataFrame(names)
df = df[['Hospital Name','Address']]
df
Currently the data looks like (one hospital example):
Hospital Name |Address
Fulwood Hospital|Preston, PR2 9SZ
The final output that I'm trying to achieve looks like.
Hospital Name |Address | Latitude | Longitude
Fulwood Hospital|Preston, PR2 9SZ|53.7589938|-2.7051618

Seems like there are a few issues here. Using data from the URL you provided:
df.head()
Hospital Name Address
0 Fortius Clinic City London, EC4N 7BE
1 Pinehill Hospital - Ramsay Health Care UK Hitchin, SG4 9QZ
2 Spire Montefiore Hospital Hove, BN3 1RD
3 Chelsea & Westminster Hospital London, SW10 9NH
4 Nuffield Health Tunbridge Wells Hospital Tunbridge Wells, TN2 4UL
(1) If your data frame column names really are Hospital name and Address, then you need to use item.Address in the call to geocode().
Just using item will give you both Hospital name and Address.
for index, item in df.iterrows():
print(f"index: {index}")
print(f"item: {item}")
print(f"item.Address only: {item.Address}")
# Output:
index: 0
item: Hospital Name Fortius Clinic City
Address London, EC4N 7BE
Name: 0, dtype: object
item.Address only: London, EC4N 7BE
...
(2) You noted that your data frame only has two columns. If that's true, you'll get a KeyError when you try to perform operations on df["Latitude"] and df["Longitude"], because they don't exist.
(3) Using apply() on the Address column might be clearer than iterrows().
Note that this is a stylistic point, and debatable. (The first two points are actual errors.)
For example, using the provided URL:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
tmp = df.head().copy()
latlon = tmp.Address.apply(lambda addr: geolocator.geocode(addr))
tmp["Latitude"] = [x.latitude for x in latlon]
tmp["Longitude"] = [x.longitude for x in latlon]
Output:
Hospital Name Address \
0 Fortius Clinic City London, EC4N 7BE
1 Pinehill Hospital - Ramsay Health Care UK Hitchin, SG4 9QZ
2 Spire Montefiore Hospital Hove, BN3 1RD
3 Chelsea & Westminster Hospital London, SW10 9NH
4 Nuffield Health Tunbridge Wells Hospital Tunbridge Wells, TN2 4UL
Latitude Longitude
0 51.507322 -0.127647
1 51.946413 -0.279165
2 50.840871 -0.180561
3 51.507322 -0.127647
4 51.131528 0.278068

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I convert an unstructured string to a dataframe? - python

You may use simple web-scraping techniques, such as bs4 and requests. import bs4 r = requests.get(URL) b = bs4.BeautifulSoup(r.text) addresses = [] for val in b.find_all(name='p'): s = list(val.stripped_strings) if s and not s[0].startswith('HOURS'): addresses.append(' '.join(s[:-1]))

Related

Parse *.txt file looping with comprehensive dictionary

Pandas read_html producing empty df with tuple column names

Best Python looping system for merging pandas DataFrame rows for export

concat pdf tables into one excel table using python

Iterate geolocation over pandas dataframe

Categories

Resources