scrape address and phone numbers from this website - python

how do i scrape the data from the and contact info class and export to csv file using bs4 and pandas libraries? from this site? i need help on how to scrape data from the tag and contact info class.
import pandas as pd
import bs4
import requests
import re
full_dict={'Title':[],'Description':[],'Address':[]}
res=requests.get("https://cupcakemaps.com/cupcakes/cupcakes-near-me/p:2")
listings=soup.findAll(class_='media')
for listing in listings:
listing_title=listing.find(True,{'title':True}).attrs['title']
listing_Description=listing.find('p',{'class':'summary-desc'})
listing_address=listing.find('p',{'class':'contact-`info'}).text=re.compile(r'[0-9]{0,4}')`

.strip() - in-built function of Python is used to remove all the leading and trailing spaces from a string.
.to_csv() - Write object to a comma-separated values (csv) file.
Ex.
import pandas as pd
from bs4 import BeautifulSoup,Tag
import requests
import re
res=requests.get("https://cupcakemaps.com/cupcakes/cupcakes-near-me/p:2")
soup = BeautifulSoup(res.text,'lxml')
listings=soup.findAll(class_='media')
data = []
for listing in listings:
listing_title=listing.find(True,{'title':True}).attrs['title']
listing_Description=listing.find('p',{'class':'summary-desc'})
if isinstance(listing_Description,Tag):
listing_Description = listing_Description.text.strip()
listing_address=listing.find('p',{'class':'contact-info'})
if isinstance(listing_address,Tag):
number_text = listing_address.text.strip()
listing_address = ''.join(filter(str.isdigit,number_text))
full_dict = {'Title': listing_title, 'Description': listing_Description, 'Address': listing_address}
data.append(full_dict)
df = pd.DataFrame(data)
# saved data into csv file
df.to_csv("contact.csv")
print(df)
O/P:
Title Description Address
0 Explore Category 'Anaheim CA Birthday Cupcakes... Delectable Anaheim, CA - Delectable check out ... 7147156086
1 Explore Category 'Costa Mesa CA Birthday Cupca... Lisa's Gourmet Snacks Costa Mesa CA check out... 7144275814
2 Explore Category 'Shorewood IL Birthday Cupcak... Acapulco Bakery Inc Shorewood, IL - Acapulco B... 8157291737
3 Explore Category 'San Francisco CA Birthday Cu... Hilda's Mart & Bake Shop San Francisco CA che... 4153333122
4 Explore Category 'Los Angeles CA Birthday Cupc... Lenny's Deli Los Angeles, CA - Lenny's Deli ch... 3104755771
5 Explore Category 'San Francisco CA Birthday Cu... Sweet Inspirations San Francisco CA check out... None
6 Explore Category 'Costa Mesa CA Birthday Cupca... The Cupcake Costa Mesa CA check out The Cupc... 9496420571
7 Explore Category 'Los Angeles CA Birthday Cupc... United Bread & Pastry Inc Los Angeles CA chec... 3236610037
8 Explore Category 'Garden Grove CA Birthday Cup... Pescadores Garden Grove CA check out Pescado... 7145395585
9 Explore Category 'Bakersfield CA Birthday Cupc... Bimbo Bakeries Usa Bakersfield CA check out ... 6613219352

Related

Columns must be same length as key Error python

I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University
file2.csv
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
USA
The University of Queensland (UQ)
USA
University of California
USA
file_output.csv
research_groups_names_f1
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences
Chinese Academy of Sciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
USA
UQ
The University of Queensland (UQ)
Australia
Harvard University
Not found
USA
University of California, Los Angeles
University of California
USA
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN

PySpark: create new column based on dictionary values matching with string in another column

I have a dataframe A that looks like this:
ID
SOME_CODE
TITLE
1
024df3
Large garden in New York, New York
2
0ffw34
Small house in dark Detroit, Michigan
3
93na09
Red carpet in beautiful Miami
4
8339ct
Skyscraper in Los Angeles, California
5
84p3k9
Big shop in northern Boston, Massachusetts
I have also another dataframe B:
City
Shortcut
Los Angeles
LA
New York
NYC
Miami
MI
Boston
BO
Detroit
DTW
I would like to add new "SHORTCUT" column to dataframe A, based on the fact that "Title" column in A contains city from column "City" in dataframe B.
I have tried to use dataframe B as dictionary and map it to dataframe A, but I can't overcome fact that city names are in the middle of the sentence.
The desired output is:
ID
SOME_CODE
TITLE
SHORTCUT
1
024df3
Large garden in New York, New York
NYC
2
0ffw34
Small house in dark Detroit, Michigan
DTW
3
93na09
Red carpet in beautiful Miami, Florida
MI
4
8339ct
Skyscraper in Los Angeles, California
LA
5
84p3k9
Big shop in northern Boston, Massachusetts
BO
I will appreciate your help.
You can leverage pandas.apply function
And see if this helps:
import numpy as np
import pandas as pd
data1={'id':range(5),'some_code':["024df3","0ffw34","93na09","8339ct","84p3k9"],'title':["Large garden in New York, New York","Small house in dark Detroit, Michigan","Red carpet in beautiful Miami","Skyscraper in Los Angeles, California","Big shop in northern Boston, Massachusetts"]}
df1=pd.DataFrame(data=data1)
data2={'city':["Los Angeles","New York","Miami","Boston","Detroit"],"shortcut":["LA","NYC","MI","BO","DTW"]}
df2=pd.DataFrame(data=data2)
# Creating a list of cities.
cities=list(df2['city'].values)
def matcher(x):
for index,city in enumerate(cities):
if x.lower().find(city.lower())!=-1:
return df2.iloc[index]["shortcut"]
return np.nan
df1['shortcut']=df1['title'].apply(matcher)
print(df1.head())
This would generate the following o/p:
id some_code title shortcut
0 0 024df3 Large garden in New York, New York NYC
1 1 0ffw34 Small house in dark Detroit, Michigan DTW
2 2 93na09 Red carpet in beautiful Miami MI
3 3 8339ct Skyscraper in Los Angeles, California LA
4 4 84p3k9 Big shop in northern Boston, Massachusetts BO

Scraping of the BBB site converting JSON to a DataFrame

I would like to put this information into a dataframe and then export to excel. So far tutorials in python produce table errors. No luck converting the JSON data to a data frame.
Any tips would be very helpful.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
import bs4
import requests, re, json
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.bbb.org/search?find_country=USA&find_entity=10126-000&find_id=357_10126-000_alias&find_text=roofing&find_type=Category&page=1&touched=1', headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p.findall(r.text)[0])
results = [(item['businessName'], ' '.join([item['address'],item['city'], item['state'], item['postalcode']]), item['phone']) for item in data['searchResult']['results']]
print(results)
import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.bbb.org/search?find_country=USA&find_entity=10126-000&find_id=357_10126-000_alias&find_text=roofing&find_type=Category&page=1&touched=1', headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p.findall(r.text)[0])
results = [(item['businessName'], ' '.join([item['address'],item['city'], item['state'], item['postalcode']]), item['phone']) for item in data['searchResult']['results']]
df = pd.DataFrame(results, columns=['Business Name', 'Address', 'Phone'])
print(df)
df.to_csv('data.csv')
Prints:
Business Name Address Phone
0 Trinity Roofing, LLC Stilwell KS 66085-8238 [(913) 432-4425, (303) 699-7999]
1 Trinity Roofing, LLC 14241 E 4th Ave Ste 5-300 Aurora CO 80011-8733 [(913) 432-4425, (303) 699-7999]
2 CMR Construction & Roofing of Texas, LLC 12500 E US Highway 40, Ste. B1 Independence MO... [(855) 376-6326, (855) 766-3267]
3 All-Star Home Repairs LLC 1806 Grove Ave Richmond VA 23220-4506 [(804) 405-9337]
4 MadSky Roofing & Restoration, LLC Bank of America Center, 16th Floor 1111 E. Mai... [(855) 623-7597]
5 Robert Owens Roofing Bealeton VA 22712-9706 [(540) 878-3544]
6 Proof Seal of Athens PO Box 80732 Canton OH 447080732 [(330) 685-6363]
7 Proof Seal of Athens Athens OH 45701-1847 [(330) 685-6363]
8 Tenecela General Services Corp 57 Anderson St Lowell MA 01852-5357 None
9 Water Tight Roofing & Siding 57 Whitehall Way Hyannis MA 02601-2149 [(508) 364-8323]
10 Tenecela General Services Corp 745 Broadway St Fl 2 Lowell MA 01854-3137 None
11 Just In Time Roofing & Contracting, LLC ----- Ft Worth TX 76102 [(888) 666-3122, (254) 296-8016, (888) 370-3331]
12 Paramount Construction of Southerntier NY Inc. 323 Fluvanna Ave. Jamestown NY 14701 [(716) 487-0093]
13 Paramount Construction of Southerntier NY Inc. P O Box 488 Falconer NY 14733 [(716) 487-0093]
14 Paramount Construction of Southerntier NY Inc. 1879 Lyndon Boulevard Falconer NY 14733 [(716) 487-0093]
And saves data.csv (screenshot from LibreOffice):

How to Nest If Statement Within For Loop When Scraping Div Class HTML

Below is a scraper that uses Beautiful Soup to scrape physician information off of this webpage. As you can see from the html code directly below, each physician has an individual profile on the webpage that displays the physician's name, clinic, profession, taxonomy, and city.
<div class="views-field views-field-title practitioner__name" >Marilyn Adams</div>
<div class="views-field views-field-field-pract-clinic practitioner__clinic" >Fortius Sport & Health</div>
<div class="views-field views-field-field-pract-profession practitioner__profession" >Physiotherapist</div>
<div class="views-field views-field-taxonomy-vocabulary-5 practitioner__region" >Fraser River Delta</div>
<div class="views-field views-field-city practitioner__city" ></div>
As you can see from the sample html code, the physician profiles occasionally have information missing. If this occurs, I would like the scraper to print 'N/A'. I need the scraper to print 'N/A' because I would eventually like to put each div class category (name, clinic, profession, etc.) into an array where the lengths of each column are exactly the same so I can properly export the data to a CSV file. Here is an example of what I want the output to look like compared to what is actually showing up.
Actual Expected
[Names] [Names]
Greg Greg
Bob Bob
[Clinic] [Clinic]
Sport/Health Sport/Health
N/A
[Profession] [Profession]
Physical Therapist Physical Therapist
Physical Therapist Physical Therapist
[Taxonomy] [Taxonomy]
Fraser River Fraser River
N/A
[City] [City]
Vancouver Vancouver
Vancouver Vancouver
I have tried writing an if statement nested within each for loop, but the code does not seem to be looping correctly as the "N/A" only shows up once for each div class section. Does anyone know how to properly nest an if statement with a for loop so I am getting the proper amount of "N/As" in each column? Thanks in advance!
import requests
import re
from bs4 import BeautifulSoup
page=requests.get('https://sportmedbc.com/practitioners')
soup=BeautifulSoup(page.text, 'html.parser')
#Find Doctor Info
for doctor in soup.find_all('div',attrs={'class':'views-field views-field-title practitioner__name'}):
for a in doctor.find_all('a'):
print(a.text)
for clinic_name in soup.find_all('div',attrs={'class':'views-field views-field-field-pract-clinic practitioner__clinic'}):
for b in clinic_name.find_all('a'):
if b==(''):
print('N/A')
profession_links=soup.findAll('div',attrs={'class':'views-field views-field-field-pract-profession practitioner__profession'})
for profession in profession_links:
if profession.text==(''):
print('N/A')
print(profession.text)
taxonomy_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for taxonomy in taxonomy_links:
if taxonomy.text==(''):
print('N/A')
print(taxonomy.text)
city_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for city in city_links:
if city.text==(''):
print('N/A')
print(city.text)
For this problem you can use ChainMap from collections module (docs here). That way you can define your default values, in this case 'n/a' and only grab information that exists for each doctor:
from bs4 import BeautifulSoup
import requests
from collections import ChainMap
url = 'https://sportmedbc.com/practitioners'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
def get_data(soup):
default_data = {'name': 'n/a', 'clinic': 'n/a', 'profession': 'n/a', 'region': 'n/a', 'city': 'n/a'}
for doctor in soup.select('.view-practitioners .practitioner'):
doctor_data = {}
if doctor.select_one('.practitioner__name').text.strip():
doctor_data['name'] = doctor.select_one('.practitioner__name').text
if doctor.select_one('.practitioner__clinic').text.strip():
doctor_data['clinic'] = doctor.select_one('.practitioner__clinic').text
if doctor.select_one('.practitioner__profession').text.strip():
doctor_data['profession'] = doctor.select_one('.practitioner__profession').text
if doctor.select_one('.practitioner__region').text.strip():
doctor_data['region'] = doctor.select_one('.practitioner__region').text
if doctor.select_one('.practitioner__city').text.strip():
doctor_data['city'] = doctor.select_one('.practitioner__city').text
yield ChainMap(doctor_data, default_data)
for doctor in get_data(soup):
print('name:\t\t', doctor['name'])
print('clinic:\t\t',doctor['clinic'])
print('profession:\t',doctor['profession'])
print('city:\t\t',doctor['city'])
print('region:\t\t',doctor['region'])
print('-' * 80)
Prints:
name: Jaimie Ackerman
clinic: n/a
profession: n/a
city: n/a
region: n/a
--------------------------------------------------------------------------------
name: Marilyn Adams
clinic: Fortius Sport & Health
profession: Physiotherapist
city: n/a
region: Fraser River Delta
--------------------------------------------------------------------------------
name: Mahsa Ahmadi
clinic: Wellpoint Acupuncture (Sports Medicine)
profession: Acupuncturist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Tracie Albisser
clinic: Pacific Sport Northern BC, Tracie Albisser
profession: Strength and Conditioning Specialist, Exercise Physiologist
city: n/a
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Christine Alder
clinic: n/a
profession: n/a
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Steacy Alexander
clinic: Go! Physiotherapy Sports and Wellness Centre
profession: Physiotherapist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Page Allison
clinic: AET Clinic, .
profession: Athletic Therapist
city: Victoria
region: Vancouver Island - Central Coast
--------------------------------------------------------------------------------
name: Dana Alumbaugh
clinic: n/a
profession: Podiatrist
city: Squamish
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Manouch Amel
clinic: Mountainview Kinesiology Ltd.
profession: Strength and Conditioning Specialist
city: Anmore
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Janet Ames
clinic: Dr. Janet Ames
profession: Physician
city: Prince George
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Sandi Anderson
clinic: n/a
profession: n/a
city: Coquitlam
region: Fraser Valley
--------------------------------------------------------------------------------
name: Greg Anderson
clinic: University of the Fraser Valley
profession: Exercise Physiologist
city: Mission
region: Fraser Valley
--------------------------------------------------------------------------------
EDIT:
For getting the output in columns, you can use this example:
def print_data(header_text, data, key):
print(header_text)
for d in data:
print(d[key])
print()
data = list(get_data(soup))
print_data('[Names]', data, 'name')
print_data('[Clinic]', data, 'clinic')
print_data('[Profession]', data, 'profession')
print_data('[Taxonomy]', data, 'region')
print_data('[City]', data, 'city')
This prints:
[Names]
Jaimie Ackerman
Marilyn Adams
Mahsa Ahmadi
Tracie Albisser
Christine Alder
Steacy Alexander
Page Allison
Dana Alumbaugh
Manouch Amel
Janet Ames
Sandi Anderson
Greg Anderson
[Clinic]
n/a
Fortius Sport & Health
Wellpoint Acupuncture (Sports Medicine)
Pacific Sport Northern BC, Tracie Albisser
n/a
Go! Physiotherapy Sports and Wellness Centre
AET Clinic, .
n/a
Mountainview Kinesiology Ltd.
Dr. Janet Ames
n/a
University of the Fraser Valley
[Profession]
n/a
Physiotherapist
Acupuncturist
Strength and Conditioning Specialist, Exercise Physiologist
n/a
Physiotherapist
Athletic Therapist
Podiatrist
Strength and Conditioning Specialist
Physician
n/a
Exercise Physiologist
[Taxonomy]
n/a
Fraser River Delta
Vancouver & Sea to Sky
Cariboo - North East
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Vancouver Island - Central Coast
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Cariboo - North East
Fraser Valley
Fraser Valley
[City]
n/a
n/a
Vancouver
n/a
Vancouver
Vancouver
Victoria
Squamish
Anmore
Prince George
Coquitlam
Mission

Apply fuzzy matching across a dataframe column and save results in a new column

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Categories