Beautiful_Soup loop over errors - python

Currently trying to web scrape numbeo.com with Beautifulsoup to extract the cost of living in ~200 cities in my data frame.
I have the following code but there's an issue in that they report differently in the url. For example, some include only the name of the city while other's end with a hyphen & state abbreviation.
https://www.numbeo.com/cost-of-living/in/Saint-Petersburg-FL
https://www.numbeo.com/cost-of-living/in/Detroit
There are some other issues but how do I reconfigure the code below to jump to another option if there's an error:
cofl_list = []
def cost_living(cit):
cit = str(cit)
cit = cit.replace('St. Petersburg','Saint-Petersburg-FL')
cit = cit.replace(' ','-')
cit = cit.replace('St.','Saint')
url = r.get(f'https://www.numbeo.com/cost-of-living/in/{cit}')
soup = bs(url.content)
cof = soup.find_all('span', attrs= {'class': 'emp_number'} )
cof_rev = cof[1]
cof_rev = str(cof_rev)
cof_rev = cof_rev.replace('$','')
cof_rev = cof_rev.replace('<span class="emp_number">','')
cof_rev = cof_rev.replace('</span>','')
cof_rev = float(cof_rev)
cofl_list.append(cof_rev)

for example:
def my_func():
try:
# execute a block of script
except:
# if an error occurs in the block under try it gets catched here so You can execute something here
continue
however if there is code in the block after the part that causes the error it will not be executed

Related

My scrapping code skips new line - Scrapy

I have this code to scrape review text from IMDB. I want to retrieve the entire text from the review, but it skips every time there is a new line, for example:
Saw an early screening tonight in Denver.
I don't know where to begin. So I will start at the weakest link. The
acting. Still great, but any passable actor could have been given any
of the major roles and done a great job.
The code will only retrieve
Saw an early screening tonight in Denver.
Here is my code:
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
first_review = reviews[0]
sel2 = Selector(text = first_review.get_attribute('innerHTML'))
rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []
error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
for d in tqdm(reviews):
try:
sel2 = Selector(text = d.get_attribute('innerHTML'))
try:
rating = sel2.css('.rating-other-user-rating span::text').extract_first()
except:
rating = np.NaN
try:
review = sel2.css('.text.show-more__control::text').get()
except:
review = np.NaN
try:
review_date = sel2.css('.review-date::text').extract_first()
except:
review_date = np.NaN
try:
author = sel2.css('.display-name-link a::text').extract_first()
except:
author = np.NaN
try:
review_title = sel2.css('a.title::text').extract_first()
except:
review_title = np.NaN
rating_list.append(rating)
review_date_list.append(review_date)
review_title_list.append(review_title)
author_list.append(author)
review_list.append(review)
except Exception as e:
error_url_list.append(url)
error_msg_list.append(e)
review_df = pd.DataFrame({
'review_date':review_date_list,
'author':author_list,
'rating':rating_list,
'review_title':review_title_list,
'review':review_list
})
Use .extract() instead of .get() to extract all texts in the type of list. Then, you can use .join() to concatenate all texts into a single string.
review = sel2.css('.text.show-more__control::text').extract()
review = ' '.join(review)
output:
'For a teenager today, Dunkirk must seem even more distant than the
Boer War did to my generation growing up just after WW2. For some,
Christopher Nolan's film may be the most they will know about the
event. But it's enough in some ways because even if it doesn't show
everything that happened, maybe it goes as close as a film could to
letting you know how it felt. "Dunkirk" focuses on a number of
characters who are inside the event, living it ....'

using beautiful soup to get consolidated data from a list of urls instead of just the first url

I'm trying get the data of three states, based on the same url format.
states = ['123', '124', '125']
urls = []
for state in states:
url = f'www.something.com/geo={state}'
urls.append(url)
and from there I have three separate urls, each containing different state ID.
However when I get to processing it via BS, the output only showed data from the state 123.
for url in urls:
client = ScrapingBeeClient(api_key="API_KEY")
response = client.get(url)
doc = BeautifulSoup(response.text, 'html.parser')
subsequently I extracted the columns I wanted using this:
listings = doc.select('.is-9-desktop')
rows = []
for listing in listings:
row = {}
try:
row['name'] = listing.select_one('.result-title').text.strip()
except:
print("no name")
try:
row['add'] = listing.select_one('.address-text').text.strip()
except:
print("no add")
try:
row['mention'] = listing.select_one('.review-mention-block').text.strip()
except:
pass
rows.append(row)
But as mentioned it only showed data for state 123. Hugely appreciate it if anyone could let me know where I went wrong, thank you!
EDIT
I added the URL output into a list, and was able to get the data for all three states.
doc = []
for url in urls:
client = ScrapingBeeClient(api_key="API_KEY")
response = client.get(url)
docs = BeautifulSoup(response.text, 'html.parser')
doc.append(docs)
However when I ran it through BS it resulted in the error message:
Attribute Error: 'list' object has no attribute select.
Do I run it through another loop?
It does not need all of these loops - Just iterate over the states and get the listings to append to rows.
The most important thing is that rows=[] is placed outside the for loops to stop it overwriting itself.
Example
states = ['123', '124', '125']
rows = []
for state in states:
url = f'www.something.com/geo={states}'
client = ScrapingBeeClient(api_key="API_KEY")
response = client.get(url)
doc = BeautifulSoup(response.text, 'html.parser')
listings = doc.select('.is-9-desktop')
for listing in listings:
row = {}
try:
row['name'] = listing.select_one('.result-title').text.strip()
except:
print("no name")
try:
row['add'] = listing.select_one('.address-text').text.strip()
except:
print("no add")
try:
row['mention'] = listing.select_one('.review-mention-block').text.strip()
except:
pass
rows.append(row)

Scrape multiple pages with Beautiful soup

I am trying to scrape multiple pages of a url.
But am able to scrape only the first page is there is a way to get all the pages.
Here is my code.
from bs4 import BeautifulSoup as Soup
import urllib, requests, re, pandas as pd
pd.set_option('max_colwidth',500) # to remove column limit (Otherwise, we'll lose some info)
df = pd.DataFrame()
Comp_urls = ['https://www.indeed.com/jobs?q=Dell&rbc=DELL&jcid=0918a251e6902f97', 'https://www.indeed.com/jobs?q=Harman&rbc=Harman&jcid=4faf342d2307e9ed','https://www.indeed.com/jobs?q=johnson+%26+johnson&rbc=Johnson+%26+Johnson+Family+of+Companies&jcid=08849387e791ebc6','https://www.indeed.com/jobs?q=nova&rbc=Nova+Biomedical&jcid=051380d3bdd5b915']
for url in Comp_urls:
target = Soup(urllib.request.urlopen(url), "lxml")
targetElements = target.findAll('div', class_ =' row result')
for elem in targetElements:
comp_name = elem.find('span', attrs={'class':'company'}).getText().strip()
job_title = elem.find('a', attrs={'class':'turnstileLink'}).attrs['title']
home_url = "http://www.indeed.com"
job_link = "%s%s" % (home_url,elem.find('a').get('href'))
job_addr = elem.find('span', attrs={'class':'location'}).getText()
date_posted = elem.find('span', attrs={'class': 'date'}).getText()
description = elem.find('span', attrs={'class': 'summary'}).getText().strip()
comp_link_overall = elem.find('span', attrs={'class':'company'}).find('a')
if comp_link_overall != None:
comp_link_overall = "%s%s" % (home_url, comp_link_overall.attrs['href'])
else: comp_link_overall = None
df = df.append({'comp_name': comp_name, 'job_title': job_title,
'job_link': job_link, 'date_posted': date_posted,
'overall_link': comp_link_overall, 'job_location': job_addr, 'description': description
}, ignore_index=True)
df
df.to_csv('path\\web_scrape_Indeed.csv', sep=',', encoding='utf-8')
Please suggest if there is anyway.
Case 1: The code presented here is exactly what you have
Comp_urls = ['https://www.indeed.com/jobs?q=Dell&rbc=DELL&jcid=0918a251e6902f97', 'https://www.indeed.com/jobs?q=Harman&rbc=Harman&jcid=4faf342d2307e9ed','https://www.indeed.com/jobs?q=johnson+%26+johnson&rbc=Johnson+%26+Johnson+Family+of+Companies&jcid=08849387e791ebc6','https://www.indeed.com/jobs?q=nova&rbc=Nova+Biomedical&jcid=051380d3bdd5b915']
for url in Comp_urls:
target = Soup(urllib.request.urlopen(url), "lxml")
targetElements = target.findAll('div', class_ =' row result')
for elem in targetElements:
The problem here is targetElements changes with every iteration in the first for loop.
To avoid this, indent the second for loop inside the first like so:
for url in Comp_urls:
target = Soup(urllib.request.urlopen(url), "lxml")
targetElements = target.findAll('div', class_ =' row result')
for elem in targetElements:
Case 2: Your the bug is not a result of improper indentation (i.e. not like what is in your original post)
If it is the case that your code is properly idented , then it may be the case that targetElements is an empty list. This means target.findAll('div', class_ =' row result') does not return anything. In that case, visit the sites, check out the dom, then modify your scraping program.

why it is skipping the whole for loop?

I have created a website scraper which will scrape all info from yellow pages (for educational purposes)
def actual_yellow_pages_scrape(link,no,dir,gui,sel,ypfind,terminal,user,password,port,type):
print(link,no,dir,gui,sel,ypfind,terminal,user,password,port,type)
r = requests.get(link,headers=REQUEST_HEADERS)
soup = BeautifulSoup(r.content,"html.parser")
workbook = xlwt.Workbook()
sheet = workbook.add_sheet(str(ypfind))
count = 0
for i in soup.find_all(class_="business-name"):
sheet.write(count,0,str(i.text))
sheet.write(count,1,str("http://www.yellowpages.com"+i.get("href")))
r1 = requests.get("http://www.yellowpages.com"+i.get("href"))
soup1 = BeautifulSoup(r1.content,"html.parser")
website = soup1.find("a",class_="custom-link")
try:
print("Acquiring Website")
sheet.write(count,2,str(website.get("href")))
except:
sheet.write(count,2,str("None"))
email = soup1.find("a",class_="email-business")
try:
print(email.get("href"))
EMAIL = re.sub("mailto:","",str(email.get("href")))
sheet.write(count,3,str(EMAIL))
except:
sheet.write(count,3,str("None"))
phonetemp = soup1.find("div",class_="contact")
try:
phone = phonetemp.find("p")
print(phone.text)
sheet.write(count,4,str(phone.text))
except:
sheet.write(count,4,str("None"))
reviews = soup1.find(class_="count")
try:
print(reviews.text)
sheet.write(count,5,str(reviews.text))
except:
sheet.write(count,5,str("None"))
count+=1
save = dir+"\\"+ypfind+str(no)+".xls"
workbook.save(save)
no+=1
for i in soup.find_all("a",class_="next ajax-page"):
print(i.get("href"))
actual_yellow_pages_scrape("http://www.yellowpages.com"+str(i.get("href")),no,dir,gui,sel,ypfind,terminal,user,password,port,type)
The code is my above portion of the scraper. I have created the break points at soup and in the for loop not even a single line of for loop gets executed. No errors thrown. I tried the same with printing numbers from 1-10 it works but this is not working why?
Thank you
Answer has been found,
I used a text visulaizer to find what is in "r.content" I soupified it and got a clean HTML and gone through the HTML file and finally found that the browser is unsupported so I removed the requests header and ran the code finally got what I wanted

Accessing uninform <dt></dt> <dd></dd> tags

from collections import defaultdict
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
r= requests.get("http://www.walmart.com/search/?query=marvel&cat_id=4096_530598")
r.content
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class" : "tile-content"})
data=defaultdict(list)
for tile in g_data:
#the "tile" value in g_data contains what you are looking for...
#find the product titles
try:
title = tile.find("a","js-product-title")
data['Product Title'].append(title.text)
except:
data['Product Title'].append("")
#find the prices
try:
price = tile.find('span', 'price price-display').text.strip()
data['Price'].append(price)
except:
data['Price'].append("")
#find the stars
try:
g_star = tile.find("div",{"class" : "stars stars-small tile-row"}).find('span','visuallyhidden').text.strip()
data['Stars'].append(g_star)
except:
data['Stars'].append("")
try:
dd_starring = tile.find('dd', {"class" : "media-details-multi-line media-details-artist-dd module"}).text.strip()
data['Starring'].append(dd_starring)
except:
data['Starring'].append("")
try:
running_time = tile.find_all('dl',{"class" : "media-details dl-horizontal copy-mini"})
for dd_run in running_time :
running = dd_run.find_all('dd')[1:2]
for run in running :
#print run.text.strip()
data['Running Time'].append(run.text.strip())
except:
data['Running Time'].append("")
try:
dd_format = tile.findAll('dd',{"class" :"media-details-multi-line"})[1:2]
for formatt in dd_format:
data['Format'].append(textOfFormat)
except:
data['Format'].append("")
try:
div_shipping =tile.find_all('div',{"data-offer-shipping-pass-eligible":"false"})
data['Shipping'].append("")
except:
freeshipping = "Free Shipping"
data['Shipping'].append(freeshipping)
df = pd.DataFrame(data)
df
I want to access the which if without a class name. How to access it?
Like row no.11 has 5 director field and few other have Release date.
Currently I am accessing it using [2:1] and so on.. But its not flxible and doesnt populate my table correctly.
Any function to do this?
Substitute Staring and Running time with:
try:
dd_starring = tile.find('dd', {"class" : "media-details-artist-dd"}).text.strip()
data['Starring'].append(dd_starring)
except:
data['Starring'].append("")
try:
running = tile.find('dt',{'class':'media-details-running-time'})
running_time = running.find_next("dd")
data['Running Time'].append(running_time.text)
except:
data['Running Time'].append("")
This should run now. It seems that when you select multiple classes with BeautifulSoup it can get confused so you can get the Actors just by css class media-details-artist-dd. For the running time I employed a simple trick :)
EDIT: Changed the code to find the dd for Running Time and then get the next sibling. The previous code had an extra unneeded part
It should work now

Categories