Web-scraping: Accessing text information within a large list

Web-scraping: Accessing text information within a large list - python

Example: https://www.realtor.com/realestateandhomes-detail/20013-Hazeltine-Pl_Ashburn_VA_20147_M65748-31771
I am trying to access the number of garage spaces for several real estate listings. The only problem is that the location of the number of garage spaces isn't always in the 9th location of the list. On some pages it is earlier, and on other pages it is later.
garage = info[9].strip().replace('\n','')[15]
where
info = soup.find_all('ul', {'class': "list-default"})
info = [t.text for t in info]
and
header = {"user agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"}
page = requests.get(url, headers = header)
page.reason
requests.utils.default_user_agent()
soup = bs4.BeautifulSoup(page.text, 'html5lib')
What is the best way for me to obtain how many garage spaces a house listing has?

You can use CSS selector li:contains("Garage Spaces:") that will find <li> tag with the text "Garage Spaces:".
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.realtor.com/realestateandhomes-detail/20013-Hazeltine-Pl_Ashburn_VA_20147_M65748-31771'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
garage_spaces = soup.select_one('li:contains("Garage Spaces:")')
if garage_spaces:
garage_spaces = garage_spaces.text.split()[-1]
print('Found Garage spaces! num =', garage_spaces)
Prints:
Found Garage spaces! num = 2

Related

getting an empty list when trying to extract urls from google with beautifulsoup

I am trying to extract the first 100 urls that return from a location search in google
however i am getting an empty list every time ("no results found")
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all("div", class_="r")
websites = []
if results:
counter = 0
for result in results:
websites.append(result.find("a")["href"])
counter += 1
if counter == 100:
break
else:
print("No search results found.")
return websites
location = "Athens"
print(get_location_info(location))
No search results found.
[]
I have also tried this approach :
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all("div", class_="r")
websites = [result.find("a")["href"] for result in results][:10]
return websites
location = "sifnos"
print(get_location_info(location))`
and i get an empty list. I think i am doing everything suggested in similar posts but i still get nothing

Always and first of all, take a look at your soup to see if all the expected ingredients are in place.
Select your elements more specific in this case for example with css selector:
[a.get('href') for a in soup.select('a:has(>h3)')]
To void consent banner also send some cookies:
cookies={'CONSENT':'YES+'}
Example
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text, 'html.parser')
websites = [a.get('href') for a in soup.select('a:has(>h3)')]
return websites
location = "sifnos"
print(get_location_info(location))
Output
['https://www.griechenland.de/sifnos/', 'http://de.sifnos-greece.com/plan-trip-to-sifnos/travel-information.php', 'https://www.sifnosisland.gr/', 'https://www.visitgreece.gr/islands/cyclades/sifnos/', 'http://www.griechenland-insel.de/Hauptseiten/sifnos.htm', 'https://worldonabudget.de/sifnos-griechenland/', 'https://goodmorningworld.de/sifnos-griechenland/', 'https://de.wikipedia.org/wiki/Sifnos', 'https://sifnos.gr/en/sifnos/', 'https://www.discovergreece.com/de/cyclades/sifnos']

Fix for missing 'tr' class in webscraping

I'm trying to webscrape different stocks by rows, with the data scraped from https://www.slickcharts.com/sp500. I am following a tutorial using a similar website, however that website uses classes for each of its rows, while mine doesn't (attached below).
This is the code I'm trying to use, however I don't get any output whatsoever. I'm still pretty new at coding so any feedback is welcome.
import requests
import pandas as pd
from bs4 import BeautifulSoup
company = []
symbol = []
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for i in rows:
row = i.find_all('td')
print(row[0])

First of all, you need to add some headers to your request because most likely you get the same as me: status code 403 Forbidden. It's because the website is blocking your request. Adding User-Agent does the trick:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
page = requests.get(url, headers=headers)
Then you can iterate over tr tags as you do. But you should be careful, because, for example first tr doesn't have td tags and you will get exception in the row:
print(row[0])
Here is the example of code that prints names of all companies:
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)
But this code also outputs some other data besides company names. It's because you are iterating over all tr tags on the page. But you need to iterate over a specific table only (first table on the page in this case).
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
first_table_on_the_page = soup.find('table')
rows = first_table_on_the_page.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)

Empty lists when extracting tag classes

I am having some problems with extracting tags from a websites:
r = req.get(web+"?pg=news&tf=G&page={}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
results = [
(
x.select_one("h3.d-flex").text,
x.select_one("div.i").text,
x.select_one("div.a").a.text,
x.select_one("div.entry-content").p.text,
) for x in soup.findAll("section")
]
I need to scrape relevant information such as headlines, preview of content, date and link.
When I print the above tags, I get empty lists. Since I have no a lot of experience in selecting tags and I am not sure about classes I selected above, I would ask you if you could have a look and tell me which one(s) is wrong.

I hope this code help you. assume url http://gentedellarete.it/?pg=news&tf=G&page=1
import requests
from bs4 import BeautifulSoup
URL = "https://www.centrepointstores.com/sa/en/Women/Fashion-Accessories/Watches/CENTREPOINT-Citizen-Women%27s-Rose-Gold-Analog-Metal-Strap-Watch-EU-6039-86A/p/EU603986AGold"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
r = requests.get("http://www.gentedellarete.it/?pg=news&tf=G&page={}/".format(1), headers=HEADERS)
soup = BeautifulSoup(r.content, 'html.parser')
for x in soup.findAll('div', {'class',"text py-5 pl-md-5"}):
print('\n',x.select_one("div > a:nth-child(2) h3").text, sep='\n') #heading ok
print('\n', x.select_one('p').text) #under h3 ok
print('\n', x.select('p')[1].text) # body ok
print('\n', x.select('p')[1].text.split('(')[1].strip(')')) # date ok?

Python web scrape numerical weather data

I am attempting to print the int value of current outside air temperature. (55)
Any chance for a tip on what I am doing wrong? (sorry not a lot of wisdom here!)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
#this is used at the end with plotting results to current hour
h = dt.datetime.now().hour
r = requests.get(
'https://www.google.com/search?q=weather+duluth')
soup = BeautifulSoup(r.text, 'html.parser')
stuff = []
for item in soup.select('vk_bk sol-tmp'):
item = int(item.contents[1].get_text(strip=True)[:-1])
#print(item)#this is weather data
stuff.append(item)
This is the web URL for weather and the current outdoor temperature is tied to the div class highlighted below.
If I attempt to print stuff I just get an empty list returned.

Adding User-Agent header should give expected result
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://www.google.com/search?q=weather%20duluth', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup.find("span", {"class": "wob_t"}).text

How do I parse two elements that are stuck together?

I want to get rating and numVotes from zomato.com but unfortunately it seems like the elements are stuck together. Hard to explain but I made a quick video show casing what I mean.
https://streamable.com/sdh0w
entire code: https://pastebin.com/JFKNuK2a
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
rating = zomato_container.find('div', {'class': 'search_result_rating'})
# numVotes = zomato_container.find("div", {"class": "rating-votes-div"})
print("rating: ", rating.get_text().strip())
# print("numVotes: ", numVotes.text())

You can use re module to parse the voting count:
import re
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
print('name:', zomato_container.select_one('.result-title').get_text(strip=True))
print('rating:', zomato_container.select_one('.rating-popup').get_text(strip=True))
votes = ''.join( re.findall(r'\d', zomato_container.select_one('[class^="rating-votes"]').text) )
print('votes:', votes)
print('*' * 80)
Prints:
name: The Original Ghirardelli Ice Cream and Chocolate...
rating: 4.9
votes: 344
********************************************************************************
name: Tadich Grill
rating: 4.6
votes: 430
********************************************************************************
name: Delfina
rating: 4.8
votes: 718
********************************************************************************
...and so on.
OR:
If you don't want to use re, you can use str.split():
votes = zomato_container.select_one('[class^="rating-votes"]').get_text(strip=True).split()[0]

According to requirements in your clip you should alter you selectors to be more specific so as to target the appropriate child elements (rather than parent). At present, by targeting parent you are getting the unwanted extra child. To get the appropriate ratings element you can use a css attribute = value with starts with operator.
This
[class^=rating-votes-div]
says match on elements with class attribute whose values starts with rating-votes-div
Visual:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
name = zomato_container.select_one('.result-title').text.strip()
rating = zomato_container.select_one('.rating-popup').text.strip()
numVotes = zomato_container.select_one('[class^=rating-votes-div]').text
print('name: ', name)
print('rating: ' , rating)
print('votes: ', numVotes)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web-scraping: Accessing text information within a large list - python

Related

getting an empty list when trying to extract urls from google with beautifulsoup

Fix for missing 'tr' class in webscraping

Empty lists when extracting tag classes

Python web scrape numerical weather data

How do I parse two elements that are stuck together?

Categories

Resources