Trouble extracting data from html-doc with BeautifulSoup

Trouble extracting data from html-doc with BeautifulSoup - python

I'm trying to extract data from a page I scraped off the web and I find it to be quite difficult. I tried soup.get_Text(), but its no good since it just returns single chars in a row instead of whole string objects.
Extracting the name is easy, because you can access it with the 'b'-tag, but for example extracting the street ("Am Vogelwäldchen 2") proves to be quite difficult. I could try to assemble the adress from single chars, but this seems overly complicated and I feel there has to be an easier way of doing this. Maybe someone has a better idea. Oh and don't mind the weird function, I returned the soup because I tried different methods on it.
import urllib.request
import time
from bs4 import BeautifulSoup
#Performs a HTTP-'POST' request, passes it to BeautifulSoup and returns the result
def doRequest(request):
requestResult = urllib.request.urlopen(request)
soup = BeautifulSoup(requestResult)
return soup
def getContactInfoFromPage(page):
name = ''
straße = ''
plz = ''
stadt = ''
telefon = ''
mail = ''
url = ''
data = [
#'Name',
#'Straße',
#'PLZ',
#'Stadt',
#'Telefon',
#'E-Mail',
#'Homepage'
]
request = urllib.request.Request("http://www.altenheim-adressen.de/schnellsuche/" + page)
request.add_header("Content-Type", "application/x-www-form-urlencoded;charset=utf-8")
request.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0")
soup = doRequest(request)
#Save Name to data structure
findeName = soup.findAll('b')
name = findeName[2]
name = name.string.split('>')
data.append(name)
return soup
soup = getContactInfoFromPage("suche2.cfm?id=267a0749e983c7edfeef43ef8e1c7422")
print(soup.getText())

You can rely on the field label and get the next sibling's text.
Making a nice reusable function from this would make it more transparent and easy to use:
def get_field_value(soup, field):
field_label = soup.find('td', text=field + ':')
return field_label.find_next_sibling('td').get_text(strip=True)
Usage:
print(get_field_value(soup, 'Name')) # prints 'AWO-Seniorenzentrum Kenten'
print(get_field_value(soup, 'Land')) # prints 'Deutschland'

Related

Scraping with Beautiful Soup does not update values properly

I try to web-scrape weather website but the data does not update properly. The code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
while True:
soup = BeautifulSoup(urlopen(url), 'html.parser')
data = soup.find("div", {"class": "weather__text"})
print(data.text)
I am looking at 'WIND & WIND GUST' in 'CURRENT CONDITIONS' section. It prints the first values correctly (for example 1.0 / 2.2 mph) but after that the values update very slowly (at times 5+ minutes pass by) even though they change every 10-20-30 seconds in the website.
And when the values update in Python they are still different from the current values in the website.

You could try this alternate method: since the site actually retrieves the data from another url, you could just directly make the request and scrape the site only every hour or so to update the request url.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from datetime import datetime, timedelta
#def getReqUrl...
reqUrl = getReqUrl()
prevTime, prevAt = '', datetime.now()
while True:
ures = json.loads(urlopen(reqUrl).read())
if 'observations' not in asd:
reqUrl = getReqUrl()
ures = json.loads(urlopen(reqUrl).read())
#to see time since last update
obvTime = ures['observations'][0]['obsTimeUtc']
td = (datetime.now() - prevAt).seconds
wSpeed = ures['observations'][0]['imperial']['windSpeed']
wGust = ures['observations'][0]['imperial']['windGust']
print('',end=f'\r[+{td}s -> {obvTime}]: {wGust} ° / {wSpeed} °mph')
if prevTime < obvTime:
prevTime = obvTime
prevAt = datetime.now()
print('')
Even when making the request directly, the "observation time" in the retrieved data jumps around sometimes, which is why I'm only printing on a fresh line when obvTime increases - without that, it looks like this. (If that's preferred you can just print normally without the '',end='\r... format, and the second if block is no longer necessary either).
The first if block is for refreshing the reqUrl (because it expires after a while), which is when I actually scrape the wunderground site, because the url is inside one of their script tags:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
nxtSt = json.loads(appText.replace('&q;','"'))['wu-next-state-key']
return [
ns for ns in nxtSt.values()
if 'observations' in ns['value'] and
len(ns['value']['observations']) == 1
][0]['url'].replace('&a;','&')
or, since I know how the url starts, more simply like:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
rUrl = 'https://api.weather.com/v2/pws/observations/current'
rUrl = rUrl + appText.split(rUrl)[1].split('&q;')[0]
return rUrl.replace('&a;','&')

try:
import requests
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#'WIND & WIND GUST' in 'CURRENT CONDITIONS' section
wind_gust = [float(i.text) for i in soup.select_one('.weather__header:-soup-contains("WIND & GUST")').find_next('div', class_='weather__text').select('span.wu-value-to')]
print(wind_gust)
[1.8, 2.2]
wind = wind_gust[0]
gust = wind_gust[1]
print(wind)
1.8
print(gust)
2.2

How to get all page results - Web Scraping - Pagination

I am a beginner in regards to coding. Right now I am trying to get a grip on simple web scrapers using python.
I want to scrape a real estate website and get the Title, price, sqm, and what not into a CSV file.
My questions:
It seems to work for the first page of results but then it repeats and it does not run through the 40 pages. It rather fills the file with the same results.
The listings have info about "square meter" and the "number of rooms". When I inspect the page it seems though that it uses the same class for both elements. How would I extract the room numbers for example?
Here is the code that I have gathered so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={1}'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'col-xs-12 place-over-understitial sel-bg-gray-lighter')
for item in divs:
title = item.find('div', {'class': 'text-225'}).text.strip().replace('\n', '')
title2 = title.replace('\t', '')
hausart = item.find('span', class_ = 'text-100').text.strip().replace('\n', '')
hausart2 = hausart.replace('\t', '')
try:
price = item.find('span', class_ = 'text-250 text-strong text-nowrap').text.strip()
except:
price = 'Auf Anfrage'
wohnflaeche = item.find('p', class_ = 'text-250 text-strong text-nowrap').text.strip().replace('m²', '')
angebot = {
'title': title2,
'hausart': hausart2,
'price': price
}
hauslist.append(angebot)
return
hauslist=[]
for i in range(0, 40):
print(f'Getting page {i}...')
c = extract(i)
transform(c)
df = pd.DataFrame(hauslist)
print(df.head())
df.to_csv('immonetHamburg.csv')
This is my first post on stackoverflow so please be kind if I should have posted my problem differently.
Thanks
Pat

You have stupid mistake.
In url you have to use {page} instead of {1}. That's all.
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={page}'
I see other problem:
You start scraping at page 0 but servers often give the same result for page 0 and 1.
You should use range(1, ...) instead of range(0, ...)
As for searching elements.
Beautifulsoup may search not only classes but also id and any other value in tag - ie. name, style, data, etc. It can also search by text "number of rooms". It can also use regex for this. You can also assign own function which will check element and return True/False to decide if it has to keep it in results.
You can also combine .find() with another .find() or .find_all().
price = item.find('div', {"id": lambda value:value and value.startswith('selPrice')}).find('span')
if price:
print("price:", price.text)
And if you know that "square meter" is before "number of rooms" then you could use find_all() to get both of them and later use [0] to get first of them and [1] to get second of them.
You should read all documentation beacause it can be very useful.

I advice you use Selenium instead, because you can physically click the 'next-page' button until you cover all pages and the whole code will only take a few lines.

As #furas mentioned you have a mistake with the page.
To get all rooms you need to find_all and get the last index with -1. Because sometimes there are 3 items or 2.
#to remote all \n and \r
translator = str.maketrans({chr(10): '', chr(9): ''})
rooms = item.find_all('p', {'class': 'text-250'})
if rooms:
rooms = rooms[-1].text.translate(translator).strip()

Scraping a website with a particular format using Python

I am trying to use Python to scrape the US News Ranking for universities, and I'm struggling. I normally use Python "requests" and "BeautifulSoup".
The data is here:
https://www.usnews.com/education/best-global-universities/rankings
Using right click and inspect shows a bunch of links and I don't even know which one to pick. I followed an example from the web that I found but it just gives me empty data:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import math
from lxml.html import parse
from io import StringIO
url = 'https://www.usnews.com/education/best-global-universities/rankings'
urltmplt = 'https://www.usnews.com/education/best-global-universities/rankings?page=2'
css = '#resultsMain :nth-child(1)'
npage = 20
urlst = [url] + [urltmplt + str(r) for r in range(2,npage+1)]
def scrapevec(url, css):
doc = parse(StringIO(url)).getroot()
return([link.text_content() for link in doc.cssselect(css)])
usng = []
for u in urlst:
print(u)
ts = [re.sub("\n *"," ", t) for t in scrapevec(u,css) if t != ""]
This doesn't work as t is an empty array.
I'd really appreciate any help.

The MWE you posted is not working at all: urlst is never defined and cannot be called. I strongly suggest you to look for basic scraping tutorials (with python, java, etc.): there is plenty and in general is a good starting.
Below you can find a snippet of a code that prints the universities' names listed on page 1 - you'll be able to extend the code to all the 150 pages through a for loop.
import requests
from bs4 import BeautifulSoup
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings'
page1 = requests.get(baseurl, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
print(univ.text)
Edit: now the example works, but as you say in your question, it only returns empty lists. Below an edited version of the code that returns a list of all universities (pp. 1-150)
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
res = []
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
res.append(univ.text)
return res
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist] # unfold the list of lists
Re-edit following QHarr suggestion (thanks!) - same output, shorter and more "pythonic" solution
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
return [univ.text for univ in res_tab.select('[href]', limit=10)]
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist]

Not able to scrape the all the reviews

I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.

Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()

Python and BeautifulSoup encoding issue from UTF-8

I'm new to python and currently writing an application that scrapes data off the web. It's mostly done, there is only a little problem left with encoding. The site is encoded in ISO-8859-1, but when I try to html.decode('iso-8859-1'), it doesn't do anything.
If you run the program, use 50000 and 50126 for PLZs and you'll see what I mean in the output. It would be awesome if someone could help me out.
import urllib.request
import time
import csv
import operator
from bs4 import BeautifulSoup
#Performs a HTTP-'POST' request, passes it to BeautifulSoup and returns the result
def doRequest(request):
requestResult = urllib.request.urlopen(request)
soup = BeautifulSoup(requestResult)
return soup
#Returns all the result links from the given search parameters
def getLinksFromSearch(plz_von, plz_bis):
database = []
links = []
#The search parameters
params = {
'name_ff': '',
'strasse_ff': '',
'plz_ff': plz_von,
'plz_ff2': plz_bis,
'ort_ff': '',
'bundesland_ff': '',
'land_ff': 'DE',
'traeger_ff': '',
'Dachverband_ff': '',
'submit2' : 'Suchen'
}
DATA = urllib.parse.urlencode(params)
DATA = DATA.encode('utf-8')
request = urllib.request.Request(
"http://www.altenheim-adressen.de/schnellsuche/suche1.cfm",
DATA)
# adding charset parameter to the Content-Type header.
request.add_header("Content-Type", "application/x-www-form-urlencoded;charset=utf-8")
request.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0")
#The search request
html = doRequest(request)
h = html.decode('iso-8859-1')
soup = BeautifulSoup(h)
for link in soup.find_all('a'):
database.append(link.get('href'))
#Remove the first Element ('None') to avoid Attribute Errors
database.pop(0)
for item in database:
if item.startswith("suche"):
links.append(item)
return links
#Performs a search on the link results
def searchOnLinks(links):
adresses = []
i = 1
j = len(links)
print("Found", j, "results, collecting data.")
for item in links:
adresses.append(getContactInfoFromPage(item, i, j))
i = i + 1
time.sleep(0.1)
print("All done.")
return adresses
#A method to scrape the contact info from the search result
def getContactInfoFromPage(page, i, j):
name = ''
straße = ''
plz = ''
stadt = ''
telefon = ''
mail = ''
url = ''
data = [
#'Name',
#'Straße',
#'PLZ',
#'Stadt',
#'Telefon',
#'E-Mail',
#'Homepage'
]
request = urllib.request.Request("http://www.altenheim-adressen.de/schnellsuche/" + page)
#request.add_header("Content-Type", "application/x-www-form-urlencoded;charset=utf-8")
request.add_header("Content-Type", "text/html;charset=UTF-8")
request.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0")
print("(" , i , "/" , j , ") Making request...")
soup = doRequest(request)
print("Done.")
findeName = soup.findAll('b')
name = findeName[2]
name = name.string.split('>')
data.append(name[0])
straße = getFieldValue(soup, "Straße")
data.append(straße)
ort = getFieldValue(soup, "Ort")
(plz, stadt) = ort.split(' ', 1)
data.append(plz)
data.append(stadt)
telefon = getFieldValue(soup, "Telefon")
data.append(telefon)
mail = getFieldValue(soup, "EMail")
data.append(mail)
url = getFieldValue(soup, "Internetadresse")
data.append(url)
return data
#Strips the text from the given field's sibling
def getFieldValue(soup, field):
field_label = soup.find('td', text=field + ':')
return field_label.find_next_sibling('td').get_text(strip=True)
#The main input/output function
def inputOutput():
#PLZ is German for zip-code and consists of a five-digit number
#The program passes the numbers to the servers, and the server
#returns all search results between the two numbers
plz_von = input("Please enter first PLZ: ")
plz_bis = input("Please enter second PLZ: ")
links = getLinksFromSearch(plz_von, plz_bis)
#Checks if the search yielded any results
if len(links) > 0:
data = searchOnLinks(links)
file_name = input("Save as: ")
print("Writing to file...")
with open(file_name + '.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(data)
else:
print("The search yielded no results.")
inputOutput()

Your doRequest() function returns a BeautifulSoup object, you cannot decode that object. Just use it directly:
soup = doRequest(request)
You don't need to decode the response at all; BeautifulSoup uses both hints in the HTML (<meta> headers) as well as statistical analysis to determine the correct input encoding.
In this case the HTML document claims it is Latin-1:
<meta name="content-type" content="text/html; charset=iso-8859-1">
The response doesn't include a character set in the Content-Type header either, so this is a case of a misconfigured server. You can force BeautifulSoup to ignore the <meta> header with:
soup = BeautifulSoup(requestResult, from_encoding='utf8')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble extracting data from html-doc with BeautifulSoup - python

Related

Scraping with Beautiful Soup does not update values properly

How to get all page results - Web Scraping - Pagination

Scraping a website with a particular format using Python

Not able to scrape the all the reviews

Python and BeautifulSoup encoding issue from UTF-8

Categories

Resources