Scraping a website with a particular format using Python - python

I am trying to use Python to scrape the US News Ranking for universities, and I'm struggling. I normally use Python "requests" and "BeautifulSoup".
The data is here:
https://www.usnews.com/education/best-global-universities/rankings
Using right click and inspect shows a bunch of links and I don't even know which one to pick. I followed an example from the web that I found but it just gives me empty data:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import math
from lxml.html import parse
from io import StringIO
url = 'https://www.usnews.com/education/best-global-universities/rankings'
urltmplt = 'https://www.usnews.com/education/best-global-universities/rankings?page=2'
css = '#resultsMain :nth-child(1)'
npage = 20
urlst = [url] + [urltmplt + str(r) for r in range(2,npage+1)]
def scrapevec(url, css):
doc = parse(StringIO(url)).getroot()
return([link.text_content() for link in doc.cssselect(css)])
usng = []
for u in urlst:
print(u)
ts = [re.sub("\n *"," ", t) for t in scrapevec(u,css) if t != ""]
This doesn't work as t is an empty array.
I'd really appreciate any help.

The MWE you posted is not working at all: urlst is never defined and cannot be called. I strongly suggest you to look for basic scraping tutorials (with python, java, etc.): there is plenty and in general is a good starting.
Below you can find a snippet of a code that prints the universities' names listed on page 1 - you'll be able to extend the code to all the 150 pages through a for loop.
import requests
from bs4 import BeautifulSoup
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings'
page1 = requests.get(baseurl, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
print(univ.text)
Edit: now the example works, but as you say in your question, it only returns empty lists. Below an edited version of the code that returns a list of all universities (pp. 1-150)
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
res = []
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
res.append(univ.text)
return res
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist] # unfold the list of lists
Re-edit following QHarr suggestion (thanks!) - same output, shorter and more "pythonic" solution
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
return [univ.text for univ in res_tab.select('[href]', limit=10)]
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist]

Related

Scraping with Beautiful Soup does not update values properly

I try to web-scrape weather website but the data does not update properly. The code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
while True:
soup = BeautifulSoup(urlopen(url), 'html.parser')
data = soup.find("div", {"class": "weather__text"})
print(data.text)
I am looking at 'WIND & WIND GUST' in 'CURRENT CONDITIONS' section. It prints the first values correctly (for example 1.0 / 2.2 mph) but after that the values update very slowly (at times 5+ minutes pass by) even though they change every 10-20-30 seconds in the website.
And when the values update in Python they are still different from the current values in the website.
You could try this alternate method: since the site actually retrieves the data from another url, you could just directly make the request and scrape the site only every hour or so to update the request url.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from datetime import datetime, timedelta
#def getReqUrl...
reqUrl = getReqUrl()
prevTime, prevAt = '', datetime.now()
while True:
ures = json.loads(urlopen(reqUrl).read())
if 'observations' not in asd:
reqUrl = getReqUrl()
ures = json.loads(urlopen(reqUrl).read())
#to see time since last update
obvTime = ures['observations'][0]['obsTimeUtc']
td = (datetime.now() - prevAt).seconds
wSpeed = ures['observations'][0]['imperial']['windSpeed']
wGust = ures['observations'][0]['imperial']['windGust']
print('',end=f'\r[+{td}s -> {obvTime}]: {wGust} ° / {wSpeed} °mph')
if prevTime < obvTime:
prevTime = obvTime
prevAt = datetime.now()
print('')
Even when making the request directly, the "observation time" in the retrieved data jumps around sometimes, which is why I'm only printing on a fresh line when obvTime increases - without that, it looks like this. (If that's preferred you can just print normally without the '',end='\r... format, and the second if block is no longer necessary either).
The first if block is for refreshing the reqUrl (because it expires after a while), which is when I actually scrape the wunderground site, because the url is inside one of their script tags:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
nxtSt = json.loads(appText.replace('&q;','"'))['wu-next-state-key']
return [
ns for ns in nxtSt.values()
if 'observations' in ns['value'] and
len(ns['value']['observations']) == 1
][0]['url'].replace('&a;','&')
or, since I know how the url starts, more simply like:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
rUrl = 'https://api.weather.com/v2/pws/observations/current'
rUrl = rUrl + appText.split(rUrl)[1].split('&q;')[0]
return rUrl.replace('&a;','&')
try:
import requests
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#'WIND & WIND GUST' in 'CURRENT CONDITIONS' section
wind_gust = [float(i.text) for i in soup.select_one('.weather__header:-soup-contains("WIND & GUST")').find_next('div', class_='weather__text').select('span.wu-value-to')]
print(wind_gust)
[1.8, 2.2]
wind = wind_gust[0]
gust = wind_gust[1]
print(wind)
1.8
print(gust)
2.2

how can I scrape data beyond the page limit on Zillow?

I created a code to scrape the Zillow data and it works fine. The only problem I have is that it's limited to 20 pages even though there are many more results. Is there a way to get around this page limitation and scrap all the data ?
I also wanted to know if there is a general solution to this problem since I encounter it practically in every site that I want to scrape.
Thank you
from bs4 import BeautifulSoup
import requests
import lxml
import json
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
search_link = 'https://www.zillow.com/homes/Florida--/'
response = requests.get(url=search_link, headers=headers)
pages_number = 19
def OnePage():
soup = BeautifulSoup(response.text, 'lxml')
data = json.loads(
soup.select_one("script[data-zrr-shared-data-key]")
.contents[0]
.strip("!<>-")
)
all_data = data['cat1']['searchResults']['listResults']
home_info = []
result = []
for i in range(len(all_data)):
property_link = all_data[i]['detailUrl']
property_response = requests.get(url=property_link, headers=headers)
property_page_source = BeautifulSoup(property_response.text, 'lxml')
property_data_all = json.loads(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['apiCache'])
zp_id = str(json.loads(property_page_source.find('script', {'id': 'hdpApolloPreloadedData'}).get_text())['zpid'])
property_data = property_data_all['ForSaleShopperPlatformFullRenderQuery{"zpid":'+zp_id+',"contactFormRenderParameter":{"zpid":'+zp_id+',"platform":"desktop","isDoubleScroll":true}}']["property"]
home_info["Broker Name"] = property_data['attributionInfo']['brokerName']
home_info["Broker Phone"] = property_data['attributionInfo']['brokerPhoneNumber']
result.append(home_info)
return result
data = pd.DataFrame()
all_page_property_info = []
for page in range(pages_number):
property_info_one_page = OnePage()
search_link = 'https://www.zillow.com/homes/Florida--/'+str(page+2)+'_p'
response = requests.get(url=search_link, headers=headers)
all_page_property_info = all_page_property_info+property_info_one_page
data = pd.DataFrame(all_page_property_info)
data.to_csv(f"/Users//Downloads/Zillow Search Result.csv", index=False)
Actually, you can't grab any data from zillow using bs4 because they are dynamically loaded by JS and bs4 can't render JS. Only 6 to 8 data items are static. All data are lying down in script tag with html comment as json format. How to pull the requied data? In this case you can follow the next example.
Thus way you can extract all the items. So to pull rest of data items, is your task or just add your data items here.
Zillow is one of the most famous and smart enough websites. So we should respect its terms and conditions.
Example:
import requests
import re
import json
import pandas as pd
url='https://www.zillow.com/fl/{page}_p/?searchQueryState=%7B%22usersSearchTerm%22%3A%22FL%22%2C%22mapBounds%22%3A%7B%22west%22%3A-94.21964006249998%2C%22east%22%3A-80.68448381249998%2C%22south%22%3A22.702203494269085%2C%22north%22%3A32.23788425255877%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A14%2C%22regionType%22%3A2%7D%5D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A6%2C%22pagination%22%3A%7B%22currentPage%22%3A2%7D%7D'
lst=[]
for page in range(1,21):
r = requests.get(url.format(page=page),headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
for item in data['cat1']['searchResults']['listResults']:
price= item['price']
lst.append({'price': price})
df = pd.DataFrame(lst).to_csv('out.csv',index=False)
print(df)
Output:
price
0 $354,900
1 $164,900
2 $155,000
3 $475,000
4 $245,000
.. ...
795 $295,000
796 $10,000
797 $385,000
798 $1,785,000
799 $1,550,000
[800 rows x 1 columns]

Can't parse different product links from a webpage

I've created a script in Python to fetch different product links from a webpage. Although I know the content of that site are dynamic, I tried conventional way to let you inform that I tried. I looked for APIs in the dev tools but could not find one. Ain't there any way to get those links using requests?
Site Link
I've written so far:
import requests
from bs4 import BeautifulSoup
link = "https://www.amazon.com/stores/node/10699640011"
def fetch_product_links(url):
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for item_link in soup.select("[id^='ProductGrid-'] li[class^='style__itemOuter__'] > a"):
print(item_link.get("href"))
if __name__ == '__main__':
fetch_product_links(link)
How can I fetch different product links from that site using requests?
I think you only need the asins which you can collect from another url construct you can see in network tab i.e. you can significantly shorten the final urls. You do however need to make a request to your original url to pick up an identifier to use in second url. Returns 146 links.
import requests, re, json
node = '10699640011'
with requests.Session() as s:
r = s.get(f'https://www.amazon.com/stores/node/{node}')
p = re.compile(r'var slotsStr = "\[(.*?,){3} share\]";')
identifier = p.findall(r.text)[0]
identifier = identifier.strip()[:-1]
r = s.get(f'https://www.amazon.com/stores/slot/{identifier}?node={node}')
p = re.compile(r'var config = (.*?);')
data = json.loads(p.findall(r.text)[0])
asins = data['content']['ASINList']
links = [f'https://www.amazon.com/dp/{asin}' for asin in asins]
print(links)
EDIT:
With two given nodes:
import requests, re, json
from bs4 import BeautifulSoup as bs
nodes = ['3039806011','10699640011']
with requests.Session() as s:
for node in nodes:
r = s.get(f'https://www.amazon.com/stores/node/{node}')
soup = bs(r.content, 'lxml')
identifier = soup.select('.stores-widget-btf:not([id=share],[id*=RECOMMENDATION])')[-1]['id']
r = s.get(f'https://www.amazon.com/stores/slot/{identifier}?node={node}')
p = re.compile(r'var config = (.*?);')
data = json.loads(p.findall(r.text)[0])
asins = data['content']['ASINList']
links = [f'https://www.amazon.com/dp/{asin}' for asin in asins]
print(links)

Not able to scrape the all the reviews

I am trying to scrape this website and trying to get the reviews but I am facing an issue,
The page loads only 50 reviews.
To load more you have to click "Show More Reviews" and I don't know how to get all the data as there is no page link, also "Show more Reviews" doesn't have a URL to explore, the address remains the same.
url =
"https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')
I know this is not pretty code but I am just trying to get the review text first.
kindly help. As I am little new to this.
Looking at the website, the "Show more reviews" button makes an ajax call and returns the additional info, all you have to do is find it's link and send a get request to it (which I've done with some simple regex):
import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3
with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1
Notice how I'm using a session to preserve cookies for the ajax call.
Edit 1: You can reload the webpage multiple times and call the ajax again to get even more data.
Edit 2: Save data using your own method.
Edit 3: Changed some stuff, now gets any number of pages for you, saves to file with good' ol open()

Trouble extracting data from html-doc with BeautifulSoup

I'm trying to extract data from a page I scraped off the web and I find it to be quite difficult. I tried soup.get_Text(), but its no good since it just returns single chars in a row instead of whole string objects.
Extracting the name is easy, because you can access it with the 'b'-tag, but for example extracting the street ("Am Vogelwäldchen 2") proves to be quite difficult. I could try to assemble the adress from single chars, but this seems overly complicated and I feel there has to be an easier way of doing this. Maybe someone has a better idea. Oh and don't mind the weird function, I returned the soup because I tried different methods on it.
import urllib.request
import time
from bs4 import BeautifulSoup
#Performs a HTTP-'POST' request, passes it to BeautifulSoup and returns the result
def doRequest(request):
requestResult = urllib.request.urlopen(request)
soup = BeautifulSoup(requestResult)
return soup
def getContactInfoFromPage(page):
name = ''
straße = ''
plz = ''
stadt = ''
telefon = ''
mail = ''
url = ''
data = [
#'Name',
#'Straße',
#'PLZ',
#'Stadt',
#'Telefon',
#'E-Mail',
#'Homepage'
]
request = urllib.request.Request("http://www.altenheim-adressen.de/schnellsuche/" + page)
request.add_header("Content-Type", "application/x-www-form-urlencoded;charset=utf-8")
request.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0")
soup = doRequest(request)
#Save Name to data structure
findeName = soup.findAll('b')
name = findeName[2]
name = name.string.split('>')
data.append(name)
return soup
soup = getContactInfoFromPage("suche2.cfm?id=267a0749e983c7edfeef43ef8e1c7422")
print(soup.getText())
You can rely on the field label and get the next sibling's text.
Making a nice reusable function from this would make it more transparent and easy to use:
def get_field_value(soup, field):
field_label = soup.find('td', text=field + ':')
return field_label.find_next_sibling('td').get_text(strip=True)
Usage:
print(get_field_value(soup, 'Name')) # prints 'AWO-Seniorenzentrum Kenten'
print(get_field_value(soup, 'Land')) # prints 'Deutschland'

Categories