Beautiful Soup web scraping complex html for data - python

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you

As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

Related

Can't access table and table elements using bs4

So I am trying to scrape the following webpage: https://www.omscentral.com/
The main table there is my item of interest. I want to scrape the table, and all of its content. When I inspect the content of the page, the table is on a table tag, so I figured it would be easy to access it, with the code below.
url = 'https://www.omscentral.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('table')
However, that code only returns the table header. I saw a similar example here, but the solution of switching the parser did not work.
When I look at the soup object in itself, it seems that the requests does not expand the table, and only captures the header. Not too sure what to do here - any advice would be much appreciated!
Content is stored in script tag and rendered dynamically, so you have to extract the data from there.
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
To display in DataFrame simply use:
pd.DataFrame(data)
Example
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.omscentral.com/'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
for item in data:
print(item['name'], item.get('officialURL'))
Output
Introduction to Information Security https://omscs.gatech.edu/cs-6035-introduction-to-information-security
Computing for Good https://omscs.gatech.edu/cs-6150-computing-good
Introduction to Operating Systems https://omscs.gatech.edu/cs-6200-introduction-operating-systems
Advanced Operating Systems https://omscs.gatech.edu/cs-6210-advanced-operating-systems
Secure Computer Systems https://omscs.gatech.edu/cs-6238-secure-computer-systems
Computer Networks https://omscs.gatech.edu/cs-6250-computer-networks
...

First time web scraping from a weathercast website

I am learning web scraping as my first mini-project. Currently working with python. I want to extract a weather data and use python to show the weather where I am living, I have found the data I needed by inspecting the tags but it keeps giving me all the numbers on the weather forecast table instead of the specific one I need I have tried for to write its specific index number but it still did not work. This is the code I have written so far;
import requests
from bs4 import BeautifulSoup as bs
url= "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
r= requests.get(url)
cast = bs(r.content, "lxml")
wthr = cast.findAll("div",{"class": "col-md-9"})
print (wthr)
Any help would be greatly appreciated. The data I want is the Temperature data.
Also; Can somebody explain to me the differences between using lxml or html.parser. I have seen both methods being used widely and was curious how would you decide to use one over the other.
I think element is <div class="temp">24.2 °c</div>.
If your primary focus is just temperature data, you can check out weather APIs. There are several public APIs you could find here.
Legality of scraping should be considered before the action. You can find something about it here https://www.tutorialspoint.com/python_web_scraping/legality_of_python_web_scraping.htm
This site doesn't have robots.txt file so it is permitted to crawl.
Here is a very simplified way to get the table data published at the url that you use in the question. This uses html.parser to extract data
import requests
from bs4 import BeautifulSoup
def get_soup(my_url):
HTML = requests.get(my_url)
my_soup = BeautifulSoup(HTML.text, 'html.parser')
if 'None' not in str(type(my_soup)):
return my_soup
else:
return None
url = "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
# get the whole html document
soup = get_soup(url)
# get something from that soup
# here a table header and data are extracted from the soup
table_header = soup.find("table").findAll("th")
table_data = soup.find("table").findAll("td")
# header's and data's type is list
# combine lists
for x in range(len(table_header)):
print(table_header[x].text + ' --> ' + table_data[x].text)
""" R e s u l t :
Tarih / Saat -->
Hava --> COK BULUTLU
Sıcaklık --> 27.5°C
İşba Sıcaklığı --> 17.9°C
Basınç --> 1003.5 hPa
Görüş --> 10 km
Rüzgar --> Batıdan (270) 5 kt.
12.06.2022 13:00 --> Genel Tablo Genel Harita
"""
This is just one way to do it and it gets just a part shown in a transparent table on the site. Once more, take care of the instructions stated in the robots.txt file of the site. Regards...
Did you check if the webservice has an api you can use? Many weather-apps have api's you can use for free if you stay under a certain limit of requests. If there is, you could easily request only the data you need, so there is no need of formatting it.

Webscraping in Python (beautifulsoup)

I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

Categories