Can't access table and table elements using bs4 - python

So I am trying to scrape the following webpage: https://www.omscentral.com/
The main table there is my item of interest. I want to scrape the table, and all of its content. When I inspect the content of the page, the table is on a table tag, so I figured it would be easy to access it, with the code below.
url = 'https://www.omscentral.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('table')
However, that code only returns the table header. I saw a similar example here, but the solution of switching the parser did not work.
When I look at the soup object in itself, it seems that the requests does not expand the table, and only captures the header. Not too sure what to do here - any advice would be much appreciated!

Content is stored in script tag and rendered dynamically, so you have to extract the data from there.
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
To display in DataFrame simply use:
pd.DataFrame(data)
Example
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.omscentral.com/'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
for item in data:
print(item['name'], item.get('officialURL'))
Output
Introduction to Information Security https://omscs.gatech.edu/cs-6035-introduction-to-information-security
Computing for Good https://omscs.gatech.edu/cs-6150-computing-good
Introduction to Operating Systems https://omscs.gatech.edu/cs-6200-introduction-operating-systems
Advanced Operating Systems https://omscs.gatech.edu/cs-6210-advanced-operating-systems
Secure Computer Systems https://omscs.gatech.edu/cs-6238-secure-computer-systems
Computer Networks https://omscs.gatech.edu/cs-6250-computer-networks
...

Related

First time web scraping from a weathercast website

I am learning web scraping as my first mini-project. Currently working with python. I want to extract a weather data and use python to show the weather where I am living, I have found the data I needed by inspecting the tags but it keeps giving me all the numbers on the weather forecast table instead of the specific one I need I have tried for to write its specific index number but it still did not work. This is the code I have written so far;
import requests
from bs4 import BeautifulSoup as bs
url= "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
r= requests.get(url)
cast = bs(r.content, "lxml")
wthr = cast.findAll("div",{"class": "col-md-9"})
print (wthr)
Any help would be greatly appreciated. The data I want is the Temperature data.
Also; Can somebody explain to me the differences between using lxml or html.parser. I have seen both methods being used widely and was curious how would you decide to use one over the other.
I think element is <div class="temp">24.2 °c</div>.
If your primary focus is just temperature data, you can check out weather APIs. There are several public APIs you could find here.
Legality of scraping should be considered before the action. You can find something about it here https://www.tutorialspoint.com/python_web_scraping/legality_of_python_web_scraping.htm
This site doesn't have robots.txt file so it is permitted to crawl.
Here is a very simplified way to get the table data published at the url that you use in the question. This uses html.parser to extract data
import requests
from bs4 import BeautifulSoup
def get_soup(my_url):
HTML = requests.get(my_url)
my_soup = BeautifulSoup(HTML.text, 'html.parser')
if 'None' not in str(type(my_soup)):
return my_soup
else:
return None
url = "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
# get the whole html document
soup = get_soup(url)
# get something from that soup
# here a table header and data are extracted from the soup
table_header = soup.find("table").findAll("th")
table_data = soup.find("table").findAll("td")
# header's and data's type is list
# combine lists
for x in range(len(table_header)):
print(table_header[x].text + ' --> ' + table_data[x].text)
""" R e s u l t :
Tarih / Saat -->
Hava --> COK BULUTLU
Sıcaklık --> 27.5°C
İşba Sıcaklığı --> 17.9°C
Basınç --> 1003.5 hPa
Görüş --> 10 km
Rüzgar --> Batıdan (270) 5 kt.
12.06.2022 13:00 --> Genel Tablo Genel Harita
"""
This is just one way to do it and it gets just a part shown in a transparent table on the site. Once more, take care of the instructions stated in the robots.txt file of the site. Regards...
Did you check if the webservice has an api you can use? Many weather-apps have api's you can use for free if you stay under a certain limit of requests. If there is, you could easily request only the data you need, so there is no need of formatting it.

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

HTML in browser doesn't correspond to scraped data in python

For a project I've to scrap datas from a different website, and I'm having problem with one.
When I look at the source code the things I want are in a table, so it seems to be easy to scrap. But when I run my script that part of the code source doesn't show.
Here is my code. I tried different things. At first there wasn't any headers, then I added some but no difference.
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
import requests
# specify the url
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'
# query the website and return the html to the variable 'page'
response = requests.get(quote_page)
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)
# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')
with open('allergene.txt', 'w') as f:
f.write(soup.encode('UTF-8', 'ignore'))
What I'm looking for in the website is the things after "Herbacée" whose HTML Look like :
<p class="level1">
<img src="/static/img/state-0.png" alt="pas d'émission" class="state">
Herbacee
</p>
Do you have any idea what's wrong ?
Thanks for your help and happy new year guys :)
This page use JavaScript to render the table, the real page contains the table is:
http://www.alertepollens.org/gardens/garden/1/state/
You can find this url in Chrome Dev tools>>>Network.

Python web scraping - how to get resources with beautiful soup when page loads contents via JS?

So I am trying to scrape a table from a specific website using BeautifulSoup and urllib. My goal is to create a single list from all the data in this table. I have tried using this same code using tables from other websites, and it works fine. However, while trying it with this website the table returns a NoneType object. Can someone help me with this? I've tried looking for other answers online but I'm not having much luck.
Here's the code:
import requests
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read())
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
It looks like this data is loaded via an ajax call:
You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php
import requests
import urllib
from bs4 import BeautifulSoup
params = {
"type":"team-detail",
"league":"ncb",
"stat_id":"3083",
"season_id":"312",
"cat_type":"2",
"view":"stats_v1",
"is_previous":"0",
"date":"04/06/2015"
}
content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
soup = BeautifulSoup(content)
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
Using your web inspector you can also view the parameters that are passed along with the POST request.
Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2 because I generally prefer to use that library.
If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.
The table on that website is being created via javascript, and so does not exist when you simply throw the source code at BeautifulSoup.
Either you need to start poking around with your web inspector of choice, and find out where the javascript is getting the data from - or you should use something like selenium to run a complete browser instance.
Since the table data is loaded dynamically, there be some lag is updating the table data due multiple reason like network delay. So you can wait for time by giving a delay and reading the data.
Check if table data i.e. length is null, if so read the table data after some delay. This will help .
Looked at the url you have used. Since you are using class selector for the table. make sure that it is present other places in the HTML

Categories