I am new to python and I am trying to scrape a value that constantly updates. When you first enter the site the value says "laddar temperatur (loading temperature)" and then goes on to show the actual temperature after a while. When I run my script the only thing I get is the "loading temperature" value. I am guessing that it has to do with the fact that the script reloads the site every time I run it. How do I get it so that it "stays" on the site and collect the information after the "loading temperature"?
Site: http://s-websrv02.lulea.se/ormberget/
from bs4 import BeautifulSoup
import requests
import time
r = requests.get("http://s-websrv02.lulea.se/ormberget/")
soup = BeautifulSoup(r.text, "html.parser")
match = soup.find("div", id="ReloadThis").text
for item in match:
print(match)
time.sleep(20)
The temperature is fetched using XHR call.
The code below should return the temperature
import requests
r = requests.get('http://s-websrv02.lulea.se/ormberget/Orm_Stadium.php')
print(r.text.strip())
if you like to get the temperature value periodically do something like:
import time
import requests
collected_data = []
SLEEP_TIME = 5
while True:
r = requests.get('http://s-websrv02.lulea.se/ormberget/Orm_Stadium.php')
value = r.text.strip() if r.status_code == 200 else '-1000'
collected_data.append({'time':time.time(), 'value':value})
time.sleep(SLEEP_TIME)
Related
I'm trying to write a python script to check the status's display text for a specific country (ie. Ecuador)
on this website:
https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps.
How do I keep track on that specific text when a change happens?
Currently, I tried to compare the hash codes after a time delay interval however the hash code seems to change every time even though nothing change visually.
input_website = 'https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps'
time_delay = 60
#Monitor the website
def monitor_website():
# Run the loop the keep monitoring
while True:
# Visit the website to know if it is up
status = urllib.request.urlopen(input_website).getcode()
# If it returns 200, the website is up
if status != 200:
# Call email function
send_email("The website is DOWN")
else:
send_email("The website is UP")
# Open url and create the hash code
response = urllib.request.urlopen(input_website).read()
current_hash = hashlib.sha224(response).hexdigest()
# Revisit the website after time delay
time.sleep(time_delay)
# Visit the website after delay, and generate the new website
response = urllib.request.urlopen(input_website).read()
new_hash = hashlib.sha224(response).hexdigest()
# Check the hash codes
if new_hash != current_hash:
send_email("The website CHANGED")
Can you check it using Beautiful Soup?
Crawl the page for "Ecuador" and then check the next word for "suspended**"
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# create list of all tags 'td'
list_name = list()
tags = soup('td')
for tag in tags:
#take out whitespace and \u200b unicode
url_grab = tag.get_text().strip(u'\u200b').strip()
list_name.append(url_grab)
#Search list for Ecuador and following item in list
country_status ={}
for i in range(len(list_name)):
if "Ecuador" in list_name[i]:
country_status[list_name[i]] = list_name[i+1]
print(country_status)
else:
continue
#Check website
if country_status["Ecuador"] != "suspended**":
print("Website has changed")
I am currently trying to get a spot in a zoom class, and if I don't get a spot I owe a driving school 300 dollars. Long story there, but not very important. I am trying to get a notification if the zoom registration link data has been updated. I originally tried to just see if the hash of the site was updated at any point, but I noticed there must be some internal clock that is changed, making it notify me every minute. The specific element I am looking to see if it is removed is...
<div class="form-group registration-over">Registration is closed.</div>
I am not sure how to isolate it within the hash. Below is the code I have for checking for any update.
import time
import hashlib
from urllib.request import urlopen, Request
import webbrowser
url = Request('URL HERE',
headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
print("running")
time.sleep(10)
while True:
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(30)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash == currentHash:
continue
else:
from datetime import datetime
now = datetime.now()
nowtime = now.strftime("%H:%M:%S: ")
print(nowtime, "Something changed")
webbrowser.open('URL HERE')
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(30)
continue
except Exception as e:
print("error")
You can use Beautiful Soup to parse the HTML response which creates a tree-like structure. From your example it looks like you want to search for the class of the registration div. The find method returns None if no element is found, so you could just test for that instead of the hash change.
I've been trying to scrape Bandcamp fan pages to get a list of the albums they have purchased and I'm having trouble efficiently doing it. I wrote something with Selenium but it's mildly slow so I'd like to learn a solution that'd maybe send a POST request to the site and parse the JSON from there.
Here's a sample collection page: https://bandcamp.com/nhoward
Here's the Selenium code:
def scrapeFanCollection(url):
browser = getBrowser()
setattr(threadLocal, 'browser', browser)
#Go to url
browser.get(url)
try:
#Click show more button
browser.find_element_by_class_name('show-more').click()
#Wait two seconds
time.sleep(2)
#Scroll to the bottom loading full collection
scroll(browser, 2)
except Exception:
pass
#Return full album collection
soup_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))
#Empty array
urls = []
# Looping through all the a elements in the page source
for item in soup_a.find_all('a', {"class": "item-link"}):
url = item.get('href')
if(url != None):
urls.append(url)
return urls
The API can be accessed as follows:
$ curl -X POST -H "Content-Type: Application/JSON" -d \
'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \
https://bandcamp.com/api/fancollection/1/collection_items
I didn't encounter a scenario where a "older_than_token" was stale, so the problem boils down to getting the "fan_id" given a URL.
This information is located in a blob in the id="pagedata" element.
>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get("https://www.bandcamp.com/ggorlen")
>>> soup = BeautifulSoup(res.text, "lxml")
>>> user = json.loads(soup.find(id="pagedata")["data-blob"])
>>> user["fan_data"]["fan_id"]
82985
Putting it all together (building upon this answer):
import json
import requests
from bs4 import BeautifulSoup
fan_page_url = "https://www.bandcamp.com/ggorlen"
collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items"
res = requests.get(fan_page_url)
soup = BeautifulSoup(res.text, "lxml")
user = json.loads(soup.find(id="pagedata")["data-blob"])
data = {
"fan_id": user["fan_data"]["fan_id"],
"older_than_token": user["wishlist_data"]["last_token"],
"count": 10000,
}
res = requests.post(collection_items_url, json=data)
collection = res.json()
for item in collection["items"][:10]:
print(item["album_title"], item["item_url"])
I'm using user["wishlist_data"]["last_token"] which has the same format as the "older_than_token" just in case this matters.
In order to get the entire collection i changed the previous code from
"older_than_token": user["wishlist_data"]["last_token"]
to
user["collection_data"]["last_token"]
which contained the right token
Unfortunately for you, this particular Bandcamp site doesn't seem to make any HTTP API call to fetch the list of albums. You can check that by using your browser developer tools, Network tab, click on XHR filter. The only call being made seems to be fetching your collection details.
I am trying to get scrape some data from stockrow.com using BeautifulSoup.
However there seems to be some diffrences between inspect and view sourcecode (im using chrome, but i do not see that being a problem for Pyton).
This is resulting in some trouble as the sourcecode itself does not show any html-tags such as h1. They are however showing up when i use the inspect tool.
The part i am trying to scrape (among other things) - this is show using the inspect tool:
<h1>Teva Pharmaceutical Industries Ltd<small>(TEVA)</small></h1>
My current code, printing an empty list:
import bs4 as bs
import urllib.request
class Stock:
stockrow_url = "https://stockrow.com"
url_suffix = "/financials/{}/annual"
def __init__(self, ticker : str, stock_url=stockrow_url, url_suffix = url_suffix):
# Stock ticker
self.ticker = ticker.upper()
# URLs for financial statements related to the ticker
self.stock_url = stock_url + "/{}".format(self.ticker)
sauce = urllib.request.urlopen(self.stock_url).read()
soup = bs.BeautifulSoup(sauce, 'html.parser').h1
print(soup)
self.income_url = self.stock_url + url_suffix.format("income")
self.balance_sheet_url = self.stock_url + url_suffix.format("balance")
self.cash_flow_url = self.stock_url + url_suffix.format("cashflow")
teva = Stock("teva")
print(teva.get_income_statement())
The page is dynamically generated using jscript and cannot be handled by beautifulsoup. You can capture the information using either selenium and the like, or by looking for API calls.
In this case, you can get for TEVA, background information using
import json
import requests
hdr = {'User-Agent':'Mozilla/5.0'}
url = "https://stockrow.com/api/companies/TEVA.json?ticker=TEVA"
response = requests.get(url, headers=hdr)
info = json.loads(response.text)
info
Similarly, the income statement is hiding here:
url = 'https://stockrow.com/api/companies/TEVA/financials.json?ticker=TEVA&dimension=MRY§ion=Income+Statement'
Using the same code as above but with this other url, will get you your income statement, in json format.
And you can take it from there. Search around - there is a lot of information available on this topic. Good luck.
For last few days I am trying to scrap the following website (link pasted below) which has a few excels and pdfs available in a table. I am able to do it for the home page successfully. There are total 59 pages from which these excels/ pdfs have to be scrapped. In most of the websites I have seen till now there is a query parameter which is available in the site url which changes as you move from one page to another. In this case, we have a _doPostBack function probably because of which the URL remains the same on every page you go to. I looked at multiple solutions and posts which are suggesting to see the parameters of post call and use them but I am not able to make sense of the parameters which are provided in post call (this is the first time I am scrapping a website).
Can someone please suggest some resource which can help me write a code which helps me in moving from one page to another using python. The details are as follows:
Website link - http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx
My current code which extracts the CAP excel sheet from the home page (this is working perfect and is provided just for reference)
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import urllib
Base = "http://accord.fairfactories.org/ffcweb/Web"
html = urlopen("http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx")
bs = BeautifulSoup(html)
name = bs.findAll("td", {"class":"column_style_right column_style_left"})
i = 1
for link in bs.findAll("a", {"id":re.compile("CAP(?!\w)")}):
if 'href' in link.attrs:
name = str(i)+".xlsx"
a = link.attrs['href']
b = a.strip("..")
c = Base+b
urlretrieve(c, name)
i = i+1
Please let me know if I have missed anything while providing the information and please don't rate me -ve else I won't be able to ask any questions further
For aspx sites, you need to look for things like __EVENTTARGET, __EVENTVALIDATION etc.. and post those parameters with each request, this will get all the pages and using requests with bs4:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin # python 3 use from urllib.parse import urljoin
# All the keys need values set bar __EVENTTARGET, that stays the same.
data = {
"__EVENTTARGET": "gvFlex",
"__VIEWSTATE": "",
"__VIEWSTATEGENERATOR": "",
"__VIEWSTATEENCRYPTED": "",
"__EVENTVALIDATION": ""}
def validate(soup, data):
for k in data:
# update post values in data.
if k != "__EVENTTARGET":
data[k] = soup.select_one("#{}".format(k))["value"]
def get_all_excel():
base = "http://accord.fairfactories.org/ffcweb/Web"
url = "http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx"
with requests.Session() as s:
# Add a user agent for each subsequent request.
s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
r = s.get(url)
bs = BeautifulSoup(r.content, "lxml")
# get links from initial page.
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
# need to re-validate the post data in our dict for each request.
validate(bs, data)
last = bs.select_one("a[href*=Page$Last]")
i = 2
# keep going until the last page button is not visible
while last:
# Increase the counter to set the target to the next page
data["__EVENTARGUMENT"] = "Page${}".format(i)
r = s.post(url, data=data)
bs = BeautifulSoup(r.content, "lxml")
for xcl in bs.select("a[id*=CAP]"):
yield urljoin(base, xcl["href"])
last = bs.select_one("a[href*=Page$Last]")
# again re-validate for next request
validate(bs, data)
i += 1
for x in (get_all_excel()):
print(x)
If we run it on the first three pages, you can see we get the data you want:
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9965
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9552
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10650
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10086
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10905
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10840
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9229
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11310
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9178
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9614
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9734
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10063
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10871
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9468
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9799
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9278
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12252
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9342
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9966
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11595
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9652
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10271
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10365
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10087
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9967
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11740
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12375
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11643
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10952
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12013
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9810
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10953
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10038
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9664
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12256
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9262
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9210
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9968
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9811
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11610
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9455
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11899
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9766
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9969
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10088
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10366
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9393
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9813
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11795
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9814
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11273
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12187
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10954
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9556
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11709
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9676
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10251
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10602
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10089
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9908
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10358
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9469
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11333
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9238
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9816
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9817
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10736
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10622
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9394
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9818
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10592
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9395
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11271