Displaying all search results in Python web scraper - python

I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))

Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Wait Before Scraping using Beatifulsoup

I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd
For example, if you go here, you will see how it takes time to load page 2 data
I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!
from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read()
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
while(true):
true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
if(not True):
true=False
else:
req = Request(softwareadvice+'?review.page=2', headers=hdr)
sleep(10)
webpage = urlopen(req, timeout=10)
sleep(10)
webpage = webpage.read().decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
The reviews are loaded from external source via Ajax request. You can use this example how to load them:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
"https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)
params = {
"q": "s*|-s*",
"facet.gdm_industry_id": '{"sort":"bucket","size":200}',
"fq": "(and product_id: '{}' listed:1)",
"q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
"size": "50",
"start": "50",
"sort": "completeness_score desc,date_submitted desc",
}
# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))
params["fq"] = params["fq"].format(id_)
for start in range(0, 3): # <-- increase the number of pages here
params["start"] = 50 * start
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for h in data["hits"]["hit"]:
if "review" in h["fields"]:
print(h["fields"]["review"])
print("-" * 80)
Prints:
After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible, Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting. We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio.
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------
...and so on.
I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.
The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).
In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.
The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.
About the code you introduced:
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
webpage = web_byte.decode('utf-8')
All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.
So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.
Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.
Here are some links that might be of help to you.
scrape html generated by javascript with python
https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium
Thanks for reading.

Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?
Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)
Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.

Python Beautifulsoup - Scrape elements from "inspect"

I am trying to get scrape some data from stockrow.com using BeautifulSoup.
However there seems to be some diffrences between inspect and view sourcecode (im using chrome, but i do not see that being a problem for Pyton).
This is resulting in some trouble as the sourcecode itself does not show any html-tags such as h1. They are however showing up when i use the inspect tool.
The part i am trying to scrape (among other things) - this is show using the inspect tool:
<h1>Teva Pharmaceutical Industries Ltd<small>(TEVA)</small></h1>
My current code, printing an empty list:
import bs4 as bs
import urllib.request
class Stock:
stockrow_url = "https://stockrow.com"
url_suffix = "/financials/{}/annual"
def __init__(self, ticker : str, stock_url=stockrow_url, url_suffix = url_suffix):
# Stock ticker
self.ticker = ticker.upper()
# URLs for financial statements related to the ticker
self.stock_url = stock_url + "/{}".format(self.ticker)
sauce = urllib.request.urlopen(self.stock_url).read()
soup = bs.BeautifulSoup(sauce, 'html.parser').h1
print(soup)
self.income_url = self.stock_url + url_suffix.format("income")
self.balance_sheet_url = self.stock_url + url_suffix.format("balance")
self.cash_flow_url = self.stock_url + url_suffix.format("cashflow")
teva = Stock("teva")
print(teva.get_income_statement())
The page is dynamically generated using jscript and cannot be handled by beautifulsoup. You can capture the information using either selenium and the like, or by looking for API calls.
In this case, you can get for TEVA, background information using
import json
import requests
hdr = {'User-Agent':'Mozilla/5.0'}
url = "https://stockrow.com/api/companies/TEVA.json?ticker=TEVA"
response = requests.get(url, headers=hdr)
info = json.loads(response.text)
info
Similarly, the income statement is hiding here:
url = 'https://stockrow.com/api/companies/TEVA/financials.json?ticker=TEVA&dimension=MRY&section=Income+Statement'
Using the same code as above but with this other url, will get you your income statement, in json format.
And you can take it from there. Search around - there is a lot of information available on this topic. Good luck.

Categories