I'm trying to scraping the hotel data from expedia. For example, scraping all the hotel link in Cavendish, Canada, from 01/01/2020 to 01/03/2020. But the problem now is I can only scrape 20 of them and it is actually contains 200+ for each place. The sample webpage and its url is like:
https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669®ionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020
Scraping code:
import lxml
import re
import requests
from bs4 import BeautifulSoup
import xlwt
import pandas as pd
import numpy as np
url = 'https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669®ionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020'
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
res = requests.get(url,headers=header)
soup = BeautifulSoup(res.content,'lxml')
t1 = soup.select('a.listing__link.uitk-card-link')
So every link is stored in <a class='listing__link.uitk-card-link' href=xxxxxxx> </a> inside <li></li>, there is no differences between the html structure, can anyone explain this?
They are using API call to get next 20 records. There is no way to scrape the next 20 records.
Here is API details they are using when you click on "Show More"
API LINK
They have API authentication to get data using API calls.
Note: Scraping works only when you don't have any ajax call and no authentication method.
Related
This is the website in question (I want to extract the SMR Rating):
https://research.investors.com/stock-quotes/nasdaq-apple-inc-aapl.htm
If I have a list of stock names like AAPL, NVDA, TSM etc. and I want to iterate through them, how can I do it when the URL constantly changes in an unpredictable manner?
Take for example the same website with the ticker NVDA:
https://research.investors.com/stock-quotes/nasdaq-nvidia-corp-nvda.htm
It's not possible to append the ticker name to the URL and be done with it. I searched for a hidden API and I got this:
https://research.investors.com/services/ChartService.svc/GetData
This website gives me access to a json file but it doesn't contain the desired SMR Rating. Apart from that, I couldn't find anything else that would lead to the SMR Rating. Is this simply impossible?
Here's what I got so far, I can't get even past the HTML reading stage:
from bs4 import BeautifulSoup as bs
import json
import re
import pandas as pd
import requests
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'}
URL = "https://research.investors.com/stock-quotes/nasdaq-nvidia-corp-nvda.htm"
page = requests.get(URL, headers = header)
soup = bs(page.content, "html.parser")
print(soup)
As you can see, I can't load the full html code with beautiful soup, as the page assumes that some form of robotic activity is taking place (Error 405). Should I have specified a different header or is it indeed the case that webscraping isn't allowed on this webapge?
I'm new to web scrapping and using Beautifulsoup. I need help as I don't understand why my code is returning no text when there is text in the inspect view on the website.
Here is my simple code:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.nummerplade.net/nummerplade/Dd97487.html")
soup = BeautifulSoup(source.text,"html.parser")
name = soup.find("span",id="debitorer_name1")
print(name)
The output of running my code is:
<span id="debitorer_name1"></span>
When I inspect the HTML on the website I can see the desired name I want to extract, but not when running my script. Can anyone help me solve this issue?
Thanks!
If you reload site the data is reflecting in right side pane it takes same time so where it is uses dynamic data loading and it will not be visible in soup
How to find URL which renders dynamic data:
Go to Network tab and reload site and in left side just type the data that you want to search it will give you URL
Now go to Headers and copy user-agent, referer for headers and it will return data as in json format and you can extract what so data you want
import requests
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36", "referer": "https://www.nummerplade.net/"}
res=requests.get("https://data3.nummerplade.net/bilbogen2.php?stelnr=salza2bt3nh162519",headers=headers)
Output:
'Sebastian Carl Schwabe'
Image:
I am trying to scrape some data from yahoo finance using beautifulsoup, but I've run into a problem. I am trying to run the following code,
import xlwings as xw
import requests
import bs4 as bs
r = requests.get('https://finance.yahoo.com/quote/DKK=X?p=DKK=X&.tsrc=fin-srch')
soup = bs.BeautifulSoup(r.content,'lxml',from_encoding='utf-8')
However, when inspecting my output from "soup", I get the following status code in the section,
<body>
<!-- status code : 404 -->
<!-- Not Found on Server -->
I've run the exact same piece of code on another trading pair on yahoo finance with no problem whatsoever.
Could anyone tell me what I am doing wrong?
Thanks in advance!
You need to inject user agent to get 200 response.
#import xlwings as xw
import requests
import bs4 as bs
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
r = requests.get('https://finance.yahoo.com/quote/DKK=X?p=DKK=X&.tsrc=fin-srch',headers=headers)
print(r)
soup = bs.BeautifulSoup(r.content,'lxml')
Output:
<Response [200]>
How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help
I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/
I have a website where I'd like to get all the images from the website.
The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps:
I Choose one image that I want to extract using CSS selector, this will make the extension select the same other images automatically.
Viewed the Show button and select ATTR(attribute).
Changed src as an ATTR field.
Gave a name field name option.
Saved it & ran it in using Agenty platform/API.
This should yield me the result but it's not, it is returning an empty output.
Is there any better option? Will BS4 a better option for this? Any help is appreciated.
I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4).
#code to find all images in a given webpage
from bs4 import BeautifulSoup
import urllib.request
import requests
import shutil
url=('https://www.mcmaster.com/')
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features="lxml")
for img in soup.findAll('img'):
assa=(img.get('src'))
new_image=(url+assa)
You can also download the image with this tacked-on to the end:
response = requests.get(my_url, stream=True)
with open('Mypic.bmp', 'wb') as file:
shutil.copyfileobj(response.raw, file)
Everything in two lines:
from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrieve
for img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")
The new image should be in the same directory as the python file, but can be moved with:
os.rename()
In the case of the McMaster website, the images are linked differently, so the above methods won't work. The following code should get most of the images on the website:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import urllib.request
import shutil
import requests
req = Request("https://www.mcmaster.com/")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('link'):
links.append(link.get('href'))
print(links)
UPDATE: I found from some github post the below code that is MUCH more accurate:
import requests
import re
image_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]")
html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for item in re.findall(image_link_home,html_page):
if str(item).startswith('http') and len(item) < 150:
print(item.strip())
else:
for elements in item.split('background-image:url('):
for item in re.findall(image_link_home,elements):
print((str(item).split('")')[0]).strip())
Hope this helps!
You should use scrapy, it makes the crawling seamless, by selecting the content you wish to download with css tags You can automate the crawling easily.
You can use Agenty Web Scraping Tool.
Setup your scraper using Chrome extension to extract src attribute from images
Save the agent to run on cloud.
Here is similar question answered on Agenty forum - https://forum.agenty.com/t/can-i-extract-images-from-website/24
Full Disclosure - I am working at Agenty
This site using CSS embedding to store images. If you check the source code you can find links which has https://images1.mcmaster.com/init/gfx/home/ those are the actual images but its actually stitched together (row of images)
Example : https://images1.mcmaster.com/init/gfx/home/Fastening-and-Joining-Fasteners-sprite-60.png?ver=1539608820
import requests
import re
url=('https://www.mcmaster.com/')
image_urls = []
html_page = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for values in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',html_page):
if str(values).startswith('http') and len(values) < 150:
image_urls.append(values.strip())
else:
for elements in values.split('background-image:url('):
for urls in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',elements):
urls = str(urls).split('")')[0]
image_urls.append(urls.strip())
print(len(image_urls))
print(image_urls)
Note: Scraping website is subject to copyrights