Using python to scrape push data?

Using python to scrape push data? - python

I'm trying to scrape the left side of this news site (= SENESTE NYT):
https://www.dr.dk/nyheder/
But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?
Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side:
https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100
Can anyone help me? How do I scrape "SENESTE NYT"?

I first loaded the page with selenium and then processed with BeautifulSoup.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")
print(headlines)
And it seems to find the headlines:
[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
<h3>Afblæser tsunami-varsel for Hawaii</h3>,
<h3>56.000 flygter fra vulkan i udbrud </h3>,
<h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
<h3>Østjysk motorvej genåbnet </h3>]
Not sure if this is what you wanted.
-----EDITED----
More efficient way would be to create request with some custom headers (already confirmed this is not working)
import requests
headers = {
"Accept":"*/*",
"Host":"www.dr.dk",
"Referer":"https://www.dr.dk/nyheder",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)
r.json()

Related

Python scraper not returning full html code on some subdomains

I am throwing together a Walmart review scraper, it currently scrapes html from most Walmart pages without a problem. As soon as I try scraping a page of reviews, it only comes back with a small portion of the page's code, mainly just text from reviews and a few errant tags. Anyone know what the problem could be?
import requests
headers = {
'Accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'Accept-Language': 'en-us',
'Referer': 'https://www.walmart.com/',
'sec-ch-ua-platform': 'Windows',
}
cookie_jar = {
'_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)

As larsks already commented, some content is loaded in dynamically, for example if you scroll down far enough.
BeautifulSoup or requests don't load the whole page, but you can solve this with Selenium.
What Selenium does is it opens your url in a script-controlled web browser, it lets you fill out forms and also scroll down. Below is a code example on how to use Selenium with BS4.
from bs4 import BeautifulSoup
from selenium import webdriver
# Search on google for the driver and save it in the path below
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# Here you open the url with the reviews
driver.get("https://www.example.com")
driver.maximize_window()
# This function scrolls down to the bottom of the website
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# Now you can scrape the given website from your Selenium browser using:
html = driver.page_source
soup = BeautifulSoup(html)
This solution assumes that the reviews are loaded in through scrolling down the page. Of course you don't have to use BeautifulSoup to scrape the site, it's personal preference. Let me know if it helped.

How can I get URLs from Oddsportal?

How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help

I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/

Scrape information from FlashScore.ro live

I am trying to scrape information from this website https://www.flashscore.ro/baschet/ from the live tab. I want to receive an email every time something happens.
but my problem is with scraping
the code I have until now returns None . I want for now to get the name of the home team.
I am kinda new to this scraping with python thing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.flashscore.ro/baschet/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
def find_price():
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
home_team = soup.html.find('div', {'class': 'event__participant event__participant--home'})
return home_team
print(find_price())

The website uses JavaScript, but requests doesn't support it. so we can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.flashscore.ro/baschet/"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for page to fully render
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
for tag in soup.find_all(
"div", {"class": "event__participant event__participant--home"}
):
print(tag.text)
driver.quit()
Output:
Lyon-Villeurbanne
Fortitudo Bologna
Virtus Roma
Treviso
Trieste
Trento
Unicaja
Gran Canaria
Galatasaray
Horizont Minsk 2 F
...And on

my python app doesn't work and give a None as answer

Hi i would like to know why my app is giving me that error i've tried already everything what i found in google and still have not idea why is that happeing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.co.uk/XLTOK-Charging-Transfer-Charger-Nintendo/dp/B0828RYQ7W/ref=sr_1_1_sspa?dchild=1&keywords=type+c&qid=1598485860&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE3TDNSNUlITUNKTUMmZW5jcnlwdGVkSWQ9QTAwNDg4MTMyUlFQN0Y4RllGQzE2JmVuY3J5cHRlZEFkSWQ9QTAxNDk0NzMyMFNLSUdPU0taVUpRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
When called:
C:\Proyecto nuevo>Python main.py
None
So if anyone would like to help me will be amazing!!

If you look at the webpage code (the one your are trying to scrape), you will find that it is pretty much all javascript that populates the webpage when it loads. The requests library gets this code and doesn't run it. Your "find title" gets none because the code doesn't contain that.
To scrape this page you will have to run the javascript on it. You can check out the Selenium WebDriver in python to do this.

How to do scraping from a page with BeautifulSoup

The question asked is very simple, but for me, it doesn't work and I don't know!
I want to scrape the rating beer from this page https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone with BeautifulSoup, but it doesn't work.
This is my code:
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url).text
soup = BeautifulSoup(test_html, "lxml")
rating = soup.findAll("span", class_="ratingValue")
rating
When I finish, it doesn't work, but if I do the same thing with another page is work... I don't know. Someone can help me? The result of rating is 4.58
Thanks everybody!

If you print the test_html, you'll find you get a 403 forbidden response.
You should add a header (at least a user-agent : ) ) to your GET request.
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)
# 4.58

The reason behind getting forbidden status code (HTTP error 403) which means the server will not fulfill your request despite understanding the response. You will definitely get this error if you try scrape a lot of the more popular websites which will have security features to prevent bots. So you need to disguise your request!
For that you need use Headers.
Also you need correct your tag attribute whose data you're trying to get i.e. itemprop
use lxml as your tree builder, or any other of your choice
import requests
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
# Add this
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'lxml')
rating = soup.find('span', {'itemprop':'ratingValue'})
print(rating.text)

The page you are requesting response as 403 forbidden so you might not be getting an error but it will provide you blank result as []. To avoid this situation we add user agent and this code will get you the desired result.
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone"
headers={'User-Agent':user_agent}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, "lxml")
rating = soup.find('span', {'itemprop':'ratingValue'})
rating.text

import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36
(KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southerntier-pumking
clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)

you are facing this error because some websites can't be scraped by beautiful soup. So for these kinds of websites, you have to use selenium
download latest chrome driver from this link according to your operating system
install selenium driver by this command "pip install selenium"
# import required modules
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time, os
curren_dir = os.getcwd()
print(curren_dir)
# concatinate web driver with your current dir && if you are using window change "/" to '\' .
# make sure , you placed chromedriver in current directory
driver = webdriver.Chrome(curren_dir+'/chromedriver')
# driver.get open url on your browser
driver.get('https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone')
time.sleep(1)
# it fetch data html data from driver
super_html = driver.page_source
# now convert raw data with 'html.parser'
soup=BeautifulSoup(super_html,"html.parser")
rating = soup.findAll("span",itemprop="ratingValue")
rating[0].text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using python to scrape push data? - python

Related

Python scraper not returning full html code on some subdomains

How can I get URLs from Oddsportal?

Scrape information from FlashScore.ro live

my python app doesn't work and give a None as answer

How to do scraping from a page with BeautifulSoup

Categories

Resources