Browser code and beautifulsoup collection different

Browser code and beautifulsoup collection different - python

I try to parse soccerstand front page soccer matches and fail because the items I get with BeautifulSoup are really different from what I see in browser.
My code is simple at the moment:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://soccerstand.com/') as response:
url_data = response.read()
soup = BeautifulSoup(url_data, 'html.parser')
print(soup.find_all('div.event__match'))
So I tried this and this failed. When I checked soup variable it turned out not to contain such divs at all, so what I get with BS is different from what I see by inspecting code on the website.
What's the reason for that? Is there any workaround?

Related

Neither Selenium or Beautiful soup showing full html source?

I tried using beautiful soup to parse a website, however when I printed "page_soup" I would only get a portion of the HTML, the beginning portion of the code, which has the info I need, was omitted. No one answered my question. After doing some research I tried using Selenium to access the full HTML, however I got the same results. Below are both of my attempts with selenium and beautiful soup. When I try and print the html it starts off in the middle of the source code, skipping the doctype, lang etc initial statements.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome( executable_path= "/usr/local/bin/chromedriver")
browser.get('https://coronavirusbellcurve.com/')
html = browser.page_source
soup = BeautifulSoup(html)
print(soup)
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
print(page_soup)

The requests module seems to be returning the numbers in the first table on the page assuming you are referring to US Totals.
import requests
r = requests.get('https://coronavirusbellcurve.com/').content
print(r)

Python beautifulsoup search issue

I'm having issues having bs find this text. I think it's because the text on the page has extra quotes around it. I was told it's because the class is actually blank. If that's the case, then any suggestions on how I can build my search?
Actual text on website: <span class="" data-product-price="">
My code (I've tried several variations): soup.find_all('span',{'class' : '" data-product-price="'})
I've also tried just doing a regular search, but I'm not doing that correctly. Any suggestions or should I use something other than bs?
Edited to include full code:
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.gouletpens.com/products/twsbi-diamond-580-
fountain-pen-clear?variant=11884892028971')
soup = bs4.BeautifulSoup(r.text, features="html.parser")
print(soup)
#soup.find_all('span',{'class' : '" data-product-price="'})
#soup.find_all('span',{'class' : 'data-product-price'})[0].text

After looking at URL, you can select the price with CSS selector:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gouletpens.com/products/twsbi-diamond-580-fountain-pen-clear?variant=11884892028971'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('span[data-product-price]').get_text(strip=True))
Prints:
$50.00
OR: with bs4 API (set {'data-product-price':True} to search tags with this attribute regardless of value in it:
print(soup.find('span', {'data-product-price':True}).get_text(strip=True))

Identifying DJIA data using Beautiful Soup

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing

You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

Requests vs Selenium Python in Youtube

When I use Selenium library to find the length of related channel in YouTube channel Page it gives me 12. But when I use Requests library to find the length it gives me 0.
I want to use requests please help me if it's possible
My code
Requests
import requests
from bs4 import BeautifulSoup
import time
r = requests.get("https://www.youtube.com/channel/UCoykjkkJxsz7JukJR7mGrwg/about")
soup = BeautifulSoup(r.content, 'html.parser')
bb = soup.find_all("ytd-mini-channel-renderer",class_="style-scope ytd-vertical-channel-section-renderer")
print(len(bb))
Selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.youtube.com/channel/UCoykjkkJxsz7JukJR7mGrwg/about")
soup = BeautifulSoup(driver.page_source, 'html.parser')
bb = soup.find_all("ytd-mini-channel-renderer",class_="style-scope ytd-vertical-channel-section-renderer")
print(len(bb))

Every time I've run into an issue like this, it was because JS was creating the data I was after. If this is the case, you likely won't be able to use requests as it can't handle the JS.
If you navigate to that youtube page in a browser, you can see that "ytd-mini-channel-renderer" exists if you inspect it, but if you view source, you get 0 results. The code you can see from "view source" is what requests is getting.

Sometimes the issue is caused by the soup object having different tags from the ones you see from dev tools, which is what is happening in your case. On analysing the soup object you'll notice the information you need is actually now in <h3 class="yt-lockup-title ">.
This code will pull the results you want:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.youtube.com/channel/UCoykjkkJxsz7JukJR7mGrwg/about")
soup = BeautifulSoup(r.content, 'html.parser')
bb=soup.find_all('h3',class_='yt-lockup-title')
print(len(bb))

Pass over URLs scraping

I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code

Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Browser code and beautifulsoup collection different - python

Related

Neither Selenium or Beautiful soup showing full html source?

Python beautifulsoup search issue

Identifying DJIA data using Beautiful Soup

Requests vs Selenium Python in Youtube

Pass over URLs scraping

Categories

Resources