Webscraping Using BeautifulSoup: Retrieving source code of a website - python

Good day!
I am currently making a web scraper for Alibaba website.
My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup.
Any tips?
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = urlopen(url).read()
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup2 = make_soup(url)
I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!

You need to provide the User-Agent header at least.
Example using requests package instead of urllib2:
import requests
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)
print(soup.select_one("a.next").get('href'))
Prints http://www.alibaba.com/catalogs/products/CID144/2.

Related

Error in finding urls from the website using beautifulsoup

import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:16.0)
Gecko/20100101 Firefox/16.0'}
url = "https://ascscotties.com"
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'html.parser')
links = soup.find_all('a', href=re.compile("roster"))
for link in links:
print(link.get("href"))
The output:
https://ascscotties.com/roster.aspx?path=wbball
https://ascscotties.com/roster.aspx?path=wcross
https://ascscotties.com/roster.aspx?path=wsoc
https://ascscotties.com/roster.aspx?path=softball
https://ascscotties.com/roster.aspx?path=wten
https://ascscotties.com/roster.aspx?path=wvball
The code does not work for this https://owlsports.com/ both the website under sidearm platform. Also the landing page of https://owlsports.com/ does not have any of the roster links.
Would deleting the href parameter from your find_all query be impossible?
If getting all the urls from a website is your requirement, the documentation shows just a simple
links = soup.find_all('a')
for link in links:
print(link.get("href"))
would do.
Please let me know how your case is different, so we can solve this together. Thank you.

Scraping HREF Links contained within a Table

I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)
The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))
Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)

beautiful soup returns none when the element exists in browser

I have looked through the previous answers but none seemed to be applicable. I am building an open source quizlet scraper to extract all links from a class (e.g. https://quizlet.com/class/3675834/). In this case, the tag is a and class is "UILink". But when I use the following code, the list returned does not contain the element that I am looking for. Is it because of the JavaScript issue described here?
I tried to use the previous method of importing folder as written here but it does not contain the urls.
How can I scrape these urls?
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
}
url = 'https://quizlet.com/class/8536895/'
response = requests.get(url, verify=False, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')
b = soup.find_all("a", class_="UILink")
You wouldn't be able to directly scrape dynamic webpages using just requests. What you see browser is fully rendered page taken care by browser.
Inorder to scrape data from these kind of webpages, you following any of below approaches.
Use requests-html instead of requests
pip install requests-html
scraper.py
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
url = 'https://quizlet.com/class/8536895/'
response = session.get(url)
response.html.render() # render the webpage
# access html page source with html.html
soup = BeautifulSoup(response.html.html, 'html.parser')
b = soup.find_all("a", class_="UILink")
print(len(b))
Note: this uses headless browser(chromium) under the hood to render the page. So it can timeout or be a little slow at times.
Use selenium webdriver
Use driver.get(url) to get the page and pass the page source to beautiful Soup with driver.page_source
Note: run this in headless mode as well and there might be some latency at times.

Python not getting text between html tags

It looks like python has trouble finding text when it's marked with display=none, what should I do to overcome this issue?
Here's my code
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/')
soup = BeautifulSoup(r.text, 'html.parser')
data = soup.find('div', {'id':'all-domains'})
data.text
the code returns []
I also tried with xpath:
from lxml import etree
data = etree.HTML(r.text)
anchor = data.xpath('//div[#id="all-domains"]/text()')
It returns the same thing...
Yes, the element with id="all-domains" is empty because it is either dynamically set by the javascript executed in the browser. With requests you only get the initial HTML page without the "dynamic" part, so to say. To get all domains, I'd just iterate over the table rows and extract the domain link texts. Working sample:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/',
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"})
soup = BeautifulSoup(r.text, 'html.parser')
for domain in soup.select("tbody#domcop-table-body tr td a.domain-link"):
print(domain.get_text())
Prints:
u2tourfans.com
tvadsview.com
gfanatic.com
blucigs.com
...
twply.com
sweethomeparis.com
vvchart.com

Python, trouble getting embedded video url

Ok, I have been scratching my head on this for way too long. I am trying to retrieve the url for an embedded video on a web page using Beautiful Soup and requests modules in Python 2.7.6. I inspect the html in chrome and I can see the url to the video but when I get the page using requests and use Beautiful Soup I can't find the "video" node. From looking at the source it looks like the video window is a nested html document. I have searched all over and can't find out why I can't retrieve this. If anyone could point me in the right direction I would greatly appreciate it. Thanks.
here is the url to one of the videos:
http://www.growingagreenerworld.com/episode125/
The problem is that there is an iframe with the video tag inside which is loaded asynchronously in the browser.
Good news is that you can simulate that behavior by making an additional request to the iframe URL passing the current page URL as a Referer.
Implementation:
import re
from bs4 import BeautifulSoup
import requests
url = 'http://www.growingagreenerworld.com/episode125/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
response = session.get(url)
soup = BeautifulSoup(response.content)
# follow the iframe url
response = session.get('http:' + soup.iframe['src'], headers={'Referer': url})
soup = BeautifulSoup(response.content)
# extract the video URL from the script tag
print re.search(r'"url":"(.*?)"', soup.script.text).group(1)
Prints:
http://pdl.vimeocdn.com/43109/378/290982236.mp4?token2=1424891659_69f846779e96814be83194ac3fc8fbae&aksessionid=678424d1f375137f

Categories