im still learing at this. But first time I see, when I used requests module in Python, website give me feedback that I have no permission to access.
My code should only get data from site, and that's all.
import requests
from bs4 import BeautifulSoup
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
with requests.session() as sr:
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
Answer I get from this:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7" on this server.<p>
Reference #18.36d61202.1596089808.1cc0ef55
</p></body>
</html>
Website is using ASP.net. Site from chromebrowser is visible, but from requests is not.
Can you maybe give me show a way? It's problem with authentication? Maybe .ASPXAUTH or ASP.NET_SessionId I had to use?
Thanks in advance for your time, and any anwsers.
Use custom User-Agent HTTP header to obtain correct response:
import requests
from bs4 import BeautifulSoup
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
with requests.session() as sr:
sr.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'})
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
Prints:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=10" http-equiv="X-UA-Compatible"/>
... and so on.
You can use it . If you don't have the lib, You can install first. pip install requests-html
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url_siemens_part = "https://mall.industry.siemens.com/mall/en/WW/Catalog/Product/5SY6310-7"
sr = HTMLSession()
partUrl = sr.get(url_siemens_part)
soup = BeautifulSoup(partUrl.content,'html.parser')
print(soup)
With login is all good :) can download all data, but its a problem when I have something like below.
price_catalog = soup.find_all("td",class_="priceDetailsListPrice")
After soup need to find some values, writing as find_all "td"
I get output:
[<td class="priceDetailsListPrice">244,86 EUR
</td>]
its some other way than write "for" function like:
for price_catalog in price_catalog:
output = price_catalog.text
I think its too much to use "for" for single value :(
Related
I am trying to develop a program that can grab runes for a specific champion in League of Legends.
And here is my code:
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
response = requests.get(url).text
soup = BeautifulSoup(response,'lxml')
tables = soup.find('div',class_ = 'img-align-block')
print(tables)
And here is the original HTML File:
<img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" alt="征服者" tooltip="<itemname><img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" width="24" height="24" alt="征服者" /> 征服者</itemname><br/><br/>基礎攻擊或技能在命中敵方英雄時獲得 2 層征服者效果,持續 6 秒,每層效果提供 2-5 適性之力。 最多可以疊加 10 次。遠程英雄每次普攻只會提供 1 層效果。<br><br>在疊滿層數後,你對英雄造成的 15% 傷害會轉化為對自身的回復效果(遠程英雄則為 8%)。" height="36" width="36" class="requireTooltip">
I am not able to by any chance access this part and parse it nor find the IMG src. However, I can browse through this on their website.
How could I fix this issue?
The part you are interested in is not in the HTML. You can double check by searching:
soup.prettify()
Probably parts of the website are loaded with JavaScript, so you could use code that opens a browser and visit that page. For example, you could use selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(6) # give the website some time to load
page = driver.page_source
soup = BeautifulSoup(page,'lxml')
tables = soup.find('div', class_='img-align-block')
print(tables)
The website uses JavaScript processing, so you need to use Selenium or another scraping tool that supports JS loading.
Try setting a User-Agent on the headers of your request, without it, the website sends a different content, i.e.:
import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
h = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"}
response = requests.get(url, headers=h).text
soup = BeautifulSoup(response,'html.parser')
images = soup.find_all('img', {"class" : 'mainPicture'})
for img in images:
print(img['src'])
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
Notes:
Demo
If my answer helped you, please consider accepting it as the correct answer, thanks!
I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .
My code follows:
import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup
def findfeed(site):
user_agent = {
'User-agent':
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
raw = requests.get(site, headers = user_agent).text
result = []
possible_feeds = []
#html = bs4(raw,"html5lib")
html = bs4(raw,"html.parser")
feed_urls = html.findAll("link", rel="alternate")
for f in feed_urls:
t = f.get("type",None)
if t:
if "rss" in t or "xml" in t:
href = f.get("href",None)
if href:
possible_feeds.append(href)
parsed_url = urllib.parse.urlparse(site)
base = parsed_url.scheme+"://"+parsed_url.hostname
atags = html.findAll("a")
for a in atags:
href = a.get("href",None)
if href:
if "xml" in href or "rss" in href or "feed" in href:
possible_feeds.append(base+href)
for url in list(set(possible_feeds)):
f = feedparser.parse(url)
if len(f.entries) > 0:
if url not in result:
result.append(url)
for result_indiv in result:
print( result_indiv,end='\n ')
#return(result)
# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")
How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.
EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that
How can I make the code work for all sites
You can't. Not every site follows the best practices.
It is recommended that the site homepage includes a <link rel="alternate" type="application/rss+xml" ...> or <link rel="alternate" type="application/atom+xml" ...> element, but CNN doesn't follow the recommendation. There is no way around this.
But I found by googling that the feed of CNN is here.
That is not the homepage, and CNN has not provided any means to discover it. There is currently no automated method to discover what sites have made this error.
Actually http://cnn.com/ redirects to https://edition.cnn.com/
Requests handles redirection for you automatically:
>>> response = requests.get('http://cnn.com')
>>> response.url
'https://edition.cnn.com/'
>>> response.history
[<Response [301]>, <Response [301]>, <Response [302]>]
If I need to use any module other than BeautifulSoup, I am ready to do that
This is not a problem a module can solve. Some sites don't implement autodiscovery or do not implement it correctly.
For example, established RSS feed software that implement autodiscovery support (like the online https://inoreader.com), can't find the CNN feeds either, unless you use the specific /services/rss URL you found with Googling.
Looking at this answer. This should work perfectly:
feeds = html.findAll(type='application/rss+xml') + html.findAll(type='application/atom+xml')
Trying that on the CNN RSS service works perfectly. Your main problem is that the edition.cnn.com does not have any traces of RSS in any way or fashion.
I am trying to this site for information:
https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0
I tried writing code that has worked for other sites, but it just leaves me with an empty text file. Instead of filling up with data like it has for other sites. Here is my code:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import time
outfile = open('/Users/Luca/Desktop/test/farm_data.text','w')
my_list = list()
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=C&b=1&page=0"
my_list.append(site)
for item in my_list:
time.sleep( 5 )
html = urlopen(item)
bsObj = BeautifulSoup(html.read(), "html.parser")
nameList = bsObj.prettify().split('.')
count = 0
for name in nameList:
print (name[2:])
outfile.write(name[2:] + ',' + item + '\n')
I am trying to split it into smaller parts and go from there. I have used this code on sites like this: https://www.mtggoldfish.com/price/Aether+Revolt/Heart+of+Kiran#online
for example and it worked.
Any ideas why it works for some sites and not others? thanks so much.
The website in question probably disallows webscraping, which is why you get:
HTTPError: HTTP Error 403: Forbidden
You can spoof your user agent, by pretending to be a browser agent. Here's an example of how to do it using the fantastic requests module. You'll pass a User-Agent header when making the request.
import requests
url = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
html = requests.get(url, headers={'User-Agent' : 'Mozilla/5.0'}).text
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj)
Output:
<!DOCTYPE doctype html>
<html class="no-js" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
.
.
.
You can massage this code into your loop now.
I am trying to fetch the html source from this usl:
http://books.google.com/books?id=NZlV0M5Ije4C&dq=isbn:0470284889
I used the following code:
#!/usr/bin/env python
import urllib, urllib2, urlparse, argparse, re
from bs4 import BeautifulSoup
def getPageSoup(address):
request = urllib2.Request(address, None, {'User-Agent':'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
urlfile = urllib2.urlopen(request)
page = urlfile.read()
urlfile.close()
print 'soup has been obtained!'
return BeautifulSoup(page)
soup2 = getPageSoup(address)
metadata = soup2.findAll("metadata_row")#this content is present when viewing from the web browser
However, the html source from soup2 looks hardly like the source from the Google Books page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head><title>Quantitative Trading: How to Build Your Own Algorithmic Trading Business - Ernie Chan - Google Books</title><script>(function(){function a(c){this.t={};this.tick=function(c,e,b){b=void 0!=b?b:(new Date).getTime();this.t[c]=[b,e]};this.tick("start",null,c)}var d=new a;window.jstiming={Timer:a,load:d};try{var f=null;window.chrome&&window.chrome.csi&&(f=Math.floor(window.chrome.csi().pageT));null==f&&window.gtbExternal&&(f=window.gtbExternal.pageT());null==f&&window.external&&(f=window.external.pageT);f&&(window.jstiming.pt=f)}catch(g){};})();
</script><link href="/books/css/_9937a87cb2905e754d8d5e36995f224d/kl_about_this_book_kennedy_full_bundle.css" rel="stylesheet" type="text/css"/></head></html>
HTML source from urllib2 and my web browser are very different. How can I get the correct page source?
Thanks!
It is correct page source. All visible content of page is generated by JavaScript. So, it's impossible to fetch actual content using urllib. You should use browser extension, webkit bindings or something like that.
I am trying to scrape articles from a Chinese newspaper database. Here is some of the source code (pasting excerpt b/c keyed site):
<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%# page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>
</html>
When I try to do some straightforward scraping of the links in the table like so:
import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)
url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(
Python does not even find anything in the html, aka returns an empty list. I think this is because the source code starts with the base href tag, and Python only recognizes two tags in the document: base href and html.
Any idea how to scrape the links in this case? Thank you so much!!
Removing the second line made BS find all the tags. I didn't find a better way to parse this.
page = br.open(url)
page = page.read().replace('<! -- <%# page contentType="text/html;charset=GBK" %>', '')
soup = BeautifulSoup(page)
BS isnt really developed any longer - and would suggest you have a look at lxml
Dont have access to that specific url, but I was able to get this to work, using the html fragment (to which I added an a tag)
>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect('a')
>>> soup.cssselect('a')[0].text_content() #for example
When your html is very messed up, it's better to clean it up a little first, for instance, in this case, remove everything before , remove everything after (the first) . Download one page, mold it manually to see what is acceptable to beautifulsoup, and then write some regexes to preprocess.