I try to make a simple price tracker for bitcoin and other cryptocurrencies or stocks. I intend to use web scraping to get prices from google finance relying on BeautifulSoup and requests libraries.
The code is this:
from bs4 import BeautifulSoup
import requests
import time
def getprice():
url = 'https://www.google.com/search?q=bitcoin+price'
HTML = requests.get(url)
soup = BeautifulSoup(HTML.text, 'html.parser')
text = soup.find('div', attrs={'class':'BNeawe iBp4i AP7Wnd'}).find("div", attrs={'class':'BNeawe iBp4i AP7Wnd'}).text
return text
if __name__ == "__main__":
bitcoin = getprice()
print(bitcoin)
I get this error
File "c:\Users\gabri\Visual Studio\crypto\bitcoinprice.py", line 19, in <module>
bitcoin = getprice()
File "c:\Users\gabri\Visual Studio\crypto\bitcoinprice.py", line 15, in getprice
text = soup.find('div', attrs={'class':'BNeawe iBp4i AP7Wnd'}).find("div", attrs={'class':'BNeawe iBp4i AP7Wnd'}).text
AttributeError: 'NoneType' object has no attribute 'find'
How can I solve it?
The reason this doesn't work is because you will run into the following problems:
You will be hitting Google's bot detection, which means when you do requests.get you won't get back the Google results, instead you'll get a response from the bot detection asking you to tick a box to confirm you are human.
The class you are searching for doesn't exist.
You are using the default html.parser which is going to be useless as Google does not put the price data in the raw HTML code. Instead you want to use something more advanced like the lxml parser.
Based on what you are trying to do, you could try to trick Google's bot detection by making your request seem more legitimate, for example add in the user agent that a Chrome browser would normally send. Additionally, to get the price it seems like you want the pclqee class in a span element.
Try this instead:
First install the lxml parser:
pip3 install lxml
Then use the below snippet instead:
from bs4 import BeautifulSoup
import requests
import time
def getprice():
url = "https://www.google.com/search?q=bitcoin+price"
HTML = requests.get(
url,
headers={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
},
)
soup = BeautifulSoup(HTML.text, "lxml")
text = soup.find("span", attrs={"class": "pclqee"}).text
return text
if __name__ == "__main__":
bitcoin = getprice()
print(bitcoin)
Although the above modified snippet will work, I wouldn't advise using it. Google will still be able to detect your request as a bot occassionally and so this code would be unreliable.
If you want stock data I suggest you try to web scrape some API's directly or use API's that do that for you already, e.g. have a look at https://www.alphavantage.co/
The soup object is None, so you can't call .find() on it. It works on my machine, so trying printing soup and troubleshoot from there.
Related
I am new in Web scraping technology. I tried to implement Web scraping after reading various web tutorials like this and this. Those articles are about amazon web scraping and Netflix web scraping. There are lots of other tutorials on Imdb, Rotten Tomatoes and others. Those tutorials give me overview which attributes need to take like class attributes, div tags etc. Different websites have different methods to take those tags. However those tags are the fundamental elements of web scraping. When I follow those tutorials I can implement those codes but when I try to parse a different website other than the mentioned one I failed. Recently, I tried the code block over priceline. But I just messed up with so many html codes.
My code for price line
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url= 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=8848a774a531423bde3ed4ff3486f8bb'
r = requests.get(url, headers=headers)#, proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
name=[]
hotel_div = soup.find_all('div', class_='Box-sc-8h3cds-0.Flex-sc-1ydst80-0.iNmVhl')
for container in hotel_div:
name = d.find('span', attrs={'class':'Box-sc-8h3cds-0 Flex-sc-1ydst80-0 BadgeRow__BadgeContainer-fofgl-0 kmpPcP SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf'})
n = name.find_all('img', alt=True)
row={}
if name is not None:
#print(n[0]['alt'])
row['Name'] = n[0]['alt']
else:
row['Name'] = "unknown-product"
print(name)
It returns an empty array.
Can any one suggest any tutorial or web blogs which help me to identify the correct html tags for any website?
Thank you for the help
Each web developer will choose to name their classes and tags differently.
To check how a new site is structured you can right click on what you want to scrape and then click on inspect and a tab should appear where you can find the tag, class name, etc
(UPDATED) Now it works:
import re
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
url = 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=04bab06455d612983ec0c76e621d7c48'
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = soup(html,"lxml")
container = soup.find('a',{'class':'Link-sc-16qjtx7-0 TitleLink__TitleLinkText-vs18lp-0 jtrNVn'}).text
print(container)
https://i.stack.imgur.com/IJK0c.png
I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
If your goal is to truly parse the css:
There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.
I am trying to develop a program that can grab runes for a specific champion in League of Legends.
And here is my code:
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
response = requests.get(url).text
soup = BeautifulSoup(response,'lxml')
tables = soup.find('div',class_ = 'img-align-block')
print(tables)
And here is the original HTML File:
<img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" alt="征服者" tooltip="<itemname><img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" width="24" height="24" alt="征服者" /> 征服者</itemname><br/><br/>基礎攻擊或技能在命中敵方英雄時獲得 2 層征服者效果,持續 6 秒,每層效果提供 2-5 適性之力。 最多可以疊加 10 次。遠程英雄每次普攻只會提供 1 層效果。<br><br>在疊滿層數後,你對英雄造成的 15% 傷害會轉化為對自身的回復效果(遠程英雄則為 8%)。" height="36" width="36" class="requireTooltip">
I am not able to by any chance access this part and parse it nor find the IMG src. However, I can browse through this on their website.
How could I fix this issue?
The part you are interested in is not in the HTML. You can double check by searching:
soup.prettify()
Probably parts of the website are loaded with JavaScript, so you could use code that opens a browser and visit that page. For example, you could use selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(6) # give the website some time to load
page = driver.page_source
soup = BeautifulSoup(page,'lxml')
tables = soup.find('div', class_='img-align-block')
print(tables)
The website uses JavaScript processing, so you need to use Selenium or another scraping tool that supports JS loading.
Try setting a User-Agent on the headers of your request, without it, the website sends a different content, i.e.:
import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
h = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"}
response = requests.get(url, headers=h).text
soup = BeautifulSoup(response,'html.parser')
images = soup.find_all('img', {"class" : 'mainPicture'})
for img in images:
print(img['src'])
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
Notes:
Demo
If my answer helped you, please consider accepting it as the correct answer, thanks!
I'm new in web scraping and for learning purpose I want to find all href link in https://retty.me/ website.
But I found that my code only find one link in that website. But I viwed page source it has many link which didn't print. I also print full page where only one link contains.
what did I do wrong?
please correct me.
here is my python code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
data=[]
html = urlopen('https://retty.me')
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
file=open('scraped_data.txt','w')
for item in data:
file.write("%s\n"%item)
file.close()
If you enter the message shown in the html you get into google translate it says "We apologize for your trouble".
They don't want people scraping their site so they filter requests based on the user agent. You just need to add a user agent to the request header that looks like a browser.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
data=[]
url = 'https://retty.me'
req = Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
html = urlopen(req)
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
for item in data:
print(item)
In fact, this particular site only requires the presence of the user agent header and will accept any user agent even an empty string. The requests library as mentioned by Rishav provides a user agent by default, that's why it works without adding a custom header there.
I don't know why the website returns different HTML when used with urllib, but you can use the excellent requests library which is much easier to use than urllib anyway.
from bs4 import BeautifulSoup
import re
import requests
data = []
html = requests.get('https://retty.me').text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
print(data)
You can find the official documentation for requests here and for Beautiful Soup here.
import requests
from bs4 import BeautifulSoup
# your Response object called response
response = requests.get('https://retty.me')
# your html as string
html = response.text
#verify that you get the correct html code
print(html)
#make the html, a soup object
soup = BeautifulSoup(html, 'html.parser')
# initialization of your list
data = []
# append to your list all the URLs found within a page’s <a> tags
for link in soup.find_all('a'):
data.append(link.get('href'))
#print your list items
print(data)
I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .
My code follows:
import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup
def findfeed(site):
user_agent = {
'User-agent':
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
raw = requests.get(site, headers = user_agent).text
result = []
possible_feeds = []
#html = bs4(raw,"html5lib")
html = bs4(raw,"html.parser")
feed_urls = html.findAll("link", rel="alternate")
for f in feed_urls:
t = f.get("type",None)
if t:
if "rss" in t or "xml" in t:
href = f.get("href",None)
if href:
possible_feeds.append(href)
parsed_url = urllib.parse.urlparse(site)
base = parsed_url.scheme+"://"+parsed_url.hostname
atags = html.findAll("a")
for a in atags:
href = a.get("href",None)
if href:
if "xml" in href or "rss" in href or "feed" in href:
possible_feeds.append(base+href)
for url in list(set(possible_feeds)):
f = feedparser.parse(url)
if len(f.entries) > 0:
if url not in result:
result.append(url)
for result_indiv in result:
print( result_indiv,end='\n ')
#return(result)
# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")
How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.
EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that
How can I make the code work for all sites
You can't. Not every site follows the best practices.
It is recommended that the site homepage includes a <link rel="alternate" type="application/rss+xml" ...> or <link rel="alternate" type="application/atom+xml" ...> element, but CNN doesn't follow the recommendation. There is no way around this.
But I found by googling that the feed of CNN is here.
That is not the homepage, and CNN has not provided any means to discover it. There is currently no automated method to discover what sites have made this error.
Actually http://cnn.com/ redirects to https://edition.cnn.com/
Requests handles redirection for you automatically:
>>> response = requests.get('http://cnn.com')
>>> response.url
'https://edition.cnn.com/'
>>> response.history
[<Response [301]>, <Response [301]>, <Response [302]>]
If I need to use any module other than BeautifulSoup, I am ready to do that
This is not a problem a module can solve. Some sites don't implement autodiscovery or do not implement it correctly.
For example, established RSS feed software that implement autodiscovery support (like the online https://inoreader.com), can't find the CNN feeds either, unless you use the specific /services/rss URL you found with Googling.
Looking at this answer. This should work perfectly:
feeds = html.findAll(type='application/rss+xml') + html.findAll(type='application/atom+xml')
Trying that on the CNN RSS service works perfectly. Your main problem is that the edition.cnn.com does not have any traces of RSS in any way or fashion.