I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
If your goal is to truly parse the css:
There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.
Related
I am new in Web scraping technology. I tried to implement Web scraping after reading various web tutorials like this and this. Those articles are about amazon web scraping and Netflix web scraping. There are lots of other tutorials on Imdb, Rotten Tomatoes and others. Those tutorials give me overview which attributes need to take like class attributes, div tags etc. Different websites have different methods to take those tags. However those tags are the fundamental elements of web scraping. When I follow those tutorials I can implement those codes but when I try to parse a different website other than the mentioned one I failed. Recently, I tried the code block over priceline. But I just messed up with so many html codes.
My code for price line
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url= 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=8848a774a531423bde3ed4ff3486f8bb'
r = requests.get(url, headers=headers)#, proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
name=[]
hotel_div = soup.find_all('div', class_='Box-sc-8h3cds-0.Flex-sc-1ydst80-0.iNmVhl')
for container in hotel_div:
name = d.find('span', attrs={'class':'Box-sc-8h3cds-0 Flex-sc-1ydst80-0 BadgeRow__BadgeContainer-fofgl-0 kmpPcP SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf'})
n = name.find_all('img', alt=True)
row={}
if name is not None:
#print(n[0]['alt'])
row['Name'] = n[0]['alt']
else:
row['Name'] = "unknown-product"
print(name)
It returns an empty array.
Can any one suggest any tutorial or web blogs which help me to identify the correct html tags for any website?
Thank you for the help
Each web developer will choose to name their classes and tags differently.
To check how a new site is structured you can right click on what you want to scrape and then click on inspect and a tab should appear where you can find the tag, class name, etc
(UPDATED) Now it works:
import re
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
url = 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=04bab06455d612983ec0c76e621d7c48'
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = soup(html,"lxml")
container = soup.find('a',{'class':'Link-sc-16qjtx7-0 TitleLink__TitleLinkText-vs18lp-0 jtrNVn'}).text
print(container)
https://i.stack.imgur.com/IJK0c.png
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I'm trying to webscrape a webpage inventories, but the problem is that they don't show up in the output of the my Python script
Here's the original tag that appears on the navigator, with the text i want to scrape:
<span class="currentInv">251</span>
" in stock"
and this is the tag after parsing it using beautifulsoup as a library and lxml as a parser, I even tries other parsers like html.parser and html5lib:
<span class="currentInv"></span>
Here's my full Python script:
import requests
from bs4 import BeautifulSoup as bs
url = f'https://www.hancocks.co.uk/buy-wholesale-sweets?warehouse=1983&p=1'
parser = 'lxml'
headers = {'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
response = requests.get(url, headers=headers)
data = response.text
soup = bs(data, parser)
print(soup.find('span', class_ = 'currentInv').text)
The output is empty
I tried many times over and over, but nothing seems to work well for me
Any help would be so much appreciated.
So if you go to view source of the page you'll see the server side render HTML that gets sent down to the page actually also contains no value in that span tag. (i.e. view-source:https://www.hancocks.co.uk/buy-wholesale-sweets?warehouse=1983&p=1).
The value 251 is likely getting added client-side after the DOM is loaded via JavaScript.
I'd go through this answer Web-scraping JavaScript page with Python for more ways to try and extract that JavaScript value.
Most likely the page you see in your browser contains dynamic content. This means that when you inspect the page, you see the final result after some JavaScript code ran and manipulated the DOM that is rendered in the browser. When you load the same page in Python code using Beautiful Soup, you get the raw HTML that comes from the request. The JavaScript code for the dynamic content isn't executed, so you will not see the same results.
One solution is to use Selenium instead of Beautiful Soup. Selenium will load a page in a browser and provides an API to interact with that page.
I'm trying to retrieve some info from the following web page:
https://web.archive.org/web/19990421025223/http://www.rbc.ru
I constructed a selector which does highlight the desired table in Chrome's Inspection mode:
selector = 'body > table:nth-of-type(2) > tbody:nth-of-type(1)>tr:nth-of-type(1)>td:nth-of-type(5)>table:nth-of-type(1)>tbody:nth-of-type(1)'
however when running a script with bs4 .select() method:
import requests
from bs4 import BeautifulSoup
import lxml
url = 'https://web.archive.org/web/19990421025223/http://www.rbc.ru'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
selector = 'body > table:nth-of-type(2) > tbody:nth-of-type(1)>tr:nth-of-type(1)>td:nth-of-type(5)>table:nth-of-type(1)>tbody:nth-of-type(1)'
print(soup.select(selector=selector))
the output is: [] - which is very different from what is expected based on the fact that it consists of html code in browser.
What am I missing here?
You could not expect the browser-generated selectors to reliably work in BeautifulSoup as when a page is rendered in the browser the markup changes while when you download a page in your Python code, there is no rendering and you only get the very initial non-rendered HTML page.
Here, you have to come up with your own CSS selector or another way to locate the table element.
As the markup of the page is not really HTML-parsing-friendly, I'd locate a table element by one of it's column names:
table = soup.find("b", text="спрос").find_parent("table")
Note that it only worked for me when I parsed the page with a lenient html5lib parser:
soup = BeautifulSoup(response.content, "html5lib")
Since at run time javascript can render the entire page differently from the source, bs4 is not good for websites that changes dynamically.
I would recommend using Selenium, as it actually opens the website, and it allows you to pause the search before certain element gets rendered. There are also other headless browser libraries that emulate the browser environment silently if you don't want to see a browser pops up.
You have 2 problem in your code, first, in BeautifulSoup if you want to use CSS selector the symbols + > ~ need to be separated by space, see here if you want to patch bs4.
Second, as my previous answer to your questions there is no tbody in the page source, it generated by browser.
And here fixed CSS selector
selector = 'body > table:nth-of-type(2) > tr:nth-of-type(1) > td:nth-of-type(5) > table:nth-of-type(1)'
I'm new in web scraping and for learning purpose I want to find all href link in https://retty.me/ website.
But I found that my code only find one link in that website. But I viwed page source it has many link which didn't print. I also print full page where only one link contains.
what did I do wrong?
please correct me.
here is my python code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
data=[]
html = urlopen('https://retty.me')
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
file=open('scraped_data.txt','w')
for item in data:
file.write("%s\n"%item)
file.close()
If you enter the message shown in the html you get into google translate it says "We apologize for your trouble".
They don't want people scraping their site so they filter requests based on the user agent. You just need to add a user agent to the request header that looks like a browser.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
data=[]
url = 'https://retty.me'
req = Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
html = urlopen(req)
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
for item in data:
print(item)
In fact, this particular site only requires the presence of the user agent header and will accept any user agent even an empty string. The requests library as mentioned by Rishav provides a user agent by default, that's why it works without adding a custom header there.
I don't know why the website returns different HTML when used with urllib, but you can use the excellent requests library which is much easier to use than urllib anyway.
from bs4 import BeautifulSoup
import re
import requests
data = []
html = requests.get('https://retty.me').text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
data.append(link.attrs['href'])
print(data)
You can find the official documentation for requests here and for Beautiful Soup here.
import requests
from bs4 import BeautifulSoup
# your Response object called response
response = requests.get('https://retty.me')
# your html as string
html = response.text
#verify that you get the correct html code
print(html)
#make the html, a soup object
soup = BeautifulSoup(html, 'html.parser')
# initialization of your list
data = []
# append to your list all the URLs found within a page’s <a> tags
for link in soup.find_all('a'):
data.append(link.get('href'))
#print your list items
print(data)
I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .
My code follows:
import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup
def findfeed(site):
user_agent = {
'User-agent':
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
raw = requests.get(site, headers = user_agent).text
result = []
possible_feeds = []
#html = bs4(raw,"html5lib")
html = bs4(raw,"html.parser")
feed_urls = html.findAll("link", rel="alternate")
for f in feed_urls:
t = f.get("type",None)
if t:
if "rss" in t or "xml" in t:
href = f.get("href",None)
if href:
possible_feeds.append(href)
parsed_url = urllib.parse.urlparse(site)
base = parsed_url.scheme+"://"+parsed_url.hostname
atags = html.findAll("a")
for a in atags:
href = a.get("href",None)
if href:
if "xml" in href or "rss" in href or "feed" in href:
possible_feeds.append(base+href)
for url in list(set(possible_feeds)):
f = feedparser.parse(url)
if len(f.entries) > 0:
if url not in result:
result.append(url)
for result_indiv in result:
print( result_indiv,end='\n ')
#return(result)
# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")
How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.
EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that
How can I make the code work for all sites
You can't. Not every site follows the best practices.
It is recommended that the site homepage includes a <link rel="alternate" type="application/rss+xml" ...> or <link rel="alternate" type="application/atom+xml" ...> element, but CNN doesn't follow the recommendation. There is no way around this.
But I found by googling that the feed of CNN is here.
That is not the homepage, and CNN has not provided any means to discover it. There is currently no automated method to discover what sites have made this error.
Actually http://cnn.com/ redirects to https://edition.cnn.com/
Requests handles redirection for you automatically:
>>> response = requests.get('http://cnn.com')
>>> response.url
'https://edition.cnn.com/'
>>> response.history
[<Response [301]>, <Response [301]>, <Response [302]>]
If I need to use any module other than BeautifulSoup, I am ready to do that
This is not a problem a module can solve. Some sites don't implement autodiscovery or do not implement it correctly.
For example, established RSS feed software that implement autodiscovery support (like the online https://inoreader.com), can't find the CNN feeds either, unless you use the specific /services/rss URL you found with Googling.
Looking at this answer. This should work perfectly:
feeds = html.findAll(type='application/rss+xml') + html.findAll(type='application/atom+xml')
Trying that on the CNN RSS service works perfectly. Your main problem is that the edition.cnn.com does not have any traces of RSS in any way or fashion.