Web scraping using Beautiful Soup separating HTML and Javascript and CSS

Web scraping using Beautiful Soup separating HTML and Javascript and CSS - python

I am trying to scrape a web page which comprises of Javascript, CSS and HTML. Now this web page also has some text. When I open the web page using the file handler on running the soup.get_text() command I would only like to view the HTML portion and nothing else. Is it possible to do this?
The current source code is:
from bs4 import BeautifulSoup
soup=BeautifulSoup(open("/home/Desktop/try.html"))
print soup.get_text()
What do I change to get only the HTML portion in a web page and nothing else?

Try to remove the contents of the tags that hold the unwanted text (or style attributes).
Here is some code (tested in basic cases)
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/Desktop/try.html"))
# Clear every script tag
for tag in soup.find_all('script'):
tag.clear()
# Clear every style tag
for tag in soup.find_all('style'):
tag.clear()
# Remove style attributes (if needed)
for tag in soup.find_all(style=True):
del tag['style']
print soup.get_text()

It depends on what you mean by get. Dmralev's answer will clear the other tags, which will work fine. However, <HTML> is a tag within the soup, so
print soup.html.get_text()
should also work, with fewer lines, assuming portion means that the HTML is seperate from the rest of the code (ie the other code is not within <HTML> tags).

Related

Crawl a webpage which is generated by Javascript

I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!

This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.

Website scraping with python3 & beautifulsoup 4

I'm starting to make progress on a website scraper, but I've run into two snags. Here is the code first:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.nytimes.com")
soup=BeautifulSoup(r.text)
headlines=soup.find_all(class_="story-heading")
for headline in headlines:
print (headline)
Questions
Why do you a have to use find_all(class_= blahblahblah)
Instead of just find_all(blahblahblah)? I realize that the story-heading is a class of its own, but can't I just search all the HTML using find_all and get the same results? The notes for BeautifulSoup show find_all.a returning all the anchor tags in an HTML document, why won't find_all("story-heading") do the same?
Is it because if I try and do that, it will just find all the instances of "story-heading" within the HTML and return those? I am trying to get python to return everything in that tag. That's my best guess.
Why do I get all this extra junk code? Should my requests to find all just show me everything within the story-header tag? I'm getting a lot more text than what I am just trying to specify.

Beautiful Soup allows you use CSS Selectors. Look in the doc for "CSS selector"
You can find all elements with class "story-heading" like so:
soup.find_all(".story-heading")
If instead it's you're looking for id's just do
soup.find_all("#id-name")

How can I get all the plain text from a website with Scrapy?

I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this?

The easiest option would be to extract //body//text() and join everything found:
''.join(sel.select("//body//text()").extract()).strip()
where sel is a Selector instance.
Another option is to use nltk's clean_html():
>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
...
... <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
...
... </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"
Another option is to use BeautifulSoup's get_text():
get_text()
If you only want the text part of a document or tag, you
can use the get_text() method. It returns all the text in a document
or beneath a tag, as a single Unicode string.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !
Another option is to use lxml.html's text_content():
.text_content()
Returns the text content of the element, including
the text content of its children, with no markup.
>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Have you tried?
xpath('//body//text()').re('(\w+)')
OR
xpath('//body//text()').extract()

The xpath('//body//text()') doesn't always drive dipper into the nodes in your last used tag(in your case body.) If you type xpath('//body/node()/text()').extract() you will see the nodes which are in you html body. You can try xpath('//body/descendant::text()').

Problems making soup: BeautifulSoup not opening entire page source, stopping at /html

Hi I'm pretty new to scraping and would appreciate your help.
I am trying to open the following url using:
from bs4 import BeautifulSoup
import urllib2
import csv
import re
amicales = urllib2.urlopen("http://www.journal-officiel.gouv.fr/association/index.php?ACTION=Rechercher&HI_PAGE=1&HI_COMPTEUR=0&original_method=get&WHAT=&JTH_ID=014000%2F014040&JAN_BD_CP=&JRE_ID=%CEle-de-France%2FParis&JAN_LIEU_DECL=&JTY_ID=&JTY_WALDEC=&JTY_SIREN=&JPA_D_D=&JPA_D_F=&rechercher.x=36&rechercher.y=7&rechercher=Rechercher")
soup = BeautifulSoup(amicales)
I want to scrape results from a search query. The problem is, every result that I am interested in ends with /html.
I believe this is forcing beautifulsoup to stop reading the source code after the first search result, such that the remaining 20 or so results are ignored.
Here, for example, only the result "NATION INITIATIVE ET OU MACHROU3 WATTAN" is included:
print(soup.prettify())
Can anyone help me to open the whole page, and not just everything before the first /html tag?

Oh dear, that website is thoroughly broken. You can only have one </html> tag per page. If you look at the source, you see that there is only one <html> tag (as opposed to 50 </html> tags.
One workaround would be to first remove all the </html> tags before passing it to BeautifulSoup.
page = page.replace("</html>", "")
soup = BeautifulSoup(page)

How to get the content of a Html page in Python

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)

Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))

Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping using Beautiful Soup separating HTML and Javascript and CSS - python

Related

Crawl a webpage which is generated by Javascript

Website scraping with python3 & beautifulsoup 4

How can I get all the plain text from a website with Scrapy?

Problems making soup: BeautifulSoup not opening entire page source, stopping at /html

How to get the content of a Html page in Python

Categories

Resources