BeautifulSoup4 missing tags - python

I'm using BeautifulSoup 4 under Anaconda's distribution as bs4. Correct me if I'm wrong - I'm understanding BeautifulSoup is lib for transforming ill-formed HTML into well-formed one. But, when I'm assigning HTML to it's constructor, I lose more then half of it's characters. Shouldn't it be only fixing HTML and not cleaning it? In docs it's not well described.
This is the code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
where html is HTML of Google's homepage.
Edit:
Could it be from the way I'm retrieving string of HTML via str(soup)?

First of all, make sure you see these "missing tags" in the html coming into BeautifulSoup to parse. It could be that the problem is not in how BeautifulSoup parses the HTML, but in how you are retrieving the HTML data to parse.
I suspect, you are downloading the google homepage via urllib2 or requests and compare what you see inside str(soup) with what you see in a real browser. If this is case, then you cannot compare the two, since neither urllib2, nor requests is a browser and cannot execute javascript or manipulate DOM after the page load, or make asynchronous requests. What you get with urllib2 or requests is basically an initial HTML page "without a dynamic part".
If the problem is still in how BeautifulSoup parses the HTML...
As it clearly stated in docs, the behavior depends on which parser BeautifulSoup would choose to use under-the-hood:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document. But
if the document is not perfectly-formed, different parsers will give
different results.
See Installing a parser and Specifying the parser to use.
Since you don't specify a parser explicitly, the following rule is applied:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
See also Differences between parsers.
In other words, try to approach the problem using different parsers and see how the result would differ:
soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'html5lib')
soup = BeautifulSoup(html, 'html.parser')

Related

Beautifulsoup. Result long random string

I am learning web scraping, however, I got issue preparing soup. It doesn't even look like the HTML code I can see while inspecting the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.mediaexpert.pl/"
response = requests.get(URL).text
soup = BeautifulSoup(response,"html.parser")
print(soup)
The result is like this:Result, soup
I tried to search the whole internet, but I think I have too little knowledge, for now, to find a solution. This random string is 85% of the result.
I will be glad for every bit of help.
BeautifulSoup does not deal with JavaScript generated content. It only works with static HTML. To extract data generated by JavaScript, you would need to use a library like Selenium.

Why can't I find anything in BeautifulSoup documentation about .text or content method?

At the moment I am following a Python course on Udemy and I am learning the concept of web scraping. The way this is done is as follows:
import requests
import bs4
url = requests.get("http://example.com/")
soup = bs4.BeautifulSoup(url.text, "lxml")
Now, I cannot find anything about the text method of Beautifulsoup in the documentation. I know this because it is clearly explained in the course I am following.
Is this usual? I am asking this more from a general point of view when searching for relevant information in future documentation.
You have to use the .text attribut, because if you use just url in your case, you only get the statuscode of your request, which can not be a parameter in your soup object.

Scraping Google Patents with requests only returns style and scripts tags

I'm trying to scrape Google Patents using the following code.
url = 'https://patents.google.com/?q=usb'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
But when I try to inspect the document, using
print(soup.prettify)
I cannot get anything other than this https://pastebin.com/Xu81LdfE .
I checked the requests status and it is returning 200. Where am I going wrong?
The results on that page come for a different url:
https://patents.google.com/xhr/query?url=q%3Dusb&exp=
So instead of using BeautifulSoup, you could do r.json(), and find what you want in the dictionary it creates.
The data is not in the HTML, but loaded with JavaScript.
Therefore, beautifulsoup cannot scrape it.
Consider using the official APIs, as other usage likely violates the Google terms of service, and they will likely block you then.

using beautiful soup on local content

I started a research project grabbing pages using wget with the local links and mirror options. I did it this way at the time to get the data as I did not know how long the sites would be active. So I have 60-70 sites fully mirrored with localized links sitting in a dir. I now need to gleam what I can from them.
Is there a good example of parsing these pages using beautifulsoup? I realize that beautifulsoup is designed to take the http request and parse from there. I will be honest, I'm not savvy on beautifulsoup yet and my programming skills are not awesome. Now that I have some time to devote to it, I would like to do this the easy way versus the manual way.
Can someone point me to a good example, resource, or tutorial for parsing the html I have stored? I really appreciate it. Am I over-thinking this?
Using BeautifulSoup with local contents are just the same with Internet contents. For example, to read a local html file into bs4:
response = urllib.request.urlopen('file:///Users/Li/Desktop/test.html', timeout=1)
html = response.read()
soup = bs4.BeautifulSoup(html, 'html.parser')
In terms of how to use bs4 for processing html, the documentation of bs4 is a pretty good tutorial. In most situation, spending a day reading it is enough for basic data processing.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

Scraping Flipkart webpage using beautifulsoup

I am trying to scrape this page on Flipkart:
http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto+x+play&otracker=from-search
I am trying to find the div with class "fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco" but it returns empty result.
from bs4 import BeautifulSoup
import requests
url = "http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto%20x%20play&otracker=from-search"
page = requests.get(url)
soup = BeautifulSoup(page.text)
divs = soup.find_all("div",{"class":"fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco"})
print divs
divs is empty. I copied the class name using inspect element.
I found the answer in this question. http://www.google.com/url?q=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F22028775%2Ftried-python-beautifulsoup-and-phantom-js-still-cant-scrape-websites&sa=D&sntz=1&usg=AFQjCNFOZIMVyUDcUqNNuv-05Dp7P_L6-g
When you use requests.get(url) you load the HTML content of the url without JavaScript enabled. Without JavaScript enabled, the section of the page called 'customers who viewed this product also viewed' is never even rendered.
You can explore this behaviour by turning off JavaScript in your browser. If you scrape regularly, you might also want to download a JavaScript switcher plugin.
An alternative that you might want to look into is using a browser automation tool such as selenium.
requests.get(..) will return the content that is the plain HTTP GET on that url. all the Javascript rels that the page contains will not be downloaded, also, any inline javascript will not be executed either.
If flipkart uses js to modify the DOM after it is loaded in the browser, those changes will not reflect in the page.contents or page.text values.
you could try a different parser instead of the default parser in beautiful soup. I tried html5lib and it worked for a different website. maybe it will for you too. It will be slower than the default parser, but could be faster than selenium or other full fledged headless browsers.

Categories