BeautifulSoup Object is different from request content - python

I make a call to get function in Python using requests module. I pass this request content to BeautifulSoup. But when I print this BeautifulSoup object it is quite different from request content. Some of the tags are missing. Some of them are repeated. Why does it happen like that? For example:
req1=requests.get(url,headers)
print req1.content
s1=BeautifulSoup(req1.content)
print s1

At least, this is because HTML can be not perfectly-formed and BeautifulSoup's underlying parser would make an attempt to fix it. The behavior varies from parser to parser, see more at:
Differences between parsers

Related

How to control if a string is present in a website through python

I'm trying to identify if a string like "data=sold" is present in a website.
Now I'm using requests and a while loop but I need it to be faster:
response = requests.get(link)
if ('data=sold' in response.text):
It works well but it is not fast , is there a way to "request" only the part of the website I need to make the researching faster ?
I think you response.text is html right ?
to avoid to search string you can try with Beautiful Soup Doc here
from bs4 import BeautifulSoup
html = response.text
bs = BeautifulSoup(html)
[item['data-sold] for item in bs.find_all('ul', attrs={'data-sold' : True})]
can see other ref here
or maybe I think a about parallel for loop in python
we can make many requests in same time
As already commented, it depends on the website/server if you can only request a part of the page. Since it is a website I would think it's not possible.
If the website is really really big, the only way I can currently think of to make the search faster is to process the data just in time. When you call requests.get(link), the site will be downloaded before you can process the data. You maybe could try to call
r = requests.get(link, stream=True)
instead. And then iterate through all the lines:
for line in r:
if ('data=sold' in line):
print("hooray")
Of course you could also analyze the raw stream and just skip x bytes, use the aiohttp library, ... maybe you need to give some more information about your problem.

Correctly parse empty html tags using beautiful soup

HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.
I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel

Set lxml as default BeautifulSoup parser

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:
soup = bs4.BeautifulSoup(html, 'lxml')
but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?
According to the Specifying the parser to use documentation page:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
In other words, just installing lxml in the same python environment makes it a default parser.
Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.
If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.
Besides: "Explicit is better than implicit."
Obviously take a look at the accepted answer first. It is pretty good, and as for this technicality:
but I don't want to have to repeatedly type 'lxml' every time I call
BeautifulSoup. Is there a way I can set which parser to use once at
the beginning of my program?
If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.
# V1 - define a wrapper function - most straight-forward.
import bs4
def bs_parse(html):
return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)
Or if you feel like showing off ...
import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)

How do I get the HTML of a wiki page with Pywikibot?

I'm using pywikibot-core, and I used before another python Mediawiki API wrapper as Wikipedia.py (which has a .HTML method). I switched to pywikibot-core 'cause I think it has many more features, but I can't find a similar method.
(beware: I'm not very skilled).
I'll post here user283120 second answer, more precise than the first one:
Pywikibot core doesn't support any direct (HTML) way to interact to Wiki, so you should use API.
If you need to, you can do it easily by using urllib2.
This is an example I used to get HTML of a wiki page in commons:
import urllib2
...
url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_")
html = urllib2.urlopen(url).read().decode('utf-8')
"[saveHTML.py] downloads the HTML-pages of articles and images and saves the interesting parts, i.e. the article-text and the footer to a file"
source: https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py
IIRC you want the HTML of the entire pages, so you need something that uses api.php?action=parse. In Python I'd often just use wikitools for such a thing, I don't know about PWB or the other requirements you have.
In general you should use pywikibot instead of wikipedia (e.g. instead of "import wikipedia" you should use "import pywikibot") and if you are looking for methods and class that were been in wikipedia.py, they are now separated and can be found in pywikibot folder (mainly in page.py and site.py)
If you want to run your scripts that you wrote in compat, you can use a script in pywikibot-core named compat2core.py (in scripts folder) and there is a detailed help about conversion named README-conversion.txt, read it carefully.
The Mediawiki API has a parse action which allows to get the html snippet for the wikimarkup as returned by the Mediawiki markup parser.
For the pywikibot library there is already a function implemented which you can use like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
Returns:
str: the rendered HTML code for the page
'''
page=self.getPage(pageTitle)
html=page._get_parsed_page()
return html
When using the mwclient python library there is a generic api method see:
https://github.com/mwclient/mwclient/blob/master/mwclient/client.py
Which can be used to retrieve the html code like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
'''
api=self.getSite().api("parse",page=pageTitle)
if not "parse" in api:
raise Exception("could not retrieve html for page %s" % pageTitle)
html=api["parse"]["text"]["*"]
return html
As shown above this gives a duck typed interface which is implemented in the py-3rdparty-mediawiki library for which i am a committer. This was resolved with closing issue 38 - add html page retrieval
With Pywikibot you may use http.request() to get the html content:
import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])
This should give the html content
'<title>Elvis Presley – Wikipedia</title>\n'
With Pywikibot 6.0 http.request() gives a requests.Response object rather than plain text. In this case you must use the text Attribute:
print(r.text[94:135])
to get the same result.

Preventing BeautifulSoup from converting my XML tags to lowercase

I am using BeautifulStoneSoup to parse an XML document and change some attributes. I noticed that it automatically converts all XML tags to lowercase. For example, my source file has <DocData> elements, which BeautifulSoup converts to <docdata>.
This appears to be causing problems since the program I am feeding my modified XML document to does not seem to accept the lowercase versions. Is there a way to prevent this behavior in BeautifulSoup?
No, that's not a built-in option. The source is pretty straightforward, though. It looks like you want to change the value of encodedName in Tag.__str__.
Simple Answer
change (default html.parser) to xml parser
code: soup = BeautifulSoup(yourXmlStr, 'xml')
Detailed Explanation
refer my answer in another post

Categories