I started a research project grabbing pages using wget with the local links and mirror options. I did it this way at the time to get the data as I did not know how long the sites would be active. So I have 60-70 sites fully mirrored with localized links sitting in a dir. I now need to gleam what I can from them.
Is there a good example of parsing these pages using beautifulsoup? I realize that beautifulsoup is designed to take the http request and parse from there. I will be honest, I'm not savvy on beautifulsoup yet and my programming skills are not awesome. Now that I have some time to devote to it, I would like to do this the easy way versus the manual way.
Can someone point me to a good example, resource, or tutorial for parsing the html I have stored? I really appreciate it. Am I over-thinking this?
Using BeautifulSoup with local contents are just the same with Internet contents. For example, to read a local html file into bs4:
response = urllib.request.urlopen('file:///Users/Li/Desktop/test.html', timeout=1)
html = response.read()
soup = bs4.BeautifulSoup(html, 'html.parser')
In terms of how to use bs4 for processing html, the documentation of bs4 is a pretty good tutorial. In most situation, spending a day reading it is enough for basic data processing.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
Related
I am learning web scraping, however, I got issue preparing soup. It doesn't even look like the HTML code I can see while inspecting the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.mediaexpert.pl/"
response = requests.get(URL).text
soup = BeautifulSoup(response,"html.parser")
print(soup)
The result is like this:Result, soup
I tried to search the whole internet, but I think I have too little knowledge, for now, to find a solution. This random string is 85% of the result.
I will be glad for every bit of help.
BeautifulSoup does not deal with JavaScript generated content. It only works with static HTML. To extract data generated by JavaScript, you would need to use a library like Selenium.
I thought this would be funny and interesting to share. I ran into a weird situation which I have never encountered before.
I was fooling around with pythons beautifulsoup. After scraping https://www.amazon.ca i got the strangest output at the end of the HTML.
Can anyone tell me if this is intentional from the developers of amazon? Or is this something else ?
FYI here is the code I used to show it has nothing to do with me
import lxml
from bs4 import BeautifulSoup
import urllib.request as re
# ********Below is the soup used to gather the HTML************
url = "https://www.amazon.ca"
page = re.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
print(soup)
So, Amazon doesn't allow web scraping on their websites. They may change the HTML content for web scraping programs. For me, the HTML just said: "Forbidden".
If you want to get data from Amazon, you will probably need to use their API
I'm using BeautifulSoup 4 under Anaconda's distribution as bs4. Correct me if I'm wrong - I'm understanding BeautifulSoup is lib for transforming ill-formed HTML into well-formed one. But, when I'm assigning HTML to it's constructor, I lose more then half of it's characters. Shouldn't it be only fixing HTML and not cleaning it? In docs it's not well described.
This is the code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
where html is HTML of Google's homepage.
Edit:
Could it be from the way I'm retrieving string of HTML via str(soup)?
First of all, make sure you see these "missing tags" in the html coming into BeautifulSoup to parse. It could be that the problem is not in how BeautifulSoup parses the HTML, but in how you are retrieving the HTML data to parse.
I suspect, you are downloading the google homepage via urllib2 or requests and compare what you see inside str(soup) with what you see in a real browser. If this is case, then you cannot compare the two, since neither urllib2, nor requests is a browser and cannot execute javascript or manipulate DOM after the page load, or make asynchronous requests. What you get with urllib2 or requests is basically an initial HTML page "without a dynamic part".
If the problem is still in how BeautifulSoup parses the HTML...
As it clearly stated in docs, the behavior depends on which parser BeautifulSoup would choose to use under-the-hood:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document. But
if the document is not perfectly-formed, different parsers will give
different results.
See Installing a parser and Specifying the parser to use.
Since you don't specify a parser explicitly, the following rule is applied:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
See also Differences between parsers.
In other words, try to approach the problem using different parsers and see how the result would differ:
soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'html5lib')
soup = BeautifulSoup(html, 'html.parser')
I'm currently running this code:
import urllib
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("http://www.fifacoin.com/")
html = htmltext.read()
soup = BeautifulSoup(html)
for item in soup.find_all('tr', {'data-price': True}):
print(item['data-price'])
When I run this code I don't get any output at all, when I know there are html tags with these search parameters in them on that particular website. I'm probably making an obvious mistake here, i'm new to Python and BeautifulSoup.
The problem is that the price list table is loaded through javascript, and urllib does not include any javascript engine as far as I know. So all of the javascript in that page, which is executed in a normal browser, is not executed in the page fetched by urllib.
The only way of doing this is emulating a real browser.
Solutions that come to mind are PhantomJS and Node.js.
I recently did a similar thing with nodejs (although I am a python fan as well) and was presently surprised. I did it a little differently, but this page seems to explain quite well what you would want to do: http://liamkaufman.com/blog/2012/03/08/scraping-web-pages-with-jquery-nodejs-and-jsdom/
I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.