Relativally new to BeautifulSoup. Attempting to obtain raw html from locally saved html file. I've looked around and have found that I should probably be using Beautiful Soup for this. Though when I do this:
from bs4 import BeautifulSoup
url = r"C:\example.html"
soup = BeautifulSoup(url, "html.parser")
text = soup.get_text()
print (text)
An empty string is printed out. I assume I'm missing some step. Any nudge in the right direction would be greatly appreciated.
The first argument to BeautifulSoup is an actual HTML string, not a URL. Open the file, read its contents, and pass that in.
Touching upon the previous answer, there are two ways to open an HTML file:
1.
with open("example.html") as fp:
soup = BeautifulSoup(fp)
2.
soup = BeautifulSoup(open("example.html"))
Related
I have some HTML file. I want to extract from here an unordered list. I have class name for this unordered list. And I am trying the following code:
soup =BeautifulSoup(HTML(open('dtaa.html').read()).__html__())
soup.find("ul",{"class":"name of class"})
dtaa.html is my file
This gives me nothing. This unordered list is inside 2 division. Maybe this is the problem.
Thanks in advance
You can read the HTML file like this:
with open("dtaa.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
soup.find("ul", attrs={"class":"name of class"})
You could also try another parser, like:
soup = BeautifulSoup(fp, "html5lib")
Documentation:
BeautifulSoup
Reading & Writing files
Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:
soup = BeautifulSoup(url)
my code:
def doSomethingUseful(soup):
url = soup.???
How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.
If the url variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:
>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:
>>> soup
<html><body><p>https://foo</p></body></html>
If you still wanted to extract the URL from this, you could just use .text on the object, since it's the only thing in there:
>>> print(soup.text)
https://foo
If on the other hand url is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find to get the first a tag, then extracting the href value would be one way.
>>> actual_html = '<html><body>My link text</body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'
I am parsing a weblink and I want to save the whole webpage to a local file in format .html. I want to directly output soup to an html file locally for uploading a copy to S3-AWS ?
from bs4 import BeautifulSoup
import requests
url_name = "https://<weblink>/"
soup = BeautifulSoup(url_name,"html.parser")
Now, I am just wondering, like .txt can we output soup to .html as well. Suggestions appreciated.
You imported requests but never actually used it. You need to GET the actual site
r=requests.get(url_name)
Then you can pass that to BS
soup=BeautifulSoup(r.text,'html.parser')
I need to download all the files under this links where only the suburb name keep changing in each link
Just a reference
https://www.data.vic.gov.au/data/dataset/2014-town-and-community-profile-for-thornbury-suburb
All the files under this search link:
https://www.data.vic.gov.au/data/dataset?q=2014+town+and+community+profile
Any possibilities?
Thanks :)
You can download file like this
import urllib2
response = urllib2.urlopen('http://www.example.com/file_to_download')
html = response.read()
To get all the links in a page
from bs4 import BeautifulSoup
import requests
r = requests.get("http://site-to.crawl")
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
You should first read the html, parse it using Beautiful Soup and then find links according to the file type you want to download. For instance, if you want to download all pdf files, you can check if the links end with the .pdf extension or not.
There's a good explanation and code available here:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
I'm using Python 3.5 and bs4
The following code will not retrieve all the tables from the specified website. The page has 14 tables but the return value of the code is 2. I have no idea what's going on. I manually inspected the HTML and can't find a reason as to why it's not working. There doesn't seem to be anything special about each table.
import bs4
import requests
link = "http://www.pro-football-reference.com/players/B/BradTo00.htm"
htmlPage = requests.get(link)
soup = bs4.BeautifulSoup(htmlPage.content, 'html.parser')
all_tables = soup.findAll('table')
print(len(all_tables))
What's going on?
EDIT: I should clarify. If I inspect the soup variable, it contains all of the tables that I expected to see. How am I not able to extract those tables from soup with the findAll method?
this page is rendered by javascript, and if you disable the javascrip in you broswer, you will notice that this page only hava two table.
i recommend to use selenium for this situation.