I am parsing a weblink and I want to save the whole webpage to a local file in format .html. I want to directly output soup to an html file locally for uploading a copy to S3-AWS ?
from bs4 import BeautifulSoup
import requests
url_name = "https://<weblink>/"
soup = BeautifulSoup(url_name,"html.parser")
Now, I am just wondering, like .txt can we output soup to .html as well. Suggestions appreciated.
You imported requests but never actually used it. You need to GET the actual site
r=requests.get(url_name)
Then you can pass that to BS
soup=BeautifulSoup(r.text,'html.parser')
Related
I need to download all the files under this links where only the suburb name keep changing in each link
Just a reference
https://www.data.vic.gov.au/data/dataset/2014-town-and-community-profile-for-thornbury-suburb
All the files under this search link:
https://www.data.vic.gov.au/data/dataset?q=2014+town+and+community+profile
Any possibilities?
Thanks :)
You can download file like this
import urllib2
response = urllib2.urlopen('http://www.example.com/file_to_download')
html = response.read()
To get all the links in a page
from bs4 import BeautifulSoup
import requests
r = requests.get("http://site-to.crawl")
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
You should first read the html, parse it using Beautiful Soup and then find links according to the file type you want to download. For instance, if you want to download all pdf files, you can check if the links end with the .pdf extension or not.
There's a good explanation and code available here:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
When I attempt to parse a locally stored copy of a webpage, beautifulsoup returns gibberish to me. I don't understand why as I've never faced this problem when using the requests and bs4 modules together for scraping tasks.
here's my code
import requests
from bs4 import BeautifulSoup as BS
import os
url_2 = r'/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/'
os.chdir(url_2)
f = open('re_2.html')
soup = BS(url_2, "lxml")
f.close()
print soup
this code returns the following :
<html><body><p>/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/</p></body></html>
I wasn't able to find a similar problem online so I've posted it here. any help would be much appreciated.
You are passing the path (which you named url_2) to BeautifulSoup so it treats that as a web page text and returns it, neatly wrapped in some minimal HTML. Seems fine.
Try constructing the BS from the file's contents instead. See here how it works: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
soup = BS(f)
should do...
Relativally new to BeautifulSoup. Attempting to obtain raw html from locally saved html file. I've looked around and have found that I should probably be using Beautiful Soup for this. Though when I do this:
from bs4 import BeautifulSoup
url = r"C:\example.html"
soup = BeautifulSoup(url, "html.parser")
text = soup.get_text()
print (text)
An empty string is printed out. I assume I'm missing some step. Any nudge in the right direction would be greatly appreciated.
The first argument to BeautifulSoup is an actual HTML string, not a URL. Open the file, read its contents, and pass that in.
Touching upon the previous answer, there are two ways to open an HTML file:
1.
with open("example.html") as fp:
soup = BeautifulSoup(fp)
2.
soup = BeautifulSoup(open("example.html"))
I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...
I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code
Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1