Using urllib and BeautifulSoup to retrieve info from web with Python - python

I can get the html page using urllib, and use BeautifulSoup to parse the html page, and it looks like that I have to generate file to be read from BeautifulSoup.
import urllib
sock = urllib.urlopen("http://SOMEWHERE")
htmlSource = sock.read()
sock.close()
--> write to file
Is there a way to call BeautifulSoup without generating file from urllib?

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmlSource)
No file writing needed: Just pass in the HTML string. You can also pass the object returned from urlopen directly:
f = urllib.urlopen("http://SOMEWHERE")
soup = BeautifulSoup(f)

You could open the url, download the html, and make it parse-able in one shot with gazpacho:
from gazpacho import Soup
soup = Soup.get("https://www.example.com/")

Related

Why am I unable to web-scrape URL from a hyperlink in this website?

I tried to extract URL from a hyperlink in this web: https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/
I used the following Python code:
import requests
from bs4 import BeautifulSoup
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
The problem is this code does not return any URL.
I want to get all of this urls:
You are unable to parse it as the data is dynamically loaded. As you can see in the following image, the HTML data that is being written to the page doesn't actually exist when you download the HTML source code. The JavaScript later parses the window.__SITE variable and extracts the data from there:
However, we can replicate this in Python. After downloading the page:
import requests
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
You can use re (regex) to extract the encoded page source:
import re
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
Afterwards, you can use urllib to URL-decode the text, and json to parse the JSON string data:
from urllib.parse import unquote
from json import loads
json_data = loads(unquote(encoded_data))
You can then parse the JSON tree to get to the HTML source data:
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
At that point, you can use your own code to parse the HTML:
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
If you put it all together, here's the final script:
import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup
# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
# Decode and load as dictionary
json_data = loads(unquote(encoded_data))
# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
# Get links
links = soup.find_all('a')
# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
The links are generated dynamically by javascript code and the data can be found un the structure below.
<script id="site-injection">
window.__SITE="your data is here"
</script>
So you need to grab this script element and parse the value of window.__SITE

BeautifulSoup html -- load from memory?

I'm using BeautifulSoup in python 3.5 to parse html. While I can load it from file, I need to load it from memory because I get from an HTTP request. I've google but found nothing loading html to BeautifulSoup from memory. Is it possible?
If you are using the version 4 of BeautifulSoup, try passing the request data to it
from bs4 import BeautifulSoup
import requests
# replace the following URL
response = requests.get("https://www.python.org")
soup = BeautifulSoup(response.text,"html.parser")
from BeautifulSoup import BeautifulSoup
import requests
data = requests.get('https://google.com').text
soup = BeautifulSoup(data)

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

Accessing a website in python

I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

Categories